Text-to-speech based on deep learning for Web site using Amazon Polly and Ruby

2016-12-01

Amazon Polly, Text-to-speech service from AWS was announced at today ‘s re:Invent. Amazon Polly is speech synthesize system based on deep learning.

Amazon Polly — Text to Speech in 47 Voices and 24 Languages

[updated] I added generated speech of this article.

[updated2] I created simple CLI tools and rubygems of polly

https://rubygems.org/gems/pollynomial

The great thing about Amazon Polly is that we can use TTS easily with AWS CLI. The price is free for up to 5 million characters a month, if over that limitation, it is very cheap with $ 0.000004/character. If you synthesize Adventures of Huckleberry Finn, it costs about only $2.4.

Here is the example code of Polly with AWS CLI tool.

$ aws polly synthesize-speech \
–output-format mp3 –voice-id Joanna \
–text “Hello my name is Joanna.” \
joanna.mp3

As of December 1, 2016, they support the following 24 languages mainly in European languages.

Icelandic
Italian
Welsh
Dutch
Swedish
Spanish (Castile)
Spanish (USA)
Danish
Turkish
German
Norwegian
French
French (Canada)
Portuguese
Portuguese (Brazil)
Polish
Romanian
Russian
Japanese
English (India)
English (Welsh)
English (Australia)
English (US)
English (UK)

I think Japanese speech sounds very natural. Sometime it will be a strange accent, but if I register a word with Lexicon, we can improve the quality by myself. Japanese sample voice as following:

I often find interesting articles in Medium, but since reading long English article is a bit tough for non native English speaker like me. So I came up with if I made the article to voice, I would listen it easily. That’s why I wrote the code to convert articles to speech with Ruby like following:

There are some important restrictions of API:

The number of characters per API is 1500 characters
Long voice is truncated after 5 minutes

Text-to-speech based on deep learning for Web site using Amazon Polly and Ruby

Aki Ariga

Principal Software Engineer

Related