Building a Deeplearning powered Tamil TTS

· 442 words · 3 minute read

Just a log of our experiments of building a Tamil TTS using Deep Learning.

In this article, we will share our experience on building a deep learning text-to-speech model for the Tamil language using publicly available data and implementation.
If you want good intro course, recommending this two courses Intro from MIT and Nueral networks from Scrach

TTS 🔗

TTS stands for Text-To-Speech, orignally these systems were built by converting Text to Phonemes, then converting Phonemes to Speech, which is simply accomplished by concatenating the audio files of each phoneme. But these systems were not very good, because they were not able to capture the prosody of the language, which is the rhythm, stress, and intonation of speech. So the modern TTS systems are built using Deep Learning, which are able to capture natural prosody of the language.

Convulational Neural Networks based TTS 🔗

For our experiment we took implmentation of Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention available at https://github.com/tugstugi/pytorch-dc-tts/. It provides dataloader for both English and Mongolian. You could explore demo here Colab notebook.
We will be using this implementation as base and build our Tamil TTS.

Tamil Dataset 🔗

We couldn’t find datasets Tamil Text to Speech involving single speaker with hours of recording. Since this was complete experiement just for purpose of learning we took tamil audiobook from youtube and annotated them for ourselves, we ended up having few hours of audio and text. We might not be able to share the dataset, we might when we build a better dataset.

Edit: It seems like this new dataset from Google and IISc (https://vaani.iisc.ac.in/) would let us release version of source code and dataset. We will be working on it.

Unlike English just 26 characters, Tamil has 247 characters, which includes vowels, consonants, compound and special characters. Increased number of characters means increased number of classes, which means increased number of parameters to train. But we tried simple approach to solve a problem, by decomposing compound character to vowel and consonant pair. For example கோ is decomposed to கோ = க + ோ. This reduced the number of classes to 36 + special characters. This worked well as phonetically it is same, compound character is pronounced as consonant + vowel pair.

Training results 🔗

Here are some of the results we got after training for 5 hours of training on 1 GPU.

தெய்வத்தான் ஆகா தெனினும் முயற்சிதன் மெய்வருத்தக் கூலி தரும்

தீதும் நன்றும் பிறர் தர வாரா

சிற்றம்பலத்துக்கு இரண்டு காததூரத்தில் அலை கடல் ஓர் ஏரி

voice is not very natural, but it is able to capture the prosody of the language. We would further explore this and try to improve the results.