Last year I got the opportunity to dive into an art-project to develop new skills that are far away from our very day-to-day business projects.
The Animatronic is the doppelgänger of the german author Thomas Melle, and it (or can be a Robot be addressed as he?) “act” one hour on stage speaking about himself and other meta-related Topics.
An interesting part of the programming was to automatise the lip-sync. We want to take the voice of Thomas Melle, put in a software as input and get the positions of the animatronic’s mouth motors as output. To do that we needed to lern some theory.
The sound of a spoken sentence and the mouth position haven’t a correspondence one to one. The sounds that we are hearing is described with phonemes, and the mouth shapes that we are seeing are visemes. For example the sentences: “You have salad” and “You have talent” look the same if lip-readed. The youtube channel BadLipReading is a good example of that; he is using this language property as technique to create his videos.
So we needed some viseme as mouth position and interpolate between those position as Thomas Melle is speaking. But which visemes do we need? How many visemes are there? Some academic works claim that there are 30 or more. But to create a believable animation we need only between 6 and 12. We used 12. The artist Kéké show the mouth shapes nicely on his blog https://k-eke.tumblr.com/post/170310845461/so-there-is-the-whole-tutorial-animated-with
First we need to get from the audio file the phonemes mapped to the time. To do this there are several software that we can use. We can choose between: gentle, cmusphinx, kaldi, the inbuild Windows Speech recognition, …
Than we need rules to map the phonemes to the visemes. The simplest method it to create phonemes groups mapped to the respective visemes, and ignore coarticulation rules.
With this group we get already pretty believable animations, other phoneme groups like the one the Hanna-Barbera studios invented is describe in the repo of the really nice tool rhubarb-lip-sync created by Daniel Wolf. This tool can generate from an audio file directly the visemes, it uses internallly cmusphinx to align the phoneme to the time.
If we want to do more realistic animation we can use coarticulation rules. A viseme can be influenced by the adjacent one, so we need to analyse the viseme sequence and change some accordingly. For example the T of this looks closer than the T of that. Because the T in this is “preparing” itself to the I sound and in that is “preparing” itself to the A: sound.
We didn’t want to write complex coarticulation rules so we used a blending method. To do that we blend sequentially visemes togheter to some percents: the mouth shape for the T blended with the I is more narrow than the T blended with the A:
This method still a work in progress, I will try different and more specific coarticulation rules. And also try to implement something like timber, volume, pitch to tweak different mouth shape strategies. A paper that describe how to add some dynamics to the mouth shapes http://www.dgp.toronto.edu/~elf/JALISIG16.pdf.
And another paper with a bigger overview regarding the topic https://pdfs.semanticscholar.org/ac34/dea08577704e7f6aea6b878f62664ab708b2.pdf