Automatic lip synch.
I wasn’t originally going to talk about this until I had something to show, but my next AIR project was going to be something like my old e2animate application that I talked about in my previous post.
I was going to aim it at non-technical users. Make it fun. Easy to use. Allow them to access an online library of clip-art. Plus I was going to incorporate a very powerful and special feature. Automatic lip synchronisation. Just drop a mouth shape into an animation, and it would automatically synchronise to a sound track.
I am pleased that someone has possibly saved me all the hard work. But I was quite looking forward to the technical challenge of doing this myself.
I don’t know the details of Samir’s algorithm, but this is how I was going to do it…
The first thing is to determine the shape of the mouth from each segment of speech. I was going to look at two possible ways of doing this. Prediction gain, and frequency bin matching.
Predication gain is the power of the signal coming out a filter, divided by the power of the signal going in. If the filter matches the characteristics of the signal, then we get a high value (because we don’t lose much power), but if it doesn’t match, we get a low value, because the filter blocks out more of the signal.
So, imagine a bank of fixed filters, each one detects a particular sound. (A,I/E/U/L/W,Q/M, fricatives, etc. – possibly also taking different kinds of voices into account). Depending which filter gives us the best match, we select a particular mouth shape.
Frequency bin matching is like matching the power in parts of the frequency spectrum to known sets of values, again corresponding to different kinds of sound.
MP3 is a sub-band coder. An MP3 file actually contains these frequency bin values. Or they can be derived from an FFT (SoundMixer.computeSpectrum()).
I also heard that pixelbender could be used to process sound, and I was going to investigate this, and anything else Cosmo had.
Anyway, it was probably going to take a little fiddling around, possibly some time-based signal processing too, but I’m sure something based on these methods could automatically generate a mouth shape. Then, the size of the mouth is just proportional to the amplitude/power of the speech signal.
And that’s how I was going to do dynamic lip synch in ActionScript.
Entry filed under: Adobe AIR.