Posts in series:
- Introduction (this post)
- Architecture overview
I’m not dead, just busy. Sorry for the long break without posts.
And recently I’ve created this little tool. It uses speech recognition to synchronize movie subtitles. Here I would like to write about how it works.
So there is speech recognition library. It is pocketsphinx from Carnegie Mellon University. It’s used to produce list of words with timestamps. It works pretty good, but it is not YouTube generated subtitles good. It works well for cleanly recorded voice, in movies with more complicated audio track it will yield inferior results. Maybe 10% of words generated are correct. But it is good enough for us. How? I will explain it in further posts.
There is also option to synchronize with another subtitles. Words generated in this mode will obviously be much better.
Input subtitles that are synchronized are processed similarly, producing timestamped words. If they are of different language, it will be translated using simple dictionary lookup.
Next step is to feed this two lists of words to the correlator. It will pair similar words from both lists, generating pairs of timestamps. It could be visualised as two dimensional chart with two time scales on its axis. It will search for a straight line crossing as many points as possible (+/- epsilon). Finally, equation of that line is used to fix subtitles.
This approach will synchronize subtitles that are delayed and/or with different time rate (useful for frame-based subtitles with mismatched FPS). Obviously it won’t work with anything that has different parts inserted or removed, e.g. synchronizing video with ads and subtitles without it. But still it covers many use cases.
In subsequent posts I will try to explain some implementation details.