SubSync – architecture overview

This is second post in this series. See Table of Content.

SubSync is written in Python with custom native module compiled from C++ (named gizmo). During synchronization pipeline similar to this is constructed.

Synchronization pipeline (click to enlarge)

While the majority of blocks are part of the gizmo module, pipeline is constructed from the Python code, in subsync/synchro.py. There are three main parts of the pipeline:

  • subs extractor (yellow blocks),
  • refs extractor (blue blocks),
  • correlator (pink blocks).

Subs and refs in this context means words with timestamps produced from your subtitle file (subs) or video that you are synchronizing with (refs).

Subs and refs extraction is done using FFmpeg library components wrapped as C++ objects. Demux (gizmo/media/demux.cpp) is reading input file and extracting single track. If this is audio track, it is decoded by AudioDec (gizmo/media/audiodec.cpp) and converted to the format suitable for speech recognition engine with Resampler (gizmo/media/resampler.cpp). Speech recognition is done via PocketSphinx library wrapped in SpeechRecognition class (gizmo/media/speechrec.cpp). It produces timestamped words annotated with score (floating point value between 0 and 1). This structure is named Word (gizmo/text/words.h#L8)

Similar Words are produces by SubDec (gizmo/media/subdec.cpp), which is used to decode input subtitles. In this case, score is always set to 1.

SubDec is also outputting subtitles in SSA/ASS format, which is FFmpegs internal format for subtitles. They are collected by SubtitlesCollector (subsync/subtitle.py#L57).

Reference words are translated by the Translator (gizmo/text/translator.cpp). It simply tries to lookup in its dictionary every word that is similar enough (using distance function) to the input word. It outputs corresponding translations from dictionary in form of Words with score reduced according to calculated distance. Translator is used only when the language of subtitles is different than the language of references.

Each extraction job is done in separate thread controlled by Extractor object (gizmo/extractor.cpp). It is native thread (as opposed to Python thread). In example above there are three extracting threads, one for subs and two for refs. Words are pushed to Correlator (gizmo/correlator.cpp) which runs in its own native thread by queue (gizmo/text/wordsqueue.h). Correlator calculates two values: delay and speed change which is applied to subtitles gathered by SubtitleCollector. Correlation algorithm will be described in my next post in this series.

Single instance of Translator is used in several threads. It is safe since its pushWord method has no side effects.

Refs are usually extracted with several Extractor threads, each processing different range of timestamps. In this approach we get words from different locations almost immediately which helps Correlator to produce initial synchronization faster and more accurately. Also performance-wise its usually optimal to run one PocketSphinx instance for physical CPU core.

In case when refs are generated from subtitles as well, refs pipeline looks similar to subs one.

In next post I will discuss correlation algorithm in details.


SubSync – synchronize movie subtitles with audio track

Posts in series:

  1. Introduction (this post)
  2. Architecture overview

I’m not dead, just busy. Sorry for the long break without posts.

And recently I’ve created this little tool. It uses speech recognition to synchronize movie subtitles. Here I would like to write about how it works.

So there is speech recognition library. It is pocketsphinx from Carnegie Mellon University. It’s used to produce list of words with timestamps. It works pretty good, but it is not YouTube generated subtitles good. It works well for cleanly recorded voice, in movies with more complicated audio track it will yield inferior results. Maybe 10% of words generated are correct. But it is good enough for us. How? I will explain it in further posts.

There is also option to synchronize with another subtitles. Words generated in this mode will obviously be much better.

Input subtitles that are synchronized are processed similarly, producing timestamped words. If they are of different language, it will be translated using simple dictionary lookup.

Next step is to feed this two lists of words to the correlator. It will pair similar words from both lists, generating pairs of timestamps. It could be visualised as two dimensional chart with two time scales on its axis. It will search for a straight line crossing as many points as possible (+/- epsilon). Finally, equation of that line is used to fix subtitles.

This approach will synchronize subtitles that are delayed and/or with different time rate (useful for frame-based subtitles with mismatched FPS). Obviously it won’t work with anything that has different parts inserted or removed, e.g. synchronizing video with ads and subtitles without it. But still it covers many use cases.

In subsequent posts I will try to explain some implementation details.