Complete our OCR subsystem

Useful skills/interests: Image processing, Text Localization and Binarization, Tesseract API

Subtitles come in all shapes and colors. Some are text based (such as American closed-captions, as specified in CEA-608 and CEA-708, or the old European teletext). Others are bitmap based such as the European DVB. When subtitles use bitmaps they are a lot more flexible, but also a lot harder to transcribe.

For the Latin languages in DVB what we have works quite well. Note that while DVB is bitmap based, as least those bitmaps are separate from the main image, so you only need to OCR the bitmap to get the text.

However, there's variants and cases that make things a lot more harder (and interesting):

- Burned-in subtitles, in which they overlay the actual TV image.
- Non-latin languages, such as Chinese.
- Moving subtitles, such as the usual tickers on the screen that move from to side.
- Subtitles with different colors, for example to distinguish between different speakers.

Believe it or not some of these cases are also supported already in CCExtractor, at least for some “good” conditions. But the really hard ones, are still a job in progress.

The heavy lifting (the OCR itself) is done by tesseract. But selecting the area to process, prefilter it so tesseract gets an input it likes and so on, it's done by our own code.

We need someone that likes challenges to make the whole thing work.

We will provide all the samples and access to a high speed server that has them so the student can work on it (optional) if a fast internet connection is not available to them.

Qualification tasks
Terrible OCR results with Channel 5 (UK)
This task is ideal to get started, because you only need to deal with one function in one file: quantize_map() in src/lib_ccx/ocr.c

In addition to the samples that we already have, we would also like the creation of a dataset of a few hardsubbed (videos with burned-in subtitles) videos with the accurate timed transcripts so that we can evaluate the performance of our code on a wide variety of these real world samples. For the qualification task, this does not have to be huge. A good representative set will do fine.

Related GitHub Issues
Extract cyrillic tickertape text in Russian from NTV
Extract subtitles in a Chinese newscast
GUI, Burned-in Subtitle Extraction not working
jumps based on uninitialised values
Process closed captions and burned-in subtitles in one pass
DVB subtitles from China
Corrupt or empty subtitles
Terrible OCR results with Channel 5 (UK)

Mentor
Abhinav Shukla (@abhinav95 on slack), which is the former Summer of Code student that worked on it last year and made an incredible job.