Conference & Workshop on Assistive Technologies for People with Vision & Hearing Impairments
                                                                   Assistive Technology for All Ages
                                                                       CVHI 2007, M.A. Hersh (ed.)


     OCR-ALGORITHM FOR DETECTION OF SUBTITLES IN TELEVISION AND CINEMA

                            Morten Jønsson1, Hans Heinrich Bothe2

                             1
                            Informatics and Mathematical Modelling,
                            Technical University of Denmark (DTU).
                    Tel: (+45) 21204719. Email: s021724@student.dtu.dk
                 2
                  Centre for Applied Hearing Research (CAHR), Oersted DTU,
                            Technical University of Denmark (DTU).
           Tel: (+45) 45253954. Fax: (+45) 45880577. Email: hhb@oersted.dtu.dk


Abstract: The OCR-algorithm (optical character recognition) described in this paper is a module in the
assistive device SubPal, which should be able to read subtitles from television and camera aloud. The
SubPal system is described in detail in (Nielsen & Bothe, 2007). By sampling the television signal
(PAL) a binary image is created. This binary image is analysed using the OCR-algorithm for
generating text-strings that can be passed on to a speech synthesis box. The requirements and the
implementation of the OCR are discussed and some initial results are presented. The algorithm is
developed with the purpose of later being implemented in hardware (FPGA).

Keywords: optical character recognition, subtitles, visually impaired people, dyslexic


1. Introduction

Interviews with both visually impaired people and dyslexic have revealed, that a large group is cut off
from full understanding of the visual media (television, DVD, VHS, cinemas), when this is given in a
language that they are not comfortable with. Even if a visual impaired person is able to grasp the
overall content of the screen they are unable to read the subtitles. A solution to this is presented in the
paper (Nielsen & Bothe, 2007), from this it is clear that a versatile, fast and robust OCR (optical
character recognition) is necessary. For a sufficient detection speed this OCR should be implemented
in hardware (FPGA). Since a commercial hardware OCR is not on the market, we will in this paper
show the initial steps towards such an OCR algorithm.


2. OCR Requirements

Before going into detail with the modules of the OCR, we asses the overall requirements that should
be considered with respect to the application (described in (Nielsen & Bothe, 2007)).

       •   Robustness – each character in the subtitles should be detected with high detection rate.
           When a character is not detected correctly the word should still be detected by using look-up
           into a dictionary and selecting the best match.
       •   Speed – the detection rate for one word should be proportional to the response-time of the
           speech synthesizer. 1


1
    Implying that the speech synthesizer meets the time requirements imposed on the system.
                                                                               M. Jønsson & H. H. Bothe


       •   Adaptive – the font differ from channel to channel depending on the subtitling company. The
           algorithm should be versatile enough to encounter for this. Noisy backgrounds (e.g. white T-
           shirt behind subtitles), should not reduce the detection rate significantly.
       •   Orientation – the spatial orientation of the image should be transparent to the algorithm.

Commercial OCR-solution have been surveyed, but the majority of the available OCR-solutions
targets the software market and depends on a specific operating system and hardware-architecture,
which imposes additional overhead with respect to performance. The OCR discussed in this paper is
intentioned for the portable device described in (Nielsen & Bothe, 2007), and must comply with the
necessary speed of response implying that a special purpose hardware is feasible e.g. a FPGA 2 .
Since we have not found such on chip OCR-solution, we will instead develop this from scratch by first
looking into optimal character recognition algorithms which is the objective of this paper. Further
studies are left for implementing the algorithm on an FPGA. Although the existing commercial OCR-
solutions are not relevant for this device, we can use them as a benchmark to compare how effective
the developed algorithm is.
   In the following sections we will look into the design of such an OCR.


3. Overview of OCR Modules


                                             The task of recognising characters in television as well as
                                             cinema can be divided into the modules illustrated in figure
                                             1. This division is consistent with the standard approach
                                             used in OCR-systems (Trier & Torfinn, 1996).
                                             First of all the raw signal from the composite video signal
                                             is sampled to create a binary image, from which the
                                             subtitles can be extracted. When using the signal from the
                                             camera it is also necessary with some spatial adjustment
                                             to ensure that the lines of text are horizontal in the image,
                                             which is a precondition for our OCR-algorithm. Next the
                                             binary image is prefiltered to remove noise and enhance
                                             characteristic features. After the optimal filtering the image
                                             can be divided into separate lines, words and letters.
                                             Each of these characteristic features, statistical or
                                             semantic, are detected and compared with an already
                                             existing database (based upon training set). After
                                             choosing the most likely letters in a given word, the word
                                             can be compared with a dictionary lookup, to verify if the
                                             letter combination is likely.
                                                 Each processing step will be explained in more detail in
                                             the following sections.


Figure 1:Modules in OCR system


2
    Field Programmable Gate Array
                                                    2
                                                                                        M. Jønsson & H. H. Bothe


4. Sampling Module

The images used for the character recognition are created by sampling the composite signal from the
television/video camera using a Tektronix TDS1002 oscilloscope and applying a threshold. This is
described in more detail in (Nielsen & Bothe, 2007). An example of the resulting binary image is
shown figure 2.


Figure 2: When sampling and applying a binary threshold the binary image is created


5. Spatial Adjustments

In the case where the images come from the CCD video camera, it will often be necessary to carry out
some minor spatial adjustments, since our later described feature extraction is not rotational invariant.
The subtitles consist of one or two lines of densely packed letters with the same orientation. This
a’priori knowledge can be used for finding the rotation of the image, which maximizes the horizontal
sum in the frame (corresponding to a horizontal orientation of the subtitles). Another approach would
be to use a rotation invariant feature extraction, such as Transformation Ring Projection (Tang, 1991).


6. Preprocessing

The success of the OCR-algorithm depends on the initial filtering. The aim of this filtering and
segmentation is to separate the text into separate letters and at the same time make sure that each
letter is as characteristic as possible. To begin with we lowpass-filter the image to remove high
frequency noise which was not removed in the sampleprocess, further more we use dilation 3 to avoid
that letters are being divided into several regions (see (Carstensen, 2002), (Horn, 1986) for further
details and figure 4 and 5 for illustration). The dilation is done with a structuring element consisting of
two horizontal pixels. This ensures that errors that would separate letters into several regions are
corrected, without influencing the characteristic appearance of the letter considerably (illustrated in
figure 3).

With the optimal preconditions given the lines can then be separated by simply detecting minimas in
the horizontal projections. By looking at the vertical projections instead each line can be separated into
words and letters (see figure 6). After the segmentation each letter is mapped into fixed height and
width, thereby making it comparable with the letters in the database, when extracting the features.


    Figure 5:Letter before   Figure 4:Letter after
    filtering                lowpass-filter and      Figure 3: Principle behind dilation using 2 pixel structural
3                            dilation                element
    A binary morphology method
                                                     3
                                                                              M. Jønsson & H. H. Bothe


    Figure 6: Example of how a line can be separated into words using vertical projection


7. Feature Extraction

To compare each region with our database we need to extract relevant features. We use a
combination of simple statistical and semantic features, which minimize the amount of calculations.
The statistical features are relations between area, width/height, background/foreground and the 1st
order moment. Furthermore the horizontal and vertical projections (see figure 9) are used. A way to
reduce the data from the horizontal and vertical projection is to do a Fourier transform of the rowsums
as described in (Bourbakis, 1991), but with our low resolution it is sufficient to use the sums directly.
Before calculating the projections (row- and column-sums ) the letter is mapped into a fixed size using
bicubic interpolation (Carstensen, 2002).

Our semantic method detects holes, feet, heads and arms in the letters as illustrated in figure 8. An
improvement could be to further more use feature point extraction for detection of intersections and
corners (see (Brown, 1992) for details).


8. Classification

The horizontal and vertical projections are compared with the database using crosscorrelation.
When comparing the statistical features such as height/width-ratio, foreground/background, area and
first order moment, a normal distribution is assumed. For each letter the mean of this distribution is
given in the database, while the variance is chosen for optimal detection. An extracted feature such as
the pixelarea can now be checked towards the distribution of each letter and a probability can be
calculated (see figure 7). The results from the projections and the other statistics is finally combined
and the most likely letter is chosen. Afterwards the semantical features are used for correction of the
most likely misclassifications.


                                                   4
                                                                                         M. Jønsson & H. H. Bothe


   Figure 7: Example of how the area-feature of a unknown letter is compared with the
   letter "x" of the database


9. Verification

Even with a good detection rate of each letter, errors will occur from time to time. To cater for these,
each word is checked with a dictionary, containing the most common words. If the word is not present,
the most likely match is chosen, e.g. by finding the word with most letters in the correct position. It is
then evaluated which of the 2 words is the most likely, based on the probabilities from the
classification. The dictionary lookup is done using the large English database WordNet® (see
reference (Anon, 2006) for information on license). It contain more than 200.000 nouns, verbs,
adjectives and adverbs in all conjugations. e.g. Run, ran, running, runs, house, houses, housing ...
We combine this with a database containing words not included in WordNet®, such as pronouns,
prepositions etc.


   Figure 8: Illustration of semantic features
                                                                      Figure 9: Plot of horizontal and
                                                                      vertical projection for a t


                                                             5
                                                                                    M. Jønsson & H. H. Bothe


10. Conclusions and Future Work

With the binary images created by sampling of the television signal using oscilloscope, we get a letter
detection rate of 95%, when using the same type of font in the trainingset and the testset. By using a
commercial OCR 4 on a resized version of the subtitles we obtain a detectionrate of 96%. With the
present solution the database is created using only one type of subtitle font and is therefore only
optimal when the same type of font is presented, while the commercial approach produces similar
results independent on the font.

The aim of these considerations is to obtain an implementation in hardware for optimal response times
and optimal power consumption. Further work is necessary to make the OCR more versatile, fulfilling
the requirements laid out in section 2. A possibility for doing this could be to implement the
classification by using an associative neural network.


References

Anon (2006). WordNet 3.0, Princeton University, http://wordnet.princeton.edu/
Brown, E.W. (1992). Character recognition by feature point extraction, Northeastern University internal
   paper
Bourbakis, N.G and A. T. Gumahad, II (1991). Knowledge-based recognition of typed text characters,
   International Journal of pattern Recognition, vol. 5(1-2), pp. 293-310.
Carstensen, J.M (2002). Image analysis, vision and computer graphics, Technical University of
   Denmark
Horn, B.K.P (1986) Robot Vision, The MIT Press
Nielsen, S. and H.H. Bothe (2007) SubPal: A device for reading aloud subtitles from television and
   cinema, CVHI conference.
Tang, Y.Y (1991) Transformation-ring-projection (TRP) algorithm and VLSI implementation,
   International Journal of Pattern Recognition, vol. 5(1-2), PP. 25-56.
Trier, Ø.D, A.K. Jain and T. Torfinn (1996). Feature extraction methods for character recognition – a
   survey, Pattern Recognition, vol. 29 (4), pp. 641-662.


4
    Abbyy FineReader 8.0 Professional edition. http://buy.abbyy.com/content/frpro/default.aspx
                                                        6