A Co-training based Framework for Writer Identification in Offline Handwriting


                            Utkarsh Porwal                                              Venu Govindaraju
                Deptt. of Computer Science and Engg.                         Deptt. of Computer Science and Engg.
                    University at Buffalo - SUNY                                 University at Buffalo - SUNY
                         Amherst, NY - 14228                                          Amherst, NY - 14228
                         utkarshp@buffalo.edu                                          govind@buffalo.edu


   Abstract—Traditional forensic document analysis methods              classified into two categories. First category is of text
have focused on feature-classification paradigm where a ma-             dependent features which capture properties of writer
chine learning based classifier is used to learn discrimination         based on the text written. In this writer identification is
among multiple writers. However, usage of such techniques
is restricted to availability of a large labeled dataset which          done by modeling similar content written by different
is not always feasible. In this paper, we propose a Co-                 writers. This reliance on text dependent features poses
training based approach that overcomes this limitation by               challenges of scalability. In real world application such
exploiting independence between multiple views (features) of            data is seldom available which limits the usability of these
data. Two learners are initially trained on different views of          techniques for practical purposes. said et al. [14] extracted
a smaller labeled training data and their initial hypothesis is
used to predict labels on larger unlabeled dataset. Confident           text dependent features using Gabor filters but the main
predictions from each learner are used to add such data points          limitation was to have a full page of document written by
back to the training data with predicted label as the ground            different writers for identification. Second category is based
truth label, thereby effectively increasing the size of labeled         on text independent features. They capture writer specific
dataset and improving the overall classification performance.           properties such as slant and loops which are independent
We conduct experiments on publicly available IAM dataset and
illustrate the efficacy of proposed approach.                           of any text written. These techniques are better suited for
                                                                        real life scenarios as they directly model writers as opposed
  Keywords-Writer Identification,     Co-training,   Classifier,        to previous category. Feature selection plays an important
Views, Labeled and Unlabeled data
                                                                        role in such techniques. Several features capturing different
                                                                        aspects of handwriting has been tried. zois et al. [15] used
                     I. I NTRODUCTION
                                                                        morphological features and needed only single word for
   Writer Identification is a well studied problem in forensic          identification and niels et al. [17] used allographic features
document analysis where the goal is to correctly label                  to compare using Dynamic Time Warping(DTW). All of
the writer of an unknown handwriting sample. Existing                   this work was focused on better feature selection which
research in this area has sought to address this problem                would result in better accuracy. They did not lay stress on
using Machine Learning techniques, where a large labeled                the techniques used and made an assumption that sufficient
dataset is used to learn a model (supervised learning) that             amount of such data is available for the system to learn
efficiently discriminates between various different writer
classes. The key advantage of such learning approaches                     Likewise, writer identification can also be divided
is their ability to generalize well over unknown test data              under two major approaches. First is statistical analysis
distributions. However, such generalization provides greater            of several features such as edge hinge distribution. Edge
performance only when used with a large labeled data.                   hinge distribution captures the change in the direction of
In real-world scenarios, generating large labeled datasets              writing samples. Second approach is model based writer
requires manual annotation which is not always practical.               identification. In this predefined models of strokes of
The absence of such datasets also leads to inefficient                  handwriting are used. Prime focus of these techniques was
usage of available unlabeled data that can be exploited to              on making a better system for identification using different
provide a greater classification performance. To address                techniques for modeling and analysis. Various techniques
these issues, we propose a Co-training based learning                   such as Latent Dirichlet Allocation(LDA) were proposed
framework that learns multiple classifiers on different views           for higher accuracy for identification[12] but it was based
(features) of smaller labeled data and uses them to predict             on the assumption that sufficient training data is available.
labels for unlabeled dataset which are further bootstrapped
to the labeled data for enhancing the prediction performance.              Existing techniques and methods did not make use of
                                                                        unlabeled data for the identification. Information tapped in
  Existing literature on writer identification can be broadly           the unlabeled data can make a significant improvement in


                                                                   36
                                 Figure 1.   Schematic of Proposed Co-training Based Labeling Approach


the performance of the system. This information can be                 to each other and each of them is capable of classification
extracted using different techniques such as transductive              independently. They showed that if the two views are
SVMs[11] or graph based methods using EM algorithm.                    conditionally independent then the accuracy of classifiers
They are used to label unlabeled data in a semi supervised             can be increased significantly. This is because system is
framework. nigam et al. [7] later proved that Co-training              using more information to classify data points. Since both
performs better than these methods in semi supervised                  views are sufficient for classification, this brings redundancy
framework. It uses small snippet of labeled data and                   which in turns gives more information. nigam et al. [8]
iteratively labels some part of unlabeled data. System                 later proved that completely independent views are not
retrains itself after every iteration which results in better          required for co-training. It works well even if two views
accuracy. Co-training has been successfully used for semi              are not completely uncorrelated.
supervised learning in different areas but never been
used for labeling data for writer identification to the best             Co-training is an iterative bootstrapping method
of our knowledge. Co-training has been used for web                    which increases the confidence of the learner in each
page classification[1], object detection[5] and for visual             round. It boosts the confidence of score like Expectation
trackers[4] . It has been used extensively in NLP for tasks            Maximization method but it works better than EM[7]. In
like named entity recognition[6].                                      EM all the data points are labeled in each round while in
                                                                       Co-training few of the data points are labeled each round
  The organization of the paper is as follows. Section                 and then classifiers are retrained. This helps building a
2 provides an overview of Co-training based framework.                 better learner in each iteration which in take would make
Multiple data views in form of writer features are described           better decision and hence the overall accuracy of system
in Section 3. Section 4 illustrates the proposed approach.             will increase.
Experimental results are described in Section 5. Section 6
outlines the conclusion.
                                                                       A. Selection Algorithm
                      II. C O - TRAINING                                  Selection of data points is crucial in the performance
   Co-training is a semi supervised learning algorithm                 of the algorithm. New points added in each round should
which needs small amount of training data to start. It                 make learner more confident in making decisions about
reiteratively labels some unlabeled data points and again              the labels. Hence, several selection algorithms have been
learns from it. blum et al. [1] proposed co-training to                tried to make a better system as system’s performance can
classify web pages on the internet into faculty web pages              vary if selection method is changed. Different methods
and non-faculty web pages. Initially they used small amount            out performs each other depending on the kind of data
of web pages of faculty members to train a classifier and              and application. One approach to select points was based
were able to correctly classify most of the unlabeled pages            on performance[2]. In this method, some points were
correctly in the end. Co-training requires two separate                selected randomly and added to the labeled set. System was
views of the data and two learners. blum et al. [1] proved             retrained and its performance was tested on the unlabeled
that co-training works best if the two views are orthogonal            data. This process was repeated for some iterations and


                                                                  37
performance of every set of points was recorded. Set of             views.
points resulting in best performance were selected to be
added in the labeled set and rest were discarded. This                 Angle features were used to train first classifier and SC
method was based on the degree of agreement of both                 were used to train the other one. Then in each round a
learners over unlabeled data in each round.                         cache will be extracted from unlabeled data. This cache
                                                                    would be labeled by both learners and some data points will
   Some other methods has also been employed like                   be picked from newly labeled cache by selection algorithm.
choosing the top k elements from the newly labeled cache.           Selected data points will be added to the training set and
This is an intuitive approach as those points were labeled          the learners are retrained while remaining data points in
with the highest confidence by the learner. However, hwa            the cache are discarded. This process is repeated unless the
et al. [9] in their work showed that adding samples with            unlabeled set is empty. Below is the pseudo code for the
best confidence score not necessarily results in better             Co-training algorithm.
performance of classifiers. So, wang et al. [10] used a
different approach in which some data points with lowest
scores were also added along with the data points with              Algorithm 1 Co − trainingAlgo
highest confidence scores. This method was called max-t,
                                                                    Require:
min-s method and t and s were optimized for the best
                                                                      L1 ← Labeled View One
performance. So, several different selection methods have
                                                                      L2 ← Labeled View Two
been employed as selecting data point in each round is key
                                                                      U ← Unlabeled Data
to the performance of Co-training.
                                                                      H1 ← First Classifier
                                                                      H2 ← Second Classifier
                                                                      Train H1 with L1
                III. F EATURE S ELECTION
                                                                      Train H2 with L2
   Selection of uncorrelated views is important in the                repeat
working of Co-training. blum et al. [1] proposed that both                Extract cache C from U
views should be sufficient for classification. Each learner               U ←U −C
trained on the views should be a low error classifier. They               Label C using H1 and H2
proved that error rates of both the classifiers decreases                 d ← selection algo(C) where d ⊂ C
during Co-training because of the extra information added                 add labels(d,H1,H2)
to the system. This extra information directly depends on                 L1 ← L1 ∪ view one of d
the degree of uncorrelation. However, abney et al. [3] later              L2 ← L2 ∪ view two of d
reformulated the explanation given by [1] for the working                 Retrain H1 on L1
of Co-training in terms of measure of agreement between                   Retrain H2 on L2
learners over unlabeled data. abney et al. [3] gave an upper          until U is empty
bound on the error rates of learners based on the measure
of their disagreement. Hence, independence of both views
is crucial for the performance of the system. We chose
contour angle features[13] as a first view and we combined
structural and concavity features (SC)[18] as a second              A. Selection Algorithm
view. These features can be considered independent as both
                                                                       Selection algorithm used for selecting data points was
captures different properties of style of writing.
                                                                    based on agreement of both learners over data points. Points
                                                                    on which the confidence of both learners was above certain
                                                                    threshold were selected. In case of documents accuracy
                 IV. P ROPOSED M ETHOD
                                                                    of classifier would be high if two different views will
   Co-training fits naturally for the task of writer                indicate same label for any data point. Selection method
identification as any piece of writing can have different           based on randomly selecting data points and checking their
views. Contour angle features and structural and concavity          performance as used in [2] was not good as randomly
features are two such different views for any handwritten           checking takes time. The approach is not scalable as there
text. They can be considered uncorrelated enough to fit             are several rounds of processing of subset of cache every
the task of writer identification in Co-training framework.         time a new cache is retrieved. Below is the pseudo code for
Co-training also needs to have two learners to learn over           the selection algorithm. Score function in the algorithm gives
two views. We used two different instances of Random                the highest value of the confidence scores of the learner for
Forest as learners to normalize the effect of learner over          one data point over all writers.


                                                               38
                                                            Table I
                                 ACCURACY OF CLASSISIERS WITH BASELINE SYSTEM AND C O - TRAINING


                               Methods           Full Data   Half Data    One Fourth Data     One Tenth Data
                         Experiment 1 Baseline     83.73       79.64           74.48              59.00
                              Co-training          85.58       80.91           75.55              61.24
                         Experiment 2 Baseline     80.42       76.72           70.59              52.28
                              Co-training          82.47       77.31           72.15              53.94


Algorithm 2 SelectionAlgo                                                                   VI. C ONCLUSION
Require:
  C ← cache                                                              In this paper we presented a Co-training based frame-
  t ← threshold                                                       work for labeling a large dataset of unlabeled document
  d←Φ                                                                 with the correct writer identities. Previous work in writer
  for each data point c in C do                                       identification was focused on either on developing a better
      if score(c,H1) > t & score(c,H1) > t then                       feature selection algorithm or to use different techniques for
          d←d∪c                                                       modeling the text of the document. All the work was based
          C←C -c                                                      on a assumption that sufficient amount of labeled data is
      else                                                            available for training a system. In our work we address the
          C←C -c                                                      problem of limited amount of labeled data present in real
      end if                                                          life applications. Our method tries to iteratively generate
  end for                                                             more labeled data from unlabeled data. Experimental studies
  return d                                                            show that accuracy of learners on the dataset labeled by Co-
                                                                      training was better than the baseline system. This proves the
                                                                      effectiveness of Co-training for labeling a large dataset of
                                                                      unlabeled documents. In future we would like to address
                     V. E XPERIMENTS
                                                                      this problem of limited data by using other semi supervised
   We used IAM dataset which has total of 4075 line                   learning methods.
images written by 93 unique writers. We conducted two
experiments to test the performance of Co-training against                                     R EFERENCES
the baseline systems. In first we compared the accuracy
of classifiers after Co-training against baseline methods by          [1] A. Blum and T. Mitchell, Combining labeled and unlabeled
adding the scores of both learners. In this scores of the                 data with co-training, In Proceedings ofCOLT ’98, pp. 92-
                                                                          100.1998.
class distribution of the two learners were added for each
data point and a joint class distribution score was generated.        [2] S. Clark, J. Curran, and M. Osborne, Bootstrapping POS
Class label with the highest score was assigned to that data              taggers using unlabelled data, In Proceedings of CoNLL,
point. Second experiment was based on the maximum of                      Edmonton, Canada, pp. 4955. 2003.
the confidence score of the label assigned by each learner.
In this each classifier assigns a class label to the data             [3] S. Abney, Bootstrapping. In Proceedings of the 40th Annual
                                                                          Meeting of the Association for Computational Linguistics.
point. This assignment is based on the highest value of the               2002
confidence score distribution over all classes. Class label
with the higher score between the two is assigned to the              [4] O. Javed, S. Ali and M. Shah, Online detection and classifica-
data point.                                                               tion of moving objects using progressively improving detectors,
                                                                          In Computer Vision and Pattern Recognition, pp. 696-701.
   Our goal is to show that Co-training can be used to label              2005.
unlabeled data even if a small amount of labeled data is
                                                                      [5] A. Levin, P. Viola and Y. Freund, Unsupervised improvement
present in the beginning. Therefore experiments were run                  of visual detectors using cotraining, Proceedings of the Ninth
on dataset of different sizes. We conducted experiments                   IEEE International Conference on Computer Vision,ICCV ’03
with four different settings of data. System was initially
trained over full, half, one fourth and one tenth of the total        [6] M. Collins and Y. Singer, Unsupervised Models for Named
training data. In one tenth training data only three samples              Entity Classification, Empirical Methods in Natural Language
                                                                          Processing - EMNLP. 1999
per class were present. Table shows that after Co-training
accuracy of classifiers is better than the baseline system            [7] K. Nigam and R. Ghani, Understanding the Behavior of Co-
with all sizes of datasets in both experimental settings.                 training, In Proceedings of KDD Workshop on Text Mining,
                                                                          2000.


                                                                 39
[8] K. Nigam and R. Ghani , Analyzing the effectiveness and
    applicability of co-training, Proceedings of the Ninth Inter-
    national Conference on Information and Knowledge Manage-
    ment, pp. 86-93. 2000

[9] R. Hwa, Sample selection for statistical grammar induction,
    In Proceedings of Joing SIGDAT Conference on EMNLP and
    VLC, Hongkong, China, pp. 4552. 2000

[10] W. Wang, Z. Huang and M. Harper, Semi-Supervised Learn-
    ing for Part-of-Speech Tagging of Mandarin Transcribed
    Speech, In IEEE International Conference on Acoustics,
    Speech and Signal Processing, 2007. ICASSP 2007.

[11] T. Joachims, Transductive Inference for Text Classification
    using Support Vector Machines.In Proceedings of the Sixteenth
    International Conference on Machine Learning. pp. 200-209.
    1999.

[12] A. Bhardwaj, M. Reddy, S. Setlur, V. Govindaraju and
    S. Ramachandrula, Latent Dirichlet allocation based writer
    identification in offline handwritingIn Proceedings of the 9th
    IAPR International Workshop on Document Analysis Systems.
    pp. 357-362, 2010

[13] M. Bulacu and L. Schomaker, Text-Independent Writer Iden-
    tification and Verification Using Textural and Allographic Fea-
    tures, In IEEE Transactions on Pattern Analysis and Machine
    Intelligence. pp 701-717. 2007

[14] H. E. S. Said, G. S. Peake, T. N. Tan and K. D. Baker, Per-
    sonal identification based on handwriting. Pattern Recognition,
    33, pp. 149-160. 2000

[15] E. N. Zois and V. Anastassopoulos, Morphological wave-
    form coding for writer indentification. Pattern Recognition,
    33(3), pp. 385-398. 2000

[16] L. Schomaker and M. Bulacu, Automatic writer identification
    using connected-component contours and edge-based features
    of uppercase Western script. In IEEE Transactions on Pattern
    Analysis and Machine Intelligence, pp. 787-798. 2004

[17] R. Niels, L. Vuurpijl and L. Schomaker, Introducing TRI-
    GRAPH - Trimodal writer identification. In Proceedings of
    European Network of Forensic Handwriting Experts, 2005

[18] J.T. Favata, G. Srikantan, S.N. Srihari, Handprinted charac-
    ter/digit recognition using a multiple feature/resolution phi-
    losophy, In Proceedings of Fourth International Workshop
    Frontiers of Handwriting Recognition. 1994.


                                                                      40