=Paper=
{{Paper
|id=None
|storemode=property
|title=A Co-training based Framework for Writer Identification in Offline Handwriting
|pdfUrl=https://ceur-ws.org/Vol-768/Paper8.pdf
|volume=Vol-768
|dblpUrl=https://dblp.org/rec/conf/icdar/PorwalG11
}}
==A Co-training based Framework for Writer Identification in Offline Handwriting==
A Co-training based Framework for Writer Identification in Offline Handwriting
Utkarsh Porwal Venu Govindaraju
Deptt. of Computer Science and Engg. Deptt. of Computer Science and Engg.
University at Buffalo - SUNY University at Buffalo - SUNY
Amherst, NY - 14228 Amherst, NY - 14228
utkarshp@buffalo.edu govind@buffalo.edu
Abstract—Traditional forensic document analysis methods classified into two categories. First category is of text
have focused on feature-classification paradigm where a ma- dependent features which capture properties of writer
chine learning based classifier is used to learn discrimination based on the text written. In this writer identification is
among multiple writers. However, usage of such techniques
is restricted to availability of a large labeled dataset which done by modeling similar content written by different
is not always feasible. In this paper, we propose a Co- writers. This reliance on text dependent features poses
training based approach that overcomes this limitation by challenges of scalability. In real world application such
exploiting independence between multiple views (features) of data is seldom available which limits the usability of these
data. Two learners are initially trained on different views of techniques for practical purposes. said et al. [14] extracted
a smaller labeled training data and their initial hypothesis is
used to predict labels on larger unlabeled dataset. Confident text dependent features using Gabor filters but the main
predictions from each learner are used to add such data points limitation was to have a full page of document written by
back to the training data with predicted label as the ground different writers for identification. Second category is based
truth label, thereby effectively increasing the size of labeled on text independent features. They capture writer specific
dataset and improving the overall classification performance. properties such as slant and loops which are independent
We conduct experiments on publicly available IAM dataset and
illustrate the efficacy of proposed approach. of any text written. These techniques are better suited for
real life scenarios as they directly model writers as opposed
Keywords-Writer Identification, Co-training, Classifier, to previous category. Feature selection plays an important
Views, Labeled and Unlabeled data
role in such techniques. Several features capturing different
aspects of handwriting has been tried. zois et al. [15] used
I. I NTRODUCTION
morphological features and needed only single word for
Writer Identification is a well studied problem in forensic identification and niels et al. [17] used allographic features
document analysis where the goal is to correctly label to compare using Dynamic Time Warping(DTW). All of
the writer of an unknown handwriting sample. Existing this work was focused on better feature selection which
research in this area has sought to address this problem would result in better accuracy. They did not lay stress on
using Machine Learning techniques, where a large labeled the techniques used and made an assumption that sufficient
dataset is used to learn a model (supervised learning) that amount of such data is available for the system to learn
efficiently discriminates between various different writer
classes. The key advantage of such learning approaches Likewise, writer identification can also be divided
is their ability to generalize well over unknown test data under two major approaches. First is statistical analysis
distributions. However, such generalization provides greater of several features such as edge hinge distribution. Edge
performance only when used with a large labeled data. hinge distribution captures the change in the direction of
In real-world scenarios, generating large labeled datasets writing samples. Second approach is model based writer
requires manual annotation which is not always practical. identification. In this predefined models of strokes of
The absence of such datasets also leads to inefficient handwriting are used. Prime focus of these techniques was
usage of available unlabeled data that can be exploited to on making a better system for identification using different
provide a greater classification performance. To address techniques for modeling and analysis. Various techniques
these issues, we propose a Co-training based learning such as Latent Dirichlet Allocation(LDA) were proposed
framework that learns multiple classifiers on different views for higher accuracy for identification[12] but it was based
(features) of smaller labeled data and uses them to predict on the assumption that sufficient training data is available.
labels for unlabeled dataset which are further bootstrapped
to the labeled data for enhancing the prediction performance. Existing techniques and methods did not make use of
unlabeled data for the identification. Information tapped in
Existing literature on writer identification can be broadly the unlabeled data can make a significant improvement in
36
Figure 1. Schematic of Proposed Co-training Based Labeling Approach
the performance of the system. This information can be to each other and each of them is capable of classification
extracted using different techniques such as transductive independently. They showed that if the two views are
SVMs[11] or graph based methods using EM algorithm. conditionally independent then the accuracy of classifiers
They are used to label unlabeled data in a semi supervised can be increased significantly. This is because system is
framework. nigam et al. [7] later proved that Co-training using more information to classify data points. Since both
performs better than these methods in semi supervised views are sufficient for classification, this brings redundancy
framework. It uses small snippet of labeled data and which in turns gives more information. nigam et al. [8]
iteratively labels some part of unlabeled data. System later proved that completely independent views are not
retrains itself after every iteration which results in better required for co-training. It works well even if two views
accuracy. Co-training has been successfully used for semi are not completely uncorrelated.
supervised learning in different areas but never been
used for labeling data for writer identification to the best Co-training is an iterative bootstrapping method
of our knowledge. Co-training has been used for web which increases the confidence of the learner in each
page classification[1], object detection[5] and for visual round. It boosts the confidence of score like Expectation
trackers[4] . It has been used extensively in NLP for tasks Maximization method but it works better than EM[7]. In
like named entity recognition[6]. EM all the data points are labeled in each round while in
Co-training few of the data points are labeled each round
The organization of the paper is as follows. Section and then classifiers are retrained. This helps building a
2 provides an overview of Co-training based framework. better learner in each iteration which in take would make
Multiple data views in form of writer features are described better decision and hence the overall accuracy of system
in Section 3. Section 4 illustrates the proposed approach. will increase.
Experimental results are described in Section 5. Section 6
outlines the conclusion.
A. Selection Algorithm
II. C O - TRAINING Selection of data points is crucial in the performance
Co-training is a semi supervised learning algorithm of the algorithm. New points added in each round should
which needs small amount of training data to start. It make learner more confident in making decisions about
reiteratively labels some unlabeled data points and again the labels. Hence, several selection algorithms have been
learns from it. blum et al. [1] proposed co-training to tried to make a better system as system’s performance can
classify web pages on the internet into faculty web pages vary if selection method is changed. Different methods
and non-faculty web pages. Initially they used small amount out performs each other depending on the kind of data
of web pages of faculty members to train a classifier and and application. One approach to select points was based
were able to correctly classify most of the unlabeled pages on performance[2]. In this method, some points were
correctly in the end. Co-training requires two separate selected randomly and added to the labeled set. System was
views of the data and two learners. blum et al. [1] proved retrained and its performance was tested on the unlabeled
that co-training works best if the two views are orthogonal data. This process was repeated for some iterations and
37
performance of every set of points was recorded. Set of views.
points resulting in best performance were selected to be
added in the labeled set and rest were discarded. This Angle features were used to train first classifier and SC
method was based on the degree of agreement of both were used to train the other one. Then in each round a
learners over unlabeled data in each round. cache will be extracted from unlabeled data. This cache
would be labeled by both learners and some data points will
Some other methods has also been employed like be picked from newly labeled cache by selection algorithm.
choosing the top k elements from the newly labeled cache. Selected data points will be added to the training set and
This is an intuitive approach as those points were labeled the learners are retrained while remaining data points in
with the highest confidence by the learner. However, hwa the cache are discarded. This process is repeated unless the
et al. [9] in their work showed that adding samples with unlabeled set is empty. Below is the pseudo code for the
best confidence score not necessarily results in better Co-training algorithm.
performance of classifiers. So, wang et al. [10] used a
different approach in which some data points with lowest
scores were also added along with the data points with Algorithm 1 Co − trainingAlgo
highest confidence scores. This method was called max-t,
Require:
min-s method and t and s were optimized for the best
L1 ← Labeled View One
performance. So, several different selection methods have
L2 ← Labeled View Two
been employed as selecting data point in each round is key
U ← Unlabeled Data
to the performance of Co-training.
H1 ← First Classifier
H2 ← Second Classifier
Train H1 with L1
III. F EATURE S ELECTION
Train H2 with L2
Selection of uncorrelated views is important in the repeat
working of Co-training. blum et al. [1] proposed that both Extract cache C from U
views should be sufficient for classification. Each learner U ←U −C
trained on the views should be a low error classifier. They Label C using H1 and H2
proved that error rates of both the classifiers decreases d ← selection algo(C) where d ⊂ C
during Co-training because of the extra information added add labels(d,H1,H2)
to the system. This extra information directly depends on L1 ← L1 ∪ view one of d
the degree of uncorrelation. However, abney et al. [3] later L2 ← L2 ∪ view two of d
reformulated the explanation given by [1] for the working Retrain H1 on L1
of Co-training in terms of measure of agreement between Retrain H2 on L2
learners over unlabeled data. abney et al. [3] gave an upper until U is empty
bound on the error rates of learners based on the measure
of their disagreement. Hence, independence of both views
is crucial for the performance of the system. We chose
contour angle features[13] as a first view and we combined
structural and concavity features (SC)[18] as a second A. Selection Algorithm
view. These features can be considered independent as both
Selection algorithm used for selecting data points was
captures different properties of style of writing.
based on agreement of both learners over data points. Points
on which the confidence of both learners was above certain
threshold were selected. In case of documents accuracy
IV. P ROPOSED M ETHOD
of classifier would be high if two different views will
Co-training fits naturally for the task of writer indicate same label for any data point. Selection method
identification as any piece of writing can have different based on randomly selecting data points and checking their
views. Contour angle features and structural and concavity performance as used in [2] was not good as randomly
features are two such different views for any handwritten checking takes time. The approach is not scalable as there
text. They can be considered uncorrelated enough to fit are several rounds of processing of subset of cache every
the task of writer identification in Co-training framework. time a new cache is retrieved. Below is the pseudo code for
Co-training also needs to have two learners to learn over the selection algorithm. Score function in the algorithm gives
two views. We used two different instances of Random the highest value of the confidence scores of the learner for
Forest as learners to normalize the effect of learner over one data point over all writers.
38
Table I
ACCURACY OF CLASSISIERS WITH BASELINE SYSTEM AND C O - TRAINING
Methods Full Data Half Data One Fourth Data One Tenth Data
Experiment 1 Baseline 83.73 79.64 74.48 59.00
Co-training 85.58 80.91 75.55 61.24
Experiment 2 Baseline 80.42 76.72 70.59 52.28
Co-training 82.47 77.31 72.15 53.94
Algorithm 2 SelectionAlgo VI. C ONCLUSION
Require:
C ← cache In this paper we presented a Co-training based frame-
t ← threshold work for labeling a large dataset of unlabeled document
d←Φ with the correct writer identities. Previous work in writer
for each data point c in C do identification was focused on either on developing a better
if score(c,H1) > t & score(c,H1) > t then feature selection algorithm or to use different techniques for
d←d∪c modeling the text of the document. All the work was based
C←C -c on a assumption that sufficient amount of labeled data is
else available for training a system. In our work we address the
C←C -c problem of limited amount of labeled data present in real
end if life applications. Our method tries to iteratively generate
end for more labeled data from unlabeled data. Experimental studies
return d show that accuracy of learners on the dataset labeled by Co-
training was better than the baseline system. This proves the
effectiveness of Co-training for labeling a large dataset of
unlabeled documents. In future we would like to address
V. E XPERIMENTS
this problem of limited data by using other semi supervised
We used IAM dataset which has total of 4075 line learning methods.
images written by 93 unique writers. We conducted two
experiments to test the performance of Co-training against R EFERENCES
the baseline systems. In first we compared the accuracy
of classifiers after Co-training against baseline methods by [1] A. Blum and T. Mitchell, Combining labeled and unlabeled
adding the scores of both learners. In this scores of the data with co-training, In Proceedings ofCOLT ’98, pp. 92-
100.1998.
class distribution of the two learners were added for each
data point and a joint class distribution score was generated. [2] S. Clark, J. Curran, and M. Osborne, Bootstrapping POS
Class label with the highest score was assigned to that data taggers using unlabelled data, In Proceedings of CoNLL,
point. Second experiment was based on the maximum of Edmonton, Canada, pp. 4955. 2003.
the confidence score of the label assigned by each learner.
In this each classifier assigns a class label to the data [3] S. Abney, Bootstrapping. In Proceedings of the 40th Annual
Meeting of the Association for Computational Linguistics.
point. This assignment is based on the highest value of the 2002
confidence score distribution over all classes. Class label
with the higher score between the two is assigned to the [4] O. Javed, S. Ali and M. Shah, Online detection and classifica-
data point. tion of moving objects using progressively improving detectors,
In Computer Vision and Pattern Recognition, pp. 696-701.
Our goal is to show that Co-training can be used to label 2005.
unlabeled data even if a small amount of labeled data is
[5] A. Levin, P. Viola and Y. Freund, Unsupervised improvement
present in the beginning. Therefore experiments were run of visual detectors using cotraining, Proceedings of the Ninth
on dataset of different sizes. We conducted experiments IEEE International Conference on Computer Vision,ICCV ’03
with four different settings of data. System was initially
trained over full, half, one fourth and one tenth of the total [6] M. Collins and Y. Singer, Unsupervised Models for Named
training data. In one tenth training data only three samples Entity Classification, Empirical Methods in Natural Language
Processing - EMNLP. 1999
per class were present. Table shows that after Co-training
accuracy of classifiers is better than the baseline system [7] K. Nigam and R. Ghani, Understanding the Behavior of Co-
with all sizes of datasets in both experimental settings. training, In Proceedings of KDD Workshop on Text Mining,
2000.
39
[8] K. Nigam and R. Ghani , Analyzing the effectiveness and
applicability of co-training, Proceedings of the Ninth Inter-
national Conference on Information and Knowledge Manage-
ment, pp. 86-93. 2000
[9] R. Hwa, Sample selection for statistical grammar induction,
In Proceedings of Joing SIGDAT Conference on EMNLP and
VLC, Hongkong, China, pp. 4552. 2000
[10] W. Wang, Z. Huang and M. Harper, Semi-Supervised Learn-
ing for Part-of-Speech Tagging of Mandarin Transcribed
Speech, In IEEE International Conference on Acoustics,
Speech and Signal Processing, 2007. ICASSP 2007.
[11] T. Joachims, Transductive Inference for Text Classification
using Support Vector Machines.In Proceedings of the Sixteenth
International Conference on Machine Learning. pp. 200-209.
1999.
[12] A. Bhardwaj, M. Reddy, S. Setlur, V. Govindaraju and
S. Ramachandrula, Latent Dirichlet allocation based writer
identification in offline handwritingIn Proceedings of the 9th
IAPR International Workshop on Document Analysis Systems.
pp. 357-362, 2010
[13] M. Bulacu and L. Schomaker, Text-Independent Writer Iden-
tification and Verification Using Textural and Allographic Fea-
tures, In IEEE Transactions on Pattern Analysis and Machine
Intelligence. pp 701-717. 2007
[14] H. E. S. Said, G. S. Peake, T. N. Tan and K. D. Baker, Per-
sonal identification based on handwriting. Pattern Recognition,
33, pp. 149-160. 2000
[15] E. N. Zois and V. Anastassopoulos, Morphological wave-
form coding for writer indentification. Pattern Recognition,
33(3), pp. 385-398. 2000
[16] L. Schomaker and M. Bulacu, Automatic writer identification
using connected-component contours and edge-based features
of uppercase Western script. In IEEE Transactions on Pattern
Analysis and Machine Intelligence, pp. 787-798. 2004
[17] R. Niels, L. Vuurpijl and L. Schomaker, Introducing TRI-
GRAPH - Trimodal writer identification. In Proceedings of
European Network of Forensic Handwriting Experts, 2005
[18] J.T. Favata, G. Srikantan, S.N. Srihari, Handprinted charac-
ter/digit recognition using a multiple feature/resolution phi-
losophy, In Proceedings of Fourth International Workshop
Frontiers of Handwriting Recognition. 1994.
40