A Co-training based Framework for Writer Identification in Offline Handwriting Utkarsh Porwal Venu Govindaraju Deptt. of Computer Science and Engg. Deptt. of Computer Science and Engg. University at Buffalo - SUNY University at Buffalo - SUNY Amherst, NY - 14228 Amherst, NY - 14228 utkarshp@buffalo.edu govind@buffalo.edu Abstract—Traditional forensic document analysis methods classified into two categories. First category is of text have focused on feature-classification paradigm where a ma- dependent features which capture properties of writer chine learning based classifier is used to learn discrimination based on the text written. In this writer identification is among multiple writers. However, usage of such techniques is restricted to availability of a large labeled dataset which done by modeling similar content written by different is not always feasible. In this paper, we propose a Co- writers. This reliance on text dependent features poses training based approach that overcomes this limitation by challenges of scalability. In real world application such exploiting independence between multiple views (features) of data is seldom available which limits the usability of these data. Two learners are initially trained on different views of techniques for practical purposes. said et al. [14] extracted a smaller labeled training data and their initial hypothesis is used to predict labels on larger unlabeled dataset. Confident text dependent features using Gabor filters but the main predictions from each learner are used to add such data points limitation was to have a full page of document written by back to the training data with predicted label as the ground different writers for identification. Second category is based truth label, thereby effectively increasing the size of labeled on text independent features. They capture writer specific dataset and improving the overall classification performance. properties such as slant and loops which are independent We conduct experiments on publicly available IAM dataset and illustrate the efficacy of proposed approach. of any text written. These techniques are better suited for real life scenarios as they directly model writers as opposed Keywords-Writer Identification, Co-training, Classifier, to previous category. Feature selection plays an important Views, Labeled and Unlabeled data role in such techniques. Several features capturing different aspects of handwriting has been tried. zois et al. [15] used I. I NTRODUCTION morphological features and needed only single word for Writer Identification is a well studied problem in forensic identification and niels et al. [17] used allographic features document analysis where the goal is to correctly label to compare using Dynamic Time Warping(DTW). All of the writer of an unknown handwriting sample. Existing this work was focused on better feature selection which research in this area has sought to address this problem would result in better accuracy. They did not lay stress on using Machine Learning techniques, where a large labeled the techniques used and made an assumption that sufficient dataset is used to learn a model (supervised learning) that amount of such data is available for the system to learn efficiently discriminates between various different writer classes. The key advantage of such learning approaches Likewise, writer identification can also be divided is their ability to generalize well over unknown test data under two major approaches. First is statistical analysis distributions. However, such generalization provides greater of several features such as edge hinge distribution. Edge performance only when used with a large labeled data. hinge distribution captures the change in the direction of In real-world scenarios, generating large labeled datasets writing samples. Second approach is model based writer requires manual annotation which is not always practical. identification. In this predefined models of strokes of The absence of such datasets also leads to inefficient handwriting are used. Prime focus of these techniques was usage of available unlabeled data that can be exploited to on making a better system for identification using different provide a greater classification performance. To address techniques for modeling and analysis. Various techniques these issues, we propose a Co-training based learning such as Latent Dirichlet Allocation(LDA) were proposed framework that learns multiple classifiers on different views for higher accuracy for identification[12] but it was based (features) of smaller labeled data and uses them to predict on the assumption that sufficient training data is available. labels for unlabeled dataset which are further bootstrapped to the labeled data for enhancing the prediction performance. Existing techniques and methods did not make use of unlabeled data for the identification. Information tapped in Existing literature on writer identification can be broadly the unlabeled data can make a significant improvement in 36 Figure 1. Schematic of Proposed Co-training Based Labeling Approach the performance of the system. This information can be to each other and each of them is capable of classification extracted using different techniques such as transductive independently. They showed that if the two views are SVMs[11] or graph based methods using EM algorithm. conditionally independent then the accuracy of classifiers They are used to label unlabeled data in a semi supervised can be increased significantly. This is because system is framework. nigam et al. [7] later proved that Co-training using more information to classify data points. Since both performs better than these methods in semi supervised views are sufficient for classification, this brings redundancy framework. It uses small snippet of labeled data and which in turns gives more information. nigam et al. [8] iteratively labels some part of unlabeled data. System later proved that completely independent views are not retrains itself after every iteration which results in better required for co-training. It works well even if two views accuracy. Co-training has been successfully used for semi are not completely uncorrelated. supervised learning in different areas but never been used for labeling data for writer identification to the best Co-training is an iterative bootstrapping method of our knowledge. Co-training has been used for web which increases the confidence of the learner in each page classification[1], object detection[5] and for visual round. It boosts the confidence of score like Expectation trackers[4] . It has been used extensively in NLP for tasks Maximization method but it works better than EM[7]. In like named entity recognition[6]. EM all the data points are labeled in each round while in Co-training few of the data points are labeled each round The organization of the paper is as follows. Section and then classifiers are retrained. This helps building a 2 provides an overview of Co-training based framework. better learner in each iteration which in take would make Multiple data views in form of writer features are described better decision and hence the overall accuracy of system in Section 3. Section 4 illustrates the proposed approach. will increase. Experimental results are described in Section 5. Section 6 outlines the conclusion. A. Selection Algorithm II. C O - TRAINING Selection of data points is crucial in the performance Co-training is a semi supervised learning algorithm of the algorithm. New points added in each round should which needs small amount of training data to start. It make learner more confident in making decisions about reiteratively labels some unlabeled data points and again the labels. Hence, several selection algorithms have been learns from it. blum et al. [1] proposed co-training to tried to make a better system as system’s performance can classify web pages on the internet into faculty web pages vary if selection method is changed. Different methods and non-faculty web pages. Initially they used small amount out performs each other depending on the kind of data of web pages of faculty members to train a classifier and and application. One approach to select points was based were able to correctly classify most of the unlabeled pages on performance[2]. In this method, some points were correctly in the end. Co-training requires two separate selected randomly and added to the labeled set. System was views of the data and two learners. blum et al. [1] proved retrained and its performance was tested on the unlabeled that co-training works best if the two views are orthogonal data. This process was repeated for some iterations and 37 performance of every set of points was recorded. Set of views. points resulting in best performance were selected to be added in the labeled set and rest were discarded. This Angle features were used to train first classifier and SC method was based on the degree of agreement of both were used to train the other one. Then in each round a learners over unlabeled data in each round. cache will be extracted from unlabeled data. This cache would be labeled by both learners and some data points will Some other methods has also been employed like be picked from newly labeled cache by selection algorithm. choosing the top k elements from the newly labeled cache. Selected data points will be added to the training set and This is an intuitive approach as those points were labeled the learners are retrained while remaining data points in with the highest confidence by the learner. However, hwa the cache are discarded. This process is repeated unless the et al. [9] in their work showed that adding samples with unlabeled set is empty. Below is the pseudo code for the best confidence score not necessarily results in better Co-training algorithm. performance of classifiers. So, wang et al. [10] used a different approach in which some data points with lowest scores were also added along with the data points with Algorithm 1 Co − trainingAlgo highest confidence scores. This method was called max-t, Require: min-s method and t and s were optimized for the best L1 ← Labeled View One performance. So, several different selection methods have L2 ← Labeled View Two been employed as selecting data point in each round is key U ← Unlabeled Data to the performance of Co-training. H1 ← First Classifier H2 ← Second Classifier Train H1 with L1 III. F EATURE S ELECTION Train H2 with L2 Selection of uncorrelated views is important in the repeat working of Co-training. blum et al. [1] proposed that both Extract cache C from U views should be sufficient for classification. Each learner U ←U −C trained on the views should be a low error classifier. They Label C using H1 and H2 proved that error rates of both the classifiers decreases d ← selection algo(C) where d ⊂ C during Co-training because of the extra information added add labels(d,H1,H2) to the system. This extra information directly depends on L1 ← L1 ∪ view one of d the degree of uncorrelation. However, abney et al. [3] later L2 ← L2 ∪ view two of d reformulated the explanation given by [1] for the working Retrain H1 on L1 of Co-training in terms of measure of agreement between Retrain H2 on L2 learners over unlabeled data. abney et al. [3] gave an upper until U is empty bound on the error rates of learners based on the measure of their disagreement. Hence, independence of both views is crucial for the performance of the system. We chose contour angle features[13] as a first view and we combined structural and concavity features (SC)[18] as a second A. Selection Algorithm view. These features can be considered independent as both Selection algorithm used for selecting data points was captures different properties of style of writing. based on agreement of both learners over data points. Points on which the confidence of both learners was above certain threshold were selected. In case of documents accuracy IV. P ROPOSED M ETHOD of classifier would be high if two different views will Co-training fits naturally for the task of writer indicate same label for any data point. Selection method identification as any piece of writing can have different based on randomly selecting data points and checking their views. Contour angle features and structural and concavity performance as used in [2] was not good as randomly features are two such different views for any handwritten checking takes time. The approach is not scalable as there text. They can be considered uncorrelated enough to fit are several rounds of processing of subset of cache every the task of writer identification in Co-training framework. time a new cache is retrieved. Below is the pseudo code for Co-training also needs to have two learners to learn over the selection algorithm. Score function in the algorithm gives two views. We used two different instances of Random the highest value of the confidence scores of the learner for Forest as learners to normalize the effect of learner over one data point over all writers. 38 Table I ACCURACY OF CLASSISIERS WITH BASELINE SYSTEM AND C O - TRAINING Methods Full Data Half Data One Fourth Data One Tenth Data Experiment 1 Baseline 83.73 79.64 74.48 59.00 Co-training 85.58 80.91 75.55 61.24 Experiment 2 Baseline 80.42 76.72 70.59 52.28 Co-training 82.47 77.31 72.15 53.94 Algorithm 2 SelectionAlgo VI. C ONCLUSION Require: C ← cache In this paper we presented a Co-training based frame- t ← threshold work for labeling a large dataset of unlabeled document d←Φ with the correct writer identities. Previous work in writer for each data point c in C do identification was focused on either on developing a better if score(c,H1) > t & score(c,H1) > t then feature selection algorithm or to use different techniques for d←d∪c modeling the text of the document. All the work was based C←C -c on a assumption that sufficient amount of labeled data is else available for training a system. In our work we address the C←C -c problem of limited amount of labeled data present in real end if life applications. Our method tries to iteratively generate end for more labeled data from unlabeled data. Experimental studies return d show that accuracy of learners on the dataset labeled by Co- training was better than the baseline system. This proves the effectiveness of Co-training for labeling a large dataset of unlabeled documents. In future we would like to address V. E XPERIMENTS this problem of limited data by using other semi supervised We used IAM dataset which has total of 4075 line learning methods. images written by 93 unique writers. We conducted two experiments to test the performance of Co-training against R EFERENCES the baseline systems. In first we compared the accuracy of classifiers after Co-training against baseline methods by [1] A. Blum and T. Mitchell, Combining labeled and unlabeled adding the scores of both learners. In this scores of the data with co-training, In Proceedings ofCOLT ’98, pp. 92- 100.1998. class distribution of the two learners were added for each data point and a joint class distribution score was generated. [2] S. Clark, J. Curran, and M. Osborne, Bootstrapping POS Class label with the highest score was assigned to that data taggers using unlabelled data, In Proceedings of CoNLL, point. Second experiment was based on the maximum of Edmonton, Canada, pp. 4955. 2003. the confidence score of the label assigned by each learner. In this each classifier assigns a class label to the data [3] S. Abney, Bootstrapping. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. point. This assignment is based on the highest value of the 2002 confidence score distribution over all classes. Class label with the higher score between the two is assigned to the [4] O. Javed, S. Ali and M. Shah, Online detection and classifica- data point. tion of moving objects using progressively improving detectors, In Computer Vision and Pattern Recognition, pp. 696-701. Our goal is to show that Co-training can be used to label 2005. unlabeled data even if a small amount of labeled data is [5] A. Levin, P. Viola and Y. Freund, Unsupervised improvement present in the beginning. Therefore experiments were run of visual detectors using cotraining, Proceedings of the Ninth on dataset of different sizes. We conducted experiments IEEE International Conference on Computer Vision,ICCV ’03 with four different settings of data. System was initially trained over full, half, one fourth and one tenth of the total [6] M. Collins and Y. Singer, Unsupervised Models for Named training data. In one tenth training data only three samples Entity Classification, Empirical Methods in Natural Language Processing - EMNLP. 1999 per class were present. Table shows that after Co-training accuracy of classifiers is better than the baseline system [7] K. Nigam and R. Ghani, Understanding the Behavior of Co- with all sizes of datasets in both experimental settings. training, In Proceedings of KDD Workshop on Text Mining, 2000. 39 [8] K. Nigam and R. Ghani , Analyzing the effectiveness and applicability of co-training, Proceedings of the Ninth Inter- national Conference on Information and Knowledge Manage- ment, pp. 86-93. 2000 [9] R. Hwa, Sample selection for statistical grammar induction, In Proceedings of Joing SIGDAT Conference on EMNLP and VLC, Hongkong, China, pp. 4552. 2000 [10] W. Wang, Z. Huang and M. Harper, Semi-Supervised Learn- ing for Part-of-Speech Tagging of Mandarin Transcribed Speech, In IEEE International Conference on Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. [11] T. Joachims, Transductive Inference for Text Classification using Support Vector Machines.In Proceedings of the Sixteenth International Conference on Machine Learning. pp. 200-209. 1999. [12] A. Bhardwaj, M. Reddy, S. Setlur, V. Govindaraju and S. Ramachandrula, Latent Dirichlet allocation based writer identification in offline handwritingIn Proceedings of the 9th IAPR International Workshop on Document Analysis Systems. pp. 357-362, 2010 [13] M. Bulacu and L. Schomaker, Text-Independent Writer Iden- tification and Verification Using Textural and Allographic Fea- tures, In IEEE Transactions on Pattern Analysis and Machine Intelligence. pp 701-717. 2007 [14] H. E. S. Said, G. S. Peake, T. N. Tan and K. D. Baker, Per- sonal identification based on handwriting. Pattern Recognition, 33, pp. 149-160. 2000 [15] E. N. Zois and V. Anastassopoulos, Morphological wave- form coding for writer indentification. Pattern Recognition, 33(3), pp. 385-398. 2000 [16] L. Schomaker and M. Bulacu, Automatic writer identification using connected-component contours and edge-based features of uppercase Western script. In IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 787-798. 2004 [17] R. Niels, L. Vuurpijl and L. Schomaker, Introducing TRI- GRAPH - Trimodal writer identification. In Proceedings of European Network of Forensic Handwriting Experts, 2005 [18] J.T. Favata, G. Srikantan, S.N. Srihari, Handprinted charac- ter/digit recognition using a multiple feature/resolution phi- losophy, In Proceedings of Fourth International Workshop Frontiers of Handwriting Recognition. 1994. 40