-

Hybrid Language Segmentation for Historical Documents

Alfter

David

Bizzoni

0 0 University of Gothenburg

English. Language segmentation, i.e. the division of a multilingual text into monolingual fragments has been addressed in the past, but its application to historical documents has been largely unexplored. We propose a method for language segmentation for multilingual historical documents. For documents that contain a mix of high- and low-resource languages, we leverage the high availability of highresource language material and use unsupervised methods for the low-resource parts. We show that our method outperforms previous efforts in this field.

Italiano. La segmentazione del linguaggio, la divisione di un testo multilingue in frammenti monolingue, è stata affrontata nel passato, ma la sua applicazione a documenti storici è rimasta in gran parte inesplorata. Proponiamo un metodo per la segmentazione linguistica di documenti storici multilingue. Per documenti che contengono sia lingue ad alta disponibilità di risorse che lingue sottorappresentate, utilizziamo a nostro vantaggio l’elevata disponibilità delle lingue con un’ampia gamma di risorse e impieghiamo sistemi non supervisionati per le parti che dispongono di un minor numero di risorse. Mostriamo che il nostro metodo supera gli sforzi precedenti in questo settore. 1 Introduction e computational processing of historical documents presents challenges that modern documents do not; oen there is no standard orthography, and the documents may interleave multiple languages (Garree et al., 2015) . Furthermore, the languages used in the documents may by now be considered dead languages.

is work will address the issue of language segmentation, i.e. segmenting a multilingual text into monolingual fragments for further processing. While this task has been addressed in the past using supervised and weakly supervised methods such as trained language models (Řehŭřek and Kolkus, 2009; King and Abney, 2013) , unsupervised methods (Biemann and Teresniak, 2005; Yamaguchi and Tanaka-Ishii, 2012; Aler, 2015a) , the application to short messages (Porta, 2014; Aler, 2015b) and the application to historical documents with regard to OCR tasks (Garree et al., 2015) , there is still room for improvement, especially concerning historical documents.

Due to the scarcity of multilingual corpora (Lui et al., 2014) , a popular approach is to use monolingual training data. However, in the case of historical documents, the number of available texts in a given historical language might be too low to yield representative language models.

We propose a method that works on texts containing at least one high resource language and at least one low resource language. e intuition is to use supervised and weakly supervised methods for the high resource languages and unsupervised methods for the low resource languages to arrive at a beer language segmentation; supervised methods derived from high-resource languages single out these languages while unsupervised algorithms tackle the remaining unknown language(s) and cluster them by similarity.

e presented approach is extendable to more than one high-resource language, in which case a separate language model has to be trained for each language; the approach is also applicable to more than one low-resource language, where the unsupervised methods are expected to produce an accurate split of all languages present.

Hybrid language segmentation

Let D = w1:::wn be a document consisting of the words w1 to wn. Let Lh be a character-level ngram language model trained on data for a high resource language which occurs in the document D. We first apply the language model Lh to the document D and assign each word wi the probability given by Lh (1).

8wi 2 D : P (wi) = Lh(wi) (1) e language model Lh is implemented as a trigram language model with non-linear back-off. For testing purposes, we trained a language model on a dump of the English Wikipedia (3 GB of compressed data).

Under the assumption that the text contains at least two languages with at least one word from each language, we determine the minimum probability Pmin for a split (2). is probability corresponds to the lowest probability assigned by the language model Lh to any word in the text.

Pmin = mini=1::nP (wi) (2)

Next, we determine the maximum probability distance Pa between adjacent words (3) and the global maximum probability distance Pg between any two words (4).

Pa = maxi=2::n( P (wi 1) P (wi) ) (3) Pg = maxi=1::n;j=1::n( P (wi) P (wj ) ) (4)

We also calculate the mean probability Pmean between the two adjacent words which maximize Pa (5).

Pmean = P (wi) +2 P (wj ) (5)

Finally, we calculate the sharpest drop in probabilities and define Pmindrop as the probability at the lowest point of the drop (6).

Pmindrop =maxi=3::n( P (wi 2)

P (wi 1) + P (wi 1)

P (wi) ) (6)

We then set a preliminary language split threshold Psplit based on Pmin, Pa, Pg, Pmean and Pmindrop (7).

Psplit =

Pa+Pg +Pmean + Pmindrop 3 2 2 (7) In a first step, every word wi with a probability P above the split threshold Psplit is considered to belong to the high resource language modeled by Lh and is tagged as such, while every word wj with a probability P below the split threshold is considered as belonging to an unknown language and is le untagged.

In a second step, all untagged words are clustered by similarity. is is done by using language model induction (Aler, 2015a) . All words le untagged by the previous step are regarded as one text. From the first word w1, an initial language model Li is created. e next word w2 is tested against the initial model. If the probability P (w2jLi) exceeds a certain threshold value, the model is updated with w2, otherwise a new model is created. In this way, we iterate through the text, creating language models as necessary. e same procedure is done starting from the last word and moving towards the beginning of the text. From the two sets of language model inductions (forward, backward), the most similar models according to their n-gram distribution are then merged. is process is repeated, keeping the previously merged models, until no more models are induced.

Each word is then tagged with the language model Lm ( cluster) which maximizes P (wjLm).

Finally, all words are evaluated in a local context using variable-length Markov Models (VMM). is step aims at eliminating inconsistencies, detecting other-language inclusions and merging back together same-language fragments. Řehŭřek and Kolkus (2009) use a similar technique, but they use a fixed-width sliding window while we use a variable window size based on context.

For each word wi, we look at its tag ti. We then consider all the words immediately to the le of wi and all the words immediately to the right of wi that have a tag different from ti. From these words, we create local context language models le (Ll) and right (Lr). We calculate the similarity between Ll and Lr as well as the similarity of wi to Ll and Lr. ere are different possible scenarios: 1. Ll is similar to Lr (a) wi is similar to Ll or Lr (b) wi is dissimilar to Ll or Lr 2. Ll is dissimilar to Lr (a) wi is similar to Ll (b) wi is similar to Lr (c) wi is dissimilar to Ll and Lr

In case 1a, we assimilate the tag of wi to the tag of either Ll or Lr; in that case, the labels for Ll and Lr are the same. In case 1b, wi is probably an other-language inclusion, since it is dissimilar to its context, while the le and right contexts are similar. In case 2a, we assimilate the tag of wi to the tag of Ll, and similarly in case 2b, we assimilate the tag of wi to Lr. In case 2c, wi is dissimilar to its context and the le and right contexts are also dissimilar. In this case, we leave the tag unchanged.

e following sections describe the data used for evaluation as well as the results. 3

Data and Evaluation

Pacati, [Ved. pacati, Idg. *peqǔō, Av. pac-; Obulg. peka to fry, roast, Lith, kepū bake, Gr. pέssw cook, pέpwn ripe] to cook, boil, roast Vin. IV, 264; fig. torment in purgatory (trs. and intrs.): Niraye pacitvā aer roasting in N.S.II, 225, PvA. 10, 14. – ppr. pacanto tormenting, Gen. pacato (+Caus. pācayato) D. I, 52 (expld at DA. I, 159, where read pacato for paccato, by pare daṇḍena pīḷentassa). – pp. pakka (q.v.). < >Caus. pacāpeti & pāceti (q. v.). – Pass. paccati to be roasted or tormented (q.

v.). (Page 382)

In the absence of beer comparable data, we re-use the Pali dictionary data entries presented in Aler (2015a) and compare our calculated language segmentation to the segmentation presented in Aler (2015a).

e extract shown corresponds to the fih Pali text used in the experiments. It shows among others some of the languages used, the unclear boundaries between languages, abbreviations, symbols and references. Monolingual stretches tend to be short with interspersed language inclusions.

Based on the findings in Aler (2015a) that neither a high Rand Index nor a high F-score alone yield good segmentations, but a combination of high Rand Index and F-score yield good segmentations, we have adopted a new measure of goodness-of-segmentation Gs, which is the arithmetic mean of the Rand Index and F5 score (8).

Gs = RI +2 F 5 (8)

Due to how precision and recall are calculated in the context of cluster evaluation, seing > 1, and thus placing more emphasis on recall, penalizes the algorithm for clustering together data points that are separated in the gold standard and lowers the impact spliing of data points which are clustered together in the gold standard. Indeed, it is preferable to have multiple clusters of a certain language than to have clusters of mixed languages. us, we use F5 ( = 5) instead of F1 scores.

We have found le context assimilation to be working beer than right context assimilation or both side context assimilation. We therefore use only le context assimilation and leave out the other two options. 4

Results

e following table shows our results (Hybrid Language Segmentation, HLS) compared to the results given in Aler (2015a) (Language Model Induction, LMI). We converted the scores given in Aler (2015a) to the new compound score Gs. e baselines from Aler (2015a) are also indicated. AIO indicates the baseline where each word is thrown into the same cluster; there is only one cluster (all-in-one). AID indicates the baseline where each word is separated into its own cluster; there is one cluster per word (all-in-different).

Text Pali 1 Pali 2 Pali 3 Pali 4 Pali 5

AIO 0.3174 0.3635 0.4996 0.4047 0.5848

AID 0.4643 0.5188 0.3071 n/a 0.2833

LMI 0.5296 0.7662 0.4700 n/a 0.4402

HLS 0.6665 0.5916 0.6056 0.4730 0.5863

As can be seen from the results, our approach outperforms the baselines as well as the purely unsupervised language model induction approach except for one data point where the language model induction produced an almost perfect clustering whereas the hybrid language segmentation method did not. 5

Discussion

A big problem with the dictionary data is that it is transcribed in a noisy manner. is is not immediately clear from looking at the data, but on closer inspection, it can be seen that some symbols like commas and full stops are rendered with non-standard Unicode characters (Unicode codepoint U+FF0C (FULLWIDTH COMMA) and Unicode codepoint U+FF0E (FULLWIDTH FULL STOP)) which break the chosen whitespace tokenization method. is results in chunks that are bigger than they should be, oen containing multiple languages. We can also see that the transcription of Greek characters were rendered as character that look alike but are not actually Greek characters (see the quote at the beginning of section 3).

If we look more closely at the results, we can see that our approach tends to be overly confident when assigning words to the high-resource language, which in this case is English. is includes words that clearly are not English, such as ‘°itar’ and ‘°ātar’1. e following example (Pali 1) shows the full dictionary entry.

[n. ag. fr. abhijjhita in med. function] one who covets M <smallcaps>i.</smallcaps> 287 (T. abhijjhātar, v. l. °itar) = A <smallcaps>v.</smallcaps> 265 (T. °itar, v. l. °ātar).

e poor discriminatory power of the model is probably related to the training data. While the English Wikipedia offers a huge amount of training data, it also includes many non-English words in explanations and on pages about non-English non-translatable terms for example. us, the resulting language model is noisy.

It might be possible to increase accuracy by changing the split threshold Psplit, but while choosing a higher Psplit will effectively reduce the amount of erroneous English tags, it will also decrease the amount of correctly tagged words. It is 1Here, ° stands for the root of the head word of the entry, so °itar should be read ‘abhijjhitar’ and °ātar should be read ‘abhijjhātar’ possible that the unsupervised approach followed by the local context smoothing might re-assign the English words to the English model or at least to a consistent, second model. However, this remains to be tested. We think that simply using more ‘pure’ English training data will improve the language model’s accuracy.

As for local context smoothing, we have not reached conclusive results. While in some cases, it succeeds in re-assigning the correct tag to a previously incorrectly tagged word, it also induces errors by erroneously re-tagging previously correct tags. is is most probably due to the short monolingual fragments in our data; longer monolingual fragments would yield more reliable language models. In connection to this, calculating similarity based on small contexts seems problematic. Another problem are non-words and their treatment. We have chosen not to cross nonword boundaries when calculating local context, but doing so might improve the results.

Finally, we have only tested the approach with one high resource language and a multitude of low-resource languages. It would be interesting to test the method more extensively using more high resource language models (which in turn might interfere with each other). 6

Conclusion

We have introduced a hybrid language segmentation method which leverages the presence of high-resource language content in mixed language historical documents and the availability of the necessary resources to build language models, coupled with an unsupervised language model induction approach which covers the low-resource parts. We have shown that our method outperforms the previously introduced unsupervised language model induction approach.

We have also found that our method seems to work both on longer texts and on shorter texts, whereas the approach described in Aler (2015a) seems to be working beer on shorter texts such as Twier messages.

e local context approach yields inconclusive results. is is most probably due to the similarity measure used and the small size of the context. We would need, if possible, a beer similarity measure for small language models or another method of evaluating the word in respect to its context.

Aler , D. ( 2015a ). Language Segmentation. Master's thesis , Universität Trier.

Aler , D. ( 2015b ). Language segmentation of twitter tweets using weakly supervised language model induction . TweetMT @ SEPLN.

Biemann , C. and Teresniak , S. ( 2005 ). Disentangling from babylonian confusionunsupervised language identification . In Computational Linguistics and Intelligent Text Processing , pages 773 - 784 . Springer.

Garre e, D. , Alpert-Abrams , H. , BergKirkpatrick, T., and Klein , D. ( 2015 ). Unsupervised code-switching for multilingual historical document transcription . In Proceedings of NAACL.

King , B. and Abney , S. P. ( 2013 ). Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods . In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics and Human Language Technologies , pages 1110 - 1119 .

Lui , M. , Lau , J. H. , and Baldwin , T. ( 2014 ). Automatic detection and language identification of multilingual documents . Transactions of the Association for Computational Linguistics , 2 : 27 - 40 .

Porta , J. ( 2014 ). Twier Language Identification using Rational Kernels and its potential application to Sociolinguistics . TweetLID @ SEPLN.

Řehŭřek , R. and Kolkus , M. ( 2009 ). Language identification on the web: Extending the dictionary method . In Computational Linguistics and Intelligent Text Processing , pages 357 - 368 . Springer.

Yamaguchi , H. and Tanaka-Ishii , K. ( 2012 ). Text segmentation by language using minimum description length . In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics , pages 969 - 978 . Association for Computational Linguistics.