<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Hybrid Language Segmentation for Historical Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alfter</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bizzoni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Gothenburg</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>English. Language segmentation, i.e. the division of a multilingual text into monolingual fragments has been addressed in the past, but its application to historical documents has been largely unexplored. We propose a method for language segmentation for multilingual historical documents. For documents that contain a mix of high- and low-resource languages, we leverage the high availability of highresource language material and use unsupervised methods for the low-resource parts. We show that our method outperforms previous efforts in this field.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Italiano. La segmentazione del linguaggio, la
divisione di un testo multilingue in frammenti
monolingue, è stata affrontata nel passato, ma
la sua applicazione a documenti storici è
rimasta in gran parte inesplorata. Proponiamo
un metodo per la segmentazione linguistica di
documenti storici multilingue. Per documenti
che contengono sia lingue ad alta
disponibilità di risorse che lingue sottorappresentate,
utilizziamo a nostro vantaggio l’elevata
disponibilità delle lingue con un’ampia gamma di
risorse e impieghiamo sistemi non
supervisionati per le parti che dispongono di un minor
numero di risorse. Mostriamo che il nostro
metodo supera gli sforzi precedenti in questo
settore.
1 Introduction
e computational processing of historical
documents presents challenges that modern
documents do not; oen there is no standard
orthography, and the documents may interleave multiple
languages
        <xref ref-type="bibr" rid="ref1 ref2 ref4">(Garree et al., 2015)</xref>
        . Furthermore, the
languages used in the documents may by now be
considered dead languages.
      </p>
      <p>
        is work will address the issue of language
segmentation, i.e. segmenting a multilingual text
into monolingual fragments for further
processing. While this task has been addressed in the past
using supervised and weakly supervised
methods such as trained language models
        <xref ref-type="bibr" rid="ref5 ref8">(Řehŭřek
and Kolkus, 2009; King and Abney, 2013)</xref>
        ,
unsupervised methods
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref9">(Biemann and Teresniak, 2005;
Yamaguchi and Tanaka-Ishii, 2012; Aler, 2015a)</xref>
        ,
the application to short messages
        <xref ref-type="bibr" rid="ref1 ref2 ref7">(Porta, 2014;
Aler, 2015b)</xref>
        and the application to historical
documents with regard to OCR tasks
        <xref ref-type="bibr" rid="ref1 ref2 ref4">(Garree
et al., 2015)</xref>
        , there is still room for improvement,
especially concerning historical documents.
      </p>
      <p>
        Due to the scarcity of multilingual corpora
        <xref ref-type="bibr" rid="ref6">(Lui
et al., 2014)</xref>
        , a popular approach is to use
monolingual training data. However, in the case of
historical documents, the number of available texts
in a given historical language might be too low to
yield representative language models.
      </p>
      <p>We propose a method that works on texts
containing at least one high resource language and
at least one low resource language. e intuition
is to use supervised and weakly supervised
methods for the high resource languages and
unsupervised methods for the low resource languages to
arrive at a beer language segmentation;
supervised methods derived from high-resource
languages single out these languages while
unsupervised algorithms tackle the remaining unknown
language(s) and cluster them by similarity.</p>
      <p>e presented approach is extendable to more
than one high-resource language, in which case
a separate language model has to be trained for
each language; the approach is also applicable to
more than one low-resource language, where the
unsupervised methods are expected to produce an
accurate split of all languages present.</p>
    </sec>
    <sec id="sec-2">
      <title>Hybrid language segmentation</title>
      <p>Let D = w1:::wn be a document consisting of the
words w1 to wn. Let Lh be a character-level
ngram language model trained on data for a high
resource language which occurs in the document
D. We first apply the language model Lh to the
document D and assign each word wi the
probability given by Lh (1).</p>
      <p>8wi 2 D : P (wi) = Lh(wi)
(1)
e language model Lh is implemented as a
trigram language model with non-linear back-off.
For testing purposes, we trained a language model
on a dump of the English Wikipedia (3 GB of
compressed data).</p>
      <p>Under the assumption that the text contains at
least two languages with at least one word from
each language, we determine the minimum
probability Pmin for a split (2). is probability
corresponds to the lowest probability assigned by the
language model Lh to any word in the text.</p>
      <p>Pmin = mini=1::nP (wi)
(2)</p>
      <p>Next, we determine the maximum probability
distance Pa between adjacent words (3) and the
global maximum probability distance Pg between
any two words (4).</p>
      <p>Pa = maxi=2::n( P (wi 1)
P (wi) )
(3)
Pg = maxi=1::n;j=1::n( P (wi)
P (wj ) ) (4)</p>
      <p>We also calculate the mean probability Pmean
between the two adjacent words which maximize
Pa (5).</p>
      <p>Pmean = P (wi) +2 P (wj ) (5)</p>
      <p>Finally, we calculate the sharpest drop in
probabilities and define Pmindrop as the probability at
the lowest point of the drop (6).</p>
      <p>Pmindrop =maxi=3::n( P (wi 2)</p>
      <p>P (wi 1)
+ P (wi 1)</p>
      <p>P (wi) )
(6)</p>
      <p>We then set a preliminary language split
threshold Psplit based on Pmin, Pa, Pg, Pmean and
Pmindrop (7).</p>
      <p>Psplit =</p>
      <p>Pa+Pg +Pmean + Pmindrop
3 2
2
(7)
In a first step, every word wi with a probability
P above the split threshold Psplit is considered to
belong to the high resource language modeled by
Lh and is tagged as such, while every word wj
with a probability P below the split threshold is
considered as belonging to an unknown language
and is le untagged.</p>
      <p>
        In a second step, all untagged words are
clustered by similarity. is is done by using
language model induction
        <xref ref-type="bibr" rid="ref1 ref2">(Aler, 2015a)</xref>
        . All words
le untagged by the previous step are regarded as
one text. From the first word w1, an initial
language model Li is created. e next word w2 is
tested against the initial model. If the
probability P (w2jLi) exceeds a certain threshold value,
the model is updated with w2, otherwise a new
model is created. In this way, we iterate through
the text, creating language models as necessary.
e same procedure is done starting from the last
word and moving towards the beginning of the
text. From the two sets of language model
inductions (forward, backward), the most similar
models according to their n-gram distribution are then
merged. is process is repeated, keeping the
previously merged models, until no more models are
induced.
      </p>
      <p>Each word is then tagged with the
language model Lm ( cluster) which maximizes
P (wjLm).</p>
      <p>Finally, all words are evaluated in a local
context using variable-length Markov Models
(VMM). is step aims at eliminating
inconsistencies, detecting other-language inclusions and
merging back together same-language fragments.
Řehŭřek and Kolkus (2009) use a similar
technique, but they use a fixed-width sliding window
while we use a variable window size based on
context.</p>
      <p>For each word wi, we look at its tag ti. We then
consider all the words immediately to the le of
wi and all the words immediately to the right of
wi that have a tag different from ti. From these
words, we create local context language models
le (Ll) and right (Lr). We calculate the
similarity between Ll and Lr as well as the similarity of
wi to Ll and Lr. ere are different possible
scenarios:
1. Ll is similar to Lr
(a) wi is similar to Ll or Lr
(b) wi is dissimilar to Ll or Lr
2. Ll is dissimilar to Lr
(a) wi is similar to Ll
(b) wi is similar to Lr
(c) wi is dissimilar to Ll and Lr</p>
      <p>In case 1a, we assimilate the tag of wi to the tag
of either Ll or Lr; in that case, the labels for Ll
and Lr are the same. In case 1b, wi is probably
an other-language inclusion, since it is dissimilar
to its context, while the le and right contexts are
similar. In case 2a, we assimilate the tag of wi to
the tag of Ll, and similarly in case 2b, we
assimilate the tag of wi to Lr. In case 2c, wi is dissimilar
to its context and the le and right contexts are
also dissimilar. In this case, we leave the tag
unchanged.</p>
      <p>e following sections describe the data used
for evaluation as well as the results.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Data and Evaluation</title>
      <p>Pacati, [Ved. pacati, Idg. *peqǔō, Av.
pac-; Obulg. peka to fry, roast, Lith,
kepū bake, Gr. pέssw cook, pέpwn ripe]
to cook, boil, roast Vin. IV, 264; fig.
torment in purgatory (trs. and intrs.):
Niraye pacitvā aer roasting in N.S.II, 225,
PvA. 10, 14. – ppr. pacanto
tormenting, Gen. pacato (+Caus. pācayato) D.
I, 52 (expld at DA. I, 159, where read
pacato for paccato, by pare daṇḍena
pīḷentassa). – pp. pakka (q.v.). &lt;
&gt;Caus. pacāpeti &amp; pāceti (q. v.). – Pass.
paccati to be roasted or tormented (q.</p>
      <p>v.). (Page 382)</p>
      <p>In the absence of beer comparable data, we
re-use the Pali dictionary data entries presented
in Aler (2015a) and compare our calculated
language segmentation to the segmentation
presented in Aler (2015a).</p>
      <p>e extract shown corresponds to the fih
Pali text used in the experiments. It shows
among others some of the languages used, the
unclear boundaries between languages,
abbreviations, symbols and references. Monolingual
stretches tend to be short with interspersed
language inclusions.</p>
      <p>Based on the findings in Aler (2015a) that
neither a high Rand Index nor a high F-score
alone yield good segmentations, but a
combination of high Rand Index and F-score yield good
segmentations, we have adopted a new measure
of goodness-of-segmentation Gs, which is the
arithmetic mean of the Rand Index and F5 score
(8).</p>
      <p>Gs = RI +2 F 5 (8)</p>
      <p>Due to how precision and recall are calculated
in the context of cluster evaluation, seing &gt; 1,
and thus placing more emphasis on recall,
penalizes the algorithm for clustering together data
points that are separated in the gold standard and
lowers the impact spliing of data points which
are clustered together in the gold standard.
Indeed, it is preferable to have multiple clusters of
a certain language than to have clusters of mixed
languages. us, we use F5 ( = 5) instead of F1
scores.</p>
      <p>We have found le context assimilation to be
working beer than right context assimilation or
both side context assimilation. We therefore use
only le context assimilation and leave out the
other two options.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>e following table shows our results (Hybrid
Language Segmentation, HLS) compared to the
results given in Aler (2015a) (Language Model
Induction, LMI). We converted the scores given in
Aler (2015a) to the new compound score Gs. e
baselines from Aler (2015a) are also indicated.
AIO indicates the baseline where each word is
thrown into the same cluster; there is only one
cluster (all-in-one). AID indicates the baseline
where each word is separated into its own cluster;
there is one cluster per word (all-in-different).</p>
      <p>Text
Pali 1
Pali 2
Pali 3
Pali 4
Pali 5</p>
      <p>AIO
0.3174
0.3635
0.4996
0.4047
0.5848</p>
      <p>AID
0.4643
0.5188
0.3071
n/a
0.2833</p>
      <p>LMI
0.5296
0.7662
0.4700
n/a
0.4402</p>
      <p>HLS
0.6665
0.5916
0.6056
0.4730
0.5863</p>
      <p>As can be seen from the results, our
approach outperforms the baselines as well as the
purely unsupervised language model induction
approach except for one data point where the
language model induction produced an almost
perfect clustering whereas the hybrid language
segmentation method did not.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>A big problem with the dictionary data is that
it is transcribed in a noisy manner. is is not
immediately clear from looking at the data, but
on closer inspection, it can be seen that some
symbols like commas and full stops are rendered
with non-standard Unicode characters (Unicode
codepoint U+FF0C (FULLWIDTH COMMA) and
Unicode codepoint U+FF0E (FULLWIDTH FULL
STOP)) which break the chosen whitespace
tokenization method. is results in chunks that
are bigger than they should be, oen
containing multiple languages. We can also see that the
transcription of Greek characters were rendered
as character that look alike but are not actually
Greek characters (see the quote at the beginning
of section 3).</p>
      <p>If we look more closely at the results, we can
see that our approach tends to be overly
confident when assigning words to the high-resource
language, which in this case is English. is
includes words that clearly are not English, such as
‘°itar’ and ‘°ātar’1. e following example (Pali 1)
shows the full dictionary entry.</p>
      <p>[n. ag. fr. abhijjhita in med.
function] one who covets M
&lt;smallcaps&gt;i.&lt;/smallcaps&gt; 287 (T.
abhijjhātar, v. l. °itar) = A
&lt;smallcaps&gt;v.&lt;/smallcaps&gt; 265 (T. °itar, v. l.
°ātar).</p>
      <p>e poor discriminatory power of the model is
probably related to the training data. While the
English Wikipedia offers a huge amount of
training data, it also includes many non-English words
in explanations and on pages about non-English
non-translatable terms for example. us, the
resulting language model is noisy.</p>
      <p>It might be possible to increase accuracy by
changing the split threshold Psplit, but while
choosing a higher Psplit will effectively reduce the
amount of erroneous English tags, it will also
decrease the amount of correctly tagged words. It is
1Here, ° stands for the root of the head word of the entry,
so °itar should be read ‘abhijjhitar’ and °ātar should be read
‘abhijjhātar’
possible that the unsupervised approach followed
by the local context smoothing might re-assign
the English words to the English model or at least
to a consistent, second model. However, this
remains to be tested. We think that simply using
more ‘pure’ English training data will improve the
language model’s accuracy.</p>
      <p>As for local context smoothing, we have not
reached conclusive results. While in some cases,
it succeeds in re-assigning the correct tag to a
previously incorrectly tagged word, it also induces
errors by erroneously re-tagging previously
correct tags. is is most probably due to the short
monolingual fragments in our data; longer
monolingual fragments would yield more reliable
language models. In connection to this, calculating
similarity based on small contexts seems
problematic. Another problem are non-words and
their treatment. We have chosen not to cross
nonword boundaries when calculating local context,
but doing so might improve the results.</p>
      <p>Finally, we have only tested the approach with
one high resource language and a multitude of
low-resource languages. It would be interesting
to test the method more extensively using more
high resource language models (which in turn
might interfere with each other).
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We have introduced a hybrid language
segmentation method which leverages the presence of
high-resource language content in mixed
language historical documents and the availability of
the necessary resources to build language models,
coupled with an unsupervised language model
induction approach which covers the low-resource
parts. We have shown that our method
outperforms the previously introduced unsupervised
language model induction approach.</p>
      <p>We have also found that our method seems to
work both on longer texts and on shorter texts,
whereas the approach described in Aler (2015a)
seems to be working beer on shorter texts such
as Twier messages.</p>
      <p>e local context approach yields inconclusive
results. is is most probably due to the
similarity measure used and the small size of the
context. We would need, if possible, a beer
similarity measure for small language models or another
method of evaluating the word in respect to its
context.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Aler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2015a</year>
          ).
          <source>Language Segmentation. Master's thesis</source>
          , Universität Trier.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Aler</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2015b</year>
          ).
          <article-title>Language segmentation of twitter tweets using weakly supervised language model induction</article-title>
          .
          <source>TweetMT @ SEPLN.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Biemann</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Teresniak</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2005</year>
          ).
          <article-title>Disentangling from babylonian confusionunsupervised language identification</article-title>
          .
          <source>In Computational Linguistics and Intelligent Text Processing</source>
          , pages
          <fpage>773</fpage>
          -
          <lpage>784</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Garre</surname>
            e,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alpert-Abrams</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          , BergKirkpatrick, T., and
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Unsupervised code-switching for multilingual historical document transcription</article-title>
          .
          <source>In Proceedings of NAACL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>King</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Abney</surname>
            ,
            <given-names>S. P.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods</article-title>
          .
          <source>In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics and Human Language Technologies</source>
          , pages
          <fpage>1110</fpage>
          -
          <lpage>1119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Lui</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lau</surname>
            ,
            <given-names>J. H.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Baldwin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Automatic detection and language identification of multilingual documents</article-title>
          .
          <source>Transactions of the Association for Computational Linguistics</source>
          ,
          <volume>2</volume>
          :
          <fpage>27</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Porta</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Twier Language Identification using Rational Kernels and its potential application to Sociolinguistics</article-title>
          . TweetLID @ SEPLN.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Řehŭřek</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kolkus</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Language identification on the web: Extending the dictionary method</article-title>
          .
          <source>In Computational Linguistics and Intelligent Text Processing</source>
          , pages
          <fpage>357</fpage>
          -
          <lpage>368</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Yamaguchi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Tanaka-Ishii</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>Text segmentation by language using minimum description length</article-title>
          .
          <source>In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics</source>
          , pages
          <fpage>969</fpage>
          -
          <lpage>978</lpage>
          . Association for Computational Linguistics.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>