<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Corpus-based Decompounding in Sanskrit</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Siba Sankar Sahu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Puneet Mamgain</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computer Science and Engineering</institution>
          ,
          <addr-line>IIT (BHU) Varanasi</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sikkim Manipal Institute of Technology</institution>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Unlike English, in highly in ected Indian languages like Bengali, Marathi, and Sanskrit, compound words are not multi-word expressions but created by combining two or more simple words without any orthographic separation. A compound word with unmarked word boundaries creates a problem for many computational tasks. Splitting compound words improves performances in Machine Translation, and Information Retrieval by reducing out-of-vocabulary words in the dictionary. So far, a number of decompounding techniques have been applied in European languages like German, Dutch, and Scandinavian. In this work, we apply a corpus-based decompounding technique in Sanskrit and improve splitting accuracy by applying various ranking methods. We evaluate the performance by di erent ranking methods against a gold standard in terms of Precision, Recall, and F-measure.</p>
      </abstract>
      <kwd-group>
        <kwd>Compound</kwd>
        <kwd>Machine Translation</kwd>
        <kwd>Information Retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>But, no such work has been proposed and evaluated in Sanskrit. In this work,
we apply a corpus-based decompounding technique in Sanskrit and use various
ranking methods to improve splitting accuracy. We built a gold standard and
evaluate the performance of the compound splitting technique by combining
various ranking methods in terms of Precision, Recall, and F-measure.
The paper is organised as follows. In Section 2 we review di erent kind of
corpusbased decompounding techniques implemented and evaluated for di erent
languages. Section 3 describes about corpus-based decompounding in Sanskrit and
various ranking methods which used for improving splitting accuracy. Section 4
describes the dataset, experiment on decompounding, and shows improvements
in splitting performance by using multiple ranking methods. Section 5 concludes
with directions for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>There exist a few decompounding techniques applied earlier in European
languages. In general, they split the compound in three ways. i. rule-based ii.
corpusbased iii. supervised method. Sometimes some hybrid approach are also used to
improve the performance of compound splitting techniques. In this section, we
present a brief overview of corpus-based decompounding methods implemented
in di erent languages.</p>
      <p>
        Koehn and Knight [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] proposed a compound splitting approach in German
compound words to improve Statistical Machine Translation. The splitting point of
compound words is determined by the highest geometric mean of word
frequencies.
      </p>
      <p>arg max S =</p>
      <p>Y count(ei)
ei2S</p>
      <p>
        1
! n
(1)
(2)
In this method, it is assumed that if a constituent part of a compound frequently
occurs as an independent word in a corpus, it is likely to be part of the compound.
Marek [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and Alfonseca [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]proposed a compound splitting technique based on a
probability score. The splitting point score is determined by (sum of the negative
logarithms of the) probabilities of the constituent elements, which is calculated
by their frequency count(e) divided by the corpus size N.
      </p>
      <p>arg min S = X</p>
      <p>log
ei2S
count(e)</p>
      <p>N</p>
      <p>In this method it is assumed that, if the probability of constituent elements
of a compound is high in the corpus then it has a positive e ect on the prediction
of compound constituent.</p>
      <p>
        In the recent past, a corpus-based decompounding technique was implemented
in an Indian language Bengali for Information Retrieval[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The characteristics of
Bengali language is di erent from an European language and hence, the authors
did not apply the frequency based decompounding approach used in European
languages. At rst, they proposed a relaxed decompounding where a compound
will not split to its constituent parts if all the constituent parts of the compound
word are not individually valid words. Secondly, they perform a selective
decompounding when a constituent of a compound co-occurs up to a certain level of
the threshold with the compound word in the corpus.
      </p>
      <p>In the light of recent developments, the motivation in the present article is to
apply corpus-based decompounding in Sanskrit text where no previous work exists
to the best of knowledge of the authors.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Data Set</title>
      <p>We built a corpus by extracting the documents from Wikipedia, All India Radio
Sanskrit news and Samprativartah news. Since the data will come from multiple
sources having di erent forms and formats, extracting text involves data
cleaning, removing formatting tags, handling images and advertisement etc. The data
can be extremely noisy. Some of the standard techniques of text pre-processing
like case-folding, removing punctuations, stop-word removal applied. Now, the
raw corpus contains a total of 27,170,305 words in root form as well as in ected
form. Moreover, we use a dictionary of 9568 words to generate candidate words.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Decompounding Approach</title>
      <p>The algorithm implemented in three main steps
4.1</p>
      <p>Candidate generation
At rst, we scan the word from left to right and check sequence of word i.e.
minimum length (L=3) is a valid word in the dictionary. If it is a valid word
then generate a binary splits and repeat the process till end of the word. The
splitter try to generate all possible constituent of a compound word. If a given
word is not compound returns as it is.
4.2</p>
      <p>Cleaning
During compound splitting there may be possibility of wrong candidate
generation. To avoid wrong candidate generation we use following cleaning methods.
Su x parts Sanskrit is a highly in ectional language and the root form of
words more likely to be in ected. As a pre-processing of Sanskrit text we are
not applied any kind of stemming technique for su x removal hence, su xes
may be split-o during compound splitting. We merge the split of su x length
shorter than 4 characters. The length of su x somehow arbitrary, and could be
varied. Increasing in length of su x parts discard the actual candidate splits
and decreasing length of su x parts less e ect on cleaning method.
Fragment If the splitter generate one or two characters we can avoid through
it.
4.3</p>
      <p>Ranking
Now the cleaned list of words available for ranking. We use following ranking
methods to rank the split.</p>
      <p>Most known In this ranking method a score assigned to the split based on
knowing the constituent parts. If all the constituent of a compound is known
assign a higher value otherwise a lesser value.</p>
      <p>Light and aggressive Light and aggressive assign a score based on number of
splitting parts. light prefers the split with fewer parts whereas aggressive prefers
the split with longer split parts.</p>
      <p>Semantic Similarity The basic idea of word embedding is to represent the
words in such a way that semantically similar words are close to each other
whereas semantically dis-similar words are farther apart. We can implement word
embedding in three ways. i.Word2vec model, ii.Fast-text model and iii.GloVe
model. Here, we use Fast-text model to generate word embeddings in which skip
gram predicts the surrounding context window of size 5 of given current word.
We use the size of 300 dimensions with epoch size 25 and learning rate as 1.
In the input layer words are broken into n-gram and fed to the neural network
and output layer contain context words. We train the fast-text model by using
above described dataset in Section 3. The pre-trained model used for ranking.
We assumed that adjacent portions of splits are more similar then furtherapart
splits. For Eg. Extremely di cult labour, Extremely and di cult are often
found together, even with di cult and labour, but extremely and labour are
not likely to be similar at all. We use Cosine Similarity measure to evaluate the
distance between two word vectors.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>
        We built a test dataset by extracting the document from Samprativarth News
and choose 1190 randomly unique words. The gold standard of test data
manually created by well known Sanskrit person. we evaluate the peformance of
compound splitting technique by combining di erent ranking methods against
a goldstandard interms of Precision, Recall, F-measure and Accuracy. We can
evaluate the correctness of split by Koehn and Knight [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] methods.
correct split: words that should be split and were split correctly
correct non: words that should not be split and were not
wrong not: words that should be split but were not
wrong faulty split: words that should be split, were split, but wrongly
wrong split: words that should not be split, but were
precision: (correct split) / (correct split + wrong faulty split + wrong super
uous split)
recall: (correct split) / (correct split + wrong faulty split + wrong not split)
accuracy: (correct) / (correct + wrong)
6
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>Decompounding is an important pre-processing step for compound languages like
Bengali, Marathi, and Sanskrit. In this work, we investigate the e ect of
corpusbased decompounding in Sanskrit and improve splitting accuracy by di erent
ranking methods. In Sanskrit, compound word occurs in two ways: one is direct
concatenation of words and another is by applying sandhi rules. It has been
observed that compound splitting technique splits the most of direct concatenation
of compound words e ectively but for sandhi-ed compound the performance of
compound splitter is quite poor due to sandhi rule changed the rst character
of the second candidate appear in a modi ed form in the compound. Secondly,
in sandhi-ed compounds the second candidate of compound word may not be
word in dictionary. In di erent ranking methods shortest method gives highest
splitting accuracy i.e. 78:23% as shown in above Table 1.</p>
      <p>As part of future work, we plan to investigate the e ect of decompounding as
well as various ranking method in Sanskrit Information retrieval system. We
would like to investigate the e ect of decompounding in highly in ected Indian
languages like Bengali and Marathi.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Adda-Decker</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A corpus-based decompounding algorithm for german lexical modeling in lvcsr</article-title>
          . In: Eighth European Conference on Speech Communication and
          <string-name>
            <surname>Technology</surname>
          </string-name>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alfonseca</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bilac</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pharies</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>German decompounding in a di cult corpus</article-title>
          .
          <source>In: International Conference on Intelligent Text Processing and Computational Linguistics</source>
          . pp.
          <volume>128</volume>
          {
          <fpage>139</fpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ganguly</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leveling</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>G.J.:</given-names>
          </string-name>
          <article-title>A case study in decompounding for bengali information retrieval</article-title>
          .
          <source>In: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          . pp.
          <volume>108</volume>
          {
          <fpage>119</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Koehn</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Knight</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Empirical methods for compound splitting</article-title>
          .
          <source>arXiv preprint cs/0302032</source>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Marek</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Analysis of german compounds using weighted nite state transducers</article-title>
          .
          <source>Bachelor thesis</source>
          , University of Tubingen (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Rigouts</given-names>
            <surname>Terryn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Macken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Lefever</surname>
          </string-name>
          , E.:
          <article-title>Dutch hypernym detection: does decompounding help</article-title>
          ? In: Joint Second Workshop on Language and
          <article-title>Ontology &amp; Terminology and Knowledge Structures (LangOnto2+ TermiKS)</article-title>
          . pp.
          <volume>74</volume>
          {
          <fpage>78</fpage>
          .
          <string-name>
            <surname>European Language Resources Association (ELRA)</surname>
          </string-name>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>