A Corpus-based Decompounding in Sanskrit Siba Sankar Sahu1 and Puneet Mamgain2 1 Dept. of Computer Science and Engineering, IIT (BHU) Varanasi, India 2 Sikkim Manipal Institute of Technology, India Abstract. Unlike English, in highly inflected Indian languages like Ben- gali, Marathi, and Sanskrit, compound words are not multi-word ex- pressions but created by combining two or more simple words without any orthographic separation. A compound word with unmarked word boundaries creates a problem for many computational tasks. Splitting compound words improves performances in Machine Translation, and Information Retrieval by reducing out-of-vocabulary words in the dictio- nary. So far, a number of decompounding techniques have been applied in European languages like German, Dutch, and Scandinavian. In this work, we apply a corpus-based decompounding technique in Sanskrit and improve splitting accuracy by applying various ranking methods. We evaluate the performance by different ranking methods against a gold standard in terms of Precision, Recall, and F-measure. Keywords: Compound, Machine Translation, Information Retrieval 1 Introduction Compounding is the process of combining two or more simple words to form a new complex word. Two or more words are combined if there is a semantic relation between them. In Sanskrit, combining of words could be direct concate- nation or by using sandhi rules. In general, a compound word is divided into three categories: ‘open compound’, ‘closed compound’, and ‘hyphenated com- pound’. In open compound, two or more words are combined to create a new meaningful word but use space between every candidate words. In the closed compound, two or more words are combined to create a new compound word but with no space between them. In the hyphenated compound, two or more words are combined with hyphens in between. In Indian languages like Bengali, Marathi, and Sanskrit, frequency of closed compound is very high compared to other types of compounds. A closed compound word reduces efficiency of Ma- chine Translation, Speech recognition, and Information Retrieval systems. Many decompounding techniques are proposed in European languages like Ger- man, Dutch and Czech which have shown to reduce OOV words in the dictionary, improve performance in Machine Translation and Hypernym Detection [4, 6, 1]. Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). FDIA 2019, 17-18 July 2019, Milan, Italy. But, no such work has been proposed and evaluated in Sanskrit. In this work, we apply a corpus-based decompounding technique in Sanskrit and use various ranking methods to improve splitting accuracy. We built a gold standard and evaluate the performance of the compound splitting technique by combining var- ious ranking methods in terms of Precision, Recall, and F-measure. The paper is organised as follows. In Section 2 we review different kind of corpus- based decompounding techniques implemented and evaluated for different lan- guages. Section 3 describes about corpus-based decompounding in Sanskrit and various ranking methods which used for improving splitting accuracy. Section 4 describes the dataset, experiment on decompounding, and shows improvements in splitting performance by using multiple ranking methods. Section 5 concludes with directions for future work. 2 Related Work There exist a few decompounding techniques applied earlier in European lan- guages. In general, they split the compound in three ways. i. rule-based ii. corpus- based iii. supervised method. Sometimes some hybrid approach are also used to improve the performance of compound splitting techniques. In this section, we present a brief overview of corpus-based decompounding methods implemented in different languages. Koehn and Knight [4] proposed a compound splitting approach in German com- pound words to improve Statistical Machine Translation. The splitting point of compound words is determined by the highest geometric mean of word frequen- cies. ! n1 Y arg max S = count(ei ) (1) ei ∈S In this method, it is assumed that if a constituent part of a compound frequently occurs as an independent word in a corpus, it is likely to be part of the compound. Marek [5] and Alfonseca [2]proposed a compound splitting technique based on a probability score. The splitting point score is determined by (sum of the negative logarithms of the) probabilities of the constituent elements, which is calculated by their frequency count(e) divided by the corpus size N.   X count(e) arg min S = − log (2) N ei ∈S In this method it is assumed that, if the probability of constituent elements of a compound is high in the corpus then it has a positive effect on the prediction of compound constituent. In the recent past, a corpus-based decompounding technique was implemented in an Indian language Bengali for Information Retrieval[3]. The characteristics of Bengali language is different from an European language and hence, the authors did not apply the frequency based decompounding approach used in European languages. At first, they proposed a relaxed decompounding where a compound will not split to its constituent parts if all the constituent parts of the compound word are not individually valid words. Secondly, they perform a selective decom- pounding when a constituent of a compound co-occurs up to a certain level of the threshold with the compound word in the corpus. In the light of recent developments, the motivation in the present article is to ap- ply corpus-based decompounding in Sanskrit text where no previous work exists to the best of knowledge of the authors. 3 Data Set We built a corpus by extracting the documents from Wikipedia, All India Radio Sanskrit news and Samprativartah news. Since the data will come from multiple sources having different forms and formats, extracting text involves data clean- ing, removing formatting tags, handling images and advertisement etc. The data can be extremely noisy. Some of the standard techniques of text pre-processing like case-folding, removing punctuations, stop-word removal applied. Now, the raw corpus contains a total of 27,170,305 words in root form as well as inflected form. Moreover, we use a dictionary of 9568 words to generate candidate words. 4 Decompounding Approach The algorithm implemented in three main steps 4.1 Candidate generation At first, we scan the word from left to right and check sequence of word i.e. minimum length (L=3) is a valid word in the dictionary. If it is a valid word then generate a binary splits and repeat the process till end of the word. The splitter try to generate all possible constituent of a compound word. If a given word is not compound returns as it is. 4.2 Cleaning During compound splitting there may be possibility of wrong candidate gener- ation. To avoid wrong candidate generation we use following cleaning methods. Suffix parts Sanskrit is a highly inflectional language and the root form of words more likely to be inflected. As a pre-processing of Sanskrit text we are not applied any kind of stemming technique for suffix removal hence, suffixes may be split-off during compound splitting. We merge the split of suffix length shorter than 4 characters. The length of suffix somehow arbitrary, and could be varied. Increasing in length of suffix parts discard the actual candidate splits and decreasing length of suffix parts less effect on cleaning method. Fragment If the splitter generate one or two characters we can avoid through it. 4.3 Ranking Now the cleaned list of words available for ranking. We use following ranking methods to rank the split. Most known In this ranking method a score assigned to the split based on knowing the constituent parts. If all the constituent of a compound is known assign a higher value otherwise a lesser value. Light and aggressive Light and aggressive assign a score based on number of splitting parts. light prefers the split with fewer parts whereas aggressive prefers the split with longer split parts. Semantic Similarity The basic idea of word embedding is to represent the words in such a way that semantically similar words are close to each other whereas semantically dis-similar words are farther apart. We can implement word embedding in three ways. i.Word2vec model, ii.Fast-text model and iii.GloVe model. Here, we use Fast-text model to generate word embeddings in which skip gram predicts the surrounding context window of size 5 of given current word. We use the size of 300 dimensions with epoch size 25 and learning rate as 1. In the input layer words are broken into n-gram and fed to the neural network and output layer contain context words. We train the fast-text model by using above described dataset in Section 3. The pre-trained model used for ranking. We assumed that adjacent portions of splits are more similar then furtherapart splits. For Eg. Extremely difficult labour, Extremely and difficult are often found together, even with difficult and labour, but extremely and labour are not likely to be similar at all. We use Cosine Similarity measure to evaluate the distance between two word vectors. 5 Evaluation We built a test dataset by extracting the document from Samprativarth News and choose 1190 randomly unique words. The gold standard of test data man- ually created by well known Sanskrit person. we evaluate the peformance of compound splitting technique by combining different ranking methods against a goldstandard interms of Precision, Recall, F-measure and Accuracy. We can evaluate the correctness of split by Koehn and Knight [4] methods. correct split: words that should be split and were split correctly correct non: words that should not be split and were not wrong not: words that should be split but were not wrong faulty split: words that should be split, were split, but wrongly wrong split: words that should not be split, but were precision: (correct split) / (correct split + wrong faulty split + wrong superflu- ous split) recall: (correct split) / (correct split + wrong faulty split + wrong not split) accuracy: (correct) / (correct + wrong) Table 1. Evaluation of the methods compared against a manual annotated gold stan- dard of splits: Using Shortest method gives the best accuracy (78.23%). Method Correct Wrong Metrics Split not not faulty split prec. recall acc. Raw 0 802 388 0 0 - 0.0 67.3% Most Known 65 761 172 69 123 25.29 21.2 69.4% Semantic Similarity 81 765 94 116 134 24.4 27.8 71.09% Aggressive 54 799 120 56 161 19.9 23.4 71.6% Light 82 849 104 61 94 34.5 33.19 78.23% 6 Conclusions and Future Work Decompounding is an important pre-processing step for compound languages like Bengali, Marathi, and Sanskrit. In this work, we investigate the effect of corpus- based decompounding in Sanskrit and improve splitting accuracy by different ranking methods. In Sanskrit, compound word occurs in two ways: one is direct concatenation of words and another is by applying sandhi rules. It has been ob- served that compound splitting technique splits the most of direct concatenation of compound words effectively but for sandhi-ed compound the performance of compound splitter is quite poor due to sandhi rule changed the first character of the second candidate appear in a modified form in the compound. Secondly, in sandhi-ed compounds the second candidate of compound word may not be word in dictionary. In different ranking methods shortest method gives highest splitting accuracy i.e. 78.23% as shown in above Table 1. As part of future work, we plan to investigate the effect of decompounding as well as various ranking method in Sanskrit Information retrieval system. We would like to investigate the effect of decompounding in highly inflected Indian languages like Bengali and Marathi. References 1. Adda-Decker, M.: A corpus-based decompounding algorithm for german lexical modeling in lvcsr. In: Eighth European Conference on Speech Communication and Technology (2003) 2. Alfonseca, E., Bilac, S., Pharies, S.: German decompounding in a difficult corpus. In: International Conference on Intelligent Text Processing and Computational Lin- guistics. pp. 128–139. Springer (2008) 3. Ganguly, D., Leveling, J., Jones, G.J.: A case study in decompounding for bengali information retrieval. In: International Conference of the Cross-Language Evalua- tion Forum for European Languages. pp. 108–119. Springer (2013) 4. Koehn, P., Knight, K.: Empirical methods for compound splitting. arXiv preprint cs/0302032 (2003) 5. Marek, T.: Analysis of german compounds using weighted finite state transducers. Bachelor thesis, University of Tübingen (2006) 6. Rigouts Terryn, A., Macken, L., Lefever, E.: Dutch hypernym detection: does de- compounding help? In: Joint Second Workshop on Language and Ontology & Ter- minology and Knowledge Structures (LangOnto2+ TermiKS). pp. 74–78. European Language Resources Association (ELRA) (2016)