Proceedings of the 11th International Conference on Applied Informatics Eger, Hungary, January 29–31, 2020, published at http://ceur-ws.org Feature Reduction for Dependency Graph Construction in Computational Linguistics László Kovács, László Csépányi-Fürjes University of Miskolc, Institute of Information Science, Hungary kovacs@iit.uni-miskolc.hu csepanyifl@iit.uni-miskolc.hu Abstract Dependency grammar is an important tool in semantic analysis of text sources. In the transition-based approach of dependency graph construction, the engine detects the special features of the source text and determines the next construction steps using a machine learning method. Traditionally, the feature set is constructed manually, and some of the features may be irrele- vant or redundant. In this paper, we investigate the efficiency of two known feature reduction methods in a grammar induction problem. The first method uses variance minimization algorithm and the other works with an approach based on mutual information content. We propose also a normalized fea- ture similarity for alternative cluster-based feature reduction approach. For the test evaluation, we use the sentence bank of UD (Universal Dependency) homepage taking examples from two different languages: English and Hun- garian. Keywords: computational linguistics, dependency graph, feature reduction MSC: 68R15 1. Introduction The analysis of sentence structures is an important field of computational linguis- tics. In the literature, we can find different approaches to describe the grammar of human languages as different languages may require different structure models to represent the very rich language specialties. The most widely used grammar model is the phrase-structure grammar or constituency grammar [1], where the sentence structure is based on the structural axis between the Noun-phrase and the Verb- phrase units. The phrase-structure grammar provides very good results especially Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 180 for languages with fixed word order. The other main approach of sentence level grammar is the family of dependency grammars. The idea of dependency struc- tures dates back to the work of Frege on algebraic logic [2]. According to Frege, the semantic of a sentence is based on the predicate (verb) phrase, where the exact meaning of the sentence is given by the arguments of the predicate. In this sense, the words related to the arguments depend on the words of the predicate. The first explicit application of this semantic dependency model on sentence structuring relates to the works of Tesniere [3]. In dependency grammars, the semantic structure of a sentence is represented with the dependency graph connecting the words of the sentence. In the graph, the edges may be assigned to different semantic labels [4]. In a dependency rela- tionship, one of the words is called head word, the other is the dependent word. This dependency corresponds to a parent - child relationship, every word as depen- dent unit can be bound only to one head word. The maximal number of related dependent words is called the valency of the head. In the sentence parsing module, first POS (part of speech) and other main morphological attributes are determined for the words, then the full dependency hierarchy is constructed. The grammar is called projective if the order of the words in the sentence is identical to the or- der of the corresponding leaf nodes in the dependency graph. The languages with flexible word order like Hungarian, theses sequences may be different, the language contains non-projective grammar structures, too. According to the analysis of Sar- torio [5], the ratio of sentences with some non-projective phenomena, can achieve a frequency ratio of 25%. In the literature, there are two main approaches to construct the dependency graph. One solution is the transition-based method where the words in the sen- tence are processed sequentially and the next elementary graph construction step is determined from the current processing context [6]. The method uses a predefined set of elementary transformation rules and the appropriate rule for execution is predicted using some machine learning methods. This construction module uses the following data elements: state descriptor variables, initial and finals state de- scriptors. Considering the practical implementations, the most widely used variant is the arc-eager method [7], where the state description is given by three base lists: the buffer of words not tested yet, stack of words under investigation and the list of words already processed. The model contains the following transformation op- erators: Local-left, Local-right, Reduce and Shift. The Shift operator moves one word from the buffer of words not tested yet into the stack. The Reduce operator removes the word from the head of this stack. 2. Features vectors in graph construction The winner elementary graph construction step is generally predicted from the current context parameters. These context parameters are called context features and they relate to some grammatical parameters of the words like the POS tag or the position. The applied classifier engine uses these feature vectors as input to 181 determine the winner operation category. It can be shown that the efficiency of the classification process depends significantly from the feature set used to describe the context status. According to the experiences presented in [8], the application of large, extended feature set yields in an improved classification accuracy. To provide good prediction result, in the classical approach, the parser engines use a large set of features in order to access rich information about current context. In the practice, there exists a default feature set which is dominantly applied in the different application systems. Only few articles focus on the problem of optimality of the selected feature set. Among the related approaches, the dominant solution use a greedy expansion algorithm. Starting from a minimal feature set as a subset of the global feature pool, the set is extended iteratively with new features providing the largest quality increase. The main quality factor is the classification accuracy of the prediction engine. Although large feature sets can increase the classification accuracy, the large data set decreases the time cost efficiency of the system. From this point of view, it seems reasonable to reduce the applied feature set. At the same time we can also see that the usual feature sets (like unigrams, bigrams and trigrams) are sporad- ically overlapping each others which suggests that some of the features are either irrelevant or even misleading (see [8]). The reduction mechanism [6] for feature set optimization eliminates the redundant features while it tries to maximize the information content of the set. In the literature, there are only few research works on this optimization problem. In our investigation, we focus on the reduction approach to provide an optimal reduced feature set for the investigated source language. In the evaluation tests we use Hungarian and English text sources, these two languages belong to different language categories regarding the projective status of the grammar. Our hypothesis says that the default feature sets can be reduced providing a more efficient feature set selection. The main goal in our investigation is to construct a method to discover the features with low relevance. 3. Related work The construction of a language dependency treebank is labor-intensive process. Researchers at the University of Szeged developed the first Hungarian dependency corpus by transforming the existing expression-based Szeged Treebank. The con- verted dependency trees were manually checked and repaired (see [9]). The cre- ation of the dependency Treebank was motivated by the so-called UD CoNLL 2007 Shared Task, which required participants to train and test their own dependency parser system using the same data set (see [10]). The development of the Hun- garian Treebank enabled the Hungarian researchers to appear in the shared task not only as the developer of a dependency parser, but also as the owner of the Hungarian data set. The current website of the Universal Dependency (UD) project contains the so-called Hungarian Szeged Universal Treebank, which is an extract from the Hun- 182 garian Dependency Treebank. This publicly available excerpt contains 910 train, 441 dev, and 449 test sentences and largely follows the UD 2.0 annotation principles (see [11]). The magyarlanc library is a tightly coupled tool chain that performs NLP tasks and allows you to perform basic operations such as: text and sentence segmentation and tokenization, lemmatization, POS tagging and dependency tree parsing of a Hungarian text. It was developed in accordance with the already mentioned Szeged Universal Treebank. For doing the dependency parsing the graph-based MATE parser was integrated into the magyarlanc system (see [12]). The reason why they selected a graph-based parser was an experiment in which they compared the popular ArcEager transition-based algorithm with the graph-based MATE parser. The result showed that the graph-based parser over performs its competitor. You may wonder why we still focus on the transition-based algorithms. We believe that it is worth considering the use of these algorithms for the Hungarian language for two reasons. Firstly the ArcEager version of the algorithm is not capable of exploring the dependency tree of non-projective sentence structures, and is therefore unable to compete with a graph-based parser. Since Hungarian language is rich in non-projective structures, this characteristic is an important factor when we are selecting the algorithm variant. For instance there is the non- projective list-based variant, which is specifically designed to analyze non-projective sentence structures. Secondly, the text processing from left to right is similar to the way the human brain parses a written text. We think that experimenting with a transition-based algorithm - which is a classic left-to-right processor - can provide some interesting insights about the human parsing-learning process as well. 4. Feature reduction using maximum entropy We can observe that a certain feature may be more valuable than the others from the viewpoint of the classification accuracy if its information value regarding the category label is higher than the average. If a feature shows very similar or even the same values for all the different transitions then that feature is weak in terms of classification. The above mentioned conditions are showing similarities with the relevance of information in a probabilistic system. The transitions can be viewed as possible signals in a communication channel: Ω = {𝑇1 , 𝑇2 , 𝑇3 , . . . }. The probabilities (𝑝(𝑇𝑖 )) of the features determined by the frequency values in the training set can be viewed as a probability space: 𝑃Ω = {𝑝(𝑇1 ), 𝑝(𝑇2 ), 𝑝(𝑇3 ), . . . }. The information content of 𝑃Ω can be expressed with the entropy value: ∑︁ 𝐻(𝑃Ω ) = − 𝑝(𝑇𝑖 )𝑙𝑜𝑔(𝑝(𝑇𝑖 )). 𝑖 183 The entropy gets the greatest value when the probabilities calculated for all tran- sitions are the same and the smallest if the probability calculated for a certain transition is 1. It means that the lower the entropy the more valuable the feature template is. To evaluate the information content of a feature 𝑓 , we group the test cases by the feature value into disjoint subsets. The weighted entropy of 𝑓 is calculated with ∑︁ 𝐻𝑓 = |Ω𝑣 | · 𝐻(𝑃Ω𝑣 ), 𝑣 where the summation runs over the possible feature values (𝑣) and Ω𝑣 denotes the set of test cases where the 𝑓 value is equal to 𝑣. Having an input training set, we calculate the frequency values and using these values as approximation of the probabilities, we get also the entropy for every features. Based on the entropy values, we can eliminate the features with high entropy. 5. Feature reduction using mutual information con- tent In the field of data analysis, the attribute reduction is a widely used preprocessing step. A key factor in attribute reduction is the dependency measure among the different features. One of the most widely used base measures is the correlation coefficient, where the correlation for two variables with continuous values (𝑋, 𝑌 ) can be given [13] with ∑︀ (𝑋𝑖 − 𝑋)(𝑌𝑖 − 𝑌 ) 𝑟 = √︁∑︀ 𝑖 √︁∑︀ . 2 2 𝑗 (𝑋𝑗 − 𝑋) 𝑗 (𝑌𝑗 − 𝑌 ) In the literature, we can find many extensions of the base correlation measure, like the Fast Correlation-based Filter method [14] using symmetric uncertainty measure. As the features in the transition-based graph construction model are mainly categorical variables, the dependency between two variables is usually given with a contingency table which displays the multivariate frequency distribution of the variables. To measure the strength of association between the two variables, we can use measures like contingency coefficient or phi co-efficient. In our investigation we selected an information theory oriented approach us- ing mutual information content to measure the mutual dependence between the features. Mutual information can be calculated with ∑︁ 𝑝𝑋,𝑌 (𝑥, 𝑦) ∑︁ 𝐼(𝑋, 𝑌 ) = 𝑝𝑋,𝑌 (𝑥, 𝑦) log − 𝑝𝑋,𝑌 (𝑥, 𝑦) log 𝑝𝑌 (𝑦), 𝑝𝑋 (𝑥) 𝑥∈𝑋,𝑦∈𝑌 𝑥∈𝑋,𝑦∈𝑌 where joint distribution is 𝑃(𝑋,𝑌 ) and the marginal distributions are 𝑃𝑋 and 𝑃𝑌 . The summation runs over the possible value pairs in the contingency table. This 184 formula can be expressed also with 𝐼(𝑋, 𝑌 ) = 𝐻(𝑋) + 𝐻(𝑌 ) − 𝐻(𝑋, 𝑌 ), where 𝐻(𝑋) is the entropy for the variable 𝑋 and 𝐻(𝑋, 𝑌 ) denotes the joint entropy of 𝑋 and 𝑌 . 5.1. Similarity measure using normalized mutual information content The mutual information measure can take any value from the set of non-negative real numbers. In order to use a normalized similarity value, we propose the 𝐸𝑠𝑖𝑚(𝑋, 𝑌 ) measure given with 𝐼(𝑋, 𝑌 ) 𝐸𝑠𝑖𝑚(𝑋, 𝑌 ) = . min(𝐻(𝑋), 𝐻(𝑌 ) It can be shown that 0 ≤ 𝐸𝑠𝑖𝑚(𝑋, 𝑌 ) ≤ 1 on the following way. As 𝐻(𝑋), 𝐻(𝑌 ) ≥ 0 and 𝐼(𝐼, 𝑌 ) ≥ 0, thus 𝐸𝑠𝑖𝑚(𝑋, 𝑌 ) ≥ 0. On the other hand, we know that 𝐻(𝑋, 𝑌 ) ≥ max(𝐻(𝑋), 𝐻(𝑌 )). This means that 𝐼(𝑋, 𝑌 ) = 𝐻(𝑋) + 𝐻(𝑌 ) − 𝐻(𝑋, 𝑌 ) ≤ 𝐻(𝑋) + 𝐻(𝑌 ) − max(𝐻(𝑋), 𝐻(𝑌 )), thus 𝐼(𝐼, 𝑌 ) 𝐻(𝑋) + 𝐻(𝑌 ) − 𝐻(𝑋, 𝑌 ) 𝐸𝑠𝑖𝑚(𝑋, 𝑌 ) = = ≤ 1. min(𝐻(𝑋), 𝐻(𝑌 )) 𝐻(𝑋) + 𝐻(𝑌 ) − 𝑚𝑎𝑥(𝐻(𝑋), 𝐻(𝑌 ) Based on the 𝐸𝑠𝑖𝑚(𝑋, 𝑌 ) similarity measure, we can perform a clustering of the features to determine the group of similar feature elements. The discovered clusters can be used to eliminate features very close to other features. 6. Experiments 6.1. Evaluating the features with the Hungarian data set In our experiment we implemented four variants of the transition-based algorithm, namely ArcEager-Stack, ArcStandard-Stack, Projective-List and NonProjective- List (see [6]). We incorporated the feature list from the literature. Then we ap- plied the following iterative experiment: train ⇒ evaluate ⇒ entropy calculation 185 Figure 1: Feature reduction based on maximum entropy described in Section 4 ⇒ feature reduction. In the feature reduction phase we omitted the feature that has the maximum entropy. The numbers on the horizontal axis of Figure 1 graph show how many features have been omitted. The vertical axis shows the mean UAS ( unlabeled attachment score) of the test, validation, and train data sets. It is interesting to see that omitting the high entropy features does increase the accuracy until we reach a point where even the highest entropy features are too valuable to leave out from the calculation. These results suggest that the maximum entropy calculation can be used to evaluate the features and to find ones that can be eliminated from the training. This elimination helps reducing the calculation cost of the algorithms and even increases the accuracy. 6.2. Comparing the Hungarian result to English We compared the Hungarian result with an English training set downloaded from the UD website1 . To eliminate the difference that comes from the different training set sizes we shortened the English set to the same size as the Hungarian one (1800 sentences). Figure 2 concludes that our feature reduction procedure appears to be independent of the examined language. The other three algorithm variants produced similar results. 6.3. Comparing the maximum entropy based reduction to randomized reduction Finally, we wanted to investigate whether maximal-entropy based elimination is actually more beneficial than just leaving out a randomly selected feature. In this final test, the features to be discarded were randomly selected and the UAS values were compared with the aforementioned results. It can be seen that the accuracy achieved by the randomly dropped features is constantly decreasing and falling under the entropy-selected features (see Figure 2). 1 https://github.com/UniversalDependencies/UD_English-EWT 186 Figure 2: Feature reduction using ArcStandard-Stack algorithm, based on maximum entropy in English and Hungarian samples, and random elimination Stack List 𝛽[2]𝑝𝑜𝑠; 𝛽[2]𝑝𝑜𝑠; 𝛽[1]𝑝𝑜𝑠; 𝜆2 [0]𝑝𝑜𝑠; 𝛽[0]𝑝𝑜𝑠; 𝜆2 [𝑛]𝑝𝑜𝑠; 𝜎[0]𝑝𝑜𝑠; 𝜆1 [1]𝑝𝑜𝑠; 𝛽[0]𝑝𝑜𝑠𝛽[1]𝑝𝑜𝑠 𝛽[1]𝑝𝑜𝑠; Table 1: The 5 worst performing features 6.4. Feature Clustering with Esim measure In the mutual information content similarity approach, we used the 𝐸𝑠𝑖𝑚(𝐶, 𝑇𝑖 ) value to measure the importance of the feature 𝑇𝑖 . This means, that the relevance of a feature is given with the mutual information content related to the category label (transition code). As Figure 3 shows, both measures provided the same importance order of the features. Thus both methods can be used for feature reduction, they provide the same or similar priority orders. On the other hand, we can mention an additional benefit of the Esim method, namely, it can be used also to measure the general similar- ity among the different features, independently from the category label. Based on the generated distance matrix, also a clustering of the features can be con- structed. We have selected the MDS (Multidimensional Scaling) method to map the elements of the feature set into the points of an Euclidean space. In our test results (see Figure 4), the single outlier point with thick border denotes the category variable. We can use clustering techniques like k-means algorithm, to determine the group of similar features. The MDS result in Figure 4 shows, for example, that S0_FORM_B0_FORM_POS and S0_FORM_POS_B0_FORM are very similar to each others. The related cluster-based feature reduction and the entropy-based reduction provide consistent results. 187 Figure 3: Comparison of feature priority lists (bottom items are the best) using the Esim and maximum entropy methods Figure 4: MDS mapping and clustering of the features based on Esim similarity 7. Conclusion In our study, we analyze a maximum entropy based feature reduction mechanism of dependency parser algorithms. After evaluating our results, we can say that the maximum entropy-based evaluation is well suited for investigating the relevance of features. It can be seen that the accuracy achieved by the randomly dropped features is constantly decreasing and falling under the entropy-selected features. It also appears that this mechanism is language independent. By reducing the number of features in the transition-based algorithms, we can achieve cost savings 188 without reducing the accuracy of the parser. Acknowledgements. The described study was carried out as part of the EFOP- 3.6.1-16-00011 “Younger and Renewing University – Innovative Knowledge City – institutional development of the University of Miskolc aiming at intelligent spe- cialization” project implemented in the framework of the Szechenyi 2020 program. The realization of this project is supported by the European Union, co-financed by the European Social Fund. References [1] Prószéky, G., Koutny, I., Wacha, B. (1989). A dependency syntax of Hungar- ian, Metataxis in Practice , (1989), 151–181 [2] Klement K. C., Frege and the Logic of Sense and Reference. Routledge, 2017. [3] Tesnière, L. , Eléments de Syntaxe Structurale, Klincksiek, Paris (1959) [4] Ágel, V. , Fischer, K. Dependency grammar and valency theory. In: The Oxford Handbook of Linguistic Analysis, (2009) [5] Sartorio, F., Improvements in Transition Based Systems for Dependency Parsing, PhD Thesis, Università degli Studi di Padova (2015) [6] Nivre, J., Algorithms for deterministic incremental dependency parsing, Comput. Linguist., 34(4), (2008), 513–553. [7] Nivre, J., Hall, J., Nilson, J. Memory-based dependency parsing, Proceedings of CoNLL, (2004), 49–56. [8] Zhang, Y., Nivre, J. . Transition-based dependency parsing with rich non-local features, Proceedings of ACL-HLT , (2011), 188–193. [9] Vincze, V. , Szauter, D., Almási, A. , Móra, Gy. , Alexin, Z. and Csirik, J., Hungarian Dependency Treebank, (2010) [10] Nivre, J., Hall, J., Kübler, S., McDonald, R., Nilsson, J., Riedel, S., Yuret, D., The CoNLL 2007 shared task on dependency parsing. EMNLP-CoNLL, (2007), 915–932. [11] Vincze, V., Farkas, R., Simkó, K., Szántó, Z., Varga, V. . Univerzális depen- dencia és morfológia magyar nyelvre. XII. Magyar Számítógépes Nyelvészeti Konfer- encia,(2016) [12] Farkas, R., Szántó, Z., Vincze, V., Zsibrita, J. , Módosított morfológiai egyértelműsítés és integrált konstituenselemzés a magyarlanc 3.0-ban, XII. Mag- yar Számítógépes Nyelvészeti Konferencia.(2016). [13] Blessie, E. Chandra, E. Karthikeyan, Sigmis: A feature selection algorithm us- ing correlation based method, Journal of Algorithms and Computational Technology, Vol 6.3, (2012), 385-394. [14] Yu, L., Huan L., Feature selection for high-dimensional data: A fast correlation- based filter solution, Proceedings of the 20th international conference on machine learning (ICML-03), (2003) 189