Linguistic processing in lattice-based taxonomy construction Anastasia Novokreshchenova, Maria Shabanova, Dmitry Zaytsev and Nina Belyaeva Higher School of Economics, Moscow, Russia ?? Abstract. Building a lattice-based taxonomy over a text corpus with formal concept analysis (FCA) methods requires preliminary text pro- cessing that would enable construction of a context. We consider several natural language processing methods aimed at automatic attribute ac- quisition from texts. In particular, we derive attributes of three types: frequent words, latent topics and named entities. Afterwards, we con- struct a context for each type taking documents in the corpus as a set of objects. Then the corresponding concept lattices are built and pruned with the help of stability index in order to improve the readability of the diagrams. The proposed technique is illustrated on a collection of 26 texts in English dealing with political domain. In this case, the technique serves as a tool for deeper understanding of the interests of different polit- ical actors producing political texts by clarifying the connections between notions they use in them. 1 Introduction Constructing a taxonomy over a text corpus requires preliminary text process- ing. In this paper we explore capabilities of several natural language processing methods for extracting a set of attributes from the documents each taken as an object. The first method is keyword extraction which is based on the Vec- tor Space Model [1]. The second technique is Latent Dirichlet Allocation, an extension of probabilistic latent semantic analysis (pLSA) [2], which allows one to extract latent topics from word distributions. These two methods are based on the “bag-of-words” assumption—that the order of the words in documents can be beglected. The third technique we use for attribute extraction is Named Entity Recognition (NER), which involves linguistic analysis and part-of-speech- tagging (POS-tagging). Here, we apply these three methods to a collection of 26 texts dealing with political domain. After constructing an object-attribute matrix and building a lattice, fitering based on the stability index is applied in order to improve the readability of the diagrams. ?? This work is supported by project 10-04-0017 of the Scientific Foundation of the State University-Higher School of Economics “Discrete mathematical models for political analysis of democratic institutions and human rights”. Linguistic processing in lattice-based taxonomy construction 351 Our primary data consists of 26 texts corresponding to speeches of European and American leaders and spokepersons during the 2007-2010 period addressing their relations with Russia. These documents were collected by experts in polit- ical science as part of an interdisciplinary research project. From the viewpoint of political studies, the aims were (1) to define the context in which Russia is addressed in the speeches of Western leaders and international organizations, (2) to analyze the role and importance of democracy and human rights agenda. However, this text corpus is used here primarily as the testing ground for differ- ent methods of text analysis, and will be expanded in further research in order to reach a higher level of validity for the conclusions. 2 Building a context with frequent words In The Vector Space Model (VSM) [3] documents are represented as vectors of features consisting of the weights of the terms that occur within the collection. The term weighing we perform here is based on word frequences. This factor is called term frequency of a term j within a document i: nij tfij = P , k nik where nij is the number of occurrences of the word j in document i; the denom- inator is the total number of words in the document i. Constructing the Context and Stability-based Pruning Formal Con- cept Analysis (FCA) methods have been proven useful for representing a mean- ingful structure of a given knowledge community (such as a scientific community [5]) in a form of a lattice-based taxonomy (see [4] for FCA preliminaries). While constructing the context we let the documents be a set of objects and words that are most frequently used within each document be a set of attributes. The con- text construction involved two common techniques: stemming (using the Porter stemmer [6]) and elimination of stop words. Using the database of 26 political texts we have built a context made of documents and terms mentioned in each document most frequently. In particular, we took 20 most popular terms for each document according to tf measure. The resulting context contains 26 documents and 249 terms which yields a lattice of 453 formal concepts. The number of concepts is too large to be shown in a diagram. In order to obtain intelligible diagrams, we apply the pruning technique based on the notion of concept stability [7]. For a formal context K = (G, M, I) and a formal concept (A, B) of the context K the stability index is defined as follows: |{C ⊆ A | C 0 = B}| σ(A, B) = 2|A| The basic stability-based pruning method is to remove all concepts with stability below a fixed threshold. Of course, stable concepts (i.e., satisfying the 352 A. Novokreshchenova et al. Fig. 1. The most stable concepts in the political dataset. Figures in squares show the sizes of concept extents. chosen stability threshold) do not always form a lattice. However, this fact does not influence interpretation of results. The reduced substructure featuring the 31 most stable concepts is presented in Fig. 1. The diagrams were produced with Concept Explorer.[8] From this diagram it is possible to provide the following description concerning the area of discourse by European countries and the US the situation with Russia. The term “russia” is obviously a central issue—this word is one of the most frequent in 20 documents out of 26. It is also a parent for several associated subtopics such as concepts with intents {“security”, “russia”} and {“european”, “russia”}. On the whole, it may be concluded that according to word frequencies security issues and the relationships between Russia and Europe, as well as some global problems, are the most discussed topics. 3 Building a context with latent topics In this section we address an issue of probabilistic modeling of text. Its basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. Such probabilistic topic approach to document modeling was introduced as probabilistic Latent Semantic Analysis (pLSA) [2]. Here we apply an extended probabilistic model Latent Dirichlet Allocation (LDA) [9] to our dataset of 26 political texts taking the number of latent topics equal to 20.1 As previously, before running the model, all words were stemmed with Porter algorithm and stop-words were eliminated. 1 The experiments were done with Matlab Topic Modeling Toolbox: http://psiexp.ss.uci.edu/research. Linguistic processing in lattice-based taxonomy construction 353 Table 1 presents word distributions for several of the obtained topics. The words in the columns are arranged according to probability assignment of a word to a certain topic. Table 1. Several Topics Obtained with LDA Model Economics Democracy America Georgia Energy Global Security Europe crisi right nation georgia russia global secur europ presid human unit russian russian polici today european finance govern nuclear intern interest institut member union econom peopl america georgian energy issu challeng idea system democraci american territori medvedev effort afganistan point govern work interest south issu respons strateg thing reform women futur order rule achiev face bring propos democrat weapon process trust approach matter global After applying the LDA model to the text corpora we construct a context taking topics as attributes for the documents—a topic is assigned to a document if the total number of times that words in this document were assigned to a particular topic exceeds a fixed threshold. The resulting substructure of the corresponding pruned lattice is presented in Fig. 2. Lists of words in squares represent the first six words assigned to a topic with the highest probability. Fig. 2. The lattice of the 17 most stable concepts of a context built from 26 documents and 20 topics. 354 A. Novokreshchenova et al. According to this lattice the most actual topics are those connected with Eu- ropean Union (topic represented by terms “europ”, “european”, “union”, “idea” and assigned to 12 documents), global problems (“global”, “polici”, “institut”, “issu”, “effort”, 14 documents) and security issues (“secur”, “today”, “member”, “challeng”, “afganistan”, “starteg”, 14 documents), as well as energy resources (“russia”, “relat”, “russian”, “interest”, “energi”, “medvedev”, 12 documents). There is a concept corresponding to the topic of Russian-Georgian conflict which contains 5 documents. In addition there is an isolated concept related to eco- nomic development of China (“peopl”, “work”, “commit”, “econom”, “help”, “china”, 9 documents). 4 Building a Context with Named Entities Name Entity Recognition (NER) is the process of finding mentions of fixed types of information in human language. From the 26 texts, we extracted 38 paragraphs that touch issues related to Russia. The paragraphs were processed with GATE2 system and three types of named entities were extracted—names of persons, organizations and geographical objects. We construct a context taking paragraphs as objects and organizations, persons and locations mentioned in a certain paragraph more than a fixed number of times as attributes. The resulting pruned lattice is presented in Fig. 3. Fig. 3. The lattice of the 21 most stable concepts of a context built from paragraphs and named entities. From this lattice it can be noticed that Europe and European Union are the most discussed topics as it appeared in previous results. It is worth noticing that 2 GATE: General Architecture for Text Engineering: http://www.gate.ac.uk/ Linguistic processing in lattice-based taxonomy construction 355 the United Nations (UN) is mentioned only in the context of Afghanistan, which in its turn is mentioned solely in the context of NATO. Topics corresponding to China and Ukraine form isolated concepts. 5 Conclusion The first method based on frequent words allowed us to identify what questions are raised most frequently by European and American leaders while talking about Russia, whereas latent topic modeling allowed us to specify and describe these issues more thoroughly. The results obtained with named-entity lattice are rather similar. We concluded that it could be more useful to merge NER with the LDA model, for instance, by taking named entities as a set of objects and topics as a set of attributes. Expanding the corpus of the texts and combining NER with the LDA model can be useful in testing the hypotheses addressed in political studies. On the whole, building lattice-based taxonomies using various language pro- cessing techniques is promissing for obtaining additional knowledge regarding open and hidden intentions of political actors. What is important for social sci- ences is that the knowledge is obtained without collecting any additional infor- mation and by a “neutral” tool—through deriving deeper connections between notions used by the actors in their texts. References 1. Salton G., Wong A., and Yang C. A Vector Space Model for Automatic Indexing. Communications of the ACM, vol. 18, pp. 613–620 (1975) 2. Steyvers M., Griffits T. Probabilistic Topic Models. Latent Semantic Analysis: A Road to Meaning, Laurence Erlbaum (2005) 3. Jurafsky D. and Martin J. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition, Prentice Hall (2000) 4. Ganter B. and Wille R. Formal Concept Analysis, Mathematical Foundations. Springer Verlag, Berlin (1999) 5. Roth, C., Obiedkov, S., Kourie, D.G. Towards concise representation for tax- onomies of epistemic communities. In: Yahia, S.B., Nguifo, E.M. (eds.) Proc. CLA 4th Intl. Conf. on Concept Lattices and their Applications. LNCS/LNAI, vol. 4923, pp. 240–255. Springer (2006) 6. Porter M. An algorithm for suffix stripping. Program, vol. 14, pp.130–137 (1980) 7. Kuznetsov, S., Obiedkov, S., Roth, C. Reducing the representation complexity of lattice-based taxonomies. In: Priss, U., Polovina, S., Hill, R. (eds.) Conceptual Structures: Knowledge Architectures for Smart Applications: 15th Intl Conf on Conceptual Structures, ICCS 2007, Sheffield, UK. LNCS/LNAI, vol. 4604, pp. 241–254. Springer (July 2007) 8. Yevtushenko, S. System of data analysis “Concept Explorer”. (In Russian). Pro- ceedings of the 7th national conference on Artificial Intelligence KII-2000, p. 127– 134, Russia, 2000 9. Blei D., Ng A., Jordan M. Latent Dirichlet allocation. Journal of Machine Learning Research, vol. 3: pp. 993–1022 (2003)