Multigraph Representation for Lexical Chaining Natalia Loukachevitch Research Computing Center of M.V. Lomonosov Moscow State University Moscow, Russia louk@mail.cir.ru Abstract In this paper lexical cohesion modeling is considered. We argue that to model lexical cohesion in connected texts it is not enough to find related words in a text. It is important to take into account relations between entities participating in the described situations. Consideration of this factor gives the possibility to develop more flexible lexical chaining algorithms and to construct lexical chains more corresponding to the discourse structure. The multigraph representation seems to be appropriate as a basis for lexical cohesion modeling. 1 Introduction One property of coherent texts is presence of multiple lexical repetitions and closely related words in the texts. This phenomenon is called ‘lexical cohesion’ [4] and usually modeled by means of lexical chains – sets of related words revealed on the basis of thesaurus relations. A lexical chain is a chain of words in which the criterion for inclusion of a word is some kind of cohesive relationship to a word that is already in the chain [13]. For example, in text (*) we can see repetitions (war crimes), the full name of a corporation and its abbreviation BBC, use of derivative words such as Ossetia, Ossetian, part-whole relations as between South Ossetia and Tskhinvali 1 : (*) The British Broadcasting Corporation is the first foreign news agency to be granted unrestricted access to the breakaway Georgian republica of South Ossetia since Georgian forces attacked the capital Tskhinvali. Journalists working for the BBC have unearthed evidence of Georgian war crimes against South Ossetian civilians. The indiscriminate use of force is a clear and serious violation of the Geneva Convention and can constitute a war crime. So there are several evident lexical chains as 1) British Broadcasting Corporation, news agency, journalists, BBC, 2) South Ossetia, Tskhinvali, South Ossetian 3) Georgian, Georgian, Georgian 4) war crimes, Geneva Convention, war crime 1 The text is taken from World Socialist Web Site (http://www.wsws.org/articles/2008/nov2008/geor-n10.shtml) and it is fully based on the BBC reportage located at http://news.bbc.co.uk/1/hi/world/7692751.stm. Using this text example we would not like to discuss any political positions. We need this example to demonstrate the dynamic nature of lexical chains because their construction should depend not only on the static thesaurus knowledge but on the text content and structure. 1 Automatic construction of lexical chains is considered as an important step to construction of the discourse structure and better understanding of the text content. The first lexical chaining algorithm based on Roget’s thesaurus was proposed in [13]. Next approaches usually utilize lexical relations from WordNet [1, 7, 16]. Graph representations of thesaurus relations as a basis for lexical chaining [1, 12, 16] are often used. Initially it was suggested that lexical chains are intuitively clear for text readers. Lately it was shown that lexical chaining by humans is a very subjective procedure [6, 8, 14]. Two readers can propose different lexical chains even for a small text. So in text (*) it is not clear if lexical chains 2) and 3) are separate chains or they should be joined in a single chain, because at the moment of text creation South Ossetia was officially a part of Georgia, therefore South Ossetia and Georgia were very related entities. In this paper we will describe the main techniques of automatic construction of lexical chains. We will consider experiments showing distinctions in lexical chain construction between several annotators. We argue that the subjectivity of manual construction of lexical chains is due to the fact that an important factor is not considered and the more appropriate form of description of lexical cohesion in a given text is not a graph but a multigraph with two types of relations between vertices. 2 Methods of Lexical Chain Construction Most techniques to lexical chaining are based on lexical relations described in WordNet [1, 7]. Hirst and St-Onge [7] divide lexical cohesion relations from WordNet into three categories: extra- strong (repetitions), strong (synonyms and symmetric relations) and medium-strong relations. Every next relation is weaker than previous one. Medium-strong relations include paths of the WordNet conceptual structure with maximum 5 links and can have different weights depending on path length. The main stages in the proposed construction of lexical chains are as follows: - the construction of lexical chains begins from the first words of a text; - to insert the next word, its relations with members of existing lexical chains are checked; - if there are such relations with any element of a chain then the new word is inserted in the chain. Only one lexical chain can be chosen: among several possible lexical chains a lexical chain with maximal weight of a relation with a current word is chosen. - If a current word is ambiguous then the choice of a lexical chain determines the choice of a sense. - For strong relations and medium-strong relations there are restrictions on distance between a current word and existing lexical chains. Hirst and St.Onge [7] indicate the problems of the lexical chaining process such as extra cohesive relations and missing relations. The problems arise from several sources: (1) limitations in the set of relations in WordNet, or a missing connection; especially lack of situational relations (school - child care, physician - hospital); (2) inconsistency in the proximity in links in WordNet (stew and steak were not considered as related because distance of 6 synsets; public and professionals are considered as related - distance 4 synsets); (3) incorrect or incomplete disambiguation. Barzilay and Elhadad, 1997 [1] pointed out that preceding lexical chains do not give enough information for correct disambiguation. Therefore they proposed to take all possible alternatives for word senses in a text and try to assign them to existing lexical chains. Barzilay and Elhadad define the best interpretation as the one with the most connections (edges in a graph). They define the score of interpretation as a sum of its chain scores, determined by the number and weight of the relation between chain members. When the number of possible interpretations is larger than a certain threshold then the weak interpretations are pruned. 2 To add situational relations to lexical chaining process Stokes et.al. [16] propose to use statistical associative relations of words in a text corpus. O.Medelyan [12] considers relations described in an information-retrieval thesaurus as situational relations. She defines a lexical chain as a graph G = (V, E) with nodes vi ∈ V being terms and edges (vi , vj , wij ) ∈ E restricting semantic relations between them, where wij is a weight expressing the strength of the relation. She proposes to build lexical chain candidates for the whole text, and the resultant graph is divided into lexical chains with the condition of the minimum path length between nodes m < 4 . An algorithm for graph clustering divides the sparsely connected graph into dense lexical chains. Most algorithms of lexical chaining assign a current word only to a single lexical chain. We found only one approach [8] where the lexical chaining algorithm enables to include current word into several lexical chains. In this case the overgeneration problem arises when too many lexical chains are generated [8]. 3 Subjectivity of Lexical Chains Lately considerable subjectivity of lexical chaining in experiments with human annotators was revealed. Hirst and Morris [6] introduce an example text (**) and show that even for such a short text authors of the paper had different opinions on available lexical chains: (**) How can we figure out what a text means. One could argue that the meaning is in the mind of the reader, but some people think that the meaning lies within the text itself. So one author thinks that there exist two lexical chains. One chain is “understanding” chain including such words as figure out, means, meaning, mind, think, meaning, another chain is “text” chain, including words text, reader, text. Another author also distinguishes two lexical chains but words means, meaning were assigned to chain “text”. In [6] an experiment in manual lexical chaining is described. A study was conducted with five participants as readers of a general-interest article from the Reader’s Digest on the topic of movie actors. Subjects were instructed to read the article and mark the word groups they perceive, using a different color of pencil for each different group. Subject’s groups were compared in pairwise manner: for each pair of participants number of words they agreed was divided by the total number of words they used. Averaged over all pairs of participants, the agreement was 63%. Hollingsworth and Teufel [8] describe an experiment on comparison of lexical chains created by different annotators for a scientific paper on computational linguistics. The task was to collect sets of related terms mentioned in the paper. A term can comprise a single word or a combination of words, all taken directly from the text. As a result of the experiment considerable subjectivity of manual lexical chaining was demonstrated. The first annotator built 12 lexical chains, the second annotator constructed 22 lexical chains. Coincidence of main elements of lexical chains (the most frequent terms in the chain) appeared only in 4 lexical chains. In this experiment all annotators assigned at least one term to several lexical chains. That is the principle of assignment of a term to a single lexical chain utilizing in automatic lexical chaining procedures seems to be too restrictive. 3 4 Cohesive Harmony, Discourse Structure and Lexical Chaining Hasan [5] introduces the concept of cohesive harmony, which presents an attempt to formalize internal and external structure of sentences in texts. Cohesive harmony is based on cohesion chains and semantic relations between members of the chains. Semantic intrasentence relations are similar to case relations of Fillmore [3] or conceptual relations in conceptual graphs [15] such as agent, object, instrument and so on. The actual rule for chain formation is that elements of a chain can be joined together if (at least) two instances of the same conceptual relation exist between them. Hasan explains that “the source of unity … resides in the fact that similar ‘things’ are said about similar/same ‘entities’, ‘events’ etc.” [4, p.212]. Texts with more chains participating in cohesive harmony, and fewer chains left isolated, were consistently judged as more coherent [6]. From this consideration we can make the following conclusion (***) if some entities participate in some situations or events in different roles more than once then these entities should not be represented as members of the same lexical chain. So in text (*) Georgia is several times mentioned as an agent of the attack, and South Ossetia is presented as an object of the attack. Therefore Georgia and South Ossetia should not be joined into the same lexical chain. S1 Main Theme R1 R2 C1 R3 C2 S2 C3 R2 S11 R3 C21 R1 R4 C31 C11 SubTheme 2 C4 S3 SubTheme 1 R1 R2 C12 C31 SubTheme 3 Figure 1: Main theme and its elaboration in subthemes. The similar conclusion can be obtained by other means, on theoretical grounds of Van Dijk theory [2]. Van Dijk describes the topical structure of a text, the macrostructure, as a hierarchical structure in a sense that the theme of a whole text can be identified and summed up to a single 4 macroproposition. The theme of the whole text can usually be described in terms of less general themes which in turn can be characterized in terms of even more specific themes, and so on (Figure 1: Si means a situation discussed in the main theme or in a subtheme of a text, Ci is an entity, participating in the situation, Ri is a relation between the situation and a participant). Every sentence of a text corresponds to a subtheme of the text. The macrostructure of a connected text defines its global coherence: “Without such a global coherence, there would be no overall control upon the local connections and continuations. Sentences must be connected appropriately according to the given local coherence criteria, but the sequence would go simply astray without some constraint on what it should be about globally [2, p.115-116]. So lexical chains should also have some restrictions on their development. But in proposed techniques there are no any constraints. In our opinion this restriction consists in the specific function of lexical chains in the topical structure. Lexical chains are responsible not only for local connections between sentences. They provide references between levels of the hierarchical structure of the text content. To refer to the main theme of the whole text a subtheme has to include a main concept (concept of the text macroproposition) or its related concept; in sentences of a text such references look like lexical cohesion relations. Besides sentences in the text have to elaborate relations between entities of macroproposition. It means that many sentences have to include more that one entity from the macroproposition. Therefore the more two entities are met in the same sentences of a text, the more probable that they are representatives of different issues of the macroproposition and it means that they should be assigned to different lexical chains to present the text macrostructure correctly. So again we can make a conclusion similar to (***): (****) The more often words co-occur in the same clauses (simple sentences) of a text, the less they should be included to the same lexical chains. Co-occurrence of words in the same clauses can be treated as the generalized conceptual relation R (X,Y) [15]. Our consideration means that in lexical chaining experiments with human annotators the task for annotators should restrict similar terms extraction [6, 8] with the rule that if two entities interact with each other in a situation described in a text then they should not joined into the same lexical chain. Recall that the proximity inconsistency problem of WordNet relations described in [7] (see section 2) then it should be stressed that this problem is due not only to the specific linguistic resource. Lexical chaining algorithms need to be more flexible and take into account the word co-occurrence factor. 5 Use of Multigraphs for Lexical Chaining Algorithms Let us return to text (**) and its lexical chains (see Section 3). It will be remembered that two readers of the text disagreed on the allocation of words means, meaning to lexical chain figure out, mind, think or lexical chain text, reader, text. We can analyze this text using information about co-occurrences of words in the text. In such a small text words means, meaning were mentioned three times in the same sentences with words text, reader: what a text means the meaning is in the mind of the reader the meaning lies within the text itself 5 This indicates that the text (**) is devoted to consideration of relation “text – its meaning”. Text and Meaning present different elements of the topical structure and therefore words text and meaning should belong to different lexical chains to correspond to the discourse structure of the text. At the same time words means, meaning should not also be assigned to another lexical chain figure out, think, because these verbs govern clauses including words means, meaning. So relation between figure out, think and meaning is an important issue of the text fragment. figure out what a text means… think that the meaning lies within the text itself. In our opinion words means, meaning, meaning should not be included in both lexical chains and form a separate lexical chain. So lexical chains for text (**) should be as follows: 1) text, reader, text. 2) figure out, think 3) means, meaning, meaning country region is_a is_a is_a is_a Ossetia USA Georgia Russia whole whole whole North whole whole Ossetia whole Washington Tbilisi whole South Ossetia whole Saakashvili Tskhinvali Figure 2: Fragment of the conceptual net for text (*****) Now we would like to address a fuller version of text (*): (*****) The British Broadcasting Corporation is the first foreign news agency to be granted unrestricted access to the breakaway Georgian republica of South Ossetia since Georgian forces attacked the capital Tskhinvali. Journalists working for the BBC have unearthed evidence of Georgian war crimes against South Ossetian civilians. The indiscriminate use of force is a clear and serious violation of the Geneva Convention and can constitute a war crime. 6 Consequently many people wrongly believe that Russia precipitated the conflict by invading Georgia. The stubborn fact remains that it was Georgian military forces, no doubt following consultation with American military "advisers", who bombarded South Ossetia's small town capital Tskhinvali. After consulting with Washington, the Saakashvili administration in Tbilisi attempted to take back the region whilst the world's attention was focused upon the opening ceremony of the Beijing Olympics. At approximately 23.30 local time on August 7, the Georgian military mounted an exceptionally heavy artillery attack on Tskhinvali. Moscow responded rapidly by advancing through the Roki Tunnel, which links South Ossetia with North Ossetia in Russia, crushing Georgian forces within a couple of days. Since the outbreak of hostilities, accusations of war crimes have flown between the two sides. Saakashvili employed an American PR company to amplify his claims of Russian atrocities. In the text several countries, cities and regions are mentioned. In the conceptual net (see Figure 2) corresponding entities are situated very close to each other. If to use lexical chaining rules described in ([7], see section 2), all highlighted words tend to be included into the same lexical chain. But in fact there are four distinct participants (Russia, South Ossetia, Georgia and USA) interacting with each other and this issue has to be presented using four different lexical chains. Looking at the text we can see the following co-occurrence in the same clauses of the text: Georgia Countries and Regions South Ossetia Russia USA (without South Ossetia) South Ossetia - 6 3 1 Georgia 6 - 3 3 (without South Ossetia) Russia 3 3 - 1 USA 1 3 1 - In Table 1 we presented not initial concepts but manually constructed lexical chains and the picture of interactions is practically evident: Georgia: Georgia, Georgian, Saakashvili, Tbilisi Russia: Russia, Russian, Moscow, North Ossetia South Ossetia: South Ossetia, Tskhinvali, Roki Tunnel USA: American, Washington But if to automatically process the text then a conceptual graph has to be generated from the text. Vertices of the graph are not words but concepts because representation of the text content has to be conceptual not lexical. Concepts can be presented in a text as different word forms, derivative words, synonyms. The graph belongs to multigraphs with two types of edges between vertices. One type of edge, Rconc, presents conceptual relations between entities from a thesaurus or an ontology, another type of edge, Rclauses, depicts co-occurrences of entities in the same clauses of a text (Figure 3). Both types of relations and vertices are labeled with frequencies. Labels of vertices are frequencies of entities in the text. Labels of co-occurrence relations are frequencies of corresponding co- occurrences. Labels of conceptual relations are frequencies of co-occurrences within several sentences but not in the same clauses of the text. So conceptual multigraph MG of thematic representation can be defined as 6-tuple MG = (V, fv, Rconc, frconc, Rclauses, frclause). To partition the multigraph to subgraphs that represent different interacting entities in a text we suppose to carry out the following steps: 1. At the first step the most frequent vertex (concept C0) is chosen. We begin from the most frequent concepts because they tend to be more important and better reveal their behaviour in the text. 7 It should be stressed, that the most frequent word of a lexical chain is often considered as a representative of the lexical chain [1, 8]. frclause = 2 fv = 8 Georgia South Ossetia fv = 4 frconc = 4 frclause = 1 frconc = 2 frclause = 1 frconc = 2 fv = 3 Tskhinvali Figure 3: Multigraph with two types of edges 2. For next frequent concept Ci we check if it can be joined to an existing subgraph. Ci can be included in an existing subgraph with the main vertex C0 if Ci is located on an allowable conceptual path from C0 and the following conditions are fulfilled: 2a) frconc (C0, Ci) > frclause(C0, Ci) or 2b) frclause(C0, Ci)=1 These conditions provide that relation (C0,Ci) is really used for establishing lexical cohesion in the text. For several concepts with the same frequences the text order can be used. 3. Frequencies of vertices (and corresponding edges) joined to the same subgraph are summed up. 4. A vertex Ci can belong to two subgraphs G0 and G1 if 4a) Ci is not the main vertex of these subgraphs, or 4b) Ci may be the main vertex of subgraph G1 if fr(Ci)>frconc(C0,C1), where C0 is main vertex of G0. In our text (*****) the most frequent concept is concept GEORGIA ( fv = 8). Next concept is SOUTH OSSETIA ( fv = 4). Edges between GEORGIA and SOUTH OSSETIA have frequencies frconc = 2 and frclause = 2. So concept SOUTH OSSETIA begins its own subgraph. Next concept is TSKHINVALI ( fv = 3). TSKHINVALI vertex can potentially be joined to both subgraphs. But edges between GEORGIA and TSKHINVALI have frequencies frconc=1 and frclause = 2. Edges between SOUTH OSSETIA and TSKHINVALI have frequencies frconc=2 and frclause=1. So TSKHINVALI is joined to SOUTH OSSETIA subgraph. 8 Next concept is RUSSIA (fv = 3). At this stage there is no allowable path to main concepts of existing graphs. So vertex RUSSIA begins a new subgraph. Concept USA (American – fv = 2) also begins a new graph. Next vertex is SAAKASHVILI vertex (fv=2). It can be joined to GEORGIA subgraph (frconc = 2 and frclause = 0) and so on. Thus we obtain four main participants of the situation described in the example text. In [9] we have already described an algorithm of construction specific types of lexical chains – thematic nodes. But in that approach co-occurrence of concepts was used only for selection of the most important thematic nodes for the text content. We used the thematic representation of documents as a basis for conceptual indexing [10], automatic text categorization [11] and summarization [9]. Now we suppose to utilize co-occurrences of concepts for more correct and flexible lexical chaining. Conclusion In this paper we showed that to model lexical cohesion in connected texts it is not enough to find related words in a text. It is important to take into account relations between entities participating in the described situations. The simple way to do this is to take into consideration co-occurrences of words in the same sentences of the text because the more often words co-occur in the same clauses (simple sentences) of a text, the more probable that they relate to different interacting entities. The factor of co-occurrence allows us to do lexical chaining algorithms more flexible and more corresponding to the discourse structure. The multigraph representation seems to be appropriate as a basis for lexical cohesion modeling. The factor of conceptual relations between entities should be also considered in lexical chaining experiments with human annotators to decrease its subjectivity. Acknowledgements Partial support for this research is provided by the Russian Foundation for Fundamental Research through grant # 09-06-00390-а. References [1] Barzilay R. & Elhadad M. (1998). Using Lexical Chains for Text Summarization. - ACL/ EACL Workshop Intelligent Scalable Text Summarization. Madrid. [2] Dijk van T. (1985). Semantic discourse analysis. In: Teun A. van Dijk, (Ed.) Handbook of Discourse Analysis, vol. 2. (pp. 103-136). London: Academic Press. [3] Fillmore, Charles J. (1968). The Case for Case. In: Bach and Harms (Ed.): Universals in Linguistic Theory. New York: Holt, Rinehart, and Winston, 1-88. [4] Halliday M. & Hasan R. (1976). Cohesion in English. - Longman, London. [5] Hasan R. (1984). Coherence and Cohesive harmony. In: J. Flood, editor, Understanding reading comprehension, 181- 219. Newark, DE: IRA. [6] Hirst G. & Morris J. (2005). The subjectivity of Lexical Cohesion in Text. In James C. Chanahan, Yan Qu, and Janyce Wiebe, editors, Computing attitude and affect in text. Springer, Dodrecht, The Netherlands. p. 41–48. [7] Hirst G. & St-Onge D. (1998). Lexical Chains as representation of context for the detection and correction malapropisms. In: C. Fellbaum, editor, WordNet: An electronic lexical database and some of its applications. Cambrige, MA: The MIT Press. 9 [8] Hollingsworth W. & Teufel. S. (2005). Human Annotation of Lexical Chains: Coverage and Agreement Measures. In: Workshop proceedings ``ELECTRA: Methodologies and Evaluation of Lexical Cohesion Techniques in Real-world Applications'', SIGIR 2005, Salvador, Brazil. [9] Loukachevitch N. & Dobrov B. (2000). Thesaurus-Based Structural Thematic Summary in Multilingual Information Systems - Machne Translation Review, - N 11, p. 10-20. [10] Loukachevitch Natalia V. & Dobrov Boris V. (2002). Evaluation of Thesaurus on Sociopolitical Life as Information Retrieval Tool. In: Proceedings of Third International Conference on Language Resources and Evaluation (LREC2002) / M.Gonzalez Rodriguez, C. Paz Suarez Araujo (Eds.) - Vol.1 - 2002, Gran Canaria, Spain - p.115-121. [11] Loukachevitch N.V. & Dobrov B.V. (2003). Knowledge-Based Text Categorization of Legislative Documents // Proceedings of 7th Conference on Computational Lexicography and Text Research (COMPLEX 2003) / Ed. F.Kiefer, J.Pajzs - Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest. - pp.57-66. [12] Medelyan O. (2007). Computing Lexical Chains with Graph Clustering. In: Proceedings of the ACL 2007 Student Research Workshop, pp. 85-90. [13] Morris J.& Hirst G.(1991). Lexical Cohesion Computed by Thesaural Relations as an Indicator of the Structure of the Text. Computational Linguistics, 17(1): pp. 21-45. [14] Morris, J., Beghtol, C. & Hirst, G. (2003). Term relationships and their contribution to text semantics and information literacy through lexical cohesion. In: Proceedings 31st Annual Conference of the Canadian Association for Information Science, Halifax, Canada. [15] Sowa, John (1984). Conceptual Structures^ Information Processing in Mind and Machine. Addison-Wesley. [16] Stokes N., Hatch P. & Carthy J. (2000). Lexical semantic relatedness and online news event detection. In the proceedings of the Annual 23rd ACM SIGIR Conference on Research and Development (SIGIR-00), pp.324-325. 10