Modeling of lexical relations between topics Modeling of Lexical Relations Between Topics retrieved from DBLP journals Retrieved from DBLP Journals Lukas Hlavacek and Michal Vymola Lukáš Hlaváček and Michal Výmola Department of Computer Science, FEI, VSB - Technical University of Ostrava, 17.oflistopadu Department Computer15,Science, 708 33,FEI, Ostrava-Poruba, Czech VSB - Technical Republicof Ostrava, University {lukas.hlavacek, 17. listopadu 15, 708 33, Ostrava-Poruba, Czech Republic michal.vymola}@vsb.cz {lukas.hlavacek, michal.vymola}@vsb.cz Abstract. In this paper we present a method for getting topics from aims and scopes in DBLP journals and following construction of their hierarchical order. We focused on semantic relations between topics. Our method is fully automatic but manual cleaning of topic database would lead to much better accuracy. Another purpose of our work is to provide similar system to ACM classification system. We want to provide better, newer and fully automatic system contrary to ACM. Keywords: DBPL, ACM, Lexical acquisition, Text mining 1 Introduction This paper describes a method for semantically retrieved topics hierarchy into a graph where the nodes are topics and the edges (also called links) represent relationships between them. We present how to retrieve data from plain text, then algorithm itself an in the last part testing and result analysis. There were not well presented yet methods for automatic topic retrieval from plain text. Of course, our technique is applicable to many other issues requiring word cat- egorization with semantic influence. Existing methods are primarily focused on semantic of ordinary words. In next chapter we mention some of them. But we want to focus in this paper especially on semantic of topics because topics have a special semantic. This paper describes a method for semantically retrieved top- ics hierarchy into a graph where the nodes are topics and the edges (also called links) represent relationships between them. Levels in our graph represent level in hierarchy. Our motivation was Widdow’s method described in [1], but we apply his technique on topics retrieved from DBLP. For testing we could not use WordNet database as Widdows did because this database contains only information about simple words (synonyms). Topics mostly consists of more words (2-3). That is why we choose ACM classification (latest 1998). ACM is widely recognized as the premier membership organization for com- puting professionals, delivering resources that advance computing as a science and a profession, enable professional development and promote policies and re- search that benefit society. V. Snášel, J. Pokorný, K. Richta (Eds.): Dateso 2011, pp. 153–160, ISBN 978-80-248-2391-1. 154 Lukáš Hlaváček, Michal Výmola ACM hosts the computing industry’s leading Digital Library and Guide to Computing Literature, and serves its global members and the computing pro- fession with journals and magazines, conferences, workshops, electronic forums, and Online Books and Courses. ACM’s first classification system for the computing field was published in 1964. Then, in 1982, the ACM published an entirely new system. New versions based on the 1982 system followed, in 1983, 1987, 1991, and 1998 [8]. ACM classification is already more than 10 years old and does not contain many new topics, for instance social network or wireless networks. The DBLP data source, which is representative of conventional database applications, is maintained by a single source. It is one of the best formatted and organized bibliography datasets. DBLP covers approximately 400,000 re- searchers who have publications in major Computer Science publication venues. Bibliographic datasets have been used for social network analysis, such as study- ing the structure and the spread of influence in scientific communities. 2 Previous work In this paper we continue in D. Widdows’ work. He presented method for as- sembling semantic knowledge from a part-of-speech tagged corpus using graph algorithms. His graph model is built by linking pairs of words which participate in particular syntactic relationships. In this part we present some of most important previous methods. Most work on automatic lexical acquisition has been based at some point on the notion of semantic similarity. Roark and Charniak described a generic algorithm for extracting such lists of similar words using the notion of semantic similarity in [7]. Roark and Charniak reported accuracies of 17% and 35%. The early re- sults have been improved upon by Riloff and Jones in [6], where a mutual boot- strapping approach is used to extract words in particular semantic categories and expression patterns for recognizing relationships between these words for the purposes of information extraction. The accuracy achieved in this experi- ment was about 78%. Another way to obtain word-senses directly from corpora is to use clustering algorithms on feature-vectors. General problem for such clustering techniques lies in the question of how many clusters one should have, i.e. how many senses are appropriate for a particular word in a given domain. Lin’s approach to this problem in [10] is to build a similarity tree (using what is in effect a hierar- chical clustering method) of words related to a target word. Another approach described T. P. Martin and M. Azmi-Murad in [8]. Widdows described a new incremental algorithm for extracting categories of similar words. He defined following method for constructing a hierarchy of words, affecting how they depend on each other. Definition 1. Let A be a set of nodes and let N(A) be the S neighbours of A. These neigbour nodes are linked to any a ∈ A. (So N (A) = a∈A N (a).) The Title Suppressed Due to Excessive Length 155 best new node is taken to be the node b ∈ N (A) \ A with the highest proportion of links to N (A). More precisely, for each u ∈ N (A) \ A, let the affinity between u and A be given by the ratio |N (u) ∩ N (A)| af (N (u), N (A)) = (1) |N (u)| If N (u) = ∅ then af(N(u),N(A))=0. The best new node b ∈ N (A) \ A is the node which maximises this affinity score [1]. Here it depends which seed topic is taken as the first one. This algorithm may find for particular topic A some other node B as the best node. But this does not have to find topic A as the best node for topic B if B is a seed topic. 3 Methodology This chapter describes our approach to retrieve topics from text and how to construct topics hierarchy. Source of our data are manually extracted aims and scopes from DBLP journals. Follows automatic extraction particular topics from these texts. 3.1 The algorithm for creating of dictionary Let R be the set of abstracts where ri ∈ R is abstract of journal. Then let N be the set of improper words (negative dictionary), where nm ∈ T is improper word in negative dictionary. Also we define δ, that represents maximal number of words in topic. For example if δ = 3 then topic contantains max three words. Function split is standart function of programing language C# and splits text by conjuction. Algorithm is based on searching of specific important conjunctions in sen- tences (and, or, comma, semi-comma) and words with a short distance to these conjunctions. These conjunctions have various priority whereas the most impor- tant is and. Other special characters were removed. In the case of and conjunction is necessary to correctly analyze the topic which we do following way: Fig. 1. Dividing of topic containing and 156 Lukáš Hlaváček, Michal Výmola In fig.1 is an example of syntactic compound of topic with and. Arrows show which word is needed to add and where. For instance software and hardware architecture or other example software architecture and hardware. In the fig.1 is suggested an algorithm for topics consisting of two words. Similar algorithm is for 3-word topics. It is important to proceed from lexical structure of more-words conjunctions when creating such rules. It might happen that this method divides the topic into two new topics which were not supposed to be divided, e.g. Data terminals and printers. Other conjunctions usually does not have this issue and these conjunctions have for semantic of topics also the same priority. Next step is to save these parsed topics into incident matrix which creates relations between particular nodes. Then we can easily set up level of importance according to frequency of their occurrence in text. Algorithm 1 The algorithm for create dictionary Require: δ = 3, R=set of abstracts, N=set of improper words for all ri ∈ R do S=split(ri , ’.’) {where S is the set of sentences for ri ∈ R} for all sj ∈ S do W=split(sj ,conjunction) {where W is potencial the set of topics for sj ∈ S} for all wk ∈ W do if |wk | <= δ and wk ∈ / N then Add wk to T {where T is the set of topics} end if end for end for end for 3.2 The algorithm for create graph of topics Let T be the set of topics, where ti ∈ T is topic in dictionary of topics. Let R be the set of abstracts, where ri ∈ R is abstract of journal. Also we use incidence matrix I m×n , that represents link between words in dictionary.We define  that represents required strength of relationship between two words in incidence matrix. In the first phase we analyze each journal separately and we search for one, two or three-word topics. 4 Testing We identified about 750 journals in DBLP.For testing we used 500 aims and scopes from these journals and containing about 90 000 words. This text was parsed by algorithm 1 into about 4000 topics. During testing we found out that Title Suppressed Due to Excessive Length 157 Algorithm 2 The algorithm for searching of similar topics Require: I m×n =empty matrix, m = n = |T |, =required strength of relationship, A=the set of seed topics, x=number of topics for all ri ∈ R and do S=split(ri , ’.’) {where S is the set of sentences for ri ∈ R} for all sj ∈ S do W=split(sj ,conjunction) {where W is potencial the set of topics for sj ∈ S} Compute Z = W ∩ T {where Z is the set of topics in sentence} for all zn , zm ∈ Z do Create link between zm , zn in I, where izm zn = izm zn + 1. end for end for end for for all imn ∈ I do if imn <  then imn = 0 end if end for while |A| < x do Compute N (A) from matrix I for all bi ∈ N (A) \ A do Compute N (bi ) from matrix I Compute af (N (bi ), N (A)) end for Add bi with the highest affinity to A.(bi ∪ A) end while for our topic-database size we are obtaining the best results if we take first 1000 topics with the highest frequency of occurrence. So there is about 3000 topics left but their rating is too low to affect the result significantly. We achieved this result by following way: first we took 400 topics, but in this setting we found only few related topics. Then why we doubled size of topics. In this way we continued up to 2000 topics and we were observing generated groups. We were also focusing on reliability of topics in particular groups so we do not obtain topics out of seed topic in group. We set a border up to 10 good topics in each group, if possible. However for some of topics algorithm finished in lower number. From all results we found that for our database is optimal to take first 1000 topics with the highest number of occurrences. We compared our results to ACM hierarchy. 5 Conlusion Most from previous works were using WordNet database for testing. Because topics are semantically different we could not use this one but instead we used ACM hierarchy database which is a database of topics hierarchically ordered. It is possible to download also in XML so it is very easy to work with it. Even we 158 Lukáš Hlaváček, Michal Výmola Fig. 2. Graph generated for topic Artificial Intelligence are not able to measure accuracy as a single number, we can pursue conformity to ACM. Result is strongly affected by the age of last release of ACM hierarchy (in 1998) when many topics did not exist, for instance social networks or business intelligence. Our algorithm found about 5000 automatically generated topics from 500 journals. There were manually found approximately 10% of topics which were not good topics. Latest ACM database consists of 1200 topics. We choose some topics common for ACM and our results and compared them. Since we found only about 20 percent of topics same in both databases it does not make much sence to compare them together. The difference between both databases is that for instance in ACM there is a single topic graphic but our algorithm generates computer graphic and 8 other topics containing word graphic. This is also a reason why is our hierarchy so much larger. Figure 4 shows how algorithm is adding every new topic into hierarchy. First are added topics as first level which were directly connected to the seed topic, then to the level 2 were added topics with the highest ratio to topics from first level. This algorithm iterates the same for next levels. In Table 1 we present results for few randomly chosen seed topics. We can see that our method generates corectly related topics to the seed topic. However there might be some topics not related to the seed word and these topics are marked in table as italic font. Compare to ACM we obtained much better related topics in groups. More aims and scopes we have better results we get. Title Suppressed Due to Excessive Length 159 Fig. 3. ACM graph for topic Artificial Intelligence References 1. Widdows D., Beate Dorow: A Graph Model for Unsupervised Lexical Acquisition. COLING, 2002. 2. Widdows D. and Ferraro K.: Semantic vectors: A scalable open source package and online technology management application. In Proceedings of the sixth interna- tional conference on Language Resources and Evaluation,2008. 3. Widdows, D.: Orthogonal negation in vector spaces for modeling word-meanings and document retrieval. In Proceedings of the 41st Annual Meeting of the Associ- ation for Computational Linguistics,2003. 4. Rieffel, E.: Certainty and uncertainly in quantum information processing. In Bruza et al. (2007), 134-141. 5. Widdows, D. and Bruza P.: Quantum information dynamics and open world sci- ence. In Bruza et al. (2007),126-133. 6. Ellen Riloff and Rosie Jones: Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence, 1999 7. Brian Roark and Eugene Charniak: Noun-phrase co-occurence statistics for semi- automatic semantic lexicon construction. In COLING-ACL,1998 8. Trevor P. Martin, Masrah Azmi-Murad: An Incremental Algorithm to find Asym- metric Word Similarities for Fuzzy Text Mining. WSTST 2005, 838-847 9. Communications of the ACM, New York, NY, USA 10. Dekang Lin: Automatic retrieval and clustering similar words,In COLIN- GACL,Montreal,Athens 160 Lukáš Hlaváček, Michal Výmola Fig. 4. Hierarchical graph generated for Wireless technology Table 1. Groups of topics for chosen seed topics Seed topic Topics artificial intel- formation retrieval, digital library, mechanical engineering, data ligence warehousing, logic programming, machine learning, evolutionary computation, quantum computing, artificial life, learning sys- tem, neural network wireless fspread-spectrum system, cellular system, ip, adaptive antenna, network protocol, telecommunications network, computer com- munication, lan, man, satellite network, wireless computing computer image processing, computer vision, speech recognition, data min- graphic ing, machine learning, digital library, evolutionary computation, artificial life, learning system, data warehousing mathematical computational method, numerical mathematic, computational method mathematic, stochastic process, set theory, probability theory, matroid theory, computer scientist, coding theory, differential equation, process engi- production engineering, electronics engineering, unmanned sys- neering tem, electrical engineering, chemical engineering, mechanical en- gineering, traffic engineering, applied mathematic language language technology, human-computer interaction, decision sup- research port, computational linguistic, spatial reasoning, qualitative rea- soning, global warming, semantic web, object-oriented program- ming combinatorial randomized algorithm, mathematical programming, graph al- optimization gorithm, indexing information, data analysis, computer vision, information fusion, database design, web mining, risk analysis