Relational-Situational Method for Intelligent Search and Analysis of Scientific Publications Gennady Osipov Ivan Smirnov Ilya Tikhomirov Artem Shelmanov gos@isa.ru ivs@isa.ru tih@isa.ru shelmanov@isa.ru Institute for Systems Analysis of RAS, Moscow, Russia Abstract The paper presents model of semantic-syntactic structure of text and method for semantic search and analysis of scientific publications. We mainly focus on stages of semantic analysis of scientific publications and practical usage of developed model and method. Experiments with scientific publication are briefly given. 1 Introduction Scientific publications normally contain all the essential information about research, including problem state- ments, proposed solutions and achieved results. Large amount of information depicting the state-of-the-art of science and technology in the world is now openly available on the web. This information can be found in electronic versions of scientific and popular journals, preprints, reports on R&D works presenting the research results, their potential economic impact and recommendations on their usage and applications. However at the moment there is paradox: a scientific papers count constantly grows, but time for paper search and its analysis also grows. For example, PhD-students spend more and more time using digital libraries for papers search and their analysis. Also state-of-the-art understanding in different fields of research or in Science as a whole is a problem in such area, as scientific management. It is important to have a clear view of which research topics are developing at the present time, which topics tend to collapse and which topics are expected to advance in the near future. This paper presents a model of semantic-syntactic structure of text, method of relational-situational analysis and its practical usage in intelligent search and analytic engine EXACTUS EXPERT. Engine can be used for analytical support of scientific activities. 2 Related work There are many systems that provide analytical support of scientific activities. However most of them are primarily focused on referential analysis of scientific papers. There are much lesser those that provide abilities for deep comprehensive analysis of scientific trends and domains. Scopus is a system owned by Elsevier Company. It is positioned as the biggest universal referential database of scientific information. It covers papers from more than 5,000 international publishers including Russian journals. In addition to advanced search ability and referential analysis the system provides tools for comparison of journals by publication activity and by other metrics [MACRVQ+ 07]. c by the paper’s authors. Copying permitted only for private and academic purposes. Copyright ! In: M. Lupu, M. Salampasis, N. Fuhr, A. Hanbury, B. Larsen, H. Strindberg (eds.): Proceedings of the Integrating IR technologies for Professional Search Workshop, Moscow, Russia, 24-March-2013, published at http://ceur-ws.org 57 Relational-Situational Method for Intelligent Search and Analysis of Scientific Publications SciVal system also proposed by Elsevier is a complex system for science branch analysis. It allows evaluat- ing scientific results of different branches and provides graphical visualization of effectiveness of organizations, countries and geographical regions for a period of time and domain (map of science) [sci]. Web of Knowledge is a system proposed by Thomson Reuters Company. It provides access to the most well-known citation index Web of Science that contains a wide range of publications in almost every branch of science. There are advanced customizable search abilities and tools for referential analysis. The system provides comprehensive statistics about journals and publication for different domains. There are also some tools for monitoring of scientific trends and research teams [LCR13]. Although every system has a lot of abilities and tools, they are not universal but better suit particular cases. The case of scientometrics i.e. determining trends, evaluating research teams and their results, finding novel outstanding approaches is still far away from being solved problem. Another drawback of state-of-the-art systems is that they mostly deal with structured data which is prepared manually. There is a lack of systems that can be fed with raw unstructured data like texts in natural languages. Although there are some examples like Google Scholar that process raw data, these systems provide lesser abilities for analytics of science. The system we propose EXACTUS EXPERT challenges the tasks of processing the raw unstructured data like texts in natural language and providing tools for comprehensive analysis of scientific paper, domains and trends. 3 Model of semantic-syntactic structure of text This section is devoted to model that was developed to describe semantic-syntactic structure of text in our experiments. The model is organized into four levels of abstraction: graphematic, morphological, syntactic and semantic. Graphematic model is the lowest level. In this model text is represented as a hierarchy of elements: sentences, clauses and words. Words are chains of characters produced as a result of tokenization procedure. They consist of letters, numbers, punctuation marks and special characters. Every character in text except whitespace char- acters belongs to one word. Each word is assigned with a mark that indicates type of the word: abbreviation, punctuation composition and ordinary word. Clause is a nonempty ordered array of words. Clauses in natural language correspond to simple sentences in a compound sentence, participial phrases, adverbial participial phrases and some other special constructions. Clauses feature projectivity, i.e. they cant overlap partially, but whole clause can be included into another as a part. Finally sentences are nonempty ordered arrays of clauses. Morphological model extends graphematics model by adding for each word lemma (the canonical form of the word) and some other morphological information. In English morphological information often is represented as a POS (Part-Of-Speech) tag due to limited amount of word forms. On the contrary Russian has a rich, inflexive lexis and lots of different forms what would require hundreds of POS tags. Thus it is common for Russian to consider morphological information of the word as a set of morphological properties which include part of speech as one of its elements. Sets of morphological properties for different parts of speech also differ. For example, nouns feature gender, case, number, when verbs feature tense, personality, number, gender but not the case. In addition we extend noun property set with categorial semantic class (CSC). Categorial semantics is a generalized meaning characterizing words that belong to the same categorial class (for instance, nouns may belong to the classes of people, things and attributes) [OSTZ08] . Categorial semantics class is necessary for semantic analysis since it defines the syntax features of the word and ways of functioning in clause. CSC of word is determined not only by its morphological properties but also by special dictionaries. The syntactic model describes the syntax relations in sentence. The underlying model formalism is dependency tree. There are two types of syntax trees: trees built in a clause connect words and trees built in a sentence connect clauses. The reason to divide syntax relations into two groups is that relations between clauses are more specific than between words. In addition it is easier to build shorter trees than full dependency structure. For large scale of tasks it is not necessary to build a solid tree, but possible to bring it to number of shallow sub trees. Our algorithms, that perform semantic analysis of text, primarily rely on sub trees that represent NPs (noun phrases) PPs (prepositional phrases) and VBs (verb phrases). The semantic level is represented by relational-situational model [OST10] which is based on the theory of linguistic semantics [OSTZ08], [ZOS04]. The underling relational-situational model formalism is heterogeneous semantic network [ST09], [Osi92]. 58 Relational-Situational Method for Intelligent Search and Analysis of Scientific Publications Nodes of the semantic network are represented by syntaxemes minimal indivisible semantic-syntactic struc- tures of language [Osi92]. In a particular discourse or in a particular sentence of a query the word acts as a syntaxeme. Syntaxemes are detected according to: a) categorial semantics of the word; b) morphological form; c) function in the sentence. Syntaxeme that acts as a head of VP (possible NP in case of verbal noun) have special functionality and is called predicate word. In general it holds the central position in the semantic structure of the clause and has influence on related NPs and PPs. Syntaxemes that act as heads of NPs or as main nouns in PPs are called nominal syntaxemes. The nominal syntaxemes bear semantic meanings from predicate words which are represented in the model as labeled relations between nominal syntaxemes and predicate word. There are also semantic relations between nominal syntaxemes that reflect relationships of concepts in the conceptual system of the domain. Semantic relations between nominal syntaxemes express compatibility of their meanings. The model suggests 65 different meanings of syntaxemes in total, some of them are listed below. • Subject – component that performs action. • Ablative – an initial point of motion. • Directive – direction of motion, oriented action or orientation of object. • Mediation – way (method) of action. • Destinative – appointment of action. • Locative – location of component. The model suggests 35 different types of semantic relations between nominal syntaxemes, some of them are listed below. • ABL – relationship, where the first component is initial point of motion in direction of the second components destination. • TRA – transitive relationship, where the first component denotes rout of the second component. • DIR – directive relation, where one component denotes the way (direction) of the other component. • MED – mediative relation, whose one component denotes the mode, means of the other component’s action. • DES – destinative relation, whose one component denotes destination of the other component. • LOC – relationship where the first component names location of the second component. Distinct semantics networks of clauses and sentences are linked in semantic network of the whole text by co- referent and anaphoric relations. In addition key concepts in semantic network represented by nominal syntax- emes are enriched with a set of syntactically and semantically related concepts (synonyms, hyponyms, holonyms etc.). The designed linguistic semantics model allows solving many tasks more effectively in comparison with the known approaches based on keywords extraction [OSTV12]. Semantic network contains the whole essence of text. It allows compiling facts about same object that are mentioned during the whole discourse even if there is a quite a distance between mentions of this object or if it is expressed by different words. It helps to improve relevance of retrieved documents in the information search systems. Semantic networks allow finding documents that are close to the query by meaning. It is also possible to find inferred facts that are not available for search engine if text is represented as a vector of words. Simplified model of semantic-syntactic structure of text M can be described as follows. M =< S, Ts , R, Is >, where S is a set of syntaxemes S = {s1 , s2 , , sn }, si – denotes syntaxeme; R – denotes the family of relations on the set of syntaxemes, R : S × S. Ts – denotes syntaxeme types and defined in linguistics dictionaries. Is : S → Ts . Syntaxeme is represented by triple s =< W, P, τ >, τ ∈ Ts , Ts = {p, n}. Here, W — denotes word; P – denotes syntaxeme features including categorial semantic class, prepositions and other morphological properties; and τ – denotes type of syntaxeme (p – predicate word; n – nominal syntaxeme). 59 Relational-Situational Method for Intelligent Search and Analysis of Scientific Publications R = {(s1 , s2 )} is a family of binary relations on the set of syntaxemes. R consists of three subfamilies: Rp denotes types of relations between predicate words and nominal syntaxemes; Rn denotes types of relations between nominal syntaxemes in a single clause; Rc denotes types of relations that express anaphora and co- reference. 4 Method of Relational-Situational analysis There are four stages in the semantic processing of the discourse: graphematic, morphological, syntactic and semantic analysis [Zol88], [Osi95]. Each stage is fulfilled by a separate analyzer with its input and output data and its own settings. As the first three stages are quite common only key aspects of stage of semantic analysis will be discussed. Lets consider text ”Oxygen arrives at tissues from lungs through blood. There it is spent on oxidation of various substances” as an example. Let us assume that graphematic and morphological structures are already built as well as syntax trees and everything is prepared for the next stage. Note there is only one clause in each sentence so they would be referred in the following discussion as the same components. The main task of the semantic analysis is to reveal semantic meanings of syntaxemes and relations on a set of syntaxemes. Semantic analysis starts with procedure of predicate word extraction from text. Predicate words mostly are heads of VPs so they are predominantly verbs rarely verbal nouns and participles. In the text above the predicate word of the first sentence is arrive and predicate word of the second sentence is spend. In terms of the model it would be written as follows: ; . Then nominal syntaxemes are extracted from NPs and PPs. There will be four nominal syntaxemes in the first sentence: ; ; ; and three in the second sentence: ; ; . When the predicate word and nominal syntaxemes of a clause are determined the nominal syntaxemes are assigned with meanings that correspond to argument structure of the predicate word. Argument structures of predicate word are stored in a special linguistics dictionary. They determine which meanings can be obtained by the syntactically related syntaxemes. The assignment of meanings is mostly based on such features as case (or position in text), categorical semantic class and preposition. Only meanings that correspond to the most completely filled predicate word argument structure are contained. The obtained meanings are stored as a set of relations between predicate word and a nominal syntaxemes. The result set of relations for the first sentence contains four elements: RSubject = {(arrive, oxygen)}; RAblative = {(arrive, lung)}; RDirective = {(arrive, tissue)}; RMediation = {(arrive, blood)}. The result set of relations for the second sentence contains three elements: RSubject = {(spend, it)}; RDestinative = {(spend, oxidation)};RLocative = {(spend, there)}. In case there is no predicate word in the sentence or it is not found in the dictionary some special algorithms based on machine learning are applied to determine meaning by context. The next step is to set up relations on the set of nominal syntaxemes. These relations reflect stable relationships between meanings of syntaxemes. Information about computability of meanings is also stored in the linguistics dictionary with the predicate word. The result set of relations of the first sentence contains following relations: RABL = {(oxygen, lung)}; RT RA = {(lung, tissue)}; RMED = {(oxygen, blood)};RDIR = {(oxygen, tissue)}. The result set of relations of the second sentence contains following relations: RDES = {(it, there)}; RLOC = {(it, oxidation)}. After semantic network of each sentence is constructed it is linked in a total semantic network of whole text by co-referent and anaphoric relations. These relations are established using lexical databases like WordNet and features like distance, morphological properties and syntactic role. There are two co-referent relations: RCOREF = {(it, oxygen); (there, tissue)}. The resulting semantic-syntactic structure of text is represented below. M =< S, T s, R, Is > S = {< arrive, verb, p >; < spend, verb, p >; < oxygen, objective, n >; < lung, objective; f rom, n >; < tissue, objective; at, n >; < blood, objective; througth, n >; < oxidation, objective; on, n >; < there, location, n >; < it, objective, n >} RSubject = {(arrive, oxygen); (spend, it)}; RAblative = {(arrive, lung)}; RDirective = {(arrive, tissue)}; RMediation = {(arrive, blood)}; RDestinative = {(spend, oxidation)}; RLocative = {(spend, there)} 60 Relational-Situational Method for Intelligent Search and Analysis of Scientific Publications RABL = {(oxygen, lung)}; RT RA = {(lung, tissue)}; RMED = {(oxygen, blood)}; RDIR {(oxygen, tissue)}; RDES = {(it, there)}; RLOC = {(it, oxidation)}; RCOREF = {(it, oxygen); (there, tissue)} Figure 1 shows visual representation of the built semantic-syntactic structure of text as a semantic network. Syntaxemes are represented by nodes of the semantic network. The solid edges denote relations between predicate words and nominal syntaxemes and they are marked with corresponding meanings. The dashed edges denote relations between nominal syntaxemes and they are marked with corresponding relation types. Finally the dotted edges denote syntax relations and the wide dotted edges with COREF mark denote co-referent relations. arrives from at through lungs Oxygen tissues blood it There is spent on oxidation substances of various Figure 1: Visual representation of the semantic-syntactic structure of text: ”Oxygen arrives at tissues from lungs through blood. There it is spent on oxidation of various substances.” 5 Practical usage Semantic images of scientific text generated as the result of the semantic analysis are stored in semantic indexes and used for search and analysis. The model of semantic-syntactic structure of text described above extends possibilities of text processing and gives additional information about content of scientific publications. It allows extracting, for example, objects of research, methods and tools applied in research, results of research and other useful entities from publications. The solution of following tasks is based on the results of semantic analysis of scientific publications. 5.1 Semantic search Semantic search of publications allows using queries formulated in natural language. The main idea of semantic search is semantic matching of a query with documents stored in search index. Semantic search involves gener- ation of semantic images of documents and queries. The semantic image as described above is presented as the semantic network so the semantic matching consists in comparison of networks for query and documents meaning by meaning and relation by relation. In the result the semantic relevance is calculated that allows ranging the documents by semantically correspondence to the search query. Semantic search involves both statistical approaches based on TFIDF and semantic analysis. The semantic analysis substantially enhances search precision and recall and reduces the number of irrelevant documents 61 Relational-Situational Method for Intelligent Search and Analysis of Scientific Publications returned by the search engine. 5.2 Search for similar documents The same idea as for semantic search is used for searching similar documents. In this case semantic images of two documents are matched and semantic distance between them is calculated. It allows detection of possible duplicates of scientific papers and plagiarism, tracking of succession (or revelation of its absence) in the results of research work in various types of scientific information sources (R&D reports, technical documentation, research papers, publications in mass media). 5.3 Extraction of definitions Definitions of terms in scientific publications can be determined by their lexical, syntactic and semantic contexts. To find and extract term definitions we developed method based on analysis of these contexts [She12]. Our linguists revealed more than 60 contexts of term definitions. They were refined and summarized during some experiments. In a result we created set of syntactic and semantic templates that covers the most frequent cases of term definitions. The idea of the method is to search all matches that suit lexical, syntactic and semantic conditions of stored templates. For example, template POS(Noun) & SemValue(Estimative) + Lemma(called) matches definition ”This dividing line is called the bissectrice or bisection line”. Fifteen templates of such kind were implemented. Templates also bear information about, which part of the match should be extracted as a term and which part should be treated as a definition. When list of terms is constructed it is filtered using some heuristic rules. These rules exclude from the result set typical erroneous definitions and redundant words from terms themselves. The set of found terms of a document can extend a list of keywords, it can be taken into account by procedure of automatic annotation construction and it also can be a sign of novelty of scientific paper. Terms and definitions can also obtain a special mark in a search index which has influence on relevance of the document to a query. Terms and definitions placed into a search index can help to trace relationships between documents since it is possible to find texts with similar terminology and even determine in what text term was introduced first. 5.4 Retrieving results of research from papers The results of research presented in the paper are formulated by means of specific phrases which correspond to special structures. We suggested these structures containing pairs < predicateword, meaning > of special predicate word and meaning of its argument (syntaxeme). For extraction of such structures a corpus of scientific texts with marked up phrases describing results was formed. Using Bayesian classifier structures for extracting results were obtained. For example, a result can be presented with the structure < develop, object >, so the sentence ”Authors developed the method” is considered describing result. It was discovered that theoretical results are commonly presented with structures < predicateword, delibirative >, and applied results are presented with structures < predicateword, object >. Retrieving results allows evaluating efficiency of a given research or a given field of research and makes possible to compare them by productivity. 5.5 Assessment of the quality of scientific publications The problem of evaluating quality of scientific publication has two aspects. First, the publication should have conventional format, i.e. contain sections such as problem statement, methods, solutions, results of the research, conclusions, references and so on (see, for example, IMRAD [Day89],[SP04]). Second, the publication should not contain quasi-scientific or prescientific lexis and phrases. To check the paper’s structure it is necessary to detect availability of mentioned sections. As for retrieving results we assume that sections contain specified semantic structures such as < predicateword, argument, meaning >. A corpus of scientific texts with marked up sections was created. Bayesian method extracted structures specific to the sections. Thus, the section ”problem statement” frequently contains structures < is, research, object >, < attract, attention, subject > etc., the section ”conclusion” contains struc- tures < discover, opportunities, resultative >, < present, we, subject >, < let, research, causative > etc. For checking quasi-scientific or prescientific lexis and phrases in publications the special dictionaries were developed. 62 Relational-Situational Method for Intelligent Search and Analysis of Scientific Publications 6 Experiments results The described principles, models and methods were implemented in the intelligent search and analytic engine EXACTUS EXPERT. To evaluate quality of developed semantic analyzer the small corpus of sentences was created. It contains two hundred sentences that consist of single clause, which represent a search queries. The precision on this corpus is 0.83 and the recall is 0.97. The achieved result is good for big data approaches and is suitable for our search engine since it deals with large scale collections of textual documents. The search algorithm of EXACTUS EXPERT was tested on Russian Information Retrieval Evaluation Seminar (ROMIP)[NN08], a competition of Russian search engines. In many respects ROMIP seminars are similar to other world information retrieval events such as TREC [tre], CLEF, NTCIR, etc. Similar to TREC, ROMIP has cycle nature and is overseen by a program committee consisting of representatives from academia and industry. Several tracks that correspond to different tasks are conducted. In few years on ROMIP there was conducted search in large scale collections task. For example, in 2008 there were two collections containing 1.6 and 3.0 million Russian documents. The evaluation procedure changes from year to year. In general competitors provide search results for a big set of queries of different types (e.g. about 30,000 queries in 2008), which are compered against the ground truth. The ground truth consists of the smaller set of queries (e.g. about 500 queries in 2008) randomly chosen from the big set and assessed by some experts. Several widely known evaluation metrics are used in ROMIP: precision, recall, 11-point TREC precision-recall graph, Bpref etc. In 2008 our search algorithm showed the highest precision/recall values, in 2009 the algorithm for searching similar documents showed one of the best results. The experiments show the advantage of using linguistic methods together with statistical methods for improvement of search quality. The experimental study of the developed methods for semantic analysis of scientific publications was carried out on the material of scientific journals, conference proceedings and theses abstracts. We have processed about 100 thousand publications in total, including journal publications in Russian and English, theses abstracts in Russian and conference papers in Russian and English. The algorithm for retrieving results of research from papers showed value 0.85 for precision on testing data with value 0.90 for precision of detecting theoretical or applied results. The precision of definition and term extraction algorithm is 0.84 and the recall is 0.86 [She12]. Example of report generated by system for a paper quality evaluation is presented below. Science index (from -5 to 5) equals 4. Interpretation: Text contains 28% of scientific lexis and 1% of quasi-scientific or prescientific lexis. References are present. Problem statement is available with probability 0.83. Methods described with probability 0.87. Conclusions are available with probability 0.55. Results: Methodological approach described in the paper ... was developed for accessing large distributed informational systems ... and applied for disaster recovery.... Definitions: Disaster recovery capability to recover applications and data after disaster... The process of creation of such model is called as disaster recovery modeling... In the report science index of 4 means that the analyzed paper is more likely to be scientific, because it contains big amount of scientific lexis and lesser amount of quasi-scientific lexis. There is also high probability that paper has a problem statement and a methods description. The system is not sure about presence of conclusion. 7 Conclusion and future work The developed model of semantic-syntactic structure of text and method help to solve set of tasks of analytical support of scientific activities. The search and analytic engine EXACTUS EXPERT is demanded by experts to support the decision making process on financing of research topics, by editors of scientific journals and by researchers themselves, especially by PhD-students. As a future work, we plan to develop methods for detection of logical defects in scientific texts. This feature is demanded by editors of scientific journals. Also, we plan to improve quality of English text analysis and conduct more experiments. 63 Relational-Situational Method for Intelligent Search and Analysis of Scientific Publications 7.0.1 Acknowledgements The project is supported by Ministry of Education and Science of the Russian Federation grant 07.551.11.4003. References [Day89] R. A. Day. The origins of the scientific paper: The imrad format. American Medical Writers Association Journal, 4(2), 1989. [LCR13] Loet Leydesdorff, Stephen Carley, and Ismael Rafols. Global maps of science based on the new web-of-science categories. Scientometrics, 94:589–593, 2013. [MACRVQ+ 07] Felix Moya-Anegon, Zaida Chinchilla-Rodriguez, Benjamin Vargas-Quesada, Elena Corera- Alvarez, FranciscoJose Munoz-Fernandez, Antonio Gonzalez-Molina, and Victor Herrero-Solana. Coverage analysis of scopus: A journal metric approach. Scientometrics, 73:53–78, 2007. [NN08] Marina Nekrestyanova and Igor Nekrestyanov. Romip 2008 evaluation: Rules, methodology and adhoc decisions. In Proceedings of ROMIP’2008, pages 5–26, 2008. [Osi92] Gennady Osipov. Formulation of subject domain models: Part 1. heterogeneous semantic nets. Journal of Computer and Systems Sciences International. Scripta Technica Inc., 1992. [Osi95] G. Osipov. Methods for extracting semantic types of natural language statements from texts. In 10th IEEE International Symposium on Intelligent Control, Monterey, California, USA, aug 1995. [OST10] G. S. Osipov, I. V. Smirnov, and I. A. Tikhomirov. Relational-situational method for text search and analysis and its applications. Scientific and Technical Information Processing, 37(6):432– 437, 2010. [OSTV12] Gennady Osipov, Ivan Smirnov, Ilya Tikhomirov, and Olga Vybornova. Technologies for se- mantic analysis of scientific publications. In R. R. Yager, V. Sgurev, and M. Hadjiski, editors, Proceedings of 2012 IEEE 6th International Conference Intelligent Systems, volume 2, pages 58–62, 2012. [OSTZ08] Gennady Osipov, Ivan Smirnov, Ilya Tikhomirov, and Olga Zavjalova. Application of linguistic knowledge to search precision improvement. In Proceedings of 4th International IEEE conference on Intelligent Systems, volume 2, pages 17–2–17–5, 2008. [sci] Scival http://info.scival.com/. [She12] A. O. Shelmanov. Method for automatic extraction of multiword terms from texts of scientific publications. In Proceedings of thirteenth National conference on Artificial Intelligence with international participation CAI-2012, volume 1, pages 268–274, Belgorod, 2012. BGTU. [SP04] L. B. Sollaci and M. G. Pereira. The introduction, methods, results, and discussion (imrad) structure: a fifty-year survey. J. Med. Libr. Assoc., 92(3), 2004. [ST09] I. Smirnov and I. Tikhomirov. Heterogeneous semantic networks for text representation in intel- ligent search engine EXACTUS. In Proceedings of workshop SENSE’09 - conceptual Structures for Extracting Natural language SEmantics, The 17th International Conference on Conceptual Structures (ICCS’09), pages 1–9, Moscow, Russia, July 2009. [tre] Trec: Text retrieval conference. http://trec.nist.gov/. [Zol88] G. A. Zolotova. Syntactic dictionary: Repertory of elementary units of Russian Syntax. Nauka, Moscow, 1988. [ZOS04] G. Zolotova, N. Onipenko, and M. Sidorova. Communicative grammar of Russian language. oscow, 2004. 64