Using Part-of-Speech Tagging for Building Networks of Terms in Legal Sphere Dmytro Landea,b,c and Oleh Dmytrenkoa a IRR of NAS of Ukraine, 2, Mykoly Shpaka Street, Kyiv, 03113, Ukraine b NTU «Igor Sikorsky KPI», 37, Prosp. Peremohy, Kyiv, 03056, Ukraine c SRIIL of NALS of Ukraine, 110-v, Saksaganskogo Street, Kyiv, 01032, Ukraine Abstract This paper considers an important formalization problem and building the terminological ontology of problem subject domains based on content-related text data. As an ontological model, we propose to use a linguistic network model of text representation, the so-called network of key terms. In this network, the nodes are keywords and phrases that appear in the text corpus, and the links between them are semantic-syntactic links between these terms in the text. Using systems of aggregation of thematic information flows from freely available information resources distributed in global computer networks, input sets of text data were prepared. In particular, this paper solves the important and urgent problem of computerized processing of legal information. The task of computerized processing of natural language texts lies at the intersection between linguistic theory and mathematical sciences. Therefore, a wider natural language processing based on Part-of-Speech tagging was used for extraction of the key terms. After the extraction, a statistical weighing of the formed words and phrases was performed. The horizontal visibility graph algorithm was used to build undirected links between key terms. This paper also considers a new method that allows determining the direction of links between terms and weighting these links in the undirected network of words and phrases. This method takes into account the parts of speech tagging and also obeys the principle of inclusion of a word or phrase in their corresponding extended phrases with more words. The approbation of the proposed method was carried out on the example of a freely available legal document «Universal Declaration of Human Rights». After extracting the key terms from this legal document and determining the direction and weight of links between words or phrases using the proposed methods the directed weighted network of terms was built. The considered in this work method for building the terminological networks can be used, in particular, in systems for automatic text structuring and summarizing of legal information, or systems for detecting the duplicates and contradictions in normative legal documents. It will promote the formation and improvement of conceptual and terminological apparatus in the legal sphere and harmonize national and international law. Keywords Information space, unstructured data, ontological model, problem subject domain, legal information, text corpus, computerized text processing, Part-of-Speech tagging, network of terms, automatic summarization 1. Introduction Modern information and communication technologies and the information space, in general, are developing faster than ever before. This process is characterized by a correspondingly rapid increase in data volumes [1]. These large data volumes are produced by elements of the information space, in particular, documents and a variety of data sources such as files, emails, web pages and other sources, COLINS-2021: 5th International Conference on Computational Linguistics and Intelligent Systems, April 22–23, 2021, Kharkiv, Ukraine EMAIL: dwlande@gmail.com (D. Lande); dmytrenko.o@gmail.com (O. Dmytrenko) ORCID: 0000-0003-3945-1178 (D. Lande); 0000-0001-8501-5313 (O. Dmytrenko) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) regardless of the formats of their presentation. Data is created, recorded, stored, processed and reproduced increasingly often in electronic form. It is important to note the fact that the described above data doubles approximately every 18 months [2]. As a result, over the past five years, humanity has produced more information than during all previous history [3]. For example, the International Data Corporation (IDC) predicts 175 zettabytes (in other words, 175 trillion gigabytes) of new data will be created around the world in 2025 [4]. But such an information surge, or so-called an information explosion, is accompanied not only by an influx of new valuable knowledge. The majority of such data, however, are unstructured data including unnecessary and noisy data, which constitute 95% of big data [5], and only a very small part (about 5%) of all data is a piece of valuable information that can be used in decision-making. So now the information society is facing a number of problems that no one has faced before. The main problem is the critical discrepancy between the development of modern information systems and the increase of dynamic information flows in global computer networks [6]. Namely, the problem is the lack of appropriate technological solutions and the inability of existing systems to process huge amounts of unstructured data, including text data, and extract knowledge from them at the same rate at which the corresponded data is produced and accumulated. The mentioned above problems lead to the accumulation of unstructured data [7]. In turn, the huge volumes of such messy data make it difficult to find the necessary and relevant information that the Internet user tries to get in response to his request. Therefore, the huge amount of information flows and dynamic text data that accumulated in global computer networks determine the relevance and importance of the conceptualization process of this data and their further formalization in the form of a certain ontological model. This leads to the necessity to develop and improve existing technological solutions and create new ones to ensure a sufficiently high speed of processing and analysis of unstructured data. This process of the global information space formation is important from the point of view of the transformation of the unstructured data accumulated on information resources into the knowledge. In turn, the obtained knowledge can be valuable recommendations in the process of rapid decision- making in various spheres of activity, in particular, telecommunications, cyber, financial, trade, military, political, diplomatic and other spheres. 2. Computerized processing of legal information Obtaining brief and at the same time the most important and relevant information or informative statements from one or more text documents, so-called summary, abstract or annotation, is an important task of computerized text processing [8]. Generating concise information-rich reports based on short annotations or digests simplifies access to the main content of the text without the need to process a large text document or text corpus. In the middle of the last century, the works related to automatic text summarization were mentioned [9]. However, due to the globalization of the information space and the continual increase in the number of information flows, the task of an automatic text summarization is more important than ever before. Also, the automatic text summarization rejects information noise, reduces information consumed by humans and promotes rapid access to the main content of the document. As a result, it promotes important management decisions. Since scientific and technological progress has also affected the legal sphere, the problem of computerized processing of legal information is relevant [10]. The number of normative legal documents submitted in electronic form, and hence the amount of information that an expert in this field has to deal with, is also constantly growing. Although currently there are different systems of automatic summarization [11], improving existing or developing new systems that could process large volumes of legal documents with acceptable performance and quality is still an important task. The defining feature of legal information is that the related texts are not fully freely accessible and unstructured. This is important to consider the above-mentioned fact when choosing the appropriate method or approach to solve the problem of automatic text summarization in the legal sphere. In general, there are statistical, positional and indicative methods of automatic summarization. In this work, a statistical method was used to calculate the weight values of individual words and phrases. Based on the conjunction of the statistical method with the linguistic network model, where key terms are nodes and the links between them are semantic-semantic links between terms in a sentence, a new method has been proposed. This method can be used in automatic legal information summarization systems or systems for detecting duplicates and contradictions in legal documents. 3. Text data formalization An important stage in the complex research of some problem subject domain thematically related to the flow of text data is the presentation of its knowledge in a form that becomes suitable for further automated processing or in other words the formalization of this knowledge. The building terminological ontology of the studied subject domain is one type of its knowledge formalization. In this work, it is proposed to use a linguistic network model as an ontological model of text data. This choice of model is because, as it turned out, many of the problems that arise when working with information flows lie at the intersection between the mathematical sciences and linguistics theory. The linguistic theory as a branch of general linguistics, in turn, makes it possible to work with natural language texts, knowing their properties, functions and, most importantly, structure. The theory of graphs and complex networks is considered a powerful mathematical theory, within which the problem of formalization of the subject domain can be solved. Let's consider the mathematical component of conceptualization and further formalization of a certain problem subject domain with which text corpora are meaningfully connected. This paper uses a network model for presenting text data. In other words, texts of a certain thematic orientation can be presented in the form of a network of words and phrases connected by a formal semantic connection. A partial case of such a network model may be a network built based on key terms. In this network, the nodes correspond to the individual key concepts of the subject domain, and the edges are the links between concepts. From the point of view of linguistics, natural language arises in a number of its problems, which are connected first of all with ambiguity, non-compositionality and self-application of language units. Therefore, when applying the basic techniques of natural language processing, it should be bear in mind that it contains different forms of a word (word forms that have a common basis), derived from another word, and linguistic phrases used to express different meanings. This leads to the fact that the meaning of a single word or phrase in a particular case will depend on the context in which it is [12]. So there is a problem, which is also called inflected language [13]. Since some phrases can be interpreted in two ways, without knowing the context, although knowing the meaning of all other words included in the statement, there is a problem in determining the exact meaning of a complex statement. The above linguistic phenomena significantly complicate the task of establishing the correct reflection of the semantic-syntactic structure of the text into its formal logical representation. While building the terminological ontologies of the subject domains on the basis of thematic text documents [14] it is important that the terms (words and phrases) used as the names of the concepts that accompany the chosen subject domain are obeyed the principle of unambiguity. It means, the word used as the name should be the name of only one object, if it is a single name. If it is a common name, then this phrase should be a common name for all objects in the same class. Therefore, the linguistic component of natural language text processing is one of the central problems of information technology intellectualization. 4. Basic techniques of natural language processing In recent years, the tasks of computer processing of dynamic information flows have become increasingly important. In this work, for computerized natural language pre-processing some of the most common techniques are used. In particular, these techniques include text tokenization and removal of stop words. Tokenization or lexical analysis is the segmentation of a sequence of characters into a sequence of so-called tokens using a scanner or tokenizer that performs the function of lexical analysis. The term "token" should be understood as a certain form of a word. The token is an independent semantic unit, which is considered in aggregate of all its possible forms and meanings. As the initial stage of computerized text processing, the tokenization allows working with the word as an individual entity, while knowing the context in which this word is used. To clear the text of words that are a source of noise and are informationally-unimportant, it is recommended to delete co-called stop words [15]. For example, stop words include determiner, prepositions, particles, exclamations, conjunctions, adverbs, pronouns, introductory word, numbers from 0 to 9 (unambiguous). Also, stop words include sequences of characters often used on the Internet (for example, www, HTTP, com, etc.), and others frequently used official, independent parts of speech, symbols and punctuation marks. These words don`t have any additional semantic load. That is why the stop words must be ignored while building terminological ontologies. It is also recommended using a stop dictionary or stop word list that expert in the considered subject domain has formed. There are various software tools and, in particular, NLTK (Natural Language Toolkit open-source library) modules of the Python NLP (Natural Language Processing) library, which help to easily apply the above methods of pre-processing to different types of texts [16]. After tokenization, a technique such as Part-of-Speech tagging (PoS tagging), or in other words just tagging, is usually used [17]. This natural language processing step is one of the main and basic components of almost any NLP task and helps to extrapolate the language syntax and text structure. The Parts-of-Speech tagging is based not only on the definition of the word but also on the context in which the word is used. That is, the tagging takes into account the connection of the tagged word with neighbouring and related words in a phrase, sentence or paragraph. The main idea of text tagging is relating a word in a text or body to a certain part of speech. Figure 1 shows the main idea of Part-of- Speech tagging in a simple example. For each word in the sentence «One day her mother said» the certain tag (label) that marks a certain part of speech was assigned. For example, the word «one» is referred to as CD (where CD is a tag that marks cardinal number), the word «day» is referred to as NN (where NN marks noun) and so on (where PRP$ marks Possessive Pronoun and VBD marks Verb, past tense). Figure 1: Example of Parts-of-Speech tagging [18] To mark parts of speech a collection of predefined tags that are assigned to each word in the sentence is used. Figure 2 presents the Penn Treebank list of tags that used for Part-of-Speech tagging task [19]. PoS tagging can be used in searches engines and text corpus analysis tools and algorithms for indexing words and has many other uses as well. Especially PoS tagging can be very useful in case there are words or tokens that can have multiple tags. The tagging helps to distinguish between the occurrences of the word when it used as one part of speech or another. And most importantly, tagging simplifies the context related to a specific subject domain. The particular parts of speech are represented as word classes or lexical categories. These categories based on the syntactic context of a word or phrase. Therefore, using the Parts-of-Speech as the method for classifying words by parts of speech helps to mark up each word it a text (or corpus) according to its lexical category. The E. Brill's PoS tagger [21] is one of the first and most widely used English tagger. The stochastic algorithms are also used in addition to a group of rules-based algorithms. Figure 2: The Penn Treebank PoS Tagset (excluding punctuations) [20] 5. Statistical weighting and key terms extracting using Part-of-Speech tagging The initial stage of formalization of knowledge about a certain subject domain is the conceptualization or, in other words, the definition of basic objects (individuals, attributes, processes, etc.) and the relationship between them. If we talk about building a terminological ontology as a network based on text corpora, then an important task is to define key terms (key words and phrases). In their symbolic form, these key term actually denote objects, processes or phenomena of the real world or environment. To define these basic concepts (key terms), it is proposed to perform statistical weighing of words and phrases that the text corpus contain, taking into account Part-of-Speech tagging. To extract key words and phrases from the text it is necessary to assign them a certain numerical weight. A statistical indicator can be used as one of the weights for representing important words. As a statistical weight of terms, the Term Frequency - Inverse Document Frequency (short is TF-IDF) [22] is commonly used. Although this is not the only approach possible to solve the problem of identifying key terms. But in [23] it was shown that hat the use of the GTF (Global Term Frequency) is more effective when working with thematically related text documents that contained in a text corpus. This statistical indicator shows how the term is important in the global context and determined by the ratio of the total number of this term in all documents to the total number of all terms that the documents contain. It was shown that in contrast to the common statistical indicator TF-IDF, the proposed indicator of the importance of terms make it possible more effectively to find information- important elements of the text when working with a thematically predefined text corpus when the information-important term occurs in almost every document. 6. Method First of all, it should be noted that the building of the networks of terms is carried out within each separate sentence of the text corpus. In this work, the NLTK (Natural Language Toolkit) module that developed in the Python programming language were used. For example, "word_tokenize" and "pos_tag" are used to automatically split tokens and assign part of speech tags to each word, respectively. For stop words removal the sets of stop words freely available by references [24, 25] were applied. In addition to the standard sets of stop word it is also proposed to use the list of stop words a formed by experts. The proposed method for determining keywords and phrases and the direction of links between them is based on the use of the results obtained through the process of classifying words by parts of speech (Part-of-Speech tagging). Practical research [16] shows that the most commonly used part of speech in English text are determiners (their tag is DT), singular or mass noun (NN), plural noun (NNS), personal pronoun (PRP), verbs and all their forms (VB, VBD, VBG, VBN, VBP, VBZ), adjectives (JJ), including comparative adjectives (JJR) and superlative adjectives (JJS), and adverbs (RB) in particular comparative adverbs (RBR) and superlative adverbs (RBS). In general, individual nouns «NN*» that usually related with people, places, things or concepts, and nouns coupled with adjectives (phrases like «JJ* NN*») are considered as key terms. Also in this work phrases that have the form «NN*1 NN*2», «JJ*1 JJ*2», «JJ*1 JJ*2 NN*», «JJ*1 JJ*2 NN*1 NN*2» are considered to be important and key. As was noted above, determiners, prepositions (IN), coordinating conjunction (CC), individual verbs and their form, adverbs and pronouns are stop words. But in this work we consider the phrases which patterns look like «V*1 to V*2», «NN*1 IN/CC NN*2», «JJ*1 IN/CC JJ*2», «JJ* NN*1 IN/CC NN*2», «JJ*1 IN/CC JJ*2 NN*», «JJ1 JJ2 NN1 IN/CC NN2», «JJ1 IN/CC JJ2 NN1 IN/CC NN2» as key. After forming the phrases according to described above patterns and arranging them in a certain order (a sequence is formed where phrases with more words are placed before phrases and words that are part of them), the individual stop words are removed. The next step is the statistical weighing of words and phrases included in the sequence formed at the previous stage. In this work, GTF (Global Term Frequency) the idea of which is described above is used. The so-called tuple is formed for each formed phrase in the order of its occurrence in the text. Each tuple consists of three elements: the first element is the term (a word or formed phrase); the next is a tag or combination of tags (for formed phrases) that are assigned to a word depending on to which part of speech this word or phrase belong; the last element of this set is the numeric value of GTF. The defining feature of the proposed technique is that the GTF is calculated taking into account the two first elements of the tuple (the word or phrase and the part of speech to which it belongs). The number of such identical pairs that normalized to the total number of formed terms in the whole text determines the value of the third element of the formed tuple. The next step is to determine the undirected relationships between the terms in the text. The Horizontal Visibility Graph algorithm (HVG) is used to transforms time series that formed with the consequence of numerical values of GTF into the undirected graph [26]. The idea of the algorithm is that the two nodes ti and tj, (in our case, two phrases ti and tj), which correspond to the xi and xj in the formed time series, are in horizontal visibility if and only if xk