Using Part-of-Speech Tagging for Building Networks of Terms in Legal Sphere

Using Part-of-Speech Tagging for Building Networks of Terms in Legal Sphere DmytroLande dwlande@gmail.com IRR of NAS of Ukraine

2, Mykoly Shpaka Street 03113 Kyiv Ukraine

NTU «Igor Sikorsky KPI»

37, Prosp. Peremohy 03056 Kyiv Ukraine

SRIIL of NALS of Ukraine

110-v, Saksaganskogo Street 01032 Kyiv Ukraine

OlehDmytrenko dmytrenko.o@gmail.com IRR of NAS of Ukraine

2, Mykoly Shpaka Street 03113 Kyiv Ukraine

Using Part-of-Speech Tagging for Building Networks of Terms in Legal Sphere F156412CC7F781E300B121035DFBDBB8 GROBID - A machine learning software for extracting information from scholarly documents Information space unstructured data ontological model problem subject domain legal information text corpus computerized text processing Part-of-Speech tagging network of terms automatic summarization

This paper considers an important formalization problem and building the terminological ontology of problem subject domains based on content-related text data. As an ontological model, we propose to use a linguistic network model of text representation, the so-called network of key terms. In this network, the nodes are keywords and phrases that appear in the text corpus, and the links between them are semantic-syntactic links between these terms in the text. Using systems of aggregation of thematic information flows from freely available information resources distributed in global computer networks, input sets of text data were prepared. In particular, this paper solves the important and urgent problem of computerized processing of legal information. The task of computerized processing of natural language texts lies at the intersection between linguistic theory and mathematical sciences. Therefore, a wider natural language processing based on Part-of-Speech tagging was used for extraction of the key terms. After the extraction, a statistical weighing of the formed words and phrases was performed. The horizontal visibility graph algorithm was used to build undirected links between key terms. This paper also considers a new method that allows determining the direction of links between terms and weighting these links in the undirected network of words and phrases. This method takes into account the parts of speech tagging and also obeys the principle of inclusion of a word or phrase in their corresponding extended phrases with more words. The approbation of the proposed method was carried out on the example of a freely available legal document «Universal Declaration of Human Rights». After extracting the key terms from this legal document and determining the direction and weight of links between words or phrases using the proposed methods the directed weighted network of terms was built. The considered in this work method for building the terminological networks can be used, in particular, in systems for automatic text structuring and summarizing of legal information, or systems for detecting the duplicates and contradictions in normative legal documents. It will promote the formation and improvement of conceptual and terminological apparatus in the legal sphere and harmonize national and international law.

Introduction

Modern information and communication technologies and the information space, in general, are developing faster than ever before. This process is characterized by a correspondingly rapid increase in data volumes [1]. These large data volumes are produced by elements of the information space, in particular, documents and a variety of data sources such as files, emails, web pages and other sources, regardless of the formats of their presentation. Data is created, recorded, stored, processed and reproduced increasingly often in electronic form. It is important to note the fact that the described above data doubles approximately every 18 months [2]. As a result, over the past five years, humanity has produced more information than during all previous history [3]. For example, the International Data Corporation (IDC) predicts 175 zettabytes (in other words, 175 trillion gigabytes) of new data will be created around the world in 2025 [4]. But such an information surge, or so-called an information explosion, is accompanied not only by an influx of new valuable knowledge. The majority of such data, however, are unstructured data including unnecessary and noisy data, which constitute 95% of big data [5], and only a very small part (about 5%) of all data is a piece of valuable information that can be used in decision-making.

So now the information society is facing a number of problems that no one has faced before. The main problem is the critical discrepancy between the development of modern information systems and the increase of dynamic information flows in global computer networks [6]. Namely, the problem is the lack of appropriate technological solutions and the inability of existing systems to process huge amounts of unstructured data, including text data, and extract knowledge from them at the same rate at which the corresponded data is produced and accumulated. The mentioned above problems lead to the accumulation of unstructured data [7]. In turn, the huge volumes of such messy data make it difficult to find the necessary and relevant information that the Internet user tries to get in response to his request.

Therefore, the huge amount of information flows and dynamic text data that accumulated in global computer networks determine the relevance and importance of the conceptualization process of this data and their further formalization in the form of a certain ontological model.

This leads to the necessity to develop and improve existing technological solutions and create new ones to ensure a sufficiently high speed of processing and analysis of unstructured data.

This process of the global information space formation is important from the point of view of the transformation of the unstructured data accumulated on information resources into the knowledge. In turn, the obtained knowledge can be valuable recommendations in the process of rapid decisionmaking in various spheres of activity, in particular, telecommunications, cyber, financial, trade, military, political, diplomatic and other spheres.

Computerized processing of legal information

Obtaining brief and at the same time the most important and relevant information or informative statements from one or more text documents, so-called summary, abstract or annotation, is an important task of computerized text processing [8]. Generating concise information-rich reports based on short annotations or digests simplifies access to the main content of the text without the need to process a large text document or text corpus.

In the middle of the last century, the works related to automatic text summarization were mentioned [9]. However, due to the globalization of the information space and the continual increase in the number of information flows, the task of an automatic text summarization is more important than ever before. Also, the automatic text summarization rejects information noise, reduces information consumed by humans and promotes rapid access to the main content of the document. As a result, it promotes important management decisions.

Since scientific and technological progress has also affected the legal sphere, the problem of computerized processing of legal information is relevant [10]. The number of normative legal documents submitted in electronic form, and hence the amount of information that an expert in this field has to deal with, is also constantly growing. Although currently there are different systems of automatic summarization [11], improving existing or developing new systems that could process large volumes of legal documents with acceptable performance and quality is still an important task.

The defining feature of legal information is that the related texts are not fully freely accessible and unstructured. This is important to consider the above-mentioned fact when choosing the appropriate method or approach to solve the problem of automatic text summarization in the legal sphere. In general, there are statistical, positional and indicative methods of automatic summarization. In this work, a statistical method was used to calculate the weight values of individual words and phrases.

Based on the conjunction of the statistical method with the linguistic network model, where key terms are nodes and the links between them are semantic-semantic links between terms in a sentence, a new method has been proposed. This method can be used in automatic legal information summarization systems or systems for detecting duplicates and contradictions in legal documents.

Text data formalization

An important stage in the complex research of some problem subject domain thematically related to the flow of text data is the presentation of its knowledge in a form that becomes suitable for further automated processing or in other words the formalization of this knowledge. The building terminological ontology of the studied subject domain is one type of its knowledge formalization.

In this work, it is proposed to use a linguistic network model as an ontological model of text data. This choice of model is because, as it turned out, many of the problems that arise when working with information flows lie at the intersection between the mathematical sciences and linguistics theory. The linguistic theory as a branch of general linguistics, in turn, makes it possible to work with natural language texts, knowing their properties, functions and, most importantly, structure. The theory of graphs and complex networks is considered a powerful mathematical theory, within which the problem of formalization of the subject domain can be solved.

Let's consider the mathematical component of conceptualization and further formalization of a certain problem subject domain with which text corpora are meaningfully connected. This paper uses a network model for presenting text data. In other words, texts of a certain thematic orientation can be presented in the form of a network of words and phrases connected by a formal semantic connection. A partial case of such a network model may be a network built based on key terms. In this network, the nodes correspond to the individual key concepts of the subject domain, and the edges are the links between concepts.

From the point of view of linguistics, natural language arises in a number of its problems, which are connected first of all with ambiguity, non-compositionality and self-application of language units. Therefore, when applying the basic techniques of natural language processing, it should be bear in mind that it contains different forms of a word (word forms that have a common basis), derived from another word, and linguistic phrases used to express different meanings. This leads to the fact that the meaning of a single word or phrase in a particular case will depend on the context in which it is [12]. So there is a problem, which is also called inflected language [13]. Since some phrases can be interpreted in two ways, without knowing the context, although knowing the meaning of all other words included in the statement, there is a problem in determining the exact meaning of a complex statement. The above linguistic phenomena significantly complicate the task of establishing the correct reflection of the semantic-syntactic structure of the text into its formal logical representation.

While building the terminological ontologies of the subject domains on the basis of thematic text documents [14] it is important that the terms (words and phrases) used as the names of the concepts that accompany the chosen subject domain are obeyed the principle of unambiguity. It means, the word used as the name should be the name of only one object, if it is a single name. If it is a common name, then this phrase should be a common name for all objects in the same class. Therefore, the linguistic component of natural language text processing is one of the central problems of information technology intellectualization.

Basic techniques of natural language processing

In recent years, the tasks of computer processing of dynamic information flows have become increasingly important. In this work, for computerized natural language pre-processing some of the most common techniques are used. In particular, these techniques include text tokenization and removal of stop words.

Tokenization or lexical analysis is the segmentation of a sequence of characters into a sequence of so-called tokens using a scanner or tokenizer that performs the function of lexical analysis. The term "token" should be understood as a certain form of a word. The token is an independent semantic unit, which is considered in aggregate of all its possible forms and meanings. As the initial stage of computerized text processing, the tokenization allows working with the word as an individual entity, while knowing the context in which this word is used.

To clear the text of words that are a source of noise and are informationally-unimportant, it is recommended to delete co-called stop words [15]. For example, stop words include determiner, prepositions, particles, exclamations, conjunctions, adverbs, pronouns, introductory word, numbers from 0 to 9 (unambiguous). Also, stop words include sequences of characters often used on the Internet (for example, www, HTTP, com, etc.), and others frequently used official, independent parts of speech, symbols and punctuation marks. These words don`t have any additional semantic load. That is why the stop words must be ignored while building terminological ontologies. It is also recommended using a stop dictionary or stop word list that expert in the considered subject domain has formed.

There are various software tools and, in particular, NLTK (Natural Language Toolkit open-source library) modules of the Python NLP (Natural Language Processing) library, which help to easily apply the above methods of pre-processing to different types of texts [16].

After tokenization, a technique such as Part-of-Speech tagging (PoS tagging), or in other words just tagging, is usually used [17]. This natural language processing step is one of the main and basic components of almost any NLP task and helps to extrapolate the language syntax and text structure. The Parts-of-Speech tagging is based not only on the definition of the word but also on the context in which the word is used. That is, the tagging takes into account the connection of the tagged word with neighbouring and related words in a phrase, sentence or paragraph. The main idea of text tagging is relating a word in a text or body to a certain part of speech. Figure 1 shows the main idea of Part-of-Speech tagging in a simple example. For each word in the sentence «One day her mother said» the certain tag (label) that marks a certain part of speech was assigned. For example, the word «one» is referred to as CD (where CD is a tag that marks cardinal number), the word «day» is referred to as NN (where NN marks noun) and so on (where PRP$ marks Possessive Pronoun and VBD marks Verb, past tense). To mark parts of speech a collection of predefined tags that are assigned to each word in the sentence is used. Figure 2 presents the Penn Treebank list of tags that used for Part-of-Speech tagging task [19].

PoS tagging can be used in searches engines and text corpus analysis tools and algorithms for indexing words and has many other uses as well. Especially PoS tagging can be very useful in case there are words or tokens that can have multiple tags. The tagging helps to distinguish between the occurrences of the word when it used as one part of speech or another. And most importantly, tagging simplifies the context related to a specific subject domain.

The particular parts of speech are represented as word classes or lexical categories. These categories based on the syntactic context of a word or phrase. Therefore, using the Parts-of-Speech as the method for classifying words by parts of speech helps to mark up each word it a text (or corpus) according to its lexical category.

The E. Brill's PoS tagger [21] is one of the first and most widely used English tagger. The stochastic algorithms are also used in addition to a group of rules-based algorithms.

Statistical weighting and key terms extracting using Part-of-Speech tagging

The initial stage of formalization of knowledge about a certain subject domain is the conceptualization or, in other words, the definition of basic objects (individuals, attributes, processes, etc.) and the relationship between them. If we talk about building a terminological ontology as a network based on text corpora, then an important task is to define key terms (key words and phrases). In their symbolic form, these key term actually denote objects, processes or phenomena of the real world or environment.

To define these basic concepts (key terms), it is proposed to perform statistical weighing of words and phrases that the text corpus contain, taking into account Part-of-Speech tagging.

To extract key words and phrases from the text it is necessary to assign them a certain numerical weight. A statistical indicator can be used as one of the weights for representing important words. As a statistical weight of terms, the Term Frequency -Inverse Document Frequency (short is TF-IDF) [22] is commonly used. Although this is not the only approach possible to solve the problem of identifying key terms. But in [23] it was shown that hat the use of the GTF (Global Term Frequency) is more effective when working with thematically related text documents that contained in a text corpus. This statistical indicator shows how the term is important in the global context and determined by the ratio of the total number of this term in all documents to the total number of all terms that the documents contain. It was shown that in contrast to the common statistical indicator TF-IDF, the proposed indicator of the importance of terms make it possible more effectively to find informationimportant elements of the text when working with a thematically predefined text corpus when the information-important term occurs in almost every document.

Method

First of all, it should be noted that the building of the networks of terms is carried out within each separate sentence of the text corpus.

In this work, the NLTK (Natural Language Toolkit) module that developed in the Python programming language were used. For example, "word_tokenize" and "pos_tag" are used to automatically split tokens and assign part of speech tags to each word, respectively.

For stop words removal the sets of stop words freely available by references [24,25] were applied. In addition to the standard sets of stop word it is also proposed to use the list of stop words a formed by experts.

The proposed method for determining keywords and phrases and the direction of links between them is based on the use of the results obtained through the process of classifying words by parts of speech (Part-of-Speech tagging). Practical research [16] shows that the most commonly used part of speech in English text are determiners (their tag is DT), singular or mass noun (NN), plural noun (NNS), personal pronoun (PRP), verbs and all their forms (VB, VBD, VBG, VBN, VBP, VBZ), adjectives (JJ), including comparative adjectives (JJR) and superlative adjectives (JJS), and adverbs (RB) in particular comparative adverbs (RBR) and superlative adverbs (RBS). In general, individual nouns «NN*» that usually related with people, places, things or concepts, and nouns coupled with adjectives (phrases like «JJ* NN*») are considered as key terms. Also in this work phrases that have After forming the phrases according to described above patterns and arranging them in a certain order (a sequence is formed where phrases with more words are placed before phrases and words that are part of them), the individual stop words are removed.

The next step is the statistical weighing of words and phrases included in the sequence formed at the previous stage. In this work, GTF (Global Term Frequency) the idea of which is described above is used.

The so-called tuple is formed for each formed phrase in the order of its occurrence in the text. Each tuple consists of three elements: the first element is the term (a word or formed phrase); the next is a tag or combination of tags (for formed phrases) that are assigned to a word depending on to which part of speech this word or phrase belong; the last element of this set is the numeric value of GTF. The defining feature of the proposed technique is that the GTF is calculated taking into account the two first elements of the tuple (the word or phrase and the part of speech to which it belongs). The number of such identical pairs that normalized to the total number of formed terms in the whole text determines the value of the third element of the formed tuple.

The next step is to determine the undirected relationships between the terms in the text. The Horizontal Visibility Graph algorithm (HVG) is used to transforms time series that formed with the consequence of numerical values of GTF into the undirected graph [26]. The idea of the algorithm is that the two nodes ti and tj, (in our case, two phrases ti and tj), which correspond to the xi and xj in the formed time series, are in horizontal visibility if and only if xk <min (xi; xj) for all tk where ti < tk < tj. In our case, the sequence ti, i = 1,..,n is the sequence of words and phrases formed within the sentence after the above-described pre-processing (where n is the number of all formed terms). HVG allows building the network structures in which numerical weight assigned to individual words or phrases.

If there is determined using the above HVG algorithm on undirected link between the nodes ti and tj of the time series, then it is suggested to establish the direction of this link for pair of node ti and tj (where ti is the source node and tj is the target node) if only:

 if in the sentence the word (not a phrase) that corresponds to the source node tj occurs earlier than the term (word or phrase) that corresponds to the target node tj;  if in the sentence the phrase (not the word) that corresponds to the target node tj occurs earlier than the term that corresponds to the source node tj (Figure 3). According to the principle of forming a sequence of terms, which described above, and also the proposed rules used to determine the links, the network of key terms (key words and phrases) consists of the words and phrases that included in their corresponding extended phrases with more words. In the built directed network of terms, the major part of the terms is so-called an extension of the corresponding phrases and words. The algorithm proposed in the work [27] uses a similar principle of determining the direction in the terminological network (the directions build on the principle of entering the term into its corresponding phrase with more number of words).

Figure 3: Example of building the directed terminological network

Using the algorithm proposed in [14], the weight values of the links between pair of nodes are determined. After combining ("merging") the nodes that correspond to the same terms (phrases) in the previously built directed network, the number of the same-directed links merged into a single one determines the weight of the merged link between the corresponding nodes.

Text data aggregation and corpora forming

During complex research and development of new information systems that are able to process large-scale data, thematic information flows from social media and databases of scientific publications, which are a source of textual data, play an important role. That is, today on the Internet there is a dynamic database available for experiments of such a volume that it was even difficult to imagine before. It is also important to note that the data are practically publicly available on social networks and freely accessible web search engines. In addition, there are various technological possibilities for aggregating this data and forming text corpora, which can then be used as input data sets. Therefore, having the freely available dynamic text arrays and systems of data aggregation from global computer networks opens wide opportunities for improving existing and developing new methods for these data analysis.

In order to carry out objective research and develop new methods and approaches that can be further implemented in information systems, it is important that the input data sets contain the most objective and reliable information. Also, the thematic reflected in these text data sets must be characterized by a sufficient degree of completeness.

The freely available English text «Universal Declaration of Human Rights» published by the United Nations on its website [28] was used to test the methodology.

Results of research

Using NLTK (Natural Language Toolkit open-source library) and software modules of the Python NLP (Natural Language Processing) library the initial stages of word processing including tokenization and Part-of-Speech tagging were performed for the text document «Universal Declaration of Human Rights». According to described above methodology, the patterns of phrases were formed. Also, the stop words removing was carried out. At the next stage, for the obtained terms using the global indicator of the importance of the term (GTF) statistical weighing was performed. As a result, for the considered text, the so-called tuple «Term (word or phrase); Tag (part of speech); The numerical value of GTF» was formed in the order of terms occurrence. The tuples with the largest numerical value of GTF are presented in Table 1 (all tuple is sorted in descending order of their GTF).

In this work, the Gephi software [29] was used to visualize the obtained directed weighted network of words and phrases (Figure 4).

Figure 1 :1Figure 1: Example of Parts-of-Speech tagging[18]

Figure 2 :2Figure 2:The Penn Treebank PoS Tagset (excluding punctuations)[20]

the form «NN*1 NN*2», «JJ*1 JJ*2», «JJ*1 JJ*2 NN*», «JJ*1 JJ*2 NN*1 NN*2» are considered to be important and key. As was noted above, determiners, prepositions (IN), coordinating conjunction (CC), individual verbs and their form, adverbs and pronouns are stop words. But in this work we consider the phrases which patterns look like «V*1 to V*2», «NN*1 IN/CC NN*2», «JJ*1 IN/CC JJ*2», «JJ* NN*1 IN/CC NN*2», «JJ*1 IN/CC JJ*2 NN*», «JJ1 JJ2 NN1 IN/CC NN2», «JJ1 IN/CC JJ2 NN1 IN/CC NN2» as key.

Figure 4 :4Figure 4: Network of terms built for the legal document «Universal Declaration of Human Rights» (node labels contain the term and its corresponding tag)

Table 11Top 58 key terms extracted from the «Universal Declaration of Human Rights» and sorted in descending order of their GTFTermsTagGTFrightsNNS0.022105humanJJ0.012632equalJJ0.010526freedomsNNS0.010526freedomNN0.010526lawNN0.009474protectionNN0.009474entitledVBN0.009474

Conclusion

In this paper, a new method for building the networks of terms was considered. The key terms of the built network were extracted using a wider natural language processing based on Part-of-Speech tagging. The analysis of the ontological models obtained by the method for building a directed weighted networks of terms based on text corpus allows making constructive conclusions regarding the subject domain with which the texts are thematically related and can be the basis for decisionmaking in this domain. The method was tested on the example of enough well-structured legal document «Universal Declaration of Human Rights» that freely available on the Internet. And as a result the network of terms was built. The considered methodology can be used, in particular, in systems of automatic text structuring and summarization of legal information, or systems of detection of duplicates and contradictions in normative legal documents. In general, it will promote the formation and improvement of conceptual and terminological apparatus in the legal sphere and harmonize of national and international law.

VMayer-Schönberger KCukier Big data: A revolution that will transform how we live, work, and think Houghton Mifflin Harcourt 2013 Humanity Doubles Its Data Creation Every 18 Months, And It Has Powerful Implications Big data: A review SSagiroglu DSinanc 2013 international conference on collaboration technologies and systems (CTS), IEEE 2013 Predictions About Data In 2020 And The Coming Decade 2020 Beyond the hype: Big data concepts, methods, and analytics AGandomi MHaider International journal of information management 35 2 2015 Analysis of information flows in global computer networks (based on the scientific report at the meeting of the Presidium of the NAS of Ukraine on January 25 DLande Bulletin of the National Academy of Sciences of Ukraine 3 2017. 2017 in Ukrainian RFeldman JSanger The text mining handbook: advanced approaches in analyzing unstructured data Cambridge university press 2007 MMaybury Advances in automatic text summarization MIT press 1999 The automatic creation of literature abstracts HPLuhn IBM Journal of research and development 2 2 1958 Subject Domain Models of Jurisprudence According to Google Scholar Scientometrics Data DLande ODmytrenko ORadziievska Proceedings of the 4th International Conference on Computational Linguistics and Intelligent Systems (COLINS 2020). Volume I: Main Conference the 4th International Conference on Computational Linguistics and Intelligent Systems (COLINS 2020). Volume I: Main Conference

Lviv, Ukraine

April 23-24, 2020. 2020 2604 CEUR Workshop Proceedings (ceur-ws.org) Best Text Summarizing Tool for Academic Writing 2014 For Free Review of computer-linguistic methods of natural language texts processing AONikonenko Artificial Intelligence 3 2011 in Russian Inflected Language: Toward a Hermeneutics of Nearness: Heidegger, Levinas KZiarek 1994 SUNY Press Creating the Directed Weighted Network of Terms Based on Analysis of Text Corpora DVLande OODmytrenko .org/10.1109/SAIC51296.2020.9239182 IEEE 2nd International Conference on System Analysis & Intelligent Computing (SAIC)

Kyiv, Ukraine

IEEE 2020. 2020 The automatic identification of stop words WJWilbur KSirotkin Journal of information science 18 1 1992 Natural Language Processing with Python SBird EKlein ELoper 2009 O'Reilly Media Inc Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging EBrill Computational linguistics 21 4 1995 Extract Custom Keywords using NLTK POS tagger in python 2020 Building a large annotated corpus of English: The Penn Treebank MMarcus BSantorini MAMarcinkiewicz Computational Linguistics 19 2 1993 Part-of-speech tagging guidelines for the Penn Treebank Project BSantorini 19104. 1990 PA Department of Computer and Information Science School of Engineering and Applied Science University of Pennsylvania Philadelphia A simple rule-based part of speech tagger EBrill 10.3115/974499.974526 Proceedings of the third conference on Applied natural language processing (ANLC '92). Association for Computational Linguistics the third conference on Applied natural language processing (ANLC '92). Association for Computational Linguistics

Stroudsburg, PA, USA

1992 Term-weighting approaches in automatic text retrieval GSalton CBuckley 10.1016/0306-4573(88)90021-0 Information processing & management 24 5 1988 Determining the Directions of Links in Undirected Networks of Terms DLande ODmytrenko Radziievska Selected Papers of the XIX International Scientific and Practical Conference «Information Technologies and Security» (ITS 2019) 2019 2577 CEUR Workshop Proceedings (ceur-ws.org) XPO6: Download Stop Word List 2015 Text Fixer: Common English Words List 2011 A characterization of horizontal visibility graphs and combinatorics on words GGutin TMansour SSeverini Physica A: Statistical Mechanics and its Applications 390 12 2011 Building of networks of natural hierarchies of terms based on analysis of texts corpora DLande arXiv:1405.6068 arXiv preprint Universal Declaration of Human Rights 2007 <author> <persName><surname>Gephi</surname></persName> </author> <ptr target="https://gephi.org" /> <imprint> <date type="published" when="2017">2017</date> </imprint> </monogr> </biblStruct> </listBibl> </div> </back> </text> </TEI>