Extracting Knowledge Tokens from Text Streams Eugene Alferov1,2 and Vadim Ermolayev1 1 Department of IT, Zaporozhye National University, 66 Zhukovskogo st., 69063, Zaporozhye, Ukraine alferov.evgeniy@gmail.com, vadim@ermolayev.com 2 Kherson State University, 27, 40 Rokiv Zhovnya ave., 73000, Ukraine alferov_jk@ksu.ks.ua Abstract. This problem analysis paper presents our position on how could the solution be sought to the problem of extracting semantically rich fragments from a stream of plain text posts. We first present our understanding of the problem context and explain the focus of our research. Further, in the problem setting section we elaborate the workflow for knowledge extraction from in- coming information tokens. This workflow is then used as a key to structure our review of the literature on the relevant component techniques which may be ex- ploited in a combination to achieve the desired outcome. We finally outline our plan for conducting the experiments with an aim to validate the workflow and find a proper combination of the component techniques for all steps which may solve our specific research problem. Keywords. Workflow, knowledge extraction, text streams, processing, ontol- ogy learning, component techniques Key terms. Data, Process, Knowledge, Approach, Methodology 1 Introduction The dramatic growth of data volumes we face today is accelerated by the increase of social networking applications that allow non-specialist users create a huge amount of content easily and freely. Equipped with rapidly evolving mobile devices, a user is becoming a nomadic gateway boosting the generation of additional real-time sensor data. The emerging Internet of Things makes each and every thing a data or content, adding billions of additional artificial and autonomic sources of data to the overall landscape. Smart spaces, where people, devices, and their infrastructures are all loosely connected, also generate data of unprecedented volumes and with velocities rarely observed before. Noticeably, the major part of the new data comes in streams. An expectation is that valuable information will be extracted out of all these data to help improve the quality of life and making our world a better place – for humans. Extracting Knowledge Tokens from Text Streams 109 Humans are however left bewildered about how to use, analyze, understand all these data, giving a proper account to its dynamics. A topical recent estimate of the need for data-savvy managers in the United States is 1.5 million [1]. This manpower is needed to extract and use valuable information and knowledge for further decision making. The critical steps in this work are (i) extracting information and knowledge; and (ii) bringing the descriptions of the reflections of the world or domain into a refined state – accounting for the changes brought in by new data, at scale. In this paper we focus on the step (i) extraction. In Section 2 we present the prob- lem statement by giving basic definitions and providing our view on how could a processing workflow look like. The plethora of approaches, techniques, technologies, and software tools already exist for solving different parts of the overall problem. Hence we analyze the related work and structure this analysis using the workflow as the key in Section 3. Finally we conclude the paper and present our plans for the fu- ture proof of concept experimental work in Section 4. 2 Problem Statement Ontology is a complex artifact that comprises structural components of several types. Further the structural denotation of an ontology used in Description Logics [2] is exploited: an ontology O comprises its schema S and the set of individuals I : O  ( S , I ) . Ontology schema is also referred to as a terminological component (TBox). It contains the statements describing the concepts of O, the properties of those concepts, and the axioms over the schema constituents. If a finer grained look at an ontology schema is taken, one may consider S comprising the following interrelated constituents: S  {S C , S O , S D , S A }, where S C is the set of statements describing concepts, S O is the set of statements describing object properties, S D is the set of statements describing datatype properties, and S A is the set of axioms specifying constraints over S C , S O , and S D (c.f. [3]). One may notice that these constituents correspond to the types of the schema specification statements of an ontology representation language L which is used for specifying O . The set of individuals, also referred to as assertional component (ABox), is the set of the ground statements about the individuals and their attribution to the constituents of the schema. Ontology Learning is the process of extracting the abovementioned constituents of O from a text stream source. More specifically, the problem which is approached in this research work is twofold: For every individual plain text document (further referred to as information token) arriving in the stream window DO: (i) Extract ontological fragment (further referred to as knowledge token) specifying the semantics of the information token. (ii) Refine the ontology O incorporating the changes brought in by the knowledge token. 110 E. Alferov and V. Ermolayev The focus of this paper is the first part of the problem – the extraction of knowl- edge tokens from information tokens of plain text in a particular professional domain coming in a stream. The texts of ICTERI paper abstracts have been chosen as the domain and source text corpus for our initial experiments – see also Section 4. As an ontology is a complex artifact, the extraction of knowledge tokens from texts is also a complex process. It comprises several steps and, possibly, iterations for ex- tracting different structural constituents of S and I . These steps produce several types of outputs in a particular sequence, sometimes referred to as the ontology learn- ing layer cake (c.f. [4]). Those outputs are terms – concepts and their instances – datatype properties – taxonomic relationships and object properties – axioms. Based on [5] we present in Fig. 1 a workflow putting together extraction steps, inputs, out- puts, and required component technology types. The overall workflow contains two consecutive phases – Text Pre-processing and Ontology Extraction. Text Pre-processing phase gets the information token as a plain text input and produces its structured representation as a set of terms by applying several statistical and linguistic techniques. All the tasks of the Ontology Extraction Phase use the output of Phase 1 as their input and incrementally build up the knowl- edge token by adding different ABox and TBox constituents. For that statistical, lin- guistic, semantic, and logical techniques are employed in combinations. Fig. 1 lists all relevant component techniques per task. All of those are never used in implementa- tions. Therefore our initial research objective is to find out which combination of component techniques works best of all for our specific data – i.e. copes well with (a) the texts of small size but belonging to a particular domain; and (b) limited processing time constrained by a stream window lifetime parameter. Further, after this constella- tion of component techniques is chosen, the objective would be to refine those which do not provide results of a satisfactory quality in our problem settings. 3 Related Research and Available Component Techiques In this section we will describe the component techniques, outlined in Fig. 1, which we found relevant to our work. Those component techniques could overall be catego- rized as linguistic, statistic, semantic and logical (c.f. [5]). As pictured in Fig. 1 they could be applied at different steps and for different purposes. Though not explicitly shown in Fig. 1, the steps may undergo iterations for refining their results. Therefore, the workflow proposed in this paper could be considered as hybrid and iterative. De-noising (statistical, linguistic). This is a method that extracts the de-noised text, comprising the content-rich sentences, from full texts [6]. Processing of noisy text becomes important because the quality of texts in the form of blogs, emails and chat logs can be extremely poor. The sentences in dirty texts are typically full of spelling errors, ad-hoc abbreviations and improper casing [7]. Tokenization. Tokenization is splitting the text into a set of tokens, usually words. This process is unsupervised and can be performed automatically by progam-parser. Part of speech detection/tagging (linguistic). Part of speech tagging (POST) is the process of assigning one of the parts of speech to the given word. POST provides the Extracting Knowledge Tokens from Text Streams 111 syntactic structures and dependency information required for further linguistic analy- sis in order to uncover terms and relations. POST is a semi-supervised or even unsu- pervised process. PHASE 1 (T1): PHASE 2 (T2 – T6): Text Pre-processing Ontology Extraction Bag of Terms Information Token Form Concepts and Extract Datatype T1 Extract Terms - De-noising T2 Concept Instances T3 Properties * - Sentence parsing Domain - Co-occurrence analysis - Syntactic structure analysis - Part of speech - Clustering - Dependency analysis detection Concepts - Latent semantic analysis - Association rule mining - Syntactic structure - Sub-categorization - Use of Lexico-syntactic analysis frames patterns Domain - Relevance analysis Semantic - Use of a Semantic lexicon Domain - Use of Semantic templates Terms - Co-occurrence analysis Properties - Logical inference Lexicon Knowledge Token Bag of Terms Extract / Discover Extract Object LEGEND: * T4 Concept Hierarchies T5 Properties T6 Extract Axioms Control flow - Clustering - Association rule minng - Use of Axiom templates Information flow - Term subsumption - Syntactic structure - Inductive logic Information flow in case - Use of a Semantic lexicon analysis programming - Syntactic structure - Dependency analysis of (semi-) supervised approach analysis - Use of Semantic Ontology / Resource - Dependency analysis templates Workflow step (task) Semantic - Use of Semantic templates - Use of Lexico-syntactic Lexicon - Use of Lexico-syntactic patterns patterns - Logical inference - Logical inference Fig. 1. A workflow for knowledge token extraction Lemmatization (linguistic). Lemmatization is the reduction of morphological variants of the tokens to their base form that can be performed in unsupervised way. For achieving this word form must be known, i.e. the part of speech of every word has to be assigned in the text document. This process usually takes a time and may con- tain errors. Chunking (linguistic). Chunking is unsupervised splitting a text in syntactically correlated parts. Sentence parsing. Sentence parsing is identifying the syntactic structure of a sen- tence, for example in a form of a parse tree. Syntactic structure analysis (linguistic). In syntactic structure analysis, words and modifiers in syntactic structures (e.g., noun phrases, verb phrases, and prepositional phrases) are analyzed to discover potential terms and relations. It can be done in un- supervised way. 112 E. Alferov and V. Ermolayev Relevance Analysis (statisitcal). The extent of occurrence of terms in individual documents and in text corpora is employed for relevance analysis. This is semi- supervised or even unsupervised technique. Co-occurrence analysis (statisitcal). Co-occurrence analysis identifies lexical units that tend to occur together for purposes ranging from extracting related terms to discovering implicit relations between concepts [5]. This technique is unsupervised. Clustering (statistical). Grouping together variants of terms to form concepts and separating unrelated ones is known as terms clustering. It usually unsupervised tech- nique. In this approach some measure of similarity is employed to assign terms into groups for discovering concepts or constructing hierarchy [8]. Some of the major issues in clustering are working with high-dimensional data and feature extraction and preparation for similarity measurement. This gave rise to a class of featureless simi- larity measures based solely on the co-occurrence of words in large text corpora. It is known that clustering results are of acceptable quality only if a statistically represen- tative (i.e. large) text corpora is processed. This fact limits the applicability of this technique in our settings (texts of small size). However, used in the combination with other techniques, clustering may yield some valuable addition to the result – and thus needs to be tried. Latent semantic analysis (statistical). Latent semantic analysis (LSA) is a theo- retical approach and mathematical method for determining the meaning similarity of words and passages by analysis of large text corpora. The main idea is that the aggre- gate of all the word contexts in which a given word does and does not appear provides a set of mutual constraints that largely determines the similarity of meaning of words and sets of words to each other [9]. LSA can be useful in our investigation because it is a fully automatic mathematical and statistical technique for extracting and inferring meaningful relations from the contextual usage of words in text. Sub-categorization (linguistic, semantic). Sub-categorization, or extracting sub- categorization frames, is an approach to extract one type of lexical information with particular importance for Natural Language Processing (NLP). Access to an accurate and comprehensive sub-categorization lexicon is vital for the development of success- ful parsing technology important for many NLP tasks (e.g. automatic verb classifica- tion) and useful for any application which can benefit from information about predi- cate-argument structure (e.g. Information Extraction) [10]. Using semantic lexicon (linguistic, semantic). A semantic lexicon is a dictionary or thesaurus of words/terms labeled with semantic classes (e.g., “ongoing effort” is an Activity) so associations can be drawn between words that have not previously been encountered [11]. Semantic lexicons are a popular resource in ontology learning and play an important role in many NLP tasks. Dependency analysis (linguistic). Syntactic structure consists of lexical items, linked by dependencies. They are binary asymmetric relations that are held between a head and its dependents. Dependency analysis examines dependency information to uncover relations at the sentence level. In this analysis, grammatical relations, such as subject, object, adjunct, and complement, are used for determining more complex relations. Dependency analysis is usually unsupervised approach. Extracting Knowledge Tokens from Text Streams 113 Association rule mining (statistical). Association rule mining aims to extract cor- relations, frequent patterns, associations or casual structures among sets of items in data repositories [12]. It is an unsupervised component technique which works well for considerably big data corpora. Association rules highlight correlations between features in the texts, e.g. keywords. Association rules can be easy interpreted and are understandable for an analyst or even for a normal user. Use of lexico-syntactic patterns (linguistic). Lexico-syntactic patterns (LSPs) are generalized linguistic structures or schemas that indicate semantic relationships among terms and can be applied to the identification of formalized concepts and conceptual relations in natural language text [13]. Lexico-syntactic patterns are suitable for automatic ontology building, since they model semantic relations. These display exactly the kind of relation between their parts that makes them easily translatable into an ontology representation. Use of semantic templates (semantic, linguistic). Semantic templates are similar to lexico-syntactic patterns in terms of their purpose. However, semantic templates offer more detailed rules and conditions for extracting not only taxonomic relations but also complex non-taxonomic relations [5]. Logical inference (logical, semantic). In logical inference implicit relations are de- rived from existing ones using rules such as transitivity and inheritance [5]. However, the introduction of invalid or conflicting relations may also happen in case of an in- complete or underspecified inference rule set – for example because of improper ac- count for the validity of transitivity or mutual disjointness axioms. Term subsumption (statistical, semantic). In the subsumption method, a given term subsumes another term if the documents in which the latter term occurs are a subset of the documents in which the given term occurs [14]. A term subsumption measure is used to quantify the extent of a term x being more general than another term y. This technique is semi-supervised and unsupervised too. The term subsump- tion technique is easy to implement and it makes labeling concepts an easy task. However, with this method, it is difficult to classify terms that do not co-occur fre- quently and it requires a large data set to work reliably. Use of axiom templates (semantic, linguistic). Axioms are useful for describing the relationships between the concepts of an ontology. They can be written in differ- ent ways depending on the relation that exist among the concepts. Inductive logic programming (logical, semantic). Inductive logic programming (ILP) is a research area at the intersection of inductive machine learning and logic programming. ILP generalizes the inductive and the deductive approaches by aiming to develop theories, techniques and applications of inductive learning from observa- tions and background knowledge represented in first order logical framework. The overview of the applicability of the presented component techniques and their interrelationship with respect to the tasks in our workflow are presented in Table 1. 114 E. Alferov and V. Ermolayev 4 Summary and Future Work Our literature search has revealed that extracting knowledge, or more specifically learning ontologies, from plain text corpora is a well developed research field that continues to produce new results. However, and to the best of our knowledge, extract- ing ontologies from text streams, with a constraint on the life time of an input infor- mation token, is a recently emerged research problem. The reasons for adding this specific problem to the research agenda are the phenomenon of Big Data, in particular its velocity dimension, as well as the need for better, more reliable, semantically rich solutions for automating Big Data analytics. One more complication introduced by our problem setting is the small size of an individual information token which hinders yielding good quality results using the majority of traditional statistical and linguistic techniques for ontology extraction from text corpora. We argued in this paper that applying a combination of the relevant existing com- ponent techniques in a structured and iterative way may overall produce such a result – as an incremental collection of ontology elements in a knowledge token provided by individual techniques at different stages in our proposed workflow. Table 1. Relevance of component techniques to the tasks within the workflow for extracting knowledge tokens from information tokens Task (Fig. 1.) Component technology T1 T2 T3 T4 T5 T6 De-noising st, li Part of speech detection/tagging li Lemmatization li Chunking li Syntactic structure analysis li li li li Relevance Analysis st Co-occurrence analysis st st Clustering st st Latent semantic analysis st Sub-categorization se, li Using semantic lexicon se, li se, li se, li Dependency analysis li li li Association rule mining st st Use of lexico-syntactic patterns li li li Use of semantic templates se, li se, li Logical inference lo, se lo, se lo, se Term subsumption st, se Use of axiom templates se, li Inductive logic programming lo, se Legend : li – linguistic; lo – logical; se – semantic; st – statistical; As this research is in an early phase, we do not yet have the proof for this hypothe- sis. However there is the plan in place for conducting the initial series of the “proof- of-concept” experiments in which the component technologies will be exploited in a semi-supervised or supervised fashion. For that we plan to use a small but well se- Extracting Knowledge Tokens from Text Streams 115 mantically annotated corpus of the abstracts (information tokens) and full texts of ICTERI papers collected in the ICTERIWiki portal1. This document corpus is incre- mentally extended by adding the papers and their semantic annotations for each new ICTERI conference instance. The annotations are done using the ICTERI Scope On- tology by Tatarintseva et.al. [15]. These annotations will be used as a “Golden Stan- dard” for evaluating the results of automated knowledge token extraction using the workflow proposed in this paper. After the concept is proven and the constellation of the component techniques is circumscribed, we plan to test the approach on one of the professional news portals. Further, it is planned to extend the proposed knowledge extraction procedure to sen- sor stream data processing. References 1. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Hung Byers, A.: Big data: the Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute (2011), http://www.mckinsey.com/insights/mgi/research/technology_and_ innovation/big_data_the_next_frontier_for_innovation 2. Nardi, D., Brachman, R.J.: An Introduction to Description Logics. In: Baader, F., Calvanese, D., McGuinness, D. L., Nardi, D., Patel-Schneider, P. F. (eds.) The Description Logic Handbook, Cambridge University Press New York, NY, USA (2007) 3. Davidovsky, M., Ermolayev, V., Tolok V.: Instance Migration between Ontologies Having Structural Differences. In: Int. J. on Artificial Intelligence Tools, vol. 20(6), pp. 1127– 1156 (2011) 4. Buitelaar, P., Cimiano, P., Magnini, B.: Ontology Learning from Text: an Overview. In: Buitelaar, P., Cimmiano, P., Magnini, B. (eds.). Ontology Learning from Text: Methods, Evaluation and Applications, IOS Press, Amsterdam (2005) 5. Wong, W., Liu, W., Bennamoun, M.: Ontology Learning from Text: a Look Back and into the Future. ACM Comput. Surv., 44(4), Article 20, 36 pages. http://doi.acm.org/10.1145/2333112.2333115 (2012) 6. Shams, R., Mercer, R. E.: Investigating Keyphrase Indexing with Text Denoising. In: Proceedings of the 12th ACM/IEEE-CS Joint Conf. on Digital Libraries, pp. 263–266, ACM (2012) 7. Wong, W., Liu, W., Bennamoun, M.: Enhanced Integrated Scoring for Cleaning Dirty Texts. arXiv preprint arXiv:0810.0332. (2008) 8. Cimiano, P., Hotho, A., Staab, S.: Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis. Journal of Artificial Intelligence Research Archive, 24(1), 305– 339 (2005) 9. Landauer, T.K., Foltz, P.W., Laham, D.: Introduction to Latent Semantic Analysis. Journal: Discourse Processes, 25(2-3), 259–284 (1998) 10. Preiss, J., Briscoe, T., Korhonen, A.: A System for Large-Scale Acquisition of Verbal, Nominal and Adjectival Subcategorization Frames from Corpora. In: Annual Meeting. Association for Computational Linguistics, 45(1), 912 (2007) 11. Thelen, M., Riloff, E.: A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts. In: Proc. ACL-02 Conf. on Empirical Methods in Natural 1 http://isrg.kit.znu.edu.ua/icteriwiki/ 116 E. Alferov and V. Ermolayev Language Processing, Association for Computational Linguistics, vol. 10, pp. 214–221 (2002) 12. Kotsiantis, S., Kanellopoulos, D.: Association Rules Mining: a Recent Overview. GESTS International Transactions on Computer Science and Engineering, 32(1), 71–82 (2006) 13. Summary on Requirements on Lexico-Syntactic Patterns (Synthesis by PC), http://www.w3.org/community/ontolex/wiki/Specification_of_Requirements/Lexico- Syntactic_Patterns 14. De Knijff, J., Frasincar, F., Hogenboom, F.: Domain Taxonomy Learning from Text: the Subsumption Method versus Hierarchical Clustering. Data & Knowledge Engineering, (2012) 15. Tatarintseva, O., Borue, Yu., Ermolayev, V.: Validating OntoElect Methodology in Refining ICTERI Scope Ontology. In: H.C. Mayr et al. (Eds.): UNISCON 2012, LNBIP 137, pp. 128--139 (2013)