<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extracting Knowledge Tokens from Text Streams</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Eugene Alferov</string-name>
          <email>alferov.evgeniy@gmail.com</email>
          <email>alferov_jk@ksu.ks.ua</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vadim Ermolayev</string-name>
          <email>vadim@ermolayev.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of IT, Zaporozhye National University</institution>
          ,
          <addr-line>66 Zhukovskogo st., 69063, Zaporozhye</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Key terms. Data</institution>
          ,
          <addr-line>Process, Knowledge, Approach, Methodology</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Kherson State University</institution>
          ,
          <addr-line>27, 40 Rokiv Zhovnya ave., 73000</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>108</fpage>
      <lpage>116</lpage>
      <abstract>
        <p>This problem analysis paper presents our position on how could the solution be sought to the problem of extracting semantically rich fragments from a stream of plain text posts. We first present our understanding of the problem context and explain the focus of our research. Further, in the problem setting section we elaborate the workflow for knowledge extraction from incoming information tokens. This workflow is then used as a key to structure our review of the literature on the relevant component techniques which may be exploited in a combination to achieve the desired outcome. We finally outline our plan for conducting the experiments with an aim to validate the workflow and find a proper combination of the component techniques for all steps which may solve our specific research problem.</p>
      </abstract>
      <kwd-group>
        <kwd />
        <kwd>Workflow</kwd>
        <kwd>knowledge extraction</kwd>
        <kwd>text streams</kwd>
        <kwd>processing</kwd>
        <kwd>ontology learning</kwd>
        <kwd>component techniques</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The dramatic growth of data volumes we face today is accelerated by the increase of
social networking applications that allow non-specialist users create a huge amount of
content easily and freely. Equipped with rapidly evolving mobile devices, a user is
becoming a nomadic gateway boosting the generation of additional real-time sensor
data. The emerging Internet of Things makes each and every thing a data or content,
adding billions of additional artificial and autonomic sources of data to the overall
landscape. Smart spaces, where people, devices, and their infrastructures are all
loosely connected, also generate data of unprecedented volumes and with velocities
rarely observed before. Noticeably, the major part of the new data comes in streams.</p>
      <p>
        An expectation is that valuable information will be extracted out of all these data to
help improve the quality of life and making our world a better place – for humans.
Humans are however left bewildered about how to use, analyze, understand all these
data, giving a proper account to its dynamics. A topical recent estimate of the need for
data-savvy managers in the United States is 1.5 million [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This manpower is needed
to extract and use valuable information and knowledge for further decision making.
The critical steps in this work are (i) extracting information and knowledge; and (ii)
bringing the descriptions of the reflections of the world or domain into a refined state
– accounting for the changes brought in by new data, at scale.
      </p>
      <p>In this paper we focus on the step (i) extraction. In Section 2 we present the
problem statement by giving basic definitions and providing our view on how could a
processing workflow look like. The plethora of approaches, techniques, technologies,
and software tools already exist for solving different parts of the overall problem.
Hence we analyze the related work and structure this analysis using the workflow as
the key in Section 3. Finally we conclude the paper and present our plans for the
future proof of concept experimental work in Section 4.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Problem Statement</title>
      <p>
        Ontology is a complex artifact that comprises structural components of several types.
Further the structural denotation of an ontology used in Description Logics [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is
exploited: an ontology O comprises its schema S and the set of individuals
I : O  (S, I ) . Ontology schema is also referred to as a terminological component
(TBox). It contains the statements describing the concepts of O, the properties of
those concepts, and the axioms over the schema constituents.
      </p>
      <p>
        If a finer grained look at an ontology schema is taken, one may consider S
comprising the following interrelated constituents: S  {S C , S O , S D , S A}, where S C
is the set of statements describing concepts, S O is the set of statements describing
object properties, S D is the set of statements describing datatype properties, and S A
is the set of axioms specifying constraints over S C , S O , and S D (c.f. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). One may
notice that these constituents correspond to the types of the schema specification
statements of an ontology representation language L which is used for specifying O .
      </p>
      <p>The set of individuals, also referred to as assertional component (ABox), is the set
of the ground statements about the individuals and their attribution to the constituents
of the schema.</p>
      <p>Ontology Learning is the process of extracting the abovementioned constituents of
O from a text stream source. More specifically, the problem which is approached in
this research work is twofold:</p>
      <p>For every individual plain text document (further referred to as information token)
arriving in the stream window DO:
(i) Extract ontological fragment (further referred to as knowledge token) specifying
the semantics of the information token.
(ii) Refine the ontology O incorporating the changes brought in by the knowledge
token.</p>
      <p>The focus of this paper is the first part of the problem – the extraction of
knowledge tokens from information tokens of plain text in a particular professional domain
coming in a stream. The texts of ICTERI paper abstracts have been chosen as the
domain and source text corpus for our initial experiments – see also Section 4.</p>
      <p>
        As an ontology is a complex artifact, the extraction of knowledge tokens from texts
is also a complex process. It comprises several steps and, possibly, iterations for
extracting different structural constituents of S and I . These steps produce several
types of outputs in a particular sequence, sometimes referred to as the ontology
learning layer cake (c.f. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]). Those outputs are terms – concepts and their instances –
datatype properties – taxonomic relationships and object properties – axioms. Based
on [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] we present in Fig. 1 a workflow putting together extraction steps, inputs,
outputs, and required component technology types.
      </p>
      <p>The overall workflow contains two consecutive phases – Text Pre-processing and
Ontology Extraction. Text Pre-processing phase gets the information token as a plain
text input and produces its structured representation as a set of terms by applying
several statistical and linguistic techniques. All the tasks of the Ontology Extraction
Phase use the output of Phase 1 as their input and incrementally build up the
knowledge token by adding different ABox and TBox constituents. For that statistical,
linguistic, semantic, and logical techniques are employed in combinations. Fig. 1 lists all
relevant component techniques per task. All of those are never used in
implementations. Therefore our initial research objective is to find out which combination of
component techniques works best of all for our specific data – i.e. copes well with (a)
the texts of small size but belonging to a particular domain; and (b) limited processing
time constrained by a stream window lifetime parameter. Further, after this
constellation of component techniques is chosen, the objective would be to refine those which
do not provide results of a satisfactory quality in our problem settings.
3</p>
      <p>
        Related Research and Available Component Techiques
In this section we will describe the component techniques, outlined in Fig. 1, which
we found relevant to our work. Those component techniques could overall be
categorized as linguistic, statistic, semantic and logical (c.f. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). As pictured in Fig. 1 they
could be applied at different steps and for different purposes. Though not explicitly
shown in Fig. 1, the steps may undergo iterations for refining their results. Therefore,
the workflow proposed in this paper could be considered as hybrid and iterative.
      </p>
      <p>
        De-noising (statistical, linguistic). This is a method that extracts the de-noised text,
comprising the content-rich sentences, from full texts [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Processing of noisy text
becomes important because the quality of texts in the form of blogs, emails and chat
logs can be extremely poor. The sentences in dirty texts are typically full of spelling
errors, ad-hoc abbreviations and improper casing [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Tokenization. Tokenization is splitting the text into a set of tokens, usually words.
This process is unsupervised and can be performed automatically by progam-parser.</p>
      <p>Part of speech detection/tagging (linguistic). Part of speech tagging (POST) is the
process of assigning one of the parts of speech to the given word. POST provides the
syntactic structures and dependency information required for further linguistic
analysis in order to uncover terms and relations. POST is a semi-supervised or even
unsupervised process.</p>
      <p>Information</p>
      <p>Token
Domain
Terms</p>
      <p>Bag
of Terms</p>
      <p>PHASE 1 (T1):
Text Pre-processing</p>
      <p>Bag
of Terms
T1 Extract Terms
- De-noising
- Sentence parsing
- Part of speech</p>
      <p>detection
- Syntactic structure</p>
      <p>analysis
- Relevance analysis
- Co-occurrence analysis</p>
      <p>PHASE 2 (T2 – T6):</p>
      <p>Ontology Extraction</p>
      <p>Form Concepts and
T2 Concept Instances
- Co-occurrence analysis
Domain - Clustering
Concepts - Latent semantic analysis
- Sub-categorization</p>
      <p>frames
Semantic - Use of a Semantic lexicon
Lexicon</p>
      <p>Domain
Properties</p>
      <p>T3</p>
      <p>Extract Datatype</p>
      <p>Properties
- Syntactic structure analysis
- Dependency analysis
- Association rule mining
- Use of Lexico-syntactic</p>
      <p>patterns
- Use of Semantic templates
- Logical inference</p>
      <p>*
Knowledge Token
LEGEND:</p>
      <p>Control flow
Information flow</p>
      <p>Information flow in case
of (semi-) supervised approach</p>
      <p>Ontology / Resource
Workflow step (task)
Lexicon</p>
      <p>Extract / Discover
T4 Concept Hierarchies
For achieving this word form must be known, i.e. the part of speech of every word has
to be assigned in the text document. This process usually takes a time and may
contain errors.
correlated parts.</p>
      <p>Chunking (linguistic). Chunking is unsupervised splitting a text in syntactically
Sentence parsing. Sentence parsing is identifying the syntactic structure of a
sentence, for example in a form of a parse tree.</p>
      <p>Syntactic structure analysis (linguistic). In syntactic structure analysis, words and
modifiers in syntactic structures (e.g., noun phrases, verb phrases, and prepositional
phrases) are analyzed to discover potential terms and relations. It can be done in
unsupervised way.</p>
      <p>T6 Extract Axioms
- Use of Axiom templates
- Inductive logic
programming</p>
      <p>Relevance Analysis (statisitcal). The extent of occurrence of terms in individual
documents and in text corpora is employed for relevance analysis. This is
semisupervised or even unsupervised technique.</p>
      <p>
        Co-occurrence analysis (statisitcal). Co-occurrence analysis identifies lexical
units that tend to occur together for purposes ranging from extracting related terms to
discovering implicit relations between concepts [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This technique is unsupervised.
      </p>
      <p>
        Clustering (statistical). Grouping together variants of terms to form concepts and
separating unrelated ones is known as terms clustering. It usually unsupervised
technique. In this approach some measure of similarity is employed to assign terms into
groups for discovering concepts or constructing hierarchy [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Some of the major
issues in clustering are working with high-dimensional data and feature extraction and
preparation for similarity measurement. This gave rise to a class of featureless
similarity measures based solely on the co-occurrence of words in large text corpora. It is
known that clustering results are of acceptable quality only if a statistically
representative (i.e. large) text corpora is processed. This fact limits the applicability of this
technique in our settings (texts of small size). However, used in the combination with
other techniques, clustering may yield some valuable addition to the result – and thus
needs to be tried.
      </p>
      <p>
        Latent semantic analysis (statistical). Latent semantic analysis (LSA) is a
theoretical approach and mathematical method for determining the meaning similarity of
words and passages by analysis of large text corpora. The main idea is that the
aggregate of all the word contexts in which a given word does and does not appear provides
a set of mutual constraints that largely determines the similarity of meaning of words
and sets of words to each other [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. LSA can be useful in our investigation because it
is a fully automatic mathematical and statistical technique for extracting and inferring
meaningful relations from the contextual usage of words in text.
      </p>
      <p>
        Sub-categorization (linguistic, semantic). Sub-categorization, or extracting
subcategorization frames, is an approach to extract one type of lexical information with
particular importance for Natural Language Processing (NLP). Access to an accurate
and comprehensive sub-categorization lexicon is vital for the development of
successful parsing technology important for many NLP tasks (e.g. automatic verb
classification) and useful for any application which can benefit from information about
predicate-argument structure (e.g. Information Extraction) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Using semantic lexicon (linguistic, semantic). A semantic lexicon is a dictionary
or thesaurus of words/terms labeled with semantic classes (e.g., “ongoing effort” is an
Activity) so associations can be drawn between words that have not previously been
encountered [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Semantic lexicons are a popular resource in ontology learning and
play an important role in many NLP tasks.
      </p>
      <p>Dependency analysis (linguistic). Syntactic structure consists of lexical items,
linked by dependencies. They are binary asymmetric relations that are held between a
head and its dependents. Dependency analysis examines dependency information to
uncover relations at the sentence level. In this analysis, grammatical relations, such as
subject, object, adjunct, and complement, are used for determining more complex
relations. Dependency analysis is usually unsupervised approach.</p>
      <p>
        Association rule mining (statistical). Association rule mining aims to extract
correlations, frequent patterns, associations or casual structures among sets of items in
data repositories [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. It is an unsupervised component technique which works well
for considerably big data corpora. Association rules highlight correlations between
features in the texts, e.g. keywords. Association rules can be easy interpreted and are
understandable for an analyst or even for a normal user.
      </p>
      <p>
        Use of lexico-syntactic patterns (linguistic). Lexico-syntactic patterns (LSPs) are
generalized linguistic structures or schemas that indicate semantic relationships
among terms and can be applied to the identification of formalized concepts and
conceptual relations in natural language text [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Lexico-syntactic patterns are
suitable for automatic ontology building, since they model semantic relations. These
display exactly the kind of relation between their parts that makes them easily
translatable into an ontology representation.
      </p>
      <p>
        Use of semantic templates (semantic, linguistic). Semantic templates are similar
to lexico-syntactic patterns in terms of their purpose. However, semantic templates
offer more detailed rules and conditions for extracting not only taxonomic relations
but also complex non-taxonomic relations [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Logical inference (logical, semantic). In logical inference implicit relations are
derived from existing ones using rules such as transitivity and inheritance [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However,
the introduction of invalid or conflicting relations may also happen in case of an
incomplete or underspecified inference rule set – for example because of improper
account for the validity of transitivity or mutual disjointness axioms.
      </p>
      <p>
        Term subsumption (statistical, semantic). In the subsumption method, a given
term subsumes another term if the documents in which the latter term occurs are a
subset of the documents in which the given term occurs [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. A term subsumption
measure is used to quantify the extent of a term x being more general than another
term y. This technique is semi-supervised and unsupervised too. The term
subsumption technique is easy to implement and it makes labeling concepts an easy task.
However, with this method, it is difficult to classify terms that do not co-occur
frequently and it requires a large data set to work reliably.
      </p>
      <p>Use of axiom templates (semantic, linguistic). Axioms are useful for describing
the relationships between the concepts of an ontology. They can be written in
different ways depending on the relation that exist among the concepts.</p>
      <p>Inductive logic programming (logical, semantic). Inductive logic programming
(ILP) is a research area at the intersection of inductive machine learning and logic
programming. ILP generalizes the inductive and the deductive approaches by aiming
to develop theories, techniques and applications of inductive learning from
observations and background knowledge represented in first order logical framework.</p>
      <p>The overview of the applicability of the presented component techniques and their
interrelationship with respect to the tasks in our workflow are presented in Table 1.</p>
    </sec>
    <sec id="sec-3">
      <title>Summary and Future Work</title>
      <p>Our literature search has revealed that extracting knowledge, or more specifically
learning ontologies, from plain text corpora is a well developed research field that
continues to produce new results. However, and to the best of our knowledge,
extracting ontologies from text streams, with a constraint on the life time of an input
information token, is a recently emerged research problem. The reasons for adding this
specific problem to the research agenda are the phenomenon of Big Data, in particular
its velocity dimension, as well as the need for better, more reliable, semantically rich
solutions for automating Big Data analytics. One more complication introduced by
our problem setting is the small size of an individual information token which hinders
yielding good quality results using the majority of traditional statistical and linguistic
techniques for ontology extraction from text corpora.</p>
      <p>We argued in this paper that applying a combination of the relevant existing
component techniques in a structured and iterative way may overall produce such a result
– as an incremental collection of ontology elements in a knowledge token provided by
individual techniques at different stages in our proposed workflow.</p>
      <p>
        As this research is in an early phase, we do not yet have the proof for this
hypothesis. However there is the plan in place for conducting the initial series of the
“proofof-concept” experiments in which the component technologies will be exploited in a
semi-supervised or supervised fashion. For that we plan to use a small but well
semantically annotated corpus of the abstracts (information tokens) and full texts of
ICTERI papers collected in the ICTERIWiki portal1. This document corpus is
incrementally extended by adding the papers and their semantic annotations for each new
ICTERI conference instance. The annotations are done using the ICTERI Scope
Ontology by Tatarintseva et.al. [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. These annotations will be used as a “Golden
Standard” for evaluating the results of automated knowledge token extraction using the
workflow proposed in this paper.
      </p>
      <p>After the concept is proven and the constellation of the component techniques is
circumscribed, we plan to test the approach on one of the professional news portals.
Further, it is planned to extend the proposed knowledge extraction procedure to
sensor stream data processing.
1 http://isrg.kit.znu.edu.ua/icteriwiki/</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Manyika</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chui</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bughin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dobbs</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roxburgh</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hung</surname>
            <given-names>Byers</given-names>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Big data: the Next Frontier for Innovation, Competition, and</article-title>
          <string-name>
            <given-names>Productivity. McKinsey</given-names>
            <surname>Global Institute</surname>
          </string-name>
          (
          <year>2011</year>
          ), http://www.mckinsey.com/insights/mgi/research/technology_and
          <article-title>_ innovation/big_data_the_next_frontier_for_innovation</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Nardi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brachman</surname>
            ,
            <given-names>R.J.:</given-names>
          </string-name>
          <article-title>An Introduction to Description Logics</article-title>
          . In: Baader,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>McGuinness</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            ,
            <surname>Nardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Patel-Schneider</surname>
          </string-name>
          ,
          <string-name>
            <surname>P. F</surname>
          </string-name>
          . (eds.)
          <source>The Description Logic Handbook</source>
          , Cambridge University Press New York, NY, USA (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Davidovsky</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ermolayev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tolok</surname>
            <given-names>V</given-names>
          </string-name>
          .
          <article-title>: Instance Migration between Ontologies Having Structural Differences</article-title>
          .
          <source>In: Int. J. on Artificial Intelligence Tools</source>
          , vol.
          <volume>20</volume>
          (
          <issue>6</issue>
          ), pp.
          <fpage>1127</fpage>
          -
          <lpage>1156</lpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Buitelaar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Magnini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Ontology Learning from Text: an Overview</article-title>
          . In: Buitelaar,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Cimmiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <surname>B</surname>
          </string-name>
          . (eds.).
          <source>Ontology Learning from Text: Methods, Evaluation and Applications</source>
          , IOS Press, Amsterdam (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Bennamoun</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Ontology Learning from Text: a Look Back and into the Future</article-title>
          .
          <source>ACM Comput. Surv.</source>
          ,
          <volume>44</volume>
          (
          <issue>4</issue>
          ),
          <source>Article</source>
          <volume>20</volume>
          , 36 pages. http://doi.acm.
          <source>org/10</source>
          .1145/2333112.2333115 (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Shams</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mercer</surname>
            ,
            <given-names>R. E.</given-names>
          </string-name>
          :
          <article-title>Investigating Keyphrase Indexing with Text Denoising</article-title>
          .
          <source>In: Proceedings of the 12th ACM/IEEE-CS Joint Conf. on Digital Libraries</source>
          , pp.
          <fpage>263</fpage>
          -
          <lpage>266</lpage>
          , ACM (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Bennamoun</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Enhanced Integrated Scoring for Cleaning Dirty Texts</article-title>
          . arXiv preprint arXiv:
          <volume>0810</volume>
          .
          <fpage>0332</fpage>
          . (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hotho</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Staab</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Learning Concept Hierarchies from Text Corpora using Formal Concept Analysis</article-title>
          .
          <source>Journal of Artificial Intelligence Research Archive</source>
          ,
          <volume>24</volume>
          (
          <issue>1</issue>
          ),
          <fpage>305</fpage>
          -
          <lpage>339</lpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Foltz</surname>
            ,
            <given-names>P.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Laham</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Introduction to Latent Semantic Analysis</article-title>
          .
          <source>Journal: Discourse Processes</source>
          ,
          <volume>25</volume>
          (
          <issue>2-3</issue>
          ),
          <fpage>259</fpage>
          -
          <lpage>284</lpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Preiss</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Briscoe</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Korhonen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A System for Large-Scale Acquisition of Verbal, Nominal and Adjectival Subcategorization Frames from Corpora</article-title>
          . In: Annual Meeting.
          <source>Association for Computational Linguistics</source>
          ,
          <volume>45</volume>
          (
          <issue>1</issue>
          ),
          <volume>912</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Thelen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riloff</surname>
          </string-name>
          , E.:
          <article-title>A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts</article-title>
          .
          <source>In: Proc. ACL-02 Conf. on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</source>
          , vol.
          <volume>10</volume>
          , pp.
          <fpage>214</fpage>
          -
          <lpage>221</lpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kotsiantis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanellopoulos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Association Rules Mining: a Recent Overview</article-title>
          .
          <source>GESTS International Transactions on Computer Science and Engineering</source>
          ,
          <volume>32</volume>
          (
          <issue>1</issue>
          ),
          <fpage>71</fpage>
          -
          <lpage>82</lpage>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <article-title>Summary on Requirements on Lexico-Syntactic Patterns (Synthesis by PC</article-title>
          ), http://www.w3.org/community/ontolex/wiki/Specification_of_Requirements/LexicoSyntactic_Patterns
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>De Knijff</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frasincar</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hogenboom</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Domain Taxonomy Learning from Text: the Subsumption Method versus Hierarchical Clustering</article-title>
          .
          <source>Data &amp; Knowledge Engineering</source>
          , (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Tatarintseva</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Borue</surname>
            ,
            <given-names>Yu.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ermolayev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Validating OntoElect Methodology in Refining ICTERI Scope Ontology</article-title>
          . In: H.
          <string-name>
            <surname>C. Mayr</surname>
          </string-name>
          et al. (Eds.):
          <source>UNISCON</source>
          <year>2012</year>
          , LNBIP 137, pp.
          <fpage>128</fpage>
          --
          <lpage>139</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>