<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Lviv, Ukraine, November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Toward a Theoretical Framework of Terminological Saturation for Ontology Learning from Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Victoria Kosa</string-name>
          <email>victoriya1402.kosa@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Zaporizhzhia National University</institution>
          ,
          <addr-line>Zaporizhzhia</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>1</volume>
      <fpage>5</fpage>
      <lpage>16</lpage>
      <abstract>
        <p>In this position paper, we propose a detailed technical outline of what needs to be done, for example in a Master project, to bridge the research gap for the problem of the existence of terminological saturation. The problem is studied regarding a sequence of incrementally growing sub-collections of documents describing an arbitrary subject domain, using the OntoElect approach. After reviewing the related work, we present the formal basics of the approach and experimental evidence of the existence of terminological saturation. Consequently, we formulate the research hypotheses, and outline the methodology and plan for further research elaborating on this position.</p>
      </abstract>
      <kwd-group>
        <kwd>terminological saturation</kwd>
        <kwd>theoretical framework</kwd>
        <kwd>distance metric</kwd>
        <kwd>envelope function</kwd>
        <kwd>saturation conditions</kwd>
        <kwd>saturation existence theorem</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Extracting a set of terms from a document collection, describing a subject domain, is
an important initial step in figuring out a complete set of requirements for building an
ontology for the domain [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. The result of this step will only be of value if a source
collection of documents is sufficiently complete. Otherwise, important terms might
have been missed. A straightforward way to assemble a complete collection is to
retrieve all the available documents. Unfortunately, this is not realistic due to the varying
availabilities and huge quantities of the sources in realistic domains. A way to reduce
the size of the collection to be processed, while keeping the completeness of the term
set dissolved in it, is to extract a terminologically saturated subset of documents – a
sub-collection termed as a terminological core.
      </p>
      <p>
        In frame of OntoElect methodology [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we have proposed the domain-independent
technique for that, based on detecting terminological saturation in the sequence of
incrementally growing sub-collections. In our approach, the relevant [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] documents are
iteratively added to the sub-collection, terms are extracted from the previous and
current snapshots, and terminological difference [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] between the snapshots is measured.
After terminological difference had gone below the individual term significance
threshold, the current sub-collection could have been considered terminologically saturated
and could have been regarded as a terminological core. Our prior work experimentally
proved that, following this approach, it is possible to decrease substantially the quantity
of documents for term extraction and make the bags of significant terms more compact
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], while preserving the representativeness of a terminological core for the domain.
Several important aspects [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ] influencing the emergence of saturation have been
observed as well.
      </p>
      <p>
        The following research question has been left, however, without a proper attention
in our prior research: Is there a way to prove formally the existence of terminological
saturation for an arbitrary collection of textual documents? This question is important,
as terminology extraction from texts is computationally hard, even if optimized [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
Hence, it might be of value to know if terminological saturation is achievable, having
the set of documents at hand, before starting iterative computations using incrementally
growing textual datasets.
      </p>
      <p>In this position paper, we aim at blueprinting the formal framework to solve the
outlined problem. For that, the hypothesis about the conditions for the existence of
terminological saturation need to be formulated. However, we leave this proof for a
separate research project leading to a degree in Computer Science. Therefore, this paper is
the proposal of a potential master project.</p>
      <p>The remainder of the paper is structured as follows. In Section 2, we review the
existing related work on terminological saturation. In Sect. 3 we present our
background experimental evidence of the phenomenon of terminological saturation and
deliberate on relevant research questions. In Sect. 4, we present the structure of the formal
theoretical framework and sketch out some of its basic components, including the
hypothesis about the formal conditions for the existence of terminological saturation. In
Sect. 5 we present a vision of the plan of the future research work towards developing
this theoretical framework. Finally, we make conclusive remarks in Sect. 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        An ontology as an artifact [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], by definition, is “a formal, explicit specification of a
shared conceptualization” [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In Ontology Engineering, the mainstream interpretation
of this term for a domain ontology is that it is a formal descriptive theory of the subject
domain. A particular property of a domain ontology, which we focus on in this work,
is that it has to be a shared specification. A commonly accepted way of the assessment
of being shared is the degree to which an ontology supports the mental pictures,
interpretations of, or views on the subject domain by the domain professionals – the
knowledge stakeholders. The more views, further termed as requirements, are
supported by the ontology, the higher is the acceptance of this ontology by the knowledge
stakeholders. Therefore, it is better shared by them.
      </p>
      <p>
        To design a domain ontology to be well supporting the requirements of the relevant
knowledge stakeholder community, it is imperative to be informed sufficiently fully
about their views. This poses a challenge, as it is hard to elicit directly the
interpretations of the domain from the knowledge stakeholders in an explicit form [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Different
ontology engineering methodologies attack this challenge of requirements elicitation in
slightly different manners, based on organizing systematic interviews or brainstorming
sessions with the experts selected from the knowledge stakeholders ([
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref15 ref16">12, 13, 14, 15,
16</xref>
        ] to mention the few most frequently cited). However, there is always a risk, along
this way, that the selected group and their requirements under-represent the sentiment
of the specialist community. Furthermore, there is no guidance in Ontology
Engineering literature on how to measure objectively the representativeness of the expert group
and, therefore, the completeness of the requirements elicited from them. In fact, as the
experts are expensive, the tradeoff between the completeness and the price is made in
favor of lowering the price.
      </p>
      <p>
        To overcome the abovementioned difficulty in ensuring representativeness, it has
been proposed [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] to learn ontologies, or the requirements for ontology development
(also termed as features), not from the group of experts, but from the artifacts developed
by the knowledge stakeholders in the domain. The pragmatic reason was that extracting
features from the artifacts is less laborious and could be automated, at least in part.
Furthermore, collecting a representative sample of artifacts is more feasible than a
representative group of experts.
      </p>
      <p>
        One of the relevant types of these artifacts is professional textual documents.
Ontology learning from texts is now a noticeable subfield in ontology learning with
developed methodologies and processing pipelines [
        <xref ref-type="bibr" rid="ref1 ref18">18, 1</xref>
        ]. To learn the requirements for
engineering a domain ontology, a representative set of textual documents needs to be
collected. This document collection has to:
• Contain relevant texts of sufficiently good quality
• Be representative (sufficiently complete) in order to reflect community consensus
      </p>
      <p>
        While there is a bunch of approaches to select relevant documents in the literature,
the problem of checking if a document collection is sufficiently complete has not been
adequately resolved. One reason for under-estimating the importance of ensuring the
representativeness of a text collection is that text resources are abundant. Hence, one
may always expect to be able to have enough if the domain of her interest is well
circumscribed. For example, it is feasible to collect high-quality research papers on a
particular topic or within a field, as we did for Knowledge Management (KM) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This
KM collection contains circa 9 000 journal articles in full texts, which might be
considered as a sufficient volume due to the recommendations of linguistic corpora experts,
e.g. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. However, even if one collects what she thinks enough, there has to be a way
to measure the representativeness of this text corpus regarding ontology learning. It
might also be worth knowing if continuing collecting more documents will finally
result in a representative corpus.
      </p>
      <p>Furthermore, it might happen that only a small part of a big document collection is
sufficiently complete in terms of domain knowledge coverage. Therefore, having a
method to find this terminological core within the entire collection would help
substantially decrease the effort needed for ontology learning from these texts.</p>
      <p>
        The only relevant Computer Science publication we found in the context of ensuring
the completeness of the set of processed documents is [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] by Ferrari et al. This work
proposes two completeness metrics that take into account the relevant terms and
relationships among terms extracted from software system requirements specifications
written in natural language. This approach helps assessing if the set of specified
requirements is complete with respect to the available document or a small set of
documents. However, it does not allow finding out if the used set of documents represents
the sentiment of the specialist community satisfactorily fully. Despite that, the approach
of Ferrari et al. resembles our work as both are based on terminology extraction.
      </p>
      <p>
        The use of saturation phenomenon has received little attention in the Computer
Science literature, in particular in Text Mining and Ontology Learning. Saturation of
clauses was used in Theorem Proving [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Term saturation was also used in document
clustering and query answering for building term proximity graphs [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. One more
example, related to clustering, is the use of hierarchical cluster analysis for building topic
taxonomy for a properly sampled subset of documents [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ]. Saturation measure
(together with ceiling) is used for patent text clustering and topic classification [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
      <p>
        The only broadly exploited analogy to terminological saturation, which is directly
relevant for our purposes, we found in qualitative research methodologies for Sociology
and Medical Sciences, which is theoretical (or data) saturation [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. Qualitative
research is applied in different domains for processing the interviews with the subjects
and with an aim to build a (descriptive) theory that supports a research hypothesis in
the given (social) context. The problem faced by qualitative analysts was that the
interviews were expensive. So, it was desired to have an indicator of a representative
subjects sample size, such that covers well the potential replies by the other subjects who
were not interviewed.
      </p>
      <p>
        In Qualitative Research, the phenomenon of data saturation finds its origin in the
Grounded Theory method by Glaser and Strauss [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] for conducting interviews and
processing the data collected in these interviews. They explained their method as “the
discovery of theory from data systematically obtained from social research” [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. In
their proposal, a theory becomes grounded exactly due to its systematic discovery – i.e.
every statement in the theory has to be supported by the data.
      </p>
      <p>
        Notably, a mainstream ontology engineering methodology could be very similarly
termed as the discovery of a descriptive domain theory based on the data obtained from
qualitative research – c.f. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. By analogy, an ontology could be regarded as grounded
in evidence (data) if data saturation has been detected. For ontology learning from texts,
a set of terms used in a domain could be regarded as such an evidence. Hence, if the set
of terms becomes saturated, an ontology devisable from these terms (as the features
pointing to the requirements) could be regarded as a grounded descriptive theory of the
domain. Consequently, the subset of texts, from which the saturated set of terms has
been extracted, could be regarded as the terminological core corpus for this domain.
      </p>
      <p>
        Numerous attempts to operationalize the detection of data saturation has been
mentioned in the Qualitative Analysis literature. However, it is still a “mysterious step” [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]
in the Grounded Theory method. Sociologists identify the factors that pertain to or
hinder data saturation and offer several methodological hints, informally. However, an
objective and proven formal measure for detecting data saturation is still not available in
the literature. To the best of our knowledge, the only reference, related to terminological
saturation in textual data in the context of term extraction is our previous work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>Background and Experimental Evidence</title>
      <p>We now briefly present our background knowledge in the context of detecting
terminological saturation in document collections. This background has been developed in
the OntoElect project1. We first outline the formal basics of the technique proposed for
detecting terminological saturation in Sect. 3.1. In Sect 3.2, we then summarize our
experimental evidence of the validity of this technique and emphasize the problem. This
problem represents the research gap, which needs to be further studied and narrowed.
3.1</p>
      <sec id="sec-3-1">
        <title>Formal Background in Detecting Terminological Saturation</title>
        <p>
          In OntoElect [
          <xref ref-type="bibr" rid="ref2 ref4">4, 2</xref>
          ], we seek for a set of terms that statistically fully describe an arbitrary
subject domain (
        </p>
        <p>), for which a domain ontology needs to be developed or refined.</p>
        <p>Our supposition is that, if we have a sufficiently bounded 
to describe it is finite and not very large. These terms could be extracted from the
doc, the set of terms used
plete document collection.
uments (Doc), belonging to a documents collection (
= {
Hypothetically, one may collect all the existing documents that describe 
}) that describes 
– a
com.</p>
        <p>Definition 1: A Complete Document Collection for 
. A 
containing all the
documents describing</p>
        <p>For any realistic 
, its 
is a Complete 
(

).
representative set of terms from it would be a tedious and resource consuming task.
This is why we look at the document sub-collections of a 
.</p>
        <p>Definition 2: A document sub-collection. A document sub-collection (
) for the
may be very big in volume. Therefore, extracting a

is the subset of the</p>
        <p>:</p>
        <p>
          We are interested in finding a 
cally the same set of terms as the 
these 
features that characterise 
, form the terminological basis 
= {  },  = 1, … , 
Definition 3: A Terminological Basis. The finite set of terms   , identifying all the
might not be equally important for describing the domain,
as reflected in the documents of the 
with every term used in the documents of the 
. Let a real positive value (score) be associated
. The more significant the term is for
describing the domain, the higher is the score. Hence, a vector space model (VSM) [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]
might be an appropriate formal representation of a document space for 
. In this
model, any document or 
, including 
, is a point in a vector space with the basis
1
https://www.researchgate.net/project/OntoElect-a-Methodology-for-Domain-Ontology-Refinement
tracted.
        </p>
        <p>= {  } having dimension 
. It is not easy, however, to build a 
for any realistic
domain. We have to have a technique to: extract   from the documents describing the
; make sure that an extracted   ∈  ; and that all significant   have been
exExtracted Terms. Let there be a mapping that transforms a 
into a bag of terms
( ) extracted from the documents of this 
. Every element 
in 
is a pair
&lt;  ,   
&gt;, where t is a candidate string and   
is the estimate of the likelihood
The term candidates having high scores are denoted as significant terms.
that  is a relevant term for the</p>
        <p>: the higher the    , the more likely  ∈  .</p>
        <p>Retained Significant Terms. Let a term significance threshold ( 
) be rationally
above which the terms are regarded as significant, hence belong to the 
chosen (or estimated) for the    s of individual terms in a  . It indicates a boundary
. After having
built a  , let us retain the  s with the score &gt; eps in the corresponding Bag of Significant</p>
        <sec id="sec-3-1-1">
          <title>Terms ( ). In many ATE methods,</title>
          <p>is either chosen empirically, or selected based
on common sense considerations. In OntoElect, we offer a rationale for the estimation
of an eps.</p>
          <p>A Simple Majority Vote on Terms. A subset of term candidates is considered the
core, containing all significant terms, if the terms in this core reflect the sentiment of a
simple majority of the knowledge stakeholders in the domain. Consequently, the score
of a term might be interpreted as the sum of the votes for this term by the domain
knowledge stakeholders.</p>
          <p>Definition 4: An Individual Term Significance Threshold. Let  be sorted in the
descending order of the scores. Then eps threshold for  is computed as follows (1):</p>
          <p>=     : ∑ =1     &gt; 1/2 ∑ =1     ,
‖ ‖
ing condition (2) holds:
at the top of the ordered list is found, whose voters are the simple majority.
where  is the minimal number such that the condition after the semicolon holds.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>Definition 4 specifies that, for computing the   for a  , the minimal subset of  s</title>
          <p>Definition 5: A Bag of Retained Significant Terms. Let  ⊂  such that the
follow(1)
(2)
∀ ,     &gt;   .
in the bag of significant terms  
hardly tractable for any realistic 
will constitute the 
because of at least two reasons: (i) one has to
Then  denotes the bag of retained significant terms.</p>
          <p>Successive Approximation for building a 
a 
might be to extract and retain all the significant terms from the 
. One (superficial) way for building
ensure that the collection of documents s/he possesses is a 
(ii) processing a 
, that qualifies to be a 
for a realistic 
tionally expensive because of its volume. One feasible way might be to use a
statistifrom this 
cally representative 
will form a basis that is very similar to  , with statistically negligible
instead of</p>
          <p>. Then, the relevant and valid terms extracted
difference. Hence, we need a way to figure out if a 
In this regard, a successive approximation technique is plausible.
is statistically representative.</p>
          <p>. The terms
. This is however
, which is hard; and
, is very
computa</p>
          <p>Terminological Saturation. Let: 
ment
sub-collections
such</p>
          <p>that
 1,  2, … ,   , …
1, 
2
, … , 
  ,   +1 retained from the successive 
positive value. If, at some  : (i) ℎ  goes below the threshold of the statistical error  ;
and (ii) there is a convincing evidence that it will never go above this threshold; then
and returning the difference as a real
be the bags of
retained</p>
          <p>significant terms
 , … . Let ℎ  be a function comparing the bags of significant terms
1,</p>
          <p>, 
 , … be the sequence of
docu2 ⊂ ⋯ ⊂</p>
          <p>⊂ ⋯ ⊂ 
extracted</p>
          <p>;
from

is a saturated</p>
          <p>for</p>
          <p>, is a saturated term set;  
the difference (distance) between   and  
such a   could be used as an  -approximation of 
is not higher than  . The set of terms in</p>
          <p>. Such a   , labelled further as
is the bag of terms from which  
. The difference (ℎ  ) between  
is retained; and
and any
successive  , including  
, is within the statistical error: ℎ  ( 
,  
) &lt;  .</p>
          <p>Terminological Saturation Threshold. The premise for a rational choice of the
threshold ( ) for detecting terminological saturation is that a set of terms becomes
saturated if it already contains all the terms from 
. Hence, whatever the terms are added
in the subsequent  s, these are not significant. Let us set ε =    
– the  
(1)
computed for  
. Then</p>
          <p>(2) will contain statistically the same set of terms as  .</p>
          <p>Terminological Difference Measure. Let us now define the function (ℎ  ) for
measuring the distance between term sets. It takes in a pair of term sets and maps these
arguments into a real positive value of the distance between them:
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Experimental Evidence of Terminological Saturation</title>
        <p>Experiments show that terminological saturation, measured using ℎ  , exists in the
collections, having appropriate volume, composed of carefully selected documents
describing the same 
topics (different</p>
        <p>(Fig. 1(a)). From the other hand, if the documents on arbitrary
s) are randomly taken into 
 , then terminological saturation
is not observed (Fig. 1(b)) and appears to be not reachable. Furthermore, in some
collections ℎ  measurements result in an unclear picture – as pictured in Fig. 1(c). In the
latter case, ℎ  values are volatile and oscillate quite sharply around the curve of   ,
hence, cannot reliably indicate if terminological saturation is reachable. Fig. 1(c)
pictures that there is often a chance that “might be” saturation observed in several
measurement steps is further disproved by an additional measurement. In such cases, the
 ℎ : {&lt;   ,   &gt;} ⟶ ℜ+.
(3)
Due to the incremental nature of sub-collections 
 , not only the number of extracted
terms will grow in   + compared to   , but also the absolute values of the    s of
the terms. Therefore, for making the    s comparable in the pairs, normalized   
values have to be used. Let 
of the term, in a bag of terms  ,
is the</p>
        <p>of the first element. Then a normalized score (
having maximal value. In a  , sorted in the descending order of the   
s, 
) could be computed as
. To measure ℎ  we need to: (i) extract   from 
 and  
from</p>
        <p>
          ; (ii) compute    for   and    for   using (1); (iii) retain significant
terms in   and   using (2); (iv) compute 
for   and   ; (v) compute ℎ    ,  
based on (i)-(iv).
collection might be not representative and, therefore, more documents have to be added
to it. One more reason for high ℎ  volatility might be that the collection is too noisy,
as it appeared to be in the case of DAC2 [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] (Fig. 1(c)).
        </p>
        <p>(a) Saturation in DMKD-3003 (b) Absence of saturation in RAW4 (c) Saturation is not clear in DAC
Legend: individual term significance threshold eps; terminological difference  ℎ (  ,   +1)
Hence, the open problem that needs to be further researched is finding the sufficient
conditions for terminological saturation to exist after a necessary condition, its
indication, have been observed in ℎ  measurements.</p>
        <p>Our position is that the problem needs to be solved formally and the theoretical
framework for the solution has to be elaborated as a structured set of proven formal
statements.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>The Structure of the Formal Framework</title>
      <p>
        Based on the OntoElect method and the experimental evidence of its validity, presented
in Sect. 3, we now outline the open research questions and put forward the hypotheses
that need to be proven. The hypotheses are formulated in a way to resolve the issues
mentioned in the context of the volatility of ℎ  measurements and, therefore,
instability in terminological saturation. These statements form the logic of the sought
theoretical framework.
The ℎ  function is the mapping of the vectors   ,   , representing respective partial
collections, into ℜ+. Several distance functions are known from the literature in this
context – c.f. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>We put forward the following research hypotheses about this function.</p>
      <p>Hypothesis 1: ℎ  is Manhattan distance function.
2
3
4</p>
      <p>DAC is the collection containing 506 full text papers published in the proceedings of the
Design Automation Conference between 2004 and 2010.</p>
      <p>DMKD-300 is the collection containing 300 full text articles published in the Journal of Data
Mining and Knowledge Discovery between 1997 and 2010.</p>
      <p>
        RAW collection was synthetically formed of 80 randomly articles from English Wikipedia
such that no two of them were about a similar topic and the size of an article was not too small.
(4)
an
Manhattan (or often also called taxicab) distance [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] is:
 
  ,
      </p>
      <p>= ∑ =1  
 −    ,</p>
      <p>Hypothesis 2: Let 
sible document sub-collections of a</p>
      <p>is a metric space.
where
  = ( 1,  2, … ,  
are
the
vectors
in
to be proven that (4) is the formula for computing ℎ  .
n-dimensional ℜ+ vector space with fixed Cartesian coordinate system. Hence, it has
be a vector space formed of VSM representations of all
posin  
. Then ℎ    ,   is a metric function
For proving this statement, the metric conditions for ℎ    ,  
need to be checked:
(i) non-negativity; (ii) triangle inequality; (iii) symmetry; and (iv) identity of
indiscernibles. Further, it has to be proven that a 
is a metric space with ℎ    ,  
as its
distance metric.
4.2</p>
    </sec>
    <sec id="sec-5">
      <title>Envelope Functions for   and</title>
      <p>approximation,  
decreasing function.
when ℎ 
For terminological saturation measurement, we are interested in computing ℎ  not for
arbitrary   ,   , but for successive pairs   ,   +1,  = 1, 2, … . Hence, the function
( ) = ℎ  (  ,   +1) is of our practical interest. In particular, we are interested
( ) goes below</p>
      <p>. To find this out, we have to analyse:
• The values of ℎ</p>
      <p>( ),  = 1, 2, … : {  ,   +1} → ℜ+
• The values of individual term significance thresholds    , used for retaining terms
in   , which could be regarded as a function   ( ) =    ,  = 1, 2, …:   → ℜ+</p>
      <sec id="sec-5-1">
        <title>As it is revealed in our experiments (Sect. 3.2),   ( ) is not necessarily a monoton</title>
        <p>ically non-decreasing function. However, in general it might be possible to build its
( ), as a lower envelope function, that is a monotonically
nonenvelope function, that is a monotonically non-increasing function.</p>
      </sec>
      <sec id="sec-5-2">
        <title>Analogously to</title>
        <p>( ), ℎ</p>
        <p>( ) is not necessarily a monotonically non-increasing
function. It might also be possible to build its approximation, ℎ    ( ) as an upper</p>
        <p>Hypothesis 3: Terminological saturation exists if there exist: (i) a monotonically
non-decreasing
function  
ℎ    ( ); the intersection of these functions at some  .
monotonically
non-increasing function
4.3</p>
        <sec id="sec-5-2-1">
          <title>The Existence of Terminological Saturation</title>
          <p>Hypothesis 3 may be formulated as an existence theorem for terminological saturation
in a sequence of incrementally growing document sub-collections as follows.</p>
          <p>Theorem 1 (sufficient conditions of terminological saturation). Let: (i)</p>
          <p>2 … ⊂ 
arbitrary domain 
tracted from 
1, 
 … be the sequence of document sub-collections, each describing an
; (ii)  1,  2, … ,   , … be the sequence of the bags of terms
ex</p>
          <p>… and   ( ) is the function of individual term
significance thresholds for   ,  = 1, 2, … ; (iii)  1,  2, … ,   ,   +1, … be the sequence of the
1 ⊂
bags of retained significant terms for which pairwise successive terminological
differlogically saturated, i is the saturation point, and 
ence is computed using the ℎ 
( ) function. Then, the sequence of 
 = 

= 
 is
termino(iii)</p>
          <p>( ) ≥ ℎ    ( )
(i)</p>
          <p>
            There exist a non-decreasing lower envelope function   
(ii) There exist a non-increasing upper envelope function ℎ    ( ) for ℎ  ( )
( ) for   ( )
have to be verified in the experiments using the instruments and datasets available in
the OntoElect Project [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] in the frame of the optimized processing pipeline [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. In
particular, the datasets generated from the DMKD-300 document collection will be used
as our prior experiments demonstrated quick and stable terminological saturation in this
It is planned that the experimental part will be organized as follows:
          </p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>Envelope functions for</title>
        <p>30 percent of measurements
and proven existence Theorem 1
measurements
(ii) Terminological saturation will be predicted based on these envelope functions
(iii) Terminological saturation will be checked using the remaining 70 percent of
Further, the same experiments will be done using the datasets of the RAW and DAC
collections to verify if the prediction of terminological saturation works reliably in
complex conditions and across domains.</p>
        <p>( ) and ℎ  ( ) will be predicted based on the first
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusive Remarks</title>
      <p>The objective of this position paper is the proposal of a Master project aimed at
developing a rigorous theoretical framework proving the existence of terminological
saturation in the sequence of incrementally enlarged sub-collections of documents describing
an arbitrary subject domain. The proposal is based on the background knowledge of the
OntoElect project.</p>
      <p>
        In the literature study focused on this topic, we found out that little attention has
been paid, to date, to the problem of terminology saturation in textual corpora. In
particular, the only measure of such a saturation is our own prior work [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Furthermore, a
formal justification for the existence of terminological saturation in this context has not
been provided and the conditions for saturation to exist have not been studied.
Therefore, the development of a theoretical framework, proposed and outlined in this paper,
is timely and important for research and practice.
      </p>
      <p>The proposal of the framework is structured along the facets of: terminological
distance measure and its metric properties; the envelope functions to cope with
non-monotonicity of saturation measurement; and the existence theorem that states the required
conditions.</p>
      <p>
        Our plans for the future work are: (i) elaborate and validate experimentally the
formal proofs of Hypotheses 1-3; (ii) modify our processing pipeline using the knowledge
from the theoretical framework; and (iii) evaluate the instrumental software in the
experiments on industrial-scale textual collections, like Springer KM [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          , Liu,
          <string-name>
            <given-names>W.</given-names>
            ,
            <surname>Bennamoun</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Ontology learning from text: A look back and into the future</article-title>
          .
          <source>ACM Comput. Surv.</source>
          ,
          <volume>44</volume>
          (
          <issue>4</issue>
          ),
          <source>Article</source>
          <volume>20</volume>
          , 36 p. (
          <year>2012</year>
          ). doi:
          <volume>10</volume>
          .1145/2333112.2333115
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Ermolayev</surname>
          </string-name>
          , V.:
          <article-title>OntoElecting requirements for domain ontologies. The case of time domain</article-title>
          .
          <source>EMISA Int J of Conceptual Modeling</source>
          <volume>13</volume>
          (
          <article-title>Sp</article-title>
          . Issue),
          <fpage>86</fpage>
          -
          <lpage>109</lpage>
          (
          <year>2018</year>
          ). doi:
          <volume>10</volume>
          .18417/emisa.si.
          <source>hcm.9</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Dobrovolskyi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keberle</surname>
          </string-name>
          , N.:
          <article-title>Collecting seminal scientific abstracts with topic modelling, snowball sampling and citation analysis</article-title>
          .
          <source>In: Ermolayev</source>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          et al. (eds.):
          <article-title>ICTERI 2018</article-title>
          . Volume I: Main Conference,
          <source>CEUR-WS</source>
          , vol.
          <volume>2105</volume>
          , pp.
          <volume>179192</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Tatarintseva</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ermolayev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , Keller,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Matzke</surname>
          </string-name>
          , W.-E.:
          <article-title>Quantifying ontology fitness in OntoElect using saturation- and vote-based metrics</article-title>
          . In: Ermolayev,
          <string-name>
            <surname>V.</surname>
          </string-name>
          , et al. (eds.)
          <source>Revised Selected Papers of ICTERI</source>
          <year>2013</year>
          ,
          <article-title>CCIS</article-title>
          , vol.
          <volume>412</volume>
          , pp.
          <fpage>136</fpage>
          --
          <lpage>162</lpage>
          (
          <year>2013</year>
          ). doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -03998-
          <issue>5</issue>
          _
          <fpage>8</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Ermolayev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batsakis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keberle</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tatarintseva</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antoniou</surname>
          </string-name>
          , G.:
          <article-title>Ontologies of time: review and trends</article-title>
          .
          <source>Int J of Computer Science and Applications</source>
          <volume>11</volume>
          (
          <issue>3</issue>
          ),
          <fpage>57</fpage>
          -
          <lpage>115</lpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kosa</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chugunenko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuschenko</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Badenes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ermolayev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birukou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Semantic saturation in retrospective text document collections</article-title>
          . In: Mallet,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Zholtkevych</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . (eds.)
          <article-title>ICTERI 2017 PhD Symposium</article-title>
          , CEUR-WS, vol.
          <year>1851</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          , (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kosa</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaves-Fraga</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumenko</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuschenko</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moiseenko</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dobrovolskyi</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vasileyko</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Badenes-Olmedo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ermolayev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corcho</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birukou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The Influence of the Order of Adding Documents to Datasets on Terminological Saturation</article-title>
          .
          <source>Tech. Rep. TS-RTDC-TR-2018-2-v2</source>
          , Zaporizhzhia National University, Ukraine,
          <volume>72</volume>
          p. (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Kosa</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaves-Fraga</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Keberle</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birukou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Similar terms grouping yields faster terminological saturation</article-title>
          . In: Ermolayev,
          <string-name>
            <surname>V.</surname>
          </string-name>
          et al. (eds.)
          <article-title>ICTERI 2018</article-title>
          .
          <article-title>Revised Selected Papers</article-title>
          .
          <source>CCIS</source>
          , vol.
          <volume>1007</volume>
          , pp.
          <fpage>43</fpage>
          -
          <lpage>70</lpage>
          (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -13929-
          <issue>2</issue>
          _
          <fpage>3</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Kosa</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaves-Fraga</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dobrovolskiy</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fedorenko</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ermolayev</surname>
          </string-name>
          . V.:
          <article-title>Optimizing automated term extraction for terminological saturation measurement</article-title>
          . In: Ermolayev. V. et al. (eds.)
          <source>ICTERI</source>
          <year>2019</year>
          ,
          <string-name>
            <surname>Volume</surname>
            <given-names>I</given-names>
          </string-name>
          : Main Conference,
          <source>CEUR-WS</source>
          , vol.
          <volume>2387</volume>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Guarino</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oberle</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Staab</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>What is an ontology? In: Staab S</article-title>
          .,
          <string-name>
            <surname>Studer</surname>
            <given-names>R</given-names>
          </string-name>
          . (eds.) Handbook on Ontologies, pp.
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          ,
          <source>International Handbooks on Information Systems</source>
          , Springer, Berlin, Heidelberg (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Studer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benjamins</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fensel</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Knowledge engineering: principles and methods</article-title>
          .
          <source>Data &amp; Knowledge Engineering</source>
          <volume>25</volume>
          (
          <issue>1-2</issue>
          ),
          <fpage>161</fpage>
          -
          <lpage>198</lpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Gómez-Pérez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fernández-López</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corcho</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          : Ontological Engineering. Springer, London (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Pinto</surname>
            ,
            <given-names>H. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tempich</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Staab</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sure</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>DILIGENT: towards a fine-grained methodology for distributed, loosely controlled and evolving engineering of ontologies</article-title>
          . In: de Mántaras R.L.,
          <string-name>
            <surname>Saitta</surname>
            <given-names>L</given-names>
          </string-name>
          . (eds.) 16th
          <source>European Conf. on Artificial Intelligence, ECAI</source>
          , pp.
          <fpage>393</fpage>
          -
          <lpage>397</lpage>
          , IOS Press, (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Schreiber</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          et. al.:
          <article-title>Knowledge Engineering and Management: The CommonKADS Methodology</article-title>
          . MIT Press, Cambridge, Massachusetts (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Suárez-Figueroa</surname>
            ,
            <given-names>M. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gómez-Pérez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gangemi</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . (Eds.): Ontology Engineering in a Networked World. Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Sure</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Staab</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Studer</surname>
          </string-name>
          , R.:
          <article-title>On-To-Knowledge methodology</article-title>
          . In: Staab,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Studer</surname>
          </string-name>
          ,
          <string-name>
            <surname>R</surname>
          </string-name>
          . (eds.) Handbook on Ontologies, pp.
          <fpage>117</fpage>
          -
          <lpage>132</lpage>
          , Series on Handbooks in
          <source>Information Systems</source>
          , Springer, Berlin, Heidelberg (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Maedche</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Staab</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Ontology learning for the Semantic Web</article-title>
          .
          <source>IEEE Intell. Syst</source>
          <volume>16</volume>
          (
          <issue>2</issue>
          ),
          <fpage>72</fpage>
          -
          <lpage>79</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Buitelaar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cimiano</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Magnini</surname>
          </string-name>
          , B. (eds.):
          <article-title>Ontology Learning from Text: Methods, Evaluation and Applications</article-title>
          .
          <source>Frontiers in Artificial Intelligence and Applications</source>
          , vol.
          <volume>123</volume>
          , IOS Press (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>Corpas</given-names>
            <surname>Pastor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Seghiri</surname>
          </string-name>
          <string-name>
            <surname>Domínguez</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          :
          <article-title>Size matters: a quantitative approach to corpus representativeness</article-title>
          . In: Rabadán,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Fernández</given-names>
            <surname>López</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Guzmán</given-names>
            <surname>González</surname>
          </string-name>
          , T. (eds.) Lengua, traducción, recepción en honor de Julio César Santoyo, pp.
          <fpage>111</fpage>
          -
          <lpage>145</lpage>
          , Universidad de León Área de Publicaciones,
          <string-name>
            <surname>León</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Ferrari</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>dell'Orletta</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spagnolo</surname>
            <given-names>G.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gnesi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Measuring and improving the completeness of natural language requirements</article-title>
          . In: Salinesi C.,
          <string-name>
            <surname>van de Weerd</surname>
            <given-names>I</given-names>
          </string-name>
          . (eds.)
          <source>REFSQ</source>
          <year>2014</year>
          ,
          <article-title>LNCS</article-title>
          , vol.
          <volume>8396</volume>
          , Springer, Cham (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Riazanov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Voronkov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Adaptive saturation-based reasoning</article-title>
          . In: Dines Bjorner,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Broy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Zamulin</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.)
          <source>Andrei Ershov 4th Int Conf on Perspectives of System Informatics (PSI'01)</source>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>61</lpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Chernyak</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Berenstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Method and apparatus for informational processing based on creation of term-proximity graphs and their embeddings into informational units</article-title>
          .
          <source>US Patent Application Publication, No US</source>
          <year>2006</year>
          /0031219 A1, Feb.
          <volume>9</volume>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Doerre</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerstl</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goeser</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mueller</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seiffert</surname>
          </string-name>
          , R.:
          <article-title>Taxonomy generation for document collections</article-title>
          .
          <source>US Patent</source>
          ,
          <source>No US 6 446 061 B1</source>
          , Sep.
          <volume>3</volume>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24. Han, H.,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Mining technical topic networks from chinese patents</article-title>
          . In: Jung,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Womser-Hacker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>S</surname>
          </string-name>
          . (eds.)
          <article-title>1st Int W-shop on Patent Mining and Its Applications (IPAMIN 2014), CEUR-WS</article-title>
          , vol.
          <volume>1292</volume>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Aldiabat</surname>
            ,
            <given-names>K. M.</given-names>
          </string-name>
          :
          <article-title>Data saturation: the mysterious step in Grounded Theory methodology</article-title>
          .
          <source>The Qualitative Report 23(1)</source>
          ,
          <fpage>245</fpage>
          -
          <lpage>261</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Glaser</surname>
            ,
            <given-names>B. G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Strauss</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The Discovery of Grounded Theory: Strategies for Oualitative Research</article-title>
          . Aldine, Chicago, IL (
          <year>1967</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>C. S.:</given-names>
          </string-name>
          <article-title>A vector space model for automatic indexing</article-title>
          .
          <source>Communications of the ACM</source>
          <volume>18</volume>
          (
          <issue>11</issue>
          ),
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          (
          <year>1975</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Kosa</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaves-Fraga</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumenko</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuschenko</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Badenes-Olmedo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ermolayev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birukou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Cross-evaluation of automated term extraction tools by measuring terminological saturation</article-title>
          . In: Bassiliades,
          <string-name>
            <surname>N.</surname>
          </string-name>
          , et al. (eds.)
          <article-title>ICTERI 2017</article-title>
          .
          <article-title>Revised Selected Papers</article-title>
          .
          <source>CCIS</source>
          , vol.
          <volume>826</volume>
          , pp.
          <fpage>135</fpage>
          -
          <lpage>163</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Gomaa</surname>
            ,
            <given-names>W. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fahmy</surname>
          </string-name>
          . A. A.:
          <article-title>A survey of text similarity approaches</article-title>
          .
          <source>Int J Comp Appl</source>
          <volume>68</volume>
          (
          <issue>13</issue>
          ),
          <fpage>13</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>