<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Refining Terminological Saturation using String Similarity Measures</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alyon</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chugun</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rodion Popov</string-name>
          <email>rodeonpopov@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Chaves-Fraga</string-name>
          <email>dchaves@fi.upm.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>im Ermol</string-name>
          <email>vadim@ermolayev.com</email>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, Zaporizhzhya National University</institution>
          ,
          <addr-line>Zhukovskogo st. 66, 69600, Zaporizhzhya</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ontology Engineering Group, Universidad Politécnica de Madrid</institution>
          ,
          <addr-line>Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper reports on the refinement of the THD algorithm, developed in the OntoElect framework. This baseline THD algorithm used exact string matches for key term comparison. It has been refined by introducing an appropriate string similarity metric for grouping the terms having similar meaning and looking similar as text strings. To choose the most appropriate metric, several existing metrics have been cross-evaluated on the developed test set of multiword terms in English. The rationale for creating this test set is also presented. Further, the refined algorithm for measuring terminological difference has been cross-evaluated with the baseline THD algorithm. For this cross-evaluation, the bags of terms extracted from the TIME collection of scientific papers were used. The experiment revealed that using the refined algorithm yielded better and quicker terminological saturation, compared to the baseline.</p>
      </abstract>
      <kwd-group>
        <kwd>Automated Term Extraction</kwd>
        <kwd>OntoElect</kwd>
        <kwd>Terminological Difference</kwd>
        <kwd>Key Term</kwd>
        <kwd>Linguistic Similarity Metric</kwd>
        <kwd>Bag of Terms</kwd>
        <kwd>Terminological Saturation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The research presented in this paper is the part of the development of the
methodological and instrumental components for extracting representative (complete) sets of
significant terms from the representative sub-collections of textual documents having
minimal possible size. These terms are further interpreted as the required features for
engineering an ontology in a particular domain of interest. Therefore, it is assumed that the
documents in a collection cover a single and well circumscribed domain. The main
hypothesis, put forward in this work, is that a sub-collection can be considered as
representative to describe the domain, in terms of its terminological footprint, if any
additions of extra documents from the entire collection to this sub-collection do not
noticeably change this footprint. Such a sub-collection is further considered as complete and
therefore yields a representative bag of significant terms describing its domain. The
approach to assess the representativeness does so by evaluating terminological
saturation in a document (sub-)collection [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [31].
      </p>
      <p>
        Detecting saturation is done by measuring terminological difference (thd) among the
pairs of the consecutive incrementally enlarged datasets, as described in Section 4. This
set measure is of course based on measuring differences between individual terms.
A (baseline) THD algorithm [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has been developed and implemented in the OntoElect
project1. This THD algorithm, however, uses a simple string equivalence check for
detecting similar individual terms. The objective of the research presented in this paper
was to find out if it is possible to achieve better performance in measuring
terminological difference by using a proper string similarity measure to compare individual terms.
      </p>
      <p>The remainder of the paper is structured as follows. Section 2 reviews the related
work. Section 3 reports on the implementation of the chosen string similarity measures
and selecting the proper term similarity thresholds for their use. Section 4 sketches out
the approach of OntoElect for measuring thd and our refinement of the baseline THD
algorithm. Section 5 presents the set-up and results of our evaluation experiments. Our
conclusions and plans for the future work are given in Section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The work reported in this paper aims at improving the measures of terminological
difference between the bags of terms extracted from textual documents. The improvement
is sought via the proper choice and use of existing string metrics for measuring
linguistic (dis)similarity between extracted terms, as opposed to the baseline THD algorithm
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] which uses text string equality measures for comparing terms. It is also the premise
in our approach that the bags of terms are multi-word, extracted from plain text files,
and accompanied by numeric significance (rank) values. The terms are also expected
to be English. Therefore, the work related to the presented research is sought in
automated term extraction (ATE) from English texts and string similarity (distance)
measurement of the pairs of text strings containing one to several words.
      </p>
      <p>
        In the majority of approaches to ATE, e.g. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] or [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], processing is done in two
consecutive phases: Linguistic Processing and Statistical Processing. Linguistic
processors, like POS taggers or phrase chunkers, filter out stop words and restrict candidate
terms to n-gram sequences: nouns or noun phrases, adjective-noun and
noun-preposition-noun combinations. Statistical processing is then applied to measure the ranks of
the candidate terms. These measures are [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] either the measures of “unithood”, which
focus on the collocation strength of units that comprise a single term; or the measures
of “termhood” which point to the association strength of a term to domain concepts.
      </p>
      <p>
        For “unithood”, the metrics are used such as mutual information [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], log likelihood
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], t-test [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the notion of ‘modifiability’ and its variants [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The metrics for
“termhood” are either term frequency-based (unsupervised approaches) or reference
corpora-based (semi-supervised approaches). The most used frequency-based metrics
1
https://www.researchgate.net/project/OntoElect-a-Methodology-for-Domain-Ontology-Refinement
are TF/IDF (e.g. in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]), weirdness [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] which compares the frequency of a term in
the evaluated corpus with that in the reference corpus, domain pertinence [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. More
recently, hybrid approaches were proposed, that combine “unithood” and “termhood”
measurements in a single value. A representative metric is c/nc-value [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
C/nc-valuebased approaches to ATE have received their further evolution in many works, e.g. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to mention a few.
      </p>
      <p>
        Linguistic Processing is organized and implemented in a very similar fashion in all
ATE methods, except some of them that also include filtering out stop words. Stop
words could be filtered out also at a cut-off step after statistical processing. So, in our
review and selection we look at the second phase of Statistical Processing only.
Statistical Processing is sometimes further split in two consecutive sub-phases of term
candidate scoring, and ranking. For term candidates scoring, reflecting its likelihood of
being a term, known methods could be distinguished by being based on (c.f. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ])
measuring occurrences frequencies (including word association), assessing occurrences
contexts, using reference corpora, e.g. Wikipedia [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], topic modelling [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], [29].
      </p>
      <p>
        Perhaps the most cited paper that compares string similarity (distance) metrics is
[
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. In their cross-evaluation aimed at finding the proper metric for approximate name
matching in databases, the authors of [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] used two metric functions based on edit
distance: Levenstein distance [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]; and Monger-Elkan distance [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] metrics. Among the
metrics based on other principles, they also mentioned Jaro [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ], Jaro-Winkler [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]
metrics; token-based Jaccard similarity index [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], TF/IDF based cosine similarity and
several other corpus-based metrics.
      </p>
      <p>
        The authors of [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] also acknowledge that there is a rich set of string similarity
measures available in the literature, including character n-gram similarity [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ],
Levenstein distance [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], Jaro-Winkler measure [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], Jaccard similarity [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], tf-idf based
cosine similarity [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ], and Hidden Markov Model-based measure [26].
      </p>
      <p>
        To the best of our knowledge, none of the published techniques in ATE use text
string similarity (distance) measures to group linguistically similar terms. This is done
in the work presented in this paper. Furthermore, none of the techniques, except
OntoElect [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], use terminological saturation measures to minimize the sets of
documents necessary for extracting the bags of terms which represent a domain.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Implementation of String Similarity Measures and the Choice of Term Similarity Thresholds</title>
      <p>From the variety of metrics, mentioned above, due to the specifics of our task of the
approximate comparison of short strings containing a few words, we filtered out those:
(i) that require long strings or sets of strings of a considerably big size; (ii) that are
computationally hard. We also tried to keep the representatives of all kinds of string
metrics in our short list as much as it was possible. As a result, we formed the following
list of measures to be considered for further use:
 Levenstein distance, Hamming distance [27], Jaro similarity, and Jaro-Winkler
similarity – edit distance based syntactic measures
 Jaccard similarity index – a token based measure
 Sørensen-Dice coefficient [28] – a bi-gram comparison based measure
Among those, Levenstein and Hamming distances appeared to be the least appropriate
in our context due to their limitations. Levenstein returns an integer number of required
edits, while the rest of the measures return normalized reals. So, it has not been clear if
normalizing Levenstein would really make the result comparable to the other measures
in a way to use the same term similarity threshold. Hamming is applicable only to the
strings of equal lengths. So, adding spaces to the shorter string would really lower the
precision of measurement. Therefore, it has finally been decided to use Jaro,
Jaro-Winkler, Jaccard, and Sørencen-Dice for implementation and cross-evaluation in our work.
Further, it is briefly explained how should the selected measures be computed and
referred to their implementation code. After that, it is explained how term similarity
thresholds have been chosen for these implemented measures.</p>
      <p>Jaro similarity simj between two strings S1 and S2 is computed (1) as the minimal
number of one character transforms to be done to the first term (string) for getting the
second string in the compared pair.</p>
      <p> 0,
sim j  1/ 3  ( m  m 
 | S1 | | S2 |
if</p>
      <p>m  0
m  t</p>
      <p>) otherwise ,
m
(1)
where: |S1|, |S2| are the lengths of the compared strings; m is the number of the matching
characters; and t is the half of the number of transposed characters. The characters are
matching if they are the same and their distance from the beginning of the string differs
by no more than ⌊ (| 1|, | 2|)/2⌋ − 1. The number of matching but having
different sequence order symbols is the number of transposed characters.</p>
      <p>Jaro-Winkler similarity measure simj-w refines Jaro similarity measure simj by using
a prefix scale value p which assigns better ratings to the strings that match from their
beginnings for a prefix length l. Hence, for the two strings S1 and S2 it is computed as
shown in (2).</p>
      <p>
        simj-w = simj + l*p*(1 – simj),
(2)
where l is the length of a common prefix (up to a maximum of 4 characters); p is a
constant scaling factor for how much the similarity value is adjusted upwards for having
common prefixes (up to 0.25, otherwise the measure can become larger than 1; [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]
suggests that p=0.1).
      </p>
      <p>
        Sometimes Winkler’s prefix bonus l*p*(1 – simj) is given only to the pairs having
Jaro similarity higher that a particular threshold. This threshold is suggested [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] to be
equal to 0.7.
      </p>
      <p>Jaccard similarity index simja is a similarity measure for finite sets, characters in our
case. It is computed, for the two strings S1 and S2, as the ratio between the cardinalities
of the intersection and union of the character sets in S1 and S2 as shown in (3).
sim ja  ( S1  S2 ) /( S1  S2 )
(3)</p>
      <p>Finally, Sørensen-Dice coefficient, regarded as a character string similarity measure,
is computed by counting identical character bi-grams in S1 and S2 and relating these to
the overall number of bi-grams in both strings – as shown in (4).</p>
      <p>simsd  2n /( nS1  nS2 ) ,
(4)
where: n is the no of bi-grams found in S1 and also in S2; nS1 , nS2 are the numbers of
all bi-grams in S1 and S2.</p>
      <p>
        The functions for all four string similarity measures have been implemented2 in
Python 3.0 and return real values within [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ].
      </p>
      <p>For the proper use of those functions it is however necessary to determine what
would be a reasonable threshold to distinguish between (semantically) similar and not
similar terms. For determining that, the following cases in string comparison need to
be taken into account:
 Character strings are fully the same – Full Positives (FP). This case clearly falls
into similar (the same) terms.
 Character strings are very different and the terms in these strings carry different
semantics – Full Negatives (FN). This case is also clear and is characterized by low
values of similarity measures.
 Character strings are partially the same and the terms in these strings carry the
same or similar semantics – Partial Positives (PP).</p>
      <p>The terms in such strings are similar, though it may not be fully clear. The following
are different categories of terms that bring us about this case: words in the terms have
different endings (e.g. plural/singular forms); different delimiters are used (e.g. “-”, or
“–”, or “ - ”); a symbol is missing, erroneously added, or misspelled (a typo); one term
is a sub-string of the other (e.g. subsuming the second); one of the strings contains
unnecessary extra characters (e.g. two or three spaces instead of one, or noise).
 Character strings are partially the same but the terms in these strings carry
different semantics – Partial Negatives (PN)</p>
      <p>The terms in such strings are different, though it may not be fully clear. The
following are the categories that bring us about this case: the terms carried by the compared
strings differ by a few characters, but have different meanings (e.g. “deprecate” versus
“depreciate”); the compared terms have common word(s) but fully differ in their
meanings (e.g. “affect them” versus “effect them”). These false positives are the hardest case
to be detected.</p>
      <p>The test set of term pairs falling into the cases described above has been manually
developed3. For each pair of terms in this test set all four string similarity measures
have been computed.
2 These functions are publicly available at: https://github.com/EvaZsu/OntoElect
3 The test set and computed term similarity values are publicly available at
https://github.com/EvaZsu/OntoElect/blob/master/Test-Set.xls</p>
      <p>We have computed the average values of all four similarity measures for each
category using all the test set term pairs falling into this category. The results are given in
Table 1.</p>
      <p>Term similarity thresholds have to be chosen such that full and partial negatives are
regarded as not similar, but full and partial positives are regarded as similar. Hence, for
the case of partial positives, the thresholds have to be chosen as minimal of all the case
categories, and for the partial negatives – as the maximal of all the case categories. The
values of case thresholds are shown bolded in Table 1 and provide us with the margins
for relevant threshold intervals in our experiments. These intervals have been evenly
split in four points as presented in Table 2. The requirements for partial positives and
negatives unfortunately contradict to each other. For example, if a threshold is chosen
to filter out partial negatives, also some of the partial positives will be filtered out.
Therefore, subsuming that partial negatives are rare, it has been decided to use the
thresholds for partial positives.</p>
    </sec>
    <sec id="sec-4">
      <title>OntoElect and the Refinement of the THD Algorithm</title>
      <p>OntoElect, as a methodology, seeks for maximizing the fitness of the developed
ontology regarding what the domain knowledge stakeholders think about the domain. Fitness
is measured as the stakeholders’ “votes” – a measure that allows assessing the
stakeholders’ commitment to the ontology under development – reflecting how well their
sentiment about the requirements is met. The more votes are collected – the higher the
commitment is expected to be. If a critical mass of votes is acquired (say 50%+1, which
is a simple majority vote), the ontology is considered to satisfactorily meet the
requirements.</p>
      <p>Unfortunately, direct acquisition of requirements from domain experts is not very
realistic as they are expensive and not really willing to do the work falling out of their
core activity. So, we focus on the indirect collection of the stakeholders’ votes by
extracting these from high quality and reasonably high impact documents authored by the
stakeholders.</p>
      <p>
        An important feature to be ensured for knowledge extraction from text collections is
that the dataset needs to be representative to cover the opinions of the domain
knowledge stakeholders satisfactorily fully. OntoElect suggests a method to measure
the terminological completeness of the document collection by analyzing the saturation
of terminological footprints of the incremental slices of the document collection [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
The full texts of the documents from a retrospective collection are grouped in datasets
in the order of their timestamps. As pictured in Fig. 1a, the first dataset D1 contains the
first portion (inc) of documents. The second dataset D2 contains the first dataset D1
plus the second incremental slice (inc) of documents. Finally, the last dataset Dn
contains all the documents from the collection.
      </p>
      <p>(a)
(b)</p>
      <p>At the next step of the OntoElect workflow the bags of multi-word terms
B1, B2, …, Bn are extracted from the datasets D1, D2, …, Dn, using UPM Term
Extractor software [30], together with their significance (c-value) scores. Please see an
example of an extracted bag of terms extracted in Fig. 1b.</p>
      <p>At the subsequent step, every extracted bag of terms Bi, i = 1, …, n is processed as
follows:
 Normalized scores are computed for each individual term:
n-score = c-value / max(c-value)
 Individual term significance threshold (eps) is computed to cut off those terms
that are not within the majority vote. The sum of n-scores having values above
eps form the majority vote if this sum is higher that ½ of the sum of all n-scores.
 The cut-off at n-score &lt; eps is done
 The result is saved in Ti</p>
      <p>
        After this step only significant terms, whose n-scores represent the majority vote,
are retained in the bags of terms. Ti are then evaluated for saturation by measuring
pairwise terminological difference between the subsequent bags Ti and Ti+1,
i = 0, …, n-1. So far it has been done by applying the baseline THD algorithm4 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
presented in Fig. 2.
      </p>
      <p>Algorithm THD. Compute Terminological Difference between Bags of Terms
Input:
Ti, Ti+1 – the bags of terms with grouped similar terms.</p>
      <p>Each term Ti.term is accompanied with its T.n-score.</p>
      <p>Ti, Ti+1 are sorted in the descending order of T.n-score.</p>
      <p>
        M – the name of the string similarity measure function to compare terms
th – the value of the term similarity threshold from within [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ]
Output: thd(Ti+1, Ti), thdr(Ti+1, Ti)
1. sum := 0
2. thd := 0
3. for k := 1, │Ti+1│
4. sum := sum + Ti+1.n-score[k]
5. found : = .F.
6. for m := 1, │Ti│
7. if (Ti+1.term[k] = Ti.term[m]) if (M(Ti+1.term[k], Ti.term[m], th))
8. then
9. thd += │Ti+1.n-score[k] - Ti.n-score[m]│
10. found := .T.
11. end for
12. if (found = .F.) then thd += Ti+1.n-score[k]
13. end for
14. thdr := thd / sum
Fig. 2: THD algorithm [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for measuring terminological difference in a pair of bags of terms. It
uses string equalities for comparing terms and therefore needs to be refined as outlined by the
rounded rectangles. The refined THD has two more input parameters (M and th) and uses M for
comparing terms (line 7) instead of checking the equality of character strings.
      </p>
      <p>In fact, THD accumulates, in the thd value for the bag Ti+1, the n-score differences
if there were the same terms in Ti and Ti+1. If there was no the same term in Ti, it adds
the n-score of the orphan to the thd value of Ti+1. After thd has been computed, the
relative terminological difference thdr receives its value as thd divided by the sum of
n-scores in Ti+1.</p>
      <p>Absolute (thd) and relative (thdr) terminological differences are computed for
further assessing if Ti+1 differs from Ti more than the individual term significance
threshold eps. If not, it implies that adding an increment of documents to Di for producing
Di+1 did not contribute any noticeable amount of new terminology. So, the subset Di+1
of the overall document collection may have become terminologically saturated.
However, to obtain more confidence about the saturation, OntoElect suggests that some
4 The baseline THD algorithm is implemented in Python and is publicly available at
https://github.com/bwtgroup/SSRTDC-modules/tree/master/THD
more subsequent pairs of Ti and Ti+1 are evaluated. If stable saturation is observed,
then the process of looking for a minimal saturated sub-collection could be stopped.</p>
      <p>Our task was to modify the THD algorithm in a way to allow finding not exactly the
same but sufficiently similar terms by applying string similarity measures with
appropriate thresholds – as explained in the previous Section 3. For that, the preparatory
similar term grouping step has been introduced to avoid duplicate similarity detection.</p>
      <p>For each of the compared bags of terms Ti and Ti+1 the similar term grouping (STG)
algorithm is applied at this preparatory step – see Fig. 3.
After term grouping is accomplished for both bags of terms, the refined THD algorithm
(Fig 2 – rounded rectangles) is performed to compute the terminological difference
between Ti and Ti+1.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Cross-Evaluation</title>
      <p>
        This section reports on our evaluation of the refined THD algorithm against the baseline
THD [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This evaluation is done following the workflow of OntoElect Requirements
Elicitation Phase [31] and using the TIME document collection.
5.1
      </p>
      <sec id="sec-5-1">
        <title>Set-up of the Experiment</title>
        <p>The objective of our experiment was to find out if using the refined THD algorithm
yields quicker and smoother terminological saturation compared to the use of the
baseline THD algorithm. We were also looking at finding out which string similarity
measure best fits for measuring terminological saturation.</p>
        <p>For making the results comparable, the same datasets created from the TIME
document collection – as described in Section 5.2 – has been fed into both the refined and
baseline THD algorithms. We applied:
(i) The refined THD – sixteen times – one per individual string similarity measure M
(Section 3) and per individual term similarity threshold th (Table 3); and
(ii) The baseline THD – one time</p>
        <p>The values of: (i) the No of retained terns; (ii) absolute terminological difference
(thd); and (iii) the time taken to perform term grouping by the STG algorithm (sec);
were measured.</p>
        <p>Finally, to verify if the refined THD is correct, we checked if it returns the same
results as the baseline THD when the term similarity threshold is set to 1.0.</p>
        <p>All the computations have been run on a Windows 10 64-bit PC with: Intel® Core™
2 Duo CPU, E7400 @ 2.80 GHz; 4.0 Gb on-board memory.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Experimental Data</title>
        <p>TIME document collection contains the full text papers of the proceedings of the TIME
Symposia series5. The domain of the collection is Time Representation and Reasoning.
The publisher of these papers is IEEE. It contains all the papers published in the TIME
symposia proceedings between 1994 and 2013, which are 437 full text documents.
These papers have been processed manually, including their conversion to plain texts
and cleaning of these texts. So, the resulting datasets were not very noisy. We have
chosen the increment for generating the datasets to be 20 papers. So, based on the
available texts, we have generated 22 incrementally enlarged datasets D1, D2, …, D226
using our Dataset Generator7. The chronological order of adding documents has been
used.
5.3</p>
      </sec>
      <sec id="sec-5-3">
        <title>Results and Discussion</title>
        <p>The results of our measurements of terminological saturation (thd) are pictured in a
diagrammatic form in Fig. 4. The diagrams showing the time spent by the STG
algorithm for detecting and grouping similar terms, based on the chosen term similarity
thresholds are in Fig. 6. The diagrams in Fig. 4 and 6 have been built using the values
5 http://time.di.unimi.it/TIME_Home.html
6 The TIME collection in plain text and the datasets generated of these texts are available at:
https://www.dropbox.com/sh/64pbodb2dmpndcy/AAAzVW7aEpgW-JrXHaCEqg2Sa/
TIME?dl=0
7 The dataset generator is available at: https://github.com/bwtgroup/SSRTDC-PDF2TXT
of the measurements from the four tables – one per term similarity threshold point (Min,
Ave-1, Ave-2, and Max)8.</p>
        <p>Saturation (thd) measurements reveal that the refined THD algorithm detected
terminological saturation faster than the baseline THD algorithm – no matter what was
the chosen term similarity measure (M) or the similarity threshold (th). If the results for
different measures are compared, then it may be noted that the respective saturation
curves behave differently, depending on the similarity threshold point.
(a) Min term similarity thresholds</p>
        <p>(b) Ave-1 term similarity thresholds</p>
        <p>Overall, as it could be seen in Fig 4 (a) – (d), the use of the Sørensen-Dice measure
demonstrated the least volatile behavior along the term similarity threshold points. This
measure resulted in making the refined THD algorithm to detect saturation slower than
the three other measures for Min, Ave-1, and Ave-2. For Max, it was as fast as Jaro and
slightly slower than Jaccard and Jaro-Winker.</p>
        <p>One more observation was that, integrally, all the implemented term similarity
measures coped well with retaining important terms. These are indicated by
terminology contribution peaks in the diagrams (a)-(d) of Fig. 4. It is well seen in Fig. 4(d), for
the Max threshold point, that all the string similarity method curves follow the shape of
the baseline THD curve quite closely. Hence, they have the peaks exactly in the same
thd measurement points where the baseline has, pointing at more new significant terms.
8</p>
        <p>The tables are not presented in the paper due to the page limit, though are publicly available
at: https://github.com/EvaZsu/OntoElect. File names are Results-Alltogether-{min, ave, ave2,
max}-th.xlsx</p>
        <p>At Min, Ave-1, and Ave-2, however, the method that have been most sensitive to
terminology peaks, was Sørensen-Dice. This sensitivity is also confirmed by Fig. 5.
Fig. 5 pictures the proportions of the retained to all extracted terms computed at
different term similarity threshold points. It is clear from Fig. 5 that Sørensen-Dice
retains the biggest number of terms at all used term similarity thresholds.
(a) Min term similarity thresholds</p>
        <p>(b) Ave-1 term similarity thresholds
(c) Ave-2 term similarity thresholds
Legend:
(d) Max term similarity thresholds
Finally, it has to be noted that the introduction of string similarity measures in the
computation of terminological difference (THD algorithm) increases the computational
complexity of the algorithm quite substantially. Fig. 6 pictures the times (in seconds)
taken by the pre-processor STG algorithm. As it could be noticed in Fig. 6(a)-(d), the
times grow with the value of the term similarity threshold (th) and reach thousands of
seconds for Max threshold values. It is interesting to notice that Sørensen-Dice and
Jaccard are substantially more stable to the increase of th than Jaro and Jaro-Winkler.
Sørensen-Dice takes, however, roughly an order of magnitude more time than Jaccard.
From the other hand, Jaccard was not very sensitive to terminological peaks and
retained significantly less terms than Sørensen-Dice.</p>
        <p>To sum up, the findings are put in Table 3 to rank the evaluated string similarity
measures on a scale 1 (the best) to 5 (the worst).</p>
        <p>Probably surprisingly, Jaccard, which is the most lightweight string similarity
measure (Fig. 6), demonstrated the best performance among the rest, including the baseline
THD. As it was well balanced on all evaluation aspects. This balance was also good in
the case of Sørensen-Dice. However, Sørensen-Dice lost to Jaccard and baseline THD
as it took too much time for term grouping. Jaro and Jaro-Winkler were clear negative
outliers. Therefore, at an expense of a slightly higher execution time, the THD refined
by Jaccard string similarity measure is the preferred choice for measuring
terminological saturation in OntoElect.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions and Future Work</title>
      <p>In this paper, we investigated if a simple string equivalence measure used in the
baseline THD algorithm may be outperformed if a proper string similarity measure is used
instead. For finding this out, we: (i) have chosen the four candidate measures from the
broader variety of the available, based on the specifics of term comparison; (ii)
developed the test set of specific term pairs to decide about term similarity thresholds for the
chosen measures; (iii) implemented these measures, the algorithm for similar terms
grouping (STG), and the refinement of the baseline THD algorithm; (iv)
cross-evaluated the refined THD algorithm against the baseline, and also all individual measures
against each other; (v) gave our recommendation about the use of the refined THD
algorithm with Jaccard measure which demonstrated the most balanced performance in
our experiments.</p>
      <p>For the experiments we used the datasets generated, using our instrumental software
suite, from the TIME document collection. This collection contains real scientific
papers acquired from the proceedings series of the Time Representation and Reasoning
Symposia.</p>
      <p>
        Our future work is planned based on the results of the presented experiments and
some additional observations we made. Firstly, we would like to explore the ways to
improve the performance of the Sørensen-Dice measure implementation as its higher
computational complexity is the only flaw against the Jaccard measure implementation.
Secondly, we are interested in finding out if a similar term grouping algorithm, using a
sensitive similarity measure, like Sørensen-Dice, would be plausible for grouping
features while building feature taxonomies. This task is on the agenda for the second
(Conceptualization) phase of OntoElect [32]. Thirdly, we are keen to check if the evaluation
results on the other document collections will be similar to that presented in this paper.
To find this out we plan to repeat the same cross-evaluation experiments but on the
datasets generated from DMKD and DAC collections [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>The research leading to this publication has been done in part in cooperation with the
Ontology Engineering Group of the Universidad Politécnica de Madrid in frame of FP7
Marie Curie IRSES SemData project (http://www.semdata-project.eu/), grant
agreement No PIRSES-GA-2013-612551. While performing this research, the first author
has been a master student on the program on Computer Science and Information
Technologies at Zaporizhzhia National University. The second author is funded by a PhD
grant provided by Zaporizhzhia National University and the Ministry of Education and
Science of Ukraine.
26. Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing
with the asymmetric signature scheme. In: Proc. of the 2011 ACM SIGMOD Int Conf on
Management of data, pp. 1033--1044. ACM New York, USA (2011)
27. Hamming, R. W.: Error detecting and error correcting codes. Bell System Technical Journal
29 (2), 147--160 (1950), DOI:10.1002/j.1538-7305.1950.tb00463.x.
28. Dice, Lee R.: Measures of the amount of ecologic association between species. Ecology 26
(3), 297--302 (1945), DOI:10.2307/1932409
29. Badenes-Olmedo, C., Redondo-García, J. L., Corcho, O.: Efficient clustering from
distributions over topics. In: Proc. K-CAP 2017, ACM, New York, NY, USA, Article 17, 8 p.
(2017), DOI: 10.1145/3148011.3148019
30. Corcho, O., Gonzalez, R., Badenes, C., Dong, F.: Repository of indexed ROs. Deliverable</p>
      <p>No. 5.4. Dr Inventor project (2015)
31. Ermolayev, V.: OntoElecting requirements for domain ontologies. The case of time domain.</p>
      <p>EMISA Int J of Conceptual Modeling 13(Sp. Issue), 86--109 (2018) DOI:
10.18417/emisa.si.hcm.9
32. Moiseenko, S., Ermolayev, V.: Conceptualizing and formalizing requirements for ontology
engineering. In: Antoniou, G., Zholtkevych, G. (eds.) Proc. ICTERI 2018 PhD Symposium,
Kyiv, Ukraine, May 14-17, CEUR-WS (2018) online – to appear</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Tatarintseva</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ermolayev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , Keller,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Matzke</surname>
          </string-name>
          , W.-E.:
          <article-title>Quantifying ontology fitness in OntoElect using saturation- and vote-based metrics</article-title>
          . In: Ermolayev,
          <string-name>
            <surname>V.</surname>
          </string-name>
          , et al. (eds.)
          <source>Revised Selected Papers of ICTERI</source>
          <year>2013</year>
          ,
          <article-title>CCIS</article-title>
          , vol.
          <volume>412</volume>
          , pp.
          <fpage>136</fpage>
          --
          <lpage>162</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Fahmi</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bouma</surname>
          </string-name>
          , G., van der Plas, L.:
          <article-title>Improving statistical method using known terms for automatic term extraction</article-title>
          .
          <source>In: Computational Linguistics in the Netherlands, CLIN</source>
          <volume>17</volume>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Wermter</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hahn</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Finding new terminology in very large corpora</article-title>
          . In: Clark,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Schreiber</surname>
          </string-name>
          ,
          <string-name>
            <surname>G</surname>
          </string-name>
          . (eds.)
          <source>Proc. 3rd Int Conf on Knowledge Capture, K-CAP</source>
          <year>2005</year>
          , pp.
          <fpage>137</fpage>
          --
          <lpage>144</lpage>
          , Banff, Alberta, Canada,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2005</year>
          ) DOI:
          <fpage>10</fpage>
          .1145/1088622.1088648
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iria</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brewster</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ciravegna</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A comparative evaluation of term recognition algorithms</article-title>
          .
          <source>In: Proc. 6th Int Conf on Language Resources and Evaluation</source>
          ,
          <string-name>
            <surname>LREC</surname>
          </string-name>
          <year>2008</year>
          , Marrakech, Morocco (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Daille</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Study and implementation of combined techniques for automatic extraction of terminology</article-title>
          . In: Klavans,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Resnik</surname>
          </string-name>
          , P. (eds.)
          <source>The Balancing Act: Combining Symbolic and Statistical Approaches to Language</source>
          , pp.
          <fpage>49</fpage>
          --
          <lpage>66</lpage>
          . The MIT Press. Cambridge, Massachusetts (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>J. D.</given-names>
          </string-name>
          :
          <article-title>Highlights: Language- and domain-independent automatic indexing terms for abstracting</article-title>
          .
          <source>J. Am. Soc. Inf. Sci</source>
          .
          <volume>46</volume>
          (
          <issue>3</issue>
          ),
          <fpage>162</fpage>
          --
          <lpage>174</lpage>
          (
          <year>1995</year>
          ) DOI:
          <fpage>10</fpage>
          .1002/(SICI)
          <fpage>1097</fpage>
          -
          <lpage>4571</lpage>
          (
          <issue>199504</issue>
          )46:
          <fpage>3</fpage>
          &lt;
          <fpage>162</fpage>
          :
          <article-title>:AID-ASI2&gt;3.0</article-title>
          .CO;
          <fpage>2</fpage>
          -
          <lpage>6</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Caraballo</surname>
            ,
            <given-names>S. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Charniak</surname>
          </string-name>
          , E.:
          <article-title>Determining the specificity of nouns from text</article-title>
          .
          <source>In: Proc. 1999 Joint SIGDAT Conf on Empirical Methods in Natural Language Processing and Very Large Corpora</source>
          , pp.
          <fpage>63</fpage>
          --
          <lpage>70</lpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Astrakhantsev</surname>
          </string-name>
          , N.:
          <article-title>ATR4S: toolkit with state-of-the-art automatic terms recognition methods in scala</article-title>
          .
          <source>arXiv preprint arXiv:1611.07804</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Medelyan</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I. H.</given-names>
          </string-name>
          :
          <article-title>Thesaurus based automatic keyphrase indexing</article-title>
          . In: Marchionini,
          <string-name>
            <given-names>G.</given-names>
            ,
            <surname>Nelson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            ,
            <surname>Marshall</surname>
          </string-name>
          , C. C. (eds.)
          <source>Proc. ACM/IEEE Joint Conf on Digital Libraries, JCDL</source>
          <year>2006</year>
          , pp.
          <fpage>296</fpage>
          --
          <lpage>297</lpage>
          ,
          <string-name>
            <surname>Chapel</surname>
            <given-names>Hill</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NC</surname>
          </string-name>
          , USA, ACM (
          <year>2006</year>
          ) DOI:
          <fpage>10</fpage>
          .1145/1141753.1141819
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gillam</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tostevin</surname>
          </string-name>
          , L.:
          <article-title>University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder)</article-title>
          .
          <source>In: Proc. 8th Text REtrieval Conf, TREC-8</source>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Sclano</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Velardi</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>TermExtractor: A Web application to learn the common terminology of interest groups and research communities</article-title>
          .
          <source>In: Proc. 9th Conf on Terminology and Artificial Intelligence</source>
          ,
          <source>TIA</source>
          <year>2007</year>
          ,
          <string-name>
            <surname>Sophia</surname>
            <given-names>Antipolis</given-names>
          </string-name>
          , France (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Frantzi</surname>
            ,
            <given-names>K. T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ananiadou</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>The c/nc value domain independent method for multi-word term extraction</article-title>
          .
          <source>J. Nat. Lang. Proc. 6</source>
          (
          <issue>3</issue>
          ),
          <fpage>145</fpage>
          --
          <lpage>180</lpage>
          (
          <year>1999</year>
          ) DOI:
          <fpage>10</fpage>
          .5715/jnlp.6.3_
          <fpage>145</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Kozakov</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fin</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Drissi</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doganata</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cofino</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Glossary extraction and utilization in the information search and delivery system for IBM Technical Support</article-title>
          .
          <source>IBM System Journal</source>
          <volume>43</volume>
          (
          <issue>3</issue>
          ),
          <fpage>546</fpage>
          --
          <lpage>563</lpage>
          (
          <year>2004</year>
          ) DOI:
          <fpage>10</fpage>
          .1147/sj.433.0546
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Astrakhantsev</surname>
          </string-name>
          , N.:
          <article-title>Methods and software for terminology extraction from domain-specific text collection</article-title>
          .
          <source>PhD thesis</source>
          ,
          <article-title>Institute for System Programming of Russian Academy of Sciences (</article-title>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Bordea</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Buitelaar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Polajnar</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Domain-independent term extraction through domain modelling</article-title>
          .
          <source>In: Proc. 10th Int Conf on Terminology and Artificial Intelligence</source>
          ,
          <source>TIA</source>
          <year>2013</year>
          , Paris, France (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Kosa</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chaves-Fraga</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Naumenko</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuschenko</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Badenes-Olmedo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ermolayev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Birukou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Cross-evaluation of automated term extraction tools by measuring terminological saturation</article-title>
          . In: Bassiliades,
          <string-name>
            <surname>N.</surname>
          </string-name>
          , et al. (eds.)
          <article-title>ICTERI 2017</article-title>
          .
          <article-title>Revised Selected Papers</article-title>
          .
          <source>CCIS</source>
          , vol.
          <volume>826</volume>
          , pp.
          <fpage>135</fpage>
          --
          <lpage>163</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>W.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ravikumar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fienberg</surname>
            ,
            <given-names>S.E.</given-names>
          </string-name>
          :
          <article-title>A comparison of string distance metrics for name-matching tasks</article-title>
          .
          <source>In: Proc. 2003 Int. Conf. on Information Integration on the Web</source>
          , pp
          <fpage>73</fpage>
          --
          <lpage>78</lpage>
          , AAAI Press (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Levenshtein</surname>
            ,
            <given-names>V.I.</given-names>
          </string-name>
          :
          <article-title>Binary codes capable of correcting deletions, insertions, and reversals</article-title>
          .
          <source>Soviet Physics Doklady</source>
          <volume>10</volume>
          (
          <issue>8</issue>
          ),
          <fpage>707</fpage>
          --
          <lpage>710</lpage>
          (
          <year>1966</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Monge</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Elkan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>The field-matching problem: algorithm and applications</article-title>
          .
          <source>In: Proc. 2nd Int Conf on Knowledge Discovery and Data Mining</source>
          , pp.
          <fpage>267</fpage>
          --
          <lpage>270</lpage>
          , AAAI Press (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Jaro</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          :
          <article-title>Probabilistic linkage of large public health data files (disc</article-title>
          :
          <fpage>P687</fpage>
          -
          <lpage>689</lpage>
          ).
          <source>Statistics in Medicine 14</source>
          ,
          <fpage>491</fpage>
          --
          <lpage>498</lpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Winkler</surname>
          </string-name>
          , W. E.:
          <article-title>String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage</article-title>
          .
          <source>In: Proc. Section on Survey Research Methods. ASA</source>
          , pp.
          <fpage>354</fpage>
          --
          <lpage>359</lpage>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Jaccard</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>The distribution of the flora in the alpine zone</article-title>
          .
          <source>New Phytologist</source>
          <volume>11</volume>
          ,
          <fpage>37</fpage>
          --
          <lpage>50</lpage>
          (
          <year>1912</year>
          ) DOI:
          <fpage>10</fpage>
          .1111/j.1469-
          <lpage>8137</lpage>
          .
          <year>1912</year>
          .tb05611.x
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>String similarity measures and joins with synonyms</article-title>
          .
          <source>In: Proc. 2013 ACM SIGMOD Int Conf on the Management of Data</source>
          , pp.
          <fpage>373</fpage>
          --
          <lpage>384</lpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
          </string-name>
          , R. T.,
          <string-name>
            <surname>Shim</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Power-law based estimation of set similarity join size</article-title>
          .
          <source>Proc. of the VLDB Endowment</source>
          <volume>2</volume>
          (
          <issue>1</issue>
          ),
          <fpage>658</fpage>
          --
          <lpage>669</lpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Tsuruoka</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McNaught</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsujii</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ananiadou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Learning string similarity measures for gene/protein name dictionary look-up using logistic regression</article-title>
          .
          <source>Bioinformatics</source>
          <volume>23</volume>
          (
          <issue>20</issue>
          ),
          <fpage>2768</fpage>
          --
          <lpage>2774</lpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>