Refining Terminological Saturation using String
                    Similarity Measures

        Alyona Chugunenko1[0000-0002-9760-3558], Victoria Kosa1[0000-0002-7300-8818],
               Rodion Popov1, David Chaves-Fraga2[0000-0003-3236-2789],
                     and Vadim Ermolayev1[0000-0002-5159-254X]
           1 Department of Computer Science, Zaporizhzhya National University,

                    Zhukovskogo st. 66, 69600, Zaporizhzhya, Ukraine
              aluonac@i.ua, victoriya1402.kosa@gmail.com,
               rodeonpopov@gmail.com, vadim@ermolayev.com
      2 Ontology Engineering Group, Universidad Politécnica de Madrid, Madrid, Spain

                                dchaves@fi.upm.es


       Abstract. This paper reports on the refinement of the THD algorithm, developed
       in the OntoElect framework. This baseline THD algorithm used exact string
       matches for key term comparison. It has been refined by introducing an appro-
       priate string similarity metric for grouping the terms having similar meaning and
       looking similar as text strings. To choose the most appropriate metric, several
       existing metrics have been cross-evaluated on the developed test set of multi-
       word terms in English. The rationale for creating this test set is also presented.
       Further, the refined algorithm for measuring terminological difference has been
       cross-evaluated with the baseline THD algorithm. For this cross-evaluation, the
       bags of terms extracted from the TIME collection of scientific papers were used.
       The experiment revealed that using the refined algorithm yielded better and
       quicker terminological saturation, compared to the baseline.

       Keywords: Automated Term Extraction, OntoElect, Terminological Difference,
       Key Term, Linguistic Similarity Metric, Bag of Terms, Terminological Satura-
       tion.


1      Introduction

The research presented in this paper is the part of the development of the methodolog-
ical and instrumental components for extracting representative (complete) sets of sig-
nificant terms from the representative sub-collections of textual documents having min-
imal possible size. These terms are further interpreted as the required features for engi-
neering an ontology in a particular domain of interest. Therefore, it is assumed that the
documents in a collection cover a single and well circumscribed domain. The main
hypothesis, put forward in this work, is that a sub-collection can be considered as rep-
resentative to describe the domain, in terms of its terminological footprint, if any addi-
tions of extra documents from the entire collection to this sub-collection do not notice-
ably change this footprint. Such a sub-collection is further considered as complete and
therefore yields a representative bag of significant terms describing its domain. The
approach to assess the representativeness does so by evaluating terminological satura-
tion in a document (sub-)collection [1], [31].
   Detecting saturation is done by measuring terminological difference (thd) among the
pairs of the consecutive incrementally enlarged datasets, as described in Section 4. This
set measure is of course based on measuring differences between individual terms.
A (baseline) THD algorithm [1] has been developed and implemented in the OntoElect
project1. This THD algorithm, however, uses a simple string equivalence check for de-
tecting similar individual terms. The objective of the research presented in this paper
was to find out if it is possible to achieve better performance in measuring terminolog-
ical difference by using a proper string similarity measure to compare individual terms.
   The remainder of the paper is structured as follows. Section 2 reviews the related
work. Section 3 reports on the implementation of the chosen string similarity measures
and selecting the proper term similarity thresholds for their use. Section 4 sketches out
the approach of OntoElect for measuring thd and our refinement of the baseline THD
algorithm. Section 5 presents the set-up and results of our evaluation experiments. Our
conclusions and plans for the future work are given in Section 6.


2       Related Work

The work reported in this paper aims at improving the measures of terminological dif-
ference between the bags of terms extracted from textual documents. The improvement
is sought via the proper choice and use of existing string metrics for measuring linguis-
tic (dis)similarity between extracted terms, as opposed to the baseline THD algorithm
[1] which uses text string equality measures for comparing terms. It is also the premise
in our approach that the bags of terms are multi-word, extracted from plain text files,
and accompanied by numeric significance (rank) values. The terms are also expected
to be English. Therefore, the work related to the presented research is sought in auto-
mated term extraction (ATE) from English texts and string similarity (distance) meas-
urement of the pairs of text strings containing one to several words.
    In the majority of approaches to ATE, e.g. [2] or [3], processing is done in two con-
secutive phases: Linguistic Processing and Statistical Processing. Linguistic proces-
sors, like POS taggers or phrase chunkers, filter out stop words and restrict candidate
terms to n-gram sequences: nouns or noun phrases, adjective-noun and noun-preposi-
tion-noun combinations. Statistical processing is then applied to measure the ranks of
the candidate terms. These measures are [4] either the measures of “unithood”, which
focus on the collocation strength of units that comprise a single term; or the measures
of “termhood” which point to the association strength of a term to domain concepts.
    For “unithood”, the metrics are used such as mutual information [5], log likelihood
[6], t-test [2], [3], the notion of ‘modifiability’ and its variants [7], [3]. The metrics for
“termhood” are either term frequency-based (unsupervised approaches) or reference
corpora-based (semi-supervised approaches). The most used frequency-based metrics

1   https://www.researchgate.net/project/OntoElect-a-Methodology-for-Domain-Ontology-Re-
    finement
are TF/IDF (e.g. in [8], [9]), weirdness [10] which compares the frequency of a term in
the evaluated corpus with that in the reference corpus, domain pertinence [11]. More
recently, hybrid approaches were proposed, that combine “unithood” and “termhood”
measurements in a single value. A representative metric is c/nc-value [12]. C/nc-value-
based approaches to ATE have received their further evolution in many works, e.g. [2],
[11], [13] to mention a few.
    Linguistic Processing is organized and implemented in a very similar fashion in all
ATE methods, except some of them that also include filtering out stop words. Stop
words could be filtered out also at a cut-off step after statistical processing. So, in our
review and selection we look at the second phase of Statistical Processing only. Statis-
tical Processing is sometimes further split in two consecutive sub-phases of term can-
didate scoring, and ranking. For term candidates scoring, reflecting its likelihood of
being a term, known methods could be distinguished by being based on (c.f. [8]) meas-
uring occurrences frequencies (including word association), assessing occurrences con-
texts, using reference corpora, e.g. Wikipedia [14], topic modelling [15], [29].
    Perhaps the most cited paper that compares string similarity (distance) metrics is
[17]. In their cross-evaluation aimed at finding the proper metric for approximate name
matching in databases, the authors of [17] used two metric functions based on edit dis-
tance: Levenstein distance [18]; and Monger-Elkan distance [19] metrics. Among the
metrics based on other principles, they also mentioned Jaro [20], Jaro-Winkler [21]
metrics; token-based Jaccard similarity index [22], TF/IDF based cosine similarity and
several other corpus-based metrics.
    The authors of [23] also acknowledge that there is a rich set of string similarity
measures available in the literature, including character n-gram similarity [24], Leven-
stein distance [15], Jaro-Winkler measure [21], Jaccard similarity [22], tf-idf based co-
sine similarity [25], and Hidden Markov Model-based measure [26].
    To the best of our knowledge, none of the published techniques in ATE use text
string similarity (distance) measures to group linguistically similar terms. This is done
in the work presented in this paper. Furthermore, none of the techniques, except Onto-
Elect [1], [16], use terminological saturation measures to minimize the sets of docu-
ments necessary for extracting the bags of terms which represent a domain.


3      Implementation of String Similarity Measures
       and the Choice of Term Similarity Thresholds

From the variety of metrics, mentioned above, due to the specifics of our task of the
approximate comparison of short strings containing a few words, we filtered out those:
(i) that require long strings or sets of strings of a considerably big size; (ii) that are
computationally hard. We also tried to keep the representatives of all kinds of string
metrics in our short list as much as it was possible. As a result, we formed the following
list of measures to be considered for further use:

 Levenstein distance, Hamming distance [27], Jaro similarity, and Jaro-Winkler sim-
  ilarity – edit distance based syntactic measures
 Jaccard similarity index – a token based measure
 Sørensen-Dice coefficient [28] – a bi-gram comparison based measure

Among those, Levenstein and Hamming distances appeared to be the least appropriate
in our context due to their limitations. Levenstein returns an integer number of required
edits, while the rest of the measures return normalized reals. So, it has not been clear if
normalizing Levenstein would really make the result comparable to the other measures
in a way to use the same term similarity threshold. Hamming is applicable only to the
strings of equal lengths. So, adding spaces to the shorter string would really lower the
precision of measurement. Therefore, it has finally been decided to use Jaro, Jaro-Win-
kler, Jaccard, and Sørencen-Dice for implementation and cross-evaluation in our work.
Further, it is briefly explained how should the selected measures be computed and re-
ferred to their implementation code. After that, it is explained how term similarity
thresholds have been chosen for these implemented measures.
    Jaro similarity simj between two strings S1 and S2 is computed (1) as the minimal
number of one character transforms to be done to the first term (string) for getting the
second string in the compared pair.

                                          0,                if m  0
                                                      
                 sim j              m       m      m   t
                           1/ 3  (                      ) otherwise ,                (1)
                                  | S1 | | S 2 |    m

where: |S1|, |S2| are the lengths of the compared strings; m is the number of the matching
characters; and t is the half of the number of transposed characters. The characters are
matching if they are the same and their distance from the beginning of the string differs
by no more than ⌊𝑚𝑎𝑥(|𝑆1 |, |𝑆2 |)/2⌋ − 1. The number of matching but having differ-
ent sequence order symbols is the number of transposed characters.
   Jaro-Winkler similarity measure simj-w refines Jaro similarity measure simj by using
a prefix scale value p which assigns better ratings to the strings that match from their
beginnings for a prefix length l. Hence, for the two strings S1 and S2 it is computed as
shown in (2).

                           simj-w = simj + l*p*(1 – simj),                              (2)

where l is the length of a common prefix (up to a maximum of 4 characters); p is a
constant scaling factor for how much the similarity value is adjusted upwards for having
common prefixes (up to 0.25, otherwise the measure can become larger than 1; [21]
suggests that p=0.1).
   Sometimes Winkler’s prefix bonus l*p*(1 – simj) is given only to the pairs having
Jaro similarity higher that a particular threshold. This threshold is suggested [21] to be
equal to 0.7.
   Jaccard similarity index simja is a similarity measure for finite sets, characters in our
case. It is computed, for the two strings S1 and S2, as the ratio between the cardinalities
of the intersection and union of the character sets in S1 and S2 as shown in (3).

                            sim ja  ( S1  S 2 ) /( S1  S 2 )                         (3)
   Finally, Sørensen-Dice coefficient, regarded as a character string similarity measure,
is computed by counting identical character bi-grams in S1 and S2 and relating these to
the overall number of bi-grams in both strings – as shown in (4).

                              simsd  2n /( nS1  nS2 ) ,                             (4)

where: n  is the no of bi-grams found in S1 and also in S2; nS1 , nS2 are the numbers of
all bi-grams in S1 and S2.
    The functions for all four string similarity measures have been implemented2 in Py-
thon 3.0 and return real values within [0, 1].
    For the proper use of those functions it is however necessary to determine what
would be a reasonable threshold to distinguish between (semantically) similar and not
similar terms. For determining that, the following cases in string comparison need to
be taken into account:

 Character strings are fully the same – Full Positives (FP). This case clearly falls
  into similar (the same) terms.
 Character strings are very different and the terms in these strings carry different
  semantics – Full Negatives (FN). This case is also clear and is characterized by low
  values of similarity measures.
 Character strings are partially the same and the terms in these strings carry the
  same or similar semantics – Partial Positives (PP).

   The terms in such strings are similar, though it may not be fully clear. The following
are different categories of terms that bring us about this case: words in the terms have
different endings (e.g. plural/singular forms); different delimiters are used (e.g. “-”, or
“–”, or “ - ”); a symbol is missing, erroneously added, or misspelled (a typo); one term
is a sub-string of the other (e.g. subsuming the second); one of the strings contains
unnecessary extra characters (e.g. two or three spaces instead of one, or noise).

 Character strings are partially the same but the terms in these strings carry differ-
  ent semantics – Partial Negatives (PN)
    The terms in such strings are different, though it may not be fully clear. The follow-
ing are the categories that bring us about this case: the terms carried by the compared
strings differ by a few characters, but have different meanings (e.g. “deprecate” versus
“depreciate”); the compared terms have common word(s) but fully differ in their mean-
ings (e.g. “affect them” versus “effect them”). These false positives are the hardest case
to be detected.
    The test set of term pairs falling into the cases described above has been manually
developed3. For each pair of terms in this test set all four string similarity measures
have been computed.


2   These functions are publicly available at: https://github.com/EvaZsu/OntoElect
3   The test set and computed term similarity values are publicly available at
    https://github.com/EvaZsu/OntoElect/blob/master/Test-Set.xls
  We have computed the average values of all four similarity measures for each cate-
gory using all the test set term pairs falling into this category. The results are given in
Table 1.

Table 1: Average similarity measure values for different categories of term pairs
from the test set

 Case / Category                    Items in   Sørensen-    Jaccard        Jaro      Jaro-
                                    Test Set     Dice                               Winkler
 Different strings (FN)                    6        0.03          0.45       0.55       0.55
 Identical strings (FP)                    3        1.00          1.00       1.00       1.00
 Similar Semantics (PP)                   32        0.71          0.72       0.63       0.70
 - Unnecessary (extra) characters          7      0.8401        0.8820     0.8714     0.8784
 - Common parts (words)                    6      0.7122        0.7280     0.6375     0.7043
 - Typos                                   6      0.7797        0.8637     0.8863     0.9220
 - Different delimiters                    6      0.7860        0.8473     0.9125     0.9442
 - Different endings                       7      0.8911        0.9135     0.9410     0.9590
 Different Semantics (PN)                 18        0.89          0.89       0.89       0.91
 - Common parts (words)                   11      0.4336        0.5221     0.6161     0.6408
 - Very few character differences          7      0.8826        0.8845     0.8914     0.9059
                           Total:         59

   Term similarity thresholds have to be chosen such that full and partial negatives are
regarded as not similar, but full and partial positives are regarded as similar. Hence, for
the case of partial positives, the thresholds have to be chosen as minimal of all the case
categories, and for the partial negatives – as the maximal of all the case categories. The
values of case thresholds are shown bolded in Table 1 and provide us with the margins
for relevant threshold intervals in our experiments. These intervals have been evenly
split in four points as presented in Table 2. The requirements for partial positives and
negatives unfortunately contradict to each other. For example, if a threshold is chosen
to filter out partial negatives, also some of the partial positives will be filtered out.
Therefore, subsuming that partial negatives are rare, it has been decided to use the
thresholds for partial positives.

           Table 2: Term similarity thresholds chosen for experimental evaluation

                                        Term Similarity Thresholds
                                      Min    Ave-1    Ave-2    Max
                   Sørensen-Dice      0.71     0.76      0.83       0.89
                   Jaccard            0.72     0.77      0.83       0.89
                   Jaro               0.63     0.72      0.80       0.89
                   Jaro-Winkler       0.70     0.77      0.84       0.91


4      OntoElect and the Refinement of the THD Algorithm

OntoElect, as a methodology, seeks for maximizing the fitness of the developed ontol-
ogy regarding what the domain knowledge stakeholders think about the domain. Fitness
is measured as the stakeholders’ “votes” – a measure that allows assessing the stake-
holders’ commitment to the ontology under development – reflecting how well their
sentiment about the requirements is met. The more votes are collected – the higher the
commitment is expected to be. If a critical mass of votes is acquired (say 50%+1, which
is a simple majority vote), the ontology is considered to satisfactorily meet the require-
ments.
    Unfortunately, direct acquisition of requirements from domain experts is not very
realistic as they are expensive and not really willing to do the work falling out of their
core activity. So, we focus on the indirect collection of the stakeholders’ votes by ex-
tracting these from high quality and reasonably high impact documents authored by the
stakeholders.
    An important feature to be ensured for knowledge extraction from text collections is
that the dataset needs to be representative to cover the opinions of the domain
knowledge stakeholders satisfactorily fully. OntoElect suggests a method to measure
the terminological completeness of the document collection by analyzing the saturation
of terminological footprints of the incremental slices of the document collection [1].
The full texts of the documents from a retrospective collection are grouped in datasets
in the order of their timestamps. As pictured in Fig. 1a, the first dataset D1 contains the
first portion (inc) of documents. The second dataset D2 contains the first dataset D1
plus the second incremental slice (inc) of documents. Finally, the last dataset Dn con-
tains all the documents from the collection.


                      (a)                                            (b)
Fig. 1: (a) Incrementally enlarged datasets in OntoElect; (b) an example of a bag of terms ex-
tracted by UPM Term Extractor [30].

   At the next step of the OntoElect workflow the bags of multi-word terms
B1, B2, …, Bn are extracted from the datasets D1, D2, …, Dn, using UPM Term Ex-
tractor software [30], together with their significance (c-value) scores. Please see an
example of an extracted bag of terms extracted in Fig. 1b.
   At the subsequent step, every extracted bag of terms Bi, i = 1, …, n is processed as
follows:

 Normalized       scores     are     computed      for     each     individual     term:
      n-score = c-value / max(c-value)
 Individual term significance threshold (eps) is computed to cut off those terms
      that are not within the majority vote. The sum of n-scores having values above
      eps form the majority vote if this sum is higher that ½ of the sum of all n-scores.
 The cut-off at n-score < eps is done
 The result is saved in Ti

   After this step only significant terms, whose n-scores represent the majority vote,
are retained in the bags of terms. Ti are then evaluated for saturation by measuring pair-
wise terminological difference between the subsequent bags Ti and Ti+1,
i = 0, …, n-1. So far it has been done by applying the baseline THD algorithm4 [1]
presented in Fig. 2.
Algorithm THD. Compute Terminological Difference between Bags of Terms
Input:
 Ti, Ti+1 – the bags of terms with grouped similar terms.
       Each term Ti.term is accompanied with its T.n-score.
       Ti, Ti+1 are sorted in the descending order of T.n-score.
 M – the name of the string similarity measure function to compare terms
 th – the value of the term similarity threshold from within [0,1]
Output: thd(Ti+1, Ti), thdr(Ti+1, Ti)
1. sum := 0
2. thd := 0
3. for k := 1, │Ti+1│
4.     sum := sum + Ti+1.n-score[k]
5.     found : = .F.
6.     for m := 1, │Ti│
7.         if (Ti+1.term[k] = Ti.term[m]) if (M(Ti+1.term[k], Ti.term[m], th))
8.            then
9.              thd += │Ti+1.n-score[k] - Ti.n-score[m]│
10.             found := .T.
11.   end for
12.   if (found = .F.) then thd += Ti+1.n-score[k]
13. end for
14. thdr := thd / sum


Fig. 2: THD algorithm [1] for measuring terminological difference in a pair of bags of terms. It
uses string equalities for comparing terms and therefore needs to be refined as outlined by the
rounded rectangles. The refined THD has two more input parameters (M and th) and uses M for
comparing terms (line 7) instead of checking the equality of character strings.

    In fact, THD accumulates, in the thd value for the bag Ti+1, the n-score differences
if there were the same terms in Ti and Ti+1. If there was no the same term in Ti, it adds
the n-score of the orphan to the thd value of Ti+1. After thd has been computed, the
relative terminological difference thdr receives its value as thd divided by the sum of
n-scores in Ti+1.
    Absolute (thd) and relative (thdr) terminological differences are computed for fur-
ther assessing if Ti+1 differs from Ti more than the individual term significance thresh-
old eps. If not, it implies that adding an increment of documents to Di for producing
Di+1 did not contribute any noticeable amount of new terminology. So, the subset Di+1
of the overall document collection may have become terminologically saturated. How-
ever, to obtain more confidence about the saturation, OntoElect suggests that some

4
 The baseline THD algorithm is implemented in Python and is publicly available at
https://github.com/bwtgroup/SSRTDC-modules/tree/master/THD
more subsequent pairs of Ti and Ti+1 are evaluated. If stable saturation is observed,
then the process of looking for a minimal saturated sub-collection could be stopped.
   Our task was to modify the THD algorithm in a way to allow finding not exactly the
same but sufficiently similar terms by applying string similarity measures with appro-
priate thresholds – as explained in the previous Section 3. For that, the preparatory
similar term grouping step has been introduced to avoid duplicate similarity detection.
   For each of the compared bags of terms Ti and Ti+1 the similar term grouping (STG)
algorithm is applied at this preparatory step – see Fig. 3.

    Algorithm STG. Group similar terms in the bag of terms
    Input:
     T – a bag of terms. Each term T.term is accompanied with its
           T.n-score. T is sorted in the descending order of T.n-score.
     M – the name of the string similarity measure function to compare
         terms
     th – the value of the term similarity threshold from within [0,1]
    Output: T with grouped similar terms
    1. sum := 0
    2. for k = 1,│T│
    3.     term := T.term[k]
    4.     n-score := T.n-score[k]
    5.     count := 1
    6.     for m = k+1,│T│
    7.          if M(term, T.term[m], th)
    8.             then
    9.               n-score += T.n-score[m]
    10.              count += 1
    11.              remove(T[m])
    12.    end for
    13.    T.n-score[k] := n-score / count
    14. end for

                     Fig. 3: Similar Term Grouping (STG) algorithm

After term grouping is accomplished for both bags of terms, the refined THD algorithm
(Fig 2 – rounded rectangles) is performed to compute the terminological difference be-
tween Ti and Ti+1.


5      Cross-Evaluation

This section reports on our evaluation of the refined THD algorithm against the baseline
THD [1]. This evaluation is done following the workflow of OntoElect Requirements
Elicitation Phase [31] and using the TIME document collection.
5.1     Set-up of the Experiment
The objective of our experiment was to find out if using the refined THD algorithm
yields quicker and smoother terminological saturation compared to the use of the base-
line THD algorithm. We were also looking at finding out which string similarity meas-
ure best fits for measuring terminological saturation.
    For making the results comparable, the same datasets created from the TIME docu-
ment collection – as described in Section 5.2 – has been fed into both the refined and
baseline THD algorithms. We applied:
(i) The refined THD – sixteen times – one per individual string similarity measure M
      (Section 3) and per individual term similarity threshold th (Table 3); and
(ii) The baseline THD – one time
    The values of: (i) the No of retained terns; (ii) absolute terminological difference
(thd); and (iii) the time taken to perform term grouping by the STG algorithm (sec);
were measured.
    Finally, to verify if the refined THD is correct, we checked if it returns the same
results as the baseline THD when the term similarity threshold is set to 1.0.
    All the computations have been run on a Windows 10 64-bit PC with: Intel® Core™
2 Duo CPU, E7400 @ 2.80 GHz; 4.0 Gb on-board memory.


5.2     Experimental Data
TIME document collection contains the full text papers of the proceedings of the TIME
Symposia series5. The domain of the collection is Time Representation and Reasoning.
The publisher of these papers is IEEE. It contains all the papers published in the TIME
symposia proceedings between 1994 and 2013, which are 437 full text documents.
These papers have been processed manually, including their conversion to plain texts
and cleaning of these texts. So, the resulting datasets were not very noisy. We have
chosen the increment for generating the datasets to be 20 papers. So, based on the avail-
able texts, we have generated 22 incrementally enlarged datasets D1, D2, …, D226 us-
ing our Dataset Generator7. The chronological order of adding documents has been
used.


5.3     Results and Discussion
The results of our measurements of terminological saturation (thd) are pictured in a
diagrammatic form in Fig. 4. The diagrams showing the time spent by the STG algo-
rithm for detecting and grouping similar terms, based on the chosen term similarity
thresholds are in Fig. 6. The diagrams in Fig. 4 and 6 have been built using the values


5   http://time.di.unimi.it/TIME_Home.html
6   The TIME collection in plain text and the datasets generated of these texts are available at:
    https://www.dropbox.com/sh/64pbodb2dmpndcy/AAAzVW7aEpgW-JrXHaCEqg2Sa/
    TIME?dl=0
7   The dataset generator is available at: https://github.com/bwtgroup/SSRTDC-PDF2TXT
of the measurements from the four tables – one per term similarity threshold point (Min,
Ave-1, Ave-2, and Max)8.
    Saturation (thd) measurements reveal that the refined THD algorithm detected ter-
minological saturation faster than the baseline THD algorithm – no matter what was
the chosen term similarity measure (M) or the similarity threshold (th). If the results for
different measures are compared, then it may be noted that the respective saturation
curves behave differently, depending on the similarity threshold point.


      (a) Min term similarity thresholds            (b) Ave-1 term similarity thresholds


     (c) Ave-2 term similarity thresholds           (d) Max term similarity thresholds
Legend:
Fig. 4: Terminological saturation measurements grouped in four different term similarity thresh-
old (th) points: (a) Min; (b) Ave-1; (c) Ave-2; and (d) Max. The legend shows the colors for
different string similarity measures.

   Overall, as it could be seen in Fig 4 (a) – (d), the use of the Sørensen-Dice measure
demonstrated the least volatile behavior along the term similarity threshold points. This
measure resulted in making the refined THD algorithm to detect saturation slower than
the three other measures for Min, Ave-1, and Ave-2. For Max, it was as fast as Jaro and
slightly slower than Jaccard and Jaro-Winker.
   One more observation was that, integrally, all the implemented term similarity
measures coped well with retaining important terms. These are indicated by terminol-
ogy contribution peaks in the diagrams (a)-(d) of Fig. 4. It is well seen in Fig. 4(d), for
the Max threshold point, that all the string similarity method curves follow the shape of
the baseline THD curve quite closely. Hence, they have the peaks exactly in the same
thd measurement points where the baseline has, pointing at more new significant terms.


8   The tables are not presented in the paper due to the page limit, though are publicly available
    at: https://github.com/EvaZsu/OntoElect. File names are Results-Alltogether-{min, ave, ave2,
    max}-th.xlsx
   At Min, Ave-1, and Ave-2, however, the method that have been most sensitive to
terminology peaks, was Sørensen-Dice. This sensitivity is also confirmed by Fig. 5.


  Fig. 5: Proportions of retained to all extracted terms for different term similarity measures

   Fig. 5 pictures the proportions of the retained to all extracted terms computed at
different term similarity threshold points. It is clear from Fig. 5 that Sørensen-Dice
retains the biggest number of terms at all used term similarity thresholds.


     (a) Min term similarity thresholds            (b) Ave-1 term similarity thresholds


    (c) Ave-2 term similarity thresholds           (d) Max term similarity thresholds
  Legend:
        Fig. 6: Time (sec) spent for finding similar terms, grouped similarly to Fig. 4

   Finally, it has to be noted that the introduction of string similarity measures in the
computation of terminological difference (THD algorithm) increases the computational
complexity of the algorithm quite substantially. Fig. 6 pictures the times (in seconds)
taken by the pre-processor STG algorithm. As it could be noticed in Fig. 6(a)-(d), the
times grow with the value of the term similarity threshold (th) and reach thousands of
seconds for Max threshold values. It is interesting to notice that Sørensen-Dice and
Jaccard are substantially more stable to the increase of th than Jaro and Jaro-Winkler.
Sørensen-Dice takes, however, roughly an order of magnitude more time than Jaccard.
From the other hand, Jaccard was not very sensitive to terminological peaks and re-
tained significantly less terms than Sørensen-Dice.
   To sum up, the findings are put in Table 3 to rank the evaluated string similarity
measures on a scale 1 (the best) to 5 (the worst).

                  Table 3: The ranking of the evaluated string similarity measures

                                                            Rank (1-5)
    Evaluation aspect               Baseline    Sørensen-    Jaccard        Jaro      Jaro-
                                     THD          Dice                               Winkler
    Faster detection of termino-       5            3             1           4         2
    logical saturation
    More significant terms re-          1            2            3           5         4
    tained
    Less time taken                    1            3            2            5         4
                           Total:     7(2)         8(3)         6(1)        14(5)     10(4)

   Probably surprisingly, Jaccard, which is the most lightweight string similarity meas-
ure (Fig. 6), demonstrated the best performance among the rest, including the baseline
THD. As it was well balanced on all evaluation aspects. This balance was also good in
the case of Sørensen-Dice. However, Sørensen-Dice lost to Jaccard and baseline THD
as it took too much time for term grouping. Jaro and Jaro-Winkler were clear negative
outliers. Therefore, at an expense of a slightly higher execution time, the THD refined
by Jaccard string similarity measure is the preferred choice for measuring terminologi-
cal saturation in OntoElect.


6         Conclusions and Future Work

In this paper, we investigated if a simple string equivalence measure used in the base-
line THD algorithm may be outperformed if a proper string similarity measure is used
instead. For finding this out, we: (i) have chosen the four candidate measures from the
broader variety of the available, based on the specifics of term comparison; (ii) devel-
oped the test set of specific term pairs to decide about term similarity thresholds for the
chosen measures; (iii) implemented these measures, the algorithm for similar terms
grouping (STG), and the refinement of the baseline THD algorithm; (iv) cross-evalu-
ated the refined THD algorithm against the baseline, and also all individual measures
against each other; (v) gave our recommendation about the use of the refined THD
algorithm with Jaccard measure which demonstrated the most balanced performance in
our experiments.
   For the experiments we used the datasets generated, using our instrumental software
suite, from the TIME document collection. This collection contains real scientific pa-
pers acquired from the proceedings series of the Time Representation and Reasoning
Symposia.
    Our future work is planned based on the results of the presented experiments and
some additional observations we made. Firstly, we would like to explore the ways to
improve the performance of the Sørensen-Dice measure implementation as its higher
computational complexity is the only flaw against the Jaccard measure implementation.
Secondly, we are interested in finding out if a similar term grouping algorithm, using a
sensitive similarity measure, like Sørensen-Dice, would be plausible for grouping fea-
tures while building feature taxonomies. This task is on the agenda for the second (Con-
ceptualization) phase of OntoElect [32]. Thirdly, we are keen to check if the evaluation
results on the other document collections will be similar to that presented in this paper.
To find this out we plan to repeat the same cross-evaluation experiments but on the
datasets generated from DMKD and DAC collections [16].


Acknowledgements

The research leading to this publication has been done in part in cooperation with the
Ontology Engineering Group of the Universidad Politécnica de Madrid in frame of FP7
Marie Curie IRSES SemData project (http://www.semdata-project.eu/), grant agree-
ment No PIRSES-GA-2013-612551. While performing this research, the first author
has been a master student on the program on Computer Science and Information Tech-
nologies at Zaporizhzhia National University. The second author is funded by a PhD
grant provided by Zaporizhzhia National University and the Ministry of Education and
Science of Ukraine.


References
 1. Tatarintseva, O., Ermolayev, V., Keller, B., Matzke, W.-E.: Quantifying ontology fitness in
    OntoElect using saturation- and vote-based metrics. In: Ermolayev, V., et al. (eds.) Revised
    Selected Papers of ICTERI 2013, CCIS, vol. 412, pp. 136--162 (2013)
 2. Fahmi, I., Bouma, G., van der Plas, L.: Improving statistical method using known terms for
    automatic term extraction. In: Computational Linguistics in the Netherlands, CLIN 17
    (2007)
 3. Wermter, J., Hahn, U.: Finding new terminology in very large corpora. In: Clark, P.,
    Schreiber, G. (eds.) Proc. 3rd Int Conf on Knowledge Capture, K-CAP 2005, pp. 137--144,
    Banff, Alberta, Canada, ACM (2005) DOI: 10.1145/1088622.1088648
 4. Zhang, Z., Iria, J., Brewster, C., Ciravegna, F.: A comparative evaluation of term recognition
    algorithms. In: Proc. 6th Int Conf on Language Resources and Evaluation, LREC 2008,
    Marrakech, Morocco (2008)
 5. Daille, B.: Study and implementation of combined techniques for automatic extraction of
    terminology. In: Klavans, J., Resnik, P. (eds.) The Balancing Act: Combining Symbolic and
    Statistical Approaches to Language, pp. 49--66. The MIT Press. Cambridge, Massachusetts
    (1996)
 6. Cohen, J. D.: Highlights: Language- and domain-independent automatic indexing terms for
    abstracting. J. Am. Soc. Inf. Sci. 46(3), 162--174 (1995) DOI: 10.1002/(SICI)1097-
    4571(199504)46:3<162::AID-ASI2>3.0.CO;2-6
 7. Caraballo, S. A., Charniak, E.: Determining the specificity of nouns from text. In: Proc. 1999
    Joint SIGDAT Conf on Empirical Methods in Natural Language Processing and Very Large
    Corpora, pp. 63--70 (1999)
 8. Astrakhantsev, N.: ATR4S: toolkit with state-of-the-art automatic terms recognition meth-
    ods in scala. arXiv preprint arXiv:1611.07804 (2016)
 9. Medelyan, O., Witten, I. H.: Thesaurus based automatic keyphrase indexing. In: Mar-
    chionini, G., Nelson, M. L., Marshall, C. C. (eds.) Proc. ACM/IEEE Joint Conf on Digital
    Libraries, JCDL 2006, pp. 296--297, Chapel Hill, NC, USA, ACM (2006) DOI:
    10.1145/1141753.1141819
10. Ahmad, K., Gillam, L., Tostevin, L.: University of surrey participation in trec8: Weirdness
    indexing for logical document extrapolation and retrieval (wilder). In: Proc. 8th Text RE-
    trieval Conf, TREC-8 (1999)
11. Sclano, F., Velardi, P.: TermExtractor: A Web application to learn the common terminology
    of interest groups and research communities. In: Proc. 9th Conf on Terminology and Artifi-
    cial Intelligence, TIA 2007, Sophia Antipolis, France (2007)
12. Frantzi, K. T., Ananiadou, S.: The c/nc value domain independent method for multi-word
    term extraction. J. Nat. Lang. Proc. 6(3), 145--180 (1999) DOI: 10.5715/jnlp.6.3_145
13. Kozakov, L., Park, Y., Fin, T., Drissi, Y., Doganata, Y., Cofino, T.: Glossary extraction and
    utilization in the information search and delivery system for IBM Technical Support. IBM
    System Journal 43(3), 546--563 (2004) DOI: 10.1147/sj.433.0546
14. Astrakhantsev, N.: Methods and software for terminology extraction from domain-specific
    text collection. PhD thesis, Institute for System Programming of Russian Academy of Sci-
    ences (2015)
15. Bordea, G., Buitelaar, P., Polajnar, T.: Domain-independent term extraction through domain
    modelling. In: Proc. 10th Int Conf on Terminology and Artificial Intelligence, TIA 2013,
    Paris, France (2013)
16. Kosa, V., Chaves-Fraga, D., Naumenko, D., Yuschenko, E., Badenes-Olmedo, C., Ermola-
    yev, V., Birukou, A.: Cross-evaluation of automated term extraction tools by measuring ter-
    minological saturation. In: Bassiliades, N., et al. (eds.) ICTERI 2017. Revised Selected Pa-
    pers. CCIS, vol. 826, pp. 135--163 (2018)
17. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for
    name-matching tasks. In: Proc. 2003 Int. Conf. on Information Integration on the Web, pp
    73--78, AAAI Press (2003)
18. Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals.
    Soviet Physics Doklady 10 (8), 707--710 (1966)
19. Monge, A., Elkan, C.: The field-matching problem: algorithm and applications. In: Proc.
    2nd Int Conf on Knowledge Discovery and Data Mining, pp. 267--270, AAAI Press (1996)
20. Jaro, M. A.: Probabilistic linkage of large public health data files (disc: P687-689). Statistics
    in Medicine 14, 491--498 (1995)
21. Winkler, W. E.: String comparator metrics and enhanced decision rules in the Fellegi-Sunter
    model of record linkage. In: Proc. Section on Survey Research Methods. ASA, pp. 354--359
    (1990)
22. Jaccard, P.: The distribution of the flora in the alpine zone. New Phytologist 11, 37--50
    (1912) DOI:10.1111/j.1469-8137.1912.tb05611.x
23. Lu, J., Lin, C., Wang, W., Li, C., Wang, H.: String similarity measures and joins with syn-
    onyms. In: Proc. 2013 ACM SIGMOD Int Conf on the Management of Data, pp. 373--384
    (2013)
24. Lee, H., Ng, R. T., Shim, K.: Power-law based estimation of set similarity join size. Proc. of
    the VLDB Endowment 2(1), 658--669 (2009)
25. Tsuruoka, Y., McNaught, J., Tsujii, J., Ananiadou, S.: Learning string similarity measures
    for gene/protein name dictionary look-up using logistic regression. Bioinformatics 23(20),
    2768--2774 (2007)
26. Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing
    with the asymmetric signature scheme. In: Proc. of the 2011 ACM SIGMOD Int Conf on
    Management of data, pp. 1033--1044. ACM New York, USA (2011)
27. Hamming, R. W.: Error detecting and error correcting codes. Bell System Technical Journal
    29 (2), 147--160 (1950), DOI:10.1002/j.1538-7305.1950.tb00463.x.
28. Dice, Lee R.: Measures of the amount of ecologic association between species. Ecology 26
    (3), 297--302 (1945), DOI:10.2307/1932409
29. Badenes-Olmedo, C., Redondo-García, J. L., Corcho, O.: Efficient clustering from distribu-
    tions over topics. In: Proc. K-CAP 2017, ACM, New York, NY, USA, Article 17, 8 p.
    (2017), DOI: 10.1145/3148011.3148019
30. Corcho, O., Gonzalez, R., Badenes, C., Dong, F.: Repository of indexed ROs. Deliverable
    No. 5.4. Dr Inventor project (2015)
31. Ermolayev, V.: OntoElecting requirements for domain ontologies. The case of time domain.
    EMISA Int J of Conceptual Modeling 13(Sp. Issue), 86--109 (2018) DOI:
    10.18417/emisa.si.hcm.9
32. Moiseenko, S., Ermolayev, V.: Conceptualizing and formalizing requirements for ontology
    engineering. In: Antoniou, G., Zholtkevych, G. (eds.) Proc. ICTERI 2018 PhD Symposium,
    Kyiv, Ukraine, May 14-17, CEUR-WS (2018) online – to appear