Optimizing Automated Term Extraction for Terminological Saturation Measurement Victoria Kosa1 [0000-0002-7300-8818], David Chaves-Fraga2 [0000-0003-3236-2789], Hennadii Dobrovolskyi1 [0000-0001-5742-104X], Egor Fedorenko 1, 3 [0000-0001-6157-8111], and Vadim Ermolayev 1 [0000-0002-5159-254X] 1 Department of Computer Science, Zaporizhzhia National University, Zaporizhzhia, Ukraine victoriya1402.kosa@gmail.com, gen.dobr@gmail.com, vadim@ermolayev.com 2 Ontology Engineering Group, Universidad Politรฉcnica de Madrid, Madrid, Spain dchaves@fi.upm.es 3 BaDM, Dnipro, Ukraine egorfedorencko@gmail.com Abstract. Assessing the completeness of a document collection, within a domain of interest, is a complicated task that requires substantial effort. Even if an auto- mated technique is used, for example, terminology saturation measurement based on automated term extraction, run times grow quite quickly with the size of the input text. In this paper, we address this issue and propose an optimized approach based on partitioning the collection of documents in disjoint constituents and computing the required term candidate ranks (using the c-value method) inde- pendently with subsequent merge of the partial bags of extracted terms. It is proven in the paper that such an approach is formally correct โ€“ the total c-values can be represented as the sums of the partial c-values. The approach is also vali- dated experimentally and yields encouraging results in terms of the decrease of the necessary run time and straightforward parallelization without any loss in quality. Keywords: Automated term extraction, terminological saturation, partial c-value, merged-partial c-value, optimization 1 Introduction Ontology learning from texts is a developing research field that aims to extract domain description theories from text corpora. It is increasingly acknowledged as a plausible alternative to ontology development based on the interviews of domain knowledge stakeholders. One shortcoming of learning an ontology from texts is that the input cor- pus has to be quite big for being representative for the subject domain. Another short- coming is that learning ontologies from text is expensive, in terms of taken time, as it involves the use of several algorithms, in a pipeline [1], that are computationally hard. Automated term extraction (ATE) is an essential step at the beginning of the pipeline for ontology learning [1, 2], that is known to be bulky in terms of the increase of the run time with the growth of the input text corpus. Therefore, finding a way to reduce: (i) either the size of the processed text; or (ii) the time spent for term extraction; or (iii) both is of importance. In our prior work [2, 3, 4, 5], we developed the ATE-based approach (OntoElect) that helps circumscribe the minimal possible representative part of a documents collec- tion, which forms the corpus for further ontology learning. This technique is based on measuring terminological saturation in the collection of documents, which is computa- tionally quite expensive in the terms of the run time. In this paper, we present the approach, based on the partitioning of a document col- lection, which allows substantially reducing ATE run time in the OntoElect processing pipeline. The remainder of the paper is structured as follows. In Sect. 2, we outline our Onto- Elect approach to detect terminological saturation in document collections describing a subject domain. In Sect. 3, we review the related work in ATE and argue for the choice of the c-value method as the best appropriate for measuring terminological saturation. In Sect. 4, we explain our motives to optimize the c-value method based on partitioning a document collection and present a formal framework for that. Section 5 reports on the setup and results of our experimental evaluation of the proposed optimization approach. Finally, we draw the conclusions and outline our plans for the future work in Sect. 6. 2 Background and Research Problem OntoElect is the methodology for learning a domain ontology from a statistically rep- resentative sub-collection (๐ท๐ท๐ท๐ท๐ท๐ท๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘  ) of the complete collection of documents (๐ท๐ท๐ท๐ท = {๐‘‘๐‘‘๐‘–๐‘– }) 1 describing this subject domain. The representativeness of a sub-collection is de- cided using a successive approximation method, based on measuring terminological saturation. In this method, sub-collections are incrementally extended by adding several (๐‘–๐‘–๐‘–๐‘–๐‘–๐‘–) documents to the previous sub-collection in the sequence. Let ๐ท๐ท๐ท๐ท๐ท๐ท1 , ๐ท๐ท๐ท๐ท๐ท๐ท2 , โ€ฆ , ๐ท๐ท๐ท๐ท๐ท๐ท๐‘–๐‘– , โ€ฆ be the sequence of incrementally extended document sub-collections, such that ๐ท๐ท๐ท๐ท๐ท๐ท0 = โˆ… and ๐ท๐ท๐ท๐ท๐ท๐ท๐‘–๐‘– = ๐ท๐ท๐ท๐ท๐ท๐ท๐‘–๐‘–โˆ’1 โˆช {๐‘‘๐‘‘๐‘—๐‘— }๐‘–๐‘– , where ๐‘‘๐‘‘๐‘—๐‘— , ๐‘—๐‘— = 1, โ€ฆ , ๐‘–๐‘–๐‘–๐‘–๐‘–๐‘–, are chosen from the remainder of the ๐ท๐ท๐ท๐ท using one of the possible ordering criteria [6]: chronological, reversed-chronological, bi-directional, random, or descending citation frequency. Let also ๐‘‡๐‘‡1 , ๐‘‡๐‘‡2 , โ€ฆ , ๐‘‡๐‘‡๐‘–๐‘– , โ€ฆ be the bags of retained signifi- cant terms extracted from ๐ท๐ท๐ท๐ท๐ท๐ท1 , ๐ท๐ท๐ท๐ท๐ท๐ท2 , โ€ฆ , ๐ท๐ท๐ท๐ท๐ท๐ท๐‘–๐‘– , โ€ฆ. In OntoElect, the measure of termi- nological difference (๐‘ก๐‘กโ„Ž๐‘‘๐‘‘) is used for comparing the bags of terms ๐‘‡๐‘‡๐‘–๐‘– , ๐‘‡๐‘‡๐‘–๐‘–+1 retained from the successive ๐ท๐ท๐ท๐ท๐ท๐ท๐‘–๐‘– , ๐ท๐ท๐ท๐ท๐ท๐ท๐‘–๐‘–+1 . It returns the difference as a real positive value. If, at some ๐‘–๐‘–: (i) ๐‘ก๐‘กโ„Ž๐‘‘๐‘‘ goes below the threshold of the statistical error ๐œ€๐œ€; and (ii) there is a convincing 1 In OntoElect, we do not require the availability of this complete collection. Instead, we require that a substantial part of it is available, which presumably contains all the significant terms describing the subject domain. If so, it is further revealed that ๐ท๐ท๐ท๐ท๐ท๐ท๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘  โŠ‚ ๐ท๐ท๐ท๐ท. evidence that it will never go above this threshold; then the difference (distance) be- tween ๐‘‡๐‘‡๐‘–๐‘– and hypothetical ๐‘‡๐‘‡๐ท๐ท๐ท๐ท is not higher than ๐œ€๐œ€. Such a ๐‘‡๐‘‡๐‘–๐‘– could be used as an ๐œ€๐œ€ -approximation of the representative set of significant terms describing the domain. This representative set of terms is denoted as the terminological basis (๐‘‡๐‘‡๐‘‡๐‘‡) of the sub- ject domain. This ๐‘‡๐‘‡๐‘–๐‘– , labelled further as ๐‘‡๐‘‡๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘  , is denoted as the saturated term set, and the corresponding ๐ท๐ท๐ท๐ท๐ท๐ท๐‘–๐‘– , labelled further as ๐ท๐ท๐ท๐ท๐ท๐ท๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘  , is the saturated ๐ท๐ท๐ท๐ท๐ท๐ท. The differ- ence (๐‘ก๐‘กโ„Ž๐‘‘๐‘‘) between ๐‘‡๐‘‡๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘  and any successive ๐‘‡๐‘‡๐‘–๐‘– , including ๐‘‡๐‘‡๐ท๐ท๐ท๐ท , is within the statistical error: ๐‘ก๐‘กโ„Ž๐‘‘๐‘‘(๐‘‡๐‘‡๐‘ ๐‘ ๐‘ ๐‘ ๐‘ ๐‘  , ๐‘‡๐‘‡๐ถ๐ถ๐ถ๐ถ๐ถ๐ถ ) < ๐œ€๐œ€. In our prior work, it has been demonstrated that ๐‘ก๐‘กโ„Ž๐‘‘๐‘‘ is the measure, which can be effectively used for comparing terminological sets as vector representations of the se- mantic similarity/distance between document collections. However, one substantial shortcoming of this approach is that it is computationally expensive. Indeed, given an approximately fixed length of an increment {๐‘‘๐‘‘๐‘—๐‘— }๐‘–๐‘– and the increasing size of ๐ท๐ท๐ท๐ท๐ท๐ท๐‘–๐‘– , the method processes more and more the same part of the collection with the growth of ๐‘–๐‘–. Therefore, the computational cost (run time) for measuring ๐‘ก๐‘กโ„Ž๐‘‘๐‘‘ would have been substantially lowered if there was a way to process only the increments of the succes- sive sub-collections. This processing is, essentially the ATE pipeline. Hence, the re- search problem is to prove that modifying the ATE processing pipeline for measuring terminological saturation to process: โ€ข Only the disjoint parts {๐‘‘๐‘‘๐‘—๐‘— }๐‘–๐‘– of a document collection โ€ข Instead of sub-collections ๐ท๐ท๐ท๐ท๐ท๐ท๐‘–๐‘– yields the same result and takes substantially less execution time. 3 Related Work in ATE In the majority of approaches to ATE [7, 8] processing is done in two consecutive phases: linguistic and statistical. Linguistic processors, like POS taggers or phrase chunkers, filter out stop words and restrict candidate terms to n-gram sequences: nouns or noun phrases, adjective-noun and noun-preposition-noun combinations. Statistical processing is then applied to measure the ranks of the candidate terms. These measures are [9]: either the measures of unithood, which focus on the collocation strength of units that comprise a single term; or the measures of termhood, which point to the association strength of a term to domain concepts. For unithood, the measures are used such as mutual information [10], log likelihood [10], t-test [7, 8], modifiability and its variants [11, 8]. The measures for termhood are either term frequency-based (unsupervised approaches) or reference corpora-based (semi-supervised approaches). The most used frequency-based metrics are TF/IDF [12, 13], weirdness [14], and domain pertinence [15]. More recently, hybrid approaches were proposed, that combine unithood and termhood measurements in a single value. A representative measure is c/nc-value [16]. C/Nc-value-based approaches to ATE have received their further evolution in many works: [7, 15, 17] to mention a few. Linguistic processing is organized and implemented in a very similar fashion in all ATE methods, except some of them that also include filtering out stop words. Stop words could be filtered out also at a cut-off step after statistical processing. Statistical processing is sometimes further split in two consecutive sub-phases of term candidate scoring, and ranking. For term candidates scoring, reflecting its likelihood of being a term, known methods could be distinguished by being based on (c.f. [12]) measuring occurrences frequencies (including word association), assessing occurrences contexts, using reference corpora, e.g. Wikipedia [18], topic modelling [19, 20]. The cut-off procedure, takes the top candidates, based on scores, and thus distin- guishes significant terms from insignificant (or non-) terms. Many cut-off methods rely upon the scores, coming from one scoring algorithm, and establish a threshold in one or another way. Some others that collect the scores from several scoring algorithms use (weighted) linear combinations [21], voting [9, 3], or (semi-)supervised learning [22]. In our set-up [3], we do cut-offs after term extraction based on retaining a simple ma- jority vote. Therefore, the ATE solutions, which perform cut-offs together with scoring, are not relevant for our approach. Based on the evaluations in [9, 12, 23], the most widely used ATE algorithms, for which their performance assessments are published, are listed in Table 1. The table also provides the assessments based on the aspects we use for selection. Table 1. Comparison of the most widely used ATE measures and algorithms (revision of the corresponding table in [24]) Method Domain- Super- Measure(s) Term Cut- Precision Run Time Tool [Source] indepen- vizion Signi- off (GENIA; (related to dence (U/SS) ficance (+/-) average) c-value (+/-) method) TTF + U Term (Total) Frequency + - ATR4S [25] 0.70; 0.35 0.34 JATE ATF + U Average Term Frequency + - 0.71; 0.33 0.37 ATR4S [23] 0.75; 0.32 0.35 JATE TTF-IDF + U TTF+Inverse Document + - ATR4S [26] Frequency 0.82; 0.51 0.35 JATE RIDF + U Residual IDF - 0.71; 0.32 0.53 ATR4S [27] 0.80; 0.49 0.37 JATE C-value + U C-value, NC-value + - 0.73; 0.53 1.00 ATR4S [16] 0.77; 0.56 1.00 JATE Weirdness +/- SS Weirdness - 0.77; 0.47 0.41 ATR4S [14] 0.82; 0.48 1.67 JATE GlossEx + SS Lexical (Term) Cohesion, - ATR4S [21] Domain Specificity 0.70; 0.41 0.42 JATE TermEx + SS Domain Pertinence, Do- - + ATR4S [15] main Consensus, Lexical 0.87; 0.46 0.52 JATE Cohesion, Structural Rele- vance PU-ATR [18] - SS Nc-value, Domain Speci- - + 0.78; 0.57 809.21 ATR4S ficity JATE Comments to Table 1: Domain Independence: โ€œ+โ€ stands for a domain-independent method; โ€œ-โ€œ marks that the method is either claimed to be domain-specific by its authors, or is evaluated only on one particular do- main. We look for a domain-independent method. Supervision: โ€œUโ€ โ€“ unsupervised; โ€œSSโ€ โ€“ semi-supervised. We look for an unsupervised method. Term Significance: โ€œ+โ€ โ€“ the method returns a value for each retained term, which could further be used as a measure of its significance compared to the other terms; โ€œ-โ€œ marks that such a meas- ure is not returned or the method does the cut-off itself. We look for doing cut-offs later. Cut-off: โ€œ+โ€ โ€“ the method does cut-offs itself and returns only significant terms; โ€œ-โ€ โ€“ the method does not do cut-offs. We look for โ€œ-โ€. Precision and Run Time: The values are based on the comparison of the two cross-evaluation experiments reported in [12] and [23]. Empty cells in the table mean that there was no data for this method in this experiment using this tool. Survey [12] used ATR4S [12] โ€“ an open-source software tool for automated term recognition (ATR) written in Scala (4S). It evaluated 13 differ- ent methods, implemented in ATR4S, on five different datasets, including the GENIA dataset [28]. Survey [23] used JATE 2.0 [23], free software for automated term extraction (ATE) written in Java (J). It evaluated nine different methods, implemented in JATE, on two different datasets, including GENIA. Hence, the results on GENIA are the baseline for comparing the precision. Two values are given for each reference experiment: precision on GENIA; average precision. Both [12] and [23] experimented with c-value method, which was the slowest on average for [23]. So, the execution times for c-value were used as a baseline to normalize the rest in the Run Time column. Tool: The last column in the table names the tools used in the corresponding experiments. The information in Table 1 supports the conclusion of [23] stating that c-value is the most reliable method. The c-value method obtains consistently good results, in terms of precision, on the two different mixes of datasets [23, 12]. It could also be noted that c-value is one of the slowest in the group of unsupervised and domain-independent methods, though its performance is comparable with the fastest ones. Still, c-value out- performs the domain-specific methods, sometimes significantly โ€“ as it is in the case of PU-ATR. Therefore, we have chosen c-value as the method for our experimental frame- work. 4 Motivation and Formal Framework ATE is known to be computationally expensive in the terms of run time versus the volume of input text. The c-value method that we have chosen for our terminological saturation measurement pipeline (Table 1) is more expensive than the other unsuper- vised and domain neutral methods. Furthermore, ATE implementations are often con- strained 2 in the volume of input text. Hence, reducing the volume of text to be processed by the method, and partitioning it in relatively small chunks, may substantially lower this expense and contribute to the better scalability of the solution. 4.1 Motivation The c-value method [16], as mentioned in Sect. 3, is hybrid and combines linguistic and statistical steps applied to the entire document collection (text corpus). The method starts with the linguistic pipeline, which outputs the list of term candidate strings. It then continues with the statistical part, which computes significance scores for these term candidates as c-values. The diagram of the measured run time versus the volume 2 For example, the UPM Term Extractor software [29], which is based on the c-value method and used in our experiments, does not take in texts of more than 15 Mb in volume. of input text, provided in Sect. 5 (Fig. 3), in the case of the conventional pipeline illus- trates, by run time values, the computational complexity of the c-value method. Let us now consider a document collection ๐ท๐ท as a composition of its disjoint parts. Definition 1 (A partial collection and a partition of a document collection). ๐ท๐ท๐‘–๐‘– , ๐‘–๐‘– = 1, โ€ฆ , ๐‘›๐‘› are the partial document collections of ๐ท๐ท and {๐ท๐ท๐‘–๐‘– } = {๐ท๐ท๐‘–๐‘– }๐‘›๐‘›๐‘–๐‘–=1 is the par- tition of ๐ท๐ท if the following conditions hold: Condition 1: ๐ท๐ท = โ‹ƒ๐‘›๐‘›๐‘–๐‘–=1 ๐ท๐ท๐‘–๐‘– , (1) Condition 2: โ‹‚๐‘›๐‘›๐‘–๐‘–=1 ๐ท๐ท๐‘–๐‘– = โˆ…. The linguistic part processes separate sentences. Therefore: (i) its computational complexity is the function of the number of sentences in a document collection; and (ii) the partial collections ๐ท๐ท๐‘–๐‘– , ๐‘–๐‘– = 1, โ€ฆ , ๐‘›๐‘› (Definition 1) could be processed independently and the outputs further merged. Hence, applying the linguistic step to the partition of ๐ท๐ท could at least be parallelized, which results in the runtime gain of ๐‘›๐‘› times. In the case of OntoElect pipeline, the linguistic step is iteratively applied to the in- crementally enlarged datasets (see Sect. 2). Therefore, the same chunks of text are pro- cessed many times. Let us suppose that ๐ท๐ท contains ๐‘˜๐‘˜ documents and ๐‘–๐‘–๐‘–๐‘–๐‘–๐‘– = ๐‘˜๐‘˜/๐‘›๐‘› is the increment to enlarge datasets. Then, the number of documents to be processed is: โ€ข In the case of the incrementally enlarged datasets: 1+2+โ‹ฏ+๐‘›๐‘› ๐‘–๐‘–๐‘–๐‘–๐‘–๐‘– + 2 โˆ™ ๐‘–๐‘–๐‘–๐‘–๐‘–๐‘– + โ‹ฏ + ๐‘›๐‘› โˆ™ ๐‘–๐‘–๐‘–๐‘–๐‘–๐‘– = ๐‘˜๐‘˜ โˆ™ โ‰ˆ (๐‘›๐‘› + 1)/2 โˆ™ ๐‘˜๐‘˜, which is substantially ๐‘›๐‘› more than ๐‘˜๐‘˜ if ๐‘›๐‘› > 1 โ€ข In the case of partial collections: ๐‘˜๐‘˜ Hence, processing partial collections instead of incrementally enlarged datasets ๐‘›๐‘›+1 gives a substantial additional gain in runtime, which is ( โˆ’ 1) โˆ™ ๐‘˜๐‘˜ times. 2 Similarly, it might be reasonable to apply the statistical step of the pipeline not to the incrementally enlarged datasets, but to partial collections. However, it is not straightforward that: โ€ข Computing c-values for the terms extracted from the partial collections; and โ€ข Merging further these bags of terms with their significance scores will give the same result as applying the statistical step to incrementally enlarged da- tasets. In the remainder of this section, we prove that partitioning c-value computation with later results merging gives correct results. C-value [16], further labelled as ๐‘๐‘๐‘๐‘ in formulae and equations, is built using several statistical characteristics of the corresponding term candidate string. These characteris- tics are: โ€ข The total frequency (number) of occurrence(s) of the candidate string in the docu- ment corpus โ€ข The frequency (number) of occurrence(s) of the candidate string as a part of other longer candidate terms โ€ข The number of these longer candidate terms โ€ข The length of the candidate string (in the number of words) Let: ๐‘ ๐‘  be a term candidate string; |๐‘ ๐‘ | โ€“ the length of ๐‘ ๐‘  in words; ๐‘™๐‘™๐‘™๐‘™ โ€“ a longer term candidate string in which ๐‘ ๐‘  is nested as a sub-string; ๐‘“๐‘“(. ) โ€“ the frequency (number) of occurrence(s) of a term candidate string in a collection of textual documents ๐ท๐ท; ๐‘‡๐‘‡ ๐‘ ๐‘  โ€“ the set of extracted term candidate strings ๐‘™๐‘™๐‘™๐‘™ that nest ๐‘ ๐‘ ; and ๐‘ƒ๐‘ƒ(๐‘‡๐‘‡ ๐‘ ๐‘  ) โ€“ the number of these ๐‘™๐‘™๐‘™๐‘™. Then a (complete) ๐‘๐‘๐‘๐‘ of ๐‘ ๐‘  is denoted [16] as follows: ๐‘™๐‘™๐‘™๐‘™๐‘™๐‘™2 (|๐‘ ๐‘ |) โˆ™ ๐‘“๐‘“(๐‘ ๐‘ ) ๐‘–๐‘–๐‘–๐‘– ๐‘ ๐‘  ๐‘–๐‘–๐‘–๐‘– ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐‘–๐‘–๐‘–๐‘– ๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž ๐‘™๐‘™๐‘™๐‘™ ๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’ ๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“ ๐ท๐ท ๐‘๐‘๐‘๐‘(๐‘ ๐‘ ) = ๏ฟฝ 1 . (2) ๐‘™๐‘™๐‘™๐‘™๐‘™๐‘™2 (|๐‘ ๐‘ |) โˆ™ ๏ฟฝ๐‘“๐‘“(๐‘ ๐‘ ) โˆ’ โˆ‘๐‘™๐‘™๐‘™๐‘™โˆˆ๐‘‡๐‘‡ ๐‘ ๐‘  ๐‘“๐‘“(๐‘™๐‘™๐‘™๐‘™)๏ฟฝ ๐‘œ๐‘œ๐‘œ๐‘œโ„Ž๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’ ๐‘ƒ๐‘ƒ(๐‘‡๐‘‡ ๐‘ ๐‘  ) 4.2 Merged Partial C-values Definition 2 (Partial c-value). The partial c-value of the term candidate string ๐‘ ๐‘  extracted from the partial document collection ๐ท๐ท๐‘–๐‘– is computed as: ๐‘™๐‘™๐‘™๐‘™๐‘™๐‘™2 (|๐‘ ๐‘ |) โˆ™ ๐‘“๐‘“๐‘–๐‘– (๐‘ ๐‘ ) ๐‘–๐‘–๐‘–๐‘– ๐‘ ๐‘  ๐‘–๐‘–๐‘–๐‘– ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›๐‘›๐‘› ๐‘–๐‘–๐‘–๐‘– ๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž๐‘Ž ๐‘™๐‘™๐‘™๐‘™ ๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’ ๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“๐‘“ ๐ท๐ท๐‘–๐‘– ๐‘๐‘๐‘๐‘๐‘๐‘๐‘–๐‘– (๐‘ ๐‘ ) = ๏ฟฝ 1 , (3) ๐‘™๐‘™๐‘™๐‘™๐‘™๐‘™2 (|๐‘ ๐‘ |) โˆ™ ๏ฟฝ๐‘“๐‘“๐‘–๐‘– (๐‘ ๐‘ ) โˆ’ โˆ‘๐‘™๐‘™๐‘™๐‘™โˆˆ๐‘‡๐‘‡ ๐‘ ๐‘  ๐‘“๐‘“๐‘–๐‘– (๐‘™๐‘™๐‘™๐‘™)๏ฟฝ ๐‘œ๐‘œ๐‘œ๐‘œโ„Ž๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’ ๐‘ƒ๐‘ƒ๏ฟฝ๐‘‡๐‘‡๐‘–๐‘–๐‘ ๐‘  ๏ฟฝ ๐‘–๐‘– where: ๐‘‡๐‘‡๐‘–๐‘–๐‘ ๐‘  is the set of term candidate strings ๐‘™๐‘™๐‘™๐‘™, that nest ๐‘ ๐‘ , extracted from ๐ท๐ท๐‘–๐‘– ; ๐‘“๐‘“๐‘–๐‘– (. ) is the number of occurrences of ๐‘ ๐‘  or ๐‘™๐‘™๐‘™๐‘™ in ๐ท๐ท๐‘–๐‘– . Lemma 1 (The total frequency of nested occurrences). The total value of the fre- quency of nested occurrences, in ๐ท๐ท, of a term candidate string ๐‘ ๐‘  in longer term candi- date strings ๐‘™๐‘™๐‘™๐‘™ is the sum of the total frequency values of nested occurrences in all partial collections ๐ท๐ท๐‘–๐‘– of ๐ท๐ท: ๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก(๐‘ ๐‘ ) = โˆ‘๐‘™๐‘™๐‘™๐‘™โˆˆ๐‘‡๐‘‡ ๐‘ ๐‘  ๐‘“๐‘“(๐‘™๐‘™๐‘™๐‘™) = โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๏ฟฝโˆ‘๐‘™๐‘™๐‘™๐‘™โˆˆ๐‘‡๐‘‡๐‘–๐‘–๐‘ ๐‘  ๐‘“๐‘“๐‘–๐‘– (๐‘™๐‘™๐‘™๐‘™)๏ฟฝ = โˆ‘๐‘›๐‘›๐‘–๐‘–=1(๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘–๐‘– (๐‘ ๐‘ )). (4) Proof. It implies from Definition 2 (of partial c-value), that ๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘–๐‘– (๐‘ ๐‘ ) is the total number of occurrences of the term candidate string ๐‘ ๐‘  in all longer term candidate strings ๐‘™๐‘™๐‘™๐‘™ extracted from the partial collection ๐ท๐ท๐‘–๐‘– . The number of these longer term candidate strings equals to ๐‘ƒ๐‘ƒ(๐‘‡๐‘‡๐‘–๐‘–๐‘ ๐‘  ). Due to the disjointness of the partial collections ๐ท๐ท๐‘–๐‘– (Condition 2 of Definition 1), ๐‘“๐‘“(๐‘™๐‘™๐‘™๐‘™) = โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๐‘“๐‘“๐‘–๐‘– (๐‘™๐‘™๐‘™๐‘™). Therefore, and due to the Condition 1 of Defi- nition 1, the total number of occurrences of ๐‘ ๐‘  in all ๐‘™๐‘™๐‘™๐‘™ extracted from ๐ท๐ท is: ๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก(๐‘ ๐‘ ) = โˆ‘๐‘™๐‘™๐‘™๐‘™โˆˆ๐‘‡๐‘‡ ๐‘ ๐‘  ๐‘“๐‘“(๐‘™๐‘™๐‘™๐‘™) = = โˆ‘๐‘™๐‘™๐‘™๐‘™โˆˆ๐‘‡๐‘‡ ๐‘ ๐‘  โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๐‘“๐‘“๐‘–๐‘– (๐‘™๐‘™๐‘™๐‘™) = โˆ‘๐‘™๐‘™๐‘™๐‘™โˆˆโ‹ƒ๐‘›๐‘› ๏ฟฝ๐‘‡๐‘‡ ๐‘ ๐‘ ๏ฟฝ(โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๐‘“๐‘“๐‘–๐‘– (๐‘™๐‘™๐‘™๐‘™)) = ๐‘–๐‘–=1 ๐‘–๐‘– = โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๏ฟฝโˆ‘๐‘™๐‘™๐‘™๐‘™โˆˆ๐‘‡๐‘‡๐‘–๐‘–๐‘ ๐‘  ๐‘“๐‘“๐‘–๐‘– (๐‘™๐‘™๐‘™๐‘™)๏ฟฝ = โˆ‘๐‘›๐‘›๐‘–๐‘–=1(๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘–๐‘– (๐‘ ๐‘ )). โ–ก Definition 3 (Merged partial c-value). The merged partial c-value of the term can- didate string ๐‘ ๐‘  is computed as: ๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š(๐‘ ๐‘ ) = โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๐‘๐‘๐‘๐‘๐‘๐‘๐‘–๐‘– (๐‘ ๐‘ ). (5) The following Theorem 1 allows computing ๐‘๐‘๐‘๐‘(๐‘ ๐‘ ) for the whole collection ๐ท๐ท based on the merging of the known partial c-values ๐‘๐‘๐‘๐‘๐‘๐‘๐‘–๐‘– (๐‘ ๐‘ ), ๐‘–๐‘– = 1, โ€ฆ , ๐‘›๐‘› for the partial collec- tions ๐ท๐ท๐‘–๐‘– , ๐‘–๐‘– = 1. โ€ฆ , ๐‘›๐‘› of ๐ท๐ท. Theorem 1 (Equality of ๐’„๐’„๐’„๐’„ and ๐’Ž๐’Ž๐’Ž๐’Ž๐’Ž๐’Ž๐’Ž๐’Ž). If a document collection ๐ท๐ท is partitioned as {๐ท๐ท๐‘–๐‘– } = {๐ท๐ท๐‘–๐‘– }๐‘›๐‘›๐‘–๐‘–=1 , which means that Conditions 1 and 2 (1) hold, then ๐‘๐‘๐‘๐‘(๐‘ ๐‘ ) = ๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š(๐‘ ๐‘ ) (6) Proof. The proof is structured in three cases: (1) ๐‘ ๐‘  is never nested in ๐‘™๐‘™๐‘™๐‘™; (2) โˆ€๐ท๐ท๐‘–๐‘– , ๐‘ ๐‘  is nested at least once and at least in one ๐‘™๐‘™๐‘™๐‘™; and (3) ๐‘ ๐‘  is nested in ๐‘™๐‘™๐‘™๐‘™ for some ๐ท๐ท๐‘–๐‘– . Case 1: not nested. If, โˆ€๐‘–๐‘– = 1, โ€ฆ , ๐‘›๐‘›, ๐‘ ๐‘  extracted from ๐ท๐ท๐‘–๐‘– is not nested in any ๐‘™๐‘™๐‘™๐‘™ ex- tracted from ๐ท๐ท๐‘–๐‘– , then ๐‘ ๐‘  is not nested in any ๐‘™๐‘™๐‘™๐‘™ extracted from ๐ท๐ท. Therefore, for such an ๐‘ ๐‘ : ๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š(๐‘ ๐‘ ) = โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๐‘๐‘๐‘๐‘๐‘๐‘๐‘–๐‘– (๐‘ ๐‘ ) = โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๐‘™๐‘™๐‘™๐‘™๐‘™๐‘™2 (|๐‘ ๐‘ |) โˆ™ ๐‘“๐‘“๐‘–๐‘– (๐‘ ๐‘ ) = (7) = ๐‘™๐‘™๐‘™๐‘™๐‘™๐‘™2 (|๐‘ ๐‘ |) โˆ™ โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๐‘“๐‘“๐‘–๐‘– (๐‘ ๐‘ ) = ๐‘™๐‘™๐‘™๐‘™๐‘™๐‘™2 (|๐‘ ๐‘ |) โˆ™ ๐‘“๐‘“(๐‘ ๐‘ ) = ๐‘๐‘๐‘๐‘(๐‘ ๐‘ ), due to Conditions 1 and 2 (2) and the definition of ๐‘“๐‘“(. ). Case 2: all nested. If, โˆ€๐‘–๐‘– = 1, โ€ฆ , ๐‘›๐‘›, ๐‘ ๐‘  extracted from ๐ท๐ท๐‘—๐‘— is nested in an ๐‘™๐‘™๐‘™๐‘™ extracted from ๐ท๐ท๐‘—๐‘— , then: (i) this ๐‘ ๐‘  (extracted from ๐ท๐ท) is nested in this ๐‘™๐‘™๐‘™๐‘™ (extracted from ๐ท๐ท); and (ii) ๐‘™๐‘™๐‘™๐‘™ โˆˆ ๐‘‡๐‘‡๐‘—๐‘—๐‘ ๐‘  โŠ‚ ๐‘‡๐‘‡ ๐‘ ๐‘  โ€“ because ๐ท๐ท๐‘–๐‘– โŠ‚ ๐ท๐ท due to condition 1. Therefore: ๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š(๐‘ ๐‘ ) = โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๐‘๐‘๐‘๐‘๐‘๐‘๐‘–๐‘– (๐‘ ๐‘ ) = 1 = โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๏ฟฝ๐‘™๐‘™๐‘™๐‘™๐‘™๐‘™2 (|๐‘ ๐‘ |) โˆ™ ๏ฟฝ๐‘“๐‘“๐‘–๐‘– (๐‘ ๐‘ ) โˆ’ โˆ‘๐‘™๐‘™๐‘™๐‘™โˆˆ๐‘‡๐‘‡ ๐‘ ๐‘  ๐‘“๐‘“๐‘–๐‘– (๐‘™๐‘™๐‘™๐‘™)๏ฟฝ๏ฟฝ = ๐‘ƒ๐‘ƒ๏ฟฝ๐‘‡๐‘‡๐‘–๐‘–๐‘ ๐‘  ๏ฟฝ ๐‘–๐‘– 1 = ๐‘™๐‘™๐‘™๐‘™๐‘™๐‘™2 (|๐‘ ๐‘ |) โˆ™ โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๏ฟฝ๐‘“๐‘“๐‘–๐‘– (๐‘ ๐‘ ) โˆ’ โˆ‘๐‘™๐‘™๐‘™๐‘™โˆˆ๐‘‡๐‘‡ ๐‘ ๐‘  ๐‘“๐‘“๐‘–๐‘– (๐‘™๐‘™๐‘™๐‘™)๏ฟฝ = ๐‘ƒ๐‘ƒ๏ฟฝ๐‘‡๐‘‡๐‘–๐‘–๐‘ ๐‘  ๏ฟฝ ๐‘–๐‘– 1 = ๐‘™๐‘™๐‘™๐‘™๐‘™๐‘™2 (|๐‘ ๐‘ |) โˆ™ ๏ฟฝโˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๐‘“๐‘“๐‘–๐‘– (๐‘ ๐‘ ) โˆ’ โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๏ฟฝ โˆ‘๐‘™๐‘™๐‘™๐‘™โˆˆ๐‘‡๐‘‡ ๐‘ ๐‘  ๐‘“๐‘“๐‘–๐‘– (๐‘™๐‘™๐‘™๐‘™)๏ฟฝ๏ฟฝ = (8) ๐‘ƒ๐‘ƒ๏ฟฝ๐‘‡๐‘‡๐‘–๐‘–๐‘ ๐‘  ๏ฟฝ ๐‘–๐‘– 1 = ๐‘™๐‘™๐‘™๐‘™๐‘™๐‘™2 (|๐‘ ๐‘ |) โˆ™ ๏ฟฝ๐‘“๐‘“(๐‘ ๐‘ ) โˆ’ โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๏ฟฝ โˆ‘๐‘™๐‘™๐‘™๐‘™โˆˆ๐‘‡๐‘‡ ๐‘ ๐‘  ๐‘“๐‘“๐‘–๐‘– (๐‘™๐‘™๐‘™๐‘™)๏ฟฝ๏ฟฝ โ‰ˆ|โ„Ž1 ๐‘ƒ๐‘ƒ๏ฟฝ๐‘‡๐‘‡๐‘–๐‘–๐‘ ๐‘  ๏ฟฝ ๐‘–๐‘– 1 โ‰ˆ|โ„Ž1 ๐‘™๐‘™๐‘™๐‘™๐‘™๐‘™2 (|๐‘ ๐‘ |) โˆ™ ๏ฟฝ๐‘“๐‘“(๐‘ ๐‘ ) โˆ’ โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๏ฟฝโˆ‘๐‘™๐‘™๐‘™๐‘™โˆˆ๐‘‡๐‘‡ ๐‘ ๐‘  ๐‘“๐‘“๐‘–๐‘– (๐‘™๐‘™๐‘™๐‘™)๏ฟฝ๏ฟฝ = ๐‘ƒ๐‘ƒ(๐‘‡๐‘‡ ๐‘ ๐‘  ) ๐‘–๐‘– 1 = ๐‘™๐‘™๐‘™๐‘™๐‘™๐‘™2 (|๐‘ ๐‘ |) โˆ™ ๏ฟฝ๐‘“๐‘“(๐‘ ๐‘ ) โˆ’ โˆ‘๐‘™๐‘™๐‘™๐‘™โˆˆ๐‘‡๐‘‡ ๐‘ ๐‘ ๏ฟฝ๐‘“๐‘“(๐‘™๐‘™๐‘™๐‘™)๏ฟฝ๏ฟฝ = ๐‘๐‘๐‘๐‘(๐‘ ๐‘ ) ๐‘ƒ๐‘ƒ(๐‘‡๐‘‡ ๐‘ ๐‘  ) Here โ€œโ‰ˆ|โ„Ž1 โ€ stands for hypothetically approximately equal. The hypothesis โ„Ž1 about 1 1 1 1 the approximate equality in โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๏ฟฝ ๏ฟฝโ‰ˆ . Formally, โˆ‘๐‘›๐‘›๐‘–๐‘–=1 ๏ฟฝ ๏ฟฝ> . How- ๐‘ƒ๐‘ƒ(๐‘‡๐‘‡๐‘–๐‘–๐‘ ๐‘  ) ๐‘ƒ๐‘ƒ(๐‘‡๐‘‡ ๐‘ ๐‘  ) ๐‘ƒ๐‘ƒ(๐‘‡๐‘‡๐‘–๐‘–๐‘ ๐‘  ) ๐‘ƒ๐‘ƒ(๐‘‡๐‘‡ ๐‘ ๐‘  ) ever, asymptotically, ๐‘ƒ๐‘ƒ(๐‘‡๐‘‡๐‘–๐‘–๐‘ ๐‘  ) = ๐‘œ๐‘œ(๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘–๐‘– (๐‘ ๐‘ )) and ๐‘ƒ๐‘ƒ(๐‘‡๐‘‡ = ๐‘œ๐‘œ๏ฟฝ๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก(๐‘ ๐‘ )๏ฟฝ due to: ๐‘ ๐‘  ) (i) the overlaps in ๐‘‡๐‘‡๐‘–๐‘–๐‘ ๐‘  ; and (ii) possible nestings in several instances of ๐‘™๐‘™๐‘™๐‘™. Therefore, the influence of those denominators in (8) becomes lower with the growth of the volume of ๐ท๐ท and its partial collections ๐ท๐ท๐‘–๐‘– . This is a promise that โ„Ž1 might be true. Case 3: ๐‘ ๐‘  is sometimes nested in ๐‘™๐‘™๐‘™๐‘™. There exist several partial collections ๐ท๐ท๐‘—๐‘— , for which Case 2 is applied. For the rest of partial collections ๐ท๐ท๐‘˜๐‘˜ Case 1 is applied. In this situation two partial sums โ€“ ๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š1 (๐‘ ๐‘ ) and ๐‘š๐‘š๐‘๐‘๐‘๐‘๐‘๐‘2 (๐‘ ๐‘ ) โ€“ are computed for these disjoint subsets of the partition of ๐ท๐ท. Similarly to Case 2, ๐‘๐‘๐‘๐‘(๐‘ ๐‘ ) โ‰ˆ|โ„Ž1 ๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š1 (๐‘ ๐‘ ) + ๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š2 (๐‘ ๐‘ ). Hence, if the hypotheses โ„Ž1 holds true, Cases 1-3 prove Theorem 1. โ–ก For checking โ„Ž1, complete (๐‘๐‘๐‘๐‘) and merged partial (๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š) c-values are experimen- tally computed and compared, as presented in Sect. 5. A straightforward corollary from Theorem 1 is that c-values do not depend on the partitioning of a document collection. ๐‘š๐‘š Corollary 1 (Size of a partial collection). Let {๐ท๐ท๐‘–๐‘– }๐‘›๐‘›๐‘–๐‘–=1 ; ๏ฟฝ๐ท๐ท๐‘—๐‘— ๏ฟฝ๐‘—๐‘—=1 ; ๐‘›๐‘› โ‰  ๐‘š๐‘š be two dif- ferent partitions of a document collection ๐ท๐ท. Then: โˆ€๐‘ ๐‘ , ๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š(๐‘ ๐‘ )|๏ฟฝ๐ท๐ท๐‘–๐‘– ๏ฟฝ = ๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š(๐‘ ๐‘ )|๏ฟฝ๐ท๐ท ๏ฟฝ โ‰ˆ|โ„Ž1 ๐‘๐‘๐‘๐‘(๐‘ ๐‘ ), (9) ๐‘—๐‘— where: ๐‘ ๐‘  is a term candidate string extracted from the document collection ๐ท๐ท; ๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š(๐‘ ๐‘ )|{๐ท๐ท๐‘–๐‘–} is the merged partial c-value of the term candidate string ๐‘ ๐‘  computed for the partition {๐ท๐ท๐‘–๐‘– } of ๐ท๐ท; ๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š๐‘š(๐‘ ๐‘ )|๏ฟฝ๐ท๐ท๐‘—๐‘— ๏ฟฝ is the merged partial c-value of the term candidate string ๐‘ ๐‘  computed for the partition {๐ท๐ท๐‘—๐‘— } of ๐ท๐ท. Based on Corollary 1, the size of a partial collection ๐ท๐ท๐‘–๐‘– โˆˆ {๐ท๐ท๐‘–๐‘– } may be reasonably chosen based on the specifics of the problem and available hardware resources โ€“ RAM in particular. One possible scenario and problem might be extracting terms from a stream of textual documents, like blog posts or tweets. In this setting, the size of a partial collection has to be smaller than the size of the stream window. Algorithm MPCV. Merge partial c-values from two Bags of Terms Input: Ti, Ti+1 โ€“ the bags of retained significant terms. Each term Ti.term is accompanied with its Ti.pcv. Ti, Ti+1 are sorted in the descending order of Ti.pcv, Ti+1.pcv. Output: the bag of terms Ti+1 with merged Ti.pcv into Ti+1.pcv for every term 1. resort := .FALSE. 2. for k := 1 to |Ti| 3. match := .FALSE. 4. for m := 1 to |Ti+1| 5. if (Ti.term[k] = Ti+1.term[m]) 6. then begin Ti+1.pcv[m] += Ti.pcv[k]; match := .TRUE.; end 7. if (.NOT. match) 8. then begin append(Ti.term[k]+Ti.pcv[k], Ti+1); resort := .TRUE.; end 9. end for 10. end for 11. if (resort) then sort(Ti+1, Ti+1.pcv, desc) Fig. 1: The MPCV algorithm for merging partial c-values in two bags of retained significant terms The MPCV algorithm (Fig. 1) is used for merging partial c-values in the bags of sig- nificant terms retained from the textual datasets representing the partial document col- lections of the complete document collection. 5 Experimental Evaluation The idea of experimental evaluation is to compare the conventional and optimized pro- cessing pipelines based on checking: โ€ข The correctness. Are the merged partial c-values computed using the optimized pipeline practically the same as the c-values computed by the conventional pipeline? โ€ข Execution time. What is the difference in the duration of the extraction of the same bags of retained significant terms between the conventional and optimized pipelines? Checking correctness validates the hypothesis โ„Ž1 (Sect. 4) to fully prove Theorem 1. If โ„Ž1 holds true, the optimized processing pipeline could be used for extracting terms in the process of measuring terminological saturation in document collections. Com- paring the execution times of the conventional and optimized processing pipelines al- lows assessing the efficiency of the optimized pipeline. 5.1 Experimental Data The document collection used in our experiments is the DMKD-300 collection, which contains the subset of (300) full text articles from the Springer journal on Data Mining and Knowledge Discovery 3 published between 1997 and 2010. These papers have been automatically pre-processed to plain texts [24] and have not been cleaned. Therefore, the resulting datasets, representing partial collections, were moderately noisy. We have chosen the increment (๐‘–๐‘–๐‘–๐‘–๐‘–๐‘–) for generating the datasets to be 20 papers. Hence, based on the available texts, we have generated, using our Dataset Generator software (Sect. 5.3): โ€ข 15 incrementally extended datasets ๐ท๐ท1 = {๐‘‘๐‘‘๐‘—๐‘— }20 ๐‘—๐‘—=1 (20 papers), 20 20 ๐ท๐ท2 = ๐ท๐ท1 โˆช {๐‘‘๐‘‘๐‘—๐‘— }๐‘—๐‘—=1 (40 papers), โ€ฆ, ๐ท๐ท15 = ๐ท๐ท14 โˆช {๐‘‘๐‘‘๐‘—๐‘— }๐‘—๐‘—=1 (300 papers) for the con- 4 ventional pipeline 15 5 โ€ข 15 datasets of ๐‘–๐‘–๐‘–๐‘–๐‘–๐‘– size forming the partition ๏ฟฝ๐ท๐ท๐‘–๐‘– = {๐‘‘๐‘‘๐‘—๐‘— }20 ๐‘—๐‘—=1 ๏ฟฝ of the DMKD-300 ๐‘–๐‘–=1 4F collection for the optimized pipeline The descending citation frequency (DCF) order [6] of adding documents to partial collections has been used in both cases. 5.2 Instrumental Software and Computational Environment Our experimental workflow is appropriately supported by the developed and used in- strumental software. The toolset is concisely presented in Table 2. 3 https://link.springer.com/journal/10618 4 DMKD-300 collection in plain texts: http://dx.doi.org/10.17632/knb8fgyr8n.1#folder- 637dc34c-fa29-4587-9f63-df0e602d6e86; incrementally enlarged datasets generated of these texts: http://dx.doi.org/10.17632/knb8fgyr8n.1#folder-b307088c-9479-43fb-8197- a12a66ff685b 5 The partition of the DMKD-300 collection: https://github.com/OntoElect/Data/blob/ master/DMKD-300-DCF-Part.zip Table 2: The modules of the instrumental software toolset used in experiments Phase / Tool Input Output Implementa- Constraints Task tion Pre-Processing Phase Generate Dataset the folder with plain the folder with Plain C#, Datasets Generator text documents; the Text datasets; the table https://github.co XLS file with citation with run time per da- m/OntoE- frequencies and docu- taset lect/Code/tree/m ment file names aster/DataSet- Gen-cs Terms Extraction Phase Extract UPM Term the folder with plain the folder with the Java, English texts Terms Extractor text datasets bags of terms; the table https://github.co only, c-value [29] with run-time per bag m/ontologylearn method [16], of terms ing-oeg/epnoi- input data of legacy at most 15 Mb Merge par- MPCV the folder with the the folder with the Python, tial bags of terms; the list bags of terms with https://github.co c-values of files to be processed merged c-values; the m/OntoElect/Co table with run-time per de/tree/master/ bag of terms MPCV Post-processing Phase Compute Baseline the folder with the The table containing Python, uses the base- Termino- THD bags of terms; the list eps, thd, values for the https://github.co line THD al- logical Dif- of files to be processed consecutive pairs of m/OntoE- gorithm [4] feren-ces the bags of terms lect/Code/tree/m aster/THD 5.3 Experimental Flow The set-up of our experiments includes the configuration of the execution flow in two parallel processing pipelines โ€“ conventional and optimized, as pictured in Fig. 2. The conventional pipeline implements the processing of incrementally extended document sub-collections, as explained in Sect. 2. It takes in the files of the document collection in the specified (DCF) order and generates the incrementally extended da- tasets (Sect. 5.1) using the dataset generator (Table 2). At the next step, the datasets are fed into the term extractor software (Table 2) which outputs the bags of extracted terms ๐‘‡๐‘‡๐‘–๐‘– and measures run times ๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘–๐‘– . The optimized pipeline implements the processing of the partitioned document sub- collections as explained in Sect. 4. It takes the files of the document collection in the same order (DCF) and generates partition datasets (Sect. 5.1) using the dataset genera- tor (Table 2). At the next step, the datasets are fed into the term extractor software (Table 2) which outputs the bags of extracted terms ๐‘‡๐‘‡๐‘–๐‘– and measures run times. At the subsequent step, the extracted sets o terms are fed into the merger module (Table 2) which applies the MPCV algorithm (Sect. 4.2) consequently to the pairs {๐‘‡๐‘‡๐‘–๐‘– , ๐‘‡๐‘‡๐‘–๐‘–+1 } as pictured in Fig. 2. As a result, the merged bags of terms ๐‘‡๐‘‡๐‘–๐‘–๐‘š๐‘š = โ‹ƒ๐‘–๐‘–๐‘˜๐‘˜=1 ๐‘‡๐‘‡๐‘–๐‘– are generated. The run times of the merge operation are also measured. The required total run times (๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘–๐‘–๐‘š๐‘š ) are computed as the sums of the respective term extraction and merge run times. Document Collection (Plain Texts) โ€ฆ Conventional Pipeline Optimized Pipeline Generate Datasets Generate Datasets D1 D2 D3 Di Dn D1 D2 D3 Di Dn inc D1 D2 โ€ฆ Di-1 โ€ฆ Dn-1 inc โ€ฆ โ€ฆ + inc + inc + inc + inc Extract & Retain Extract & Retain Significant Terms Significant Terms measure T1 T2 T3 Ti Tn ๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘–๐‘– T1 T2 T3 Ti Tn โ€ฆ โ€ฆ โ€ฆ โ€ฆ Merge pcv in the Bags of Terms measure ๐‘‡๐‘‡1๐‘š๐‘š ๐‘‡๐‘‡2๐‘š๐‘š ๐‘‡๐‘‡3๐‘š๐‘š ๐‘‡๐‘‡๐‘–๐‘–๐‘š๐‘š ๐‘‡๐‘‡๐‘›๐‘›๐‘š๐‘š ๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘ก๐‘–๐‘–๐‘š๐‘š โ€ฆ โ€ฆ Pairwise Compare Bags of Terms ๐‘ก๐‘กโ„Ž๐‘‘๐‘‘(โ€ฆ), i =1, โ€ฆ, n Fig. 2: The execution flow of evaluation experiments After executing these two parallel branches, if โ„Ž1 (Sect.4) holds true, ๐‘‡๐‘‡๐‘–๐‘– coming from the conventional pipeline and ๐‘‡๐‘‡๐‘–๐‘–๐‘š๐‘š coming from the optimized pipeline have to contain very similar sets of terms with approximately the same c-values. This is checked by applying the THD algorithm [4] implemented in the Baseline THD module (Table 2). THD is applied: (i) to the pairs {๐‘‡๐‘‡๐‘–๐‘– , ๐‘‡๐‘‡๐‘–๐‘–+1 } and {๐‘‡๐‘‡๐‘–๐‘–๐‘š๐‘š , ๐‘‡๐‘‡๐‘–๐‘–+1 ๐‘š๐‘š } for comparing satu- ration curves for conventional and optimized cases; and (ii) to the pairs {๐‘‡๐‘‡๐‘–๐‘– , ๐‘‡๐‘‡๐‘–๐‘–๐‘š๐‘š } for computing terminological difference between hypothetically the same sets of terms. All the computations, except term extraction, have been run on a Windows 7 64-bit HP ZBook 17 G3 PC with: Intelยฎ Coreโ„ข i7-6700HQ CPU, E7400 @ 2.60 GHz; 8.0 Gb on-board memory; NVIDIA Qadro M3000M GPU. Term extraction has been run on an Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz, 64 cores, 256GB server. 5.4 The Results of Experiments and Discussion The results of our experiment are presented in a tabular form in Table 3 and graphically pictured in Fig. 3 and Fig. 4. Table 3: thd measurements and run times for the conventional and optimized pipelines Bag of Terms eps thd Run Times (sec) No (i) ๐‘ป๐‘ป๐’Š๐’Šโˆ’๐Ÿ๐Ÿ , ๐‘ป๐‘ป๐’Š๐’Š ๐‘ป๐‘ป๐’Ž๐’Ž ๐’Ž๐’Ž ๐’Š๐’Šโˆ’๐Ÿ๐Ÿ , ๐‘ป๐‘ป๐’Š๐’Š ๐‘ป๐‘ป๐’Š๐’Š , ๐‘ป๐‘ป๐’Ž๐’Ž ๐’Š๐’Š ๐‘ป๐‘ป๐’Š๐’Š ๐‘ป๐‘ป๐’Ž๐’Ž ๐’Š๐’Š 1 12.00 62.89 62.89 0.00 30.41 31.94 2 15.50 31.51 32.22 3.64 60.47 32.79 3 18.00 23.38 22.17 2.85 85.34 31.64 4 19.65 18.04 17.63 3.08 111.72 31.85 5 23.22 15.80 15.57 3.31 147.30 37.11 6 24.00 10.07 9.50 3.86 153.78 32.25 7 24.00 10.67 10.14 3.40 195.27 41.49 8 26.00 9.14 9.19 5.28 217.62 43.28 9 28.00 8.68 8.54 6.95 268.59 47.16 10 28.53 7.56 7.42 5.21 296.49 41.32 11 28.53 7.30 6.44 6.13 324.39 41.58 12 28.53 6.53 6.56 5.88 363.05 45.20 13 30.00 6.43 5.42 10.07 401.94 40.55 14 38.00 15.47 3.13 9.09 401.19 39.81 15 38.00 3.53 15.37 6.14 459.75 50.12 Fig. 3(a) clearly shows that the results of saturation measurements using the bags of terms resulting from the optimized pipeline (๐‘ป๐‘ป๐’Ž๐’Ž ๐’Ž๐’Ž ๐’Š๐’Šโˆ’๐Ÿ๐Ÿ , ๐‘ป๐‘ป๐’Š๐’Š column in Table 3 and Merged Partial curve in Fig. 3(a)) and conventional pipeline (๐‘ป๐‘ป๐’Š๐’Šโˆ’๐Ÿ๐Ÿ , ๐‘ป๐‘ป๐’Š๐’Š column in Table 3 and Incremental curve in Fig. 3(a)) are practically the same, except the last two measure- ments. The deviation at the tail could be explained that regular noise is accumulated differently in these two cases. A nice side result in this context is that the optimized pipeline using merged partial c-values accumulates less regular noise. Fig. 3(b) clearly pictures the fact that the difference between the bags of terms ๐‘‡๐‘‡๐‘–๐‘– and ๐‘‡๐‘‡๐‘–๐‘–๐‘š๐‘š does not exceed approximately 1/3 of the individual term significance thresh- old ๐‘’๐‘’๐‘’๐‘’๐‘’๐‘’ that is used to cut-off insignificant terms. In the combination, these two obser- vations reliably prove 6 our hypothesis โ„Ž1 (Sect. 4). The comparison of the run times presented in Table 3 and Fig. 4 clearly demonstrates that the optimized pipeline, with near to constant values, outperforms the conventional pipeline significantly. 6 One may argue that the reported experiment is just an experiment with one document collec- tion. Hence for a different document collection the results might be different regarding the validity of โ„Ž1. Our counter-argument is that the computation of c-values is collection and domain-independent. Furthermore, the terms with the same c-value are randomly distributed in the documents of the collection. (a) Saturation measurements (b) Terminological differences Fig. 3: Merged partial c-values computed using the optimized pipeline are practically the same as the c-values computed using the conventional pipeline Fig. 4: Run times of the conventional (Incremental) and optimized (Merged Partial) pipelines 6 Conclusions and Future Work The contribution of this paper is the proposal of computing significance scores (c-val- ues) for term candidates extracted from a document collection using not the incremen- tally extended datasets, representing sub-collections, but the partitions of the collection. It has been proven formally, up to the validity of the โ„Ž1 hypothesis, that this optimized approach is correct โ€“ i.e. gives practically the same results. The hypothesis has been validated experimentally, by comparing the outputs coming from the conventional and optimized processing pipelines. The experiment also clearly showed that the proposed way of text processing very substantially outperforms the conventional approach. The run times measured while processing partitions of the document collection remained almost constant in consecu- tive steps. A tiny increase was observed due to the very small overhead for merging the bags of terms extracted from the partition datasets. Yet one more advantage of the pro- posed approach is that partition datasets could be processed independently as these do not overlap in data. Hence, the optimized pipeline is straightforwardly parallelizable. This fact opens the way to process real world document collections at industrial scales for finding terminological cores within these collections. Choosing a proper partition size also removes the limitation of many software term extractors on the volume of input data. Our plan for the future work is to apply the optimized processing pipeline to detect terminological saturation in the industrial size paper collection in the domain of Knowledge Management. References 1. Wong, W., Liu, W., Bennamoun, M.: Ontology learning from text: a look back and into the future. ACM Comput. Surv., 44(4), Article 20, 36 pages (2012). http://doi.acm.org/10.1145/2333112.2333115 2. Ermolayev, V.: OntoElecting requirements for domain ontologies. The case of time domain. EMISA Int J of Conceptual Modeling 13(Sp. Issue), 86--109 (2018) 3. Tatarintseva, O., Ermolayev, V., Keller, B., Matzke, W.-E.: Quantifying ontology fitness in OntoElect using saturation- and vote-based metrics. In: Ermolayev, V., et al. (eds.) Revised Selected Papers of ICTERI 2013, CCIS, vol. 412, pp. 136--162 (2013) 4. Chugunenko, A., Kosa, V., Popov, R., Chaves-Fraga, D., Ermolayev, V.: Refining termino- logical saturation using string similarity measures. In: Ermolayev, V, et al. (eds.): Proc. ICTERI 2018. Volume I: Main Conference, Kyiv, Ukraine, May 14-17, 2018, CEUR-WS vol. 2105, pp. 3--18 (2018, online) http://ceur-ws.org/Vol-2105/10000003.pdf 5. Ermolayev, V., Batsakis, S., Keberle, N., Tatarintseva, O., Antoniou, G.: Ontologies of time: review and trends. International Journal of Computer Science and Applications 11(3), 57-- 115 (2014) 6. Kosa, V., Chaves-Fraga, D., Naumenko, D., Yuschenko, E., Moiseenko, S., Dobrovolskyi, H., Vasileyko, A., Badenes-Olmedo, C., Ermolayev, V., Corcho, O., and Birukou, A.: The influence of the order of adding documents to datasets on terminological saturation. Tech- nical Report TS-RTDC-TR-2018-2-v2, 21.11.2018, Dept. of Computer Science, Za- porizhzhia National University, Ukraine, 72 p. (2018) 7. Fahmi, I., Bouma, G., van der Plas, L.: Improving statistical method using known terms for automatic term extraction. In: Computational Linguistics in the Netherlands, CLIN 17 (2007) 8. Wermter, J., Hahn, U.: Finding new terminology in very large corpora. In: Clark, P., Schreiber, G. (eds.) Proc.3rd Int Conf on Knowledge Capture, K-CAP 2005, pp. 137--144, Banff, Alberta, Canada, ACM (2005) http://doi.org/10.1145/1088622.1088648 9. Zhang, Z., Iria, J., Brewster, C., Ciravegna, F.: A comparative evaluation of term recognition algorithms. In: Proc. 6th Int Conf on Language Resources and Evaluation, LREC 2008, Marrakech, Morocco (2008) 10. Daille, B.: Study and implementation of combined techniques for automatic extraction of terminology. In: Klavans, J., Resnik, P. (eds.) The balancing act: combining symbolic and statistical approaches to language, pp. 49--66. The MIT Press. Cambridge, Massachusetts (1996) 11. Caraballo, S. A., Charniak, E.: Determining the specificity of nouns from text. In: Proc. 1999 Joint SIGDAT Conf on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 63--70 (1999) 12. Astrakhantsev, N.: ATR4S: toolkit with state-of-the-art automatic terms recognition meth- ods in scala. arXiv preprint arXiv:1611.07804 (2016) 13. Medelyan, O., Witten, I. H.: Thesaurus based automatic keyphrase indexing. In: Mar- chionini, G., Nelson, M. L., Marshall, C. C. (eds.) Proc. ACM/IEEE Joint Conf on Digital Libraries, JCDL 2006, pp. 296--297, Chapel Hill, NC, USA, ACM (2006). http://doi.org/10.1145/1141753.1141819 14. Ahmad, K., Gillam, L., Tostevin, L.: University of surrey participation in trec8: Weirdness indexing for logical document extrapolation and retrieval (wilder). In: Proc. 8th Text RE- trieval Conf, TREC-8 (1999) 15. Sclano, F., Velardi, P.: TermExtractor: A Web application to learn the common terminology of interest groups and research communities. In: Proc. 9th Conf on Terminology and Artifi- cial Intelligence, TIA 2007, Sophia Antipolis, France (2007) 16. Frantzi, K. T., Ananiadou, S.: The c/nc value domain independent method for multi-word term extraction. J. Nat. Lang. Proc. 6(3), 145--180 (1999). http://doi.org/10.5715/jnlp.6.3_145 17. Kozakov, L., Park, Y., Fin, T., Drissi, Y., Doganata, Y., Cofino, T.: Glossary extraction and utilization in the information search and delivery system for IBM Technical Support. IBM System Journal 43(3), 546--563 (2004). http://doi.org/10.1147/sj.433.0546 18. Astrakhantsev, N.: Methods and software for terminology extraction from domain-specific text collection. PhD thesis, Institute for System Programming of Russian Academy of Sci- ences (2015) 19. Bordea, G., Buitelaar, P., Polajnar, T.: Domain-independent term extraction through domain modelling. In: Proc. 10th Int Conf on Terminology and Artificial Intelligence, TIA 2013, Paris, France (2013) 20. Badenes-Olmedo, C., Redondo-Garcรญa, J. L., Corcho, O.: Efficient clustering from distribu- tions over topics. In: Proc. K-CAP 2017, ACM, New York, NY, USA, Article 17, 8 p. (2017) 21. Park, Y., Byrd, R. J., Boguraev, B.: Automatic glossary extraction: beyond terminology identification. In: Proc. 19th Int Conf on Computational linguistics, pp. 1--7. Taipei, Taiwan (2002). http://doi.org/10.3115/1072228.1072370 22. Nokel, M., Loukachevitch, N.: An experimental study of term extraction for real infor- mation-retrieval thesauri. In: Proc 10th Int Conf on Terminology and Artificial Intelligence, pp. 69--76 (2013) 23. Zhang, Z., Gao, J., Ciravegna, F.: Jate 2.0: Java automatic term extraction with Apache Solr. In: Proc.LREC 2016, pp. 2262--2269, Slovenia (2016) 24. Kosa, V., Chaves-Fraga, D., Naumenko, D., Yuschenko, E., Badenes-Olmedo, C., Ermola- yev, V., Birukou, A.: Cross-evaluation of automated term extraction tools by measuring ter- minological saturation. In: Bassiliades, N., et al. (eds.) ICTERI 2017. Revised Selected Pa- pers. CCIS, vol. 826, pp. 135--163 (2018) 25. Justeson, J., Katz, S. M.: Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1(1), 9--27 (1995). http://doi.org/10.1017/S1351324900000048 26. Evans, D. A., Lefferts, R. G.: Clarit-trec experiments. Information processing & manage- ment 31(3), 385--395 (1995). http://doi.org/10.1016/0306-4573(94)00054-7 27. Church, K. W., Gale, W. A.: Inverse document frequency (idf): a measure of deviations from Poisson. In: Proc. ACL 3rd Workshop on Very Large Corpora, pp. 121--130, Association for Computational Linguistics, Stroudsburg, PA, USA (1995). http://doi.org/10.1007/978- 94-017-2390-9_18 28. Kim, J.-D., Ohta, T., Teteisi, Yu, Tsujii, J.: GENIA corpus - a semantically annotated corpus for bio-textmining. Bioinformatics. 19(suppl. 1), i180--i182 (2003) 29. Corcho, O., Gonzalez, R., Badenes, C., Dong, F.: Repository of indexed ROs. Deliverable No. 5.4. Dr Inventor project (2015)