<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Discovering Distributional Thesauri Semantic Relations</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Bulgarian Language, Bulgarian Academy of Sciences</institution>
          ,
          <addr-line>52, Shipchensky proh. str., bl. 17, 1113 Sofia</addr-line>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The paper1 presents technique and analysis to discover distributional thesauri relations by using statistical similarity of different word's contexts. The application uses educational electronic text corpus and the Sketch Engine software statistical search to extract and compare word's collocations from the related text corpus. The semantic search used is based on the evaluation and comparison of common keyword's collocations by generation distributional thesauri word's semantic relations and words sketch differences. The results of the related search experiments for British Academic Spoken English corpus are evaluated and presented.</p>
      </abstract>
      <kwd-group>
        <kwd>Data Mining</kwd>
        <kwd>Big Data</kwd>
        <kwd>Hierarchical Categorization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The similarity search is widely known technique to extract semantically related
words. It is used to evaluate not only synonyms but also to extract semantic
relations between words in large electronic text corpora. Recent research in that
area [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] extend the search techniques by combining semantic approaches and
information retrieval approaches improving the search, so to deal with more
complex semantic representations.
      </p>
      <p>
        Thus, the technique is applied to evaluate semantic content of retrieved
electronic textual documents by systematic analysis of structure of that documents.
Additionally, the traditional approaches were improved with the technique of
linking text-based content to image using joint information sources [
        <xref ref-type="bibr" rid="ref2 ref3">2,3</xref>
        ] for
document classification.
      </p>
      <p>
        Generally, the combined statistical similarity approaches were successfully
applied for extracting and comparing words belonging to different thesauri by
comparing their related contextual collocations. The existing applications
improve the multilingual use and the universal scope of that approach [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Further, we are going to demonstrate the use of such technique by
presenting and analyzing search results of generation and comparison of collocations
1 The research described presents results obtained during COST-STSMIC1302-36988
”Natural Language Processing Keyword Search for Related Languages” of COST
Action IC1302 ”Semantic keyword-based search on structured data sources
(KEYSTONE)”.
of related words in British Academic Spoken English corpus using specialized
software of Sketch Engine.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The Sketch Engine (SE)</title>
      <p>
        The SE software [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] allows approaches to extract semantic properties of words
and most of them are with multilingual application. Extracting keywords is
widely used technique to extract terms of particular studied domain. Also,
semantic relations can be extracted by generation of related word contexts through
word concordances which define context in quantitative terms and a further work
is needed to be done to extract semantic relations by searching for co-occurrences
and collocations of related keyword.
      </p>
      <p>Co-occurrences and collocations are words which are most probably to be
found with a related keyword. They assign the semantic relations between the
keyword and its particular collocated word which might be of similarity or of
a distance. We use techniques of T − score, M I − score and M I3 − score for
corpora processing and searching. For all, the following terms are used: N –
corpus size, fA – number of occurrences of keyword in the whole corpus (the size
of concordance), fB – number of occurrences of collocated keyword in the whole
corpus, fAB – number of occurrences of collocate in the concordance (number
of co-occurrences). The related formulas for defining T − score, M I − score and
M I3 − score are as follows:
fAB N
MI-Score log2 fA fB
T-Score
fAB − fA fB</p>
      <p>N
fAB</p>
      <p>fA3B N</p>
      <p>MI3-Score log2 fA fB</p>
      <p>
        The T − score, M I − score and M I3 − score are applicable for processing
multilingual parallel corpora as well. Collocations have been regarded as
statistically similar words [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] which can be extracted by using techniques for estimation
the strength of association between co-occurring words.
      </p>
      <p>
        The SE also offers further refinement of extracted semantic relations by
evaluation of word’s common collocations or evaluation of distributional thesauri
semantic relations [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>Further, we are going to present and analyze search results of extraction
and evaluation of common collocations from British Academic Spoken English
corpus using the SE software.</p>
    </sec>
    <sec id="sec-3">
      <title>The British Academic Spoken English (BASE) corpus (BASE)</title>
      <p>
        The British Academic Spoken English (BASE) corpus is a collection of
transcripts of lectures and seminars recorded at University of Warwick and University
of Reading in the UK during the period 1998-2005. It was created to analyze
English for Academic Purposes [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] which allows also extraction of specific semantic
relations between terms and definitions among subjects studied.
      </p>
      <p>The texts included consist of 1 186 290 words and are distributed across four
broad domain areas: (i) Arts and Humanities, (ii) Life and Medical Sciences, (iii)
Physical Sciences and (iv) Social Studies and Sciences. The corpus is annotated
according to Text Encoding Initiative Guidelines and recently was uploaded into
SE allowing the use of its incorporated options for storing, sampling, searching
and filtering texts according to different criteria.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Common Collocations Search Results</title>
      <p>
        We are going to use the SE statistical options to extract word’s semantic
relations. The methodology includes generation of word’s collocations [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and their
further comparison. For that, we are going to present search experiments and
related results for the word politics.
      </p>
      <p>
        For our research, we use M I − score and apply methodology already used to
extract specialized collocations in mathematical domain [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Fig. 1 shows
generated collocation candidates for the word politics. The received results present
most frequent words which are most probably to be found with the word politics.
They are: electoral, international, gender, etc.
      </p>
      <p>
        The results include specialized terms that can be part of thesauri like electoral
politics, international politics but also attributive collocations like
confrontational politics which are based on the meaningful combination between word
and its collocations [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>The SE word search options can extract not only statistically similar words
for building thesauri but also can compare words’ collocations which belong to
more than one thesauri – the so-called common collocations. Distributional
thesauri search evaluates word’s common collocations which share common semantic
relations.</p>
      <p>Generally, if two words have much collocations in common, they share
semantic relations of distributional thesauri and will appear in each one’s thesauri.
The SE function to generate words’ distributional thesauri can compare pairs of
words and show how they collocate.</p>
      <p>Fig. 2 shows generated distributional thesauri for the word politics and its
semantically related words like: economy, society, production, organization,
etc.</p>
      <p>However, the results underlay two clusters of semantically similar words. The
words from the first cluster present semantic relations of word politics and relate
it with words economy and society by having common collocations which are
not presented directly. The results, also, include words belonging to one and the
same part-of-speech category which relate semantically.
words politics/society
and</p>
      <p>The hidden semantic relations between words politics and society can be
evaluated by generating their common words sketch differences, and using them
to compare and contrast the two words by analyzing their collocations and by
displaying their collocates. The results are presented at Fig. 3 and include
relational semantic properties of words politics and society divided into categories
based on grammatical relations.</p>
      <p>Generally, the received results contain relations like and/or, subject − of ,
etc. Within them, the displayed words share common collocations with related
keywords. Only the relation and/or gives as a result the same part-of-speech
words ranked according to their statistical weight. It lists words like bureaucracy,
af f air, opinion, etc. which also semantically relate to word politics. The same
relation connect words state, speaker, etc. which semantically relate to word
society.</p>
      <p>At the same time, the word economics and the word economy have similar
weight (appear in the results for both politics and society), and are regarded as
common collocations of both keywords which share with them hidden semantic
relations (and/or).</p>
      <p>The relation subject − of lists words f loat, discount, lobby, etc.
semantically related to word politics, and words provide, believe, depend, require, etc.
semantically related to word society. However, the words operate and become
have similar weight and are common collocations to both politics and society
expressing their hidden semantic relations (subject − of ).</p>
      <p>Additionally, the sketch differences generated for pair words politics/economy
(Fig. 3) include words which have the same grammatical relations (and/or,
subject − of , etc.) with related keywords. The relation and/or lists more results
for the word politics compared to that listed for pair words politics/society like
class and care. For the word economy that relation lists words cost, element,
environment, creation, etc. However, the word society appear in the list of
results for both politics and economy under the same relation and can be regarded
as a hidden semantically related connection between them.</p>
      <p>Consequently, the words politics and society relate to word economy, and all
they form a cluster which share semantic relation of similarity and that words
can be regarded as synonyms. The semantic relation was evaluated on the base
of common collocation search by generation of words sketch differences.</p>
      <p>The words from the second cluster present semantic relation between keyword
politics and words enhancement, stakeholders and election also by having
common collocations which are not presented directly. The generated word sketch
differences results for pair words politics/enhancement are presented at Fig. 4.
They include relations and/or, subject − of , modif ier, object − of , etc.</p>
      <p>The results for relation and/or list more words compared to results for pair
words politics/society and politics/economy among which are policy, power,
society, history, people, etc. The resulted words also relate semantically to the
keyword politics. That relation does not present common collocations between
displayed words and pair words.</p>
      <p>The results for relation subject − of include more words compared to results
for the same relation of pair words politics/society and politics/economy among
which are words became, call, look which relate semantically to keyword politics.</p>
      <p>Under the same relation, the words learn and explain are connected
semantically to word enhancement. However, the word get is a common collocation
to both politics and enhancement connecting them by hidden semantic relation
(subject − of ).</p>
      <p>The results for relation modif ier display rich list of words which relate
semantically to keyword politics among which are international, conf rontational,
electoral, etc. Generally, that relation express a connection between a term
(politics) and its hyponyms (usually multi-word terms – international politics,
electoral politics, etc.). The results displayed for the keyword enhancement
give the word stimulus presenting a combination stimulus enhancement. The
word local appear as a semantically related word in the list of results for both
politics and enhancement and is considered as a common collocation which
connects semantically both words. Thus, the resulting multi-word terms can be
local politics and local stimulus enhancement. The other relations for the pair
words politics/enhancement do not contain common collocations.</p>
      <p>The generated words sketch differences for pair words politics/stakeholders
are presented also at Fig. 4. and include the same relations and/or, subject − of ,
modif ier, object − of , etc.</p>
      <p>The results for relations and/or and subject − of do not contain any
common collocations relating to pair words. However, the results for the relation
modif ier display exactly the same results as those generated for the pair words
politics/enhancement. Thus, the words which semantically relate to keyword
politics give exactly the same combinations forming multi-word terms. The
result for the keyword stakeholders include only the word local. The same word
also semantically relates to word politics and is a common collocation for both
politics and stakeholders. The related multi-word terms are local politics and
local stakeholders.</p>
      <p>Thus, the word local connects not only the pair words politics/stakeholders
but also the pair words politics/enhancement forming a triple of semantically
related words (politics/enhancement/stakeholders). The results of the other
generated relations do not contain any common collocations.</p>
      <p>Part of word sketch differences results for pair words politics/election are
presented at Fig. 5. and include the same relations and/or, subject − of , modif ier,
object − of , etc. The results for relations and/or, subject − of and modif ier do
not include common collocations. The relation object − of contains almost the
same words as for pair words politics/enhancement and politics/stakeholders
which relate semantically to keyword politics. However, only the word organize
is listed among semantically related words to both politics and election and is
considered as a hidden connection. The other generated relations of pair words
politics/election do not include any common collocations.</p>
      <p>Consequently, common collocations search reveals specific types of hidden
semantic relations and evaluates distributional thesauri by statistical ranking
and comparing common collocations of two semantically related words. The
sketch differences generation enlarges the number of extracted semantically
related words revealing their complex semantic structure and specific hierarchical
relations.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>The presented semantic search technique includes the use of SE approaches for
common collocations word search evaluating distributional thesauri in electronic
text corpus. It is extended also with generation of words sketch differences which
use grammatical relations to sort and filter the results. The received results show
that using above type of extended search, it is possible to reveal specific word’s
semantic and grammar features (part-of-speech) by evaluating underlaying
relations between different words contexts.</p>
      <p>
        The technique can be used for multilingual application [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] since it uses
statistical search and standard grammatical relations. It is applicable also for
terminology extraction and can be used for compilation of electronic or printed
dictionaries as well.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Azzopardi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          et al.,
          <article-title>Back to the Sketch-Board: Integrating Keyword Search</article-title>
          , Semantics, and Information Retrieval, In: Cali A.,
          <string-name>
            <surname>Gorgan</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ugarte</surname>
            <given-names>M</given-names>
          </string-name>
          .
          <article-title>(eds) Semantic Keyword-Based Search on Structured Data Sources</article-title>
          ,
          <string-name>
            <surname>KEYSTONE</surname>
          </string-name>
          <year>2016</year>
          ,
          <article-title>LNCS</article-title>
          , vol.
          <volume>10151</volume>
          ,
          <year>2017</year>
          ,
          <fpage>49</fpage>
          -
          <lpage>61</lpage>
          , Springer.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cristani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tomazzoli</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <article-title>A multimodal approach to exploit similarity in documents</article-title>
          ,
          <source>In: Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science)</source>
          , vol.
          <volume>8481</volume>
          ,
          <year>2014</year>
          , Springer,
          <fpage>490</fpage>
          -
          <lpage>499</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Cristani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tomazzoli</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <article-title>A multimodal approach to relevance and pertinence of documents</article-title>
          ,
          <source>In: Lecture Notes in Computer Science (Subseries of Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>
          , vol.
          <volume>9799</volume>
          ,
          <year>2016</year>
          , Springer,
          <fpage>157</fpage>
          -
          <lpage>168</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Stoykova</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <article-title>Using Statistical Search to Discover Semantic Relations of Political Lexica - Evidences from Bulgarian-Slovak EUROPARL 7 Corpus</article-title>
          , In: I.
          <string-name>
            <surname>Kotsireas</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Rump</surname>
          </string-name>
          and Ch. Yap (eds.),
          <source>Mathematical Aspects of Computer and Information Sciences, Lecture Notes in Computer Sciences</source>
          , vol.
          <volume>9582</volume>
          ,
          <year>2016</year>
          , Springer,
          <fpage>335</fpage>
          -
          <lpage>339</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Killgarriff</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          et al.,
          <source>The Sketch Engine: Ten Years On</source>
          , In: Lexicography,
          <year>2014</year>
          ,
          <volume>1</volume>
          ,
          <fpage>17</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Sinclair</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corpus</surname>
          </string-name>
          , Concordance, Collocations,
          <year>1991</year>
          , Oxford, OUP.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Killgarriff</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markowitz</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thomas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <article-title>Corpora and Language Learning with the Sketch Engine and SkELL</article-title>
          , In: Revue francaise de linguistique,
          <year>2015</year>
          ,
          <volume>1</volume>
          , vol. XX,
          <fpage>61</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Thompson</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <article-title>Changing the Bases for Academic Word Lists</article-title>
          , In: P. Thompson and G. Diani (eds.)
          <source>English for Academic Purposes: Approaches and Implications</source>
          ,
          <year>2015</year>
          ,
          <article-title>Newcastle-upon-</article-title>
          <string-name>
            <surname>Tyne</surname>
          </string-name>
          , Cambridge Scholars,
          <fpage>317</fpage>
          -
          <lpage>342</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gledhill</surname>
          </string-name>
          , Ch.,
          <source>Collocations in Science Writing</source>
          ,
          <year>2000</year>
          , Tuebingen.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Stoykova</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitkova</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <article-title>Conceptual Semantic Relationships for Terms of Precalculus Study</article-title>
          ,
          <source>WSEAS Transaction on Advances in Engineering Education</source>
          ,
          <year>2011</year>
          , issue 1, vol.
          <volume>8</volume>
          ,
          <fpage>13</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Stoykova</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <source>Extracting Academic Subjects Semantic Relations Using Collocations, EAI Endorsed Transactions on Energy Web and Information Technologies</source>
          , vol.
          <volume>4</volume>
          ,
          <issue>17</issue>
          (
          <issue>14</issue>
          ),
          <year>2017</year>
          , http://dx.doi.org/10.4108/eai.4-
          <fpage>10</fpage>
          -
          <year>2017</year>
          .
          <fpage>153161</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Stankovic</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          et al.,
          <source>Keyword-Based Search on Bilingual Digital Libraries</source>
          , In: Cali A.,
          <string-name>
            <surname>Gorgan</surname>
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ugarte</surname>
            <given-names>M</given-names>
          </string-name>
          .
          <article-title>(eds) Semantic Keyword-Based Search on Structured Data Sources</article-title>
          ,
          <string-name>
            <surname>KEYSTONE</surname>
          </string-name>
          <year>2016</year>
          ,
          <article-title>LNCS</article-title>
          , vol.
          <volume>10151</volume>
          ,
          <year>2017</year>
          ,
          <fpage>112</fpage>
          -
          <lpage>123</lpage>
          , Springer.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>