=Paper= {{Paper |id=Vol-1959/paper-11 |storemode=property |title=Discovering Distributional Thesauri Semantic Relations |pdfUrl=https://ceur-ws.org/Vol-1959/paper-11.pdf |volume=Vol-1959 |authors=Velislava Stoykova |dblpUrl=https://dblp.org/rec/conf/kdweb/Stoykova17 }} ==Discovering Distributional Thesauri Semantic Relations== https://ceur-ws.org/Vol-1959/paper-11.pdf
    Discovering Distributional Thesauri Semantic
                      Relations

                                  Velislava Stoykova

          Institute for Bulgarian Language, Bulgarian Academy of Sciences,
               52, Shipchensky proh. str., bl. 17, 1113 Sofia, Bulgaria
                                 vstoykova@yahoo.com



        Abstract. The paper1 presents technique and analysis to discover dis-
        tributional thesauri relations by using statistical similarity of different
        word’s contexts. The application uses educational electronic text cor-
        pus and the Sketch Engine software statistical search to extract and
        compare word’s collocations from the related text corpus. The semantic
        search used is based on the evaluation and comparison of common key-
        word’s collocations by generation distributional thesauri word’s semantic
        relations and words sketch differences. The results of the related search
        experiments for British Academic Spoken English corpus are evaluated
        and presented.

        Keywords: Data Mining, Big Data, Hierarchical Categorization.


1     Introduction
The similarity search is widely known technique to extract semantically related
words. It is used to evaluate not only synonyms but also to extract semantic
relations between words in large electronic text corpora. Recent research in that
area [1] extend the search techniques by combining semantic approaches and
information retrieval approaches improving the search, so to deal with more
complex semantic representations.
    Thus, the technique is applied to evaluate semantic content of retrieved elec-
tronic textual documents by systematic analysis of structure of that documents.
Additionally, the traditional approaches were improved with the technique of
linking text-based content to image using joint information sources [2,3] for doc-
ument classification.
    Generally, the combined statistical similarity approaches were successfully
applied for extracting and comparing words belonging to different thesauri by
comparing their related contextual collocations. The existing applications im-
prove the multilingual use and the universal scope of that approach [4].
    Further, we are going to demonstrate the use of such technique by present-
ing and analyzing search results of generation and comparison of collocations
1
    The research described presents results obtained during COST-STSMIC1302-36988
    ”Natural Language Processing Keyword Search for Related Languages” of COST
    Action IC1302 ”Semantic keyword-based search on structured data sources (KEY-
    STONE)”.
of related words in British Academic Spoken English corpus using specialized
software of Sketch Engine.


2   The Sketch Engine (SE)

The SE software [5] allows approaches to extract semantic properties of words
and most of them are with multilingual application. Extracting keywords is
widely used technique to extract terms of particular studied domain. Also, se-
mantic relations can be extracted by generation of related word contexts through
word concordances which define context in quantitative terms and a further work
is needed to be done to extract semantic relations by searching for co-occurrences
and collocations of related keyword.
    Co-occurrences and collocations are words which are most probably to be
found with a related keyword. They assign the semantic relations between the
keyword and its particular collocated word which might be of similarity or of
a distance. We use techniques of T − score, M I − score and M I 3 − score for
corpora processing and searching. For all, the following terms are used: N –
corpus size, fA – number of occurrences of keyword in the whole corpus (the size
of concordance), fB – number of occurrences of collocated keyword in the whole
corpus, fAB – number of occurrences of collocate in the concordance (number
of co-occurrences). The related formulas for defining T − score, M I − score and
M I 3 − score are as follows:


                                                   f AB N
                                MI-Score log 2
                                                   f A fB
                                                    f A fB
                                          f AB −
                                T-Score               N
                                                 f AB
                                                      3
                                                    f AB N
                                MI3-Score log 2
                                                    f A fB




    The T − score, M I − score and M I 3 − score are applicable for processing
multilingual parallel corpora as well. Collocations have been regarded as statisti-
cally similar words [6] which can be extracted by using techniques for estimation
the strength of association between co-occurring words.
    The SE also offers further refinement of extracted semantic relations by eval-
uation of word’s common collocations or evaluation of distributional thesauri
semantic relations [7].
    Further, we are going to present and analyze search results of extraction
and evaluation of common collocations from British Academic Spoken English
corpus using the SE software.
3   The British Academic Spoken English (BASE) corpus
    (BASE)

The British Academic Spoken English (BASE) corpus is a collection of tran-
scripts of lectures and seminars recorded at University of Warwick and University
of Reading in the UK during the period 1998-2005. It was created to analyze En-
glish for Academic Purposes [8] which allows also extraction of specific semantic
relations between terms and definitions among subjects studied.
    The texts included consist of 1 186 290 words and are distributed across four
broad domain areas: (i) Arts and Humanities, (ii) Life and Medical Sciences, (iii)
Physical Sciences and (iv) Social Studies and Sciences. The corpus is annotated
according to Text Encoding Initiative Guidelines and recently was uploaded into
SE allowing the use of its incorporated options for storing, sampling, searching
and filtering texts according to different criteria.


4   Common Collocations Search Results

We are going to use the SE statistical options to extract word’s semantic rela-
tions. The methodology includes generation of word’s collocations [9] and their
further comparison. For that, we are going to present search experiments and
related results for the word politics.




      Fig. 1. The collocation candidates of word politics from BASE corpus.
    For our research, we use M I − score and apply methodology already used to
extract specialized collocations in mathematical domain [10]. Fig. 1 shows gen-
erated collocation candidates for the word politics. The received results present
most frequent words which are most probably to be found with the word politics.
They are: electoral, international, gender, etc.
    The results include specialized terms that can be part of thesauri like electoral
politics, international politics but also attributive collocations like confronta-
tional politics which are based on the meaningful combination between word
and its collocations [11].




Fig. 2. The generated distributional thesauri and related semantic relations of word
politics from BASE corpus.


    The SE word search options can extract not only statistically similar words
for building thesauri but also can compare words’ collocations which belong to
more than one thesauri – the so-called common collocations. Distributional the-
sauri search evaluates word’s common collocations which share common semantic
relations.
    Generally, if two words have much collocations in common, they share se-
mantic relations of distributional thesauri and will appear in each one’s thesauri.
The SE function to generate words’ distributional thesauri can compare pairs of
words and show how they collocate.
    Fig. 2 shows generated distributional thesauri for the word politics and its
semantically related words like: economy, society, production, organization,
etc.
    However, the results underlay two clusters of semantically similar words. The
words from the first cluster present semantic relations of word politics and relate
it with words economy and society by having common collocations which are
not presented directly. The results, also, include words belonging to one and the
same part-of-speech category which relate semantically.
Fig. 3.     The words sketch differences of pair words politics/society   and
politics/economy from BASE corpus.
    The hidden semantic relations between words politics and society can be
evaluated by generating their common words sketch differences, and using them
to compare and contrast the two words by analyzing their collocations and by
displaying their collocates. The results are presented at Fig. 3 and include rela-
tional semantic properties of words politics and society divided into categories
based on grammatical relations.
    Generally, the received results contain relations like and/or, subject − of ,
etc. Within them, the displayed words share common collocations with related
keywords. Only the relation and/or gives as a result the same part-of-speech
words ranked according to their statistical weight. It lists words like bureaucracy,
af f air, opinion, etc. which also semantically relate to word politics. The same
relation connect words state, speaker, etc. which semantically relate to word
society.
    At the same time, the word economics and the word economy have similar
weight (appear in the results for both politics and society), and are regarded as
common collocations of both keywords which share with them hidden semantic
relations (and/or).
    The relation subject − of lists words f loat, discount, lobby, etc. semanti-
cally related to word politics, and words provide, believe, depend, require, etc.
semantically related to word society. However, the words operate and become
have similar weight and are common collocations to both politics and society
expressing their hidden semantic relations (subject − of ).
    Additionally, the sketch differences generated for pair words politics/economy
(Fig. 3) include words which have the same grammatical relations (and/or,
subject − of , etc.) with related keywords. The relation and/or lists more results
for the word politics compared to that listed for pair words politics/society like
class and care. For the word economy that relation lists words cost, element,
environment, creation, etc. However, the word society appear in the list of re-
sults for both politics and economy under the same relation and can be regarded
as a hidden semantically related connection between them.
    Consequently, the words politics and society relate to word economy, and all
they form a cluster which share semantic relation of similarity and that words
can be regarded as synonyms. The semantic relation was evaluated on the base
of common collocation search by generation of words sketch differences.
    The words from the second cluster present semantic relation between keyword
politics and words enhancement, stakeholders and election also by having com-
mon collocations which are not presented directly. The generated word sketch
differences results for pair words politics/enhancement are presented at Fig. 4.
They include relations and/or, subject − of , modif ier, object − of , etc.
    The results for relation and/or list more words compared to results for pair
words politics/society and politics/economy among which are policy, power,
society, history, people, etc. The resulted words also relate semantically to the
keyword politics. That relation does not present common collocations between
displayed words and pair words.
    The results for relation subject − of include more words compared to results
for the same relation of pair words politics/society and politics/economy among
which are words became, call, look which relate semantically to keyword politics.
Fig. 4. The words sketch differences of pair words politics/enhancement and
politics/stakeholders from BASE corpus.
    Under the same relation, the words learn and explain are connected seman-
tically to word enhancement. However, the word get is a common collocation
to both politics and enhancement connecting them by hidden semantic relation
(subject − of ).
    The results for relation modif ier display rich list of words which relate se-
mantically to keyword politics among which are international, conf rontational,
electoral, etc. Generally, that relation express a connection between a term
(politics) and its hyponyms (usually multi-word terms – international politics,
electoral politics, etc.). The results displayed for the keyword enhancement
give the word stimulus presenting a combination stimulus enhancement. The
word local appear as a semantically related word in the list of results for both
politics and enhancement and is considered as a common collocation which
connects semantically both words. Thus, the resulting multi-word terms can be
local politics and local stimulus enhancement. The other relations for the pair
words politics/enhancement do not contain common collocations.




Fig. 5. The words sketch differences of pair words politics/election from BASE corpus.
     The generated words sketch differences for pair words politics/stakeholders
are presented also at Fig. 4. and include the same relations and/or, subject− of ,
modif ier, object − of , etc.
     The results for relations and/or and subject − of do not contain any com-
mon collocations relating to pair words. However, the results for the relation
modif ier display exactly the same results as those generated for the pair words
politics/enhancement. Thus, the words which semantically relate to keyword
politics give exactly the same combinations forming multi-word terms. The re-
sult for the keyword stakeholders include only the word local. The same word
also semantically relates to word politics and is a common collocation for both
politics and stakeholders. The related multi-word terms are local politics and
local stakeholders.
     Thus, the word local connects not only the pair words politics/stakeholders
but also the pair words politics/enhancement forming a triple of semantically
related words (politics/enhancement/stakeholders). The results of the other
generated relations do not contain any common collocations.
     Part of word sketch differences results for pair words politics/election are pre-
sented at Fig. 5. and include the same relations and/or, subject − of , modif ier,
object − of , etc. The results for relations and/or, subject − of and modif ier do
not include common collocations. The relation object − of contains almost the
same words as for pair words politics/enhancement and politics/stakeholders
which relate semantically to keyword politics. However, only the word organize
is listed among semantically related words to both politics and election and is
considered as a hidden connection. The other generated relations of pair words
politics/election do not include any common collocations.
     Consequently, common collocations search reveals specific types of hidden
semantic relations and evaluates distributional thesauri by statistical ranking
and comparing common collocations of two semantically related words. The
sketch differences generation enlarges the number of extracted semantically re-
lated words revealing their complex semantic structure and specific hierarchical
relations.


5   Conclusion

The presented semantic search technique includes the use of SE approaches for
common collocations word search evaluating distributional thesauri in electronic
text corpus. It is extended also with generation of words sketch differences which
use grammatical relations to sort and filter the results. The received results show
that using above type of extended search, it is possible to reveal specific word’s
semantic and grammar features (part-of-speech) by evaluating underlaying rela-
tions between different words contexts.
    The technique can be used for multilingual application [12] since it uses
statistical search and standard grammatical relations. It is applicable also for
terminology extraction and can be used for compilation of electronic or printed
dictionaries as well.
References
 1. Azzopardi, J. et al., Back to the Sketch-Board: Integrating Keyword Search, Seman-
    tics, and Information Retrieval, In: Cali A., Gorgan D., Ugarte M. (eds) Semantic
    Keyword-Based Search on Structured Data Sources, KEYSTONE 2016, LNCS,
    vol. 10151, 2017, 49–61, Springer.
 2. Cristani, M., Tomazzoli, C., A multimodal approach to exploit similarity in docu-
    ments, In: Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in
    Computer Science), vol. 8481, 2014, Springer, 490–499.
 3. Cristani, M., Tomazzoli, C., A multimodal approach to relevance and pertinence
    of documents, In: Lecture Notes in Computer Science (Subseries of Lecture Notes
    in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 9799, 2016,
    Springer, 157–168.
 4. Stoykova, V., Using Statistical Search to Discover Semantic Relations of Political
    Lexica – Evidences from Bulgarian–Slovak EUROPARL 7 Corpus, In: I. Kotsireas,
    S. Rump and Ch. Yap (eds.), Mathematical Aspects of Computer and Information
    Sciences, Lecture Notes in Computer Sciences, vol. 9582, 2016, Springer, 335–339.
 5. Killgarriff, A. et al., The Sketch Engine: Ten Years On, In: Lexicography, 2014,
    1, 17–36.
 6. Sinclair, J., Corpus, Concordance, Collocations, 1991, Oxford, OUP.
 7. Killgarriff, A., Markowitz, F., Smith, S., Thomas, J., Corpora and Language Learn-
    ing with the Sketch Engine and SkELL, In: Revue francaise de linguistique, 2015,
    1, vol. XX, 61–80.
 8. Thompson, P., Changing the Bases for Academic Word Lists, In: P. Thompson
    and G. Diani (eds.) English for Academic Purposes: Approaches and Implications,
    2015, Newcastle-upon-Tyne, Cambridge Scholars, 317–342.
 9. Gledhill, Ch., Collocations in Science Writing, 2000, Tuebingen.
10. Stoykova, V., Mitkova, M., Conceptual Semantic Relationships for Terms of Pre-
    calculus Study, WSEAS Transaction on Advances in Engineering Education, 2011,
    issue 1, vol. 8, 13–22.
11. Stoykova, V., Extracting Academic Subjects Semantic Relations Using Colloca-
    tions, EAI Endorsed Transactions on Energy Web and Information Technologies,
    vol. 4, 17(14), 2017, http://dx.doi.org/10.4108/eai.4-10-2017.153161
12. Stankovic, R. et al., Keyword-Based Search on Bilingual Digital Libraries, In: Cali
    A., Gorgan D., Ugarte M. (eds) Semantic Keyword-Based Search on Structured
    Data Sources, KEYSTONE 2016, LNCS, vol. 10151, 2017, 112–123, Springer.