=Paper= {{Paper |id=Vol-2763/CPT2020_paper_s3-10 |storemode=property |title=Research and development of linguistic-statistical methods for forming a portrait of a subject area |pdfUrl=https://ceur-ws.org/Vol-2763/CPT2020_paper_s3-10.pdf |volume=Vol-2763 |authors=Oleg Zolotarev }} ==Research and development of linguistic-statistical methods for forming a portrait of a subject area== https://ceur-ws.org/Vol-2763/CPT2020_paper_s3-10.pdf
  Research and development of linguo-statistical methods for forming a
                      portrait of a subject area
                                                     Oleg V. Zolotarev
                                                    ol-zolot@yandex.ru
                                      ANO HE «Russian New University», Moscow, Russia

    The project aims to solve the fundamental scientific problem of semantic modeling, within the framework of which a methodology is
developed for the automated identification of translation links (translation correspondences), as well as hierarchical, synonymous and
associative links from Internet texts and the construction of multilingual associative hierarchical portraits of subject area (MAHPSA),
in particular, on autonomous uninhabited underwater vehicles (UUV). Accounting for multilingual and heterogeneous resources allows
you to get a more complete picture of what is happening in the subject area, to identify the sources of the origin of ideas, the speed and
directions of their distribution, to identify significant documents and promising directions. The solution to the problem is based on an
integrated approach that combines the methods of statistics, corpus linguistics and distributive semantics, and is implemented in
technology that involves the development of linguo-statistical mechanisms for the formation of a multilingual associative hierarchical
portrait of a subject area, which is a dictionary of significant terms of the subject area, the elements of which organized in synonymous
series (synsets), including translational correspondences, as well as associative and hierarchical relationships.
    Keywords: Linguo-statistical methods, associative and hierarchical portrait of the subject area, multilingual integrated ontology,
forecasting the spread of ideas, multilingual body of the subject area.

                                                                        4) Automatic selection of topics on the basis of thematic
1. Introduction                                                             modeling methods, the formation of a dictionary of
    The growth of volumes on the Internet significantly                     subject areas, the selection of many keywords of
complicates the search for information. Using semantic                      subject areas, expert control, topic correction;
search, comparing multilingual documents will allow you                 5) The formation of a dictionary of key terms mapped to
to find new interesting trends and ideas, which will                        topics;
significantly reduce the cost of developing and                         6) Compilation of frequency dictionaries of domain
popularizing new areas in science. Using a multilingual                     terms (using statistical methods);
associative hierarchical portrait of a subject area when                7) Compilation of frequency dictionaries of subject
comparing documents will allow us to compare texts not                      domain megalemmas;
only on the basis of matching phrases included in these                 8) Building multilingual synsets by combining BabelNet
documents, but also on the matching of the described                        resources and a megalemma dictionary;
objects and processes. MAHPSA allows you to determine                   9) Building SVPs using a neural network model (a
the semantic similarity of documents even if the                            combination of Word2Vec with multilingual
documents do not have common words that are included                        recurrent neural networks RNN) for texts that have
in both documents. MAHPSA allows you to calculate the                       undergone preprocessing;
integrated statistics of a multilingual collection, determine
                                                                        10) Performing hierarchical clustering using Word2Vec
significant documents and promising areas without
                                                                            and RNN, taking into account the hierarchical
translating documents into one of the languages. This is
important for the automatic processing of a large number                    relationships of synsets;
of documents (Big Data). The construction of MAHPSA                     11) The construction of an ordered list of candidates for
will provide an opportunity not only to compare                             hierarchical     relationships     from      associative
documents and search for new ideas, but also to solve other                 connections of the neural network model; viewing and
problems associated with the rapid analysis of a large                      correction of hierarchical relations is implemented on
amount of information.                                                      the basis of the Keywen Knowledge Architect
                                                                            resource [1].
2. Technique of automatic formation of a
    multilingual associative-hierarchical portrait                      3. Methodology for calculating integral statistics
    of a subject area                                                      based on MAHPSA
    The essence of the proposed method for the formation                    MAHPSA is created automatically on the basis of
of a multilingual associative-hierarchical portrait of a                statistical analysis of large volumes of texts from the
subject domain consists in iteratively expanding the initial            Internet. The hierarchical connections that make up the
multilingual dictionary of significant phrases to the                   MAHPSA form a hierarchy and classifier that facilitate the
hierarchy of multilingual synonymous series (synsets).                  search and navigation in the multilingual subject area of
The method can be stated as the following algorithm:                    the UUV.
1) Compiling a collection of multilingual texts by means                    The proposed methodology also includes the
     of a directed search in the databases of scientific                integration of various MAHPSA s with multilingual
     documents (for example, Dimensions) by keywords;                   linguistic resources (WordNet, Wikipedia, BabelNet, etc.)
2) Word processing by means of the Pullenti program,                    to obtain the largest multilingual ontology with relevant
     tokenization and metatoke nization;                                knowledge and improved coverage of terminology in the
                                                                        subject areas under consideration. The combined (integral)
3) Automatic generation of glossaries of terms and
                                                                        ontology contains a hierarchy of synonymic series
     megalemms; expert quality control of generated
                                                                        (synsets) of multilingual terms, including Russian, and
     dictionaries;

Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY
4.0)
serves as the basis for constructing a single multilingual          Text processing is carried out using the program
vector space that allows us to evaluate the semantic            PullEnti [2]. This is a unique product that wins the
proximity of multilingual texts, synsets and terms, similar     computer linguistics competitions held as part of the
to NASARI and MAFFIN methods. The translation                   Dialogue conference.
correspondences between the multilingual synsets of                 Pullenti is a linguistic processor developed at the
MAHPSA are built using Word2Vec technology. Integral            Institute of Informatics Problems, which is constantly
ontology allows you to calculate integrated multilingual        being refined and allows morphological, syntactic and
statistics and trends in the use of terms and ideas, which      partially semantic analysis of the text, distinguishing typed
allows you to predict the distribution of ideas between         objects - named entities.
languages and determine promising directions. A measure             Pullenti SDK includes the following main blocks:
of the semantic proximity of multilingual documents             1) Tokenization: breakdown into words (tokens) as
allows you to identify implicit links between documents              adjusted (Fig. 1 [2-12]);
and determine significant documents, which is necessary         2) Morphological analysis: definition for tokens of parts
to collect high-quality information from the open Internet           of speech (this is a POS-tagger - Part of Speech, which
and build large relevant multilingual corpuses of texts for          gives out all possible options for a word form
the subject area. Thus, increasing the size and quality of           regardless of its surrounding context). Languages are
integral ontology will allow us to build a better similarity         Russian, Ukrainian and English. There is
measure and subject corpus of texts, extracting knowledge            normalization, reduction of the word form to the
from which in turn will further increase the size and
                                                                     desired case \ gender \ number, and there is also
quality of integral ontology.
                                                                     processing of unknown and new words, and there is
    The methodology includes not only the identification
of significant documents, but also the identification of             also a mode for correcting errors (Fig. 2 [2-12]);
trends and the identification of promising areas for the        3) Selection of named entities [13] (NER - Names Entity
development of science.                                              Recognition): a lot of so-called analyzers that find
    To develop the first version of the integrated statistics        entities of the corresponding type (person,
methodology based on MAHPSA, it is necessary to do the               organization, geographical objects, etc.) in sequences
following:                                                           of tokens (Fig. 3 [2-12]);
1) Conduct morphological, syntactic and partially               4) A lot of tools for working with numerical data,
     semantic analysis of the text;                                  nominal and verb groups, brackets and quotation
2) Select typed objects - named entities;                            marks, dictionaries of terms and abbreviations,
3) Identify formal elements for the presentation of                  various checks (for example, equivalence of strings in
     concepts;                                                       Latin and Cyrillic letters) and other useful features
4) Develop a structure and software for storing a                    that appeared during the solution of practical
     multilingual collection of documents;                           problems (Fig. . 4 [2-12]);
5) Create dictionaries for storing structured information;      5) Derivative dictionary: a dictionary of the so-called
6) Develop neural network algorithms for calculating                 derivative groups (many same-root words, but
     integrated statistics based on MAHPSA.                          different parts of speech, and one group contains
    The first version of the program has been developed              words in different languages), group management
for highlighting interlingual implicit connections and               model (what can come after a group), synonymy, etc.;
assessing the semantic similarity of phrases in different       6) Semantic representation: tokens are structured in the
languages.                                                           form of a graph with semantic connections to solve
                                                                     more complex problems related to meaning [14].
                                                      Fig. 1. Tokenization




Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY
4.0)
Fig. 2. Morphological analysis
Fig. 3. Highlighting the named entities




        Fig. 4. Numeric Tools
    Specially for this project, the linguistic processor has     references. According to the publication date of the
been modified so that it is possible to more accurately          document, the source document of the megalemma and the
highlight implicit links in documents                            document that has a link to the megalemma are
    The concept of a token (Token base class) is at the          determined.
heart of the Pullenti SDK model. Each token refers to a              To control the quality of automatic detection of
merged fragment of the source text (BeginChar and                implicit links, methods of collective intelligence and
EndChar positions). First, the text is divided into a            crowdsourcing were used [17]. It was proposed to conduct
sequence of text tokens (TextToken), and then during             a quality check for the detection of implicit links using an
processing they are converted - merging into meta-tokens         expert approach.
(MetaToken). A metatoken is a token that has "absorbed"              The probability of a positive decision is determined by
a fused sequence of other tokens. Metatokens, for                the mathematical model:
example, represent places of occurrence of named entities                            𝑴𝑴=𝟏𝟏/𝟐𝟐
(ReferentToken) in the text. Metatokens can represent                         𝑲𝑲𝟎𝟎 = � 𝑪𝑪𝒊𝒊𝑴𝑴 𝑮𝑮𝑴𝑴=𝒊𝒊          𝒊𝒊
                                                                                                𝑹𝑹 (𝟏𝟏 − 𝑮𝑮𝑹𝑹 )
various numerical data (lowercase spelling of numbers),                                𝒊𝒊=𝟎𝟎
name groups (in the example, NounPhraseToken is the                  In accordance with this formula, the probability K0 of
inherited class from MetaToken), etc. Most of the                a positive decision by a group of M experts with the
elements received and used during the analysis are               probability of the correct GR solution for one expert is
metatokens.                                                      determined by this formula. The analysis of expert
    The concept of PullEnti megatokens served as the basis       estimates showed a rather high level of revealing implicit
for building dictionaries of megalemmas, each of which           links and determining the semantic similarity of phrases
can consist of several tokens or megatokens. The                 and documents.
megalemma is the basis for comparing meaningful phrases              There was developed software for storing a
from different languages, i.e. the concept of megalemma          multilingual collection of documents. A software
is broader than the concept of megatoken, since it               implementation of thematic modeling methods using
additionally includes identifying connections between            dictionaries of megalemmas in subject areas has been
different languages.                                             developed [18].
    Megalemma dictionaries are constructed using the                 As a result of processing collections of documents,
method for determining the proximity of terms [11]. It is        dictionaries of terms and dictionaries of megalemmas are
this method that allows us to form megalemmas on the             built. Statistics is collected for the use of terms and
basis of statistical patterns of occurrence of terms in the      megalemmas by articles.
framework of the formation of an associative-hierarchical            BabelNet is an integration resource based on the
portrait of a subject area.                                      following resources: WordNet, Wikipedia, OmegaWiki,
    Thematic dictionaries of megalemmas are formed by            Wiktionary, Wikidata, Wikiquote, VerbNet, Microsoft
subject areas and serve as the basis for the classification of   Terminology, GeoNames, ImageNet, FrameNet, WN-
texts. Megalemma dictionaries are also used to represent         Map, Open Multilingual WordNet, WoNeF, Albanet,
knowledge in ontologies and automatically supplement             Arabic WordNet ( AWN v2), BulTreeBank WordNet
them with relevant vocabulary.                                   (BTB-WN), Chinese Open WordNet, Chinese WordNet
    The formal element for the presentation of concepts          (Taiwan), DanNet, Greek WordNet, Princeton WordNet,
was chosen synset. This is the basis of knowledge                Persian WordNet, FinnWordNet, WOLF (WordNet Libre
representation in systems such as Wordnet, Babelnet and          du Français), Hebrew WordNet, Croatian WordNet,
others. This is a well-established and generally accepted        IceWordNet , MultiWordNet, ItalWordNet, Japanese
concept [15]. Synsets can chain together (megalemmas             WordNet, Multilingual Central Repository, WordNet
include synsets).                                                Bahasa, Open Dutch WordNet, Norwegian WordNet,
    Thus megalemmas are presented - these are chains of          plWordNet, OpenWN-PT, Romanian WordNet, Lithua.
synsets. The concept of synset is initially oriented toward          BabelNet is fully integrated with BabelFly's
multilingualism.                                                 multilingual lexical ambiguity and entity binding system.
    The work was carried out in two subject areas -              BabelNet is also integrated with Wikipedia's bitaxonomy
“computer graphics and visualization” and “autonomous            [20], which is built around two hierarchies: page
uninhabited underwater vehicles”.                                hierarchies and category hierarchies [15].
    Algorithms for the semantic analysis of information              Integration with BabelNet will be carried out by
have been developed [2-11, 15]. Prototypes of software           analogy with the approach that BabelNet uses to integrate
components for semantic analysis of textual information          with other (described above) resources, using automatic
have been developed too.                                         display and filling of lexical gaps in languages with limited
    Implicit links are searched using the megalemma              resources using statistical machine translation. The result
dictionary. First, the text is processed using the PullEnti      is an “encyclopedic dictionary” that provides concepts and
program, normalization of words in the text, selection of        named entities lexicalized in many languages and
named entities (NER - named entity recognition),                 associated with a large number of semantic relations [21].
formation of dictionaries of tokens and megatokens for the       Additional vocabulary and definitions are added by
text are performed. Next, a thematic analysis of the text is     reference to free networks such as WordNet, OmegaWiki,
carried out using megalemma dictionaries. In the                 English Wiktionary, Wikidata, FrameNet, VerbNet and
dictionaries of megalemmas, as already mentioned, there          others. Like WordNet, BabelNet groups words in different
is a correlation of each megalemma with a specific               languages into sets of synonyms called Babel synsets. For
document and with a specific subject area. This allows the       each Babel syntax, BabelNet provides short definitions
classification of texts in subject areas and a statistical       (called glosses) in many languages, taken from both
analysis of documents for the presence of implicit               WordNet and Wikipedia.
    In the future, it is planned to use the Babelscape                       Ddsa = ,                   (8)
product [22], which allows us to analyze documents,            where Ddsa is a dictionary of subject areas of a document.
perform semantic markup of texts, build semantic               One document can belong to several subject areas.
knowledge graphs in several languages, etc., but this issue
requires additional careful study [15].                        4. Results
    The dictionaries of terms and megalemmas proposed              A program was developed to implement methods for
within the framework of the project allow not only to          modeling topics and to identify implicit links between
classify texts, but also to define implicit links between      documents [23]. The megalemmas' dictionary is used to
articles.                                                      determine implicit references. The task is to determine the
    The structure of the glossary is represented by a tuple:   source of the megalemma and link to it. A storage structure
                  Dterm = < IDterm, Term>,              (1)    and methods for constructing a multilingual collection of
where Dterm is a glossary of terms, IDterm is a term           synsets - synonymous series are developed.
identifier in a dictionary, Term is a term.                        A neural network algorithm was developed using tags
    The structure of the megalemma dictionary is               / tokens (flagging) and the Word2vec method modified by
represented by a tuple:                                        the team of authors, already described, to identify Russian-
                 Dmeg = < IDmeg, MegL>,                 (2)    speaking terms in texts that are similar in context of lexical
where Dmeg is the megalemma dictionary, IDmeg is the           meaning [24].
megalemma identifier in the dictionary, MegL is the                The methodology for constructing forecasts for the
megalemma.                                                     development of new directions includes the ratio of the
    The structure of the document dictionary is represented    relative frequencies of occurrence of the same
by a tuple:                                                    megalemmas calculated over adjacent years. This
  Ddoc = , (3)
                                                               approach eliminates the problem of retraining neural
where Ddoc is the document dictionary, IDdoc is the            networks in connection with the accumulation of
document identifier in the dictionary, NAMEdoc is the          information.
document name, SRCdoc is the publication source,                   The analysis of clustering methods and thematic
YEARdoc is the publication year, NUMwrd is the total           modeling to assess the quality / significance of texts
number of terms in the document.                               carried out [25]. Various thematic modeling methods are
    The structure of the domain dictionary is represented      considered, including the vector model, latent semantic
by a tuple:                                                    analysis, latent Dirichlet placement, and others. The basis
                   Dsa = < IDsa, SA>,                   (4)    of these methods is a probabilistic approach, i.e.
where Dsa is the domain dictionary, IDsa is the domain         correlation of a term or document with several topics with
identifier in the dictionary, SA is the domain name.           a certain degree of probability. The disadvantage of this
    While the Dterm dictionary is a general glossary of        approach is the automatic formation of a list of topics.
terms, dictionaries of documents contain the terms of the
document and the frequency of occurrence of the term in        5. Conclusion
the document. The same thing applies to the dictionary of
megalemmas. These two dictionaries are associative tables          As a result of this scientific research, a number of
in the database. An associative table in the database          results will be obtained that have high scientific and
implements a relationship between many-to-many entities.       applied significance:
    The structure of the dictionary of terms of the            1. The updated actual multilingual collection of
document is represented by a tuple:                                 scientific texts in various languages, containing more
            Dtd = < IDterm, IDdoc, Fterm>,              (5)         than 60 thousand scientific documents and having
where Dtd is the dictionary of terms of the document,               more than 6 thousand internal bibliographic
Fterm is the relative frequency of occurrence of the term           references. This collection will allow us to accurately
in the document, calculated as follows: first, all                  calculate the significance of documents using the
insignificant words are removed from the document (stop             scientific citation index (SCI) by the number of
words, rare words, etc.), only the terms remain, then the           bibliographic references, as well as using the context
frequency of occurrence of the term is divided by the total         scientific citation index (CSCI), calculated by the
number of terms in the document.                                    number of implicit references identified through the
The structure of the dictionary of megalemmas of the                semantic similarity of texts.
document is represented by a tuple:                            2. The developed technique for the automatic formation
           Dmd = < IDmeg, IDdoc, Fmeg>,                 (6)         of a multilingual associative-hierarchical portrait of a
where Dmd is the dictionary of megalemmas in the
                                                                    subject area (MAHPSA) containing a hierarchy of
document, Fmeg is the relative frequency of megalemma
                                                                    multilingual synonymous series (synsets). With the
in the document, calculated as follows: the frequency of
megalemma is divided by the total number of                         help of MAHPSA, it is possible to solve a wide range
megalemmas in the document.                                         of problems, including calculating the semantic
    The structure of the keyword dictionary is represented          similarity of texts, identifying multilingual
by a tuple:                                                         plagiarism, expanding queries in multilingual search.
             Dkeywrd = < IDterm, IDsa>,                 (7)    3. The developed methodology and algorithms for
    Keywords are taken from a general vocabulary of                 calculating integrated multilingual statistics based on
terms and compared with the subject area. This is also an           MAHPSA, including the identification of significant
associative table.                                                  documents, trends and promising areas. Because of
    The structure of the dictionary of document correlation         applying the technique to a multilingual collection,
with a subject area is presented below.                             new concepts will be revealed, the dynamics of their
     development over time will be considered, and           [8] Sharnin M.M., Zolotarev O.V., Somin N.V.
     promising areas for the development of the subject           Extracting and processing knowledge from
     area will be constructed. Based on this, it will be          unstructured texts of the business sphere and social
     possible to build forecasts of promising areas of            networks. In the collection: Social computing:
     research.                                                    fundamentals, development technologies, social and
4.   The developed methodology for integrating                    humanitarian effects Materials of the Fourth
     MAHPSA with other ontologies and linguistic                  International Scientific and Practical Conference.
     resources, including BabelNet, which contains                2015. P. 364-371.
     millions of multilingual synsets. As a result, the      [9] Zolotarev O.V., Kozerenko E.B., Sharnin M.M.
     shortcomings of BabelNet related to the low level of         Analytical intelligence based on the analysis of
     coverage of Russian terms will be overcome. For              unstructured information from various sources,
     integrated resources, updated ratings of the                 including the Internet and the media. Bulletin of the
     significance of documents will be calculated and             Russian New University. Series: Complex systems:
     updated forecasts of promising areas of research in          models, analysis and control. 2015. No 1. P. 49-54.
     selected subject areas will be constructed.             [10] Zolotarev O.V. New approaches in constructing the
                                                                  functional structure of the subject area. In the
Acknowledgment                                                    collection: Twenty Years of Post-Soviet Russia: crisis
   The reported study was funded by RFBR according to             phenomena and modernization mechanisms materials
the research projects № 18-07-00225, 18-07-00909, 18-             of the XIV All-Russian Scientific and Practical
07-01111, 19-07-00455 and 20-04-60185.                            Conference of the Humanitarian University: in 2
                                                                  volumes. Humanitarian University. Ekaterinburg,
References                                                        2011. P. 639-643.
                                                             [11] Zolotarev O.V., Sharnin M.M., Klimenko S.V. A
[1] J. Galbraith, and R. Thayer, SECSH Public Key File            semantic approach to the analysis of terrorist activity
    Format, draft-ietf-secsh-publickeyfile-01.txt, March          on the Internet based on thematic modeling methods.
    2001, work in progress material.                         [12] Zolotarev O.V., Sharnin M.M., Klimenko S.V.
[2] Zolotarev O.V., Sharnin M.M., Klimenko S.V.,                  Bulletin of the Russian New University. Series:
    Kuznetsov K.I. PullEnty system - information                  Complex systems: models, analysis and control. 2016.
    extraction from natural language texts and automated          No. 3. P. 64-71.
    building of information systems. In the collection:      [13] Kozerenko E. B., Kuznetsov K. I. Romanov D. A.
    Situational centers and information-analytical                Semantic processing of unstructured textual data
    systems of class 4i for monitoring and security tasks         based on the linguistic processor PullEnti Informatics
    (SCVRT2015-16). Proceedings of the International              and applications 2018 volume 12 issue 3. DOI:
    Scientific Conference: in 2 volumes. 2016. P. 28-35.          10.14357/19922264180313, pp. 91–98
[3] Zolotarev O.V., Kozerenko E.B., Sharnin M.M. The         [14] Chiu, J.P. and Nichols, E. (2015). Named entity
    principles of constructing models of business                 recognition with bidirectional lstm-cnns. arXiv
    processes in the subject area based on natural                preprint arXiv:1511.08308.
    language text processing. Bulletin of the Russian New    [15] Peters M. E. et al. Deep contextualized word
    University. Series: Complex systems: models,                  representations //arXiv preprint arXiv:1802.05365. -
    analysis and control. 2014. No. 4. P. 82-88.                  2018.
[4] Zolotarev O.V. Methods and tools for domain              [16] Roberto Navigli and Simone Paolo Ponzetto. 2012a.
    modeling. In the collection: The Civilization of              BabelNet: The automatic construction, evaluation and
    Knowledge: Problems and Prospects of Social                   application of a wide-coverage multilingual semantic
    Communications Proceedings of the XIII                        net-work.Artificial Intelligence, 193:217-250.
    International Scientific Conference. 2012. P. 71-72.     [17] John Hebeler, Matthew Fisher, Ryan Blace, Andrew
[5] Zolotareva V.P., Yashkova N.V., Zolotarev O.V.                Perez-Lopez. Semantic Web Programming. - John
    Project management. Educational-methodical manual             Wiley & Sons, 2009. - 648 с.
    / Nizhny Novgorod, 2016.                                 [18] V.I.Protasov, Z.E.Potapova, R.O.Mirakhmedov,
[6] Zolotarev O.V. Formalization of knowledge about the           M.M. Sharnin, Minasyan V.B. Methods for finding
    subject area based on the analysis of natural language        solutions by a group actor with a low probability of
    structures. In the collection: The civilization of            error. In the collection of CPT2019. Materials of the
    knowledge: the problem of man in science of the XXI           international scientific conference of the Nizhny
    century. Proceedings of the XII International                 Novgorod State University of Architecture and Civil
    Scientific Conference. 2011. P. 78-80.                        Engineering and the Scientific and Research Center
[7] Zolotarev O.V., Sharnin M.M. Methods of extracting            for Information in Physics and Technique. 2019,
    knowledge from natural language texts and building            Nizhny Novgorod. P. 284-291.
    business process models based on the allocation of       [19] Brickley D., Guha R.V. RDF vocabulary description
    processes, objects, their relationships and                   language 1.0: RDF schema W3C working draft. 2002.
    characteristics. In the collection: Proceedings of the        http://www.w3.org/TR/2002/WD-rdf-schema-
    International Scientific Conference CPT2014.                  20020430/.
    Institute of Computing for Physics and Technology.       [20] Ehrmann M., Cecconi F., Vannella D., McCrae J.P.,
    2015.P. 92-98.                                                Cimiano P., Navigli R. Representing Multilingual
     Data as Linked Data: the Case of BabelNet 2.0. -
     LREC         (2014).     -      2014.     -     URL:
     http://wwwusers.di.uniroma1.it/~navigli/pubs/
     LREC_2014_Ehrmannetal.pdf.
[21] T. Flati, D. Vannella, T. Pasini, R. Navigli. Two Is
     Bigger (and Better) Than One: the Wikipedia
     Bitaxonomy Project. Proc. of the 52nd Annual
     Meeting of the Association for Computational
     Linguistics (ACL 2014), Baltimore, USA, June 22-27,
     2014, pp. 945-955.
[22] Ustalov, D., & Panchenko, A. (2017). A tool for
     effective extraction of synsets and semantic relations
     from BabelNet. В Proceedings - 2017 Siberian
     Symposium on Data Science and Engineering,
     SSDSE 2017 (стр. 10-13). [8071954] Institute of
     Electrical     and   Electronics     Engineers    Inc.
     https://doi.org/10.1109/SSDSE.2017.8071954
[23] R. Navigli, S.P. Ponzetto, BabelNetXplorer: a
     platform for multilingual lexical knowledge base
     access and exploration, in: Companion Volume
     totheProceedings of the 21st World Wide Web
     Conference, Lyon, France, 16–20 April 2012, pp.
     393–396.
[24] Lau J.H., Newman D., Karimi S., Baldwin T. Best
     Topic Word Selection for Topic Labelling //
     COLING’10 Proceedings of the 23rd International
     Conference       on    Computational      Linguistics.
     Stroudsburg, PA: Association for Computational
     Linguistics, 2010. Pp. 605-613.
[25] Google Cloud Machine Learning [CD] -
     https://cloud.google.com/ml-
     engine/docs/tutorials/python-guide.
[26] Xie Pengtao, Xing Eric P. Integrating document
     clustering and topic modeling. arXiv preprint,
     arXiv:1309.6874. 2013.

About the autors
   Zolotarev Oleg V., Ph.D., Docent, ANO HE «Russian New
University» (Moscow, Russia), E-mail: ol-zolot@yandex.ru