=Paper=
{{Paper
|id=Vol-2763/CPT2020_paper_s3-10
|storemode=property
|title=Research and development of linguistic-statistical methods for forming a portrait of a subject area
|pdfUrl=https://ceur-ws.org/Vol-2763/CPT2020_paper_s3-10.pdf
|volume=Vol-2763
|authors=Oleg Zolotarev
}}
==Research and development of linguistic-statistical methods for forming a portrait of a subject area==
Research and development of linguo-statistical methods for forming a
portrait of a subject area
Oleg V. Zolotarev
ol-zolot@yandex.ru
ANO HE «Russian New University», Moscow, Russia
The project aims to solve the fundamental scientific problem of semantic modeling, within the framework of which a methodology is
developed for the automated identification of translation links (translation correspondences), as well as hierarchical, synonymous and
associative links from Internet texts and the construction of multilingual associative hierarchical portraits of subject area (MAHPSA),
in particular, on autonomous uninhabited underwater vehicles (UUV). Accounting for multilingual and heterogeneous resources allows
you to get a more complete picture of what is happening in the subject area, to identify the sources of the origin of ideas, the speed and
directions of their distribution, to identify significant documents and promising directions. The solution to the problem is based on an
integrated approach that combines the methods of statistics, corpus linguistics and distributive semantics, and is implemented in
technology that involves the development of linguo-statistical mechanisms for the formation of a multilingual associative hierarchical
portrait of a subject area, which is a dictionary of significant terms of the subject area, the elements of which organized in synonymous
series (synsets), including translational correspondences, as well as associative and hierarchical relationships.
Keywords: Linguo-statistical methods, associative and hierarchical portrait of the subject area, multilingual integrated ontology,
forecasting the spread of ideas, multilingual body of the subject area.
4) Automatic selection of topics on the basis of thematic
1. Introduction modeling methods, the formation of a dictionary of
The growth of volumes on the Internet significantly subject areas, the selection of many keywords of
complicates the search for information. Using semantic subject areas, expert control, topic correction;
search, comparing multilingual documents will allow you 5) The formation of a dictionary of key terms mapped to
to find new interesting trends and ideas, which will topics;
significantly reduce the cost of developing and 6) Compilation of frequency dictionaries of domain
popularizing new areas in science. Using a multilingual terms (using statistical methods);
associative hierarchical portrait of a subject area when 7) Compilation of frequency dictionaries of subject
comparing documents will allow us to compare texts not domain megalemmas;
only on the basis of matching phrases included in these 8) Building multilingual synsets by combining BabelNet
documents, but also on the matching of the described resources and a megalemma dictionary;
objects and processes. MAHPSA allows you to determine 9) Building SVPs using a neural network model (a
the semantic similarity of documents even if the combination of Word2Vec with multilingual
documents do not have common words that are included recurrent neural networks RNN) for texts that have
in both documents. MAHPSA allows you to calculate the undergone preprocessing;
integrated statistics of a multilingual collection, determine
10) Performing hierarchical clustering using Word2Vec
significant documents and promising areas without
and RNN, taking into account the hierarchical
translating documents into one of the languages. This is
important for the automatic processing of a large number relationships of synsets;
of documents (Big Data). The construction of MAHPSA 11) The construction of an ordered list of candidates for
will provide an opportunity not only to compare hierarchical relationships from associative
documents and search for new ideas, but also to solve other connections of the neural network model; viewing and
problems associated with the rapid analysis of a large correction of hierarchical relations is implemented on
amount of information. the basis of the Keywen Knowledge Architect
resource [1].
2. Technique of automatic formation of a
multilingual associative-hierarchical portrait 3. Methodology for calculating integral statistics
of a subject area based on MAHPSA
The essence of the proposed method for the formation MAHPSA is created automatically on the basis of
of a multilingual associative-hierarchical portrait of a statistical analysis of large volumes of texts from the
subject domain consists in iteratively expanding the initial Internet. The hierarchical connections that make up the
multilingual dictionary of significant phrases to the MAHPSA form a hierarchy and classifier that facilitate the
hierarchy of multilingual synonymous series (synsets). search and navigation in the multilingual subject area of
The method can be stated as the following algorithm: the UUV.
1) Compiling a collection of multilingual texts by means The proposed methodology also includes the
of a directed search in the databases of scientific integration of various MAHPSA s with multilingual
documents (for example, Dimensions) by keywords; linguistic resources (WordNet, Wikipedia, BabelNet, etc.)
2) Word processing by means of the Pullenti program, to obtain the largest multilingual ontology with relevant
tokenization and metatoke nization; knowledge and improved coverage of terminology in the
subject areas under consideration. The combined (integral)
3) Automatic generation of glossaries of terms and
ontology contains a hierarchy of synonymic series
megalemms; expert quality control of generated
(synsets) of multilingual terms, including Russian, and
dictionaries;
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY
4.0)
serves as the basis for constructing a single multilingual Text processing is carried out using the program
vector space that allows us to evaluate the semantic PullEnti [2]. This is a unique product that wins the
proximity of multilingual texts, synsets and terms, similar computer linguistics competitions held as part of the
to NASARI and MAFFIN methods. The translation Dialogue conference.
correspondences between the multilingual synsets of Pullenti is a linguistic processor developed at the
MAHPSA are built using Word2Vec technology. Integral Institute of Informatics Problems, which is constantly
ontology allows you to calculate integrated multilingual being refined and allows morphological, syntactic and
statistics and trends in the use of terms and ideas, which partially semantic analysis of the text, distinguishing typed
allows you to predict the distribution of ideas between objects - named entities.
languages and determine promising directions. A measure Pullenti SDK includes the following main blocks:
of the semantic proximity of multilingual documents 1) Tokenization: breakdown into words (tokens) as
allows you to identify implicit links between documents adjusted (Fig. 1 [2-12]);
and determine significant documents, which is necessary 2) Morphological analysis: definition for tokens of parts
to collect high-quality information from the open Internet of speech (this is a POS-tagger - Part of Speech, which
and build large relevant multilingual corpuses of texts for gives out all possible options for a word form
the subject area. Thus, increasing the size and quality of regardless of its surrounding context). Languages are
integral ontology will allow us to build a better similarity Russian, Ukrainian and English. There is
measure and subject corpus of texts, extracting knowledge normalization, reduction of the word form to the
from which in turn will further increase the size and
desired case \ gender \ number, and there is also
quality of integral ontology.
processing of unknown and new words, and there is
The methodology includes not only the identification
of significant documents, but also the identification of also a mode for correcting errors (Fig. 2 [2-12]);
trends and the identification of promising areas for the 3) Selection of named entities [13] (NER - Names Entity
development of science. Recognition): a lot of so-called analyzers that find
To develop the first version of the integrated statistics entities of the corresponding type (person,
methodology based on MAHPSA, it is necessary to do the organization, geographical objects, etc.) in sequences
following: of tokens (Fig. 3 [2-12]);
1) Conduct morphological, syntactic and partially 4) A lot of tools for working with numerical data,
semantic analysis of the text; nominal and verb groups, brackets and quotation
2) Select typed objects - named entities; marks, dictionaries of terms and abbreviations,
3) Identify formal elements for the presentation of various checks (for example, equivalence of strings in
concepts; Latin and Cyrillic letters) and other useful features
4) Develop a structure and software for storing a that appeared during the solution of practical
multilingual collection of documents; problems (Fig. . 4 [2-12]);
5) Create dictionaries for storing structured information; 5) Derivative dictionary: a dictionary of the so-called
6) Develop neural network algorithms for calculating derivative groups (many same-root words, but
integrated statistics based on MAHPSA. different parts of speech, and one group contains
The first version of the program has been developed words in different languages), group management
for highlighting interlingual implicit connections and model (what can come after a group), synonymy, etc.;
assessing the semantic similarity of phrases in different 6) Semantic representation: tokens are structured in the
languages. form of a graph with semantic connections to solve
more complex problems related to meaning [14].
Fig. 1. Tokenization
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY
4.0)
Fig. 2. Morphological analysis
Fig. 3. Highlighting the named entities
Fig. 4. Numeric Tools
Specially for this project, the linguistic processor has references. According to the publication date of the
been modified so that it is possible to more accurately document, the source document of the megalemma and the
highlight implicit links in documents document that has a link to the megalemma are
The concept of a token (Token base class) is at the determined.
heart of the Pullenti SDK model. Each token refers to a To control the quality of automatic detection of
merged fragment of the source text (BeginChar and implicit links, methods of collective intelligence and
EndChar positions). First, the text is divided into a crowdsourcing were used [17]. It was proposed to conduct
sequence of text tokens (TextToken), and then during a quality check for the detection of implicit links using an
processing they are converted - merging into meta-tokens expert approach.
(MetaToken). A metatoken is a token that has "absorbed" The probability of a positive decision is determined by
a fused sequence of other tokens. Metatokens, for the mathematical model:
example, represent places of occurrence of named entities 𝑴𝑴=𝟏𝟏/𝟐𝟐
(ReferentToken) in the text. Metatokens can represent 𝑲𝑲𝟎𝟎 = � 𝑪𝑪𝒊𝒊𝑴𝑴 𝑮𝑮𝑴𝑴=𝒊𝒊 𝒊𝒊
𝑹𝑹 (𝟏𝟏 − 𝑮𝑮𝑹𝑹 )
various numerical data (lowercase spelling of numbers), 𝒊𝒊=𝟎𝟎
name groups (in the example, NounPhraseToken is the In accordance with this formula, the probability K0 of
inherited class from MetaToken), etc. Most of the a positive decision by a group of M experts with the
elements received and used during the analysis are probability of the correct GR solution for one expert is
metatokens. determined by this formula. The analysis of expert
The concept of PullEnti megatokens served as the basis estimates showed a rather high level of revealing implicit
for building dictionaries of megalemmas, each of which links and determining the semantic similarity of phrases
can consist of several tokens or megatokens. The and documents.
megalemma is the basis for comparing meaningful phrases There was developed software for storing a
from different languages, i.e. the concept of megalemma multilingual collection of documents. A software
is broader than the concept of megatoken, since it implementation of thematic modeling methods using
additionally includes identifying connections between dictionaries of megalemmas in subject areas has been
different languages. developed [18].
Megalemma dictionaries are constructed using the As a result of processing collections of documents,
method for determining the proximity of terms [11]. It is dictionaries of terms and dictionaries of megalemmas are
this method that allows us to form megalemmas on the built. Statistics is collected for the use of terms and
basis of statistical patterns of occurrence of terms in the megalemmas by articles.
framework of the formation of an associative-hierarchical BabelNet is an integration resource based on the
portrait of a subject area. following resources: WordNet, Wikipedia, OmegaWiki,
Thematic dictionaries of megalemmas are formed by Wiktionary, Wikidata, Wikiquote, VerbNet, Microsoft
subject areas and serve as the basis for the classification of Terminology, GeoNames, ImageNet, FrameNet, WN-
texts. Megalemma dictionaries are also used to represent Map, Open Multilingual WordNet, WoNeF, Albanet,
knowledge in ontologies and automatically supplement Arabic WordNet ( AWN v2), BulTreeBank WordNet
them with relevant vocabulary. (BTB-WN), Chinese Open WordNet, Chinese WordNet
The formal element for the presentation of concepts (Taiwan), DanNet, Greek WordNet, Princeton WordNet,
was chosen synset. This is the basis of knowledge Persian WordNet, FinnWordNet, WOLF (WordNet Libre
representation in systems such as Wordnet, Babelnet and du Français), Hebrew WordNet, Croatian WordNet,
others. This is a well-established and generally accepted IceWordNet , MultiWordNet, ItalWordNet, Japanese
concept [15]. Synsets can chain together (megalemmas WordNet, Multilingual Central Repository, WordNet
include synsets). Bahasa, Open Dutch WordNet, Norwegian WordNet,
Thus megalemmas are presented - these are chains of plWordNet, OpenWN-PT, Romanian WordNet, Lithua.
synsets. The concept of synset is initially oriented toward BabelNet is fully integrated with BabelFly's
multilingualism. multilingual lexical ambiguity and entity binding system.
The work was carried out in two subject areas - BabelNet is also integrated with Wikipedia's bitaxonomy
“computer graphics and visualization” and “autonomous [20], which is built around two hierarchies: page
uninhabited underwater vehicles”. hierarchies and category hierarchies [15].
Algorithms for the semantic analysis of information Integration with BabelNet will be carried out by
have been developed [2-11, 15]. Prototypes of software analogy with the approach that BabelNet uses to integrate
components for semantic analysis of textual information with other (described above) resources, using automatic
have been developed too. display and filling of lexical gaps in languages with limited
Implicit links are searched using the megalemma resources using statistical machine translation. The result
dictionary. First, the text is processed using the PullEnti is an “encyclopedic dictionary” that provides concepts and
program, normalization of words in the text, selection of named entities lexicalized in many languages and
named entities (NER - named entity recognition), associated with a large number of semantic relations [21].
formation of dictionaries of tokens and megatokens for the Additional vocabulary and definitions are added by
text are performed. Next, a thematic analysis of the text is reference to free networks such as WordNet, OmegaWiki,
carried out using megalemma dictionaries. In the English Wiktionary, Wikidata, FrameNet, VerbNet and
dictionaries of megalemmas, as already mentioned, there others. Like WordNet, BabelNet groups words in different
is a correlation of each megalemma with a specific languages into sets of synonyms called Babel synsets. For
document and with a specific subject area. This allows the each Babel syntax, BabelNet provides short definitions
classification of texts in subject areas and a statistical (called glosses) in many languages, taken from both
analysis of documents for the presence of implicit WordNet and Wikipedia.
In the future, it is planned to use the Babelscape Ddsa = , (8)
product [22], which allows us to analyze documents, where Ddsa is a dictionary of subject areas of a document.
perform semantic markup of texts, build semantic One document can belong to several subject areas.
knowledge graphs in several languages, etc., but this issue
requires additional careful study [15]. 4. Results
The dictionaries of terms and megalemmas proposed A program was developed to implement methods for
within the framework of the project allow not only to modeling topics and to identify implicit links between
classify texts, but also to define implicit links between documents [23]. The megalemmas' dictionary is used to
articles. determine implicit references. The task is to determine the
The structure of the glossary is represented by a tuple: source of the megalemma and link to it. A storage structure
Dterm = < IDterm, Term>, (1) and methods for constructing a multilingual collection of
where Dterm is a glossary of terms, IDterm is a term synsets - synonymous series are developed.
identifier in a dictionary, Term is a term. A neural network algorithm was developed using tags
The structure of the megalemma dictionary is / tokens (flagging) and the Word2vec method modified by
represented by a tuple: the team of authors, already described, to identify Russian-
Dmeg = < IDmeg, MegL>, (2) speaking terms in texts that are similar in context of lexical
where Dmeg is the megalemma dictionary, IDmeg is the meaning [24].
megalemma identifier in the dictionary, MegL is the The methodology for constructing forecasts for the
megalemma. development of new directions includes the ratio of the
The structure of the document dictionary is represented relative frequencies of occurrence of the same
by a tuple: megalemmas calculated over adjacent years. This
Ddoc = , (3)
approach eliminates the problem of retraining neural
where Ddoc is the document dictionary, IDdoc is the networks in connection with the accumulation of
document identifier in the dictionary, NAMEdoc is the information.
document name, SRCdoc is the publication source, The analysis of clustering methods and thematic
YEARdoc is the publication year, NUMwrd is the total modeling to assess the quality / significance of texts
number of terms in the document. carried out [25]. Various thematic modeling methods are
The structure of the domain dictionary is represented considered, including the vector model, latent semantic
by a tuple: analysis, latent Dirichlet placement, and others. The basis
Dsa = < IDsa, SA>, (4) of these methods is a probabilistic approach, i.e.
where Dsa is the domain dictionary, IDsa is the domain correlation of a term or document with several topics with
identifier in the dictionary, SA is the domain name. a certain degree of probability. The disadvantage of this
While the Dterm dictionary is a general glossary of approach is the automatic formation of a list of topics.
terms, dictionaries of documents contain the terms of the
document and the frequency of occurrence of the term in 5. Conclusion
the document. The same thing applies to the dictionary of
megalemmas. These two dictionaries are associative tables As a result of this scientific research, a number of
in the database. An associative table in the database results will be obtained that have high scientific and
implements a relationship between many-to-many entities. applied significance:
The structure of the dictionary of terms of the 1. The updated actual multilingual collection of
document is represented by a tuple: scientific texts in various languages, containing more
Dtd = < IDterm, IDdoc, Fterm>, (5) than 60 thousand scientific documents and having
where Dtd is the dictionary of terms of the document, more than 6 thousand internal bibliographic
Fterm is the relative frequency of occurrence of the term references. This collection will allow us to accurately
in the document, calculated as follows: first, all calculate the significance of documents using the
insignificant words are removed from the document (stop scientific citation index (SCI) by the number of
words, rare words, etc.), only the terms remain, then the bibliographic references, as well as using the context
frequency of occurrence of the term is divided by the total scientific citation index (CSCI), calculated by the
number of terms in the document. number of implicit references identified through the
The structure of the dictionary of megalemmas of the semantic similarity of texts.
document is represented by a tuple: 2. The developed technique for the automatic formation
Dmd = < IDmeg, IDdoc, Fmeg>, (6) of a multilingual associative-hierarchical portrait of a
where Dmd is the dictionary of megalemmas in the
subject area (MAHPSA) containing a hierarchy of
document, Fmeg is the relative frequency of megalemma
multilingual synonymous series (synsets). With the
in the document, calculated as follows: the frequency of
megalemma is divided by the total number of help of MAHPSA, it is possible to solve a wide range
megalemmas in the document. of problems, including calculating the semantic
The structure of the keyword dictionary is represented similarity of texts, identifying multilingual
by a tuple: plagiarism, expanding queries in multilingual search.
Dkeywrd = < IDterm, IDsa>, (7) 3. The developed methodology and algorithms for
Keywords are taken from a general vocabulary of calculating integrated multilingual statistics based on
terms and compared with the subject area. This is also an MAHPSA, including the identification of significant
associative table. documents, trends and promising areas. Because of
The structure of the dictionary of document correlation applying the technique to a multilingual collection,
with a subject area is presented below. new concepts will be revealed, the dynamics of their
development over time will be considered, and [8] Sharnin M.M., Zolotarev O.V., Somin N.V.
promising areas for the development of the subject Extracting and processing knowledge from
area will be constructed. Based on this, it will be unstructured texts of the business sphere and social
possible to build forecasts of promising areas of networks. In the collection: Social computing:
research. fundamentals, development technologies, social and
4. The developed methodology for integrating humanitarian effects Materials of the Fourth
MAHPSA with other ontologies and linguistic International Scientific and Practical Conference.
resources, including BabelNet, which contains 2015. P. 364-371.
millions of multilingual synsets. As a result, the [9] Zolotarev O.V., Kozerenko E.B., Sharnin M.M.
shortcomings of BabelNet related to the low level of Analytical intelligence based on the analysis of
coverage of Russian terms will be overcome. For unstructured information from various sources,
integrated resources, updated ratings of the including the Internet and the media. Bulletin of the
significance of documents will be calculated and Russian New University. Series: Complex systems:
updated forecasts of promising areas of research in models, analysis and control. 2015. No 1. P. 49-54.
selected subject areas will be constructed. [10] Zolotarev O.V. New approaches in constructing the
functional structure of the subject area. In the
Acknowledgment collection: Twenty Years of Post-Soviet Russia: crisis
The reported study was funded by RFBR according to phenomena and modernization mechanisms materials
the research projects № 18-07-00225, 18-07-00909, 18- of the XIV All-Russian Scientific and Practical
07-01111, 19-07-00455 and 20-04-60185. Conference of the Humanitarian University: in 2
volumes. Humanitarian University. Ekaterinburg,
References 2011. P. 639-643.
[11] Zolotarev O.V., Sharnin M.M., Klimenko S.V. A
[1] J. Galbraith, and R. Thayer, SECSH Public Key File semantic approach to the analysis of terrorist activity
Format, draft-ietf-secsh-publickeyfile-01.txt, March on the Internet based on thematic modeling methods.
2001, work in progress material. [12] Zolotarev O.V., Sharnin M.M., Klimenko S.V.
[2] Zolotarev O.V., Sharnin M.M., Klimenko S.V., Bulletin of the Russian New University. Series:
Kuznetsov K.I. PullEnty system - information Complex systems: models, analysis and control. 2016.
extraction from natural language texts and automated No. 3. P. 64-71.
building of information systems. In the collection: [13] Kozerenko E. B., Kuznetsov K. I. Romanov D. A.
Situational centers and information-analytical Semantic processing of unstructured textual data
systems of class 4i for monitoring and security tasks based on the linguistic processor PullEnti Informatics
(SCVRT2015-16). Proceedings of the International and applications 2018 volume 12 issue 3. DOI:
Scientific Conference: in 2 volumes. 2016. P. 28-35. 10.14357/19922264180313, pp. 91–98
[3] Zolotarev O.V., Kozerenko E.B., Sharnin M.M. The [14] Chiu, J.P. and Nichols, E. (2015). Named entity
principles of constructing models of business recognition with bidirectional lstm-cnns. arXiv
processes in the subject area based on natural preprint arXiv:1511.08308.
language text processing. Bulletin of the Russian New [15] Peters M. E. et al. Deep contextualized word
University. Series: Complex systems: models, representations //arXiv preprint arXiv:1802.05365. -
analysis and control. 2014. No. 4. P. 82-88. 2018.
[4] Zolotarev O.V. Methods and tools for domain [16] Roberto Navigli and Simone Paolo Ponzetto. 2012a.
modeling. In the collection: The Civilization of BabelNet: The automatic construction, evaluation and
Knowledge: Problems and Prospects of Social application of a wide-coverage multilingual semantic
Communications Proceedings of the XIII net-work.Artificial Intelligence, 193:217-250.
International Scientific Conference. 2012. P. 71-72. [17] John Hebeler, Matthew Fisher, Ryan Blace, Andrew
[5] Zolotareva V.P., Yashkova N.V., Zolotarev O.V. Perez-Lopez. Semantic Web Programming. - John
Project management. Educational-methodical manual Wiley & Sons, 2009. - 648 с.
/ Nizhny Novgorod, 2016. [18] V.I.Protasov, Z.E.Potapova, R.O.Mirakhmedov,
[6] Zolotarev O.V. Formalization of knowledge about the M.M. Sharnin, Minasyan V.B. Methods for finding
subject area based on the analysis of natural language solutions by a group actor with a low probability of
structures. In the collection: The civilization of error. In the collection of CPT2019. Materials of the
knowledge: the problem of man in science of the XXI international scientific conference of the Nizhny
century. Proceedings of the XII International Novgorod State University of Architecture and Civil
Scientific Conference. 2011. P. 78-80. Engineering and the Scientific and Research Center
[7] Zolotarev O.V., Sharnin M.M. Methods of extracting for Information in Physics and Technique. 2019,
knowledge from natural language texts and building Nizhny Novgorod. P. 284-291.
business process models based on the allocation of [19] Brickley D., Guha R.V. RDF vocabulary description
processes, objects, their relationships and language 1.0: RDF schema W3C working draft. 2002.
characteristics. In the collection: Proceedings of the http://www.w3.org/TR/2002/WD-rdf-schema-
International Scientific Conference CPT2014. 20020430/.
Institute of Computing for Physics and Technology. [20] Ehrmann M., Cecconi F., Vannella D., McCrae J.P.,
2015.P. 92-98. Cimiano P., Navigli R. Representing Multilingual
Data as Linked Data: the Case of BabelNet 2.0. -
LREC (2014). - 2014. - URL:
http://wwwusers.di.uniroma1.it/~navigli/pubs/
LREC_2014_Ehrmannetal.pdf.
[21] T. Flati, D. Vannella, T. Pasini, R. Navigli. Two Is
Bigger (and Better) Than One: the Wikipedia
Bitaxonomy Project. Proc. of the 52nd Annual
Meeting of the Association for Computational
Linguistics (ACL 2014), Baltimore, USA, June 22-27,
2014, pp. 945-955.
[22] Ustalov, D., & Panchenko, A. (2017). A tool for
effective extraction of synsets and semantic relations
from BabelNet. В Proceedings - 2017 Siberian
Symposium on Data Science and Engineering,
SSDSE 2017 (стр. 10-13). [8071954] Institute of
Electrical and Electronics Engineers Inc.
https://doi.org/10.1109/SSDSE.2017.8071954
[23] R. Navigli, S.P. Ponzetto, BabelNetXplorer: a
platform for multilingual lexical knowledge base
access and exploration, in: Companion Volume
totheProceedings of the 21st World Wide Web
Conference, Lyon, France, 16–20 April 2012, pp.
393–396.
[24] Lau J.H., Newman D., Karimi S., Baldwin T. Best
Topic Word Selection for Topic Labelling //
COLING’10 Proceedings of the 23rd International
Conference on Computational Linguistics.
Stroudsburg, PA: Association for Computational
Linguistics, 2010. Pp. 605-613.
[25] Google Cloud Machine Learning [CD] -
https://cloud.google.com/ml-
engine/docs/tutorials/python-guide.
[26] Xie Pengtao, Xing Eric P. Integrating document
clustering and topic modeling. arXiv preprint,
arXiv:1309.6874. 2013.
About the autors
Zolotarev Oleg V., Ph.D., Docent, ANO HE «Russian New
University» (Moscow, Russia), E-mail: ol-zolot@yandex.ru