<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Cross-genre Gender Identification in Russian Texts Using Topic Modeling Working Note: Team DUBL</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriella Skitalinskaya</string-name>
          <email>gabriellasky@icloud.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liliya Akhtyamova</string-name>
          <email>liliya.akhtyamova@postgrad</email>
          <email>liliya.akhtyamova@postgrad. ittdublin.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>John Cardif</string-name>
          <email>john.cardif@it-tallaght.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute of Technology Tallaght</institution>
          ,
          <addr-line>Dublin</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>In this paper, we describe the results of gender identification from Team DUBL. We used a topic modeling approach for identifying the author's gender based on his/her written texts. The model was trained on the RusProfiling PAN 2017 Twitter Corpus that contains data in the Russian language. The model has been evaluated on texts of other genres, including texts such as letters to a friend, online reviews, Facebook posts and etc. Our model has obtained competitive results and has been shown to outperform more sophisticated algorithms on gender identification.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Computing methodologies → Information extraction;</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>Author profiling is a broad field which focuses on revealing the
author’s demographics, psychological characteristics and mental
health attributes from his(her) written texts. These attributes
include age, gender, social status, personality traits, native language,
writing style, etc. The performed analysis could be used later for
plagiarism detection, authorship attribution detection, forensics,
etc.</p>
      <p>
        A series of shared tasks on digital text forensics called PAN1
(Plagiarism, Authorship and Social Software Misuse) has been
conducted since 2013. However, Slavic languages, in particular, the
Russian language is less investigated from the author
identifiÑĄation standpoint and has never been presented at PAN. This year
to solve the mentioned problem the Rusprofiling 2 shared task has
been organized [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The focus of the RusProfiling shared task is
Cross-genre Gender Identification in Russian texts, meaning
investigating the efect of the cross-genre evaluation. The models are
trained on one genre, in our case the Twitter corpus, and evaluated
on other genres, such as texts describing images, letters to a friend,
motivation letters, Facebook posts, online reviews, etc.
      </p>
      <p>
        The rest of this paper is organized as follows. We discuss relevant
literature in Section 2. Section 3 gives details on the training dataset
and the description of the proposed approach. Section 4 provides
experimental evaluation, and important insights gained during our
work. We conclude in Section 5, outlining our contributions and
directions for future research.
1http://pan.webis.de/clef17/pan17-web/author-profiling.html
2http://en.rusprofilinglab.ru/rusprofiling-at-pan/
2
One of the standard and quite successful techniques of analyzing
texts involves analyzing the writing style of the author using
stylometric features. These writing style patterns are used to identify
diferent attributes of an author. For example, in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] over 1,000
stylometric features were proposed: word- and character-based
stylometric features, function words, profanities, punctuation, etc.
Many diferent approaches to performing analysis of such features
exist. For example, in the early work [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] the authors investigated
the authorship gender and language background cohort
attribution from e-mail text documents. They used an SVM classifier to
perform analysis on over 800 e-mails. The classifier was fed 222
stylometric-, structural-, and gender-specific features, obtaining
F-score about 80%.
      </p>
      <p>
        In the shared task organized by PAN one of subtasks includes
author profiling of a Twitter post corpus. The latest track focused
on language variety identification with 4 languages and 19 varieties
included, consisting of 11400 tweets in total [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The total
number of teams, participating in the track was 22. Overall, the best
result was obtained using an SVM classifier with tf-idf n-grams,
outperforming more sophisticated methods [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        For the Russian language the research conducted in this area
is quite limited. The following papers are worth mentioning. In
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] the authors use statistical methods to calculate the
correlation of frequencies of parts-of-speech (POS) bigrams and traits of
the author. A simple regression model was used to calculate the
accuracy of the model. The training dataset consisted of students’
essays written in the Russian language. The obtained results have
shown 65%, 79%, and 88% of accuracy for the gender, neuroticism,
and openness identicfiation accordingly, proving the usefulness of
POS bigrams. In another paper [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] the authors gathered a corpus
of essays from 60 respondents on the following topics: letter to a
friend and motivation letter to the employee. Then, these texts were
analyzed to predict the self-destructive behaviour of the authors.
Using a statistical approach based on the presence and frequency
of diferent stylometric features, the authors achieved an accuracy
of about 80% on the mentioned dataset.
      </p>
      <p>
        With the goal of gender identification, in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] the authors fed
diferent morphological and syntactic features to machine learning
algorithms and were able to obtain an F-score of 74% using ReLu.
They used the RusPersonality dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], consisting of 1867 texts
in diferent genres (including descriptions of pictures, essays on
diferent topics, letters, etc.). In another paper [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] Sboev et al.
achieved even better results on the same dataset using deep learning
algorithms with an F-score of 86%.
Gabriella Skitalinskaya, Liliya Akhtyamova, and John Cardif
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
      <p>In this section, we give an overview of the proposed approach and
describe its main components.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Dataset</title>
      <p>
        The training dataset is a Twitter corpus and contains tweets from
600 users in the Russian language. For each tweet, there is
information about its gender. The task organizers provided five diferent
datasets for testing, each dataset belonging to a particular genre:
1. Ofline texts (picture descriptions, letters to a friend,
motivation letters to employees) from the RusPersonality Corpus
(370 texts) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
2. Facebook (228 texts)
3. Twitter (400 texts)
4. Product and service online reviews (776 texts)
5. Gender imitation corpus (women imitating men and the
other way around) (94 texts)
3.2
      </p>
    </sec>
    <sec id="sec-5">
      <title>Data Preprocessing</title>
      <p>Every text from the dataset went through the following
preprocessing procedures:
• removal of stopwords
• removal of short words (less than 2 characters)
• lemmatization.</p>
      <p>• removal of links, hashtags and mentions (optional)
3.3</p>
    </sec>
    <sec id="sec-6">
      <title>Method Based on Topic Modeling</title>
      <p>
        Topic modeling is a rapidly developing technique capable of
revealing hidden topics in text collections. Originating from the text
analysis, topic modeling found its implications in many other areas,
which include signal, image and video processing and network
analysis [
        <xref ref-type="bibr" rid="ref11 ref2">2, 11</xref>
        ].
      </p>
      <p>
        Regarding text analysis, practical implications of topic modeling
include information retrieval, summarization, segmentation and
classification of texts, as well as regression analysis. However, most
topic modeling based approaches capable of solving the mentioned
problems are too dificult for practitioners to understand and
apply. These approaches are based on Bayesian learning and require
sophisticated tuning and good theoretical knowledge of Bayesian
algorithms[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. As a consequence only basic models are in
common practice, for example, such algorithms as Probabilistic Latent
Semantic Analysis (PLSA) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and Latent Dirichlet Allocation (LDA)
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which are inefective in many cases.
      </p>
      <p>
        In this sense, Additive Regularization of Topic Models (ARTM)
proposed in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] is free of redundant probabilistic assumptions and
provides a simple inference for many combined and multi-objective
topic models. This method is based on classical, non-Bayesian
regularization, using a semi-probabilistic approach. Moreover, besides
the simplicity of the approach, another advantage is the ability
to take into account diferent data or "modalities" accompanying
texts to build a model, which could be images, audio and video
attachments, user log data, diferent metadata (for example, user’s
age, gender), etc.
      </p>
      <p>
        In ARTM the construction of the topic model is based on an
iterative two-step expectation-maximization (EM) algorithm, where at
the first step (E-step) the expected value of the likelihood function
is calculated, followed by the maximum likelihood estimation
(Mstep). The steps are repeated until convergence. At this point, the
adding of regularizers (or constraints) helps to prevent the
likelihood function from the problem of non-uniqueness and instability.
The detailed explanation of the ARTM algorithm could be found in
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
      <p>All experiments were carried out in Python using the
opensourced realization of an ARTM algorithm – BigARTM tool3.
In our initial experiments, we have tried a number of diferent
methods to solve the considered author profiling problem. Among them
were the bag-of-words models and models that were based on
Russian grammar rules. In the Russian language, the gender influences
the formation of past tense of verbs, which allows identification of
the genders of the subject of the verb. We have tried looking for
sentences containing verbs in the first person singular past tense
and analyzing them.</p>
      <p>Moreover, we have tried various classifiers: Random Forest,
Linear Regression, Naive Bayes, SVMs, topic modeling based on
Nonnegative Matrix Factorization, Latent Dirichlet Allocation, Latent
Semantic Analysis etc. However, our experiments revealed that
these solutions demonstrate the same or worse performance
compared to the one proposed in this work, therefore the topic modeling
based on ARTM was chosen for submission and final evaluation.
4</p>
    </sec>
    <sec id="sec-7">
      <title>EXPERIMENTS AND RESULTS</title>
      <p>We have submitted 4 runs experimenting with diferent model
settings, including diferent preprocessing of texts, diferent numbers
of topics, regularizers. A detailed description of the diferences
between each run is described in Table 14. Additionally there are
slight diferences in the preprocessing of texts for Runs 2,3 and
Run 4. The preprocessing for Run 4 include removing stop words,
short words(&lt;3 characters), hashtags, links and mentions, whereas
for Runs 2,3 only conjunctions, special characters, numbers, short
words(&lt;3 characters) and stop words have been removed.</p>
      <p>The results obtained by our runs for each dataset as well as the
best result in the track are presented in Table 2. It can be seen
that the addition of regularizers in Runs 3, 4 allows to increase the
performance. Additionally, the increase in the number of topics
leads to better performance for some test datasets. Our Run 4 with
the accuracy of 63% placed third on the Test 1 dataset. This could
be due to the fact, that since the dataset consisted of essays, such as
3http://bigartm.org
4The results of Run 1 are not presented as the experiments setup is identical to the
setup of Run 2.</p>
      <p>Cross-genre Gender Identification in Russian Texts Using Topic Modeling
Run 2
Run 3
letters to a friend, motivation letters, descriptions of pictures and
etc, the average length of texts was longer and more topics were
covered. Thus, using a higher number of topics leads to capturing
topics of higher granularity, resulting in higher accuracy. It should
be mentioned, that the results obtained for the Test 3 dataset with
the accuracy of 63% are not far of from the best result achieved in
the task.</p>
      <p>Overall, it can be seen that the dataset containing online reviews
was the hardest for the gender identification task. The reason for
this may be the diference in the nature of the train and test datasets.
The Test 3 dataset with online reviews contains specific corpora that
may not be covered adequately in the training dataset (Twitter). The
highest accuracy of 76% has been achieved on the Facebook dataset
(Test 2), which is significantly higher than the accuracy of 63%
obtained for the Twitter dataset(Test 3). In a way, this is surprising
since the nature of the Test 3 dataset is the same as of the training
dataset. Such results may be explained by the Facebook posts being
longer and richer in information, in addition to containing fewer
misspellings, syntactic errors, abbreviations and etc.</p>
      <p>The results obtained for the Gender imitation corpus (Test 5)
are not very high. In our opinion, this could be mostly due to the
chosen approach. In the case of topic modelling, it would have been
more appropriate and interesting to train on a dataset, such as a
Gender imitation corpus, and not only build a classifier to predict
people imitating the opposite gender, but also learn which topics
are more frequently discussed by such people.
5</p>
    </sec>
    <sec id="sec-8">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>In this paper, we reported the approach of the DUBL solution
submitted to the RusProfiling Shared Task. All four runs performed
competitively with one of our runs achieving high results in
identifying the author’s gender based on ofline texts.</p>
      <p>In the future, we plan to improve our model based on topic
modeling by augmenting it with more features (stylometric,
morphological and syntactic). Moreover, we are planning to make more
intricate preprocessing, i.e. adding word and character bigrams to
our model, taking into account counts of hashtags, links and
mentions found in texts. We believe that with more parameter tuning,
we can achieve better results than presented in this paper and be
able to advance the state-of-the-art in the task of author profiling
in general.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Angelo</given-names>
            <surname>Basile</surname>
          </string-name>
          , Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, and
          <string-name>
            <given-names>Malvina</given-names>
            <surname>Nissim</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>N-GrAM: New Groningen Author-profiling ModelNotebook for PAN at CLEF 2017</article-title>
          .
          <article-title>CLEF 2017 Evaluation Labs</article-title>
          and Workshop - Working Notes Papers,
          <volume>11</volume>
          -
          <fpage>14</fpage>
          September, Dublin, Ireland (Sept.
          <year>2017</year>
          ). http:// ceur-ws.
          <source>org/</source>
          Vol-1866/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Jen-Tzung Chien</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Topic Modeling for Speech and Language Processing</article-title>
          . Springer Japan, Tokyo,
          <fpage>87</fpage>
          -
          <lpage>111</lpage>
          . https://doi.org/10.1007/978-4-
          <fpage>431</fpage>
          -55339-
          <issue>7</issue>
          _
          <fpage>4</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Olivier</given-names>
            <surname>De Vel</surname>
          </string-name>
          , Malcolm Corney,
          <string-name>
            <surname>Alison Anderson</surname>
            , and
            <given-names>George</given-names>
          </string-name>
          <string-name>
            <surname>Mohay</surname>
          </string-name>
          .
          <year>2002</year>
          . Digital Forensic Research Conference Language and
          <article-title>Gender Author Cohort Analysis of E-mail for Computer Forensics Language and Gender Author Cohort Analysis of E-mail for Computer Forensics</article-title>
          .
          <source>The Digital Forensic Research Conference</source>
          (
          <year>2002</year>
          ). http:/dfrws.org
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Hofmann</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Probabilistic latent semantic indexing</article-title>
          .
          <source>Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval</source>
          (
          <year>1999</year>
          ),
          <fpage>50</fpage>
          -
          <lpage>57</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Thomas</surname>
            <given-names>K Landauer</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peter W Foltz</surname>
            , and
            <given-names>Darrell</given-names>
          </string-name>
          <string-name>
            <surname>Laham</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>An introduction to latent semantic analysis</article-title>
          .
          <source>Discourse processes 25</source>
          ,
          <fpage>2</fpage>
          -
          <lpage>3</lpage>
          (
          <year>1998</year>
          ),
          <fpage>259</fpage>
          -
          <lpage>284</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Tatiana</given-names>
            <surname>Litvinova</surname>
          </string-name>
          , Olga Litvinlova, Olga Zagorovskaya, Pavel Seredin, Aleksandr Sboev, and
          <string-name>
            <given-names>Olga</given-names>
            <surname>Romanchenko</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>&amp;quot;Ruspersonality&amp;quot;: A Russian corpus for authorship profiling and deception detection</article-title>
          .
          <source>2016 International FRUCT Conference on Intelligence, Social Media and Web (ISMW FRUCT)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          . https://doi.org/10.1109/FRUCT.
          <year>2016</year>
          .7584767
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Tatiana</given-names>
            <surname>Litvinova</surname>
          </string-name>
          , Francisco Rangel, Paolo Rosso, Pavel Seredin, and
          <string-name>
            <given-names>Olga</given-names>
            <surname>Litvinova</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Overview of the RUSProfiling PAN at FIRE Track on Cross-genre Gender Identification in Russian</article-title>
          .
          <source>Notebook Papers of FIRE</source>
          <year>2017</year>
          ,
          <article-title>FIRE-</article-title>
          <year>2017</year>
          (
          <year>2017</year>
          ). Bangalore, India, December 8-10, CEUR Workshop Proceedings. CEUR-WS.org.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Tatiana</given-names>
            <surname>Litvinova</surname>
          </string-name>
          , Pavel Seredin, Olga Litvinova, and
          <string-name>
            <given-names>Olga</given-names>
            <surname>Zagorovskaya</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Profiling a set of personality traits of text author: what our words reveal about us</article-title>
          .
          <source>Research in Language 14</source>
          ,
          <issue>4</issue>
          (1
          <year>2016</year>
          ). https://doi.org/10.1515/rela-2016
          <source>-0019</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Tatiana</given-names>
            <surname>Litvinova</surname>
          </string-name>
          , Pavel Seredin, Olga Litvinova, Olga Zagorovskaya, Alexandr Sboev, Dmitry Gudovskih, Ivan Moloshnikov, and
          <string-name>
            <given-names>Roman</given-names>
            <surname>Rybka</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Gender Prediction for Authors of Russian Texts Using Regression And Classification Techniques</article-title>
          .
          <source>Proceedings of the Third International Workshop on Concept Discovery in Unstructured Data (CDUD</source>
          <year>2016</year>
          )
          <article-title>(</article-title>
          <year>2016</year>
          ),
          <fpage>44</fpage>
          -
          <lpage>54</lpage>
          . http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1625</volume>
          / paper5.pdf
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T A</given-names>
            <surname>Litvinova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P V</given-names>
            <surname>Seredin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O A</given-names>
            <surname>Litvinova</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Using Part-of-Speech Sequences Frequencies in a Text to Predict Author Personality: a Corpus Study</article-title>
          .
          <source>Indian Journal of Science and Technology ISSN</source>
          <volume>8</volume>
          ,
          <issue>S9</issue>
          (
          <year>2015</year>
          ),
          <fpage>93</fpage>
          -
          <lpage>97</lpage>
          . https://doi.org/ 10.17485/ijst/2015/v8iS9/51103
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Lin</surname>
            <given-names>Liu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            <given-names>Tang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wen Dong</surname>
            , Shaowen Yao, and
            <given-names>Wei</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>An overview of topic modeling and its current applications in bioinformatics</article-title>
          .
          <source>SpringerPlus 5</source>
          ,
          <issue>1</issue>
          (
          <year>2016</year>
          ),
          <fpage>1608</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Francisco</surname>
            <given-names>Rangel</given-names>
          </string-name>
          , Paolo Rosso,
          <string-name>
            <given-names>Martin</given-names>
            <surname>Potthast</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Benno</given-names>
            <surname>Stein</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter</article-title>
          .
          <source>Working Notes Papers of the CLEF</source>
          (
          <year>2017</year>
          ). http://pan. webis.de
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Aleksandr</surname>
            <given-names>Sboev</given-names>
          </string-name>
          , Tatiana Litvinova, Irina Voronina, Dmitry Gudovskikh, and
          <string-name>
            <given-names>Roman</given-names>
            <surname>Rybka</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Deep Learning Network Models to Categorize Texts According to Author's Gender and</article-title>
          to Identify Text Sentiment.
          <source>2016 International Conference on Computational Science and Computational Intelligence (CSCI)</source>
          ,
          <fpage>1101</fpage>
          -
          <lpage>1106</lpage>
          . https://doi.org/10.1109/CSCI.
          <year>2016</year>
          .0210
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Evgeny</given-names>
            <surname>Sokolov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Lev</given-names>
            <surname>Bogolubsky</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Topic Models Regularization and Initialization for Regression Problems</article-title>
          .
          <source>Proceedings of the 2015 Workshop on Topic Models: Post-Processing and Applications</source>
          (
          <year>2015</year>
          ),
          <fpage>21</fpage>
          -
          <lpage>27</lpage>
          . https://doi.org/10.1145/ 2809936.2809940
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Fiona</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Tweedie</surname>
            and
            <given-names>R. Harald</given-names>
          </string-name>
          <string-name>
            <surname>Baayen</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>How Variable May a Constant be? Measures of Lexical Richness in Perspective</article-title>
          .
          <source>Computers and the Humanities</source>
          <volume>32</volume>
          ,
          <issue>5</issue>
          (
          <year>1998</year>
          ),
          <fpage>323</fpage>
          -
          <lpage>352</lpage>
          . https://doi.org/10.1023/A:1001749303137
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Konstantin</given-names>
            <surname>Vorontsov</surname>
          </string-name>
          and
          <string-name>
            <given-names>Anna</given-names>
            <surname>Potapenko</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Tutorial on probabilistic topic modeling: Additive regularization for stochastic matrix factorization</article-title>
          .
          <source>International Conference on Analysis of Images, Social Networks and Texts_x000D_</source>
          (
          <year>2014</year>
          ),
          <fpage>29</fpage>
          -
          <lpage>46</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>