Using Topic Modeling to Improve the Quality of Age-Based Text
Classification
Anna Glazkovaa
a
    University of Tyumen, 6, Volodarskogo street, Tyumen, 625003, Russia


                 Abstract
                 The prediction of the age audience of the text plays a crucial role in selecting information
                 suitable for children, book publishing, and editing. In this paper, we evaluate the impact of
                 document topic distribution vectors on the quality of age-based text classification. We
                 formulated this problem as a binary classification task and developed a topic-informed
                 machine learning classifier for resolving this problem. We compared three common topic
                 modeling techniques to obtain document topic distribution vectors, including Latent Dirichlet
                 Allocation, Gibbs Sampled Dirichlet Multinomial Mixture, and BERTopic. In most cases,
                 our topic-informed classifier achieved improvements on a dataset of Russian fiction abstracta
                 over baseline approaches.

                 Keywords 1
                 Topic model, text classification, LDA, GSDMM, BERTopic.

1. Introduction
    Text difficulty assessment is one of the main tasks in computational linguistics and natural
language processing. The difficulty of a text is determined by the combination of all text aspects that
affects the reader’s understanding, reading speed, and level of interest in the text [1]. There is
evidence that the tools for text difficulty assessment play a crucial role in regulating children's access
to suitable information, selecting relevant literature, or automating some aspects of editorial and
publishing activities.
    There is a large volume of published studies describing the role of various linguistic features in
determining the reading levels of text. The first serious discussions and analyses of text difficulty
emerged in the middle of the last century with the creating of readability indices [2-3]. In recent years,
researchers explored the impact of lexical [4-6], morphological [6-7], semantic [7], syntactic [6, 8-9],
and psycholinguistic [10-11] features on the quality of text difficulty assessment. The study [12]
presented a comparison of Russian book abstracts assigned to different age ratings using unsupervised
topic modeling. Another recent study [13] explores the problem of assessing the complexity of
Russian educational texts.
    In this work, we evaluate the effectiveness of topic modeling features for age-based text
classification of Russian books. The age-based classification is a specific task of determining the text
difficulty. Its goal is to predict the estimated age audience of the text. We use the corpus of abstracts
of fiction books [14]. Each abstract has a reader age label: adult or children’s. We use these labels as
indicators of text difficulty. Further, we obtain document topic distribution vectors for abstracts using
three common topic modeling approaches, such as a) Latent Dirichlet Allocation (LDA); b) Gibbs
Sampled Dirichlet Multinomial Mixture (GSDMM); c) BERTopic, an algorithm for generating topics
using state-of-the-art embeddings. We evaluate the impact of topic modeling features on several
machine learning methods, including Logistic Regression (LR), Linear Support Vector Classifier

VI International Conference Information Technologies and High-Performance Computing (ITHPC-2021),
September 14–16, 2021, Khabarovsk, Russia
EMAIL: a.v.glazkova@utmn.ru
ORCID: 0000-0001-8409-6457
            ©️ 2020 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                    92
(LSVC), and Multilayer Perceptron neural network (MLP). In most cases, our topic-informed
classifiers outperform the baselines.
   The rest of the paper is structured as follows. In Section 2, we describe our methodology. Section
3 provides evaluation results. Section 4 concludes this paper. Section 5 contains acknowledgments.

2. Methods
   We apply three machine learning classifiers based on Logistic Regression, Linear Support Vector
Classifier, and Multilayer Perceptron with lexical features. Lexical features were obtained only from
the 5000 top words ordered by term frequency across the corpus. We produced a sparse representation
of the word counts (the bag-of-words model) and used it as an input for each classifier. The text
preprocessing for the bag-of-words model consisted of the four steps, which are: a) removing special
characters and digits; b) converting to lowercase; c) lemmatization using pymorphy2 [15]; d)
removing stopwords and short words containing fewer than 3 characters. The methods were
implemented with classes from the Scikit-learn library [16] using the following parameters:

            1. LR: “l2” penalization, tolerance for stopping criteria is 1e-5.
            2. LSVC: “l2” penalization, tolerance for stopping criteria is 1e-5.
            3. MLP: 2 hidden ReLU layers of 2000 and 1000 neurons respectively, the solver is an L-
               BFGS method [17]. We trained the model for 10 epochs.

   The classifiers described above were used as baselines. Further, we obtained topic distribution
vectors for each document in the corpus. The document topic distribution vector represents the topic
distribution in the text by the word frequency. We concatenated the topic distribution vector with a
corresponding lexical vector (Figure 1) and evaluated the benefits of topic-informed models.
Document topics distribution vectors were obtained using three common types of probabilistic topic
models:

             1. Latent Dirichlet Allocation [18]. LDA is a two-level Bayesian generative model, which
                assumes that topic distributions over words and document distributions over topics are
                generated from prior Dirichlet distributions [19]. In this work, the LDA topic model was
                implemented using Gensim [20].
             2. Gibbs Sampled Dirichlet Multinomial Mixture [21]. GSDMM is a short text clustering
                model. This technique is essentially a modified LDA assuming that a document
                encompasses only one topic. This differs from LDA which assumes that a document can
                have multiple topics.
             3. BERTopic [22], which is a topic modeling technique that leverages transformers and c-TF-
                IDF to create dense clusters. This approach performs three main steps: a) extracting
                document embeddings using state-of-the-art language models; b) clustering document
                embeddings to create groups of similar documents with UMAP [23] and HDBSCAN [24]
                algorithms; c) extracting topics by getting the most important words per cluster with class-
                based TF-IDF (c-TF-IDF).

   To preprocess texts for LDA and GSDMM, we first performed the four preprocessing steps
mentioned above and then built bigrams for collocated words with a total collected count of more than
5 and a threshold equal to 100. When applying the BERTopic technique, we used a multilingual
version of BERT (Bidirectional Encoder Representations from Transformers)2 [25] to produce
document embeddings.


2
    https://huggingface.co/bert-base-multilingual-cased

                                                          93
Figure 1: Scheme of topic-informed model

3. Experiments
      In this section, we describe our experiments with baseline classifiers and topic-informed models.

3.1.         Evaluation dataset
   We conducted experiments on the corpus of abstracts of fiction books3 which is a part of the
Russian corpus for age-based text classification [14]. The corpus consists of annotated fiction
abstracts from online libraries. Table 1 presents the summary statistics for our data. The number of
tokens and sentences is evaluated using the NLTK tokenizer [26].

Table 1
Characteristics of data
        Sample                          Number of texts                  Avg length of texts   Avg number of
                                                                              (tokens)           sentences
                                              4646
              Train                       Adult: 2688                            106,38            5,52
                                        Children’s: 1958
                                              800
               Test                        Adult: 189                            110,14            5,66
                                        Children’s: 611

3.2.         Results
    We performed model training on the training sample and tested our models on the test sample. We
computed recall (R), precision (P), and F1-scores (F), weighted by the number of true instances for
each label (weighted recall, precision, and F1-score). The results are shown in Table 2. In brackets,
we clarified the increase in F1-scores for topic-informed models relative to the relevant baselines. For
each classifier, we evaluated LDA and GSDMM topic models with a number of requested latent
topics equal to 25, 50, 75, and 100. We also estimated document topic vectors obtained by BERTopic
varying the minimum topic size from 2 to 10 in increments of 2.
    As can be seen from the table below, the classification results mainly indicate the advantage of
topic-informed machine learning classifiers. The best result was obtained by the MLP classifier using
BERTopic vectors with minimum topic sizes equal to 8 and 10. Moreover, the topic-informed
Logistic Regression and MLP classifiers both achieved their best results using BERTopic document
topics. In most cases, the classifiers also benefit from GSDMM topics. The LSVC classifier showed
its best result using 100-dimensional GSDMM topic vectors. For our data, we did not identify a clear
benefit of LDA topics for the LR and LSVC classifiers.


3
    https://www.kaggle.com/oldaandozerskaya/fiction-corpus-for-agebased-text-classification

                                                                      94
Table 2
Results for our topic-informed models and baselines, %
   Method                    Topic model                        F             P            R
      LR                           -                         77,44          86,07        75,63
      LR                    LDA, 25 topics               77,33 (-0,11)      86,04        75,55
      LR                    LDA, 50 topics               76,75 (-0,69)      85,69        74,88
      LR                    LDA, 75 topics               77,33 (-0,11)      86,04        75,54
      LR                   LDA, 100 topics               77,68 (+0,24)      86,48        75,88
      LR                  GSDMM, 25 topics               77,67 (+0,23)      86,31        75,88
      LR                  GSDMM, 50 topics               78,57 (+1,13)      86,29        76,88
      LR                  GSDMM, 75 topics               77,43 (-0,01)      85,74        75,63
      LR                 GSDMM, 100 topics               78,12 (+0,68)       86,3        76,38
      LR                    BERTopic, n=2                 78,24 (+0,8)      86,33         76,5
      LR                    BERTopic, n=4                78,01 (+0,57)      86,26        76,25
      LR                    BERTopic, n=6                78,58 (+1,14)      86,61        76,88
      LR                    BERTopic, n=8                79,37 (+1,93)      86,72        77,75
      LR                   BERTopic, n=10                79,59 (+2,15)      86,64         78
    LSVC                           -                         78,14          86,79        76,38
    LSVC                    LDA, 25 topics               78,82 (+0,68)      87,01        77,13
    LSVC                    LDA, 50 topics               78,02 (-0,12)      86,76        76,27
    LSVC                    LDA, 75 topics               78,93 (+0,79)      87,05        77,25
    LSVC                   LDA, 100 topics               77,91 (-0,23)      86,72        76,13
    LSVC                  GSDMM, 25 topics               79,49 (+1,35)      86,76        77,88
    LSVC                  GSDMM, 50 topics               79,26 (+1,12)      86,84        77,63
    LSVC                  GSDMM, 75 topics               78,92 (+0,78)      86,56        77,25
    LSVC                 GSDMM, 100 topics               79,61 (+1,47)      87,11          78
    LSVC                    BERTopic, n=2                78,36 (+0,22)      86,86        76,63
    LSVC                    BERTopic, n=4                78,25 (+0,11)      86,66        76,56
    LSVC                    BERTopic, n=6                78,82 (+0,68)      86,85        77,13
    LSVC                    BERTopic, n=8                78,82 (+0,68)      86,85        77,13
    LSVC                   BERTopic, n=10                78,82 (+0,68)      86,85        77,13
     MLP                           -                         79,05          87,08        77,38
     MLP                    LDA, 25 topics               79,61 (+0,56)      87,27          78
     MLP                    LDA, 50 topics               80,06 (+1,01)      87,11         78,5
     MLP                    LDA, 75 topics               80,17 (+1,12)      87,31        78,63
     MLP                   LDA, 100 topics               79,39 (+0,34)       87,2        77,75
     MLP                  GSDMM, 25 topics                80,5 (+1,45)      87,12          79
     MLP                  GSDMM, 50 topics               79,26 (+0,21)      86,84        77,63
     MLP                  GSDMM, 75 topics                79,6 (+0,55)       86,8          78
     MLP                 GSDMM, 100 topics               79,72 (+0,67)      87,15        78,13
     MLP                    BERTopic, n=2                79,71 (+0,66)      86,84        78,13
     MLP                    BERTopic, n=4                80,17 (+1,12)      87,31        78,63
     MLP                    BERTopic, n=6                 79,95 (+0,9)      87,39        78,38
     MLP                    BERTopic, n=8                80,84 (+1,79)      87,25        79,38
     MLP                   BERTopic, n=10                80,84 (+1,79)      87,25        79,38

4. Conclusion
  In this paper, we have focused on the age-based classification task. We have explored Logistic
Regression, Linear Support Vector Classifier, and Multilayer Perceptron classifiers with a set of


                                                 95
document topic features obtained using LDA, GSDMM, and BERTopic topic modeling techniques.
We tested our baselines and topic-informed classifiers on the corpus of fiction abstract to predict the
age of readers.
   We demonstrated the superiority of topic-informed models as compared to baselines. The most
improvement for age-based classification gave BERTopic and GSDMM document topics. We also
showed that the usage of LDA topics does not significantly increase the results for the LR and LSVC
classifiers for our dataset. The possible explanation is that LDA topic models are aimed at working
with longer texts. Therefore, in further work, we plan to evaluate the impact of topic modeling
features on the corpus of fiction texts that are much longer and multi-thematical than book abstracts.

5. Acknowledgements
   This study is supported by the grant of the President of the Russian Federation no. MK-
637.2020.9.

6. References
[1] Chen, Xiaobin, and Detmar Meurers. "Characterizing text difficulty with word frequencies."
     Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational
     Applications (2016): 84-94.
[2] R. Flesch A new readability yardstick. Journal of applied psychology 32(3) (1948) 221.
[3] E. Dale, J. S. Chall, The concept of readability. Elementary English 26(1) (1949) 19-26.
[4] Heilman, Michael, et al. "Combining lexical and grammatical features to improve readability
     measures for first and second language texts." Human Language Technologies 2007: The
     Conference of the North American Chapter of the Association for Computational Linguistics;
     Proceedings of the Main Conference (2007): 460-467.
[5] Mukherjee, Partha, Gondy Leroy, and David Kauchak. "Using Lexical Chains to Identify Text
     Difficulty: A Corpus Statistics and Classification Study." IEEE journal of biomedical and health
     informatics 23.5 (2018): 2164-2173.
[6] Hancke, Julia, Sowmya Vajjala, and Detmar Meurers. "Readability classification for German
     using lexical, syntactic, and morphological features." Proceedings of COLING 2012 (2012):
     1063-1080.
[7] Salesky, Elizabeth, and Wade Shen. "Exploiting morphological, grammatical, and semantic
     correlates for improved text difficulty assessment." Proceedings of the Ninth Workshop on
     Innovative Use of NLP for Building Educational Applications (2014): 155-162.
[8] Sheehan, Kathleen M., Irene Kostin, and Yoko Futagi. "When do standard approaches for
     measuring vocabulary difficulty, syntactic complexity and referential cohesion yield biased
     estimates of text difficulty." Proceedings of the 30th Annual Conference of the Cognitive
     Science Society, Washington DC (2008): 1978-1983.
[9] Poulsen, Mads, and Amalie KD Gravgaard. "Who did what to whom? The relationship between
     syntactic aspects of sentence comprehension and text comprehension." Scientific Studies of
     Reading 20.4 (2016): 325-338.
[10] Crossley, Scott A., Hae Sung Yang, and Danielle S. McNamara. "What's so Simple about
     Simplified Texts? A Computational and Psycholinguistic Investigation of Text Comprehension
     and Text Processing." Reading in a Foreign Language 26.1 (2014): 92-113.
[11] Howcroft, David M., and Vera Demberg. "Psycholinguistic models of sentence processing
     improve sentence readability ranking." Proceedings of the 15th Conference of the European
     Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (2017): 958-
     968.
[12] Glazkova, Anna. "Exploring Book Themes in the Russian Age Rating System: a Topic Modeling
     Approach." Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL
     2020) (2020): 304-314.


                                                   96
[13] Sakhovskiy, Andrey, Valery Solovyev, and Marina Solnyshkina. "Topic Modeling for
     Assessment of Text Complexity in Russian Textbooks." 2020 Ivannikov Ispras Open Conference
     (ISPRAS). IEEE (2020): 102-108.
[14] Glazkova, Anna, Yury Egorov, and Maksim Glazkov. "A Comparative Study of Feature Types
     for Age-Based Text Classification." Analysis of Images, Social Networks and Texts. Lecture
     Notes in Computer Science (2020): 120-134.
[15] Korobov, Mikhail. "Morphological analyzer and generator for Russian and Ukrainian
     languages." International Conference on Analysis of Images, Social Networks and Texts.
     Springer, Cham (2015): 320-332.
[16] Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." the Journal of machine
     Learning research 12 (2011): 2825-2830.
[17] Zhu, Ciyou, et al. "Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-
     constrained optimization." ACM Transactions on Mathematical Software (TOMS) 23.4 (1997):
     550-560.
[18] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal
     of machine Learning research 3 (2003): 993-1022.
[19] Vorontsov, Konstantin, and Anna Potapenko. "Tutorial on probabilistic topic modeling: Additive
     regularization for stochastic matrix factorization." International Conference on Analysis of
     Images, Social Networks and Texts. Springer, Cham (2014): 29-46.
[20] Řehůřek, Radim, and Petr Sojka. "Gensim—statistical semantics in python." Retrieved from
     genism. org (2011).
[21] Yin, Jianhua, and Jianyong Wang. "A dirichlet multinomial mixture model-based approach for
     short text clustering." Proceedings of the 20th ACM SIGKDD international conference on
     Knowledge discovery and data mining (2014): 233-242.
[22] M. Grootendorst, BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable
     topics, 2020. URL: https://doi.org/10.5281/zenodo.4381785.
[23] McInnes, Leland, et al. "UMAP: Uniform Manifold Approximation and Projection." Journal of
     Open Source Software 3.29 (2018): 861.
[24] McInnes, Leland, John Healy, and Steve Astels. "hdbscan: Hierarchical density based
     clustering." Journal of Open Source Software 2.11 (2017): 205.
[25] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language
     understanding." arXiv preprint arXiv:1810.04805 (2018).
[26] Loper, Edward, and Steven Bird. "NLTK: the Natural Language Toolkit." Proceedings of the
     ACL-02 Workshop on Effective tools and methodologies for teaching natural language
     processing and computational linguistics-Volume 1 (2002): 63-70.


                                                  97