=Paper=
{{Paper
|id=Vol-2930/paper11
|storemode=property
|title=Using Topic Modeling to Improve the Quality of Age-Based Text Classification
|pdfUrl=https://ceur-ws.org/Vol-2930/paper11.pdf
|volume=Vol-2930
|authors=Anna Glazkova
}}
==Using Topic Modeling to Improve the Quality of Age-Based Text Classification==
Using Topic Modeling to Improve the Quality of Age-Based Text
Classification
Anna Glazkovaa
a
University of Tyumen, 6, Volodarskogo street, Tyumen, 625003, Russia
Abstract
The prediction of the age audience of the text plays a crucial role in selecting information
suitable for children, book publishing, and editing. In this paper, we evaluate the impact of
document topic distribution vectors on the quality of age-based text classification. We
formulated this problem as a binary classification task and developed a topic-informed
machine learning classifier for resolving this problem. We compared three common topic
modeling techniques to obtain document topic distribution vectors, including Latent Dirichlet
Allocation, Gibbs Sampled Dirichlet Multinomial Mixture, and BERTopic. In most cases,
our topic-informed classifier achieved improvements on a dataset of Russian fiction abstracta
over baseline approaches.
Keywords 1
Topic model, text classification, LDA, GSDMM, BERTopic.
1. Introduction
Text difficulty assessment is one of the main tasks in computational linguistics and natural
language processing. The difficulty of a text is determined by the combination of all text aspects that
affects the reader’s understanding, reading speed, and level of interest in the text [1]. There is
evidence that the tools for text difficulty assessment play a crucial role in regulating children's access
to suitable information, selecting relevant literature, or automating some aspects of editorial and
publishing activities.
There is a large volume of published studies describing the role of various linguistic features in
determining the reading levels of text. The first serious discussions and analyses of text difficulty
emerged in the middle of the last century with the creating of readability indices [2-3]. In recent years,
researchers explored the impact of lexical [4-6], morphological [6-7], semantic [7], syntactic [6, 8-9],
and psycholinguistic [10-11] features on the quality of text difficulty assessment. The study [12]
presented a comparison of Russian book abstracts assigned to different age ratings using unsupervised
topic modeling. Another recent study [13] explores the problem of assessing the complexity of
Russian educational texts.
In this work, we evaluate the effectiveness of topic modeling features for age-based text
classification of Russian books. The age-based classification is a specific task of determining the text
difficulty. Its goal is to predict the estimated age audience of the text. We use the corpus of abstracts
of fiction books [14]. Each abstract has a reader age label: adult or children’s. We use these labels as
indicators of text difficulty. Further, we obtain document topic distribution vectors for abstracts using
three common topic modeling approaches, such as a) Latent Dirichlet Allocation (LDA); b) Gibbs
Sampled Dirichlet Multinomial Mixture (GSDMM); c) BERTopic, an algorithm for generating topics
using state-of-the-art embeddings. We evaluate the impact of topic modeling features on several
machine learning methods, including Logistic Regression (LR), Linear Support Vector Classifier
VI International Conference Information Technologies and High-Performance Computing (ITHPC-2021),
September 14–16, 2021, Khabarovsk, Russia
EMAIL: a.v.glazkova@utmn.ru
ORCID: 0000-0001-8409-6457
©️ 2020 Copyright for this paper by its authors.
Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR Workshop Proceedings (CEUR-WS.org)
92
(LSVC), and Multilayer Perceptron neural network (MLP). In most cases, our topic-informed
classifiers outperform the baselines.
The rest of the paper is structured as follows. In Section 2, we describe our methodology. Section
3 provides evaluation results. Section 4 concludes this paper. Section 5 contains acknowledgments.
2. Methods
We apply three machine learning classifiers based on Logistic Regression, Linear Support Vector
Classifier, and Multilayer Perceptron with lexical features. Lexical features were obtained only from
the 5000 top words ordered by term frequency across the corpus. We produced a sparse representation
of the word counts (the bag-of-words model) and used it as an input for each classifier. The text
preprocessing for the bag-of-words model consisted of the four steps, which are: a) removing special
characters and digits; b) converting to lowercase; c) lemmatization using pymorphy2 [15]; d)
removing stopwords and short words containing fewer than 3 characters. The methods were
implemented with classes from the Scikit-learn library [16] using the following parameters:
1. LR: “l2” penalization, tolerance for stopping criteria is 1e-5.
2. LSVC: “l2” penalization, tolerance for stopping criteria is 1e-5.
3. MLP: 2 hidden ReLU layers of 2000 and 1000 neurons respectively, the solver is an L-
BFGS method [17]. We trained the model for 10 epochs.
The classifiers described above were used as baselines. Further, we obtained topic distribution
vectors for each document in the corpus. The document topic distribution vector represents the topic
distribution in the text by the word frequency. We concatenated the topic distribution vector with a
corresponding lexical vector (Figure 1) and evaluated the benefits of topic-informed models.
Document topics distribution vectors were obtained using three common types of probabilistic topic
models:
1. Latent Dirichlet Allocation [18]. LDA is a two-level Bayesian generative model, which
assumes that topic distributions over words and document distributions over topics are
generated from prior Dirichlet distributions [19]. In this work, the LDA topic model was
implemented using Gensim [20].
2. Gibbs Sampled Dirichlet Multinomial Mixture [21]. GSDMM is a short text clustering
model. This technique is essentially a modified LDA assuming that a document
encompasses only one topic. This differs from LDA which assumes that a document can
have multiple topics.
3. BERTopic [22], which is a topic modeling technique that leverages transformers and c-TF-
IDF to create dense clusters. This approach performs three main steps: a) extracting
document embeddings using state-of-the-art language models; b) clustering document
embeddings to create groups of similar documents with UMAP [23] and HDBSCAN [24]
algorithms; c) extracting topics by getting the most important words per cluster with class-
based TF-IDF (c-TF-IDF).
To preprocess texts for LDA and GSDMM, we first performed the four preprocessing steps
mentioned above and then built bigrams for collocated words with a total collected count of more than
5 and a threshold equal to 100. When applying the BERTopic technique, we used a multilingual
version of BERT (Bidirectional Encoder Representations from Transformers)2 [25] to produce
document embeddings.
2
https://huggingface.co/bert-base-multilingual-cased
93
Figure 1: Scheme of topic-informed model
3. Experiments
In this section, we describe our experiments with baseline classifiers and topic-informed models.
3.1. Evaluation dataset
We conducted experiments on the corpus of abstracts of fiction books3 which is a part of the
Russian corpus for age-based text classification [14]. The corpus consists of annotated fiction
abstracts from online libraries. Table 1 presents the summary statistics for our data. The number of
tokens and sentences is evaluated using the NLTK tokenizer [26].
Table 1
Characteristics of data
Sample Number of texts Avg length of texts Avg number of
(tokens) sentences
4646
Train Adult: 2688 106,38 5,52
Children’s: 1958
800
Test Adult: 189 110,14 5,66
Children’s: 611
3.2. Results
We performed model training on the training sample and tested our models on the test sample. We
computed recall (R), precision (P), and F1-scores (F), weighted by the number of true instances for
each label (weighted recall, precision, and F1-score). The results are shown in Table 2. In brackets,
we clarified the increase in F1-scores for topic-informed models relative to the relevant baselines. For
each classifier, we evaluated LDA and GSDMM topic models with a number of requested latent
topics equal to 25, 50, 75, and 100. We also estimated document topic vectors obtained by BERTopic
varying the minimum topic size from 2 to 10 in increments of 2.
As can be seen from the table below, the classification results mainly indicate the advantage of
topic-informed machine learning classifiers. The best result was obtained by the MLP classifier using
BERTopic vectors with minimum topic sizes equal to 8 and 10. Moreover, the topic-informed
Logistic Regression and MLP classifiers both achieved their best results using BERTopic document
topics. In most cases, the classifiers also benefit from GSDMM topics. The LSVC classifier showed
its best result using 100-dimensional GSDMM topic vectors. For our data, we did not identify a clear
benefit of LDA topics for the LR and LSVC classifiers.
3
https://www.kaggle.com/oldaandozerskaya/fiction-corpus-for-agebased-text-classification
94
Table 2
Results for our topic-informed models and baselines, %
Method Topic model F P R
LR - 77,44 86,07 75,63
LR LDA, 25 topics 77,33 (-0,11) 86,04 75,55
LR LDA, 50 topics 76,75 (-0,69) 85,69 74,88
LR LDA, 75 topics 77,33 (-0,11) 86,04 75,54
LR LDA, 100 topics 77,68 (+0,24) 86,48 75,88
LR GSDMM, 25 topics 77,67 (+0,23) 86,31 75,88
LR GSDMM, 50 topics 78,57 (+1,13) 86,29 76,88
LR GSDMM, 75 topics 77,43 (-0,01) 85,74 75,63
LR GSDMM, 100 topics 78,12 (+0,68) 86,3 76,38
LR BERTopic, n=2 78,24 (+0,8) 86,33 76,5
LR BERTopic, n=4 78,01 (+0,57) 86,26 76,25
LR BERTopic, n=6 78,58 (+1,14) 86,61 76,88
LR BERTopic, n=8 79,37 (+1,93) 86,72 77,75
LR BERTopic, n=10 79,59 (+2,15) 86,64 78
LSVC - 78,14 86,79 76,38
LSVC LDA, 25 topics 78,82 (+0,68) 87,01 77,13
LSVC LDA, 50 topics 78,02 (-0,12) 86,76 76,27
LSVC LDA, 75 topics 78,93 (+0,79) 87,05 77,25
LSVC LDA, 100 topics 77,91 (-0,23) 86,72 76,13
LSVC GSDMM, 25 topics 79,49 (+1,35) 86,76 77,88
LSVC GSDMM, 50 topics 79,26 (+1,12) 86,84 77,63
LSVC GSDMM, 75 topics 78,92 (+0,78) 86,56 77,25
LSVC GSDMM, 100 topics 79,61 (+1,47) 87,11 78
LSVC BERTopic, n=2 78,36 (+0,22) 86,86 76,63
LSVC BERTopic, n=4 78,25 (+0,11) 86,66 76,56
LSVC BERTopic, n=6 78,82 (+0,68) 86,85 77,13
LSVC BERTopic, n=8 78,82 (+0,68) 86,85 77,13
LSVC BERTopic, n=10 78,82 (+0,68) 86,85 77,13
MLP - 79,05 87,08 77,38
MLP LDA, 25 topics 79,61 (+0,56) 87,27 78
MLP LDA, 50 topics 80,06 (+1,01) 87,11 78,5
MLP LDA, 75 topics 80,17 (+1,12) 87,31 78,63
MLP LDA, 100 topics 79,39 (+0,34) 87,2 77,75
MLP GSDMM, 25 topics 80,5 (+1,45) 87,12 79
MLP GSDMM, 50 topics 79,26 (+0,21) 86,84 77,63
MLP GSDMM, 75 topics 79,6 (+0,55) 86,8 78
MLP GSDMM, 100 topics 79,72 (+0,67) 87,15 78,13
MLP BERTopic, n=2 79,71 (+0,66) 86,84 78,13
MLP BERTopic, n=4 80,17 (+1,12) 87,31 78,63
MLP BERTopic, n=6 79,95 (+0,9) 87,39 78,38
MLP BERTopic, n=8 80,84 (+1,79) 87,25 79,38
MLP BERTopic, n=10 80,84 (+1,79) 87,25 79,38
4. Conclusion
In this paper, we have focused on the age-based classification task. We have explored Logistic
Regression, Linear Support Vector Classifier, and Multilayer Perceptron classifiers with a set of
95
document topic features obtained using LDA, GSDMM, and BERTopic topic modeling techniques.
We tested our baselines and topic-informed classifiers on the corpus of fiction abstract to predict the
age of readers.
We demonstrated the superiority of topic-informed models as compared to baselines. The most
improvement for age-based classification gave BERTopic and GSDMM document topics. We also
showed that the usage of LDA topics does not significantly increase the results for the LR and LSVC
classifiers for our dataset. The possible explanation is that LDA topic models are aimed at working
with longer texts. Therefore, in further work, we plan to evaluate the impact of topic modeling
features on the corpus of fiction texts that are much longer and multi-thematical than book abstracts.
5. Acknowledgements
This study is supported by the grant of the President of the Russian Federation no. MK-
637.2020.9.
6. References
[1] Chen, Xiaobin, and Detmar Meurers. "Characterizing text difficulty with word frequencies."
Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational
Applications (2016): 84-94.
[2] R. Flesch A new readability yardstick. Journal of applied psychology 32(3) (1948) 221.
[3] E. Dale, J. S. Chall, The concept of readability. Elementary English 26(1) (1949) 19-26.
[4] Heilman, Michael, et al. "Combining lexical and grammatical features to improve readability
measures for first and second language texts." Human Language Technologies 2007: The
Conference of the North American Chapter of the Association for Computational Linguistics;
Proceedings of the Main Conference (2007): 460-467.
[5] Mukherjee, Partha, Gondy Leroy, and David Kauchak. "Using Lexical Chains to Identify Text
Difficulty: A Corpus Statistics and Classification Study." IEEE journal of biomedical and health
informatics 23.5 (2018): 2164-2173.
[6] Hancke, Julia, Sowmya Vajjala, and Detmar Meurers. "Readability classification for German
using lexical, syntactic, and morphological features." Proceedings of COLING 2012 (2012):
1063-1080.
[7] Salesky, Elizabeth, and Wade Shen. "Exploiting morphological, grammatical, and semantic
correlates for improved text difficulty assessment." Proceedings of the Ninth Workshop on
Innovative Use of NLP for Building Educational Applications (2014): 155-162.
[8] Sheehan, Kathleen M., Irene Kostin, and Yoko Futagi. "When do standard approaches for
measuring vocabulary difficulty, syntactic complexity and referential cohesion yield biased
estimates of text difficulty." Proceedings of the 30th Annual Conference of the Cognitive
Science Society, Washington DC (2008): 1978-1983.
[9] Poulsen, Mads, and Amalie KD Gravgaard. "Who did what to whom? The relationship between
syntactic aspects of sentence comprehension and text comprehension." Scientific Studies of
Reading 20.4 (2016): 325-338.
[10] Crossley, Scott A., Hae Sung Yang, and Danielle S. McNamara. "What's so Simple about
Simplified Texts? A Computational and Psycholinguistic Investigation of Text Comprehension
and Text Processing." Reading in a Foreign Language 26.1 (2014): 92-113.
[11] Howcroft, David M., and Vera Demberg. "Psycholinguistic models of sentence processing
improve sentence readability ranking." Proceedings of the 15th Conference of the European
Chapter of the Association for Computational Linguistics: Volume 1, Long Papers (2017): 958-
968.
[12] Glazkova, Anna. "Exploring Book Themes in the Russian Age Rating System: a Topic Modeling
Approach." Data Analytics and Management in Data Intensive Domains (DAMDID/RCDL
2020) (2020): 304-314.
96
[13] Sakhovskiy, Andrey, Valery Solovyev, and Marina Solnyshkina. "Topic Modeling for
Assessment of Text Complexity in Russian Textbooks." 2020 Ivannikov Ispras Open Conference
(ISPRAS). IEEE (2020): 102-108.
[14] Glazkova, Anna, Yury Egorov, and Maksim Glazkov. "A Comparative Study of Feature Types
for Age-Based Text Classification." Analysis of Images, Social Networks and Texts. Lecture
Notes in Computer Science (2020): 120-134.
[15] Korobov, Mikhail. "Morphological analyzer and generator for Russian and Ukrainian
languages." International Conference on Analysis of Images, Social Networks and Texts.
Springer, Cham (2015): 320-332.
[16] Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." the Journal of machine
Learning research 12 (2011): 2825-2830.
[17] Zhu, Ciyou, et al. "Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-
constrained optimization." ACM Transactions on Mathematical Software (TOMS) 23.4 (1997):
550-560.
[18] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal
of machine Learning research 3 (2003): 993-1022.
[19] Vorontsov, Konstantin, and Anna Potapenko. "Tutorial on probabilistic topic modeling: Additive
regularization for stochastic matrix factorization." International Conference on Analysis of
Images, Social Networks and Texts. Springer, Cham (2014): 29-46.
[20] Řehůřek, Radim, and Petr Sojka. "Gensim—statistical semantics in python." Retrieved from
genism. org (2011).
[21] Yin, Jianhua, and Jianyong Wang. "A dirichlet multinomial mixture model-based approach for
short text clustering." Proceedings of the 20th ACM SIGKDD international conference on
Knowledge discovery and data mining (2014): 233-242.
[22] M. Grootendorst, BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable
topics, 2020. URL: https://doi.org/10.5281/zenodo.4381785.
[23] McInnes, Leland, et al. "UMAP: Uniform Manifold Approximation and Projection." Journal of
Open Source Software 3.29 (2018): 861.
[24] McInnes, Leland, John Healy, and Steve Astels. "hdbscan: Hierarchical density based
clustering." Journal of Open Source Software 2.11 (2017): 205.
[25] Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language
understanding." arXiv preprint arXiv:1810.04805 (2018).
[26] Loper, Edward, and Steven Bird. "NLTK: the Natural Language Toolkit." Proceedings of the
ACL-02 Workshop on Effective tools and methodologies for teaching natural language
processing and computational linguistics-Volume 1 (2002): 63-70.
97