=Paper=
{{Paper
|id=Vol-2790/paper27
|storemode=property
|title=
Exploring Book Themes in the Russian Age Rating System: a Topic Modeling Approach
|pdfUrl=https://ceur-ws.org/Vol-2790/paper27.pdf
|volume=Vol-2790
|authors=Anna Glazkova
|dblpUrl=https://dblp.org/rec/conf/rcdl/Glazkova20
}}
==
Exploring Book Themes in the Russian Age Rating System: a Topic Modeling Approach
==
Exploring Book Themes in the Russian Age
Rating System: а Topic Modeling Approach*
6457]
Anna Glazkova[oooo-0001-8409-
Univeгsity of Tyumen, Тyumen 625003, Russia
a.v.glazkova@utmn.ru
Abstract. Age rating systems аге created to indicate target ages of
potential content users based оп information security and text semantics.
Age ratings are usually given as numbers, which tell us the youngest
age the content is suitaЫe for. А book or film with а 12+ rating has
content which is suitaЫe only fог people aged 12 уеагs and over, and а
book or film with an 18+ rating is suitaЫe fог adults only. Cuпently,
content assessment in terms of information security is caпied out Ьу
experts. ln this paper, we empirically compare book abstracts assigned
to different age ratings using unsupervised topic modeling. We use an
LDA model to discover topics from а collection of book abstracts. We
then use statistical methods to study relations between the age rating
categories assigned to books Ьу experts and the topics oЬtained. We
believe that our comparisons show interesting and useful findings for age
rating automation.
Keywords: Topic modeling • Age restrictions• Age rating • Text classi
fication • Statistical methods.
1 lntroduction
Age-based ratings serve as а warning that the content may Ье unsuitaЫe to
children. Moreover, age ratings are used to ensure that entertainment content,
such as books, but also films, games or moЬile apps, is clearly labelled with а
minimum age recommendation.
While some books are suitaЫe for readers of all ages, others are only suit
aЫe for older children and young teenagers. А specific portion of books contain
information that is only appropriate for an adult audience.
Age-based rating systems in different countries differ. Whereas the classifica
tion systems in Russia, Europe and Germany are based purely on age, the rating
systems in the USA and Australia might Ье interpreted with consideration of
factors other than age. For example, in Australia there are two different 18+
ratings applied to either adult content or pornographic materials [6].
The Russian Age Rating System (RARS) includes 5 categories of content:
* Supported Ьу the grant of the President of the Russian Federation по. МК-
637.2020.9.
Copyright © 2020 for this paper Ьу its authors. Use permitted under Creative
Commons License Attribution 4.0 lntemational (СС ВУ 4.0).
304
for children under the age of six (0+);
for children over the age of six (6+);
for children over the age of twelve (12+);
for children over the age of sixteen (16+ );
prohiЬited for children (18+).
The RARS was introduced in 2012 when the Federal law of Russian Fed
eration no. 436-FZ of 2010-12-23 «On Protection of Children from Information
Harmful to Their Health and Development>> was passed [3]. The law prohiЬits
the distribution of <> material that depicts violence, unlawful activities,
substance abuse, or self-harm.
The aim of this article is to compare the topics of the texts assigned to dif
ferent age rating categories (according to the RARS). Our findings can potential
benefit many text classification applications, such as recommender systems and
text filtering systems. First, through topic analysis, we can gain а deeper under
standing of the structure of age rating systems. Since the age rating of а book is
currently being assessed empirically, the topic analysis will Ье an important step
towards formalizing this task. Based on the results of the analysis, it will become
more clear which books on which topics most often contain information that is
unsuitaЫe for children or addressed to а particular age audience. In addition,
the results of topic modeling will help to highlight specific topics for different
age groups. This is the reason why topic distributions can Ье used as additional
features for automatic age rating classifiers in our further research.
The paper is divided into six sections. The first section is introduction. The
second section is concerned with the data preprocessing used for this study and
the description of our topic model. The third part is related to empirical analysis
of topics. The last section is conclusion.
2 Methodology
2.1 Data Pгepaгation
We use а collection of abstracts for books in Russian. These abstracts was col
lected on the basis of puЬlic online libraries.
Text preprocessing included the following actions:
standard steps, such as conversion into lower case letters, removing punctu
ations and digits, lemmatization, removing extra white spaces;
excluding all the stop words using Natural Language Toolkit (NLTK) Python
library [14] and words with fewer than 3 symbols;
removing words with TF-IDF weights less than 0.15. TF-IDF (term fre
quency-inverse document frequency) is а statistical measure that shows how
important а word is to а document in а text collection. The TF-IDF value
increases proportionally to the number of times а word appears in the doc
ument and is offset Ьу the number of documents in the text collection that
contain the word [9]. So, this action allowed us to exclude words typical
305
of book abstracts (these are usually the words <>, <>, <<волшебный>>, <<сказочный>>, <<серия>>, <<ЯПОНСКИЙ>>,
«украинский», «индийский», «арабский» ( «tale», «magic>>, «fairy
tale>>, <>, <>, <>, <>)
2 <<Рассказ>>, <<сказка>>, <<тетрадь>>, <<маленькие>>, <<занятие>>, <<пособие>>,
«интеллект», «серия» ( <>, <>, <>, <>)
3 <<Любовь», <<рассказ>>, <<СТИХ>>, «город>>, «сборнию>, <<сказка>>,
<<Пересказ>>, <<перевод>> ( <>, <>, <>, <>, <>,
<>, <>, <>)
4 «Ребенок», «планета», «написать», «стихотворная - форма>>,
<<рассказывать>>, <<сказка>>, <<русский>>, <<приключение>> ( <,
<>, <>, <>, <>, <>, <>,
<>)
308
5 <<Рассказ», <<ребенок», «поучительный», <<чтение», «церковный>>,
<<любить>>, <<Богородица>>, <<учебнию> ( <>, <>, <>,
<>, <>, <>, «Mother of God>>, <>)
6+
1 <<Школьный_ возраст>>, <<рассказ>>, <<младший_ ШКОЛЬНИК>>,
<<детский_ писатель>>, <<ребенок>>, <<известный>>, <<сборник», <<ПОВеСТЬ>>
( <>, <>, <>, <>,
<>, <>, <>, <>)
2 <<Сказка>>, <<волшебный>>, <<сказочный>>, <<серия>>, <<ЯПОНСКИЙ>>,
<<украинский», <<ИНДИЙСКИЙ>>, <<арабский>> ( <>, <>, <>, <>, <>, <>, <>)
3 <<Упражнение>>, <<ЯЗЫК>>, <<закрепление>>, <<брошюра>>, <<английский>>,
<<самоучитель>>, <<лингвистический>>, <<испанский>> ( <>, <>, <>, <>,
<>)
4 <<Ребенок», <<планета>>, <<написать>>, «стихотворная_ форма>>,
<<рассказывать>>, <<сказка>>, <<русский>>, <<приключение>> ( <>,
<>, <>, «poetic form», «to tell», <>,
<>)
5 <<Стихотворная_ форма>>, «планета>>, <<рассказывать>>, <<растение>>,
<<алфавит>>, <<иллюстрация», <<стихотворение» ( <>,
<>, <>, <>)
12+
1 <<Школьный_ возраст>>, <<рассказ>>, «младший - ШКОЛЬНИК>>,
<<детский_ писателЬ>>, <<ребенок>>, <<известный>>, <<сборник», <<ПОВеСТЬ>>
( «school age», «story», «younger school student», «children's writer»,
<>, <>, <>, <>)
2 <<Сказка>>, <<волшебный>>, <<сказочный>>, <<серия>>, <<ЯПОНСКИЙ>>,
«украинский», «индийский», «арабский» ( «tale», «magic», «fairy
tale>>, <>, <>, <>, <>)
3 <<Упражнение>>, <<ЯЗЫК>>, <<закрепление>>, <<брошюра>>, <<английский>>,
«самоучитель», «лингвистический», «испанский>> ( «exercise», «lan-
guage>>, <>, <>, <>, <>, <>,
<>)
4 <<Ребенок», <<планета>>, <<написаты>, <<стихотворная - форма>>,
<<рассказывать>>, <<сказка>>, <<русский>>, <<приключение>> ( <>,
<>, <>, <>, <>, <>, <>,
<>)
5 <<Стихотворная_ форма>>, «планета>>, <<рассказывать>>, <<растение>>,
<<алфавит>>, <<иллюстрациЯ>>, <<стихотворение>> ( <>, <>,
<>, <>, <>)
16+
309
1 <<Жадный>>, <<погон>>, <<замучить», «милость», «заметить>> » счастье>>,
<<свадебный>>, «глотать>> ( <>, <>, <>, <>, <>, «гипотетический>>, «догнать>>,
<<былой>>, <<предшественник>>, <<достижение>> ( <>, <>,
<>, «hypothetical», «to catch up», <>,
<>)
3 <<Метод>>, <<французский>>, «упрощение>>, <<повторяемость>>,
<<заучивание», «уникальность», «текст>>, «лексический>> ( «method>>,
<>, <>, <>, <>, <>, <>)
4 <<Детский>>, <<ребенок>>, <<отношение», <<педагог>>, <<искусство>>,
<<воспитание>>, <<зависимость>>, <<женщина>> ( <>, <>,
<>, <>, <>, <>, <>, <>)
5 <<Книга>>, <<прошлое>>, <<человек>>, <<любовь>>, <<ГОрЬКИЙ>>, <<терять>>,
<<ПРОЙТИ>>, <<МИР>> ( <>, <>, <>, <>, <<Ьitter>>, <>, <>, <<реасе>>)
18+
1 <<Чувство>>, <<отношение>>, <<новелла», <<представлять>>, <<реальность>>,
<<интрига>>, <<удовольствие>>, <<фантазиЯ>> ( <>, <>, <>, <>, <<гороскоп>>, <<светить>>, <<знак», <<поведение>>, <<ТИП>>,
<<предопределять>>, <<сексуальный>> ( <>, <>, <>,
<>, <>, «to determine», <>, <<прошлое>>, <<человек>>, <<любовь>>, <<горький>>, <<теряты>,
<<пройти>>, <<мир>> ( <>, <>, <>, <>, <<Ьitter>>, <>, <>, «реасе>>)
4 <<История», «друг>>, «жанр», «смешной», «личность», «диалог>>,
<<герой>>, <<весёлый>> ( <>, <>, <>, <>, <>,
<>, <>, «отличие>>, «мужчина>>, <<женщина», <<мир», «здоровье>>,
<<действие>>, <<тело» ( <>, <>, <>, <>,
<>, <>, <>, <> shows the percentage of texts for which this topic is the main, i.e. it has
the largest proportion in the topic distribution.
ТаЫе 3: The most typical topics for categories.
Category Keywords Main
topic
о+ «tale», «magic», «fairy tale», «series», «Japanese», 7,553/с
<>, <>, <>
16+ <>, <>, <>, <>, <> <>, <>, <>, <>, <>, 1,533/с
<>, <>
18+ <>, <>, <>, <>, <>, 1, 783/с
<>
о+ <>, <>, <>, <>, «intelligence», <>, «story>>, «younger school student>>, 2,453/с
12+ <>, <>, <>, <> 2,483/с
311
3.3 Age-specific Topics
In this subsection, we provide the topics that are typical mainly for one age
rating category using the Dixon's Q-test. The tabulated Q cr it value is equal to
0.642 for confidence level 90% and m = 5.
We noticed that documents in category 0+ are largely mono-thematic. At the
same time documents of other categories are usually mixtures of topics. There
fore, our topic model has many specific topics for texts from the 0+ category. In
ТаЫе 4, we present the list of the most common age-specific topics in our data
set.
As it would Ье logical to assume, age-specific topics generally relate to chil
dren's books, as well as to specific literature from the 18+ category (in our case,
literature on business and success).
ТаЫе 4. Age-specific categories.
Categ >ГУ Keywoгds Qcrit
18+ «success», «activity», «man», «business», «city», «collection», «еuго- 0,94
pean>>, <>, <>, <>, <>, «manual>>, <>, <>, «plunge», «fairytale atmosphere», «emotion», <>, <>
о+ <>, <>, <>, <>, <>, <>, <>
о+ <>, <>, <>, «Ukrainian>>, <>, 0,76
<>
о+ <>, 1 <>, «humor>>, <>, <>, 0,75
<>, <<сараЫе»
4 Conclusion
In this paper, we empirically analyzed the topics of texts assigned to different age
rating categories. We introduced the distribution of topics for age categories and
the list of the most common topics for categories and age-specific topics. These
list of topics were oЬtained using statistical methods. Our analysis confirmed the
existing differences between the categories and demonstrated that topic models
can Ье а good source of features for age rating identification. ln our future work,
we will try to develop а machine learning classifier for automatically determining
the text age rating.
1
This topic is рrоЬаЫу related to <>).
This is а book of fairy tales and folk tales of the Ural region of Russia compiled
Ьу Pavel Bazhov and puЬlished from 1936 to 1945. It is wгitten in contempoгary
language and Ыends elements of everyday life with fantastic creatures of mountains
and forests. This book significantly popularized the folklore of the Urals [8].
312
References
1. Вlei, D. М., Ng, А. У., Jordan, М. 1. Latent dirichlet allocation. ln: Journal of
machine Learning research. Vol. 3(Jan). Рр. 993-1022 (2003).
2. Dixon, W. J.: Processing data for outliers. Biometrics 1(9), 74-89 (1953).
https://doi.org/10.2307/3001634
3. Federal Law of December 29, 2010 N 436-FZ (as amended on Мау 1, 2019) <> (as amended and additional, entered into force on October 29, 2019) [Fed
eral'nyj zakon ot 29.12.2010 N 436-FZ (red. ot 01.05.2019) <<0 zashchite detej ot in
formacii, prichinyayushchej vred ih zdorov'yu i razvitiyu>> (s izm. i dop., vstup. v silu
s 29.10.2019).], http://www.consultant.ru/document/cons_doc_LAW _108808/.
Last accessed 7 Apr 2020.
4. Glazkova, А., Kruzhinov, V., Sokova, Z.: Dynamic Topic Models for Retrospective
Event Detection: А Study on Soviet Opposition-Leaning Media. ln: lnternational
Conference on Analysis of Images, Social Networks and Texts, рр. 145-154, Springer,
Cham (2019). https://doi.org/10.1007/978-3-030-37334-4_13
5. Gong, Н., You, F., Guan, Х., Сао, У., Lai, S.: Application of LDA Topic Model
in E-Mail Subject Classification. ln: 2018 International Conference on Тransporta
tion & Logistics, Information & Communication, Smart City. Atlantis Press (2018).
https://doi.org/10.2991/tlicsc-18.2018.24
6. How are age-based gaming ratings set?, https://www.kaspersky.com/Ыog/gaming
age-ratings/11647/. Last accessed 7 Apr 2020.
7. Hu Х.: News hotspots detection and tracking based on LDA topic model. In: 2016
lnternational Conference on Progress in lnformatics and Computing (PIC). IEEE,
рр. 248-252 (2016). https://doi.org/10.1109/pic.2016.7949504
8. Ilyasova, R. S.: Dialectal lexis of Р. Р. Bazov's narrations <