<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring Book Themes in the Russian Age Rating System: а Topic Modeling Approach* Anna Glaz kova[oooo- 0001-8409- 6457]</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Univeгsity of Tyumen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Тyumen</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Russia a.v.glazkova@utmn.ru</string-name>
        </contrib>
      </contrib-group>
      <fpage>304</fpage>
      <lpage>314</lpage>
      <abstract>
        <p>Age rating systems аге created to indicate target ages of potential content users based опinformation security and text semantics. Age ratings are usually given as numbers, which tell us the youngest age the content is suitaЫe for. А book or film with а 12+ rating has content which is suitaЫe only гfо people aged 12 уеагs and over, and а book or film with an 18+ rating is suitaЫe fог adults only. Cuпently, content assessment in terms of inrfomation security is caпied out Ьу experts. ln this paper, we empirically compare book abstracts assigned to diferent age ratings using unsupervised topic modeling. We use an LDA model to discover topics from а collection of book abstracts. We then use statistical methods to study relations between the age rating categories assigned to books Ьу experts and the topics oЬtained. We believe that our comparisons show interesting and useful ifndings for age rating automation.</p>
      </abstract>
      <kwd-group>
        <kwd>oTpic modeling • Age restrictions• Age rating • Text classi­ ifcation • Statistical methods</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>lntroduction
Age- based ratings serve as а warning that the content may Ье unsuitaЫe to
children. Moreover, age ratings are used to ensure that entertainment content,
such as books, but also lfims, games or moЬile apps, is clearly labelled with а
minimum age recommendation.</p>
      <p>While some books are suitaЫe for readers of all ages, others are only suit­
aЫe rfo older children and young teenagers. А specific portion of books contain
information that is only appropriate for an adult audience.</p>
      <p>Age-based rating systems in diferent countries difer. Whereas the classifica­
tion systems in Russia, Europe and Germany are based purely on age, the rating
systems in the USA and Australia might Ье interpreted with consideration of
factors other than age. For example, in Australia there are two diferent 18+
ratings applied to either adult content or pornographic materials [6].</p>
      <p>The Russian Age Rating System (RARS) includes 5 categories of content:
по.
МК</p>
      <p>Ьу its hautors. Use permited
4.0 lntemational (СС ВУ 4.0).</p>
      <p>under Creative
ofr
ofr
ofr
ofr
children under the age of six (0+);
children over the age of six (6+);
children over the age of twelve (12+);
children over the age of sixteen (16+);
prohiЬited rfo</p>
      <p>children (18+).</p>
      <p>The RARS was introduced in 2012 when the Federal law of Russian Fed­
eration no. 436-FZ of 2010-12-23 «On Protection of Children from</p>
    </sec>
    <sec id="sec-2">
      <title>Information</title>
    </sec>
    <sec id="sec-3">
      <title>Harmlfu</title>
      <p>to Their Health and Development&gt;&gt; was passed [3]. The law prohiЬits
the distribution of &lt;&lt;harmful&gt; material that depicts violence, unllfuaw
activities,
substance abuse, or self-harm.</p>
      <p>The aim of this article is to compare the topics of the texts assigned to dif­
refent
benefit
age rating categories (according to the RARS). Our nfidings can potential
many text classification</p>
      <p>applications, such as recommender systems and
text filtering</p>
      <p>systems. First, through topic analysis, we can gain а deeper under­
standing of the structure of age rating systems. Since the age rating of а book is
currently being assessed empirically, the topic analysis will Ье an important step
towards rfomalizing</p>
      <p>this task. Based on the results of the analysis, it will become
more clear which books on which topics most often
contain information that is
unsuitaЫe for children or addressed to а particular age audience. In addition,
the results of topic modeling will help to highlight specific
topics rfo
diferent
age groups. This is the reason why topic distributions can Ье used as additional
efatures for</p>
      <p>automatic age rating classifiers in our further research.</p>
      <p>The paper is divided into six sections. The rfist
section is introduction. The
second section is concerned with the data preprocessing used for this study and
the description of our topic model. The third part is related to empirical analysis
of topics. The last section is conclusion.
2
eW
2.1</p>
      <p>Methodology</p>
      <p>Data Pгepaгation
use а collection of abstracts for books in Russian. These abstracts was col­
lected on the basis of puЬlic online libraries.</p>
      <p>xeTt</p>
      <p>preprocessing included the following actions:
standard steps, such as conversion into lower case letters, removing punctu­
ations and digits, lemmatization, removing extra white spaces;
excluding all the stop words using Natural Language oTolkit
(NLTK) Python
library [14] and words with fewer than 3 symbols;
removing words with TF-IDF weights less than 0.15. TF-IDF (term fre­
quency-inverse document frequency)</p>
      <p>is а statistical measure that shows how
important а word is to а document in а text collection. The TF-IDF value
increases proportionally to the number of times а word appears in the doc­
ument and is ofset</p>
      <p>Ьу the number of documents in the text collection that
contain the word [9]. So, this action allowed us to exclude words typical
of book abstracts (these are usually the words &lt;book&gt;&gt;,
&lt;&lt;author», «reader»,
etc.). The threshold 0.15 was chosen empirically. During the study, we tested
values in the range [0.1, 0.3] with an increment of 0.05 and compared the co­
herence of the models;
- excluding personal names to delete the mentions of authors and characters.</p>
      <p>This allowed our topic model to rfom</p>
      <p>themes according to the semantic prox­
imity of abstracts, and not according to the belonging of books to one author
or the coincidence of the characters names. То recognize named entities, we
used the Natasha Python library [13];
- we have comblned common phrases (with а frequency of mutual occurrence
of more than 5) into Ьigrams using the Gensim library [12].</p>
      <p>Some statistics of the data are summarized in ТаЫе 1.</p>
      <p>аТЫе 1.</p>
      <p>Some characteristics of the data.</p>
      <p>Category Number of texts gAv</p>
      <p>number of words per text
о+
6+
12+
16+
18+
53
3107
3110
3989
3986
45.33
72.03
72.03
91.76
76.36
2.2</p>
      <p>LDA
То discover topics from the collection of abstracts, we choose to apply standard
Latent Dirichlet Allocation (LDA) [15]. pToic
modeling is а type of statistical
modeling for</p>
      <p>recognizing main topics in а collection of documents. As а rule,
topic modeling is based on LDA, а hierarchical network that relates words and
documents through latent topics [1]. Topics
are characterized Ьу diverse fre­
quency of words. The document is presented as а bag-of-words approach, and
the topic looks like а set of words ranked in decreasing order of their probabll­
ities. LDA topic model were applied to analyze various subject areas, such as
social media analysis [17,19,18], analysis of emails [5], news [7,16], cfition
texts
[10], and others.</p>
      <p>We designed а topic model for 100 topics, which reflect
the main content
of the collection of abstracts. Our approach to choosing the optimal number of
topics was to build topic models with diefrent
values of number of topics and
pick the number that gives the highest coherence value (Fig. 1).
2.3 Topic</p>
      <p>Distribution Estimation
eW
calculated the topic distribution rfo
each document in the collection. It is
а vector of length equal to the number of topics in the LDA model. The topic
0,4
1j! 0,3</p>
      <p>------ -&lt;1 ,-51
-------О,1 _,_ __,__,,_
•--■.-■</p>
      <p>
        -.Торiс(Щ
distribution vector shows how а document extracted from the collection of ab­
stracts corresponds to each topic. Each element of the topic distribution vector
is а number from О to 1, where О is а complete non-шatch and 1 is an abso­
lute match. Then, we got the averaged topic distribution vector for each age
rating category to oЬtain the average values of topic distribution for а group of
documents:
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
where п is а number of age categories, k is а number of topics, Wx is а severity
value of а particular topic for the age category, х Е [1, k].
      </p>
      <p>For the oЬtained averaged topic distribution vectors а1, а2, ... , an, we
determined:
the most typical topics for each age category. We calculated the standard
deviations for each vector а1, а2, ... , an . Next, we marked values that ex­
ceeded the three standard deviations as corresponding to the most typical
topics for the category (the three sigma rule).
age-specific topics that are typical mainly for one age rating category. For
each topic, we have а vector
where Vz (z Е [1, m]) from the vector ti corresponds to the value щ from
the vector az.</p>
      <p>In the case when the number of observations (in our case, the number of
categories) is rather small, we cannot use the three-sigma rule to search for
outliers. In this case, it is necessary to use other statistical techniques for
small-sized samples [4]. We then applied the Dixon's Q-test [2] to determine
age-specicfi topics. The Dixon's Q-test is the simpler test that allows us to
examine if one observation from а small set of replicate observations (typi­
cally the number of observations is шоrе than 3 and no шоrе than 10) is an
outlier or not [11].</p>
      <p>
        First, we arrange the values v1,v2,---,V m for each vector ti in ascending
order (from the lowest to the highest value):
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
Thererfoe, we estimate the experimental Q-value:
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
In the next step, we compare the calculated Qexp value to the tabulated
critic al Qcrit value for а chosen confidence level.
3
      </p>
      <p>Empirical Analysis of Topics
As mentioned earlier, the purpose of this study is to compare the topics of texts
assigned to diefrent age rating categories (according to the RARS). Understand­
ing the diferences between categories will help us highlight topics specific to the
categories. Thus, in the future, we will Ье аЫе to use the empirical results of
topic modeling as а set of additional features for automatic text classification
based on the age of the addressee.
3.1</p>
      <p>Distribution of Topics
According to the Russian law, books of any genre in printed and electronic
versions are subject to labeling in accordance with age restrictions. The law
distinguishes vfie age categories: 0+, 6+, 12+, 16+ and 18+ (prohiЬited for
children). The diefrence between the books of these age categories is determined
Ьу the presence of scenes of violence, cruelty, descriptions of antisocial actions,
diseases, the mention of narcotic substances, alcoholic beverages, tobacco and
swear words. ТаЫе 2 shows the most common topics for age categories based on
our topic model.</p>
      <p>ТаЫе 2: Тор-5
topics per category.</p>
      <p>о+
1 &lt;Сказка&gt;&gt;, &lt;волшебный&gt;&gt;, &lt;&lt;сказочный&gt;&gt;, &lt;&lt;серия&gt;&gt;, &lt;&lt;ЯПОНСКИЙ&gt;&gt;,
«украинский», «индийский», «арабский» («tale», «magic&gt;&gt;, «fairy
tale,&gt; &lt;&lt;series&gt;, &lt;&lt;J apanese», &lt;&lt; Ukrainian&gt;, &lt;&lt;lndian&gt;&gt;, &lt;AraЬic&gt;&gt;)
2 &lt;Рассказ&gt;&gt;, &lt;&lt;сказка&gt;&gt;, &lt;тетрадь&gt;&gt;, &lt;&lt;маленькие&gt;&gt;, &lt;&lt;занятие&gt;&gt;, &lt;пособие&gt;&gt;,
«интеллект», «серия» ( &lt;&lt;story», «fairy tale», «notebook», «small»,
«lesson&gt;&gt;, m&lt;anual&gt;&gt;, &lt;intelligence&gt;, &lt;&lt;series&gt;)
3 &lt;Любовь», &lt;рассказ&gt;&gt;, С&lt;ТИХ&gt;&gt;, «город&gt;&gt;, «сборнию&gt;, &lt;&lt;сказка&gt;&gt;,
&lt;&lt;Пересказ&gt;&gt;, &lt;&lt;перевод&gt;&gt; (&lt;&lt;love&gt;&gt;, &lt;&lt;story&gt;&gt;, &lt;&lt;verse&gt;&gt;, &lt;City&gt;&gt;, &lt;&lt;collection&gt;&gt;,
&lt;fairy tale&gt;&gt;, &lt;retelling&gt;&gt;, &lt;translation&gt;&gt;)
4 «Ребенок», «планета», «написать», «стихотворная- ф(&lt;&lt;cоhрiмldа,&gt;&gt;&gt;,,
&lt;рассказывать&gt;&gt;, &lt;&lt;сказка&gt;, &lt;&lt;русский&gt;&gt;, &lt;приключение&gt;&gt;
&lt;planet&gt;, &lt;to write&gt;, &lt;poetic form&gt;, &lt;&lt;to tell&gt;&gt;, &lt;&lt;fairy tale&gt;, &lt;&lt;Russia n&gt;&gt;,
a&lt;dventure&gt;&gt;)
5 &lt;Рассказ», &lt;&lt;ребенок», «поучительный», &lt;&lt;чтение», «церковный&gt;&gt;,
л&lt;юбить&gt;&gt;, Б&lt;огородица&gt;&gt;, &lt;&lt;учебнию&gt; ( s&lt;tory&gt;&gt;, &lt;&lt;child&gt;&gt;, &lt;instructive&gt;&gt;,
&lt;reading&gt;&gt;, &lt;church&gt;&gt;, t&lt;o love&gt;, «Mother of God&gt;, &lt;textbook&gt;&gt;)</p>
      <p>6+
1 &lt;Школьный _возраст&gt;&gt;, &lt;рассказ&gt;&gt;, &lt;младший _ШКОЛЬНИК&gt;&gt;,
д&lt;етский _писатель,&gt; &lt;ребенок&gt;&gt;, &lt;&lt;известный&gt;&gt;, &lt;сборник», &lt;&lt;ПОВеСТЬ&gt;&gt;
( &lt;&lt;school age&gt;&gt;, s&lt;tory&gt;&gt;, y&lt;ounger school student&gt;&gt;, c&lt;hildren's writer&gt;&gt;,
&lt;child&gt;&gt;, &lt;famous&gt;, c&lt;ollection&gt;&gt;, s&lt;tory&gt;&gt;)
2 &lt;Сказка&gt;&gt;, &lt;волшебный&gt;&gt;, &lt;&lt;сказочный&gt;&gt;, &lt;&lt;серия&gt;&gt;, &lt;&lt;ЯПОНСКИЙ&gt;&gt;,
&lt;украинский», &lt;&lt;ИНДИЙСКИЙ&gt;&gt;, &lt;&lt; арабский&gt;&gt; ( &lt;tale&gt;&gt;, &lt;&lt;magic&gt;&gt;, &lt;fairy
tale&gt;&gt;, &lt;&lt;series&gt;, &lt;&lt;J apanese», &lt;&lt; Ukrainian&gt;&gt;, &lt;&lt;lndian&gt;&gt;, &lt;AraЬic&gt;&gt;)
3 У&lt;пражнение&gt;&gt;, Я&lt;ЗЫК&gt;&gt;, &lt;&lt;закрепление&gt;&gt;, &lt;брошюра&gt;&gt;, &lt;&lt;английский&gt;&gt;,
с&lt;амоучитель&gt;&gt;, л&lt;ингвистический&gt;&gt;, &lt;&lt;испанский&gt;&gt; ( e&lt;xercise»,
&lt;&lt;language&gt;, &lt;fastening», &lt; brochure,&gt; &lt;&lt;English», &lt;checkbook&gt;&gt;, &lt;linguistic&gt;&gt;,
&lt;Spanish&gt; &gt;)
4 &lt;Ребенок», п&lt;ланета&gt;&gt;, н&lt;аписать&gt;&gt;, «стихотворная_форма&gt;&gt;,
&lt;рассказывать&gt;, &lt;&lt;сказка&gt;&gt;, &lt;&lt;русский&gt;&gt;, п&lt;риключение&gt;&gt; ( &lt;&lt;child&gt;&gt;,
&lt;planet&gt;&gt; , &lt;to write&gt;&gt;, «poetic rfom», «to tell», &lt;&lt;fairy tale», «Russian&gt;&gt;,
&lt;adventure&gt;&gt;)
5 &lt; Стихотворная_форма&gt;&gt;, «планета&gt;&gt;, &lt;&lt;рассказывать&gt;&gt;, р&lt;астение&gt;&gt;,
а&lt;лфавит&gt;&gt;, &lt;иллюстрация», &lt;&lt;стихотворение» ( &lt;Poetic form», «planet&gt;&gt;,
t&lt;o tell», &lt;&lt;plant», &lt;&lt;alphabet&gt;&gt;, &lt;illustration&gt;&gt;, &lt;&lt;poem&gt;&gt;)</p>
      <p>12+
1 &lt;Школьный _возраст&gt;&gt;, &lt;рассказ&gt;&gt;, «младший-ШКОЛЬНИК&gt;&gt;,
&lt;детский _писателЬ&gt;&gt;, &lt;ребенок&gt;&gt;, &lt;&lt;известный&gt;&gt;, &lt;сборник», &lt;&lt;ПОВеСТЬ&gt;&gt;
( «school age», «story», «younger school student», «children's writer»,
&lt;child&gt;&gt;, &lt;famous&gt;&gt;, c&lt;ollection&gt;&gt;, s&lt;tory&gt;&gt;)
2 &lt;Сказка&gt;&gt;, &lt;волшебный&gt;&gt;, &lt;&lt;сказочный&gt;&gt;, &lt;&lt;серия&gt;&gt;, &lt;&lt;ЯПОНСКИЙ&gt;&gt;,
«украинский», «индийский», «арабский» ( «tale», «magic», «fairy
tale&gt;, &lt;&lt;series&gt;, &lt;&lt;J apanese», &lt;&lt; Ukrainian,&gt; &lt;&lt;lndian&gt;&gt;, &lt;AraЬic&gt;&gt;)
3 У&lt;пражнение&gt;&gt;, Я&lt;ЗЫК,&gt; &lt;&lt;закрепление&gt;&gt;, &lt;брошюра&gt;&gt;, &lt;&lt;английский&gt;&gt;,
«самоучитель», «лингвистический», «испанский&gt;&gt; ( «exercise»,
«language&gt;, &lt;&lt;consolidation&gt;&gt;, &lt;&lt;brochure&gt;, &lt;&lt;English&gt;, &lt;&lt;checkbook&gt;&gt;, &lt;linguistic&gt;&gt;,
&lt; Spanish&gt;&gt;)
4 &lt;Ребенок», п&lt;ланета&gt;&gt;, н&lt;аписаты&gt;, &lt;&lt;стихотворная -форма&gt;,
&lt;рассказывать&gt;&gt;, &lt;&lt;сказка&gt;, &lt;&lt;русский&gt;&gt;, п&lt;риключение&gt;&gt; ( &lt;&lt;child&gt;&gt;,
&lt;planet&gt;, &lt;to write&gt;&gt;, &lt;poetic form&gt;, &lt;&lt;to tell&gt;&gt;, &lt;&lt;fairy tale&gt;, &lt;&lt; Russian&gt;&gt;,
&lt;adventure&gt;&gt;)
5 &lt; Стихотворная_форма&gt;&gt;, «планета&gt;, &lt;&lt;рассказывать&gt;&gt;, р&lt;астение&gt;,
а&lt;лфавит&gt;&gt;, и&lt;ллюстрациЯ&gt;&gt;, &lt;&lt;стихотворение&gt;&gt; ( &lt;Poetic form&gt;, &lt;&lt;planet&gt;&gt;,
t&lt;o tell», &lt;&lt;plant», &lt;&lt;alphabet&gt;&gt;, &lt;illustration&gt;&gt;, &lt;&lt;poem&gt;&gt;)
16+
1 &lt;Жадный&gt;&gt;, &lt;&lt;погон&gt;&gt;,&lt;&lt;замучить», «милость», «заметить&gt;&gt;» счастье&gt;&gt;,
&lt;свадебный&gt;&gt;, «глотать&gt;&gt;( &lt;&lt;greedy&gt;&gt;,&lt;&lt;epaulet&gt;&gt;,&lt;&lt;to torture&gt;&gt;,&lt;mercy&gt;&gt;, &lt;to
notice» «happiness», «wedding», «to swallow»)
2 Фронт», «мужчина», «трагедия&gt;&gt;, «гипотетический&gt;&gt;, «догнать&gt;&gt;,
&lt;былой&gt;&gt;, &lt;предшественник&gt;&gt;, &lt;достижение&gt;&gt; ( &lt;forefront &gt;&gt;, &lt;&lt;man&gt;&gt;,
&lt;tragedy&gt;&gt;, «hypothetical», «to catch up», &lt;&lt;past», «predecessor&gt;&gt;,
&lt;achievement&gt;&gt;)
3 &lt;Метод&gt;, &lt;французский&gt;, «упрощение,&gt; п&lt;овторяемость&gt;&gt;,
з&lt;чауивание», «уникальность», «текст&gt;&gt;, «лексический&gt;&gt; ( «method&gt;&gt;,
erF&lt;nch», &lt;&lt;simplification&gt;&gt;, &lt;repeataЬility&gt;, &lt;&lt;memorization&gt;&gt;,
&lt;&lt;uniqueness&gt;, &lt;&lt;text&gt;&gt;,&lt;&lt;lexical&gt;&gt;)
4 &lt;Детский&gt;&gt;, &lt;ребенок&gt;&gt;, о&lt;тношение», &lt;&lt;педагог&gt;, &lt;искусство&gt;&gt;,
&lt;воспитание&gt;&gt;, &lt;&lt;зависимость&gt;&gt;, &lt;женщина&gt;&gt; ( &lt;&lt;children's&gt;&gt;, &lt;&lt;child&gt;&gt;,
&lt;relation&gt;&gt;, &lt;&lt;teacher&gt;&gt;,&lt;&lt;art&gt;, &lt;&lt;upbringing&gt;&gt;,&lt;&lt;addiction&gt;&gt;,w&lt;oman&gt;&gt;)
5 К&lt;нига&gt;&gt;, &lt;&lt;прошлое&gt;, &lt;человек&gt;&gt;, &lt;&lt;любовь&gt;&gt;, &lt;&lt;ГОрЬКИЙ&gt;&gt;,т&lt;ерять&gt;&gt;,
&lt;ПРОЙТИ&gt;&gt;, &lt;&lt;МИР&gt; ( b&lt;ook,&gt; &lt;past&gt;, &lt;man&gt;, &lt;love&gt;&gt;, &lt;&lt;Ьitter&gt;&gt;,&lt;to lose&gt;&gt;,&lt;to
pass&gt;&gt;,&lt;реасе&gt;&gt;)</p>
      <p>18+
1 &lt; Чувство&gt;&gt;, &lt;&lt;отношение&gt;&gt;, &lt;&lt;новелла», &lt;&lt;представлять&gt;&gt;, &lt;&lt;реальность&gt;&gt;,
и&lt;нтрига&gt;&gt;, &lt;&lt;оудвольствие&gt;&gt;, ф&lt;антазиЯ&gt;&gt; ( f&lt;eeling&gt;&gt;, r&lt;elation&gt;&gt;, s&lt;hort
story», &lt;&lt;to imagine&gt;&gt;,&lt;&lt;reality», i&lt;ntrigue», «pleasure», &lt;fantasy»)
2 З&lt;одиак&gt;&gt;, г&lt;ороскоп&gt;&gt;, с&lt;ветить&gt;&gt;, &lt;знак», &lt;&lt;поведение&gt;, &lt;&lt;ТИП&gt;&gt;,
&lt;предопределять&gt;&gt;, &lt;&lt;сексуальный&gt; ( Z&lt;odiac&gt;&gt;, &lt;horoscope&gt;&gt;, &lt;to shine&gt;&gt;,
&lt;sign&gt;&gt;, &lt;&lt;behiavor», &lt;&lt;type&gt;&gt;,«to determine», &lt;&lt;sexual»)
3 К&lt;нига&gt;&gt;, &lt;&lt;прошлое&gt;&gt;, ч&lt;еловек&gt;&gt;, &lt;&lt;любовь&gt;&gt;, &lt;&lt;горький&gt;, т&lt;еряты&gt;,
п&lt;ройти&gt;&gt;, &lt;&lt;мир&gt; ( &lt; book&gt;&gt;,&lt;past&gt;&gt;, &lt;man&gt;&gt;, &lt;love&gt;, &lt;&lt;Ьitter&gt;&gt;,&lt;to lose&gt;&gt;,&lt;to
pass&gt;&gt;,«реасе&gt;&gt;)
4 &lt;История», «друг&gt;&gt;, «жанр», «смешной», «личность», «диалог&gt;&gt;,
г&lt;ерой&gt;&gt;, &lt;весёлый&gt; ( &lt;&lt;history&gt;&gt;,&lt;&lt;friend&gt;&gt;, &lt;genre&gt;&gt;, &lt;funny&gt;&gt;, &lt;personality&gt;&gt;,
&lt;dialogue&gt;&gt;, &lt;&lt;hero», &lt;funny»)
5 &lt;Знание&gt;&gt;, «отличие&gt;&gt;, «мужчина&gt;, &lt;&lt;женщина», &lt;мир», «здоровье&gt;&gt;,
&lt;действие&gt;&gt;, &lt;&lt;тело» ( &lt;&lt;knowledge&gt;&gt;, &lt;diference&gt;&gt;, &lt;&lt;man&gt;&gt;, &lt;woman&gt;&gt;,
&lt;world&gt;&gt;, &lt;&lt;health&gt;&gt;,&lt;&lt;action&gt;&gt;,&lt;&lt;body»)</p>
      <p>
        The books for children under 6 years old (О+) may contain episodic unnatu­
ralistic images justified Ьу the genre or descriptions of physical or psychological
violence, provided that the victim is compassionate and happy ending. The pre­
vailing topics in the О+ category are fai ry tales of the world (topic 1), developing
children's benefits (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), poems about the world around us (
        <xref ref-type="bibr" rid="ref3 ref4">3-4</xref>
        ) and Christian lit­
erary works for children (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ).
      </p>
      <p>
        ln the books for children over the age of 6 (6+), non-naturalistic images or
descriptions of human diseases, accidents, catastrophes or violent death without
demonstrating their consequences are permissiЫe. The books for children over
12 (12+) may contain scenes of violence or murder, descriptions of illnesses,
disasters, but without details. Alcohol, tobaccaond drug use may Ье present,
but should Ье condemned. А schematic description of the hugs and kisses of men
and women may Ье present. These two categories are described Ьу similar topics
in our topic model. These are short stories and tales for primary and secondary
school age (topic 1), fairy tales of the world (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), study guides (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) and poems for
children (
        <xref ref-type="bibr" rid="ref4 ref5">4-5</xref>
        ).
      </p>
      <p>
        The books rfo children over 16 (16+) may contain scenes of illnesses,disasters
without detailed descriptions. Violence, alcohol and drug use can Ье described,
but should Ье condemned. Rough words may Ье present, with the exception
of swear words. Scenes of sexual relations cannot Ье described with anatomical
details. In our example, this category is represented Ьу military (topics 1-2) and
human condition (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) fiction, teaching aids (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ), psychological and pedagogical
literature (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ).
      </p>
      <p>
        А book should Ье marked with the 18+ label if the book contains а natu­
ralistic description of illnesses, disasters, non-condemned drug and alcohol use,
naturalistic scenes of sexual relations, non-traditional relationships, obscene lan­
guage, scenes that encourage suicide. In our topic model, love stories (topic 1),
horoscopes (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ), human condition fiction (
        <xref ref-type="bibr" rid="ref3 ref4">3-4</xref>
        ) and possiЬly relationship psychol­
ogy books (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ) were detected in this category.
3.2
      </p>
      <p>Most Common Topics</p>
      <p>for Categories
аТЫе 3 shows most of the common topics discovered using the three sigma rule.
These topics are most widely represented in their category. The column &lt;&lt;Main
topic&gt;&gt; shows the percentage of texts for which this topic is the main, i.e. it has
the largest proportion in the topic distribution.</p>
      <p>ТаЫе 3: The most typical topics for categories.</p>
      <sec id="sec-3-1">
        <title>Category</title>
      </sec>
      <sec id="sec-3-2">
        <title>Keywords Main topic</title>
        <p>о+
16+
16+
18+
о+
6+
12+
«tale», «magic», «fairy tale», «series», «Japanese», 7,553/с
&lt; Ukrainian&gt;, &lt;&lt;lndian&gt;&gt;, &lt;AraЬic&gt;&gt;
&lt;greedy&gt;&gt;, &lt;epaulet&gt;&gt;, &lt;&lt;to torture&gt;&gt;, &lt;mercy&gt;&gt;, t&lt;o notice&gt;&gt; &lt;&lt;hap- 1,553/с
piness», &lt;wedding&gt;&gt;, &lt;&lt;to swallow»
f&lt;oreofrnt&gt;&gt;, &lt;man», «tragedy&gt;&gt;, &lt;&lt;hypothetical&gt;, &lt;&lt;to catch up&gt;&gt;, 1,533/с
p&lt;ast&gt;&gt;, &lt;&lt;predecessor», a&lt;chievement&gt;&gt;
&lt;feeling&gt;&gt;, &lt;&lt;relation&gt;&gt;, &lt;short story&gt;&gt;, &lt;&lt;to imagine&gt;&gt;, &lt;&lt;reality&gt;&gt;,1,783/с
i&lt;ntrigue», &lt;pleasure», &lt;fantasy&gt;&gt;
&lt;story&gt;&gt;, &lt;fairy tale&gt;&gt;, &lt;&lt;notebook», &lt;&lt;small&gt;&gt;, &lt;&lt;lesson», &lt;&lt;man- 5,663/с
ual&gt;&gt;, «intelligence», &lt;series»
&lt;school age&gt;&gt;, «story&gt;&gt;, «younger school student,&gt; 2,453/с
c&lt;hildren's writer», c&lt;hild&gt;&gt;, &lt;famous&gt;&gt;, &lt;collection&gt;&gt;, &lt;story&gt;&gt; 2,483/с
3.3</p>
        <sec id="sec-3-2-1">
          <title>Age-specific</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Topics</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>In this subsection, we provide the topics that are typical mainly for one age</title>
          <p>rating category using the Dixon's Q-test. The tabulated Qcrit value is equal to
0.642 for confidence level 90% and m = 5.</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>We noticed that documents in category 0+ are largely mono-thematic. At the</title>
          <p>same time documents of other categories are usually mixtures of topics. There­
rofe, our topic model has many specific topics for texts ofrm the 0+ category. In
аТЫе 4, we present the list of the most common age-specific topics in our data
set.</p>
        </sec>
        <sec id="sec-3-2-5">
          <title>As it would Ье logical to assume, age-specific</title>
          <p>dren's books, as well as to specicfi literature orfm
literature on business and success).
topics generally relate to chil­
the 18+ category (in our case,
аТЫе 4.</p>
        </sec>
        <sec id="sec-3-2-6">
          <title>Age-specific categories.</title>
          <p>Categ &gt;ГУ Keywoгds Qcrit
18+ «success», «activity», «man», «business», «city», «collection», «еuго- 0,94
pean&gt;&gt;, &lt;&lt;necessary»
о+ &lt;story&gt;&gt;, &lt;&lt;fairy tale&gt;&gt;, &lt;notebook&gt;&gt;, &lt;&lt;small», &lt;&lt;lesson&gt;&gt;, «manual&gt;&gt;, &lt;intelli- 0,93
gence&gt;&gt;, &lt;series»
о+ «fairy tale&gt;&gt;, «plunge», «fairytale atmosphere», «emotion», &lt;character», 0,85
&lt;interesting&gt;, &lt;exciting», &lt;&lt;impression&gt;&gt;
о+ &lt;child&gt;, &lt;planet», &lt;&lt;to write&gt;, &lt;&lt;poetic form&gt;&gt;, &lt;&lt;to tell&gt;&gt;, &lt;&lt;fairy tale&gt;, &lt;&lt;Rus- 0,78
sian&gt;&gt;, a&lt;dventure&gt;&gt;
о+ t&lt;ale», &lt;&lt;magic&gt;&gt;, &lt;&lt;fairy tale&gt;&gt;, &lt;series», &lt;Japanese&gt;&gt;, «Ukrainian&gt;&gt;, &lt;lndian&gt;&gt;, 0,76
&lt;Arablc&gt;&gt;
о+ &lt;malachite Ьох&gt;&gt;, &lt;&lt;tale», &lt;&lt;storyteller&gt;&gt;, «humor&gt;&gt;, nfu&lt;ny&gt;&gt;, &lt;&lt;forest&gt;, 0,75
h&lt;ero&gt;&gt;, с&lt;араЫе» 1
4</p>
          <p>Conclusion
In this paper, we empirically analyzed the topics of texts assigned to diefrent age
rating categories. We introduced the distribution of topics for age categories and
the list of the most common topics for categories and age-specicfi topics. These
list of topics were oЬtained using statistical methods. Our analysis confirmed the
existing diferences between the categories and demonstrated that topic models
can Ье а good source of features for age rating identicfiation. ln our future work,
we will try to develop а machine learning classifier for automatically determining
the text age rating.</p>
        </sec>
        <sec id="sec-3-2-7">
          <title>1 This topic is рrоЬаЫу related to &lt;&lt;The Malachite Вох» ( &lt;&lt;Малахитоваяшкатулка&gt;&gt;).</title>
        </sec>
        <sec id="sec-3-2-8">
          <title>This is а book of fairy tales and lfok tales of the Ural region of Russia compiled Ьу Pavel Bazhov and puЬlished from 1936 to 1945. It is wгitten in contempoгary language and Ыends elements of everyday life with fantastic creatures of mountains and rfoests. This book significantly popularized the folklore of the Urals [8].</title>
          <p>17. Yang, S. and Zhang, Н.: Text mining of Тwitter data using а latent Dirichlet
allocation topic model and sentiment analysis. In: Int. J. Comput. Inf. Eng, 12,
рр.525-529 (2018).
18. Zhao, F., Zhu, У., Jin, Н., Yang, L. Т.: А personalized hashtag rec­
ommendation approach using LDA-based topic model in microЬlog en­
vironment //Future Generation Computer Systems 65, 196-206 (2016).
https://doi.org/10.1016/j.future.2015.10.012
19. Zhao, W. Х., Jiang, J., Weng, J., Не, J., Lim, Е. Р., Yan, Н., Li, Х.: Com­
paring twitter and traditional media using topic models. In: European confer­
ence on information retrieval, рр. 338-349. Springer, Berlin, Heidelberg (2011).
https://doi.org/10.1007/978-3-642-20161-53_4</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Вlei</surname>
            ,
            <given-names>D. М.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
          </string-name>
          , А. У.,
          <string-name>
            <surname>Jordan</surname>
          </string-name>
          , М. 1. Latent dirichlet allocation.
          <source>ln: Journal of machine Learning research</source>
          . Vol.
          <volume>3</volume>
          (
          <issue>Jan</issue>
          ).
          <source>Р р</source>
          .
          <volume>993</volume>
          -
          <fpage>1022</fpage>
          (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Dixon</surname>
          </string-name>
          , W. J.:
          <source>Processing data rfo outliers. Biometrics</source>
          <volume>1</volume>
          (
          <issue>9</issue>
          ),
          <fpage>74</fpage>
          -
          <lpage>89</lpage>
          (
          <year>1953</year>
          ). https://doi.org/10.230/73001634
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <source>Federal Law of December</source>
          <volume>29</volume>
          ,
          <year>2010N</year>
          436-
          <fpage>FZ</fpage>
          (as
          <source>amended on Мау 1</source>
          ,
          <year>2019</year>
          ) &lt;
          <article-title>On the Protection of Children from Information Harmful to T heir Health and Development&gt; (as amended and additional</article-title>
          ,
          <source>entered into force on October 29</source>
          ,
          <year>2019</year>
          )[
          <source>Federal'nyj zakon ot 29.12.2010N 436-FZ(red. ot 01.05</source>
          .
          <year>2019</year>
          &lt;)&lt;
          <article-title>0 zashchite detej ot informacii, prichinyuayshchej vred ih zdorov'yu i razvitiyu&gt;&gt; (s izm. i dop., vstup</article-title>
          .
          <source>v silu s 29.10</source>
          .
          <year>2019</year>
          ).], http://www.consultant.ru/document/cons_doc_LAW _
          <fpage>108808</fpage>
          /.
          <source>Last accessed 7 Apr</source>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Glazkova</surname>
          </string-name>
          , А.,
          <string-name>
            <surname>Kruzhinov</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sokova</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>Dynamic Topic Models rfo Retrospective Event Detection: А Study on Soviet Opposition-LeaningMedia</article-title>
          .
          <source>ln: lnternational Conference on Analysis of Images, Social Networks and Texts</source>
          , рр.
          <fpage>145</fpage>
          -
          <lpage>154S</lpage>
          ,pringer,
          <source>Cham</source>
          (
          <year>2019</year>
          ).https://doi.org/10.100/7978-3-
          <fpage>030</fpage>
          -37334-4_
          <fpage>13</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Gong</surname>
          </string-name>
          , Н.,
          <string-name>
            <surname>You</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guan</surname>
          </string-name>
          , Х.,
          <string-name>
            <surname>Сао</surname>
          </string-name>
          , У.,
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Application of LDA Topic Model in E-MailSubject Classification</article-title>
          .
          <source>ln: 2018International Conference on Тransportation &amp; Logistics, Information &amp; Communication</source>
          ,
          <string-name>
            <given-names>Smart</given-names>
            <surname>City</surname>
          </string-name>
          . Atlantis Press (
          <year>2018</year>
          ). https://doi.org/10.2991/tlicsc-
          <fpage>18</fpage>
          .
          <year>2018</year>
          .24
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <article-title>How are age-basedgaming ratings set?</article-title>
          , https://www.kaspersky.com/Ыog/gamingage-ratings/11647/.
          <source>Last accessed 7 Apr</source>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hu</surname>
            <given-names>Х</given-names>
          </string-name>
          .:
          <article-title>News hotspots detection and tracking based on LDA topic model</article-title>
          .
          <source>In: 2016 lnternational Conference on Progress in lnrfomatics and Computing (PIC)</source>
          . IEEE, рр.
          <fpage>248</fpage>
          -
          <lpage>252</lpage>
          (
          <year>2016</year>
          ). https://doi.org/10.1109/pic.
          <year>2016</year>
          .7949504
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ilyasova</surname>
            ,
            <given-names>R. S.:</given-names>
          </string-name>
          <article-title>Dialectal lexis of Р. Р. Bazov's narrations &lt;&lt;Malachite casket»</article-title>
          .
          <source>Letters of the Chechen State University</source>
          <volume>3</volume>
          (
          <issue>11</issue>
          ),
          <fpage>103</fpage>
          -
          <lpage>10</lpage>
          (
          <issue>72018</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Manning</surname>
          </string-name>
          , C.D.; Raghavan, Р.; Schutze, Н.:
          <article-title>Scoring, term weighting, and the vector space model</article-title>
          . Introduction to lnformation Retrieval. р.
          <volume>100</volume>
          (
          <year>2008</year>
          ). https://doi.org/10.101/7СВО9780511809071.
          <fpage>007</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Mitrofanova</surname>
          </string-name>
          , О.,
          <string-name>
            <surname>Sedova</surname>
          </string-name>
          , А.:
          <article-title>Topic Modelling in Parallel and ComparaЫe Fiction Texts (the case study of English and Russian prose)</article-title>
          .
          <source>ln: Proceedings of the International Conference IMS-2017</source>
          ,рр.
          <fpage>175</fpage>
          -
          <lpage>180</lpage>
          (
          <year>2017</year>
          ). https://doi.org/10.1145/3143699.3143734
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Raschka</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>А questionaЫe practice: Dixon's Q test for outlier identification</article-title>
          , https://sebastianrasch.kacom/ Articles/2014_dixon_test.html.
          <source>Last accessed 13 Apr</source>
          <year>2020</year>
          .https://doi.org/10.13140/2.1.3000.0004
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Rehurek</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Sojka, Р.:
          <article-title>Gensim-statistical semantics in python, Retrieved ofrm genism</article-title>
          . org. (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <article-title>Natasha -high quality compact solution for extracting named entities from news articles in Russian</article-title>
          , https://natasha.github.io/ner/.
          <source>Last accessed 26 Jul</source>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Loper</surname>
          </string-name>
          , Е.,
          <string-name>
            <surname>Bird</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>NLTK: the natural language toolkit</article-title>
          ,
          <source>arXiv preprint cs/0205028</source>
          (
          <year>2002</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Vorontsov</surname>
          </string-name>
          , К.,
          <string-name>
            <surname>Potapenko</surname>
          </string-name>
          , А.:
          <article-title>Thtorial on probabllistic topic modeling: Additive regularization for stochastic matrix factorization</article-title>
          .
          <source>ln: lnternational Conference on Analysis of Images, Social Networks and Texts, рр. 29-46</source>
          ,Springer, Cham (
          <year>2014</year>
          ). https://doi.org/10.100/7978-3-
          <fpage>319</fpage>
          -12580-03.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , Н.,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Zhang, У.,
          <string-name>
            <surname>Wang</surname>
          </string-name>
          , М.,
          <string-name>
            <surname>Мао</surname>
          </string-name>
          , С.:
          <article-title>Optimization of Topic Recognition Model for News xTets Based on LDA</article-title>
          .
          <source>Journal of Digital Information Management</source>
          <volume>5</volume>
          (
          <issue>17</issue>
          ),
          <volume>257</volume>
          (
          <year>2019</year>
          ).https://doi.org/10.6025/jdim/2019/1/75/257-
          <fpage>269</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>