=Paper= {{Paper |id=Vol-2790/paper27 |storemode=property |title= Exploring Book Themes in the Russian Age Rating System: a Topic Modeling Approach |pdfUrl=https://ceur-ws.org/Vol-2790/paper27.pdf |volume=Vol-2790 |authors=Anna Glazkova |dblpUrl=https://dblp.org/rec/conf/rcdl/Glazkova20 }} == Exploring Book Themes in the Russian Age Rating System: a Topic Modeling Approach == https://ceur-ws.org/Vol-2790/paper27.pdf
      Exploring Book Themes in the Russian Age
      Rating System: а Topic Modeling Approach*

                                                       6457]
                          Anna Glazkova[oooo-0001-8409-

                     Univeгsity of Tyumen, Тyumen 625003, Russia
                                 a.v.glazkova@utmn.ru



         Abstract. Age rating systems аге created to indicate target ages of
         potential content users based оп information security and text semantics.
         Age ratings are usually given as numbers, which tell us the youngest
         age the content is suitaЫe for. А book or film with а 12+ rating has
         content which is suitaЫe only fог people aged 12 уеагs and over, and а
         book or film with an 18+ rating is suitaЫe fог adults only. Cuпently,
         content assessment in terms of information security is caпied out Ьу
         experts. ln this paper, we empirically compare book abstracts assigned
         to different age ratings using unsupervised topic modeling. We use an
         LDA model to discover topics from а collection of book abstracts. We
         then use statistical methods to study relations between the age rating
         categories assigned to books Ьу experts and the topics oЬtained. We
         believe that our comparisons show interesting and useful findings for age
         rating automation.

         Keywords: Topic modeling • Age restrictions• Age rating • Text classi­
         fication • Statistical methods.


1      lntroduction

Age-based ratings serve as а warning that the content may Ье unsuitaЫe to
children. Moreover, age ratings are used to ensure that entertainment content,
such as books, but also films, games or moЬile apps, is clearly labelled with а
minimum age recommendation.
    While some books are suitaЫe for readers of all ages, others are only suit­
aЫe for older children and young teenagers. А specific portion of books contain
information that is only appropriate for an adult audience.
    Age-based rating systems in different countries differ. Whereas the classifica­
tion systems in Russia, Europe and Germany are based purely on age, the rating
systems in the USA and Australia might Ье interpreted with consideration of
factors other than age. For example, in Australia there are two different 18+
ratings applied to either adult content or pornographic materials [6].
    The Russian Age Rating System (RARS) includes 5 categories of content:

* Supported Ьу the grant of the President of the Russian Federation по. МК-
  637.2020.9.


    Copyright © 2020 for this paper Ьу its authors. Use permitted under Creative
    Commons License Attribution 4.0 lntemational (СС ВУ 4.0).




                                           304
      for children under the age of six (0+);
      for children over the age of six (6+);
      for children over the age of twelve (12+);
      for children over the age of sixteen (16+ );
      prohiЬited for children (18+).
    The RARS was introduced in 2012 when the Federal law of Russian Fed­
eration no. 436-FZ of 2010-12-23 «On Protection of Children from Information
Harmful to Their Health and Development>> was passed [3]. The law prohiЬits
the distribution of <> material that depicts violence, unlawful activities,
substance abuse, or self-harm.
    The aim of this article is to compare the topics of the texts assigned to dif­
ferent age rating categories (according to the RARS). Our findings can potential
benefit many text classification applications, such as recommender systems and
text filtering systems. First, through topic analysis, we can gain а deeper under­
standing of the structure of age rating systems. Since the age rating of а book is
currently being assessed empirically, the topic analysis will Ье an important step
towards formalizing this task. Based on the results of the analysis, it will become
more clear which books on which topics most often contain information that is
unsuitaЫe for children or addressed to а particular age audience. In addition,
the results of topic modeling will help to highlight specific topics for different
age groups. This is the reason why topic distributions can Ье used as additional
features for automatic age rating classifiers in our further research.
    The paper is divided into six sections. The first section is introduction. The
second section is concerned with the data preprocessing used for this study and
the description of our topic model. The third part is related to empirical analysis
of topics. The last section is conclusion.


2     Methodology

2.1     Data Pгepaгation

We use а collection of abstracts for books in Russian. These abstracts was col­
lected on the basis of puЬlic online libraries.
    Text preprocessing included the following actions:
      standard steps, such as conversion into lower case letters, removing punctu­
      ations and digits, lemmatization, removing extra white spaces;
      excluding all the stop words using Natural Language Toolkit (NLTK) Python
      library [14] and words with fewer than 3 symbols;
      removing words with TF-IDF weights less than 0.15. TF-IDF (term fre­
      quency-inverse document frequency) is а statistical measure that shows how
      important а word is to а document in а text collection. The TF-IDF value
      increases proportionally to the number of times а word appears in the doc­
      ument and is offset Ьу the number of documents in the text collection that
      contain the word [9]. So, this action allowed us to exclude words typical




                                         305
   of book abstracts (these are usually the words <>, <>,       <<волшебный>>,       <<сказочный>>,   <<серия>>,    <<ЯПОНСКИЙ>>,
      «украинский», «индийский», «арабский» ( «tale», «magic>>, «fairy
      tale>>, <>, <>, <>, <>)
    2 <<Рассказ>>, <<сказка>>, <<тетрадь>>, <<маленькие>>, <<занятие>>, <<пособие>>,
      «интеллект», «серия» ( <>, <>, <>, <>)
    3 <<Любовь», <<рассказ>>, <<СТИХ>>, «город>>, «сборнию>, <<сказка>>,
      <<Пересказ>>, <<перевод>> ( <>, <>, <>, <>, <>,
      <>, <>, <>)
    4 «Ребенок»,           «планета»,        «написать»,     «стихотворная - форма>>,
      <<рассказывать>>, <<сказка>>, <<русский>>, <<приключение>> ( <,
      <>, <>, <>, <>, <>, <>,
      <>)




                                            308
5   <<Рассказ», <<ребенок», «поучительный», <<чтение», «церковный>>,
    <<любить>>, <<Богородица>>, <<учебнию> ( <>, <>, <>,
    <>, <>, <>, «Mother of God>>, <>)
                                          6+
1 <<Школьный_ возраст>>,                  <<рассказ>>,        <<младший_ ШКОЛЬНИК>>,
    <<детский_ писатель>>, <<ребенок>>, <<известный>>, <<сборник», <<ПОВеСТЬ>>
    ( <>, <>, <>, <>,
    <>, <>, <>, <>)
2   <<Сказка>>,       <<волшебный>>,       <<сказочный>>,     <<серия>>,   <<ЯПОНСКИЙ>>,
    <<украинский», <<ИНДИЙСКИЙ>>, <<арабский>> ( <>, <>, <>, <>, <>, <>, <>)
3   <<Упражнение>>, <<ЯЗЫК>>, <<закрепление>>, <<брошюра>>, <<английский>>,
    <<самоучитель>>, <<лингвистический>>, <<испанский>> ( <>, <>, <>, <>,
    <>)
4   <<Ребенок»,          <<планета>>,     <<написать>>,      «стихотворная_ форма>>,
    <<рассказывать>>, <<сказка>>, <<русский>>, <<приключение>> ( <>,
    <>, <>, «poetic form», «to tell», <>,
    <>)
5   <<Стихотворная_ форма>>, «планета>>, <<рассказывать>>, <<растение>>,
    <<алфавит>>, <<иллюстрация», <<стихотворение» ( <>,
    <>, <>, <>)
                                         12+
1 <<Школьный_ возраст>>,                  <<рассказ>>,         «младший - ШКОЛЬНИК>>,
    <<детский_ писателЬ>>, <<ребенок>>, <<известный>>, <<сборник», <<ПОВеСТЬ>>
    ( «school age», «story», «younger school student», «children's writer»,
    <>, <>, <>, <>)
2   <<Сказка>>,       <<волшебный>>,       <<сказочный>>,      <<серия>>,  <<ЯПОНСКИЙ>>,
    «украинский», «индийский», «арабский» ( «tale», «magic», «fairy
    tale>>, <>, <>, <>, <>)
3   <<Упражнение>>, <<ЯЗЫК>>, <<закрепление>>, <<брошюра>>, <<английский>>,
    «самоучитель», «лингвистический», «испанский>> ( «exercise», «lan-
    guage>>, <>, <>, <>, <>, <>,
    <>)
4   <<Ребенок»,          <<планета>>,     <<написаты>,       <<стихотворная - форма>>,
    <<рассказывать>>, <<сказка>>, <<русский>>, <<приключение>> ( <>,
    <>, <>, <>, <>, <>, <>,
    <>)
5   <<Стихотворная_ форма>>, «планета>>, <<рассказывать>>, <<растение>>,
    <<алфавит>>, <<иллюстрациЯ>>, <<стихотворение>> ( <>, <>,
    <>, <>, <>)
                                         16+




                                         309
  1   <<Жадный>>, <<погон>>, <<замучить», «милость», «заметить>> » счастье>>,
      <<свадебный>>, «глотать>> ( <>, <>, <>, <>, <>, «гипотетический>>, «догнать>>,
      <<былой>>, <<предшественник>>, <<достижение>> ( <>, <>,
      <>, «hypothetical», «to catch up», <>,
      <>)
  3   <<Метод>>,         <<французский>>,         «упрощение>>,          <<повторяемость>>,
      <<заучивание», «уникальность», «текст>>, «лексический>> ( «method>>,
      <>, <>, <>, <>, <>, <>)
  4   <<Детский>>,        <<ребенок>>,     <<отношение»,      <<педагог>>,      <<искусство>>,
      <<воспитание>>, <<зависимость>>, <<женщина>> ( <>, <>,
      <>, <>, <>, <>, <>, <>)
  5   <<Книга>>, <<прошлое>>, <<человек>>, <<любовь>>, <<ГОрЬКИЙ>>, <<терять>>,
      <<ПРОЙТИ>>, <<МИР>> ( <>, <>, <>, <>, <<Ьitter>>, <>, <>, <<реасе>>)
                                               18+
  1   <<Чувство>>, <<отношение>>, <<новелла», <<представлять>>, <<реальность>>,
      <<интрига>>, <<удовольствие>>, <<фантазиЯ>> ( <>, <>, <>, <>, <<гороскоп>>, <<светить>>, <<знак», <<поведение>>, <<ТИП>>,
      <<предопределять>>, <<сексуальный>> ( <>, <>, <>,
      <>, <>, «to determine», <>, <<прошлое>>, <<человек>>, <<любовь>>, <<горький>>, <<теряты>,
      <<пройти>>, <<мир>> ( <>, <>, <>, <>, <<Ьitter>>, <>, <>, «реасе>>)
  4   <<История», «друг>>, «жанр», «смешной», «личность», «диалог>>,
      <<герой>>, <<весёлый>> ( <>, <>, <>, <>, <>,
      <>, <>, «отличие>>, «мужчина>>, <<женщина», <<мир», «здоровье>>,
      <<действие>>, <<тело» ( <>, <>, <>, <>,
      <>, <>, <>, <> shows the percentage of texts for which this topic is the main, i.e. it has
the largest proportion in the topic distribution.

                 ТаЫе 3: The most typical topics for categories.

 Category                                  Keywords                                    Main
                                                                                       topic

      о+     «tale», «magic», «fairy tale», «series», «Japanese», 7,553/с
             <>, <>, <>
      16+    <>, <>, <>, <>, <> <>, <>, <>, <>, <>, 1,533/с
             <>, <>
      18+    <>, <>, <>, <>, <>, 1, 783/с
             <>
      о+     <>, <>, <>, <>, «intelligence», <>, «story>>, «younger school student>>,                     2,453/с
      12+    <>, <>, <>, <> 2,483/с




                                          311
3.3    Age-specific Topics
In this subsection, we provide the topics that are typical mainly for one age
rating category using the Dixon's Q-test. The tabulated Q cr it value is equal to
0.642 for confidence level 90% and m = 5.
     We noticed that documents in category 0+ are largely mono-thematic. At the
same time documents of other categories are usually mixtures of topics. There­
fore, our topic model has many specific topics for texts from the 0+ category. In
ТаЫе 4, we present the list of the most common age-specific topics in our data
set.
     As it would Ье logical to assume, age-specific topics generally relate to chil­
dren's books, as well as to specific literature from the 18+ category (in our case,
literature on business and success).

                             ТаЫе 4. Age-specific categories.

Categ >ГУ                                  Keywoгds                                         Qcrit
18+ «success», «activity», «man», «business», «city», «collection», «еuго- 0,94
       pean>>, <>, <>, <>, <>, «manual>>, <>, <>, «plunge», «fairytale atmosphere», «emotion», <>, <>
о+ <>, <>, <>, <>, <>, <>, <>
о+ <>, <>, <>, «Ukrainian>>, <>, 0,76
       <>
о+ <>, 1 <>, «humor>>, <>, <>, 0,75
       <>, <<сараЫе»



4     Conclusion
In this paper, we empirically analyzed the topics of texts assigned to different age
rating categories. We introduced the distribution of topics for age categories and
the list of the most common topics for categories and age-specific topics. These
list of topics were oЬtained using statistical methods. Our analysis confirmed the
existing differences between the categories and demonstrated that topic models
can Ье а good source of features for age rating identification. ln our future work,
we will try to develop а machine learning classifier for automatically determining
the text age rating.
1
    This topic is рrоЬаЫу related to <>).
    This is а book of fairy tales and folk tales of the Ural region of Russia compiled
    Ьу Pavel Bazhov and puЬlished from 1936 to 1945. It is wгitten in contempoгary
    language and Ыends elements of everyday life with fantastic creatures of mountains
    and forests. This book significantly popularized the folklore of the Urals [8].




                                             312
References
1. Вlei, D. М., Ng, А. У., Jordan, М. 1. Latent dirichlet allocation. ln: Journal of
   machine Learning research. Vol. 3(Jan). Рр. 993-1022 (2003).
2. Dixon, W. J.: Processing data for outliers. Biometrics 1(9), 74-89 (1953).
   https://doi.org/10.2307/3001634
3. Federal Law of December 29, 2010 N 436-FZ (as amended on Мау 1, 2019) <> (as amended and additional, entered into force on October 29, 2019) [Fed­
   eral'nyj zakon ot 29.12.2010 N 436-FZ (red. ot 01.05.2019) <<0 zashchite detej ot in­
   formacii, prichinyayushchej vred ih zdorov'yu i razvitiyu>> (s izm. i dop., vstup. v silu
   s 29.10.2019).], http://www.consultant.ru/document/cons_doc_LAW _108808/.
   Last accessed 7 Apr 2020.
4. Glazkova, А., Kruzhinov, V., Sokova, Z.: Dynamic Topic Models for Retrospective
   Event Detection: А Study on Soviet Opposition-Leaning Media. ln: lnternational
   Conference on Analysis of Images, Social Networks and Texts, рр. 145-154, Springer,
   Cham (2019). https://doi.org/10.1007/978-3-030-37334-4_13
5. Gong, Н., You, F., Guan, Х., Сао, У., Lai, S.: Application of LDA Topic Model
   in E-Mail Subject Classification. ln: 2018 International Conference on Тransporta­
   tion & Logistics, Information & Communication, Smart City. Atlantis Press (2018).
   https://doi.org/10.2991/tlicsc-18.2018.24
6. How are age-based gaming ratings set?, https://www.kaspersky.com/Ыog/gaming­
   age-ratings/11647/. Last accessed 7 Apr 2020.
7. Hu Х.: News hotspots detection and tracking based on LDA topic model. In: 2016
   lnternational Conference on Progress in lnformatics and Computing (PIC). IEEE,
   рр. 248-252 (2016). https://doi.org/10.1109/pic.2016.7949504
8. Ilyasova, R. S.: Dialectal lexis of Р. Р. Bazov's narrations <