Computing Syntactic Parameters for Automated
        Text Complexity Assessment

 Valery Solovyev1 , Marina Solnyshkina1 , Vladimir Ivanov2 , and Ivan Rygaev3
                         1
                          Kazan Federal University, Kazan, Russia
                             2
                            Innopolis University, Innopolis, Russia
           3
             Institute for Information Transmission Problems, Moscow, Russia


         Abstract. The article focuses on identifying, extracting and evaluat-
         ing syntactic parameters influencing the complexity of Russian academic
         texts. Our ultimate goal is to select a set of text features effectively
         measuring text complexity and build an automatic tool able to rank
         Russian academic texts according to grade levels. models based on the
         most promising features by using machine learning methods The inno-
         vative algorithm of designing a predictive model of text complexity is
         based on a training text corpus and a set of previously proposed and
         new syntactic features (average sentence length, average number of syl-
         lables per word, the number of adjectives, average number of participial
         constructions, average number of coordinating chains, path number, i.e.
         average number of sub-trees). Our best model achieves an MSE of 1.15.
         Our experiments indicate that by adding the abovementioned syntactic
         features, namely the average number of participial constructions, average
         number of coordinating chains, and the average number of sub-trees, the
         text complexity model performance will increase substantially.


1     Introduction

Effective reading comprehension implies that reading materials correspond read-
ers’ cognitive and language abilities. The idea behind the existing practice in
education is to ensure that students are exposed to the age-appropriate mate-
rials which are neither too complicated nor too simple for a reader. The “age-
appropriateness” has been traditionally measured by the Grade level which is
viewed as “what all students need to know and be able to do at each grade level”
to progress through their education1 .
    Grade level descriptors identify the specific content (knowledge, skills, abil-
ities) and the language of a particular course which students at a particular
education stage (or grade) are exposed to2 . There are also a number of English
text complexity analyzers available online for any educator selecting a text for
1 www.k12.wa.us/CurriculumInstruct/learningstandards.aspx https://www.gov.uk/
2 government/publications/grade-descriptors-for-gcses-graded- 9-to-1/grade-descriptors-for-
    gcses-graded-9-to-1-english-language
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution
4.0 International (CC BY 4.0).
In: P. Sosnin, V. Maklaev, E. Sosnina (eds.): Proceedings of the IS-2019 Conference, Ulyanovsk, Russia,
24-27 September 2019, published at http://ceur-ws.org
                                                                                63


students34 . Eliminating the gap between critically the texts and students’ abili-
ties scholars have been developing tools to profile texts that students would be
able to and want to read56 . The existing automatic analyzers use hundreds of
parameters ranging from quantitative, i.e. word length and sentence length only,
to qualitative (levels of meaning or purpose; structure; language conventionality
and clarity; and knowledge demands) to match a particular reader and a text78 .
    T.E.R.A., for instance, is an engine developed in 2012 by SoLET Lab which
analyzes five textual components, such as narrativity, syntactic simplicity, word
concreteness, referential cohesion and deep cohesion9 . T.E.R.A. also predicts the
grade level of the text using the Flesch-Kincaid Grade Level readability formula
(FKGL) [3]. Knowing the grade level of texts in a corpus, educators select the
most suitable text for the target audience.
    The situation in Russia is different. Significant gaps have been reported be-
tween the complexity levels of texts that students are asked to read in high
school as well as at tertiary levels and students’ abilities: the books are either
too simple or too complicated for students10 .
    Researchers also have evidence of students’ lack of wish to read11 who in
many cases are caused by the inappropriate selection of a book by an educator.
Unfortunately, Russian text complexity analyzers so far apply no other variables
but quantitative, i.e. word length and sentence length[5]. In the paper, we aim
at the following research question: Which syntactic text features better correlate
with the complexity of Russian academic texts?


2   Background

Though the terms text complexity’, text difficulty’ and text readability’ are still
sometimes used interchangeably, the notions of the concepts began being sepa-
rated in Russian academic literature as early as in 1970-s [9]. I.Lerner defined
complexity “as a category that characterizes the range of activities necessary
to solve a cognitive task, regardless of who performs this activity”[4]. While
text difficulty is viewed by the researcher as “a category characterizing a per-
son’s readiness to overcome obstacles while comprehending a reading text” [4].
Accepting I.Lerners’ point of view V. Tcetlin specified that “complexity of any
3
   http://www.readabilityformulas.com/free-readability-formula-tests.php
4
   https://www.wikihow.com/Determine-the-Reading-Level-of-a-Book
 5
   http://readability.io/
 6
   https://netpeak.net/ru/blog/readability/
 7
   https://readable.io/text/
 8
   http://casemed.case.edu/cpcpold/students/module4/Word Readability.pdf
 9
   http://129.219.222.66/Publish/tera.html
10
   https://elibrary.ru/item.asp?id=20134801, http://www.miep.edu.ru/uploaded/zvezdova
   oreshkin.pdf, http://www.mdk-arbat.ru/bookcard?book id=934790
11
   https://www.science-education.ru/ru/article/view?id=22229,
   https://cyberleninka.ru/article/v/chtenie-kak-sotsialnaya-problema,   https://trv-
   science.ru/2015/11/17/pochemu-ne-chitayut-shkolniki/
64

educational material is its objective characteristics whereas difficulty is a subjec-
tive factor of students’ preparedness to overcome complexity” [14]. The general
approach to text readability once proposed by M. Vogel and K. Washburn was
taken as a basis in all subsequent works. It is based on the objective character-
istics of the text highly correlating with the quantitative results of a test, and
implies designing a regression equation between the success of a text compre-
hension, on the one hand, and text parameters – on the other [6]. There are also
a number of graphic parameters of a text influencing its comprehension such as
fonts, indentations, spaces, colors etc., which are beyond the authors interest in
the article. The first attempts to assess Russian texts complexity were made in
late 1970-s, about 50 years later than the corresponding English studies [2]. In
1985 Yu. Tomina [13] suggested that lexical indicators of the linguistic difficulty
of texts are the number of unfamiliar words and abstract words, while syntac-
tic indicators are the number of participial constructions, the number of similar
parts of a sentence, prepositional-nominal groups. Though the formulas proposed
to predict Russian texts readability in late 1980-s were based on a number of
objective variables borrowed from the similar English texts complexity formulas,
i.e. word length and sentence length. Assessments of texts readability were ini-
tially carried out by hand, and then in 2010s it was done by means of computer
programs. They were all based on I. Oborneva’s readability formula derived in
2006 [5]:
                 F RE = 206.836 − (1.52 · ASL) − (65, 14 · ASW )
Here, ASL is the average sentence length, i.e., the number of words divided by the
number of sentences; ASW is the average number of syllables per word, i.e., the
number of syllables divided by the number of words in a text. The constants were
calculated based on the similar English formulas as well as the comparison of 100
parallel English/Russian literary texts and words in two academic dictionaries:
Explanatory Dictionary of the Russian Language by Ozhegov S.I. with 39174
words registered12 and English-Russian Muller’s dictionary with 41977 words13 .


3    Datasets

Two collections of texts were assembled for the research. The first collection of
7 texts from textbooks on Social Studies by L. N. Bogolubov marked “BOG”
was selected to teach the predictive model and define independent variables of
the text variation in the range of 5 – 11 Grade Levels. The second collection
of 7 texts from textbooks on Social Studies by A.F. Nikitin marked “NIK” also
aimed for 5 – 11 Grade Levels. Further we refer to the two collection as a Russian
Readability Corpus (RRC). Both sets of textbooks are from the “Federal List
of Textbooks Recommended by the Ministry of Education and Science of the
Russian Federation to Use in Secondary and High Schools”14 .
12
   https://slovarozhegova.ru/
13
   https://slovar-vocab.com/english-russian/muller-vocab.html
14
   http://www.fpu.edu.ru/fpu/
                                                                                65


   To ensure reproducibility of results, we uploaded the corpus on a website
thus providing its availability online15 . Note, however, that the published texts
contain shuffled order of sentences. The sizes of BOG and NIK collections of
texts are presented in Table 1.


                   Table 1. Properties of the preprocessed corpus.

                   Tokens     Sentences Words per sentence Syllables per word
        Grade BOG NIK BOG NIK BOG                NIK       BOG       NIK
        5-th      –    17,221 – 1,499 –          11.49       –       2,35
        6-th   16,467 16,475 1,273 1,197 12.94   13.76     2.56      2,71
        7-th   23,069 22,924 1,671 1,675 13.81   13.69     2.84      2,70
        8-th   49,796 40,053 3,181 2,889 15.65   13.86     2.96      2,88
        9-th   42,305 43,404 2,584 2,792 16.37   15.55     3.04      3,00
        10-th 75,182 39,183 4,468 2,468 16.83    15.88     3.07      3,12
        10-th* 98,034     – 5,798 – 16.91          –       3.05        –
        11-th     –    38,869 – 2,270 –          17.12       –       3,11
        11-th* 100,800 – 6,004 – 16.79             –       3.19        –


4      Methods

4.1     Corpus preprocessing

For the sake of convenience, we have preprocessed all texts from the corpus
in the same way. Common preprocessing included tokenization, splitting text
into sentences and part-of-speech tagging (using the TreeTagger for Russian16 ).
During the preprocessing step we excluded all extremely long sentences (longer
than 120 words) as well as too short sentences (shorter than 5 words) which we
consider outliers. Clearly, such sentences can be not outliers at all in another
domain, but in case of school textbooks on Social Studies sentences shorter than
5 words are outliers.
    Extremely short sentences mostly appear as names of chapters and sections
of the books or as a result of incorrect sentence splitting. We omit those sen-
tences, because the average sentence length is a very important feature in text
complexity assessment and hence should not be biased due to splitting errors. At
the same time sentences with five to seven words in Russian can still be viewed
as short sentences.


4.2     Lexical level Features

We have explored an extended feature set for text complexity modeling:
15
     https://kpfu.ru/slozhnost-tekstov-304364.html
16
     http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
66

 – frequency of content words (FREQ),
 – average words per sentence (ASL),
 – average syllables per word (ASW), and
 – features based on POS-tags:
     • number of nouns per sentence (NOUNS),
     • number of verbs per sentence (VERBS),
     • number of adjectives per sentence (ADJ),
     • number of pronouns per sentence (PRONOUNS),
     • number of personal pronouns per sentence (PERS. PRONOUNS),
     • number of negations per sentence (NEG),
     • number of connectives per sentence (CONN).
   We have tested the features for their importance in linear regression model.
The two features have shown better performance than others: FREQ and ADJ.
To assess the quality of proposed models the mean squared error (MSE) and the
coefficient of determination R2 were used. For the results of fitting the parameter
values in the corpus, see Table 2.


       Table 2. Results of fitting 3-parameter linear model on the RRC dataset.

     Features                   Formula for grade level         R2 MSE
     ASL, ASW, FREQ 1.92 + 0.5 x ASL + 2.12 x ASW -2.36 x FREQ 0.94 0.22
     ASL, ASW, ADJ 1.59 + 0.23 x ASL + -1.48 x ASW 3.97 x ADJ 0.95 0.19


4.3    Syntactic level features
The modern tools for processing of Russian texts are able to extract several syn-
tactic features of texts thus permitting to include a number of syntactic struc-
tures as features in readability formulas. To this end we use a syntactic analyzer
“ETAP-3” that employs a very detailed description of Russian grammar.
    The parser of the multipurpose linguistic processor ETAP-3 is a program
that performs parsing. In linguistic terms, parsing results in a dependency tree
structure, the nodes of which are word tokens of the input sentence, and ‘the
edges’ are the established syntactic dependencies. Thus, every tree node corre-
sponds to a word token in the sentence processed, whilst the directed arcs are
labeled with names of syntactic relations [1]. All the syntactic dependencies have
directions, therefore all the dependencies have the original node, its host, and
the final one - the dependent node. Each word token is represented in the form
of the initial form of a word and a set of its morphological characteristics [1]. All
the texts in the collection were processed with the syntactic parser: each sen-
tence was converted into a dependency tree structure. Then the following which
the following 14 numeric features were extracted:

 – An ‘average path’ is the quotient of the number of nodes and the number
   of “leaves” in a sentence
                                                                             67


– An ‘average sochin length’ as the average length of coordinating con-
  structions is the number of nodes in “branches” starting with coordinating
  constructions divided by the number of such branches; all types of nodes are
  processed including conjunctions and modifiers
– The ‘deeprich rate’ as the average number of verbal participles (verbal
  adverb phrases) is the quotient of the number of verbal participles and the
  number of sentences. The verbal participles are defined as a verbal adverb
  with at least one dependent modifier.
– The ‘deeprich v’ as the average span of a verbal adverb phrase is the num-
  ber of verbal adverb dependent nodes (in all ”branches”) divided by the
  number of verbal adverb phrases.
– The leaves number’ or the average number of ”leaves” (terminal nodes,
  i.e., words that are not anyone’s ”hosts”) in a sentence. Calculation formula:
  the number of all ”leaves” in the text is divided by the number of sentences.
– The longest path as the average average length of the longest branch is
  the sum of the lengths of the longest branches of the sentence divided by the
  number of sentences.
– The nouns dep’ as the average number of modifiers in a nominal group,
  i.e. the sum of the all the nodes that depend on the nouns divided by the
  number of nouns; coordinating and explanatory links were ignored.
– The podchin number’, i.e. the ratio of sentences in which there is at least
  one syndetic with subordinate conjunctions or relational links calculated as
  the number of sentences with at least one link of the type divided by the
  number of sentences.
– The podchin rate’, i.e. the average number of subordinate links calculated
  as the number of syndetic with subordinate conjunctions and relational links
  divided by the number of sentences
– The prich rate’ as the average number of participial construction is calcu-
  lated as the number of participial constructions divided by the number of
  sentences; participial constructions are defined as a participle that has at
  least one dependent.
– The prich v’ as the average span of a participial construction is the quotient
  of the number of nodes that depend on the participle (in all “branches”) and
  the number of participial constructions.
– The sentsoch number’ as the average number of compound sentences is
  the quotient of the number of coordinating constructions and the number of
  sentences.
– The sochin number’ is defined as the average number of coordinating
  chains and calculated by dividing the average number of coordinating chains
  in the sentence by the number of sentences; a chain is a sequence of nodes
  connected by the “coordinating” links, thus conjunctions “break” the chain.
– The path number’ is defined as the average number of sub-trees (in a
  sentence), calculated with an external algorithm.
– The verbs dep’ is defined as the average number of finite dependent verbs
  and is calculated as the sum of nodes directly dependent on the finite verb
  divided by the number of finite verbs; coordinating and explanatory links
  were ignored.
68


    The list of features extracted by the ETAP syntax parser was preprocessed
to group similar features. We provide the results of correlation analysis in the
Table 3. In general, some syntactic feature are similar to others and correlate
with the target variable (readability, measured as a class number). However, it
is evident that all the syntactic features have lower correlation coefficient with
the target feature (‘Grade Level’), than the two ‘classical’ lexical features (ASL
and ASW) do.


           Table 3. Correlation between features and the ‘Grade Level’.

                 Feature name               Correlation coefficient
                                            with the ‘Grade Level’
                 ASL                                 0.94
                 ASW                                 0.94
                 sochin number (SN)                  0.93
                 prich rate (PR)                     0.91
                 nouns dep (ND)                      0.88
                 average sochin length (SL)          0.87
                 path number (PN)                    0.87
                 longest path (LP)                   0.84
                 leaves number (LN)                  0.84
                 average path (AP)                   0.84
                 podchin rate                        0.64
                 podchin number                      0.62
                 deeprich v                          0.52
                 deeprich rate                       0.44
                 verbs dep                           0.43
                 prich v                             0.33
                 sentsoch number                     0.03


    Nevertheless, syntactic features have high correlation with the target variable.
This information could be useful for readability and text complexity prediction
in Russian. Our goal is to evaluate syntactic features in the next subsection with
respect to the capability to serve as predictors in a linear regression model. As
shown in Table 3 few of the syntactic features have low correlation with the
target variable (‘Grade Level’). In the rest of the paper we consider only those
syntactic features with correlation coefficient above 0.8 (as depicted in Table 3).


4.4   Evaluation of syntactic features

For evaluation of the syntactic features we carried out the two experiments. In
the first experiment, we clustered the syntactic features with respect to their
similarity to each other. By similarity we treat the correlation of the features’
values derived in RRC. The derived groups of features are the following:

 – Group 1, (G-1): Features related to the structure of the syntax tree
                                                                                  69


    • leaves number (LN)
    • average path (AP)
    • longest path (LP)
    • path number (PN)
 – Group 2, (G-2): Features related to noun and participial linear sequences
    • prich rate (PR)
    • nouns dep (ND)
 – Group 3, (G-3): Features related to coordinating constructions
    • average sochin length (SL)
    • sochin number (SN)
    The selected syntactic features could serve better predictors in a linear regres-
sion model due to their high correlation with the target variable. We measured
the performance of the resulting model in the following way. A linear regression
model was trained on the ‘BOG’ collection and tested on the ‘NIK’ collection
and vice versa. In both cases, we use the MSE as a measure of model’s perfor-
mance 4. First, we evaluate three features with the highest correlation: SN, PR
and ND. Next, three rows of the Table 4 correspond to the groups of syntactic
we found. Finally, we evaluate three features, one coming from a certain group.
From each group, we pick a feature with the highest correlation: PN from G-1,
PR from G-2 and SN from G-3. It can be seen from the Table 4 that syntactic
features from groups G-2 and G-3 are better than those from G-1. In the second


         Table 4. Results of syntactic features evaluation in linear models.

           Syntactic Features           MSE on ‘BOG’ MSE on ‘NIK’
           SN (G-3), PR (G-2), ND (G-2)     1.15         2.11
           Group 1                          7.74        25.42
           Group 2                          1.34         1.32
           Group 3                          0.55         2.58
           PN (G-1), PR (G-2), SN (G-3)      3.2         5.39


experiment, we compared syntactic features to the lexical level features (ASW,
ASL, FREQ and ADJ) with respect to their performance. Finally, we have built
linear models using combinations of both syntactic and lexical features (Table
5). Using only syntactic features without any additional features leads to worst
values of MSE. It is shown in the 5th row of Table 5. However, combination of
lexical level features (ASL and ASW) with syntactic features improves perfor-
mance of linear model for readability assessment.


5   Discussion and Future work
Earlier, in [11], authors considered text complexity formulas based only on lex-
ical parameters (listed in Section 4.2). In [12], machine learning methods were
applied to the same corpus of texts and the same set of parameters. The obtained
70


                  Table 5. Results of linear models performance.

     row # Feature set                 MSE on ‘BOG’ MSE on ‘NIK’
       1 ASL, ASW                          0.44        0.76
       2 ASL, ASW, FREQ                    1.97        0.39
       3 ASL, ASW, ADJ                     0.26        1.02
       4 ASL, ASW, ALL SYNTAX FEATURES     1.18        1.94
       5 ALL SYNTAX FEATURES               2.77        6.97
       6 G-1, G-2, G-3                      2.6        2.28
       7 ASL, ASW, G-1, G-2, G-3           2.11        0.29


results are superior to previously known, but using only lexical parameters in
the modeling of text complexity seems unnecessarily restrictive. Intuitively, syn-
tactically complex sentences should cause difficulty in understanding the text.
Therefore, it was natural to include syntactic features into the set of studied
parameters. Also, the choice of a parser for the Russian language is natural be-
cause the parser of the ETAP system is the most powerful of the existing ones.
For the study, based on expert assessment, 15 features were selected that were
supposed to have an impact on the text complexity.
    The main challenge to cope with is the selection of optimal values for the
constants in the formulas. We show that syntactic features could be useful in
readability assessment. However, the influence of syntactic features was not as
great as one might expect. This indicates the future direction of research, related
to the expansion of a set of syntactic properties. In future research, we plan to
apply semantic features, such as features based on syntactic n-grams [8, 7] and
other types of information extracted from text [10].


6     Conclusion
The research indicates that syntactic features have a high correlation with the
target variable. The findings received could be useful for readability and text
complexity prediction of Russian texts. Since this is a preliminary study, there
is ample space to improve all major stages in the future process of syntactic fea-
ture extraction. The potential working direction is viewed by the authors as im-
proving the accuracy of sentence and clause boundary detectors and parsers. We
will further experiment with syntactic complexity measures to balance the text
complexity construct multiplicity and model’s simplicity. Furthermore, we can
also experiment with additional types of machine learning models and tune pa-
rameters to derive text complexity assessment models with better performance.


Acknowledgements
This research was financially supported by the Russian Science Foundation, grant
# 18-18-00436, the Russian Government Program of Competitive Growth of
Kazan Federal University, state assignment of Ministry of Education and Science,
                                                                                     71


grant agreement # 34.5517.2017/6.7. The Russian Academic Corpus (section 3
in the paper) was created without support from the Russian Science Foundation.


References
 1. I. Boguslavsky, L. Iomdin, A. Lazursky, L. Mityushin, V. Sizov, L. Kreydlin, and
    A. Berdichevsky. Interactive resolution of intrinsic and translational ambiguity in
    a machine translation system. In Alexander Gelbukh, editor, Computational Lin-
    guistics and Intelligent Text Processing, pages 388–399, Berlin, Heidelberg, 2005.
    Springer Berlin Heidelberg.
 2. W. H. DuBay. The principles of readability. Online Submission, 2004.
 3. J. Kincaid, R. Fishburne Jr, R. Rogers, and B. Chissom. Derivation of new read-
    ability formulas (automated readability index, fog count and flesch reading ease
    formula) for navy enlisted personnel. Technical report, Naval Technical Training
    Command Millington TN Research Branch, 1975.
 4. I Ya Lerner. Kriterii slozhnosti nekotorykh elementov uchebnika: Problemy
    shkolnogo uchebnika [the criteria for the complexity of some elements of the text-
    book: Problems of a school textbook], 1974.
 5. I. Obobroneva. Avtomatizirovannaya otsenka slozhnosti uchebnykh tekstov na os-
    nove statisticheskikh parametrov [semiautomatic evaluation of the complexity of
    academic texts on the base of statistic parameters]. moscow: Rae institute of con-
    tent and methods of teaching. M.: RAS Institut soderzhaniya i metodov obucheniya,
    2006.
 6. Y. Shpakovskiy et al. Otsenka trudnosti vospriyatiya i optimizatsiya slozhnosti
    uchebnogo teksta [Evaluation of the difficulty of perception and optimization of the
    text complexity]. PhD thesis, 2007.
 7. G. Sidorov. Non-linear construction of n-grams in computational linguistics.
    México: Sociedad Mexicana de Inteligencia Artificial, 2013.
 8. G. Sidorov, F. Velasquez, E. Stamatatos, A. Gelbukh, and L. Chanona-Hernández.
    Syntactic dependency-based n-grams as classification features. In Mexican Inter-
    national Conference on Artificial Intelligence, pages 1–11. Springer, 2012.
 9. M Solnyshkina and A Kiselnikov. Slozhnostteksta: Etapy izucheniya v otechestven-
    nom prikladnom yazykoznanii [text complexity: Study phases in russian linguis-
    tics]. Vestnik Tomskogo gosudarstvennogo universiteta. Filologiya [Tomsk State
    University Journal of Philology], 6:38, 2015.
10. V. Solovyev and V. Ivanov. Knowledge-driven event extraction in Russian: corpus-
    based linguistic resources. Computational intelligence and neuroscience, 2016:16,
    2016.
11. Valery Solovyev, Vladimir Ivanov, and Marina Solnyshkina. Assessment of reading
    difficulty levels in russian academic texts: Approaches and metrics. Journal of
    Intelligent & Fuzzy Systems, 34(5):3049–3058, 2018.
12. Valery Solovyev, Marina Solnyshkina, and Vladimir Ivanov. Prediction of read-
    ing difficulty in russian academic texts. Journal of Intelligent & Fuzzy Systems,
    36:4553–4563, 2019.
13. Yu A Tomina. Obektivnaja otcenka jazykovoj trudnosti tekstov (opisanie,
    povestvovanie, rassuzhdenie, dokazatelstvo). (in russian). 1985.
14. VS Tsetlin. Didactic requirements: the case of academic texts (didakticheskie
    trebovaniya k kriteriyam slozhnosti uchebnogo materiala). Novye issledovaniya v
    pedagogicheskih naukah.–M.: Pedagogika, (1):30–33, 1980.