=Paper= {{Paper |id=Vol-2145/p17 |storemode=property |title=Detecting Information-Dense Texts: Towards an Automated Analysis |pdfUrl=https://ceur-ws.org/Vol-2145/p17.pdf |volume=Vol-2145 |authors=Danguolė Kalinauskaitė }} ==Detecting Information-Dense Texts: Towards an Automated Analysis == https://ceur-ws.org/Vol-2145/p17.pdf
       Detecting Information-Dense Texts: Towards an
                     Automated Analysis
                                                  Danguolė Kalinauskaitė
                    Department of Lithuanian Studies, Vytautas Magnus University, Kaunas, Lithuania
                              Baltic Institute of Advanced Technology, Vilnius, Lithuania
                                              danguole.kalinauskaite@bpti.lt

    Abstract—Determining information density has become a                  proposed for automatic analysis of information density based
central issue in natural language processing. While information            on lexical and syntactic levels of language.
density is seen as too complex to measure globally, a study of
lexical and syntactic features allows a comparison of information
density between different texts or different text genres. This                         II. THEORETICAL BACKGROUND
paper provides a part of methodology proposed for automatic
analysis of information density based on lexical and syntactic             A. Information Density in Computational Linguistics
levels of language.
                                                                               Information-dense texts report important factual
   Keywords—lexical density, information         density,   natural        information in direct, succinct manner. There were various
language processing, computational linguistics                             attempts to determine and evaluate information density of texts.
                                                                           Earlier works did it manually. Later various programs began to
                                                                           appear, and now this process can be done automatically.
                      I. INTRODUCTION                                      However, all programs are different in nature, as well as in
    Determining information density of text is a big challenge             their productivity and principles based on which information
in natural language processing (NLP). The use of more                      density of texts is determined. Worth mentioning issue here is
fragments of text to train statistical NLP systems may not                 that there are a lot of confusion in determining what are the
necessarily lead to improved performance. Recent                           indicators of information-dense texts, and therefore there is no
developments in this field have spawned a number of solutions               unified methodology for measuring information density.
to evaluate information density. Nevertheless, a shortfall of
                                                                               In computational linguistics, one of the most common
most of these solutions is their dependency on the genre and
                                                                           characteristics employed to detect information-dense texts is
domain of the text. In addition, most of them are not efficient
                                                                           lexical density (see B. Lexical Density below). In some works it
regardless of the NLP problem areas [1].
                                                                           is even suggested as the main indication of how informative a
    It is worth to note that the notion of information is only             text is, and used as a synonym for information density.
formal here, i.e. information is defined as semantic, pragmatic,           However, lexical density, i.e. only one level of text that
and only measurable in relative terms. A definition of                     basically points to the vocabulary, is not sufficient to talk about
information density is elaborated involving informativity (a               the whole text informativeness. Therefore it does not seem
relative measure of semantic and pragmatic information) per                convincing to link these two terms, it is more likely that one is
clause (following [2]). So in terms of semantics, information              a part of another, as lexical density measures only one level of
density is a measure of the extent to which the writer or speaker          texts, namely, vocabulary, and the whole text informativeness
is making assertions (or asking questions) rather than just                depends not only on the content but also on the structure.
referring to entities [3]. Texts contain various elements ranging
                                                                               It is worth to note that the applicability of research results in
from characters to sentences, that are supposed to have
                                                                           this area is very extensive. Numerous psychological
reasonable discriminating strength in evaluating information
                                                                           experiments have related information density to readability [4],
density in natural language text. Examples of such elements are
                                                                           [5], memory, e.g., [6], quality of students’ writing, e.g., [7],
the use of simple words, complex words, function words,
                                                                           aging [8], [9], and prediction of Alzheimer’s disease [10], [11],
content words, syllables, and so on.
                                                                           [12].
    The paper starts with a theoretical background on the
                                                                               High information density signals complex interrelationships
measurement of information density, followed by a
                                                                           expressed. Low information density means relatively little
presentation of the research, and ends with a conclusion and
                                                                           information per sentence, therefore low information density in
future work plans.
                                                                           speech or writing can indicate mental disorders, including
   The goal of this paper is to present a part of methodology              Alzheimer’s disease.

           Copyright held by the author(s).




                                                                      95
B. Lexical Density                                                             However the information lies here only on the lexical level
    Lexical density is the term most often used for describing             of text. The lexical level is related with syntactic one, i.e. how
the proportion of content words to the total number of words               words behave within a text, how they are connected with each
[13], [14]. The result is a percentage for each text in the corpus.        other in a sentence, etc. Finally the vocabulary and the form of
Content words give a text its meaning and provide information              text highly depend on the genre of text.
regarding what the text is about. More precisely, content words
are simply nouns, verbs, adjectives, and adverbs. Nouns tell us                                      III. RESEARCH
the subject, adjectives tell us more about the subject, verbs tell             The research was conducted to investigate which features of
us what they do, and adverbs tell us how they do it [13].                  text mostly characterize information-dense texts. Lexical
    Other kinds of words such as articles (a, the), prepositions           density mentioned above is one of them here, however the
(on, at, in), conjunctions (and, or, but) and so forth, are more           analysis was performed on the basis of the form of texts, too.
grammatical in nature and, by themselves, give little or no                    Lexical density has the advantage of being easy to
information about what a text is about [15]. These non-lexical             operationalise, and also practical to apply in computer analyses
words are also called function words. Auxiliary verbs, such as             of large data corpora.
to be (am, are, is, was, were, being), do (did, does, doing), have
(had, has, having) and so forth, are also considered non-lexical               The research sought to compare journal abstracts and their
as they do not provide additional meaning.                                 research papers from the point of view of their linguistic
                                                                           features and specificity of the genre, and in this way identify
   It is worth first to determine the lexical density of an ideal          textual features of abstracts based on their similarities and
example:                                                                   differences with regard to research papers. It was raised a
                                                                           hypothesis that abstracts are characterized by a higher
  (1) The quick brown fox jumped swiftly over the lazy dog.                information density than their research papers. The comparison
The lexical words (nouns, adjectives, verbs, and adverbs) are              was performed on the basis of two corpora, compiled from the
bold.                                                                      research papers and their abstracts in the journal of
There are precisely 7 lexical words out of 10 total words. The             “Pragmatics”1, from the period of 2000-2017. They have been
lexical density of the above passage is therefore 70%.                     collected specifically for the purposes of this research. Both
                                                                           corpora will be available in CLARIN-LT Repository2.
   Another simple example:                                                     The research consisted of qualitative and quantitative
                (2) She told him that she loved him.                       analysis, and in this way the contents and the form of the
                                                                           corpus of abstracts (containing 85 616 running words), and the
The lexical density of the above sentence is 2 lexical words out           corpus of research papers (containing 3 479 442 running
of 7 total words, for a lexical density of 28.57%.                         words) were analysed with the help of corpus management and
                                                                           analyses tool - Sketch Engine3, and WordSmith Tools version
    The meaning of the first sentence is quite clear. It is not
                                                                           64. The focus of research was on the abstracts, and full length
difficult to imagine what happened when “the quick brown fox
                                                                           papers were compared with their abstracts.
jumped swiftly over the lazy dog”. On the other hand, it is not
so easy to imagine what the second sentence means - due to the
use of vague personal pronouns (she and him), this sentence has            A. Qualitative Analysis
multiple interpretations and is, therefore, quite vague.                      The following are the components of qualitative analysis:
   Lexical density is a reflection of the above observations.                     keywords for each corpus (frequency lists normalized
The sentence (1) has a rather high lexical density (70%),                          for 1000 text words);
whereas, the sentence (2) has a lexical density which is quite
low (28.57%).                                                                     terms for each corpus;

    The reason that the sentence (1) has a high lexical density is                contents of each corpus by parts of speech: the
that it explicitly names both the subject (fox) and the object                     proportion of content words; the proportion of
(dog), gives us more information about each one (the fox being                     functional words.
quick and brown, and the dog being lazy), and tells us how the                 Both keyword and term lists revealed more similarities than
subject performed the action of jumping (swiftly).                         differences between abstracts and research papers, therefore
    The reason that the sentence (2) has such low lexical                  further analysis of the most frequent terms from both corpora
density is that it doesn’t do any of the things that the first             together was used to show the overall dynamics of topics of the
sentence does: we don’t know who the subject (she) and the                 journal over time.
object (him) really are; we don’t know how she told him or
how she loves him; we don’t even know if the first she and him
mean the same people as the second she and him. This sentence
tells us almost nothing, and its low lexical density is an                           1
                                                                                         https://benjamins.com/#catalog/journals/prag/main.
indicator of that, contrary to the first sentence which is packed                                    2
                                                                                                       https://clarin.vdu.lt/xmlui.
with information and its high lexical density is a reflection of                                 3
                                                                                                   https://www.sketchengine.co.uk/.
that.                                                                                     4
                                                                                            http://www.lexically.net/wordsmith/version6/.




                                                                      96
   The top 10 notional words from the keyword lists for each              B. Quantitative Analysis
corpus (see Figure 1) were the basis for the analyses of their               The following are the components of quantitative analysis:
context, i.e. grammatical constructions and lexical collocations.
                                                                                 overall statistics (see Table 1 summary);
                                                                                 lexical density: the proportion of content words to the
                                             RESEARCH
      ABSTRACTS:                                                                  total number of words (the corpus of abstracts and the
                                              PAPERS:
                                                                                  corpus of research papers separately).


                                                                                  TABLE I. OVERALL STATISTICS OF CORPORA
                                                                                                                 Abstracts         Research papers
                                                                                 Tokens (running words)           85 616              3 479 442
                                                                                  Types (distinct words)          8 295                71 038
                                                                                 Type/token ratio (TTR)            9.90                  2.13
                                                                                    Standardized TTR              36.50                 38.99
                                                                                Standardized TTR std.dev.         58.35                 60.85
                                                                                        Sentences                 3 305                116 735
                                                                                 Average sentence length
                                                                                                                   25.37                 28.62
                                                                                         (words)


                                                                            IV. CONCLUDING REMARKS AND FUTURE WORK
                                                                             The qualitative analysis showed that contents are similar in
                                                                          case of abstracts and their research papers.
   Fig. 1. The top keywords from each corpus                                  Quantitative analysis revealed that abstracts and their
                                                                          research papers are more different than similar in terms of
                                                                          formal features: formal features of both corpora manifested
     In this way the formal features of research papers and their         tangible differences in abstracts. Thus the research proposed
abstracts were observed: such contextual analyses revealed                that lexical density depends strongly on the form of texts.
linguistic ways to condense information in the abstracts that
were absent in their full length counterparts. One of them is                 Lexical density is useful and applicable measurement for
nominalisation (the use of nominal phrases instead of verbal              different text genres, however, lexical level alone is not
phrases) allowing to merge a few sentences into one.                      sufficient to measure information density of texts, while lexical
Nominalisation, in turn, is associated with higher lexical                and syntactic features together appear to be particularly well
density in abstracts than in their research papers (see Figure 2),        suited for the task.
i.e. it decreases the number of functional words and in this way             With the above in mind, future work is to develop the
increases lexical density in general.                                     methodology for measuring information density by analysing
                                                                          syntactic level of texts, and later - combining both lexical and
        ABSTRACTS:                       RESEARCH                         syntactic features for implementing the results into the
                                          PAPERS:                         automatization of text analysis.

                                                                                                            REFERENCES
                                                                          [1]    R. Shams, “Identification of informativeness in text using natural
                                                                                 language stylometry”, Doctoral thesis, The University of Western
                                                                                 Ontario, 2014.
                                                                          [2]    C. R. Mills, “Information density in French and Dagara folktales: a
                                                                                 corpus-based analysis of linguistic marking and cognitive processing”,
                                                                                 Doctoral thesis, Queen’s University Belfast, 2014.
                                                                          [3]    C. Brown, T. Snodgrass, S. J. Kemper, R. Herman, M. A. Covington,
                                                                                 “Automatic measurement of propositional idea density from part-of-
                                                                                 speech tagging”, Behavior Research Methods 40(2), pp. 540-545, 2008.
                                                                          [4]    W. Kintsch, J. Keenan, “Reading rate and retention as a function of the
                                                                                 number of propositions in the base structure of sentences”, Cognitive
Fig. 2. The relation between formal and content features of                      Psychology 5, pp. 257-274, 1973.
texts                                                                     [5]    W. Kintsch, “Comprehension: A paradigm for cognition”, Cambridge,
                                                                                 UK: Cambridge University Press, 1998.
                                                                          [6]    E. Thorson, R. Snyder, “Viewer recall of television commercials:
                                                                                 Prediction from the propositional structure of commercial scripts”,
                                                                                 Journal of Marketing Research 21, pp. 127-136, 1984.




                                                                     97
[7]  A. Y. Takao, W. A. Prothero, G. J. Kelly, “Applying argumentation                 [15] J. Ure, “Lexical density and register differentiation”. In G. E. Perren & J.
     analysis to assess the quality of university oceanography students’                    L. M. Trim (eds.). Applications of linguistics. Selected papers of the
     scientific writing”, Journal of Geoscience Education 50, pp. 40-48, 2002.              Second International Congress of Applied Linguistics, Cambridge 1969,
[8] S. Kemper, J. Marquis, M. Thompson, “Longitudinal change in language                    pp. 443-452. Cambridge: Cambridge University Press, 1971.
     production: Effect of aging and dementia on grammatical complexity
     and propositional content”, Psychology and Aging 16, pp. 600-614,
     2001.
[9] S. Kemper, A. Sumner, “The structure of verbal abilities in young and
     older adults”, Psychology and Aging 16, pp. 312-322, 2001.
[10] D. A. Snowdon, S. J. Kemper, J. A. Mortimer, L. H. Greiner, D. R.
     Wekstein, W. R. Markesbery, “Linguistic ability in early life and
     cognitive function and Alzheimer’s disease in late life: Findings from the
     Nun Study”, JAMA 275, pp. 528-532, 1996.
[11] D. A. Snowdon, L. H. Greiner, S. J. Kemper, N. Nanayakkara, J. A.
     Mortimer, “Linguistic ability in early life and longevity: Findings from
     the Nun Study”, Berlin, Germany: Springer-Verlag, 1999.
[12] D. A. Snowdon, L. H. Greiner, W. R. Markesbery, “Linguistic ability in
     early life and the neuropathology of Alzheimer’s disease and
     cerebrovascular disease: Findings from the Nun Study”, Annals of the
     New York Academy of Sciences 903, pp. 34-38, 2000.
[13] D. Didau, “Black space: improving writing by increasing lexical
     density”, The Learning Spy: Brain Food for the Thinking Teacher, 2013.
[14] V. Johansson, “Lexical diversity and lexical density in speech and
     writing: a developmental perspective”, Working Papers 53, pp. 61-79,
     2008.




                                                                                  98