<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detecting Information-Dense Texts: Towards an Automated Analysis</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Danguolė Kalinauskaitė Department of Lithuanian Studies, Vytautas Magnus University</institution>
          ,
          <addr-line>Kaunas</addr-line>
          ,
          <institution>Lithuania Baltic Institute of Advanced Technology</institution>
          ,
          <addr-line>Vilnius</addr-line>
          ,
          <country country="LT">Lithuania</country>
        </aff>
      </contrib-group>
      <fpage>95</fpage>
      <lpage>98</lpage>
      <abstract>
        <p>-Determining information density has become a central issue in natural language processing. While information density is seen as too complex to measure globally, a study of lexical and syntactic features allows a comparison of information density between different texts or different text genres. This paper provides a part of methodology proposed for automatic analysis of information density based on lexical and syntactic levels of language.</p>
      </abstract>
      <kwd-group>
        <kwd>lexical density</kwd>
        <kwd>information language processing</kwd>
        <kwd>computational linguistics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        Determining information density of text is a big challenge
in natural language processing (NLP). The use of more
fragments of text to train statistical NLP systems may not
necessarily lead to improved performance. Recent
developments in this field have spawned a number of solutions
to evaluate information density. Nevertheless, a shortfall of
most of these solutions is their dependency on the genre and
domain of the text. In addition, most of them are not efficient
regardless of the NLP problem areas [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        It is worth to note that the notion of information is only
formal here, i.e. information is defined as semantic, pragmatic,
and only measurable in relative terms. A definition of
information density is elaborated involving informativity (a
relative measure of semantic and pragmatic information) per
clause (following [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]). So in terms of semantics, information
density is a measure of the extent to which the writer or speaker
is making assertions (or asking questions) rather than just
referring to entities [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Texts contain various elements ranging
from characters to sentences, that are supposed to have
reasonable discriminating strength in evaluating information
density in natural language text. Examples of such elements are
the use of simple words, complex words, function words,
content words, syllables, and so on.
      </p>
      <p>The paper starts with a theoretical background on the
measurement of information density, followed by a
presentation of the research, and ends with a conclusion and
future work plans.</p>
      <p>The goal of this paper is to present a part of methodology</p>
    </sec>
    <sec id="sec-2">
      <title>Copyright held by the author(s).</title>
      <p>proposed for automatic analysis of information density based
on lexical and syntactic levels of language.</p>
    </sec>
    <sec id="sec-3">
      <title>II. THEORETICAL BACKGROUND</title>
      <sec id="sec-3-1">
        <title>A. Information Density in Computational Linguistics</title>
        <p>Information-dense texts report important factual
information in direct, succinct manner. There were various
attempts to determine and evaluate information density of texts.
Earlier works did it manually. Later various programs began to
appear, and now this process can be done automatically.
However, all programs are different in nature, as well as in
their productivity and principles based on which information
density of texts is determined. Worth mentioning issue here is
that there are a lot of confusion in determining what are the
indicators of information-dense texts, and therefore there is no
unified methodology for measuring information density.</p>
        <p>In computational linguistics, one of the most common
characteristics employed to detect information-dense texts is
lexical density (see B. Lexical Density below). In some works it
is even suggested as the main indication of how informative a
text is, and used as a synonym for information density.
However, lexical density, i.e. only one level of text that
basically points to the vocabulary, is not sufficient to talk about
the whole text informativeness. Therefore it does not seem
convincing to link these two terms, it is more likely that one is
a part of another, as lexical density measures only one level of
texts, namely, vocabulary, and the whole text informativeness
depends not only on the content but also on the structure.</p>
        <p>
          It is worth to note that the applicability of research results in
this area is very extensive. Numerous psychological
experiments have related information density to readability [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ],
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], memory, e.g., [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], quality of students’ writing, e.g., [7],
aging [8], [9], and prediction of Alzheimer’s disease [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ],
[
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>High information density signals complex interrelationships
expressed. Low information density means relatively little
information per sentence, therefore low information density in
speech or writing can indicate mental disorders, including
Alzheimer’s disease.</p>
      </sec>
      <sec id="sec-3-2">
        <title>B. Lexical Density</title>
        <p>
          Lexical density is the term most often used for describing
the proportion of content words to the total number of words
[
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. The result is a percentage for each text in the corpus.
Content words give a text its meaning and provide information
regarding what the text is about. More precisely, content words
are simply nouns, verbs, adjectives, and adverbs. Nouns tell us
the subject, adjectives tell us more about the subject, verbs tell
us what they do, and adverbs tell us how they do it [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>
          Other kinds of words such as articles (a, the), prepositions
(on, at, in), conjunctions (and, or, but) and so forth, are more
grammatical in nature and, by themselves, give little or no
information about what a text is about [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. These non-lexical
words are also called function words. Auxiliary verbs, such as
to be (am, are, is, was, were, being), do (did, does, doing), have
(had, has, having) and so forth, are also considered non-lexical
as they do not provide additional meaning.
        </p>
        <p>It is worth first to determine the lexical density of an ideal
example:</p>
        <p>(1) The quick brown fox jumped swiftly over the lazy dog.
The lexical words (nouns, adjectives, verbs, and adverbs) are
bold.</p>
        <p>There are precisely 7 lexical words out of 10 total words. The
lexical density of the above passage is therefore 70%.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Another simple example:</title>
      <p>(2) She told him that she loved him.</p>
      <p>The lexical density of the above sentence is 2 lexical words out
of 7 total words, for a lexical density of 28.57%.</p>
      <p>The meaning of the first sentence is quite clear. It is not
difficult to imagine what happened when “the quick brown fox
jumped swiftly over the lazy dog”. On the other hand, it is not
so easy to imagine what the second sentence means - due to the
use of vague personal pronouns (she and him), this sentence has
multiple interpretations and is, therefore, quite vague.</p>
      <p>Lexical density is a reflection of the above observations.
The sentence (1) has a rather high lexical density (70%),
whereas, the sentence (2) has a lexical density which is quite
low (28.57%).</p>
      <p>The reason that the sentence (1) has a high lexical density is
that it explicitly names both the subject (fox) and the object
(dog), gives us more information about each one (the fox being
quick and brown, and the dog being lazy), and tells us how the
subject performed the action of jumping (swiftly).</p>
      <p>The reason that the sentence (2) has such low lexical
density is that it doesn’t do any of the things that the first
sentence does: we don’t know who the subject (she) and the
object (him) really are; we don’t know how she told him or
how she loves him; we don’t even know if the first she and him
mean the same people as the second she and him. This sentence
tells us almost nothing, and its low lexical density is an
indicator of that, contrary to the first sentence which is packed
with information and its high lexical density is a reflection of
that.</p>
      <p>However the information lies here only on the lexical level
of text. The lexical level is related with syntactic one, i.e. how
words behave within a text, how they are connected with each
other in a sentence, etc. Finally the vocabulary and the form of
text highly depend on the genre of text.</p>
    </sec>
    <sec id="sec-5">
      <title>III. RESEARCH</title>
      <p>The research was conducted to investigate which features of
text mostly characterize information-dense texts. Lexical
density mentioned above is one of them here, however the
analysis was performed on the basis of the form of texts, too.</p>
      <p>Lexical density has the advantage of being easy to
operationalise, and also practical to apply in computer analyses
of large data corpora.</p>
      <p>The research sought to compare journal abstracts and their
research papers from the point of view of their linguistic
features and specificity of the genre, and in this way identify
textual features of abstracts based on their similarities and
differences with regard to research papers. It was raised a
hypothesis that abstracts are characterized by a higher
information density than their research papers. The comparison
was performed on the basis of two corpora, compiled from the
research papers and their abstracts in the journal of
“Pragmatics”1, from the period of 2000-2017. They have been
collected specifically for the purposes of this research. Both
corpora will be available in CLARIN-LT Repository2.</p>
      <p>The research consisted of qualitative and quantitative
analysis, and in this way the contents and the form of the
corpus of abstracts (containing 85 616 running words), and the
corpus of research papers (containing 3 479 442 running
words) were analysed with the help of corpus management and
analyses tool - Sketch Engine3, and WordSmith Tools version
64. The focus of research was on the abstracts, and full length
papers were compared with their abstracts.</p>
      <sec id="sec-5-1">
        <title>A. Qualitative Analysis</title>
        <p>The following are the components of qualitative analysis:



keywords for each corpus (frequency lists normalized
for 1000 text words);
terms for each corpus;
contents of each corpus by parts of speech: the
proportion of content words; the proportion of
functional words.</p>
        <p>Both keyword and term lists revealed more similarities than
differences between abstracts and research papers, therefore
further analysis of the most frequent terms from both corpora
together was used to show the overall dynamics of topics of the
journal over time.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>1 https://benjamins.com/#catalog/journals/prag/main. 2 https://clarin.vdu.lt/xmlui. 3 https://www.sketchengine.co.uk/. 4 http://www.lexically.net/wordsmith/version6/.</title>
      <p>The top 10 notional words from the keyword lists for each
corpus (see Figure 1) were the basis for the analyses of their
context, i.e. grammatical constructions and lexical collocations.</p>
      <p>In this way the formal features of research papers and their
abstracts were observed: such contextual analyses revealed
linguistic ways to condense information in the abstracts that
were absent in their full length counterparts. One of them is
nominalisation (the use of nominal phrases instead of verbal
phrases) allowing to merge a few sentences into one.
Nominalisation, in turn, is associated with higher lexical
density in abstracts than in their research papers (see Figure 2),
i.e. it decreases the number of functional words and in this way
increases lexical density in general.</p>
      <sec id="sec-6-1">
        <title>B. Quantitative Analysis</title>
        <p>The following are the components of quantitative analysis:</p>
        <p>overall statistics (see Table 1 summary);
 lexical density: the proportion of content words to the
total number of words (the corpus of abstracts and the
corpus of research papers separately).
IV. CONCLUDING REMARKS AND FUTURE WORK</p>
        <p>The qualitative analysis showed that contents are similar in
case of abstracts and their research papers.</p>
        <p>Quantitative analysis revealed that abstracts and their
research papers are more different than similar in terms of
formal features: formal features of both corpora manifested
tangible differences in abstracts. Thus the research proposed
that lexical density depends strongly on the form of texts.</p>
        <p>Lexical density is useful and applicable measurement for
different text genres, however, lexical level alone is not
sufficient to measure information density of texts, while lexical
and syntactic features together appear to be particularly well
suited for the task.</p>
        <p>With the above in mind, future work is to develop the
methodology for measuring information density by analysing
syntactic level of texts, and later - combining both lexical and
syntactic features for implementing the results into the
automatization of text analysis.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Shams</surname>
          </string-name>
          , “
          <article-title>Identification of informativeness in text using natural language stylometry”</article-title>
          ,
          <source>Doctoral thesis</source>
          , The University of Western Ontario,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Mills</surname>
          </string-name>
          , “
          <article-title>Information density in French and Dagara folktales: a corpus-based analysis of linguistic marking</article-title>
          and
          <source>cognitive processing”</source>
          ,
          <source>Doctoral thesis</source>
          , Queen's University Belfast,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Brown</surname>
          </string-name>
          , T. Snodgrass,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Kemper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Herman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Covington</surname>
          </string-name>
          , “
          <article-title>Automatic measurement of propositional idea density from part-ofspeech tagging”</article-title>
          ,
          <source>Behavior Research Methods</source>
          <volume>40</volume>
          (
          <issue>2</issue>
          ), pp.
          <fpage>540</fpage>
          -
          <lpage>545</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kintsch</surname>
          </string-name>
          , J. Keenan, “
          <article-title>Reading rate and retention as a function of the number of propositions in the base structure of sentences”</article-title>
          ,
          <source>Cognitive Psychology 5</source>
          , pp.
          <fpage>257</fpage>
          -
          <lpage>274</lpage>
          ,
          <year>1973</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kintsch</surname>
          </string-name>
          , “
          <article-title>Comprehension: A paradigm for cognition”</article-title>
          , Cambridge, UK: Cambridge University Press,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Thorson</surname>
          </string-name>
          , R. Snyder, “
          <article-title>Viewer recall of television commercials: Prediction from the propositional structure of commercial scripts”</article-title>
          ,
          <source>Journal of Marketing Research 21</source>
          , pp.
          <fpage>127</fpage>
          -
          <lpage>136</lpage>
          ,
          <year>1984</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Takao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. A.</given-names>
            <surname>Prothero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.</given-names>
            <surname>Kelly</surname>
          </string-name>
          , “
          <article-title>Applying argumentation analysis to assess the quality of university oceanography students' scientific writing”</article-title>
          ,
          <source>Journal of Geoscience Education 50</source>
          , pp.
          <fpage>40</fpage>
          -
          <lpage>48</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Kemper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Marquis</surname>
          </string-name>
          , M. Thompson, “
          <article-title>Longitudinal change in language production: Effect of aging and dementia on grammatical complexity and propositional content”</article-title>
          ,
          <source>Psychology and Aging 16</source>
          , pp.
          <fpage>600</fpage>
          -
          <lpage>614</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>Kemper</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Sumner, “
          <article-title>The structure of verbal abilities in young and older adults”</article-title>
          ,
          <source>Psychology and Aging 16</source>
          , pp.
          <fpage>312</fpage>
          -
          <lpage>322</lpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Snowdon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Kemper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Mortimer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Greiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. R.</given-names>
            <surname>Wekstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. R.</given-names>
            <surname>Markesbery</surname>
          </string-name>
          , “
          <article-title>Linguistic ability in early life and cognitive function and Alzheimer's disease in late life: Findings from the Nun Study”</article-title>
          ,
          <source>JAMA 275</source>
          , pp.
          <fpage>528</fpage>
          -
          <lpage>532</lpage>
          ,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Snowdon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Greiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Kemper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nanayakkara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Mortimer</surname>
          </string-name>
          , “
          <article-title>Linguistic ability in early life and longevity: Findings from the Nun Study”</article-title>
          , Berlin, Germany: Springer-Verlag,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Snowdon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. H.</given-names>
            <surname>Greiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. R.</given-names>
            <surname>Markesbery</surname>
          </string-name>
          , “
          <article-title>Linguistic ability in early life and the neuropathology of Alzheimer's disease and cerebrovascular disease: Findings from the Nun Study”</article-title>
          ,
          <source>Annals of the New York Academy of Sciences 903</source>
          , pp.
          <fpage>34</fpage>
          -
          <lpage>38</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Didau</surname>
          </string-name>
          , “
          <article-title>Black space: improving writing by increasing lexical density”, The Learning Spy: Brain Food for the Thinking Teacher</article-title>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>V.</given-names>
            <surname>Johansson</surname>
          </string-name>
          , “
          <article-title>Lexical diversity and lexical density in speech and writing: a developmental perspective”</article-title>
          ,
          <source>Working Papers 53</source>
          , pp.
          <fpage>61</fpage>
          -
          <lpage>79</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ure</surname>
          </string-name>
          , “
          <article-title>Lexical density and register differentiation”</article-title>
          . In G. E. Perren &amp;
          <string-name>
            <surname>J. L. M.</surname>
          </string-name>
          <article-title>Trim (eds</article-title>
          .).
          <source>Applications of linguistics. Selected papers of the Second International Congress of Applied Linguistics</source>
          , Cambridge 1969, pp.
          <fpage>443</fpage>
          -
          <lpage>452</lpage>
          . Cambridge: Cambridge University Press,
          <year>1971</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>