<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Genre distinctions and discourse modes: Text types differ in their situation type distributions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexis Palmer</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Annemarie Friedrich</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computational Linguistics Saarland University</institution>
          ,
          <addr-line>Saarbr u ̈cken</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we explore the relationship between the genre of a text and the types of situations introduced by the clauses of the text, working from the perspective of the theory of discourse modes (Smith, 2003). The typology of situation types distinguishes between, for example, events, states, generic statements, and speech acts. We analyze texts of different genres from two English text corpora, the Penn Discourse TreeBank (PDTB) and the Manually Annotated SubCorpus (MASC) of the Open American National Corpus. Texts of different types - genres in the PDTB and subcorpora in MASC - are segmented into clauses, and each clause is labeled with the type of situation it introduces to the discourse. We then compare the distribution of situation types across different text types, finding systematic differences across genres. Our findings support predictions of the discourse modes theory and offer new insights into the relationship between text types and situation type distributions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Language is not a unitary phenomenon, and
patterns of language use change according to the type
of text under investigation. In natural language
processing, furthermore, it has been shown that
there are strong effects from both the domain and
the genre of texts on the performance of systems
performing automatic analysis. These effects are
relevant at nearly all levels of analysis, from
partof-speech tagging to discourse parsing, yet they
are in some ways poorly understood. For
example, there is no single agreed-upon set of text types
that suits all levels of analysis, nor are we aware of
systematic guidelines for sorting texts into genre
categories; this process often relies on human
intuition and the claim that “I know [a document of
type X] when I see one.”</p>
      <p>Rather than conceptualizing text type purely as
a document-level characteristic, in this study we
take inspiration from a theory which targets text
passages as an intermediate level of
representation. The idea is that most texts are in fact a mix
of passages of different types. For example, a
news story may begin with a short narrative
passage which focuses on one individual’s reaction
to the newsworthy event and then proceed with a
more informative discussion of the topic at hand.
Smith (2003) identifies five different types of text
passages, or discourse modes, each of which is
associated with certain linguistic characteristics of
the text passage. (See Sec. 2 for more on the
modes and the linguistic characteristics.) This
study investigates how closely the predicted
linguistic characteristics of certain text types are
reflected in a body of naturally occurring texts.</p>
      <p>We focus on genre differences at the level of
the clause, considering the types of situations
introduced to the discourse by clauses of text.
According to Smith, the situation (or situation
entity) types presented in a text are an important
characteristic for distinguishing between the
different types of text passages. Using two sets of
documents (see Sec. 3) with genre labels, we
investigate the distributions of situation types (see
Sec. 2.1 for the inventory of situation types) for
the different text types. We find systematic
differences between news/jokes texts on the one hand
and essay/persuasive texts on the other, as the
theory predicts. In the final section of the paper, we
briefly discuss potential applications of these
findings to argumentation mining.
Eventualities describe particular situations such
as Events (1) or States (2).
2</p>
    </sec>
    <sec id="sec-2">
      <title>Discourse modes: a theory of text passages and their types</title>
      <p>Smith (2003) proposes to analyze discourse at the
level of the text passage, viewing each
individual text as a mixture of text passages. These
passages are contiguous regions of text,
generally one or more paragraphs, with particular
discourse functions. Each passage belongs to
one of five discourse modes: NARRATIVE,
REPORT, DESCRIPTION, INFORMATION,
ARGUMENT/COMMENTARY. Importantly, the modes
can be characterized according to two broad
classes of linguistic correlates: the mode of
progression through the text passage (roughly
temporal or atemporal), and the distribution of situation
entity types. The modes and their correlates
appear in Table 1.</p>
      <sec id="sec-2-1">
        <title>2.1 Situation entities</title>
        <p>In this work we are directly concerned with the
second type of linguistic correlate: the situation
entities. A situation entity (SE) can be thought of
as the abstract object introduced to the discourse
by a clause of text. The type of the SE introduced
by a clause depends on, among other things, the
internal temporal properties of the verb and its
arguments. The interpretation of the verb
constellation may of course by influenced by adverbials
and other linguistic factors. We are primarily
interested in finite clauses, for the most part
assuming that each clause introduces one SE.1</p>
        <p>The SE types fall into four broad categories.
1For a more detailed discussion of situation entities,
please see Friedrich and Palmer (2014b). For even more
information, see our project page (http:\\sitent.coli.
uni-saarland.de) and the references cited there,
including a detailed annotation manual.
(1)
(2)
(3)
(4)
(5)
(6)</p>
        <p>The tour guide pointed to the mosaic.
(EVENT)
The view from the castle is spectacular.
(STATE)</p>
        <p>The class of General Statives includes
Generalizing Sentences (3), which report regularities,
and Generic Sentences (4), which make statements
about kinds or classes.</p>
        <p>Silke often feeds my
(GENERALIZING SENTENCE)
cats.</p>
        <p>The male cardinal has a black beak.
(GENERIC SENTENCE)</p>
        <p>The third class of SE types are Abstract
Entities, which differ from the other SE types in how
they relate to the world: Eventualities and
General Statives are located spatially and temporally
in the world, but Abstract Entities are not. Facts
(5) are objects of knowledge, and Propositions (6)
are objects of belief. In the following examples,
the underlined clauses introduce Abstract Entities
to the discourse.</p>
        <p>I know that his plane arrived at 11:00.
(FACT)
I believe that his plane arrived at 11:00.
(PROPOSITION)</p>
        <p>Finally, we introduce the category Speech Acts
for clauses whose main function is performative:
namely, Questions (7) and Imperatives (8).
The broad aim of this study is to compare
the predictions of the theory to evidence from
text corpora, in particular with respect to the
distributions of SEs across different text types.
We focus on two modes: REPORT and
ARGUMENT/COMMENTARY. For the REPORT mode,
the expectation is that text passages should be
made up primarily of Eventualities (Events and
States) with some General Statives. The most
frequent SE types in the ARG/COMM mode, on the
other hand, should be primarily Abstract Entities
(Facts and Propositions) and General Statives.</p>
        <p>
          To date there is no large body of data annotated
with discourse modes. Therefore, we instead look
directly at the distributions of SEs within text
passages for which we have annotated data
          <xref ref-type="bibr" rid="ref1 ref10 ref2 ref4">(Friedrich
and Palmer, 2014b)</xref>
          , taking the genre category
assigned within our text corpora as a proxy for
discourse mode. We do this under the assumption
that some genres are associated with a certain
predominant discourse mode. From that assumption,
we consider the average SE distributions per text
type to reflect the distributions expected from the
predominant mode. Specifically, we map texts
from the genres news and jokes to the REPORT
mode, and essays and fundraising letters to the
ARG/COMM mode.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Data for corpus study</title>
      <p>We test the predictions of the theory on sets of
texts extracted from two different corpora,
described below. These corpora were chosen in large
part because they both group their texts according
to genre. Although the two corpora use a different
set of genre labels, both cover the two broad
categories we are interested in. Annotation and
analysis of the two data sets are described in Sec. 4.
3.1</p>
      <sec id="sec-3-1">
        <title>Penn Discourse TreeBank</title>
        <p>
          The Penn Discourse TreeBank (PDTB)
          <xref ref-type="bibr" rid="ref6">(Prasad et
al., 2008)</xref>
          provides annotations of discourse
structure over a collection of texts from the Wall Street
Journal; these texts are from the Penn TreeBank
          <xref ref-type="bibr" rid="ref5">(Marcus et al., 1993)</xref>
          , one of the most
widelyused annotated corpora in natural language
processing. In addition to discourse structure
annoMASC
news
essays
news
jokes
essays
letters
tations, PDTB texts are hand-labeled with
part-ofspeech tags, syntactic structure, and, as of
relatively recently, genre designations. Webber (2009)
found that the texts in PDTB belong to a number of
different categories and, further, that the discourse
relations marked in the texts pattern according to
the genre of the text. In fact, Webber (2009)
inspired the current study, raising the question of
whether the SE type distributions found in texts
similarly reflect the genre of the text.
        </p>
        <p>
          The PDTB texts are predominantly from the
news genre (roughly 1900 texts), with much
smaller numbers of texts from four other
genres: essays (roughly 170 texts), letters (roughly
60 texts), highlights (roughly 40 texts), and errata
(25 texts). From these, we extract 20 news texts
and 20 essay texts to be used in our study.
The second corpus used in this study is MASC
          <xref ref-type="bibr" rid="ref3">(Ide et al., 2008)</xref>
          , the Manually Annotated
SubCorpus of the Open American National
Corpus.2 Overall, MASC contains roughly 500,000
words of text (both written text and transcribed
speech), balanced over 19 text types. In addition
to manually-checked annotations of sentence and
word boundaries, part-of-speech tags, named
entities, and both shallow and deeper syntactic
structure, some portions of MASC have been annotated
for a number of semantic and pragmatic
phenomena. For this study, though, we use only the genre
labels and our own SE annotations (see Sec. 4).
        </p>
        <p>For our study, we extract texts from the
written part of MASC. We use the texts from four of
the genres: news, jokes, essays, and letters. The
letters fall into two sub-categories
(philanthropicfundraising and solicitation-brochures), though all
of the letters have the same general goal of
soliciting donations, whether of money, time, or goods.
2http://www.anc.org/data/masc</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Corpus study</title>
      <p>In this section we describe the segmentation and
annotation of the data, the situation type
inventories reflected in the analysis, and the methodology
used for computing results. We then present and
discuss our findings.3</p>
      <sec id="sec-4-1">
        <title>4.1 Segmentation and annotation</title>
        <p>
          Having selected texts for analysis, we next
segmented them into clauses, again following the
assumption of one SE per clause (with a few
exceptional cases). The PDTB texts were segmented
manually by the annotator, and the MASC texts
using SPADE
          <xref ref-type="bibr" rid="ref9">(Soricut and Marcu, 2003)</xref>
          with
some heuristic post-processing. Each clause was
then manually labeled with its SE type.
        </p>
        <p>The PDTB annotations were performed by one
paid annotator with extensive background in
linguistics, with ample training time but only a
minimal annotation manual.</p>
        <p>
          The MASC annotations are part of a large
ongoing annotation project with multiple paid
annotators, an extensive manual, and a structured
training phase. In the latter, we take a feature-driven
approach to annotation which improves the
quality of the annotations, leading to substantial
interannotator agreement (see Table 3). In addition to
the SE type label, annotators mark each clause
with three relevant linguistic features, which are
not used in the current study, but which guide
the annotators to find the best-fitting SE type
label. These are inherent lexical aspect of the verb
          <xref ref-type="bibr" rid="ref1 ref10 ref2 ref4">(Friedrich and Palmer, 2014a)</xref>
          , genericity of the
main referent, and habituality of the event
described. Details regarding the annotation scheme
and the benefits of feature-driven annotation
appear in Friedrich and Palmer (2014b).
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 SE inventories</title>
        <p>Each of the two analyses uses a slightly different
set of SE types. The main difference between the
two is that for the PDTB data annotations were
done mostly at a coarse-grained level, and the
MASC annotations are more fine-grained.</p>
        <p>The PDTB analysis remains close to the
inventory of SE types presented in Sec. 2.1, with the
modification that three of the four coarse-grained
categories (i.e. General Statives, Abstract Entities,
3Results from the PDTB portion of the analysis were first
presented at the 2009 Texas Linguistics Society conference
in Austin, Texas.
genre
news
jokes
essays
letters</p>
      </sec>
      <sec id="sec-4-3">
        <title>Kappa</title>
        <p>0.667
0.756
0.493
0.612
and Speech Acts) are treated as SE types. In other
words, for each of these categories, we conflate
its subtypes into a single higher-level type. States
and Events are treated as separate categories. The
coarse-grained analysis still captures the relevant
distinctions yet allows us to make useful
generalizations over the relatively small amount of data.</p>
        <p>For MASC, we return to a fine-grained
analysis. General Statives and Speech Acts are counted
at the fine-grained level, and Abstract Entities do
not appear in the analysis at all. We add the
REPORT type of situation entity, which is a subtype
of EVENTS, designed to capture cases like (9).
(9)
. . . , said the President of the Squash
Association. (REPORT)
4.3</p>
      </sec>
      <sec id="sec-4-4">
        <title>Method</title>
        <p>For both data sets, we compute the distributions
of SE types per genre. For each genre, we collect
the counts of situation entity types assigned and
then compute the corresponding percentages. For
the PDTB data (Figure 2), this is a straightforward
analysis, as there was only one annotator.</p>
        <p>For MASC (Figure 1), we use the annotations of
two annotators to compute the distributions.
Annotators are allowed to mark a segment with
multiple situation types; we simply use all markings of
types to compute the percentages. When
annotators disagree, we do not adjudicate but rather count
both annotations; when they do agree, we counts
two instance of the agreed-upon label. Hence, the
statistics presented in Figure 1 present an average
over the two annotator’s assignments. The
distributions shown in Figure 1 all differ significantly
(p &lt; 0:01) from each other according to a
2test, which means that the SE type distributions of
the genres are all significantly different from each
other: text types differ in their situation type
distributions.
news
jokes
letters
essays</p>
      </sec>
      <sec id="sec-4-5">
        <title>Findings</title>
        <p>The broad finding is that General Statives play a
predominant role for texts associated with the
ARGUMENT/COMMENTARY mode, and Events and
States for texts associated with the REPORT mode.
With these results, we begin to replace the vague
distributional statements in Table 1 with more
precise characterizations of SE type distributions.</p>
        <p>We first compare the two genres shared across
both data sets: news and essays. For both data
sets, we see that the proportion of Eventualities is
highest for the news genre, and that within
Eventualities, Events are more frequent than States.4
This supports the theoretical claim that passages
in REPORT mode predominantly consist of Events
and States. Smith (2005) also predicts a significant
number of General Statives for REPORT passages;
in our study we observe these types in the news
texts, but less frequently than Eventualities.5</p>
        <p>We see more General Statives in essays than in
news. The predominance of General Statives is
not surprising, given that arguments are frequently
built from generalizations and statements about
classes or kinds. An interesting result that is not
predicted by the theory is that in essays, States are
much more frequent than Events. Together with
the higher prevelance of General Statives, this
suggests that essays rely heavily on describing and
discussing states of affairs rather than particular
actions or events.</p>
        <p>Now we turn to the two additional genres in
MASC: jokes and letters. First it should be noted
4For MASC this second result comes from conflating the
categories of Event and Report.</p>
        <p>5It would be interesting to compare this distribution to
texts from another mode (e.g. NARRATIVE) for which Smith
(2005) does not predict many General Statives in order to
determine the relative importance of General Statives in the
REPORT mode.
%
0
5
0
4
0
3
0
2
0
1
0
news</p>
        <p>essays
that it’s not clear whether a distinction should be
made between (persuasive) essays and the
persuasive letters that appear in MASC. Second, we can
see that the predominance of State-type SEs is
even stronger for letters than it is for essays. In
addition, we see that letters use more generalizing
statements and fewer generics, and a rather high
proportion of Imperatives. The expected
distribution of Imperatives is not explicitly treated by the
theory, but one can easily imagine the sorts of
Imperative statements that would appear in
fundraising and solicitation letters: e.g. “Send a check
now! Don’t delay! Save the whales!”</p>
        <p>Jokes are interesting in that they pattern quite
similarly to news texts, but with a higher
proportion of Speech Act types. The latter can be
attributed to the fact that jokes contain more direct
and reported speech than news.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion and conclusion</title>
      <p>The corpus study described above investigates,
across two different datasets of written English
text, the relationship between situation entities and
text type on the basis of the available data. In
both cases, and taking genre as a proxy for
discourse mode, we find support for Smith’s
theoretical prediction that different types of text show
different characteristic distributions of the types of
SEs introduced by the clauses of the text. We
find this specifically for two broad text types:
news/jokes (mapped to the REPORT mode of
discourse) and essays/persuasive texts (mapped to the
ARGUMENT/COMMENTARY mode of discourse).
The current study analyzes SE distributions over
collections of texts; a logical next step is to do
this analysis in a more fine-grained fashion,
associating SE distributions with text passages
labeled with discourse modes. This would remove
the need for the genre-as-proxy assumption and
move us even further toward a clearer
understanding of how discourse modes and situation entity
types pattern together.</p>
      <p>In future work, we plan to create automatic
methods to label clauses with their SE type, which
could then be used to automatically identify the
types of text passages present in documents.</p>
      <sec id="sec-5-1">
        <title>Relevance for argumentation mining</title>
        <p>Some current research in argumentation mining
investigates the question of whether performance
for automatically extracting argument components
from text improves when a system can first
narrow down the search space to the argumentative
regions of the document. (For example, see Stab
and Gurevych (2014) and Levy et al. (2014).) Our
finding that essays and persuasive texts show a
different distribution of SE types than news texts
suggests one way to approach the challenge of finding
the argumentative portions of texts.</p>
        <p>So far work in argumentation mining has
focused predominantly on finding arguments in
argumentative texts: opinion pieces, argumentative
essays, editorials, and the like. This is to some
extent a limiting assumption, as texts from a wide
range of genres can in fact contain
argumentative passages. A method for finding argumentative
passages could extend the range of texts available
for argumentation mining.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>For the PDTB case study, we gratefully
acknowledge Caroline Sporleder for much interesting and
insightful discussion, as well as Todd Shore both
for his annotation work and for discussions
arising from that work. This study has also benefitted
from discussions with Bonnie Webber and
Manfred Pinkal. Finally, huge thanks to the
participants of the Bertinoro symposium on the
intersection of Argumentation Theory and Natural
Language Processing for a highly engaging and
intellectually stimulating week. This research was
supported in part by the MMCI Cluster of Excellence,
and the second author is supported by an IBM PhD
Fellowship.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Annemarie</given-names>
            <surname>Friedrich</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alexis</given-names>
            <surname>Palmer</surname>
          </string-name>
          .
          <year>2014a</year>
          .
          <article-title>Automatic prediction of aspectual class of verbs in context</article-title>
          .
          <source>In Proceedings of ACL</source>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Annemarie</given-names>
            <surname>Friedrich</surname>
          </string-name>
          and
          <string-name>
            <given-names>Alexis</given-names>
            <surname>Palmer</surname>
          </string-name>
          . 2014b.
          <article-title>Situation entity annotation</article-title>
          .
          <source>In Proceedings of The Linguistic Annotation Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Nancy</given-names>
            <surname>Ide</surname>
          </string-name>
          , Collin Baker, Christiane Fellbaum, and
          <string-name>
            <given-names>Charles</given-names>
            <surname>Fillmore</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>MASC: The manually annotated sub-corpus of American English</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Ran</surname>
            <given-names>Levy</given-names>
          </string-name>
          , Yonatan Bilu, Ehud Aharoni, and
          <string-name>
            <given-names>Noam</given-names>
            <surname>Slonim</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Context dependent claim detection</article-title>
          .
          <source>In Proceedings of COLING</source>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Mitchell P.</given-names>
            <surname>Marcus</surname>
          </string-name>
          , Beatrice Santorini, and Mary Ann Marcinkiewicz.
          <year>1993</year>
          .
          <article-title>Building a Large Annotated Corpus of English: The Penn Treebank</article-title>
          .
          <source>Computational Linguistics</source>
          ,
          <volume>19</volume>
          (
          <issue>2</issue>
          ):
          <fpage>313</fpage>
          -
          <lpage>330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Rashmi</given-names>
            <surname>Prasad</surname>
          </string-name>
          , Nikhil Dinesh,
          <string-name>
            <given-names>Alan</given-names>
            <surname>Lee</surname>
          </string-name>
          , Eleni Miltsakaki, Livio Robaldo,
          <article-title>Aravind K Joshi,</article-title>
          and Bonnie L Webber.
          <year>2008</year>
          .
          <article-title>The Penn Discourse TreeBank 2.0</article-title>
          .
          <source>In Proceedings of LREC</source>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Carlota S Smith</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Modes of discourse: The local structure of texts</article-title>
          . Cambridge University Press.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Carlota S Smith</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Aspectual entities and tense in discourse</article-title>
          .
          <source>In Aspectual Inquiries</source>
          , pages
          <fpage>223</fpage>
          -
          <lpage>237</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Radu</given-names>
            <surname>Soricut</surname>
          </string-name>
          and
          <string-name>
            <given-names>Daniel</given-names>
            <surname>Marcu</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Sentence level discourse parsing using syntactic and lexical information</article-title>
          .
          <source>In Proceedings ACL-HLT</source>
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Christian</given-names>
            <surname>Stab</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Annotating argument components and relations in persuasive essays</article-title>
          .
          <source>In Proceedings of COLING</source>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Bonnie</given-names>
            <surname>Webber</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Genre distinctions for discourse in the Penn TreeBank</article-title>
          .
          <source>In Proceedings of ACL</source>
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>