<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>IberLEF</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Factuality Classification Using the Pre-trained Language Representation Model BERT</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jihang Mao</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wanli Liu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1214</institution>
          ,
          <addr-line>Bethesda, MD 20814</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>TAJ Technologies, Inc.</institution>
          ,
          <addr-line>7910 Woodmont Ave</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University Blvd E</institution>
          ,
          <addr-line>Silver Spring, MD 20901</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>24</volume>
      <fpage>126</fpage>
      <lpage>131</lpage>
      <abstract>
        <p>In this paper we report our participation in the 2019 FACT (Factuality Analysis and Classification Task) challenge task, where a corpus containing texts with verbal events is provided and systems need to automatically propose a factual tag for each event. In this task facts are not verified in regard to the real world, just assessed with respect to how they are presented by the source. Therefore it is important to find indications of the linguistic context surrounding the events. Our approach utilizes BERT, a multi-layer bidirectional transformer encoder which can help learn deep bi-directional representations of texts, and the pre-trained model is fine-tuned on training data for FACT. The representations of an event and its sentence are fed into an output layer for classification. Our approach achieves encouraging results in evaluation, which demonstrates that it is competitive and applicable to multilingual text categorization tasks.</p>
      </abstract>
      <kwd-group>
        <kwd>BERT</kwd>
        <kwd>Factuality Detection</kwd>
        <kwd>Text categorization</kwd>
        <kwd>Multilingual Model</kwd>
        <kwd>Evaluation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        With the exponential growth of user-generated content, rumors in social media
platforms are widely noticed. In a Pew Research Center poll, 64% of US adults said that
“made-up news” has caused a “great deal of confusion” about the facts of current events
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, identifying the factual status of events early is a hard task without
sufficient evidence such as responses and fact checking sites. Automating the fact-checking
pipeline is rather challenging, despite the recent progress in natural language
processing, databases and information retrieval [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Many prior studies began by manually
inspecting tweet messages in the training dataset to come up with an initial
humancurated list of word features. It was found that these words could be categorized into
meaningful groups. Such “cue words” have been reported to be useful in identifying an
author’s certainty in journalism, determining veracity of rumors and detecting
disagreement in online dialogue [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3-5</xref>
        ].
It is crucial to determine whether event references are presented as having taken place
or as potential or not accomplished events. Despite its centrality for Natural Language
Understanding, this task has been under-researched, with [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ] as a reference for
English and [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] for Spanish. Besides its inherent difficulty, the bottleneck to advance on
this task has usually been the lack of annotated resources. Following Sauri [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ],
factuality is understood as the category that determines the factual status of events, i.e.,
whether events are presented or not as certain. Adopting the Sauri model with some
changes, Wonsever et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] create an annotated corpus with factuality information
and an automatic annotation tool based on automatic supervised learning. Alonso et al.
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] create a tool for the annotation of factuality expressed in texts in Spanish through
automatic processing, which is carried out from three different axes: multilevel,
multidimensional and multitextual.
      </p>
      <p>
        FACT (Factuality Analysis and Classification Task) is a task to classify events in
Spanish texts (from Spanish and Uruguayan newspaper), according to their factuality status.
The goal of FACT is the determination of the status of verb events with respect to
factuality in Spanish texts. In this task, participating teams are given a text with its events
already identified, and required to automatically assign a factuality category to each
one of the events. Current and past situations in the world that are presented as real will
be categorized into Fact, while situations that the writer presents are not having
happened in real world will be categorized into Counterfacts. Situations presented as
uncertain will be categorized into a class that includes a number of other values like
different kinds of Future, Potential or Undefined [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Their tags are F (Fact), CF
(CounterFact) and U (Undefined) respectively.
      </p>
      <p>A brief description of our method for FACT task is presented in Section 2. In Section
3 we show the results of our method on the official FACT test datasets. In section 4 we
present a discussion of the results and conclusions of our participation in this challenge.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <p>
        For FACT Task, our method builds on BERT, which has obtained state-of-the-art
performance on most NLP tasks [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. More specifically, given a sentence, our method first
obtains its token representation from the pre-trained BERT model using a
case-preserving WordPiece model, including the maximal document context provided by the data.
Next we formulate this as a sentence-pair classification task by feeding the
representations of the event and its sentence into an output layer, a multiclass classifier over the
factual tags. Finally, we combine the outputs of models for Spanish and Uruguayan
texts to generate the result.
      </p>
      <p>
        BERT utilizes a multi-layer bidirectional transformer encoder which can learn deep
bidirectional representations and can be later fine-tuned for a variety of tasks such as text
classification. Before BERT, deep learning models, such as convolutional neural
network (CNN) and Bi-directional Long Short-Term Memory (Bi-LSTM) have greatly
improved the performance in text classification over the last few years [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. OpenAI
GPT [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] has proved the effectiveness of the generative pre-training language model.
The pre-trained BERT models are trained on a large corpus (Wikipedia + BookCorpus).
There are several pre-trained models release. In FACT, we chose BERT-Base,
multilingual cased model for following reasons: First, multilingual model is better for
Spanish documents in FACT because the English-only model splits tokens not available in
its vocabulary into sub-tokens, which will affect the accuracy of the classification task.
Second, although BERT-Large generally outperforms BERT-Base in English NLP
tasks, BERT-Large versions of multilingual models haven’t been released. Third, the
multilingual cased model fixes normalization issues in many languages, so it is
recommended in languages with non-Latin alphabets (and is often better for most languages
with Latin alphabets). In FACT, we use the final hidden state corresponding to a special
token ([CLS]) as the aggregate sequence representation, then feed it into an output layer
for classification (Figure 1).
      </p>
    </sec>
    <sec id="sec-3">
      <title>BERT</title>
      <p>Sentence
Event
[CLS]
Tok1
...</p>
      <p>TokN
[SEP]
Tok</p>
      <sec id="sec-3-1">
        <title>Input text</title>
        <p>E [ C L S ]
E 1
E N
E 3</p>
        <p>E 3
In addition, in order to address the issue of local multilinguality, i.e. the differences of
the texts from Spanish and Uruguayan newspaper, we build models for Spanish and
Uruguayan texts respectively. We train the two models and predict the factual tags with
Class Label
C
T1
TN
T[SEP]
T</p>
      </sec>
      <sec id="sec-3-2">
        <title>Output Layer</title>
        <p>corresponding training and testing texts. We then combine the outputs of the two
models to generate the final results.
3</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>
        The FACT corpus contains Spanish texts with approximately 5,000 verbal events
classified as F (Fact), CF (Counterfact), and U (Undefined). It has been divided into two
subsets: the training corpus with 4,000 events, and the testing corpus with 1,000 events.
In FACT, the performance will be measured against the evaluation corpus using the
following metrics: Precision, Recall and F1 score for each category, Macro-F1, and
Global accuracy. Macro-F1 is the main measure for this task. Here we present the
results on the test set. In our best submission, the model was fine-tuned using the
hyperparameter values suggested in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]: learning rate (Adam) = 2e-5, number of
epochs=3, max sequence length=256, and batch size=16. When fine-tuning the model
for Spanish texts, we divided the training set into two subsets: 1,671 events from 20
articles for training, and 336 events from 6 articles for development. To fine-tuning the
model for Uruguayan texts, 1,679 events from 22 articles is for training, and 657 events
from 8 articles is for development.
      </p>
      <p>As shown in table 1, our best submission significantly outperformed the baseline “fact”
in both Macro-F and Accuracy, while the Macro-F score of our submission is not very
far from the highest score (-0.072). We are in third place among all participants, which
demonstrates a good performance of our system in automatically classifying events in
Spanish texts according to their factuality status.
However, although the performance of our system is reasonable on accuracy compared
to other systems (0.099 and 0.013 behind the top two systems respectively), it is far
from the accuracy we achieved on the development set (0.622 vs. 0.825). Table 2 shows
the accuracy of the models for Spanish texts, Uruguayan texts and mixed texts on
corresponding development set. The gap of performance might be caused by the
differences between the training and testing sets or over-fitting of the models. We will
conduct a further error analysis after the Gold Standard classifications of the test set are
released.
We described our approach that participated in the FACT: Factuality Analysis and
Classification Task in IberLEF 2019. Compared to previous methods, our approach has
several significant differences from system architecture to the actual implementation. It is
a general and robust framework and showed competitive performance among all
participating systems during the FACT evaluations. In future work, we will use a new set
of random seeds each time to prevent over-fitting, and plan to explore its use in practical
applications such as fact-checking and fake-news detecting.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>The authors would like to thank Dr. Yutao Zhang for providing Jihang Mao the intern
opportunity at George Mason University and valuable suggestions and comments on
the manuscript. The authors would also like to thank the FACT task organizers for
providing the data of the task.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. Pew Research Center:
          <article-title>Many Americans Believe Fake News Is Sowing Confusion</article-title>
          . https://www.journalism.org/
          <year>2016</year>
          /12/15/many-americans
          <article-title>-believe-fakenews-is-sowing-confusion (retrieved on June 21,</article-title>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Vlachos</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Riedel</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Fact checking: Task definition and dataset construction. Association for Computational Linguistics</article-title>
          , page
          <volume>18</volume>
          , (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Soni</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mitra</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gilbert</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Eisenstein</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>Modeling factuality judgments in social media text</article-title>
          .
          <source>In ACL (2)</source>
          . pages
          <fpage>415</fpage>
          -
          <lpage>420</lpage>
          . (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Reichel</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lendvai</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Veracity Computing from Lexical Cues and Perceived Certainty Trends</article-title>
          .
          <source>Proceedings of the 2nd Workshop on Noisy Usergenerated Text</source>
          ,
          <fpage>4</fpage>
          -
          <lpage>13</lpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Misra</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          <article-title>Topic independent identification of agreement and disagreement in social media dialogue</article-title>
          .
          <source>In Conference of the Special Inte Group on Discourse and Dialogue. page 920</source>
          . (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Saurí</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Pustejovsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <article-title>FactBank: a corpus annotated with event factuality</article-title>
          .
          <source>Language resources and evaluation</source>
          ,
          <volume>43</volume>
          (
          <issue>3</issue>
          ),
          <fpage>227</fpage>
          . (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gorrell</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Aker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bontcheva</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Derczynski</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kochkina</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liakata</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zubiaga</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . SemEval
          <article-title>-2019 Task 7: RumourEval, Determining Rumour Veracity and Support for Rumours</article-title>
          .
          <source>In Proceedings of the 13th International Workshop on Semantic Evaluation</source>
          (pp.
          <fpage>845</fpage>
          -
          <lpage>854</lpage>
          ). (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Wonsever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Malcuori</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>RosáFurman</surname>
          </string-name>
          , A. Factividad de los eventos referidos en textos.
          <source>Reportes Técnicos 09-12</source>
          , Pedeciba. (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Saurí</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <article-title>A Factuality Profiler for Eventualities in Text</article-title>
          .
          <source>Ph.D. Thesis</source>
          . Brandeis University. (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wonsever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosá</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Malcuori</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Factuality Annotation and Learning in Spanish Texts</article-title>
          . In LREC. (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Alonso</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>I. Castellón</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H</given-names>
            ,
            <surname>Curell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fernández-Montraveta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Oliver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Vázquez.</given-names>
            "
            <surname>Proyecto</surname>
          </string-name>
          <string-name>
            <surname>TAGFACT</surname>
          </string-name>
          :
          <article-title>Del texto al conocimiento</article-title>
          .
          <article-title>Factualidad y grados de certeza en español"</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          ,
          <volume>61</volume>
          , p.
          <fpage>151</fpage>
          -
          <lpage>154</lpage>
          . ISSN:
          <fpage>1135</fpage>
          -
          <lpage>5948</lpage>
          . (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M. W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of NAACL-HLT</source>
          <year>2019</year>
          , pages
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . Minneapolis, Minnesota, USA. (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <article-title>Learning structured representation for text classification via reinforcement learning</article-title>
          .
          <source>In Thirty-Second AAAI Conference on Artificial Intelligence</source>
          . (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Narasimhan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salimans</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <article-title>Improving language understanding with unsupervised learning</article-title>
          .
          <source>Technical report, OpenAI</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>