Modeling the Fake News Challenge
                     as a Cross-Level Stance Detection Task

       Costanza Conforti                         Mohammad Taher Pilehvar                    Nigel Collier
    Language Technology Lab                      Language Technology Lab              Language Technology Lab
     University of Cambridge                      University of Cambridge              University of Cambridge
        cc918@cam.ac.uk                             mp792@cam.ac.uk                       nhc30@cam.ac.uk


                                                                 released for the Fake News Challenge (FNC-1)1 .
                                                                    Characteristics of the corpus - The FNC-1 cor-
                         Abstract                                pus is based on the Emergent dataset [FV16], a col-
                                                                 lection of 300 claims and 2,595 articles discussing the
    The 2017 Fake News Challenge Stage 1, a                      claims. Each article is labeled with the stance it ex-
    shared task for stance detection of news ar-                 presses toward the claim and summarized into a head-
    ticles and claims pairs, has received a lot of               line by accredited journalists, in the framework of a
    attention in recent years [ea18]. The pro-                   project for rumor debunking [Sil15].
    vided dataset is highly unbalanced, with a                      For creating the FNC-1 corpus, the headlines and
    skewed distribution towards unrelated sam-                   the articles were paired and labeled with the cor-
    ples - that is, randomly generated pairs of                  responding stance, distinguishing between agreeing
    news and claims belonging to different topics.               (AGR), disagreeing (DSG) and discussing (DSC). Ad-
    This imbalance favored systems which per-                    ditional 266 labeled samples were added to avoid
    formed particularly well in classifying those                cheating [ea18]. Moreover, a number of unrelated
    noisy samples, something which does not re-                  (UNR) samples were obtained by randomly matching
    quire a deep semantic understanding.                         headlines with articles discussing a different claim. As
    In this paper, we propose a simple architec-                 shown in Table 1, the final class distribution was highly
    ture based on conditional encoding, carefully                skewed in favor of the UNR class, which amounted to
    designed to model the internal structure of a                almost three quarters of the samples (Table 1).
    news article and its relations with a claim. We                 Characteristics of the FNC-1 winning mod-
    demonstrate that our model, which only lever-                els - As a consequence of being randomly generated,
    ages information from word embeddings, can                   classification of UNR samples is relatively easy. More-
    outperform a system based on a large number                  over, given that the UNR samples constitute the large
    of hand-engineered features, which replicates                majority of the corpus, most competing systems were
    one of the winning systems at the Fake News                  designed in order to perform well on this easy-to-
    Challenge [HASC17], in the stance detection                  discriminate class. In fact, the three FNC-1 win-
    of the related samples.                                      ning teams proposed relatively standard architectures
                                                                 (mainly based on multilayer perceptrons, MLPs) lever-
                                                                 aging a large number of classic, hand-engineered NLP
1    Introduction                                                features. While those systems performed very well on
Stance classification has been identified as a key sub-          the UNR class - reaching a F1 score higher than .99 -
task in rumor resolution [ZAB+ 18]. Recently, a similar          they were not as effective in the AGR, DSG and DSC
approach has been proposed to address fake news de-              classification [ea18].
tection: as a first step towards a comprehensive model              FNC-1 as a Cross-Level Stance Detection
for news veracity classification, a corpus of news arti-         task - As shown in Table 3, one specific character-
cles, stance-annotated with respect to claims, has been          istic of the FNC-1 corpus consists in the clear asym-
                                                                 metry in length between the headlines and the articles.
Copyright © CIKM 2018 for the individual papers by the papers'   While headlines consist of one sentence, the structure
authors. Copyright © CIKM 2018 for the volume as a collection
                                                                   1 http://www.fakenewschallenge.org/
by its editors. This volume and its papers are published under
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
Table 1: Example of an agreeing headline from the FNC-1 training set, with its related document divided into
paragraphs (doc 1880). Each paragraph may express a different stance with respect to the claim, as indicated
in the first column.

             Headline. No, a spider (probably) didn’t crawl through a man’s body for several days
             Article.
AGR          ”Fear not arachnophobes, the story of Bunbury’s “spiderman” might not be all it seemed.
DSC          [...] scientists have cast doubt over claims that a spider burrowed into a man’s body [...] The story went global [...]
DSG          Earlier this month, Dylan Thomas [...] sought medical help [...] he had a spider crawl underneath his skin.
DSG          Mr Thomas said a [...] dermatologist later used tweezers to remove what was believed to be a ”tropical spider”.
(noise)      [image via Shutterstock]
AGR          But it seems we may have all been caught in a web... of misinformation.
DSC/AGR      Arachnologist Dr Framenau said [...]it was ”almost impossible” [...] to have been a spider [...]
(noise)      Dr Harvey said: ”We hear about people going on holidays and having spiders lay eggs under the skin”. [...]
(noise)      Something which is true, [...] is that certain arachnids do live on humans. We all have mites living on our faces [...]
(noise)      Dylan Thomas has been contacted for comment.”
of an article is better described as a sequence of para-             In order to be able to assess the ability of the sys-
graphs, where each paragraph plays a different role               tems to model the complex headline-article interplay
in telling a story. Single paragraphs usually expresses           described above, we filter out the noisy UNR samples
different views of a topic. Following the terminology             and consider only the related samples (AGR, DSG and
introduced by [JPN14], we propose to call this variant            DIS). Those samples were manually collected and la-
of the classic Stance Detection task Cross-Level Stance           beled by professional journalists and require deep se-
Detection.                                                        mantic understanding in order to be classified, consti-
   As shown in Table 1, an article consists in passages           tuting a difficult task even for humans. This is ev-
presenting a news, reporting about interviews, giving             ident when looking at the inter-annotator agreement
general background information and discussing similar             of human raters, which drops from Fleiss’ κ = .686
events happened in the past. In contemporary news-                to .218 when including or excluding the UNR sam-
writing prose, the most salient information is usually            ples, as reported in [ea18]. The final label distribution
condensed in the very first paragraphs, following the             is reported in Table 1.
Inverted Pyramid style. This allows the reader for
rapid decision making [Sca00].                                    2      Models
   For these reasons, we believe that detecting the
                                                                  2.1     Feature-based approach
stance of an article with respect to a headline requires
a deep understanding not only of the position taken               We implemented the model proposed by the team
in each paragraph with respect to the headline, but               Athene, which was ranked second at FNC-12 . The
also of the the complex interactions within the arti-             model consists of a 7-layer MLP with ReLU activa-
cle’s paragraphs, as illustrated by the example in Ta-            tion. On the top of the architecture, a softmax layer
ble 3. On the contrary, compressing both the head-                is used for prediction (Figure 1).
line’s and article’s content into fixed-size vectors, as              Input is given in the form of a large matrix of hand-
in the feature-based systems described in the previous            engineered features. The considered set includes the
paragraph, fails in detecting those fine-grained rela-            concatenation of feature vectors which separately con-
tionships and results in sub-optimal performance on               sider the headline and the article - like the presence of
the stance detection of AGR, DSG and DSC samples.                 refuting or polarity words (taken from a hand-selected
   To test this assumption, we propose a simple ar-               list of words as ‘hoax’ or ‘debunk’, and tf-idf weighted
chitecture based on conditional encoding, which is de-            Bag of Words vectors - and features which combine the
signed in order to model the complex interactions be-             headline and the article (joint features in Figure 1)
tween headlines and articles described above, and we              - like word/ngram overlap between the headline and
compare it with one of the feature-based systems which            the article, and cosine similarity of the embeddings of
won the FNC-1 [HASC17].                                             2 We used Athene as the baseline as the FNC-1 winning model

                                                                  was an ensemble [BSP17].
             instances    agr       dsg      dsc       unr
 fnc-1        75,385      7.4%     2.0%     17.7%     72.8%        Table 3: Asymmetry in length in the FNC-1 corpus.
 fnc-1-rel    20,491     27.2%     7.5%     65.2%        -
                                                                                             headline    article    paragraph

Table 2: Label distribution for the FNC-1 dataset,                      avg #tokens           12.40      417.69        30.88
with and without the UNR samples.                                       avg #par/article        -         11.97          -
nouns and verbs between the headline and the article.                                                                                                                                                 ŷ
Moreover, topic-based features based on non-negative


                                                                                                       HEADLINE
matrix factorization, latent Dirichlet allocation and la-
                                                                                                                                                                                          Dense Layer (Softmax)
tent semantic indexing were used. For a detailed de-                                                                  w1                 w2    w3      w4

scription of the features, refer to [HASC17].
                                                                                                                                                    SELF-ATTENTION
                                  ŷ


                                                                                                                    Paragraph 1
                      Dense Layer (Softmax)


                                                                with different number of units
                                                                        7 Dense layers
                       Dense Layer (ReLU)                                                                                                     w1      w2        w3      w4


                                 ...
                                                                                                                                              SELF-ATTENTION

                       Dense Layer (ReLU)


                                                                                                                    Paragraph 2
                                                                                                                                                                             LSTM conditioned on


                                                                                                       ARTICLE
                                                                                                                                                                             the previous sentence

                                                                                                                                                                             LSTM conditioned on
                            +               +                                                                                                                                the headline            ....
        Headline features          Joint        Body features                                                                                                                Input features (word
                                 features
                                                                                                                                              w1      w2       w3            embeddings...)

                                                                                                                                  ....
Figure 1:    The                feature-based         model                          proposed
by [HASC17].                                                                                                                                          SELF-ATTENTION


2.2   Conditional approach                                                                                          Paragraph n

In order to model the headline-article interactions
described in Section 1, we adapt the bidirec-                                                                                                 w1      w2       w3       w4         w5
tional conditional encoding architecture first proposed
by [ARVB16] for stance detection of tweets.                                                            Figure 2: Model based on conditional encoding (best
   First, the article is split into n paragraphs. Both                                                 seen in color). Networks represented with the same
the headline and the paragraphs are converted into                                                     color share the weights. Dotted arrows represent con-
their embedding representations. The headline is then                                                  ditional encoding. Due to lack of space, we represent
processed by a Bi-LSTMh (Eq 1). Each paragraph is                                                      only the forward part of the encoder. However, head-
then encoded by a further Bi-LSTMS1 (Eq 2), whose                                                      line and paragraph encoders (the red, green and blue
initial cell states are initialized with the last states of                                            networks in the figure) are Bi-LSTM.
the respectively forward and backward LSTMs which
compose Bi-LSTMh (see Figure 2 for a representation                                                    information to be concentrated in the beginning (see
of the architecture’s forward part). As pointed out                                                    Section 1).
in [ARVB16], this allows Bi-LSTMS1 to read the para-
graph in a headline-specific manner.
                                                                                                                  Hsi = Bi-LSTMS2 (Hsi )                                 ∀i ∈ {1, ..., n}                   (3)

                                                                                                       resulting in a matrix Hsi ∈ Rl×Si . We employ a sim-
       Hh = Bi-LSTMh (Eh )                                                                       (1)
                                                                                                       ilar self-attention mechanism as in [ea16] in order to
      Hsi = Bi-LSTMS1 (Esi )                      ∀i ∈ {1, ..., n}                               (2)   soft-select the most relevant elements of the sentence.
                                                                                                       Given the sequence of vectors {h1 , ..., hS } which com-
where Eh ∈ Re×H and Esi ∈ Re×Si are respec-                                                            pose HSi , the final representation of the ith paragraph
tively the embedding matrix of the headline and of the                                                 si is obtained as follows:
ith paragraph, H and Si are respectively the headline
and the ith paragraph length, e is the embedding size,
l is the hidden size, Hh ∈ Rl×H and Hs1 ∈ Rl×Si .                                                                                             uit = tanh(Ws hit + bs )                                      (4)
    Then, each paragraph representation, conditionally                                                                                                   u> us
encoded on the headline, is processed by another Bi-                                                                                          αit = exp P it >                                              (5)
LSTMS2 , conditioned on the previous paragraph. We                                                                                                       t uit us
                                                                                                                                                       X
start the paragraph-conditioned reading of the arti-                                                                                           si =            αt hit                                       (6)
cle from the bottom, as we assume the most salient                                                                                                         t
Table 4: Macro-averaged precision, recall and F1                                    Feature-based model             Conditional model
scores on the development and test set                                               654     123    1126 34.4%     1304      67    532    68.6%
                                                                              AGR


                                                               Output Class
                          dev set               test set                            9.25%   1.74% 15.94% 65.6%     18.46% 0.94%    7.53% 31.4%

                   Pm      Rm     F1m    Pm       Rm     F1m                         293     106    298    15.2%    425      51    221    7.3%
                                                                              DSG
                                                                                    4.15%   1.50%   4.22% 84.8%    6.01%   0.72%   3.12% 92.7%
Feat-based model   .359    .350   .350   .388    .361   .367
Cond model         .685    .716   .699   .505    .503   .486                        1661     170    2633 59.0%      985     115    3364 75.4%
                                                                              DSC
                                                                                    2.35%   2.40%   37.3% 41.0%    13.94% 1.62% 47.62% 24.6%

                                                                                    36.3% 30.7% 64.9% 48.0%        48.5% 21.9% 81.7% 66.8%
where the hidden representation of the word at posi-            RECALL
                                                                                    63.7% 69.3% 35.1% 51.0%        51.5% 78.1% 18.3% 33.2%
tion t, uit , is obtained though a one-layer MLP (Eq 4).


                                                                                                            EC


                                                                                                                                           EC
                                                                                        R


                                                                                               G


                                                                                                                       R


                                                                                                                              G
                                                                                                      C


                                                                                                                                     C
                                                                                     AG


                                                                                                                    AG
                                                                                                    DS


                                                                                                                                   DS
                                                                                             DS


                                                                                                                            DS
The normalized attention matrix αt is then obtained


                                                                                                           PR


                                                                                                                                          PR
though a softmax operation (Eq 5). Finally, si is com-                                      Target Class                   Target Class
puted by a weighted sum of all hidden states ht with
the weight matrix αt (Eq 6). The sentence represen-            Figure 3: Confusion Matrices of the predictions of both
tations {s1 , ..., sn } are aggregated using a backward        the feature-based and the conditional model on the
LSTM, as in Figure 2. The final prediction ŷ is ob-           test set.
tained with a softmax operation over the tagset.
                                                               3.3              Results and Discussion

                                                               Results of experiments are reported in Table 4. The
3     Experiments                                              proposed conditional model clearly outperforms the
                                                               feature-based baseline for all considered metrics, de-
3.1   (Hyper-)Parameters                                       spite having a considerably minor number of trainable
                                                               parameters. Interestingly, the feature-based model
                                                               seems to offer a better generalization over the test set,
For the feature-based model, we downloaded the fea-
                                                               while the gap between development and test set per-
ture matrices used by [HASC17] for their FNC-1 best
                                                               formance in the conditional model seems to indicate
submission3 and selected the columns corresponding to
                                                               overfitting.
the related samples. For the conditional model, we ini-
tialized the embedding matrix with word2vec embed-                Detailed performance on single classes is shown in
dings4 . Only words which occurred more than 7 times           Figure 3. Thanks to the presence of features specifi-
were included in the embedding matrix. Words not               cally designed to target the presence of refuting words,
included in word2vec were zero-initialized. In order to        the baseline model is able to reach a Precision of 15.2%
avoid overfitting, we did not fine-tune the embeddings         in classifying the very infrequent DSG class (7.5% of
during training. The main structures of the models             occurrences). The conditional model, which did not
were implemented in keras, using Tensorflow for im-            receive any explicit signal of the presence of negation,
plementing customized layers. Refer to Appendix ??             suffers more from this data imbalance, and reaches
for the complete list of hyperparameters used to train         a Precision of 7.3% on DSG samples. On the other
both architectures.                                            hand, by flattening the entire article into a fixed-size
                                                               vector, the feature-based system looses the nuances in
                                                               the argumentative structure of the news story. As a
                                                               consequence, this system struggles to distinguish be-
3.2   Evaluation Metrics                                       tween AGR and DSC samples and tends to favor the
                                                               most frequent DSC class, which receives the highest
In the FNC-1 context, a so-called FNC score was pro-           Precision and Recall scores. On the contrary, the con-
posed for evaluation: this hierarchical evaluation met-        ditional model is able to spot the subtle differences
ric gives 0.25 points for a correct REL/UNR classifica-        between AGR and DSC samples, reaching high Preci-
tions, which is incremented of 0.75 points in case of a        sion and satisfactory Recall in both classes despite the
correct AGR/DSA/DSC classification5 . This was mo-             large class imbalance - 27.7% AGR vs. 65.2% DSC
tivated by the high imbalance in favor of UNR class.           samples.

    However, as in our experiments we are only consid-
                                                                  3 https://drive.google.com/open?id=0B0-
ering REL samples, the FNC score does not constitute
                                                               muIdcdTp7UWVyU0duSDRUd3c
a useful evaluation metric. Following [ea18], we use              4 https://code.google.com/archive/p/word2vec/
macro-averaged precision, recall and F1 score, which              5 https://github.com/FakeNewsChallenge/fnc-1-

is less affected by the high class imbalance (Table 1).        baseline/blob/master/utils/score.py
4   Conclusions                                                      task 3: Cross-level semantic similarity. In
                                                                     Proceedings of SemEval 2014, pages 17–26,
Given the results discussed in the previous Section,
                                                                     2014.
we believe the strategy of modeling the FNC-1 as an
Asymmetric Stance Detection problem is promising.         [Sca00]    Christopher Scanlan. Reporting and writ-
In future work, we will carry on a detailed qualitative              ing: Basics for the 21st century. Harcourt
analysis to test the extent to which our conditional                 College Publishers, 2000.
model is able to model the narrative structures of ar-
ticles and their interactions with the headlines. The     [Sil15]    Craig Silverman. Lies, damn lies and viral
generalizability of such architecture to other domains               content. Columbia University, 2015.
can be tested on other publicly available corpora, as
                                                          [ZAB+ 18] Arkaitz Zubiaga, Ahmet Aker, Kalina
the recently released ARC dataset by [ea18].
                                                                    Bontcheva, Maria Liakata, and Rob Proc-
                                                                    ter. Detection and resolution of rumours in
Acknowledgments                                                     social media: A survey. ACM Computing
The first author (CC) would like to thank the Siemens               Surveys (CSUR), 51(2):32, 2018.
Machine Intelligence Group (CT RDA BAM MIC-DE,
Munich) and the NERC DREAM CDT (grant no.
1945246) for partially funding this work. The third
author (NC) is grateful for support from the UK EP-
SRC (grant no. EP/MOO5089/1).

References
[ARVB16] Isabelle Augenstein, Tim Rocktäschel, An-
         dreas Vlachos, and Kalina Bontcheva.
         Stance detection with bidirectional condi-
         tional encoding. In Proceedings of EMNLP
         2016, pages 876–885, 2016.

[BSP17]    Sean Baird, Doug Sibley, and Yuxi
           Pan.       Talos targets disinformation
           with fake news challenge victory.
           https://blog.talosintelligence.com/2017/06/,
           2017.

[ea16]     Zichao Yang et al. Hierarchical attention
           networks for document classification. In
           Proceedings of NAACL-HLT 2016, pages
           1480–1489, 2016.

[ea18]     Andreas Hanselowski et al. A retrospective
           analysis of the fake news challenge stance-
           detection task. In Proceedings of COLING
           2018, pages 1859–1874, 2018.

[FV16]     William Ferreira and Andreas Vlachos.
           Emergent: a novel data-set for stance clas-
           sification. In Proceedings of NAACL-HLT
           2016, pages 1163–1168, 2016.

[HASC17] Andreas Hanselowski, PVS Avinesh, Ben-
         jamin Schiller, and Felix Caspelherr. De-
         scription of the system developed by team
         athene in the fnc-1. Technical report, 2017.

[JPN14]    David Jurgens, Mohammad Taher Pile-
           hvar, and Roberto Navigli. Semeval-2014