Modeling the Fake News Challenge as a Cross-Level Stance Detection Task Costanza Conforti Mohammad Taher Pilehvar Nigel Collier Language Technology Lab Language Technology Lab Language Technology Lab University of Cambridge University of Cambridge University of Cambridge cc918@cam.ac.uk mp792@cam.ac.uk nhc30@cam.ac.uk released for the Fake News Challenge (FNC-1)1 . Characteristics of the corpus - The FNC-1 cor- Abstract pus is based on the Emergent dataset [FV16], a col- lection of 300 claims and 2,595 articles discussing the The 2017 Fake News Challenge Stage 1, a claims. Each article is labeled with the stance it ex- shared task for stance detection of news ar- presses toward the claim and summarized into a head- ticles and claims pairs, has received a lot of line by accredited journalists, in the framework of a attention in recent years [ea18]. The pro- project for rumor debunking [Sil15]. vided dataset is highly unbalanced, with a For creating the FNC-1 corpus, the headlines and skewed distribution towards unrelated sam- the articles were paired and labeled with the cor- ples - that is, randomly generated pairs of responding stance, distinguishing between agreeing news and claims belonging to different topics. (AGR), disagreeing (DSG) and discussing (DSC). Ad- This imbalance favored systems which per- ditional 266 labeled samples were added to avoid formed particularly well in classifying those cheating [ea18]. Moreover, a number of unrelated noisy samples, something which does not re- (UNR) samples were obtained by randomly matching quire a deep semantic understanding. headlines with articles discussing a different claim. As In this paper, we propose a simple architec- shown in Table 1, the final class distribution was highly ture based on conditional encoding, carefully skewed in favor of the UNR class, which amounted to designed to model the internal structure of a almost three quarters of the samples (Table 1). news article and its relations with a claim. We Characteristics of the FNC-1 winning mod- demonstrate that our model, which only lever- els - As a consequence of being randomly generated, ages information from word embeddings, can classification of UNR samples is relatively easy. More- outperform a system based on a large number over, given that the UNR samples constitute the large of hand-engineered features, which replicates majority of the corpus, most competing systems were one of the winning systems at the Fake News designed in order to perform well on this easy-to- Challenge [HASC17], in the stance detection discriminate class. In fact, the three FNC-1 win- of the related samples. ning teams proposed relatively standard architectures (mainly based on multilayer perceptrons, MLPs) lever- aging a large number of classic, hand-engineered NLP 1 Introduction features. While those systems performed very well on Stance classification has been identified as a key sub- the UNR class - reaching a F1 score higher than .99 - task in rumor resolution [ZAB+ 18]. Recently, a similar they were not as effective in the AGR, DSG and DSC approach has been proposed to address fake news de- classification [ea18]. tection: as a first step towards a comprehensive model FNC-1 as a Cross-Level Stance Detection for news veracity classification, a corpus of news arti- task - As shown in Table 3, one specific character- cles, stance-annotated with respect to claims, has been istic of the FNC-1 corpus consists in the clear asym- metry in length between the headlines and the articles. Copyright © CIKM 2018 for the individual papers by the papers' While headlines consist of one sentence, the structure authors. Copyright © CIKM 2018 for the volume as a collection 1 http://www.fakenewschallenge.org/ by its editors. This volume and its papers are published under the Creative Commons License Attribution 4.0 International (CC BY 4.0). Table 1: Example of an agreeing headline from the FNC-1 training set, with its related document divided into paragraphs (doc 1880). Each paragraph may express a different stance with respect to the claim, as indicated in the first column. Headline. No, a spider (probably) didn’t crawl through a man’s body for several days Article. AGR ”Fear not arachnophobes, the story of Bunbury’s “spiderman” might not be all it seemed. DSC [...] scientists have cast doubt over claims that a spider burrowed into a man’s body [...] The story went global [...] DSG Earlier this month, Dylan Thomas [...] sought medical help [...] he had a spider crawl underneath his skin. DSG Mr Thomas said a [...] dermatologist later used tweezers to remove what was believed to be a ”tropical spider”. (noise) [image via Shutterstock] AGR But it seems we may have all been caught in a web... of misinformation. DSC/AGR Arachnologist Dr Framenau said [...]it was ”almost impossible” [...] to have been a spider [...] (noise) Dr Harvey said: ”We hear about people going on holidays and having spiders lay eggs under the skin”. [...] (noise) Something which is true, [...] is that certain arachnids do live on humans. We all have mites living on our faces [...] (noise) Dylan Thomas has been contacted for comment.” of an article is better described as a sequence of para- In order to be able to assess the ability of the sys- graphs, where each paragraph plays a different role tems to model the complex headline-article interplay in telling a story. Single paragraphs usually expresses described above, we filter out the noisy UNR samples different views of a topic. Following the terminology and consider only the related samples (AGR, DSG and introduced by [JPN14], we propose to call this variant DIS). Those samples were manually collected and la- of the classic Stance Detection task Cross-Level Stance beled by professional journalists and require deep se- Detection. mantic understanding in order to be classified, consti- As shown in Table 1, an article consists in passages tuting a difficult task even for humans. This is ev- presenting a news, reporting about interviews, giving ident when looking at the inter-annotator agreement general background information and discussing similar of human raters, which drops from Fleiss’ κ = .686 events happened in the past. In contemporary news- to .218 when including or excluding the UNR sam- writing prose, the most salient information is usually ples, as reported in [ea18]. The final label distribution condensed in the very first paragraphs, following the is reported in Table 1. Inverted Pyramid style. This allows the reader for rapid decision making [Sca00]. 2 Models For these reasons, we believe that detecting the 2.1 Feature-based approach stance of an article with respect to a headline requires a deep understanding not only of the position taken We implemented the model proposed by the team in each paragraph with respect to the headline, but Athene, which was ranked second at FNC-12 . The also of the the complex interactions within the arti- model consists of a 7-layer MLP with ReLU activa- cle’s paragraphs, as illustrated by the example in Ta- tion. On the top of the architecture, a softmax layer ble 3. On the contrary, compressing both the head- is used for prediction (Figure 1). line’s and article’s content into fixed-size vectors, as Input is given in the form of a large matrix of hand- in the feature-based systems described in the previous engineered features. The considered set includes the paragraph, fails in detecting those fine-grained rela- concatenation of feature vectors which separately con- tionships and results in sub-optimal performance on sider the headline and the article - like the presence of the stance detection of AGR, DSG and DSC samples. refuting or polarity words (taken from a hand-selected To test this assumption, we propose a simple ar- list of words as ‘hoax’ or ‘debunk’, and tf-idf weighted chitecture based on conditional encoding, which is de- Bag of Words vectors - and features which combine the signed in order to model the complex interactions be- headline and the article (joint features in Figure 1) tween headlines and articles described above, and we - like word/ngram overlap between the headline and compare it with one of the feature-based systems which the article, and cosine similarity of the embeddings of won the FNC-1 [HASC17]. 2 We used Athene as the baseline as the FNC-1 winning model was an ensemble [BSP17]. instances agr dsg dsc unr fnc-1 75,385 7.4% 2.0% 17.7% 72.8% Table 3: Asymmetry in length in the FNC-1 corpus. fnc-1-rel 20,491 27.2% 7.5% 65.2% - headline article paragraph Table 2: Label distribution for the FNC-1 dataset, avg #tokens 12.40 417.69 30.88 with and without the UNR samples. avg #par/article - 11.97 - nouns and verbs between the headline and the article. ŷ Moreover, topic-based features based on non-negative HEADLINE matrix factorization, latent Dirichlet allocation and la- Dense Layer (Softmax) tent semantic indexing were used. For a detailed de- w1 w2 w3 w4 scription of the features, refer to [HASC17]. SELF-ATTENTION ŷ Paragraph 1 Dense Layer (Softmax) with different number of units 7 Dense layers Dense Layer (ReLU) w1 w2 w3 w4 ... SELF-ATTENTION Dense Layer (ReLU) Paragraph 2 LSTM conditioned on ARTICLE the previous sentence LSTM conditioned on + + the headline .... Headline features Joint Body features Input features (word features w1 w2 w3 embeddings...) .... Figure 1: The feature-based model proposed by [HASC17]. SELF-ATTENTION 2.2 Conditional approach Paragraph n In order to model the headline-article interactions described in Section 1, we adapt the bidirec- w1 w2 w3 w4 w5 tional conditional encoding architecture first proposed by [ARVB16] for stance detection of tweets. Figure 2: Model based on conditional encoding (best First, the article is split into n paragraphs. Both seen in color). Networks represented with the same the headline and the paragraphs are converted into color share the weights. Dotted arrows represent con- their embedding representations. The headline is then ditional encoding. Due to lack of space, we represent processed by a Bi-LSTMh (Eq 1). Each paragraph is only the forward part of the encoder. However, head- then encoded by a further Bi-LSTMS1 (Eq 2), whose line and paragraph encoders (the red, green and blue initial cell states are initialized with the last states of networks in the figure) are Bi-LSTM. the respectively forward and backward LSTMs which compose Bi-LSTMh (see Figure 2 for a representation information to be concentrated in the beginning (see of the architecture’s forward part). As pointed out Section 1). in [ARVB16], this allows Bi-LSTMS1 to read the para- graph in a headline-specific manner. Hsi = Bi-LSTMS2 (Hsi ) ∀i ∈ {1, ..., n} (3) resulting in a matrix Hsi ∈ Rl×Si . We employ a sim- Hh = Bi-LSTMh (Eh ) (1) ilar self-attention mechanism as in [ea16] in order to Hsi = Bi-LSTMS1 (Esi ) ∀i ∈ {1, ..., n} (2) soft-select the most relevant elements of the sentence. Given the sequence of vectors {h1 , ..., hS } which com- where Eh ∈ Re×H and Esi ∈ Re×Si are respec- pose HSi , the final representation of the ith paragraph tively the embedding matrix of the headline and of the si is obtained as follows: ith paragraph, H and Si are respectively the headline and the ith paragraph length, e is the embedding size, l is the hidden size, Hh ∈ Rl×H and Hs1 ∈ Rl×Si . uit = tanh(Ws hit + bs ) (4) Then, each paragraph representation, conditionally u> us encoded on the headline, is processed by another Bi- αit = exp P it > (5) LSTMS2 , conditioned on the previous paragraph. We t uit us X start the paragraph-conditioned reading of the arti- si = αt hit (6) cle from the bottom, as we assume the most salient t Table 4: Macro-averaged precision, recall and F1 Feature-based model Conditional model scores on the development and test set 654 123 1126 34.4% 1304 67 532 68.6% AGR Output Class dev set test set 9.25% 1.74% 15.94% 65.6% 18.46% 0.94% 7.53% 31.4% Pm Rm F1m Pm Rm F1m 293 106 298 15.2% 425 51 221 7.3% DSG 4.15% 1.50% 4.22% 84.8% 6.01% 0.72% 3.12% 92.7% Feat-based model .359 .350 .350 .388 .361 .367 Cond model .685 .716 .699 .505 .503 .486 1661 170 2633 59.0% 985 115 3364 75.4% DSC 2.35% 2.40% 37.3% 41.0% 13.94% 1.62% 47.62% 24.6% 36.3% 30.7% 64.9% 48.0% 48.5% 21.9% 81.7% 66.8% where the hidden representation of the word at posi- RECALL 63.7% 69.3% 35.1% 51.0% 51.5% 78.1% 18.3% 33.2% tion t, uit , is obtained though a one-layer MLP (Eq 4). EC EC R G R G C C AG AG DS DS DS DS The normalized attention matrix αt is then obtained PR PR though a softmax operation (Eq 5). Finally, si is com- Target Class Target Class puted by a weighted sum of all hidden states ht with the weight matrix αt (Eq 6). The sentence represen- Figure 3: Confusion Matrices of the predictions of both tations {s1 , ..., sn } are aggregated using a backward the feature-based and the conditional model on the LSTM, as in Figure 2. The final prediction ŷ is ob- test set. tained with a softmax operation over the tagset. 3.3 Results and Discussion Results of experiments are reported in Table 4. The 3 Experiments proposed conditional model clearly outperforms the feature-based baseline for all considered metrics, de- 3.1 (Hyper-)Parameters spite having a considerably minor number of trainable parameters. Interestingly, the feature-based model seems to offer a better generalization over the test set, For the feature-based model, we downloaded the fea- while the gap between development and test set per- ture matrices used by [HASC17] for their FNC-1 best formance in the conditional model seems to indicate submission3 and selected the columns corresponding to overfitting. the related samples. For the conditional model, we ini- tialized the embedding matrix with word2vec embed- Detailed performance on single classes is shown in dings4 . Only words which occurred more than 7 times Figure 3. Thanks to the presence of features specifi- were included in the embedding matrix. Words not cally designed to target the presence of refuting words, included in word2vec were zero-initialized. In order to the baseline model is able to reach a Precision of 15.2% avoid overfitting, we did not fine-tune the embeddings in classifying the very infrequent DSG class (7.5% of during training. The main structures of the models occurrences). The conditional model, which did not were implemented in keras, using Tensorflow for im- receive any explicit signal of the presence of negation, plementing customized layers. Refer to Appendix ?? suffers more from this data imbalance, and reaches for the complete list of hyperparameters used to train a Precision of 7.3% on DSG samples. On the other both architectures. hand, by flattening the entire article into a fixed-size vector, the feature-based system looses the nuances in the argumentative structure of the news story. As a consequence, this system struggles to distinguish be- 3.2 Evaluation Metrics tween AGR and DSC samples and tends to favor the most frequent DSC class, which receives the highest In the FNC-1 context, a so-called FNC score was pro- Precision and Recall scores. On the contrary, the con- posed for evaluation: this hierarchical evaluation met- ditional model is able to spot the subtle differences ric gives 0.25 points for a correct REL/UNR classifica- between AGR and DSC samples, reaching high Preci- tions, which is incremented of 0.75 points in case of a sion and satisfactory Recall in both classes despite the correct AGR/DSA/DSC classification5 . This was mo- large class imbalance - 27.7% AGR vs. 65.2% DSC tivated by the high imbalance in favor of UNR class. samples. However, as in our experiments we are only consid- 3 https://drive.google.com/open?id=0B0- ering REL samples, the FNC score does not constitute muIdcdTp7UWVyU0duSDRUd3c a useful evaluation metric. Following [ea18], we use 4 https://code.google.com/archive/p/word2vec/ macro-averaged precision, recall and F1 score, which 5 https://github.com/FakeNewsChallenge/fnc-1- is less affected by the high class imbalance (Table 1). baseline/blob/master/utils/score.py 4 Conclusions task 3: Cross-level semantic similarity. In Proceedings of SemEval 2014, pages 17–26, Given the results discussed in the previous Section, 2014. we believe the strategy of modeling the FNC-1 as an Asymmetric Stance Detection problem is promising. [Sca00] Christopher Scanlan. Reporting and writ- In future work, we will carry on a detailed qualitative ing: Basics for the 21st century. Harcourt analysis to test the extent to which our conditional College Publishers, 2000. model is able to model the narrative structures of ar- ticles and their interactions with the headlines. The [Sil15] Craig Silverman. Lies, damn lies and viral generalizability of such architecture to other domains content. Columbia University, 2015. can be tested on other publicly available corpora, as [ZAB+ 18] Arkaitz Zubiaga, Ahmet Aker, Kalina the recently released ARC dataset by [ea18]. Bontcheva, Maria Liakata, and Rob Proc- ter. Detection and resolution of rumours in Acknowledgments social media: A survey. ACM Computing The first author (CC) would like to thank the Siemens Surveys (CSUR), 51(2):32, 2018. Machine Intelligence Group (CT RDA BAM MIC-DE, Munich) and the NERC DREAM CDT (grant no. 1945246) for partially funding this work. The third author (NC) is grateful for support from the UK EP- SRC (grant no. EP/MOO5089/1). References [ARVB16] Isabelle Augenstein, Tim Rocktäschel, An- dreas Vlachos, and Kalina Bontcheva. Stance detection with bidirectional condi- tional encoding. In Proceedings of EMNLP 2016, pages 876–885, 2016. [BSP17] Sean Baird, Doug Sibley, and Yuxi Pan. Talos targets disinformation with fake news challenge victory. https://blog.talosintelligence.com/2017/06/, 2017. [ea16] Zichao Yang et al. Hierarchical attention networks for document classification. In Proceedings of NAACL-HLT 2016, pages 1480–1489, 2016. [ea18] Andreas Hanselowski et al. A retrospective analysis of the fake news challenge stance- detection task. In Proceedings of COLING 2018, pages 1859–1874, 2018. [FV16] William Ferreira and Andreas Vlachos. Emergent: a novel data-set for stance clas- sification. In Proceedings of NAACL-HLT 2016, pages 1163–1168, 2016. [HASC17] Andreas Hanselowski, PVS Avinesh, Ben- jamin Schiller, and Felix Caspelherr. De- scription of the system developed by team athene in the fnc-1. Technical report, 2017. [JPN14] David Jurgens, Mohammad Taher Pile- hvar, and Roberto Navigli. Semeval-2014