A Hybrid Recognition System for Check-worthy Claims Using Heuristics and Supervised Learning

A Hybrid Recognition System for Check-worthy Claims Using Heuristics and Supervised Learning ChaoyuanZuo chzuo@cs.stonybrook.edu Department of Computer Science AylaIdaKarakas ayla.karakas@stonybrook.edu Department of Linguistics RitwikBanerjee rbanerjee@cs.stonybrook.edu Department of Computer Science Stony Brook University

11794 Stony Brook New York USA

A Hybrid Recognition System for Check-worthy Claims Using Heuristics and Supervised Learning B9010D894C2760DD4874DF6E1B15E88E GROBID - A machine learning software for extracting information from scholarly documents Check-worthiness Multi-layer Perceptron Heuristics Feature Selection Stylometry

In recent years, the speed at which information disseminates has received an alarming boost from the pervasive usage of social media. To the detriment of political and social stability, this has also made it easier to quickly spread false claims. Due to the sheer volume of information, manual fact-checking seems infeasible, and as a result, computational approaches have been recently explored for automated fact-checking. In spite of the recent advancements in this direction, the critical step of recognizing and prioritizing statements worth fact-checking has received little attention. In this paper, we propose a hybrid approach that combines simple heuristics with supervised machine learning to identify claims made in political debates and speeches, and provide a mechanism to rank them in terms of their "check-worthiness". The viability of our method is demonstrated by evaluations on the English language dataset as part of the Check-worthiness task of the CLEF-2018 Fact Checking Lab.

Introduction

It is no secret that we live in an age of ubiquitous web and social media. For the most part, any Internet user readily acquires the latent power of civilian commentary and journalism [3,10]. Consequently, information available on the web now carries the potential to propagate amid the public domain with unprecedented speed and reach. The ordinary Internet user, however, contends with an overwhelming amount of information, which makes the task of determining the accuracy and integrity of the claims all the more onerous. Additionally, users usually want their beliefs to be confirmed by information [18,34]. The confluence of vast amounts of information and such confirmation bias, thus, can create a society where unverified information runs amok masquerading as facts. While correcting confirmation biases at a social scale may be extremely challenging and even controversial, the spread of misinformation can be mitigated by focusing only on curating the claims.

Comprehensive manual fact-checking is highly tedious and, in light of the sheer volume of information, infeasible. To overcome this hurdle, several approaches to automated fact-checking have been proposed in the nascent field of computational journalism [5,8]. Some prior work took to computing the semantic similarity between claims [4,13], while others proposed fact-checking as a question-answering task [5,33,36]. Both approaches need to extract statements to be fact-checked before the actual verification process can begin. ClaimBuster [12] was the first fact-checking system that assigned to each sentence a check-worthiness score between 0 and 1. Subsequently, a multi-class classification approach with fewer features was explored to specifically identify check-worthy claims, but it suffered from comparatively lower precision [28]. Outside of this small body of work, the preliminary step of identifying check-worthy claims has received little attention. Gencheva et al. [9] were the first to develop a publicly available dataset for this task. Their annotations were obtained from nine fact-checking websites. They also used a significantly richer feature set. Keeping in line with the observations made by prior work regarding the extent of overlap in lexical and shallow syntactic features [9,20], we use a significantly richer set of features derived from word embeddings and deep syntactic structures.

In this work, our focus is on recognizing "check-worthy" statements. Accurate identification of such statements will benefit the fact-checking and verification processes that follow, independent of the specific techniques used therein. We use the task formulation, data, and evaluation framework provided by the CLEF-2018 Lab on Automatic Identification and Verification of Claims in Political Debates [24] as part of their first task -Check-Worthiness [1].

Task, Data, and Evaluation Framework

The CLEF 2018 Fact Checking Lab designed two tasks that, when put together, form the complete fact-checking pipeline. In this work, however, we focus exclusively on the first.

The Task: Check-Worthiness

The first task -check-worthiness -was defined by the CLEF 2018 Fact Checking Lab as follows:

Predict which claim in a political debate should be prioritized for factchecking. In particular, given a debate, the goal is to produce a ranked list of its sentences based on their worthiness for fact checking [9].

The goal of this task is to automatically recognize claims worth checking, and present them in order of priority (i.e., as a ranked list of claims) to journalists or even ordinary Internet and social media users. The ranking is attained in terms of a check-worthiness score. This approach helps the recipient tackle the problem of information overload and instead, directly focus on the most important statements. The output, therefore, can be fed to an automated fact-checker or be used in a manual pursuit of verification. Either way, it can raise awareness of individual users and stymie the dissemination of false claims in social media.

Data

Given the alleged impact of disinformation and 'fake news' on the 2016 US presidential election, and the controversy surrounding it, any data pertaining to this election cycle is extremely relevant in terms of fact-checking endeavors having a positive social and political impact in the future. As such, a political debate dataset was provided in English and Arabic. Since our methodology involves heuristics that rely on linguistic insight, we used the English language dataset. The training data comprised three political debates. Each debate was split into sentences, and each sentence was associated with its speaker and annotated by experts as check-worthy or not (labeled 1 and 0, respectively). This data contained a total of 3,989 sentences, of which only 94 were labeled as check-worthy -a staggering imbalance with only 2.36% of the dataset bearing the label of the target class. A few simple sentences from this training data, along with their speakers and labels, are presented in Table 1.

The test data was a collection of two political debates and five political speeches. 3 The total number of sentences in these two categories (Debate and Speech) were 2,815 and 2,064, respectively.

In this work, we did not employ any external knowledge other than domainindependent language resources such as parsers and lexicons. Instead, we focused extracting linguistic features indicative of check-worthiness.

Evaluation Framework

The evaluation was done on the test data provided as part of the task. This data was released much later to the participants, with the gold standard labels for the sentences in the test data withheld. Once we selected the models, we ran it on the entire test data, and used average precision to measure the quality of the output ranking. Average precision is defined as

AP = 1 n chk n k=1 Prec(k) • δ(k)

where n chk is the number of check-worthy sentences, n is the total number of sentences, Prec(k) is the precision at cut-off k in the list of sentences ranked by check-worthiness, and δ(k) is the indicator function equaling 1 if the sentence at rank k is check-worthy, and 0 otherwise. The primary metric used by the Fact Checking Lab [24] for the check-worthiness task was mean average precision (MAP), defined simply as the mean of the average precisions over all queries.

Methodology

Our methodology is a hybrid of rule-based heuristics and supervised classification. The motivation for this approach was to test the extent to which check-worthiness can be determined based on language constructs without relying on encyclopedic knowledge. Moreover, our aim was to develop an approach that was not specific to the domain of politics. In this section, we describe the data processing, feature selection, and heuristics involved in building our classification models.

Data Processing

The first step of our processing involved normalizing the speaker names. We did this by adding speaker-specific rules in order to correctly match the speakers extracted from various sentences to the actual speakers associated with the sentences. For example, speakers in the test data included "Hillary Clinton (D-NY)", "Former Secretary of State, Presidential Candidate", and simply "Clinton". These are, of course, all referring to the same speaker.

Next, we noted that the training data consisted only of political debates where multiple entities (two political candidates, a moderator, and the occasional audience reaction) engage in a conversation. Due to the very nature of debates, the rhetorical structure is different from speeches delivered by a single speaker. The test data, however, also included political speeches. Therefore, we extracted all sentences attributed to a speaker to create sub-datasets. This formed a new training sample, which we then used to train models to identify check-worthy sentences from speeches4 . To identify check-worthy sentences from political debates, we used the original training data to train the models.

Feature Design and Selection

For both speeches and debates, we extracted a set of syntactic and semantic features to obtain a consistent knowledge representation, and converted every sentence into a vector in an abstract semantic space. The details of these features and the resultant feature vector are discussed below.

Sentence Embedding: Traditional supervised learning in natural language processing tasks have used vector spaces where dimensions correspond to words (or other linguistic units). This, however, is not in accordance with the well-known distributional hypothesis in linguistics: words that occur in similar contexts tend to have similar meanings [11]. This necessitates the representation of sentences in a low-dimensional semantic space where similar meanings are closer together.

Modeling sentence meanings in a low-dimensional space is a topic of extensive research by itself, and beyond the scope of this work. Instead, we adopted a simple method that leverages word embeddings. We used the 300-dimensional pretrained Google News word embeddings5 to represent each word as a vector [23], and took the arithmetic mean of all such vectors corresponding to the words in a sentence to obtain an abstract sentence embedding.

Lexical Features: From the training data, we removed stopwords and stemmed the remaining terms using the Snowball stemmer [30].

Stylometric Features: Stylometry, the statistical analysis of variations in linguistic constructs, has been used with great success in distinguishing deceptive from truthful language [6,26], and objective from subjective remarks [19,21]. Accordingly, we surmised that capturing stylistic variation will aid in the identification of check-worthy sentences as well, especially since they are typically expected to appear factual and objective.

In order to obtain shallow syntactic features from each sentence, we extracted the part-of-speech (POS) tags, the total number of tokens, and the number of tokens in past, present, and future tenses. We were able to infer the tense from the POS tags (e.g., both vbd and vbz are verb tags, but they indicate past and present tense, respectively). Additionally, we also extracted the number of negations in each sentence. More complex structural patterns of language, however, can only be captured by deep syntactic features. For that, we generated the constituency parse trees of all sentences, and selected clause-level and phrase-level tags. The number of words within the scope of each tag were included as the corresponding feature values. These tags, as defined in the Penn Treebank [2], are shown in Table 2. In addition to stylometry, the motivation behind using the number of words was to obtain a representation of the amount of information available under specific syntactic structures. Fig. 1 illustrates this point with the parse tree of a sentence from the training data that was labeled as check-worthy.

Semantic Features: We used the Stanford named entity recognizer (NER) [7] to extract the number of named entities in a sentence. Additionally, we appended an extra feature for named entities of the type person.

Affective Features: We used the TextBlob [22] library to train a naïve Bayes classifier on the pioneering movie review corpus for sentiment analysis [27], and thereby obtained a sentiment score for each sentence. In addition to overt sentiment, we also used the connotation of words in a sentence as features. For this, we employed Connotation WordNet [16], which assigns a (positive or negative) connotation score to each word. For every sentence, we queried this lexicon and retrieved the connotation score of its words. Finally, the overall connotation of the sentence was attributed simply to the mean of these scores.

Additionally, we also utilized lexicons that contain information about the subjective or objective nature of words [35], whether they directly indicate or are typically associated with language that indicates bias [31], and whether they are typically used to voice positive or negative opinions [15]. For every sentence, we extracted the number of words in these categories (as defined by their scores in these lexicons), thus forming four new features: (i) subjectivity, (ii) direct bias, (iii) associated bias, and (iv) opinion.

Metadata Features: In addition to the syntactic and semantic features described above, we also included three binary non-linguistic features extracted from the training sample, indicating whether or not (i) the speaker's opponent is mentioned, (ii) the speaker is the anchor/moderator, or (iii) the sentence is immediately followed by intense reaction. The third feature is encoded in the training data as a 'system' reaction, as shown by the last sentence in Table 1.

Discourse Features: All the above features were extracted without regards to the category (i.e., Debate and Speech). Since debates involve an interactive discourse structure where sentences are often formed as an immediate response to statements made by others, we include segments from the debates. We adopt the approach taken by Gencheva et al [9] and regard a "segment" to be the maximal set of consecutive sentences by the same speaker. As features, we include the relative position of a sentence within its segment, and the number of sentences in the previous, current and subsequent segments.

Feature Selection

The feature extraction processes described above yielded a very high-dimensional feature space. High-dimensional spaces, however, quickly lead to a decrease in the predictive power of models [32]. Moreover, given the extreme class imbalance, classification in such a space is likely to ignore important features indicative of the minority class (in this case, the 'check-worthy' sentences).

To reduce the dimensionality, we applied a feature selection module using the scikit-learn library [29]. As the first step, univariate feature selection was performed, and the 2,000 best features were selected based on χ 2 -test. Next, armed with the observation that linear predictive models with L1 loss yield sparse solutions and encourage the vanishing coefficients for weakly correlated features [25], we used a support vector machine (SVM) model with linear kernel and L1 regularization to further remove the relatively unimportant features. This step was first done on the entire training data, and then combined with repeated undersampling (without replacement) for the majority class. Each iteration of this undersampling process resulted in a small but balanced training sample. A L1-regularized SVM learner was trained on every sample generated in this manner, and features with vanishing coefficients were discarded. The cumulative effect of these feature selection steps was a reduction of the feature space to 2,655 and 2,404 dimensions for identification of check-worthy claims from debates and speeches, respectively.

Heuristics

Certain heuristics were introduced to override the scores assigned by the classification models. These rules differed slightly based on (i) the category, i.e., speech or debate, and (ii) whether or not the 'strict' heuristics were deployed. The strictness flag was introduced to control the threshold sentence size. When active, it would tend to discard more sentences.

These rules are specified in Algorithm 1. One particular rule required the identification of subjects in a sentence. To extract this information, we generated dependency parse trees of the sentences and counted the number of times any of the following dependency labels appeared: nsubj, csubj, nsubjpass, csubjpass, or xsubj. The first two indicate nominal and clausal subjects, respectively. The next two indicate nominal and clausal subjects in a passive clause, and the last label denotes a controlling subject, which relates an open clausal complement to its external clause.

Models

Our experiments comprised two supervised learning algorithms: support vector machines (SVM) and multilayer perceptrons (MLP). Additionally, we also built an ensemble model combing the two. In this section, we provide a description of these three models, along with their training processes.

For reasons described in Sec. 3.2, the SVM utilized a linear kernel with L1 regularization for feature selection. However, due to the propensity of the L1 loss function to miss optimal solutions, we used L2 loss in building the final model after completing feature selection. Our second model was the MLP. Here, we used two hidden layers with 100 units and 8 units in them, respectively. We used the hyperbolic tangent (tanh) as our activation function since it achieved better results when compared to rectified linear units (ReLU). Stochastic optimization was done with Adam [17]. To avoid overfitting, we used L2-regularization in both Table 3. Results for the Check-Worthiness task of our submitted models: MLP was the primary submission, along with two contrastive runs, MLPstr and ENS (MLP with strict heuristics and the ensemble model, respectively). MLPnone shows the results of the MLP without any heuristics being applied. The primary evaluation metric was mean avg. precision (MAP). The mean reciprocal rank (MRR), mean R-precision (MRP), and mean precision at k (MP@k) are also shown.

MAP

MRR MRP MP@1 MP@3 MP@5 MP@10 MP@20 MP@50 MLP SVM and MLP. Third, we built an ensemble model that combines SVM and MLP (without the strict heuristics). In this model, the final output score was a normalization (by standard deviation) of the results of SVM and MLP, and then computing the average. For all three models, class imbalance was a hindrance during the training process. To overcome that, we used ADASYN [14], an adaptive synthetic sampling algorithm for imbalanced learning. For model selection, we used 3-fold crossvalidation for debates, using two files for training and the remaining one for testing, to evaluate model performances and tune parameters. For speeches, we split the training sample into two halves (one file in each) for 2-fold crossvalidation. The evaluation script was provided by the task organizers, with the mean average precision (MAP) being the primary evaluation metric.

MLP without the strict heuristics demonstrated the best results during the training process, so this was submitted for the primary run. For the two contrastive runs, we submitted (i) MLP with strict heuristics, and (ii) the ensemble model without the strict heuristics.

Results and Analysis

Empirical Results

The detailed performance of all three submissions we made is shown in Table 3. Even though MLP yielded the best training results without the strict heuristics, MLP str performed demonstrably better across multiple metrics on the test data. Our third model, the ensemble classifier, performed poorly in general compared to both MLP models. It did, however, achieve slightly better mean R-precision and mean precision at higher cutoffs (k = 10 and 50).

Without the inclusion of any heuristics, the performance of MLP dropped significantly. This was expected, since the heuristics were designed to address the flaws of the classifiers. This model was not among the submissions, but we include it here for comparison. The difference between MLP and MLP none quantifies the extent to which the rules help the supervised learners. Next, in Table 4, we present the comparison between the results obtained by all participants. This comparison was done only on the primary submission from each team. Our MLP model without the strict heuristics achieved the best MAP, MRR, and MRP scores. Further, it also outperformed the others in terms of correctly placing the check-worthy sentences at the very top of the ranked output list, as demonstrated by the mean precision at low values (k = 1 and 3).

Qualitative Analysis

Identifying check-worthy sentences is a difficult and novel task, and even the best model suffered from misclassification errors. Upon analyzing such mistakes made by the MLP models, we were able to discern a few reasons.

First, tense plays a logical role in check-worthiness, since future actions cannot be verified. However, the part-of-speech tagging often confuses the future tense with the present continuous (e.g., "We're cutting taxes."). Second, we observed that anecdotal stories are often highly prioritized as check-worthy, while they are not. These sentences are usually complex, with a lot of content, which makes it easy for the model to conflate them with other complex sentences pertaining to real events deemed check-worthy. Third, the presence of duplicate sentences in the data means that a misclassification gets amplified, while the presence of very similar sentences with different labels likely makes the feature selection stage discard potentially useful features.

At a more abstract level, rhetorical figures of speech play a critical role. They often break the structures associated with standard sentence formation. Several sentences that were misclassified exhibited constructs such as scesis onomaton, where words or phrases with nearly equivalent meaning are repeated. We conjecture that this makes the model falsely believe that there is more informational content in the sentence. Such figures of speech become even harder to handle when they occur across multiple speakers in debates. The conversational aspect of debates also causes another problem: quite a few sentences are short, and in isolation, would perhaps not be check-worthy. However, as a response to things mentioned earlier in the debate, they are.

Another complex issue leading to misclassification is the use of sentence fragments. This is sparingly used for dramatic effect in literature, but was seen with alarming frequency in the political debates due to the prevalence of illformed or partly-formed sentences stopping and then giving way to another sentence. In some cases, the fragments are portions of the sentence that the speaker repeats. An example of such a fragment is the sentence "Ambassador Stevens -Ambassador Stevens sent 600 requests for help.", where the phrase "Ambassador Stevens" is repeated.

A proper approach to deal with these hurdles is a complex matter in and by itself. We believe that our features are better suited for written language than speech or debate transcripts. In the presence of significantly more labeled data for check-worthiness, ablation studies that remove such sentences could provide empirical evidence of this intuition.

Conclusion and Future Work

We developed a hybrid system that combines a few rules with supervised learning to detect check-worthy sentences in political debates and speeches. To tackle the severity of class imbalance, our development also included a sophisticated feature selection process and special sampling methods. Our primary model achieved the best results among all participants over multiple performance metrics.

This work opens up several intriguing possibilities for future research in the field of fact-checking. First, we intend to study in greater details the linguistic forms of informational content. Shallow syntax has been explored to understand this aspect of language in sociolinguistics, and some work has even looked into deep syntactic features. This approach has, however, not yet been applied to identifying check-worthy sentences. Furthermore, more complex neural network structures need to be thoroughly investigated. Along this line, we will be investigating deep learning models with feedback control. A stringent and focused work on these issues will empower journalists and citizens alike to be better informed and more cognizant of false claims permeating news and social media now. To that end, we also need complementary advances in related areas like natural language querying, crowdsourcing, source identification, and social network analysis.

Fig. 1 .1Fig. 1. The constituency parse tree of a check-worthy sentence from the training data: "President Bush said we would leave Iraq at the end of 2011." The size of the subtree under the subordinate clause (sbar) is representative of the amount of information available provided about the action 'said' undertaken by the entity 'President Bush'.

Table 1 .1Labeled sentence examples from political debates provided as training data. Check-worthy sentences are labeled 1, and others are labeled 0. Audience reaction and other background noise is encoded as "SYSTEM"-generated.SpeakerSentenceLabelHOLTI'm Lester Holt, anchor of "NBC Nightly News."0HOLTI want to welcome you to the first presidential debate.0TRUMPOur jobs are fleeing the country.0TRUMPThousands of jobs leaving Michigan, leaving Ohio.1CLINTONDonald thinks that climate change is a hoax1perpetrated by the Chinese.SYSTEM(applause)0

Table 2 .2Constituent tags from the Penn Treebank.Clause-LevelSBAR, SBARQ, SINV, SQ, SPhrase-LevelADJP, ADVP, CONJP, FRAG, INTJ, LST, NAC, NP, NX,PP, PRN, PRT, QP, RRC, UCP, VP, WHADJP, WHAVP,WHNP, WHPP, X

Algorithm 1 Heuristics for assigning the check-worthiness score w(•) to sentences.Require: category ∈ {speech, debate},if Sspeaker is system thenstrict mode ∈ {true, false}, sentence S.w(s) ← 10 −8end ifmin token count ← 0if Snumber of tokens < min token countif category is speech thenthenif strict mode thenw(s) ← 10 −8min token count ← 10end ifelseif S contains "thank you" thenmin token count ← 8w(s) ← 10 −8end ifend ifelseif Snumber of subjects < 1 thenif strict mode thenif category is speech thenmin token count ← 7w(s) ← 10 −8elseelse if S contains "?" thenmin token count ← 5w(s) ← 10 −8end ifend ifend ifend if

Table 4 .4Results from the primary submissions from all participants. We participated under the name Prise de Fer. The best results for each metric is shown in bold.TEAMMAPMRRMRPMP@1 MP@3 MP@5 MP@10 MP@20 MP@50Prise de Fer0.1332 0.4965 0.1352 0.4286 0.28570.20000.14290.15710.1200Copenhagen0.11520.31590.11000.14290.14290.11430.12860.12860.1257UPV-INAOE0.11300.46150.13150.28570.23810.31430.22860.12140.0866bigIR0.11200.26210.11650.00000.14290.11430.11430.10000.1114fragarach0.08120.44770.12170.28570.19050.20000.15710.10710.0743blue0.08010.24590.05760.14290.09520.05710.05710.08570.0600RNCC0.06320.37550.06390.28570.14290.11430.05710.05710.0486

The lab task provided all seven files together, without this categorization into speeches and debates. We, however, chose to treat these differently since language use is very different in these two scenarios: debates consist of the interactive statements made by the candidates and the moderator, while speeches only have a single speaker, and there is no two-sided conversational structure. The provided training sample included two speeches, and both were by Donald Trump. As a result, for the purpose of this task, a single sub-dataset was created. The approach is independent of the speaker and the number of speakers, however. Available at https://code.google.com/archive/p/word2vec/.

Acknowledgment: This work was supported in part by the U.S. National Science Foundation (NSF) under the award SES-1834597.

Overview of the CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Political Claims, Task 1: Check-Worthiness PAtanasova LMàrquez ABarrón-Cedeño TElsayed RSuwaileh WZaghouani SKyuchukov GDa San Martino PNakov CLEF 2018 Working Notes. Working Notes of CLEF 2018 -Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings, CEUR-WS LCappellato NFerro JYNie LSoulier

Avignon, France

September 2018 Bracketing Guidelines for Treebank II Style Penn Treebank Project ABies MFerguson KKatz RMacintyre VTredinnick GKim MAMarcinkiewicz BSchasberger University of Pennsylvania 97 100 1995 Blogs, Twitter, and breaking news: the produsage of citizen journalism ABruns THighfield Produsing Theory in a Digital World: The Intersection of Audiences and Production in Contemporary Theory Peter Lang Publishing Inc 2012 80 SCazalens PLamarre JLeblay IManolescu XTannier Journalism, Misinformation and Fact Checking" alternate paper track of" The Web Conference 2018 A content management perspective on fact-checking Computational Journalism: A Call to Arms to Database Researchers SCohen CLi JYang CYu Conference on Innovative Data Systems Research. CIDR '11

Asilomar, California, USA

ACM 2011 Syntactic Stylometry for Deception Detection SFeng RBanerjee YChoi Proceedings of the 50 th Annual Meeting of the Association for Computational Linguistics: Short Papers the 50 th Annual Meeting of the Association for Computational Linguistics: Short Papers Association for Computational Linguistics 2012 2 Incorporating non-local information into information extraction systems by Gibbs sampling JRFinkel TGrenager CManning Proceedings of the 43 rd Annual Meeting of the Association for Computational Linguistics the 43 rd Annual Meeting of the Association for Computational Linguistics 2005 Association for Computational Linguistics The promise of computational journalism TFlew CSpurgeon ADaniel ASwift Journalism Practice 6 2 2012 A contextaware approach for detecting worth-checking claims in political debates PGencheva PNakov LMàrquez ABarrón-Cedeño IKoychev Proceedings of the International Conference Recent Advances in Natural Language Processing the International Conference Recent Advances in Natural Language Processing 2017. 2017 Social news, citizen journalism and democracy LGoode New media & society 11 8 2009 Distributional Structure ZSHarris Word 10 2-3 1954 Detecting check-worthy factual claims in presidential debates NHassan CLi MTremayne Proceedings of the 24 th ACM International Conference on Information and Knowledge Management the 24 th ACM International Conference on Information and Knowledge Management CIKM 2015 ClaimBuster: The First-ever End-to-end Fact-checking System NHassan GZhang FArslan JCaraballo DJimenez SGawsane SHasan MJoseph AKulkarni AKNayak Proceedings of the VLDB Endowment 10 12 2017 ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning HHe YBai EAGarcia SLi Proceedings of the IEEE Joint Conference on Neural Networks (IJCNN) the IEEE Joint Conference on Neural Networks (IJCNN) IEEE 2008. 2008 Mining and Summarizing Customer Reviews MHu BLiu Proceedings of the 10 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining the 10 th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ACM 2004 ConnotationWordNet: Learning Connotation over the Word+Sense Network JSKang SFeng LAkoglu YChoi Proceedings of the 52 nd Annual Meeting of the Association for Computational Linguistics the 52 nd Annual Meeting of the Association for Computational Linguistics June 2014 1 Association for Computational Linguistics Adam: A method for stochastic optimization DPKingma JBa arXiv:1412.6980 2014 arXiv preprint Varieties of Confirmation Bias JKlayman Psychology of learning and motivation 32 1995 Elsevier Separating Fact from Fear: Tracking Flu Infections on Twitter ALamb MJPaul MDredze Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2013 Towards a text analysis system for political debates DTLe NTVu ABlessing Proceedings of the 10 th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities the 10 th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities 2016 Objectivity Classification in Online Media ELex AJuffinger MGranitzer Proceedings of the 21st ACM Conference on Hypertext and Hypermedia the 21st ACM Conference on Hypertext and Hypermedia ACM 2010 TextBlob: Simplified Text Processing SLoria Efficient Estimation of Word Representations in Vector Space TMikolov KChen GCorrado JDean arXiv:1301.3781 2013 arXiv preprint Overview of the CLEF-2018 Lab on Automatic Identification and Verification of Claims in Political Debates PNakov ABarrón-Cedeño TElsayed RSuwaileh LMàrquez WZaghouani PGencheva SKyuchukov GDa San Martino Working Notes of CLEF 2018 -Conference and Labs of the Evaluation Forum. CLEF '18

Avignon, France

September 2018 Feature selection, l 1 vs. l 2 regularization, and rotational invariance AYNg Proceedings of the twenty-first international conference on Machine learning the twenty-first international conference on Machine learning ACM 2004 78 Finding deceptive opinion spam by any stretch of the imagination MOtt YChoi CCardie JTHancock Proceedings of the 49 th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies the 49 th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies Association for Computational Linguistics 2011 1 Thumbs up?: sentiment classification using machine learning techniques BPang LLee SVaithyanathan Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10 the ACL-02 conference on Empirical methods in natural language processing-Volume 10 Association for Computational Linguistics 2002 TATHYA: A Multi-Classifier System for Detecting Check-Worthy Statements in Political Debates APatwari DGoldwasser SBagchi Proceedings of the 26 th ACM International Conference on Information and Knowledge Management the 26 th ACM International Conference on Information and Knowledge Management CIKM 2017 Scikit-learn: Machine learning in Python FPedregosa GVaroquaux AGramfort VMichel BThirion OGrisel MBlondel PPrettenhofer RWeiss VDubourg JVanderplas APassos DCournapeau MBrucher MPerrot EDuchesnay Journal of Machine Learning Research 12 2011 Snowball: A Language for Stemming Algorithms MFPorter 2001 Linguistic Models for Analyzing and Detecting Biased Language MRecasens CDanescu-Niculescu-Mizil DJurafsky Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics the 51st Annual Meeting of the Association for Computational Linguistics 2013 1 A Problem of Dimensionality: A Simple Example GVTrunk IEEE Transactions on Pattern Analysis and Machine Intelligence 1 3 1979 Fact Checking: Task definition and dataset construction AVlachos SRiedel Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science the ACL 2014 Workshop on Language Technologies and Computational Social Science 2014 JSWerner JWTankardJr Communication theories: Origins, methods and uses in the mass media 1992 Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis TWilson JWiebe PHoffmann Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing the Conference on Human Language Technology and Empirical Methods in Natural Language Processing 2005 Association for Computational Linguistics Toward computational fact-checking YWu PKAgarwal CLi JYang CYu Proceedings of the VLDB Endowment the VLDB Endowment 2014 7