A Hybrid Recognition System for Check-worthy
   Claims Using Heuristics and Supervised
                  Learning

          Chaoyuan Zuo1 , Ayla Ida Karakas2 , and Ritwik Banerjee1
                          1
                         Department of Computer Science
                     {chzuo,rbanerjee}@cs.stonybrook.edu
                          2
                            Department of Linguistics
                         ayla.karakas@stonybrook.edu
            Stony Brook University, Stony Brook New York 11794, USA


      Abstract. In recent years, the speed at which information disseminates
      has received an alarming boost from the pervasive usage of social media.
      To the detriment of political and social stability, this has also made it easier
      to quickly spread false claims. Due to the sheer volume of information,
      manual fact-checking seems infeasible, and as a result, computational
      approaches have been recently explored for automated fact-checking. In
      spite of the recent advancements in this direction, the critical step of
      recognizing and prioritizing statements worth fact-checking has received
      little attention. In this paper, we propose a hybrid approach that combines
      simple heuristics with supervised machine learning to identify claims made
      in political debates and speeches, and provide a mechanism to rank them
      in terms of their “check-worthiness”. The viability of our method is
      demonstrated by evaluations on the English language dataset as part of
      the Check-worthiness task of the CLEF-2018 Fact Checking Lab.

      Keywords: Check-worthiness · Multi-layer Perceptron · Heuristics ·
      Feature Selection · Stylometry


1   Introduction

It is no secret that we live in an age of ubiquitous web and social media. For
the most part, any Internet user readily acquires the latent power of civilian
commentary and journalism [3,10]. Consequently, information available on the
web now carries the potential to propagate amid the public domain with unprece-
dented speed and reach. The ordinary Internet user, however, contends with an
overwhelming amount of information, which makes the task of determining the
accuracy and integrity of the claims all the more onerous. Additionally, users
usually want their beliefs to be confirmed by information [18,34]. The confluence
of vast amounts of information and such confirmation bias, thus, can create a
society where unverified information runs amok masquerading as facts. While
correcting confirmation biases at a social scale may be extremely challenging and
even controversial, the spread of misinformation can be mitigated by focusing
only on curating the claims.
    Comprehensive manual fact-checking is highly tedious and, in light of the sheer
volume of information, infeasible. To overcome this hurdle, several approaches to
automated fact-checking have been proposed in the nascent field of computational
journalism [5,8]. Some prior work took to computing the semantic similarity
between claims [4,13], while others proposed fact-checking as a question-answering
task [5,33,36]. Both approaches need to extract statements to be fact-checked
before the actual verification process can begin. ClaimBuster [12] was the first
fact-checking system that assigned to each sentence a check-worthiness score
between 0 and 1. Subsequently, a multi-class classification approach with fewer
features was explored to specifically identify check-worthy claims, but it suffered
from comparatively lower precision [28]. Outside of this small body of work, the
preliminary step of identifying check-worthy claims has received little attention.
Gencheva et al. [9] were the first to develop a publicly available dataset for this
task. Their annotations were obtained from nine fact-checking websites. They also
used a significantly richer feature set. Keeping in line with the observations made
by prior work regarding the extent of overlap in lexical and shallow syntactic
features [9,20], we use a significantly richer set of features derived from word
embeddings and deep syntactic structures.
    In this work, our focus is on recognizing “check-worthy” statements. Accurate
identification of such statements will benefit the fact-checking and verification
processes that follow, independent of the specific techniques used therein. We use
the task formulation, data, and evaluation framework provided by the CLEF-
2018 Lab on Automatic Identification and Verification of Claims in Political
Debates [24] as part of their first task – Check-Worthiness [1].


2     Task, Data, and Evaluation Framework
The CLEF 2018 Fact Checking Lab designed two tasks that, when put together,
form the complete fact-checking pipeline. In this work, however, we focus exclu-
sively on the first.

2.1     The Task: Check-Worthiness
The first task – check-worthiness – was defined by the CLEF 2018 Fact Checking
Lab as follows:
      Predict which claim in a political debate should be prioritized for fact-
      checking. In particular, given a debate, the goal is to produce a ranked
      list of its sentences based on their worthiness for fact checking [9].
The goal of this task is to automatically recognize claims worth checking, and
present them in order of priority (i.e., as a ranked list of claims) to journalists
or even ordinary Internet and social media users. The ranking is attained in
terms of a check-worthiness score. This approach helps the recipient tackle the
Table 1. Labeled sentence examples from political debates provided as training data.
Check-worthy sentences are labeled 1, and others are labeled 0. Audience reaction and
other background noise is encoded as “SYSTEM”-generated.

      Speaker         Sentence                                                   Label
      HOLT            I’m Lester Holt, anchor of “NBC Nightly News.”                0
      HOLT            I want to welcome you to the first presidential debate.       0
      TRUMP           Our jobs are fleeing the country.                             0
      TRUMP           Thousands of jobs leaving Michigan, leaving Ohio.             1
      CLINTON         Donald thinks that climate change is a hoax                   1
                      perpetrated by the Chinese.
      SYSTEM          (applause)                                                    0


problem of information overload and instead, directly focus on the most important
statements. The output, therefore, can be fed to an automated fact-checker or
be used in a manual pursuit of verification. Either way, it can raise awareness of
individual users and stymie the dissemination of false claims in social media.


2.2      Data

Given the alleged impact of disinformation and ‘fake news’ on the 2016 US
presidential election, and the controversy surrounding it, any data pertaining to
this election cycle is extremely relevant in terms of fact-checking endeavors having
a positive social and political impact in the future. As such, a political debate
dataset was provided in English and Arabic. Since our methodology involves
heuristics that rely on linguistic insight, we used the English language dataset.
    The training data comprised three political debates. Each debate was split
into sentences, and each sentence was associated with its speaker and annotated
by experts as check-worthy or not (labeled 1 and 0, respectively). This data
contained a total of 3,989 sentences, of which only 94 were labeled as check-worthy
– a staggering imbalance with only 2.36% of the dataset bearing the label of the
target class. A few simple sentences from this training data, along with their
speakers and labels, are presented in Table 1.
    The test data was a collection of two political debates and five political
speeches.3 The total number of sentences in these two categories (Debate and
Speech) were 2,815 and 2,064, respectively.
    In this work, we did not employ any external knowledge other than domain-
independent language resources such as parsers and lexicons. Instead, we focused
extracting linguistic features indicative of check-worthiness.
 3
     The lab task provided all seven files together, without this categorization into speeches
     and debates. We, however, chose to treat these differently since language use is very
     different in these two scenarios: debates consist of the interactive statements made
     by the candidates and the moderator, while speeches only have a single speaker, and
     there is no two-sided conversational structure.
2.3     Evaluation Framework
The evaluation was done on the test data provided as part of the task. This data
was released much later to the participants, with the gold standard labels for the
sentences in the test data withheld. Once we selected the models, we ran it on
the entire test data, and used average precision to measure the quality of the
output ranking. Average precision is defined as
                                       n
                                     1 X
                            AP =                 Prec(k) · δ(k)
                                    nchk
                                           k=1

where nchk is the number of check-worthy sentences, n is the total number of
sentences, Prec(k) is the precision at cut-off k in the list of sentences ranked by
check-worthiness, and δ(k) is the indicator function equaling 1 if the sentence
at rank k is check-worthy, and 0 otherwise. The primary metric used by the
Fact Checking Lab [24] for the check-worthiness task was mean average precision
(MAP), defined simply as the mean of the average precisions over all queries.

3     Methodology
Our methodology is a hybrid of rule-based heuristics and supervised classification.
The motivation for this approach was to test the extent to which check-worthiness
can be determined based on language constructs without relying on encyclopedic
knowledge. Moreover, our aim was to develop an approach that was not specific
to the domain of politics. In this section, we describe the data processing, feature
selection, and heuristics involved in building our classification models.

3.1     Data Processing
The first step of our processing involved normalizing the speaker names. We did
this by adding speaker-specific rules in order to correctly match the speakers
extracted from various sentences to the actual speakers associated with the
sentences. For example, speakers in the test data included “Hillary Clinton
(D-NY)”, “Former Secretary of State, Presidential Candidate”, and
simply “Clinton”. These are, of course, all referring to the same speaker.
    Next, we noted that the training data consisted only of political debates
where multiple entities (two political candidates, a moderator, and the occasional
audience reaction) engage in a conversation. Due to the very nature of debates,
the rhetorical structure is different from speeches delivered by a single speaker.
The test data, however, also included political speeches. Therefore, we extracted
all sentences attributed to a speaker to create sub-datasets. This formed a new
training sample, which we then used to train models to identify check-worthy
sentences from speeches4 . To identify check-worthy sentences from political
debates, we used the original training data to train the models.
4
    The provided training sample included two speeches, and both were by Donald
    Trump. As a result, for the purpose of this task, a single sub-dataset was created.
    The approach is independent of the speaker and the number of speakers, however.
                 Table 2. Constituent tags from the Penn Treebank.

    Clause-Level      SBAR, SBARQ, SINV, SQ, S
    Phrase-Level      ADJP, ADVP, CONJP, FRAG, INTJ, LST, NAC, NP, NX,
                      PP, PRN, PRT, QP, RRC, UCP, VP, WHADJP, WHAVP,
                      WHNP, WHPP, X


3.2    Feature Design and Selection

For both speeches and debates, we extracted a set of syntactic and semantic
features to obtain a consistent knowledge representation, and converted every
sentence into a vector in an abstract semantic space. The details of these features
and the resultant feature vector are discussed below.


Sentence Embedding: Traditional supervised learning in natural language
processing tasks have used vector spaces where dimensions correspond to words
(or other linguistic units). This, however, is not in accordance with the well-known
distributional hypothesis in linguistics: words that occur in similar contexts tend
to have similar meanings [11]. This necessitates the representation of sentences
in a low-dimensional semantic space where similar meanings are closer together.
Modeling sentence meanings in a low-dimensional space is a topic of extensive
research by itself, and beyond the scope of this work. Instead, we adopted a
simple method that leverages word embeddings. We used the 300-dimensional pre-
trained Google News word embeddings5 to represent each word as a vector [23],
and took the arithmetic mean of all such vectors corresponding to the words in a
sentence to obtain an abstract sentence embedding.


Lexical Features: From the training data, we removed stopwords and stemmed
the remaining terms using the Snowball stemmer [30].


Stylometric Features: Stylometry, the statistical analysis of variations in lin-
guistic constructs, has been used with great success in distinguishing deceptive
from truthful language [6,26], and objective from subjective remarks [19,21].
Accordingly, we surmised that capturing stylistic variation will aid in the iden-
tification of check-worthy sentences as well, especially since they are typically
expected to appear factual and objective.
     In order to obtain shallow syntactic features from each sentence, we extracted
the part-of-speech (POS) tags, the total number of tokens, and the number of
tokens in past, present, and future tenses. We were able to infer the tense from
the POS tags (e.g., both vbd and vbz are verb tags, but they indicate past
and present tense, respectively). Additionally, we also extracted the number of
negations in each sentence.
5
    Available at https://code.google.com/archive/p/word2vec/.
Fig. 1. The constituency parse tree of a check-worthy sentence from the training data:
“President Bush said we would leave Iraq at the end of 2011.” The size of the subtree
under the subordinate clause (sbar) is representative of the amount of information
available provided about the action ‘said’ undertaken by the entity ‘President Bush’.

    More complex structural patterns of language, however, can only be captured
by deep syntactic features. For that, we generated the constituency parse trees
of all sentences, and selected clause-level and phrase-level tags. The number of
words within the scope of each tag were included as the corresponding feature
values. These tags, as defined in the Penn Treebank [2], are shown in Table 2. In
addition to stylometry, the motivation behind using the number of words was
to obtain a representation of the amount of information available under specific
syntactic structures. Fig. 1 illustrates this point with the parse tree of a sentence
from the training data that was labeled as check-worthy.

Semantic Features: We used the Stanford named entity recognizer (NER) [7]
to extract the number of named entities in a sentence. Additionally, we appended
an extra feature for named entities of the type person.

Affective Features: We used the TextBlob [22] library to train a naı̈ve Bayes
classifier on the pioneering movie review corpus for sentiment analysis [27],
and thereby obtained a sentiment score for each sentence. In addition to overt
sentiment, we also used the connotation of words in a sentence as features. For this,
we employed Connotation WordNet [16], which assigns a (positive or negative)
connotation score to each word. For every sentence, we queried this lexicon and
retrieved the connotation score of its words. Finally, the overall connotation of
the sentence was attributed simply to the mean of these scores.
    Additionally, we also utilized lexicons that contain information about the
subjective or objective nature of words [35], whether they directly indicate or are
typically associated with language that indicates bias [31], and whether they are
typically used to voice positive or negative opinions [15]. For every sentence, we
extracted the number of words in these categories (as defined by their scores in
these lexicons), thus forming four new features: (i) subjectivity, (ii) direct bias,
(iii) associated bias, and (iv) opinion.
Metadata Features: In addition to the syntactic and semantic features de-
scribed above, we also included three binary non-linguistic features extracted
from the training sample, indicating whether or not (i) the speaker’s opponent
is mentioned, (ii) the speaker is the anchor/moderator, or (iii) the sentence is
immediately followed by intense reaction. The third feature is encoded in the
training data as a ‘system’ reaction, as shown by the last sentence in Table 1.
Discourse Features: All the above features were extracted without regards
to the category (i.e., Debate and Speech). Since debates involve an interactive
discourse structure where sentences are often formed as an immediate response to
statements made by others, we include segments from the debates. We adopt the
approach taken by Gencheva et al [9] and regard a “segment” to be the maximal
set of consecutive sentences by the same speaker. As features, we include the
relative position of a sentence within its segment, and the number of sentences
in the previous, current and subsequent segments.
Feature Selection The feature extraction processes described above yielded a
very high-dimensional feature space. High-dimensional spaces, however, quickly
lead to a decrease in the predictive power of models [32]. Moreover, given the
extreme class imbalance, classification in such a space is likely to ignore important
features indicative of the minority class (in this case, the ‘check-worthy’ sentences).
    To reduce the dimensionality, we applied a feature selection module using
the scikit-learn library [29]. As the first step, univariate feature selection was
performed, and the 2,000 best features were selected based on χ2 -test. Next,
armed with the observation that linear predictive models with L1 loss yield
sparse solutions and encourage the vanishing coefficients for weakly correlated
features [25], we used a support vector machine (SVM) model with linear kernel
and L1 regularization to further remove the relatively unimportant features. This
step was first done on the entire training data, and then combined with repeated
undersampling (without replacement) for the majority class. Each iteration of
this undersampling process resulted in a small but balanced training sample.
A L1-regularized SVM learner was trained on every sample generated in this
manner, and features with vanishing coefficients were discarded. The cumulative
effect of these feature selection steps was a reduction of the feature space to 2,655
and 2,404 dimensions for identification of check-worthy claims from debates and
speeches, respectively.


3.3   Heuristics

Certain heuristics were introduced to override the scores assigned by the clas-
sification models. These rules differed slightly based on (i) the category, i.e.,
speech or debate, and (ii) whether or not the ‘strict’ heuristics were deployed.
Algorithm 1 Heuristics for assigning the check-worthiness score w(·) to sentences.
Require: category ∈ {speech, debate},        if S speaker is system then
 strict mode ∈ {true, false}, sentence S.       w(s) ← 10−8
                                             end if
    min token count ← 0                      if S number of tokens < min token count
    if category is speech then               then
       if strict mode then                      w(s) ← 10−8
          min token count ← 10               end if
       else                                  if S contains “thank you” then
          min token count ← 8                   w(s) ← 10−8
       end if                                end if
    else                                     if S number of subjects < 1 then
       if strict mode then                      if category is speech then
          min token count ← 7                      w(s) ← 10−8
       else                                     else if S contains “?” then
          min token count ← 5                      w(s) ← 10−8
       end if                                   end if
    end if                                   end if


The strictness flag was introduced to control the threshold sentence size. When
active, it would tend to discard more sentences.
    These rules are specified in Algorithm 1. One particular rule required the
identification of subjects in a sentence. To extract this information, we generated
dependency parse trees of the sentences and counted the number of times any of
the following dependency labels appeared: nsubj, csubj, nsubjpass, csubjpass,
or xsubj. The first two indicate nominal and clausal subjects, respectively. The
next two indicate nominal and clausal subjects in a passive clause, and the last
label denotes a controlling subject, which relates an open clausal complement to
its external clause.


4     Models

Our experiments comprised two supervised learning algorithms: support vector
machines (SVM) and multilayer perceptrons (MLP). Additionally, we also built
an ensemble model combing the two. In this section, we provide a description of
these three models, along with their training processes.
    For reasons described in Sec. 3.2, the SVM utilized a linear kernel with L1
regularization for feature selection. However, due to the propensity of the L1 loss
function to miss optimal solutions, we used L2 loss in building the final model
after completing feature selection. Our second model was the MLP. Here, we
used two hidden layers with 100 units and 8 units in them, respectively. We used
the hyperbolic tangent (tanh) as our activation function since it achieved better
results when compared to rectified linear units (ReLU). Stochastic optimization
was done with Adam [17]. To avoid overfitting, we used L2-regularization in both
Table 3. Results for the Check-Worthiness task of our submitted models: MLP? was
the primary submission, along with two contrastive runs, MLPstr and ENS (MLP with
strict heuristics and the ensemble model, respectively). MLPnone shows the results of
the MLP without any heuristics being applied. The primary evaluation metric was mean
avg. precision (MAP). The mean reciprocal rank (MRR), mean R-precision (MRP),
and mean precision at k (MP@k) are also shown.

           MAP      MRR      MRP      MP@1     MP@3     MP@5     MP@10    MP@20    MP@50


 MLP?      0.1332   0.4965   0.1352   0.4286   0.2857   0.2000   0.1429   0.1571   0.1200
 MLPstr    0.1366   0.5246   0.1475   0.4286   0.2857   0.2286   0.1571   0.1714   0.1229
 ENS       0.1317   0.4139   0.1523   0.2857   0.1905   0.1714   0.1571   0.1571   0.1429
 MLPnone   0.1086   0.4767   0.1037   0.2857   0.2857   0.2000   0.1286   0.1071   0.1000


SVM and MLP. Third, we built an ensemble model that combines SVM and
MLP (without the strict heuristics). In this model, the final output score was a
normalization (by standard deviation) of the results of SVM and MLP, and then
computing the average.
    For all three models, class imbalance was a hindrance during the training
process. To overcome that, we used ADASYN[14], an adaptive synthetic sampling
algorithm for imbalanced learning. For model selection, we used 3-fold cross-
validation for debates, using two files for training and the remaining one for
testing, to evaluate model performances and tune parameters. For speeches,
we split the training sample into two halves (one file in each) for 2-fold cross-
validation. The evaluation script was provided by the task organizers, with the
mean average precision (MAP) being the primary evaluation metric.
    MLP without the strict heuristics demonstrated the best results during the
training process, so this was submitted for the primary run. For the two contrastive
runs, we submitted (i) MLP with strict heuristics, and (ii) the ensemble model
without the strict heuristics.


5      Results and Analysis
5.1     Empirical Results
The detailed performance of all three submissions we made is shown in Table 3.
Even though MLP yielded the best training results without the strict heuristics,
MLPstr performed demonstrably better across multiple metrics on the test data.
Our third model, the ensemble classifier, performed poorly in general compared
to both MLP models. It did, however, achieve slightly better mean R-precision
and mean precision at higher cutoffs (k = 10 and 50).
    Without the inclusion of any heuristics, the performance of MLP dropped
significantly. This was expected, since the heuristics were designed to address the
flaws of the classifiers. This model was not among the submissions, but we include
it here for comparison. The difference between MLP and MLPnone quantifies the
extent to which the rules help the supervised learners.
Table 4. Results from the primary submissions from all participants. We participated
under the name Prise de Fer. The best results for each metric is shown in bold.

 TEAM            MAP      MRR      MRP      MP@1     MP@3     MP@5     MP@10    MP@20    MP@50


 Prise de Fer?   0.1332   0.4965   0.1352   0.4286   0.2857   0.2000   0.1429   0.1571   0.1200
 Copenhagen      0.1152   0.3159   0.1100   0.1429   0.1429   0.1143   0.1286   0.1286   0.1257
 UPV-INAOE       0.1130   0.4615   0.1315   0.2857   0.2381   0.3143   0.2286   0.1214   0.0866
 bigIR           0.1120   0.2621   0.1165   0.0000   0.1429   0.1143   0.1143   0.1000   0.1114
 fragarach       0.0812   0.4477   0.1217   0.2857   0.1905   0.2000   0.1571   0.1071   0.0743
 blue            0.0801   0.2459   0.0576   0.1429   0.0952   0.0571   0.0571   0.0857   0.0600
 RNCC            0.0632   0.3755   0.0639   0.2857   0.1429   0.1143   0.0571   0.0571   0.0486


    Next, in Table 4, we present the comparison between the results obtained
by all participants. This comparison was done only on the primary submission
from each team. Our MLP model without the strict heuristics achieved the best
MAP, MRR, and MRP scores. Further, it also outperformed the others in terms
of correctly placing the check-worthy sentences at the very top of the ranked
output list, as demonstrated by the mean precision at low values (k = 1 and 3).


5.2      Qualitative Analysis

Identifying check-worthy sentences is a difficult and novel task, and even the best
model suffered from misclassification errors. Upon analyzing such mistakes made
by the MLP models, we were able to discern a few reasons.
    First, tense plays a logical role in check-worthiness, since future actions cannot
be verified. However, the part-of-speech tagging often confuses the future tense
with the present continuous (e.g., “We’re cutting taxes.”). Second, we observed
that anecdotal stories are often highly prioritized as check-worthy, while they are
not. These sentences are usually complex, with a lot of content, which makes it
easy for the model to conflate them with other complex sentences pertaining to
real events deemed check-worthy. Third, the presence of duplicate sentences in
the data means that a misclassification gets amplified, while the presence of very
similar sentences with different labels likely makes the feature selection stage
discard potentially useful features.
    At a more abstract level, rhetorical figures of speech play a critical role.
They often break the structures associated with standard sentence formation.
Several sentences that were misclassified exhibited constructs such as scesis
onomaton, where words or phrases with nearly equivalent meaning are repeated.
We conjecture that this makes the model falsely believe that there is more
informational content in the sentence. Such figures of speech become even harder
to handle when they occur across multiple speakers in debates. The conversational
aspect of debates also causes another problem: quite a few sentences are short,
and in isolation, would perhaps not be check-worthy. However, as a response to
things mentioned earlier in the debate, they are.
     Another complex issue leading to misclassification is the use of sentence
 fragments. This is sparingly used for dramatic effect in literature, but was seen
with alarming frequency in the political debates due to the prevalence of ill-
 formed or partly-formed sentences stopping and then giving way to another
 sentence. In some cases, the fragments are portions of the sentence that the
 speaker repeats. An example of such a fragment is the sentence “Ambassador
 Stevens – Ambassador Stevens sent 600 requests for help.”, where the phrase
“Ambassador Stevens” is repeated.
     A proper approach to deal with these hurdles is a complex matter in and by
 itself. We believe that our features are better suited for written language than
 speech or debate transcripts. In the presence of significantly more labeled data
 for check-worthiness, ablation studies that remove such sentences could provide
 empirical evidence of this intuition.

6    Conclusion and Future Work
We developed a hybrid system that combines a few rules with supervised learning
to detect check-worthy sentences in political debates and speeches. To tackle the
severity of class imbalance, our development also included a sophisticated feature
selection process and special sampling methods. Our primary model achieved the
best results among all participants over multiple performance metrics.
    This work opens up several intriguing possibilities for future research in the
field of fact-checking. First, we intend to study in greater details the linguistic
forms of informational content. Shallow syntax has been explored to understand
this aspect of language in sociolinguistics, and some work has even looked into deep
syntactic features. This approach has, however, not yet been applied to identifying
check-worthy sentences. Furthermore, more complex neural network structures
need to be thoroughly investigated. Along this line, we will be investigating deep
learning models with feedback control. A stringent and focused work on these
issues will empower journalists and citizens alike to be better informed and more
cognizant of false claims permeating news and social media now. To that end,
we also need complementary advances in related areas like natural language
querying, crowdsourcing, source identification, and social network analysis.

Acknowledgment: This work was supported in part by the U.S. National
Science Foundation (NSF) under the award SES-1834597.

References
 1. Atanasova, P., Màrquez, L., Barrón-Cedeño, A., Elsayed, T., Suwaileh, R., Za-
    ghouani, W., Kyuchukov, S., Da San Martino, G., Nakov, P.: Overview of the
    CLEF-2018 CheckThat! Lab on Automatic Identification and Verification of Po-
    litical Claims, Task 1: Check-Worthiness. In: Cappellato, L., Ferro, N., Nie, J.Y.,
    Soulier, L. (eds.) CLEF 2018 Working Notes. Working Notes of CLEF 2018 -
    Conference and Labs of the Evaluation Forum. CEUR Workshop Proceedings,
    CEUR-WS.org, Avignon, France (September 2018)
 2. Bies, A., Ferguson, M., Katz, K., MacIntyre, R., Tredinnick, V., Kim, G.,
     Marcinkiewicz, M.A., Schasberger, B.: Bracketing Guidelines for Treebank II Style
     Penn Treebank Project. University of Pennsylvania 97, 100 (1995)
 3. Bruns, A., Highfield, T.: Blogs, Twitter, and breaking news: the produsage of citizen
     journalism. In: Produsing Theory in a Digital World: The Intersection of Audiences
     and Production in Contemporary Theory, vol. 80, pp. 15–32. Peter Lang Publishing
     Inc. (2012)
 4. Cazalens, S., Lamarre, P., Leblay, J., Manolescu, I., Tannier, X.: A content man-
     agement perspective on fact-checking. In: ” Journalism, Misinformation and Fact
     Checking” alternate paper track of” The Web Conference” (2018)
 5. Cohen, S., Li, C., Yang, J., Yu, C.: Computational Journalism: A Call to Arms to
     Database Researchers. In: Conference on Innovative Data Systems Research. CIDR
    ’11, ACM, Asilomar, California, USA (2011)
 6. Feng, S., Banerjee, R., Choi, Y.: Syntactic Stylometry for Deception Detection.
     In: Proceedings of the 50th Annual Meeting of the Association for Computational
     Linguistics: Short Papers-Volume 2. pp. 171–175. Association for Computational
     Linguistics (2012)
 7. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into
     information extraction systems by Gibbs sampling. In: Proceedings of the 43rd
    Annual Meeting of the Association for Computational Linguistics. pp. 363–370.
    Association for Computational Linguistics (2005)
 8. Flew, T., Spurgeon, C., Daniel, A., Swift, A.: The promise of computational
     journalism. Journalism Practice 6(2), 157–171 (2012)
 9. Gencheva, P., Nakov, P., Màrquez, L., Barrón-Cedeño, A., Koychev, I.: A context-
     aware approach for detecting worth-checking claims in political debates. In: Pro-
     ceedings of the International Conference Recent Advances in Natural Language
     Processing, RANLP 2017. pp. 267–276 (2017)
10. Goode, L.: Social news, citizen journalism and democracy. New media & society
    11(8), 1287–1305 (2009)
11. Harris, Z.S.: Distributional Structure. Word 10(2-3), 146–162 (1954)
12. Hassan, N., Li, C., Tremayne, M.: Detecting check-worthy factual claims in pres-
     idential debates. In: Proceedings of the 24th ACM International Conference on
     Information and Knowledge Management. pp. 1835–1838. CIKM (2015)
13. Hassan, N., Zhang, G., Arslan, F., Caraballo, J., Jimenez, D., Gawsane, S., Hasan,
     S., Joseph, M., Kulkarni, A., Nayak, A.K., et al.: ClaimBuster: The First-ever
     End-to-end Fact-checking System. Proceedings of the VLDB Endowment 10(12),
    1945–1948 (2017)
14. He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: Adaptive Synthetic Sampling
    Approach for Imbalanced Learning. In: Proceedings of the IEEE Joint Conference
     on Neural Networks (IJCNN), 2008. pp. 1322–1328. IEEE (2008)
15. Hu, M., Liu, B.: Mining and Summarizing Customer Reviews. In: Proceedings of
     the 10th ACM SIGKDD International Conference on Knowledge Discovery and
     Data Mining. pp. 168–177. ACM (2004)
16. Kang, J.S., Feng, S., Akoglu, L., Choi, Y.: ConnotationWordNet: Learning Connota-
     tion over the Word+Sense Network. In: Proceedings of the 52nd Annual Meeting of
     the Association for Computational Linguistics (Vol. 1: Long Papers). pp. 1544–1554.
    Association for Computational Linguistics (June 2014)
17. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
     arXiv:1412.6980 (2014)
18. Klayman, J.: Varieties of Confirmation Bias. In: Psychology of learning and moti-
    vation, vol. 32, pp. 385–418. Elsevier (1995)
19. Lamb, A., Paul, M.J., Dredze, M.: Separating Fact from Fear: Tracking Flu Infections
    on Twitter. In: Proceedings of the 2013 Conference of the North American Chapter
    of the Association for Computational Linguistics: Human Language Technologies.
    pp. 789–795 (2013)
20. Le, D.T., Vu, N.T., Blessing, A.: Towards a text analysis system for political
    debates. In: Proceedings of the 10th SIGHUM Workshop on Language Technology
    for Cultural Heritage, Social Sciences, and Humanities. pp. 134–139 (2016)
21. Lex, E., Juffinger, A., Granitzer, M.: Objectivity Classification in Online Media.
    In: Proceedings of the 21st ACM Conference on Hypertext and Hypermedia. pp.
    293–294. ACM (2010)
22. Loria, S.: TextBlob: Simplified Text Processing. http://textblob.readthedocs.org/
    en/dev/ (2014)
23. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient Estimation of Word Repre-
    sentations in Vector Space. arXiv preprint arXiv:1301.3781 (2013)
24. Nakov, P., Barrón-Cedeño, A., Elsayed, T., Suwaileh, R., Màrquez, L., Zaghouani,
    W., Gencheva, P., Kyuchukov, S., Da San Martino, G.: Overview of the CLEF-2018
    Lab on Automatic Identification and Verification of Claims in Political Debates.
    In: Working Notes of CLEF 2018 – Conference and Labs of the Evaluation Forum.
    CLEF ’18, Avignon, France (September 2018)
25. Ng, A.Y.: Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In:
    Proceedings of the twenty-first international conference on Machine learning. p. 78.
    ACM (2004)
26. Ott, M., Choi, Y., Cardie, C., Hancock, J.T.: Finding deceptive opinion spam by
    any stretch of the imagination. In: Proceedings of the 49th Annual Meeting of the
    Association for Computational Linguistics: Human Language Technologies – Vol. 1.
    pp. 309–319. Association for Computational Linguistics (2011)
27. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using
    machine learning techniques. In: Proceedings of the ACL-02 conference on Empirical
    methods in natural language processing-Volume 10. pp. 79–86. Association for
    Computational Linguistics (2002)
28. Patwari, A., Goldwasser, D., Bagchi, S.: TATHYA: A Multi-Classifier System for
    Detecting Check-Worthy Statements in Political Debates. In: Proceedings of the
    26th ACM International Conference on Information and Knowledge Management.
    pp. 1–4. CIKM (2017)
29. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O.,
    Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A.,
    Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine
    learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)
30. Porter, M.F.: Snowball: A Language for Stemming Algorithms. http://snowball.
    tartarus.org/texts/introduction.html (2001)
31. Recasens, M., Danescu-Niculescu-Mizil, C., Jurafsky, D.: Linguistic Models for
    Analyzing and Detecting Biased Language. In: Proceedings of the 51st Annual
    Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
    vol. 1, pp. 1650–1659 (2013)
32. Trunk, G.V.: A Problem of Dimensionality: A Simple Example. IEEE Transactions
    on Pattern Analysis and Machine Intelligence 1(3), 306–307 (1979)
33. Vlachos, A., Riedel, S.: Fact Checking: Task definition and dataset construction.
    In: Proceedings of the ACL 2014 Workshop on Language Technologies and Compu-
    tational Social Science. pp. 18–22 (2014)
34. Werner, J.S., Tankard Jr, J.W.: Communication theories: Origins, methods and
    uses in the mass media (1992)
35. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing Contextual Polarity in Phrase-
    Level Sentiment Analysis. In: Proceedings of the Conference on Human Language
    Technology and Empirical Methods in Natural Language Processing. pp. 347–354.
    Association for Computational Linguistics (2005)
36. Wu, Y., Agarwal, P.K., Li, C., Yang, J., Yu, C.: Toward computational fact-checking.
    Proceedings of the VLDB Endowment 7(7), 589–600 (2014)