bigIR at CLEF 2019: Automatic Verification of Arabic Claims over the Web Fatima Haouari, Zien Sheikh Ali, and Tamer Elsayed Qatar University, Doha, Qatar {200159617,zs1407404,telsayed}@qu.edu.qa Abstract. With the proliferation of fake news and its prevalent impact on democracy, journalism, and public opinions, manual fact-checkers be- come unscalable to the volume and speed of fake news propagation. Au- tomatic fact-checkers are therefore needed to prevent the negative impact of fake news in a fast and effective way. In this paper, we present our par- ticipation in Task 2 of CLEF-2019 CheckThat! Lab, which addresses the problem of finding evidence over the Web for verifying Arabic claims. We participated in all of the four subtasks and adopted a machine learning approach in each with different set of features that are extracted from both the claim and the corresponding retrieved Web search result pages. Our models, trained solely over the provided training data, for the differ- ent subtasks exhibited relatively-good performance. Our official results, on the testing data, show that our best performing runs achieved the best overall performance in subtasks A and B among 7 and 8 participating runs respectively. As for subtasks C and D, our best performing runs achieved the median overall performance among 6 and 9 participating runs respectively. Keywords: Fact Checking · Arabic Retrieval · Learning to Rank · Web Classification. 1 Introduction Fake news is witnessing an explosion recently, and it is considered as one of the biggest threats to democracy, journalism, and public trust in governments. In combating fake news, the number of manual fact-checking organizations in- creased by 239% in a period of four years, where it reached 149 fact-checkers in 2018 as apposed to only 44 in 20141 . One of the main challenges is that manual fact-checking does not scale with the volume of daily fake news. This mismatch can be attributed to the gap Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2019, 9-12 Septem- ber 2019, Lugano, Switzerland. 1 https://reporterslab.org/fact-checking-triples-over-four-years/ between the time the claim is made and the time the claim is checked and pub- lished, as it is very time-consuming for journalists to find check-worthy claims and verify them. Another challenge is that fact-checking requires advanced writ- ing skills in order to convince the readers whether the claim is true or false [6]. In fact, it is estimated that check-worthiness of a claim and writing an article about it can take up to one day [7]. Moreover, manual fact-checkers are outdated [6]. Most of the fact-checking frameworks adopt the old content management systems specialized for traditional blogs and newspapers, but not built for the current modern journalism. A new approach is therefore needed for automated fake news detection and verification. The industry and academia have shown an overwhelming interest in fake news to address the challenges of its detection and verification. Many pioneering ideas were proposed to address many aspects of fact-checking systems with their focus varies between detecting check-worthy claims [8, 10, 7], checking claims factuality [11, 15, 16, 20], checking news media factuality [2], and proposing full automatic fact-checking systems [9, 14, 22]. There are also some shared tasks proposed and open to the research community interested in the problem such as FEVER-2018 task for fact extraction and verification [18] and CheckThat! 2018 lab on automatic identification and verification in political debates at CLEF [13]. This year, CLEF-2019 CheckThat! Lab [4] introduced two tasks to tackle two main problems of automated fact-checking systems. The main objective of the first is to detect check-worthy-claims to be prioritized for fact-checking [1], while the second focuses on evidence extraction to support fact-checking a claim [5]. In this paper, we present the approach adopted by our bigIR group at Qatar University to address the second task. Task 2 (Evidence and Factuality) addresses the problem of finding evidence over the Web for verifying Arabic claims. It assumes the system is given an Arabic claim (as a short sentence) and a corresponding ranked list of Web pages that were retrieved by a Web search engine for that claim. The system then needs to address four sub-problems, each is defined as a subtask as follows: 1. Subtask A: Rank the retrieved pages based on how useful they are for verifying the claim. 2. Subtask B: Classify the Web pages as “very useful” for verification, “use- ful”, “not useful”, or “not relevant”. 3. Subtask C: Within each useful page, identify which passages are useful for claim verification. 4. Subtask D: Determine the true factuality of the claim, i.e., whether it is ”True” or ”False”. We have participated in all of the four subtasks. Since it is the first year of the task (and thus our first attempt), we generally adopted a simple machine learning approach, where learning models were trained only on the given training data over hand-crafted features. We applied feature ablation to assess the impact of each feature on the performance of our models. For subtask A, to re-rank the pages based on their usefulness, we adopted a pairwise learning-to-rank approach with features extracted either from the page as a whole (such as source popularity, URL links, and number of quotes), from the relevant segments in the page (such as the similarity score of the most relevant sentence), or from the search results (such as the original rank of the page). Additionally, we extracted claim-dependent features such as the similarity between the claim and the title and the snippet of the page. For subtask B, we adopted a multi-class classification approach to classify the Web pages. We considered several features including word embeddings, named entities, similarity scores, number of relevant sentences in the page, and URL- based features (such as URL length, URL scheme, and URL domain). For subtask C, we adopted a binary classification approach to classify the passages within a useful page. Features included Bag-Of-Words (BOW), named entities, number of quotes, score of most relevant sentence from each passage, and the similarity score between the claim and the passage. For subtask D, we also adopted a binary classification approach to discover the claim’s factuality given the retrieved Web pages. To classify the claim, we first identify the most similar pages to the claim for feature extraction. For the selected pages, we consider their similarity scores, source popularity, and the sentiment of the page. Our contribution in this work is two-fold: 1. We participated in all of the four subtasks adopting a machine learning approach with relatively-different set of features in each. The features are extracted from both the claims and the retrieved Web pages. 2. Our best performing runs exhibited the best performance in both subtasks A and B among the submitted runs. The remainder of this paper is organized as follows. Section 2 describes how we processed and extracted features from the claims and retrieved pages. Sec- tions 3, 4, 5, and 6 outline our approach and discuss our experimental evaluation in detail for subtask A, B, C, and D respectively. Finally, Section 7 concludes and discusses possible future work. 2 Preprocessing & Feature Extraction In our work, we apply common main preprocessing for all subtasks to parse documents, identify relevant segments, and extract features. However, we include or exclude some features in each subtask. In this section, we describe in detail the preprocessing steps and introduce and motivate the features we extracted at all levels. For each page, we extract two types of features: features that depend on the claim/page relationship (claim-dependent) and features that depend solely on the page (page-dependent). In what follows, a text segment in a page is centered by one sentence, but also includes both the sentence that precedes and the sentence that follows it, as defined by Yasser et al. [21], to consider the context of the sentence. 2.1 HTML Parsing As the Web pages are in raw HTML format, we parse each page by extracting only the clean version of the textual body discarding images, videos, and scripts using newspaper2 and BeautifulSoup3 Python libraries. We removed stopwords using Python NLTK4 Arabic stopwords. We also discard the sentences containing less than 3 words, motivated by the empirical study done by Zhi et al. [22]. 2.2 Text Vector Representations In extracting our features, we consider two text vector representations: – Bag-of-Words (BOW): We consider BOW representation to represent full passages (mainly for subtask C). We considered only the terms that appeared at least 7 times in the training data, based on some preliminary experiments. – Distributed Representation (W2V): We consider word2vec embeddings [12] to represent the claim and the segments of a page; each is represented as the average vector of the embeddings of terms in the claim/segment. We used the pre-trained AraVec embeddings model proposed by Soliman et al. [17]. 2.3 Relevant Segments Identification To identify relevant segments in a page for a given claim, we represent the claim and each sentence in the page by their average of term W2V vectors. We then compute the cosine similarity between the vectors of the claim and each segment. Segments are considered relevant if the similarity score is higher than a threshold. 2.4 Page-Dependent Features We extracted two types of page-dependent features: credibility and content. Credibility Features To indicate the credibility of the page, we consider the following features: – Source Popularity (SrcPop): This feature may indicate trustworthiness, as it captures how popular a particular website is. We used Amazon Alexa rank5 motivated by Baly et al. [2] that used this feature to estimate the reliability of media sources. We consider this feature as a categorical feature by binning the ranking values into 10 categories, then we convert it to a one hot encoding vector of 10 binary features. 2 https://pypi.org/project/newspaper3k/ 3 https://pypi.org/project/bs4/ 4 https://pypi.org/project/nltk/ 5 https://www.alexa.com/ – URL Features: these features were used by Baly et al. [2] to detect the reliability of web sources. We used Python URL handling library urlib 6 to parse the URL and extract the following orthographic features: • Length (URLLen) and Number of Sections (URLSecs): The length of the URL path and the number of sections separated by ‘/’ help indicate whether the website is legitimate, irregular, or a phishing website. • Scheme (URLScheme): The URL protocol (https or http) indicates the trustworthiness of the website. We extracted the URL scheme then we used scikit-learn label encoder7 to encode string values of schemes to integers. • Domain Suffix(URLSfx): The suffix of a URL domain determines the source and credibility of the website. For example, a website with domain suffix .gov is a federal government site and is more credible than a commercial website with a suffix of .com. We used label encoder to encode their string values into integers. Content Features From page body, we extract the following linguistic and similarity features: – Number of Quotes (NQts): For each page, we count the number of quotes in all relevant segments. This feature may be very useful to rank web pages and decide how useful they are for claim verification as it may indicate the credibility of the page by quoting sources. In our work, we considered only quotes with five words or more. – Number of URL links (NLinks): This feature represents the number of URL links in the retrieved page. It may indicate the credibility of the source by giving references. – Named Entities (NEs): Pages mentioning named entities may indicate the truthfulness of the page. We used Python polyglot NLP tool8 to recognize location, organizations, and persons entities in the most relevant segment of the page. We form a vector of 3 integer values representing the number of occurrences of every entity type in the segment. 2.5 Claim-Dependent Features We extracted the following features based on the claim-page interaction: – Original Rank (Rank): This feature is available from the search results and it represents how the page is potentially-relevant to the claim according to the search engine. – Similarity: This includes cosine similarity between claim and title (ClmTtlSim), claim and snippet(ClmSnptSim), and claim and a passage (ClmPsgSim). 6 https://pypi.org/project/urllib3/ 7 scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html 8 https://github.com/aboSamoor/polyglot – Number of Relevant Sentences (NRelSent): For every page, we com- pute the similarity between the claim and each sentence. We count the num- ber of relevant sentences in each page as it might indicate the relevance of the page. – Number of relevant webpages (NRelPages): For every claim, we count the number of webpages with a similarity score between claim and most relevant sentence higher than a certain threshold. – Score of the most Relevant Segment (MostRelSeg): This feature in- dicates how similar the most relevant segment is to the claim. – Sentiment (SntCnt): Sentiment analysis can help identify if the stance of the page is positive, negative, or neutral. This may help in identifying whether the page agrees with the claim or not. We use polyGlots Sentiment model9 to extract sentiments. From the most relevant segment, we get two values, the number of words with positive polarity and the number of words with negative polarity. 3 Subtask A: Reranking Retrieved Pages In this subtask [3], the goal is to rerank the retrieved pages based on their use- fulness for verifying a specific claim. In this section, we present our proposed approach, experimental setup and results, our selected runs for CLEF submis- sions, and finally we will present the CLEF results. 3.1 Approach Our approach is based on learning-to-rank (L2R). We propose a pairwise L2R model considering three different L2R classifiers, namely, SVM C-Support Vector Classification (SVC), which is implemented based on libsvm10 , Gaussian Naı̈ve Bayes (Gaussian NB), and the ensemble classifier Random Forest (RF), using Scikit-learn Python library.11 We consider the following features (discussed in Section 2): – Basic features: Rank, SrcPop, and MostRelSeg. – Similarity features: ClmTtlSim and ClmSnptSim. – NLinks. – NQts. 3.2 Experimental Setup Parameters We experimented with the three different classifiers mentioned in 3.1. We set the kernel for SVC to linear, and set the number of estimators for the RF models to 100 (based on preliminary experiments). For the NB models, we did not tune any hyper-parameters and used the default settings. 9 https://polyglot.readthedocs.io/en/latest/Sentiment.html 10 https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html 11 https://scikit-learn.org/stable/index.html Baselines We compare our models against a baseline that returns the pages ranked in their original ranks (i.e., based on relevance scores of the search engine, not on usefulness for fact-checking). 3.3 Evaluation on Training As we were constrained by the size of the training data, containing only 10 claims, we adopted leave-one-claim-out (LOO) cross validation to evaluate the trained models. We optimized our models using the graded relevance measure NDCG@20. We first experimented with different values of the cosine similarity threshold (0.4, 0.5, 0.6, and 0.7) when extracting relevant segments. In our unreported preliminary experiments, we observed that the best performing models were the ones trained with features extracted using a similarity threshold of 0.4 and 0.7, presented in Fig. 1 and Fig. 2 respectively. We also tried different combinations of features as shown in both figures. The results show that our models could not beat the baseline with only the basic features. However, NB models outperformed the baseline when other features were introduced. We also notice that introducing the ClmTtlSim and ClmSnptSim to the basic features improved the performance of our models, while excluding the SrcPop feature improved the performance. Moreover, our proposed NLinks and NQts features did not have a noticeable impact on the performance of the models. Fig. 1. Subtask A: Performance of L2R models on training data with combinations of features (cosine similarity threshold set to 0.4). Fig. 2. Subtask A: Performance of L2R models on training data with combinations of features (cosine similarity threshold set to 0.7). 3.4 CLEF Evaluation Runs As shown in Fig. 1, NB models outperform other L2R models over the training data, therefore we picked the 3 best NB models to submit to CLEF:. 1. NB trained with Basic, ClmTtlSim, and ClmSnptSim features, and excluding SrcPop. 2. NB trained with Basic, ClmTtlSim, ClmSnptSim, and NQts features, and excluding SrcPop. 3. NB trained with Basic, ClmTtlSim, ClmSnptSim, NQts, and NLinks fea- tures, and excluding SrcPop. Moreover, when the cosine similarity threshold was set to 0.7, RF outperformed other models, as shown in Fig. 2, so we also picked its best performing model: 4. RF trained with Basic, ClmTtlSim, and ClmSnptSim features, and excluding SrcPop. Results As shown in Table 1, the official CLEF evaluation shows that our best performing model on the test data was the NB model trained with basic, ClmT- tlSim, and ClmSnptSim features (excluding SrcPop) which achieved NDCG@20 value of 0.55. This was the maximum score achieved among 7 runs submitted for this subtask. We observed that the performance of our models on training data was better than on testing data; this can be attributed to the small size of the training dataset, containing only 395 pages from 10 claims, which could be insufficient and not a good representative to train the models. Table 1. Subtask A: Performance of CLEF submitted runs. NDCG@20 NDCG@20 Features Classifier on train on test {Basic+Sim} RF 0.704 0.47 -SrcPop {Basic+Sim+NQts} NB 0.693 0.52 -SrcPop {Basic+Sim} NB 0.692 0.55 -SrcPop {Basic+Sim +NQts+NLinks} NB 0.688 0.51 -SrcPop 4 Subtask B: Classifying Retrieved Pages The main goal of this subtask [3] is to classify all retrieved Web pages based on how useful they are in detecting the claim’s veracity. A webpage is useful if it has enough evidence to verify the claim and if its source is trustworthy. In this section, we present our approach, experimental setup, training results, and CLEF results for our submitted runs. 4.1 Approach In our approach for this subtask, we use different machine learning algorithms to perform multi-class classification. We consider SVC as it shows to learn well from small datasets. We also include Gradient Boosting (GB) and RF as an ensemble model. As mentioned in 3.1, we use Scikit-learn Python library for our implementation. We consider the following features: – Basic features: Rank, SrcPop and MostRelSeg. – NEs in the relevant segment. – NQts. – URL features. – W2V representation of both the claim and the relevant segment. 4.2 Experimental Setup Parameters For SVC, we used an RBF kernel with regularization parameter C = 15 and L2 penalty, and we set γ to 0.01 to avoid over-fitting. For GB and RF models, we set the number of estimators to 100 and 150 respectively (based on preliminary experiments). Baselines As a baseline we adopted Wang et al. [19] method for feature extrac- tion and classification. Their dataset consists of short passages where passages are classified into five different categories. This baseline was selected because the feature extraction methods are implemented on short passages similar to the size of our extracted relevant segments. Moreover, they are working on fine-grain classification. Since our training data is highly imbalanced, we also used the Zero Rule al- gorithm as a baseline for this subtask. Zero Rule algorithm predicts the majority class in the dataset. In our training data, class -1 (non-relevant) is the majority class with 65% of the labels. 4.3 Evaluation on Training We conducted multiple experiments in attempt to find which features combina- tion will result in the best F1 score. We split our dataset into 70% for training and 30% for testing. From our experiments, we noticed that varying the simi- larity threshold when extracting relevant segments had a significant impact on the overall score. We concluded that our best performing models were the ones trained with features extracted with similarity thresholds of 0.4 and 0.7. Fig. 3 and Fig. 4 show the results obtained from our experiments using similarity thresholds 0.4 and 0.7 respectively. We observed that when training the classifiers with basic features and NEs the performance improved. On the other hand, incorporating some content fea- tures like URL features and W2V vectors had a negative impact on the per- formance of the classifiers. We also note that ensemble classifiers (GB and RF) outperformed the baselines and other classifiers all the time. Fig. 3. Subtask B: Performance of classifiers on training data with combinations of features (cosine similarity threshold set to 0.4). Fig. 4. Subtask B: Performance of classifiers on training data with combinations of features (cosine similarity threshold set to 0.7). 4.4 CLEF Evaluation Runs As concluded in section 4.3, ensemble classifiers have outperformed SVC classifiers. So, for our runs we picked the GB and RF models. We selected the following models with cosine similarity threshold of 0.7: 1. GB Classifier trained with basic features. 2. GB Classifier trained with basic features and NEs. We also picked the following models when cosine similarity threshold is set to 0.4: 3. GB Classifier trained with basic features and NQts. 4. RF Classifier trained with basic features and NQts. Results Table 2 shows our training results compared to the official CLEF testing results. We notice that our best validation model with F1 score of 0.52 that combines basic features with NEs has achieved lower testing score. Meanwhile, our model that combines basic features with NQts has scored a testing F1 score of 0.31. The inconsistency between train and test F1 scores can be justified due to the small training dataset of only 395 webpages. Also, the imbalance in the classes of the dataset could have caused the models to overfit. Our best model that achieved F1 score value of 0.31 is the highest among all submitted runs for this subtask. 5 Subtask C: Classifying Passages In this subtask [3], the goal is to extract useful passages for claim verifica- tion within the useful retrieved pages. In this section, we present our proposed Table 2. Subtask B: Performance of CLEF submitted runs. F1 F1 Features Classifier on Train on Test Basic Features GB 0.48 0.16 Basic Features + NEs GB 0.52 0.22 Basic Features + NQts GB 0.47 0.31 Basic Features + NQts RF 0.45 0.30 methodology, experimental evaluation, selected runs for this subtask, and CLEF results. 5.1 Approach Deciding whether a passage within a useful page is useful or not is a classifica- tion problem. Therefore, our methodology is based on using different machine learning classifiers namely SVC, NB, and RF. We consider the following features for this subtask: – BOW of the passage. – MostRelSeg in the passage. – ClmPsgSim. – NQts in the passage. – NEs in the passage. 5.2 Experimental Setup Parameters The three different classifiers mentioned in section. 5.1 were used in our experiments. We set the kernel for SVC to linear, and the number of estimators for the RF models to 100 in all the experiments. For the Gaussian NB, we did not tune any hyperparameters and we based our experiments on the default settings. Baselines We compare our models against the majority baseline. 5.3 Evaluation on Training Since we have only 6 claims in the dataset provided for subtask C, which contains only 167 passages from 31 different pages, we considered LOO cross validation in our experiments. We used F1 score as our evaluation metric. As shown in Fig. 5, SVC outperformed all other models with all groups of features. However, when the BOW features were excluded, the Gaussian NB achieved the best among all. We also observed that the two best performing models are the SVC model when the NEs features were excluded, and the SVC model when the NQts feature was excluded achieving an F1 score of 0.444 and 0.43 respectively. We also noticed that the performance of the SVC model trained with all features improved compared to when trained with BOW features only, achieving an F1 score of 0.427 as apposed to 0.387. Fig. 5. Subtask C: Performance of classifiers models on training data with combinations of features. 5.4 CLEF Evaluation Runs As shown in Fig. 5, SVC models outperformed other classifiers except when the BOW features were excluded, in which case the NB model achieved the best F1 score. Therefore, we picked the 3 best SVC models and the best NB model to submit: 1. SVC trained with all features. 2. SVC trained with all features excluding the NQts feature. 3. SVC trained with all features excluding NEs features. 4. NB trained with all features excluding BOW features. Results As shown in Table 3, in the official CLEF evaluation, our best per- forming model in the test phase was the SVC model trained with all features excluding the NQts features, which achieved F1 score value of 0.4. The low F1 of our models can be attributed to the big difference in training and testing data including passages from 6 claims and 59 claims respectively. Our highest scoring model is ranked 3rd out of the six runs submitted to the lab, and the maximum score achieved among all runs submitted for this subtask was 0.56. 6 Subtask D: Verifying Claims The goal of this subtask is to identify whether the claim is ”True” or ”False”. For a claim to be true, it should have supporting evidence that verifies its factuality. Table 3. SubTask C. Performance of CLEF submitted runs. F1 F1 Features Classifier on train on test All SVC 0.423 0.39 All-NQts SVC 0.43 0.4 All-NEs SVC 0.44 0.19 All-BOW NB 0.38 0.37 In this section, we present our approach, experimental setup, and training results for verifying the claims. Then, we discuss CLEF results for our submitted runs. 6.1 Approach Deciding the factuality of a claim is a binary classification problem. Therefore, we propose a supervised learning approach using different classifiers: GB, RF and Linear Discriminant Analysis (LDA). For this subtask, we select the most significant features from webpages to classify the claim. Unlike previous tasks, we consider SntCnt features to find the polarity of the webpage. In addition, we consider the usefulness of the article by using the most relevant segment extracted as explained in Section 2 to represent the webpage. In our experiments, we consider the following features for our binary classifiers: – Similarity Scores: out of all webpages associated with a claim, we only con- sider three different scores: maximum ClmTtlSim, ClmSnptSim, and MostRelSeg. – NRelPages. – For every claim, we select the webpage with maximum MostRelSeg value and extract the following features from it: SrcPop and SntCnt. 6.2 Experimental Setup Parameters For GB and RF classifiers, we found that the default parameters are the best (based on preliminary experiments). For LDA classifier, we found that using 5 components for linear discrimination is most effective in terms of accuracy. Baseline As a baseline for this subtask, we implemented Karadzhov et al. [11] method. They classify claims as ”True” or ”False” based on the top returned search results from several engines. They used an SVC classifier with RBF kernel in their experiments. The inputs to the classifier are word embeddings of the most relevant segment in the webpage, webpage snippet, and the claim. In addition to the word embeddings, the average and maximum similarity scores of the segments and snippets are included as features. We also adopt their method of segment extraction to compare with our approach. 6.3 Evaluation on Training We conducted multiple experiments to find which features combination will re- sult in the best factuality classification. Due to the limitation in the size of training dataset, we used 8-fold cross validation on all our models for this sub- task. We first experimented with different values of the cosine similarity threshold (0.4, 0.5, 0.6, and 0.7) when extracting relevant segments. In our unreported preliminary experiments, we observed that the best performing models were the ones trained with features extracted using a similarity threshold of 0.6 presented in Fig. 6. We noticed that the GB model trained with all features outperformed all other models. We also observed that our models outperformed the baseline score most of the time except when the NRelPages were excluded from the features. Furthermore, we conclude that NRelPages and SntCnt features are useful in classification of a claim. Fig. 6. Subtask D: Performance of classification models on training data with combi- nations of features (cosine similarity threshold set to 0.6). 6.4 CLEF Evaluation Runs Based on our training results presented in section 6.3, we decided to use the models trained on all features to classify the claims factuality on testing data. We selected the best ensemble classifiers with two different similarity thresholds. 1. GB classifier, with similarity threshold 0.7. 2. GB classifier, with similarity threshold 0.4. 3. RF classifier, with similarity threshold 0.4. 4. RF classifier, with similarity threshold 0.6. Results Table 4 shows our training results compared to the official CLEF testing results. Runs for subtask D were submitted over two cycles. In the first cycle, we classify the claims factuality using all webpages provided. In the second cycle, we classify the claims factuality using only useful webpages. We present the results for the second cycle in this section. As presented in Table 4, we notice that all models achieved very similar F1 test scores. However, our GB model trained with all features has the highest training and testing scores, achieving F1 score of 0.91 and 0.53 for training and testing respectively. Our highest scoring model is ranked 4th out of the nine runs submitted to the lab, and the maximum score achieved among all runs submitted for this subtask was 0.62. Table 4. Subtask D: Performance of CLEF submitted runs. F1 F1 Features Classifier on Train on Test All GB 0.91 0.53 All GB 0.83 0.51 All RF 0.80 0.53 All RF 0.66 0.51 7 Conclusion In this paper, we present our approach for task 2 of CLEF-2019 CheckThat! Lab. For subtask A, we proposed pairwise learning-to-rank approach using dif- ferent learning models to rank the retrieved pages based on their usefulness. Our best performing model trained using the basic and similarity features (ex- cluding source popularity) achieved an NDCG@20 of 0.55, which is the highest score among 7 runs submitted for this subtask. For subtask B, we proposed a classification model incorporating source popularity feature along with named entities. Our best performing model achieved an F1 score of 0.31, which is the highest score achieved among the 8 runs submitted for this subtask. For subtask C, we proposed a classification model considering BOW, named entities, and the number of quotes features extracted from passages. Our best performing model trained with all features (excluding the number of quotes) achieved an F1 score of 0.4 and got 3rd place. For subtask D, we proposed a classification model using sentiment features to find the polarity of the page, in addition to the number of potentially-relevant pages. Our best model trained with all features achieved an F1 score of 0.53 and got 4th place. That was our first attempt using a very small training data that was provided by the track organizers. With larger datasets, we plan to improve our classifi- cation models with more features including word embeddings, that are trained specifically for this task, and probably with deep learning models as well. References 1. Atanasova, P., Nakov, P., Karadzhov, G., Mohtarami, M., Da San Martino, G.: Overview of the CLEF-2019 CheckThat! Lab on Automatic Identification and Ver- ification of Claims. Task 1: Check-Worthiness 2. Baly, R., Karadzhov, G., Alexandrov, D., Glass, J.R., Nakov, P.: Predicting Fac- tuality of Reporting and Bias of News Media Sources. CoRR abs/1810.01765 (2018), http://arxiv.org/abs/1810.01765 3. Elsayed, T., Nakov, P., Barrón-Cedeño, A., Hasanain, M., Suwaileh, R., Da San Martino, G., Atanasova, P.: Checkthat! at clef 2019: Automatic identifica- tion and verification of claims. In: European Conference on Information Retrieval. pp. 309–315. Springer (2019) 4. Elsayed, T., Nakov, P., Barrón-Cedeño, A., Hasanain, M., Suwaileh, R., Da San Martino, G., Atanasova, P.: Overview of the CLEF-2019 CheckThat!: Automatic Identification and Verification of Claims. In: Experimental IR Meets Multilingual- ity, Multimodality, and Interaction. LNCS, Lugano, Switzerland (September 2019) 5. Hasanain, M., Suwaileh, R., Elsayed, T., Barrón-Cedeño, A., Nakov, P.: Overview of the CLEF-2019 CheckThat! Lab on Automatic Identification and Verification of Claims. Task 2: Evidence and Factuality 6. Hassan, N., Adair, B., Hamilton, J.T., Li, C., Tremayne, M., Yang, J., Yu, C.: The quest to automate fact-checking. In: Proceedings of the 2015 Computation+ Journalism Symposium (2015) 7. Hassan, N., Arslan, F., Li, C., Tremayne, M.: Toward automated fact-checking: Detecting check-worthy factual claims by claimbuster. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Min- ing. pp. 1803–1812. ACM (2017) 8. Hassan, N., Li, C., Tremayne, M.: Detecting check-worthy factual claims in presi- dential debates. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. pp. 1835–1838. ACM (2015) 9. Hassan, N., Zhang, G., Arslan, F., Caraballo, J., Jimenez, D., Gawsane, S., Hasan, S., Joseph, M., Kulkarni, A., Nayak, A.K., et al.: Claimbuster: The first-ever end- to-end fact-checking system. Proceedings of the VLDB Endowment 10(12), 1945– 1948 (2017) 10. Jaradat, I., Gencheva, P., Barrón-Cedeno, A., Màrquez, L., Nakov, P.: Claim- rank: Detecting check-worthy claims in arabic and english. arXiv preprint arXiv:1804.07587 (2018) 11. Karadzhov, G., Nakov, P., Màrquez, L., Barrón-Cedeño, A., Koychev, I.: Fully automated fact checking using external sources. arXiv preprint arXiv:1710.00341 (2017) 12. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed repre- sentations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013) 13. Nakov, P., Barrón-Cedeño, A., Elsayed, T., Suwaileh, R., Màrquez, L., Zaghouani, W., Atanasova, P., Kyuchukov, S., Da San Martino, G.: Overview of the CLEF- 2018 CheckThat! lab on automatic identification and verification of political claims. In: International Conference of the Cross-Language Evaluation Forum for Euro- pean Languages. pp. 372–387. Springer (2018) 14. Popat, K., Mukherjee, S., Strötgen, J., Weikum, G.: Credeye: A credibility lens for analyzing and explaining misinformation. In: Companion of the The Web Confer- ence 2018 on The Web Conference 2018. pp. 155–158. International World Wide Web Conferences Steering Committee (2018) 15. Rashkin, H., Choi, E., Jang, J.Y., Volkova, S., Choi, Y.: Truth of varying shades: Analyzing language in fake news and political fact-checking. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 2931–2937 (2017) 16. Ruchansky, N., Seo, S., Liu, Y.: Csi: A hybrid deep model for fake news detection. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. pp. 797–806. ACM (2017) 17. Soliman, A.B., Eissa, K., El-Beltagy, S.R.: Aravec: A set of arabic word embedding models for use in arabic nlp. Procedia Computer Science 117, 256–265 (2017) 18. Thorne, J., Vlachos, A., Christodoulopoulos, C., Mittal, A.: Fever: a large-scale dataset for fact extraction and verification. arXiv preprint arXiv:1803.05355 (2018) 19. Wang, L., Wang, Y., de Melo, G., Weikum, G.: Five shades of untruth: Finer- grained classification of fake news. In: 2018 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). pp. 593–594. IEEE (2018) 20. Wang, W.Y.: “Liar, liar pants on fire”: A new benchmark dataset for fake news detection. arXiv preprint arXiv:1705.00648 (2017) 21. Yasser, K., Kutlu, M., Elsayed, T.: Re-ranking Web Search Results for Better Fact-Checking: A Preliminary Study. In: Proceedings of 27th ACM International Conference on Information and Knowledge Management (CIKM). pp. 1783–1786. ACM, Turin, Italy (2018) 22. Zhi, S., Sun, Y., Liu, J., Zhang, C., Han, J.: Claimverif: a real-time claim verifica- tion system using the web and fact databases. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. pp. 2555–2558. ACM (2017)