-

Overview of CheckThat! 2020 Arabic: Automatic Identi cation and Veri cation of Claims in Social Media

Maram Hasanain

Fatima Haouari

Reem Suwaileh

Zien Sheikh Ali

Bayan Hamdan

bayan.hamdan995@gmail.com 3

Tamer Elsayed

telsayedg@qu.edu.qa 0

Alberto Barron-Ceden~o

Giovanni Da San Martino

gmartinog@hbku.edu.qa 2

Preslav Nakov

2 0 Computer Science and Engineering Department, Qatar University , Doha , Qatar 1 DIT, Universita di Bologna , Forl , Italy 2 Qatar Computing Research Institute , HBKU, Doha , Qatar 3 Research Consultant

In this paper, we present an overview of the Arabic tasks of the third edition of the CheckThat! Lab at CLEF 2020. The lab featured three Arabic tasks over social media (and the Web): Task 1 on checkworthiness estimation, Task 3 on evidence retrieval, and Task 4 on claim veri cation. For evaluation, we collected a dataset of Arabic tweets and Web pages consisting of 7.5K tweets and 14,742 Web pages. The systems in the ranking tasks (Task 1 and Task 3) were evaluated using precision at 30 (P @30) and precision at 10 (P @10), respectively. F1 was the o cial evaluation measure for Task 4. Eight teams submitted runs to the Arabic tasks, which is double the number of teams participating in the Arabic tasks of the CheckThat! lab at CLEF 2019. The most successful approach to Task 1 used an Arabic pre-trained language model, while text similarity measures and linguistic features were used in the other tasks. We release to the research community all datasets from the lab, which should enable further research on automatic claim veri cation in Arabic social media.

With the rapid growth of social media such as Twitter, large amounts of fake and unveri ed claims have emerged and have been propagated to a ect online social media users as well as the o ine society. Thus, the automatic detection and veri cation of fake claims could help mitigate this negative development and bene t not only normal users, but also journalists and news agencies.

A plethora of studies addressed the problem of claim identi cation [ 14, 15, 18, 28, 30, 34 ] and veri cation [ 9, 26, 27, 29, 38 ] in social media, but addressing these tasks in Arabic is severely under-explored [ 2, 3 ]. Similarly, check-worthiness estimation is under-explored in social media [ 1 ]. A considerable body of literature on check-worthiness estimation exists, but the focus has been mainly on political debates and speeches [ 21, 22, 24, 33 ].

In its third edition, the CheckThat! lab [ 7 ]5 focused on social media, speci cally Twitter, with the aim of enabling the automatic veri cation of claims. This paper focuses on the three Arabic tasks o ered by the CheckThat! lab in 2020:6 Task 1 Check-worthiness estimation for tweets: predict which tweet from a stream of tweets on a topic should be prioritized for fact-checking. Task 3 Evidence retrieval: given a check-worthy claim in a tweet on a speci c topic and a set of text snippets extracted from potentially relevant Web pages, return a ranked list of evidence snippets for the claim. Task 4 Claim veri cation: given a check-worthy claim in a tweet and a set of potentially-relevant Web pages, estimate the veracity of the claim.

The Arabic tasks attracted 8 teams, which submitted a total of 30 runs, and the most successful approaches adopted ne-tuning existing pretrained models, namely AraBERT [ 4 ] and multilingual BERT [ 16 ]. The datasets for the three tasks were created from scratch as the goal for this year was to focus on social media for the rst time, as opposed to previous editions of the lab, which featured automatic identi cation and veri cation of political claims [ 32, 5, 8 ], and evidence-based claim veri cation [ 17, 6, 20 ]. We make the datasets available to the research community to support further research on the three tasks.7

For each of the Arabic tasks, we describe below the evaluation dataset created to support that task, we present a summary of the approaches used by the participating systems, and we discuss the evaluation results. 2

Task 1ar: Check-Worthiness on Tweets

Since check-worthiness estimation for tweets in general, and for Arabic tweets in particular, is a relatively new task, we constructed a new dataset speci cally designed for training and testing systems for this task. We identi ed the need for a \context" that a ects the check-worthiness of tweets, and we used \topics" to represent that context. Given a topic, we de ne a check-worthy tweet as a tweet that is relevant to the topic, contains one main claim that can be fact-checked by consulting reliable sources, and is important enough to be worthy of veri cation. More on the annotation criteria is presented later in this section. 5 https://sites.google.com/view/clef2020-checkthat/ 6 Refer to [35] to read about the English tasks of the CheckThat! 2020 lab. 7 https://sites.google.com/view/clef2020-checkthat/ ايروس يف ايكرت لخدت :عوضوملا ناونع فدهب لوخدلا ايكرت تررق ، 2011 ماع ةروثلا ل اعتشا دعب ن ينس عستلا ب رِاقُي ام ايروس ي ف ب رحلا رارمتسا دعب :عوضوملا حرش يف ماعلا يأرلا راثأ ام وهو ،نييندملا ديرشتو لتقو عاضولأا ريتوت نع ةيبنجلأاو ةيروسلا تاوقلا عدرو نييندملا ةيامح وهو نلعم

.ةدعصلأا عيمج ىلع ايروس يف يركسعلا يكرتلا لخدتلا تاروطت نع عوضوملا اذه ثدحتي .ملاعلا Topic title: Intervention of Turkey in Syria Topic description: After 9 years of war in Syria since the eruption of the revolution in 2011, Turkey decided to intervene in Syria with the declared aim of protecting civilians and deterring Syrian and foreign forces from aggravating the situation, and killing and displacing civilians, which ignited public opinion in the world. This topic talks about developments related to the Turkish military intervention in Syria on all aspects. In order to construct the dataset for this task, we rst manually created fteen topics over the period of several months. We then selected trending topics at the time among Arab social media users. Each topic was represented using a short title and a much longer text description. Figure 1 shows an example topic from the training dataset.

Examples of other topic titles include \Coronavirus in the Arab World", \Sudan and normalization", and \Deal of the century". We augmented each topic with a set of keywords, hashtags, and usernames to track in Twitter. Once we had created a topic, we immediately crawled a one-week stream of tweets using the constructed search terms, where we searched Twitter (via the Twitter search API) using each term by the end of each day. We limited the search to original Arabic tweets (i.e., we excluded retweets). We then de-duplicated the tweets and we dropped those matching our quali cation lter that excludes tweets containing terms from a blacklist of explicit terms and tweets that contain more than four hashtags or more than two URLs. Afterwards, we ranked the tweets by popularity (de ned by the sum of their retweets and likes), and we selected the top-500 to be annotated.

The annotation process was performed in two steps; we rst identi ed the tweets that are relevant to the topic and contain factual claims, then we identi ed the check-worthy tweets among those relevant tweets.

We rst recruited one annotator to annotate each tweet for its relevance with respect to the target topic. In this step, we labeled each tweet as one of three categories: { Non-relevant tweet for the target topic. { Relevant tweet but with no factual claims, such as a tweet expressing an opinion about the topic, reference, or speculation about the future. { Relevant tweet that contains a factual claim and that can be fact-checked by consulting reliable sources. 1 ﺔﻟﺎﺣو بﺮﻴﻨﻟا رﻮﺤﻣ ﻰﻠﻋ مﺮﺠﻤﻟا ﻲﻧاﺮﻳ ا ﺮوﻟا ل ﺘﺣ ا تﺎﻴﺸﻴﻠﻴﻣ تﺎﻋﺎﻓد رﺎﻴﻬﻧا راﻮﺜﻟا تﺎﺑ مﺎﻣا ﻪﺗﺎﺑﺎﺑد ﺮﻴﻣﺪﺗ ﺪﻌﺑ ل ﺘﺣ ﺰاﺗﺔﺮﻗﻣ فﻮﻔﺻ ﻲﻓ بوﺮﻫو ﻮﻓ وﺎﺤﻣ ﺔﺮﻛﻌﻤﻟﺎﺑ ﺔﻳﻮﺠﻟا ﻪﺗاﻮﻘﺑ جﺰﻳ وﺮﻟا ل ﺘﺣ او ﺮﻲﺘﻟﻛاﺶﻴﺠﻟا ةﺪﻧﺎﺴﻣو

كرﺎﻌﻤﻟا ىﺮﺠﻣ ﺮﻴﻴﻐﺗ Translate Tweet 6:35 PM · Feb 20, 2020 · Twitter for Android 71 Retweets and comments 314 Likes Relevant people كدﺎﺒﻌﻟ ك ﻧ برﺎﻳ ﺔﻴﻤﻟ2ﺎﻋ0ةﺪﺮﻛﻌﺑ· قYeاﺮstﻓer.d.ﺔaﻧyﻮﻠﺷﺮﺑو ﻴﻣ Only relevant tweets with factual2claims were labelled for check-worthiness. Two ﺎﻣﺎﻋ annotators annoﻢﻠtﻳaﻮﺴtﻟeاﻦdﻤﺣtﺮhﻟاﺪoﺒﻋse@atbwd7e88et·sFeb 2r0st. A third expert annotator performed disagreement resolution whenever needed. Due to thﻢﻬeﻠﻤﺷsuﺖbﺘﺷjوeﻢcﻬtﻌﻤivﺟeقﺮﻓnﻢaﻬﻠtﻟاure oTfrencdihngeincSkauﺮ-dﻤiﻋA_raيbiaوراو_ﻞﯿﻳﺎﻜﻣ_يﺪﮫﻣ# worthiness, we chose to re1present3 the check-worthiness criteria by several questions, to help the annotators think about di erent aspects of check-worthiness. The annotators were asked to answer the following three questions for each tweet, using a Likert scale between 1 and 5: 1. Do you think the claim in the tweet is of interest to the public? 2. To what extent do you think the claim can negatively a ect the reputation of an entity, country, etc.? 3. Do you think journalists will be interested in covering the spread of the claim or the information discussed by the claim? Once the annotator has answered the above questions, s/he is required to answer the following fourth question considering all the ratings given previously:

Do you think the claim in the tweet is check-worthy? This is a yes/no question, and the resulting answer is the label we use to represent check-worthiness in this dataset. Figure 2 shows an example of a tweet making a check-worthy claim.

For the nal set, all tweets but those labelled as check-worthy were considered not check-worthy. Given 500 tweets annotated for each of the fteen topics, the annotated set contained 2,062 check-worthy claims (27.5%). Three topics constituted the training set, and the remaining twelve topics were used to evaluate the participating systems.

M ○

○ ○

○ ○ ○ ○ ○

○ ○ ○ ○ 2.2

Overview of the Approaches Eight teams participated in this task submitting a total of 26 runs. Table 1 shows an overview of the approaches. The most successful runs ne-tuned existing pretrained models, namely AraBERT and multilingual BERT. Other approaches relied on pre-trained models such as Glove, Word2vec, and Language-Agnostic SEntence Representations (LASER) to obtain embeddings for the tweets, which were fed either into a neural network or other machine learning models, such as SVM. In addition to text representations, some teams used other features, namely morphological and syntactic features, part-of-speech (POS) tags, named entities, and sentiment features. 2.3

Evaluation We treated Task 1 as a ranking problem, where we expect check-worthy tweets to be ranked at the top. We evaluated the runs using precision at k (P @k) and Mean Average Precision (MAP). We considered P @30 as the o cial evaluation measure, as we anticipated that the user would check a maximum of 30 claims per week. We also developed two simple baselines: baseline 1 which ranks tweets in descending order based on their popularity score (sum of likes and retweets a tweet has received), and baseline 2 which ranks tweets in reverse chronological order, i.e., the most recent ones rst. Table 2 shows the performance of the best run per team in addition to the two baselines, ranked by the o cial measure. We can see that most teams managed to improve over the two baselines by a large margin.

MAP Evidence retrieval represents the second major step in an automatic fact-checking system where evidence is collected to be used for claim veri cation. Potentially, systems can extract evidence from any source. However, in order to unify the evaluation setup and to ensure that all systems have access to the same source of evidence, this was de ned as a ranking task over a set of text snippets provided along with check-worthy claims. We de ne an evidence snippet as a text snippet from a Web page that constitutes evidence supporting or refuting the claim. For this task, we needed a set of claims and a set of potentially relevant Web pages, from which evidence snippets would be extracted. We rst collected the set of Web pages using the topics for Task 1. While developing the topics, we represented each one by a set of search phrases. We used these phrases as queries to Google Web search daily, and in a week we collected a set of Web pages, which we then used to construct a dataset for the task.

As for the set of claims, we draw a random sample from the check-worthy tweets identi ed for each topic from Task 1. Since the data from Task 2, Subtask C in last year's edition of the lab could be used for training [ 20 ], we only released test claims and Web pages for the twelve test topics used in Task 1. The dataset for this task contains a total of 200 claims and 14,742 corresponding Web pages.

Since we seek a controlled method to allow systems to return snippets, which in turn would allow us to label a consistent set of potential evidence snippets, we automatically pre-split these pages into snippets, which we eventually released for each page. To extract snippets, we rst de-duplicated the crawled Web pages using the URL. Then, we extracted the textual content from the HTML document after removing any markup and scripts. Finally, we detected the Arabic text and we split it into snippets, using full-stops, question marks, or exclamation marks as delimiters. Overall, we extracted 169,902 snippets. Relevant people a@lianlinoouurer8d0dine Follow ﻋ ﻦﻣ ﺮﺜﻛأ ﺬﻨﻣ ﻲﻧﺎﻨﺒﻟ ﻲﻓﺎﺤﺻ|

NBN] رﺎﺒﺧ ا ﺮﻳﺪﻣ] |تاﻮﻨﺳ ﻲﺗاﺪﻳﺮﻐﺗ :ﺔﻈﺣ ﻣ @nbntweets مﺰﻠﺗ وًا ﺣ يﺮﻈﻧ ﺔﻬﺟو ﺲﻜﻌﺗ

ًﺎﺗﺎﺘﺑ ًﺎﻘﻠﻄﻣ ًاﺪﺑأ ًاﺪﺣأ What’s happening ﺮﻴﻫﺎﺸﻣ · This morning ﺮﻬﺷأ ﻦﻣ رﻮﺻ ﻨﺗ ﺪﻳﺪﺣ ﻲﺠﻴﺟ

ةﺮﻴﺧ ا ﺎﻬﻠﻤﺣ Trending in Saudi Arabia

ﻪﯿﺴﻨﺠﻟا_ﻖﺤﺘﺴﺗ_ﻻ_ﷲﺪﺒﻋ_ﺞﻳرا# 3,914 Tweets Translate Tweet 6:39 PM · Mar 8, 2020 · Twitter for Android 4 Retweets and comments 40 Likes 1 تاﺪﺠﺘﺴﻤﻟا ﺮﺧآ ﻦﻋ ﻲﻣﻮﻴﻟا هﺮﻳﺮﻘﺗ ﻲﻓ ،ﻲﻌﻣﺎﺠﻟا يﺮﻳﺮﺤﻟا ﻖﻴﻓر ﻰﻔﺸﺘﺴﻣ ﻦﻠﻋأ

،سوﺮﻴﻔﻟﺎﺑ ةﺪﻳﺪﺟ تﺎﺑﺎﺻإ 4 ﻞﻴﺠﺴﺗ ،ﺪﺠﺘﺴﻤﻟا نﺎﻨﺒﻟ_ﺎﻧورﻮﻛ# سوﺮﻴﻓ لﻮﺣ ﻢﻬﻨﻣ 29 ﻊﺿو :ﺎﺑﺎﺼﻣ 32 ﻰﻟإ نﺎﻨﺒﻟ ﻲﻓ ﺔﻴﺑﺎﺠﻳ ا ت ﺎﺤﻟا دﺪﻋ عراﺎﻔﻲﺗ ﻟﺎﺘﻟﺎﺑو .جﺮﺣ ﻊﺿو ﻲﻓ 3و ،ﺮﻘﺘﺴﻣ “Th“eTshteatsetmatenmteintdincadtiecdatte1hdath4ante4wnew 1 infeinctfieocntisownserweerreecorercdoerdd,eadn,datnhdutsh@tuahlsienothure80 · Mar 8

ali noureddine numnubmerboefrpoofspitoivsieticvaesceassiensLienbLaenboannon incrineacrseadsteod3t2o,3s2tr,esstsreinsgsitn1hgath2a9t o2f9tohfetmheamr1eare stabstlea,balen,da3ndar3eairneaincraitcicraitlriecspailtileussaituioanti”on”

Show Tra3 nslation. In its4 daily report ab4o0ut novel Coronavirus development, Ra q AlHariri University Hospital annTowueentycoeurdretplhyat 4 new cases were reported, and thus the number of positive cases in Lebanon increased to 32: 29 of tنhﺎﻧeﻮﻛmﺶﺘaﻔﻤrﻟeا@sytoausbselfem,ataarnsvdd· 3Maar8re in a critical situation.

ﻦﻴﻣﻮﻳ ﻞﻫ نﺎﻨﺒﻟ ﻲﺠﻟﺰﺘﻣ فﻮﺸﻨﻟ ﺎﻣ ﺐﻴﻃ ﺲﺑ يلاتلايبلواتل،اسبوور،سياوفلرابياةفدلايبدجةديتدابجاصتاإباصإ4 ليج4سلتيجىسلات نىايلبالاناريبالشاارو1اش"او " ةمصةامعصلاباع)يلمابو)كيحمو(كيحر(يريحرليارقحيفلراقىيفرفشىتسفمشتنسلعمأ ،نلنعاTنأrبeل،nنdيافiنnوبgل iيnفوSaud"iArab"ia ادكؤامدك،ؤابماص،امب"ا.ص3ج2رم"ح.ى3ج2لعرإ ضحنىانولعبإل ضينيفافنوةبةﺮل3ﺣيييبففواﺔجﻤة،3ﻠيرلإﻛابواقتج@تلااسح،يSرلإاملaاقتwتلاامدسحsدهaملعنnامSمدعa2دهاwفعن9تمmرععa2اضف9·ترMوعاaضنrا8و نا 6 ليج6سﺖلتيﺤجﻤدسعﺳبتﻮدﻟ2عﺮب2ﻳﺮﻘى2ﺘﻟ2الإف"ىﻮانلﺸوإﻧر"ﺎاوﻨنﻴوكﻓ"رﻦوتﻳاوكب"اصتاإباعص"ا.فة5ﺔإتﻴ,د5رﻤعي"7ﻟدﺎا4.ﻋفجة،Tتةدترwﺮويتﻛeادارeبج·،tايsYصتبeوتsارtإبeايrصبdayإ “In“LIenbLaenboann,othne,tRhaefRicaHficarHirairHiroisHpoitsapli2ta0lﺪﻌﺑ قاﺮﻓ ..ﺔﻧﻮﻠﺷﺮﺑو ﻴﻣ (go(vgeorvnemrnenmteanl)tainl)tihnetchaepcitaapli,tBael,iBrueti,rut, ﺎﻣﺎﻋ annaonunnocuendctehdathCaotﺮrﻳCﺮoﻘoﺘnﻟrاaﺺoinnﺨaﻠfﻣeiاncﺪtﻴfﻫieocntisohnasdhraidsernisteon to

Trending in Saudi Arabia 22,2a2ft,earft6erne6wneinwfeinctfieocntisownserweerreecorercdoerdd.”ed.” 10.9K Tweets

Due to the large number of snippets collected for the claims, annotating all pairs of claims and snippets was infeasible given the limited amount of time we had. Therefore, we followed a pooling method: we annotated pooled evidence snippets returned from the submitted runs by the participating systems. Since the o cial evaluation measure for the task was P @10, we rst extracted the top 10 evidence snippets returned by each run for each claim. We then created a pool of unique snippets per claim (considering both snippet IDs and content for de-duplication). Finally, a single person annotated each snippet for a claim. The annotators were asked to decide whether a snippet contained evidence that would be useful for verifying the input claim. This evidence can be statistics, quotes, facts extracted from veri ed sources, etc.

Figure 3 shows an example of a check-worthy tweet. We observe that the example evidence snippet (Fig. 3a) repeats the same information from the tweet referring to a report as the source of the information. While the non-evidence snippet (Fig. 3b) is also very related to the tweet, it states a smaller number of infections since the snippet was extracted from a Web page posted a day before the tweet posting time.

Overall, we annotated 3,380 snippets. After label propagation, we had 3,720 annotated snippets of which only 95 were evidence snippets. Our annotation volume was limited due to the very small number of runs participating in the task (two runs submitted by one team). 3.2 Only one team, EvolutionTeam [36], participated in the task and they submitted two runs. They used the cosine similarity between the claim and the snippet as their ranking score to rank the candidate evidence snippets. In a second run, the similarity was weighted by the intersection between the snippet and a lexicon of sentiment words. This task was modeled as a ranking problem, where the system is expected to rank the evidence snippets at the top of the list. In order to evaluate the submitted runs, we computed P @k at di erent cuto s (k = 1, 5, and 10). The o cial measure was P @10.

The participating team's best-performing run achieved an average P @10 of 0.0456 over the claims. 4

Task 4ar: Claim Veri cation

Starting with the same 200 claims used in Task 3, one expert fact-checker veri ed each claim's veracity. We limited the annotation categories to two, True and False, excluding partially-true claims. A True claim is a claim that is supported by a reliable source that con rms the authenticity of the information published in the tweet. A False claim can be a claim that mentions information contradicting that in a reliable source or that has been explicitly refuted by a reliable source. 4.1

Dataset The claims in the tweets were annotated considering two main factors; the content of the tweet (claim) and the date of the tweet publication. For the annotation, we considered supporting or refuting information that was reported before, on, or a few days after the time of the claim. We consulted several reliable sources to verify the claims. These sources di ered depending on the topic of the claim. For example, for health-related claims, we consulted refereed studies or articles published in reliable medical journals or websites, such as APA.

Out of the initial 200 claims, we ended up with 165 claims for which we managed to nd a de nite label. Only six claims among these 165 were found to be False. Since data from Task 2-Subtask D in the last year's edition of the lab can be used for training [ 20 ], the nal set of 165 annotated claims was used to evaluate the submitted runs. ﻢﻬﻟﺰﻋو ﻦﯿﺼﻟا# ﻦﻣ ﻦﻴﻣدﺎﻗ ﻦﻴﻳدﻮﻌﺳ ب ﻃ١٠ لﺎﺒﻘﺘﺳإ ﻦﻋ ﻦﻠﻌﺗ ﺔﺤﺼﻟا# .ﺔﺼﺼﺨﺘﻣ ﺔﻴﺒﻃ ﻢﻗاﻮﻃ ﻢﻬﺘﻘﻓﺮﺑ ﺐﺳﺎﻨﻣ ﻦﻜﺳ ﻲﻓ ﻦﻴﻋﻮﺒﺳا ةﺪﻤﻟ ًﺎﻳزاﺮﺘﺣإ ﺎﻧورﻮﻛ# سوﺮﻴﻔﻟ ةﺮﻤﺘﺴﻤﻟا ﺔﻌﺑﺎﺘﻤﻟاو ﺔﻳزاﺮﺘﺣ ا تاءﺮاﺟ ا ﻦﻤﺿ ﻚﻟذ ﻲﺗﺄﻳ

.ﺪﺠﺘﺴﻤﻟا Translate Tweet 12:56 PM · Feb 2, 2020 · Twitter for iPhone 80 Retweets and comments 39 Likes 17

Muneer @muneerbatta · Feb 2 Translate Tweet 1:46 PM · Jan 31, 2020 · Twitter for iPhone

Khalid AlAsmari @khalid_alasmari · Feb 2 80 Retweets and comments 217 Likes

(a) Twe80et with a True cl3a9im .ﺔﻟﺎﺣد8ﻮﻌ9ﺳ00@تsaﺎuﺑdﺎﺻ_122 ا·وF،eﺔbﻟﺎ2ﺣ212 ﻦﻴﺼﻟا ﻲﻓ ﺪﻳﺪﺠﻟا ﺎﻧورﻮﻛ سوﺮﻴﻓ تﺎﻴﻓو دﺪﻋ ﻎﻠﺑ ﻢﻬﻴﻤﺤﻳ ﷲ Relevant people @يوaﺎlﻗbﺮaﺒﻟrاgﷲawﺪyﺒﻋ Follow - ﻖﺒﺳ# ﺮﻳﺮﺤﺗ ﺲﻴﺋر ﺐﺋﺎﻧ

ﺔﯿﻛﺮﺣ_ﺮﯿﻔﺳ# albargawi@sabq.org:ﻞﺻاﻮﺘﻠﻟ 5,088 Tweeﺔtsﻳدﻮﻌﺴﻟا رﺎﺒﺧأ 45 (b) Twe80et with a False c2l1a7im Show mor@eS،aﺮﺘuﻳdﻮiﺗNﻰeﻠwﻋs5ﺔﻳ0دﻮﻌﺴﻟا رﺎﺒﺧأ بﺎﺴﺣ Translation (a). ﺚMﻏiﺰnﺰiﻌsﻟاtﺪrﻋyﺪ of @Hmeaag1l1tghaeeathn1n23ouFenbc2ed the return of 10 Saudi students froﺔmﻠﺟﺎﻋ رﺎﺒﺧأ ،رﺎﺒﺧ ا ﻢﻫأ ﺔﻌﺑﺎﺘﻣ China. The students were :ﻞﺻاﻮﺘﻠﻟ . ﺔﻳ ﺣو

Tweet your repplylaced in precautionary isolation for a period of two we e:بkﺎﺴsﺗاو s@saudinews50.info in appropriate houHsiannogf,maohcacmodm@p12a2Hnainef d·Jabny31specialized medical teams. This comes as part 00966500360610 of the precautionary measures and continuous mءﺎﺑoﻮﻟnﺎﻫiﻦtﻣoﻦrﻴiﻤnﻠﺴgﻤﻟoاوfءﺎﻳtﺮﺑh eاﻲNﻤﺣoاﻢvﻬeﻠﻟlا Coronavirus.

1 1 What’s happening Translation (b). ﻲTﻌhﻴﺒﺴeﻟا n-u-دmﻮﻌﺳbe@r8Yof9fLLnHeiAwGKqdMeba1 t·hJasn 3d1ue to Coronavirus in China hﺮﻴaﻫﺎﺸsﻣr·eTahiscmhoernding 212, and 8,900 are infected. ﻦﻴﻤﻟﺎﻌﻟا برﺎﻳ ﻦﻴﻣا ﺮﻬﺷأ ﻦﻣ رﻮﺻ ﻨﺗةﺪﺮﻳﻴﺪﺧﺣ ﻲاﺎﺠﻬﻴﻠﺟﻤﺣ There were two runs submitted by EvolutionTeam [36]. They used a scoring function that computes the degree of concordance and negation (using a manual list) between a claim and all input text snippets for that claim. We treated the task as a classi cation problem and we used typical evaluation measures for such tasks in the case of class imbalance: F1 measure (o cial), Precision, and Recall. The best run achieved a macro-averaged F1 score of 0.5524.

Conclusion and Future Work

In this overview paper, we presented a description of the three Arabic tasks that were op ered as part of the third edition of the CheckThat! lab at CLEF 2020. Unliek previous editions of the lab, this time we focused on false information propagated on Arabic social media (speci cally, on Twitter). Task 1 on check-worthiness ranking of tweets attracted the highest number of participating teams. Generally, the best approaches for that task relied on pre-trained language models such as multi-lingual BERT and AraBERT. Moreover, one team participated in Tasks 3 and 4. We suspect that the low number of participants in these two tasks was due to the lack of new training data provided for this edition of the lab.

For future editions of the lab, we plan to focus on Task 1, since it is a very critical step in the process of automatic veri cation over social media, where a huge stream of tweets needs to be processed in order to identify claims that are worth fact-checking.

Acknowledgments

This work was made possible in part by NPRP grant# NPRP11S-1204-170060 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the authors. The work of Reem Suwaileh was supported by GSRA grant# GSRA5-1-0527-18082 from the Qatar National Research Fund and the work of Fatima Haouari was supported by GSRA grant# GSRA6-1-0611-19074 from the Qatar National Research Fund.

This research is also part of the Tanbih project, which aims to limit the e ect of disinformation, \fake news", propaganda, and media bias. 32. Nakov, P., Barron-Ceden~o, A., Elsayed, T., Suwaileh, R., Marquez, L., Zaghouani, W., Gencheva, P., Kyuchukov, S., Da San Martino, G.: Overview of the CLEF2018 lab on automatic identi cation and veri cation of claims in political debates. In: Working Notes of CLEF 2018 { Conference and Labs of the Evaluation Forum.

CLEF '18, Avignon, France (2018) 33. Patwari, A., Goldwasser, D., Bagchi, S.: TATHYA: a multi-classi er system for detecting check-worthy statements in political debates. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. pp. 2259{2262.

CIKM '17, Singapore (2017) 34. Santhoshkumar, S., Babu, L.D.: Earlier detection of rumors in online social networks using certainty-factor-based convolutional neural networks. Social Network Analysis and Mining 10(1), 1{17 (2020) 35. Shaar, S., Nikolov, A., Babulkov, N., Alam, F., Barron-Ceden~o, A., Elsayed, T., Hasanain, M., Suwaileh, R., Haouari, F., Da San Martino, G., Nakov, P.: Overview of CheckThat! 2020 English: Automatic identi cation and veri cation of claims in social media. In: Cappellato et al. [ 10 ] 36. Touahri, I., Mazroui, A.: EvolutionTeam at CheckThat! 2020: Integration of linguistic and sentimental features in a fake news detection approach. In: Cappellato et al. [ 10 ] 37. Williams, E., Rodrigues, P., Novak, V.: Accenture at CheckThat! 2020: If you say so: Post-hoc fact-checking of claims using transformer-based models. In: Cappellato et al. [ 10 ] 38. Zhang, Q., Lipani, A., Liang, S., Yilmaz, E.: Reply-aided detection of misinformation via bayesian deep learning. In: Proceedings of the World Wide Web Conference. p. 2333{2343. WWW '19, San Francisco, CA, USA (2019)

1. Alam , F. , Shaar , S. , Nikolov , A. , Mubarak , H. , Martino , G.D.S. , Abdelali , A. , Dalvi , F. , Durrani , N. , Sajjad , H. , Darwish , K. , et al.: Fighting the COVID-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society . arXiv preprint arXiv: 2005 . 00033 ( 2020 )

2. Alkhair , M. , Meftouh , K. , Smali, K. , Othman , N. : An Arabic corpus of fake news: Collection, analysis and classi cation . In: Proceedings of the International Conference on Arabic Language Processing . pp. 292 { 302 . ICALP ' 19 , Springer, Nancy, France ( 2019 )

3. Alzanin , S.M. , Azmi , A.M.: Rumor detection in Arabic tweets using semisupervised and unsupervised expectation{maximization . Knowledge-Based Systems 185 , 104945 ( 2019 )

4. Antoun , W. , Baly , F. , Hajj , H.: AraBERT: Transformer-based model for Arabic language understanding . In: Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools . pp. 9 { 15 . OSAC ' 20 , Marseille , France ( 2020 )

5. Atanasova , P. , Marquez , L. , Barron-Ceden~o, A. , Elsayed , T. , Suwaileh , R. , Zaghouani , W. , Kyuchukov , S. , Da San Martino, G., Nakov , P. : Overview of the CLEF-2018 CheckThat! lab on automatic identi cation and veri cation of political claims. Task 1: Check-worthiness . In: Cappellato et al. [ 12 ]

6. Atanasova , P. , Nakov , P. , Karadzhov , G. , Mohtarami , M. , Da San Martino, G.: Overview of the CLEF-2019 CheckThat! lab on automatic identi cation and verication of claims. Task 1: Check-worthiness . In: Cappellato et al. [ 11 ]

7. Barron-Ceden~o, A. , Elsayed , T. , Nakov , P. , Da San Martino, G., Hasanain , M. , Suwaileh , R. , Haouari , F. , Babulkov , N. , Hamdan , B. , Nikolov , A. , Shaar , S. , Sheikh Ali , Z. : Experimental ir meets multilinguality, multimodality, and interaction proceedings of the eleventh international conference of the clef association (clef 2020 ). In: Arampatzis, A. , Kanoulas , E. , Tsikrika , T. , Vrochidis , S. , Joho , H. , Lioma , C. , Eickho , C. , Neveol , A. , Cappellato , L. , Ferro , N. (eds.) Overview of CheckThat! 2020: Automatic Identi cation and Veri cation of Claims in Social Media . LNCS (12260) , Springer ( 2020 )

8. Barron-Ceden~o, A. , Elsayed , T. , Suwaileh , R. , Marquez , L. , Atanasova , P. , Zaghouani , W. , Kyuchukov , S. , Da San Martino, G., Nakov , P. : Overview of the CLEF-2018 CheckThat! lab on automatic identi cation and veri cation of political claims. Task 2: Factuality . In: Cappellato et al. [ 12 ]

9. Bian , T. , Xiao , X. , Xu , T. , Zhao , P. , Huang , W. , Rong , Y. , Huang , J. : Rumor detection on social media with bi-directional graph convolutional networks . In: Proceedings of the AAAI Conference on Arti cial Intelligence . AAAI '20 , vol. 34 , pp. 549 { 556 . New York, NY, USA ( 2020 )

10. Cappellato , L. , Eickho , C. , Ferro , N. , Neveol , A . (eds.): CLEF 2020 Working Notes. CEUR Workshop Proceedings , CEUR-WS.org ( 2020 )

11. Cappellato , L. , Ferro , N. , Losada , D. , Muller, H. (eds.): Working Notes of CLEF 2019 Conference and Labs of the Evaluation Forum . CEUR Workshop Proceedings , CEUR-WS.org ( 2019 )

12. Cappellato , L. , Ferro , N. , Nie , J.Y. , Soulier , L . (eds.): Working Notes of CLEF 2018{ Conference and Labs of the Evaluation Forum . CEUR Workshop Proceedings , CEUR-WS.org ( 2018 )

13. Cheema , G.S. , Hakimov , S. , Ewerth , R.: Check square at CheckThat! 2020: Claim detection in social media via fusion of transformer and syntactic features . In: Cappellato et al. [ 10 ]

14. Chen , T. , Li , X. , Yin , H. , Zhang, J.: Call attention to rumors: Deep attention based recurrent neural networks for early rumor detection . In: Paci c-Asia conference on knowledge discovery and data mining . pp. 40 { 52 . PAKDD ' 18 , Springer, Melbourne, Australia ( 2018 )

15. Chen , Y. , Sui , J. , Hu , L. , Gong , W. : Attention-residual network with CNN for rumor detection . In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management . pp. 1121 { 1130 . CIKM ' 19 , Beijing, China ( 2019 )

16. Devlin , J. , Chang , M.W. , Lee , K. , Toutanova , K. : BERT: Pre-training of deep bidirectional transformers for language understanding . In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . pp. 4171 { 4186 . NAACL-HLT ' 19 , Minneapolis , Minnesota, USA ( 2019 )

17. Elsayed , T. , Nakov , P. , Barron-Ceden~o, A. , Hasanain , M. , Suwaileh , R. , Da San Martino, G., Atanasova , P. : Overview of the CLEF-2019 CheckThat!: Automatic identi cation and veri cation of claims . In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. pp. 301 { 321 . LNCS, Springer ( 2019 )

18. Gao , J., Han, S. , Song , X. , Ciravegna , F. : RP-DNN : A tweet level propagation context based deep neural networks for early rumor detection in social media . In: Proceedings of the 12th Language Resources and Evaluation Conference . pp. 6094 { 6105 . LREC ' 20 , Marseille , France ( 2020 )

19. Hasanain , M. , Elsayed , T.: bigIR at CheckThat! 2020: Multilingual BERT for ranking Arabic tweets by check-worthiness . In: Cappellato et al. [ 10 ]

20. Hasanain , M. , Suwaileh , R. , Elsayed , T. , Barron-Ceden~o, A. , Nakov , P. : Overview of the CLEF-2019 CheckThat! lab on automatic identi cation and veri cation of claims. Task 2: Evidence and factuality . In: Cappellato et al. [ 11 ]

21. Hassan , N. , Arslan , F. , Li , C. , Tremayne , M. : Toward automated fact-checking: Detecting check-worthy factual claims by claimbuster . In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining . pp. 1803 { 1812 . Halifax, NS ( 2017 )

22. Hassan , N. , Li , C. , Tremayne , M. : Detecting check-worthy factual claims in presidential debates . In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management . pp. 1835 { 1838 . CIKM ' 15 , Melbourne , Australia ( 2015 )

23. Hussein , A. , Hussein , A. , Ghneim , N. , Joukhadar , A. : DamascusTeam at CheckThat! 2020: Check worthiness on Twitter with hybrid CNN and RNN models . In: Cappellato et al. [ 10 ]

24. Jaradat , I. , Gencheva , P. , Barron-Ceden~o, A. , Marquez , L. , Nakov , P. : ClaimRank: Detecting check-worthy claims in Arabic and English . In: Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics . pp. 26 { 30 . NAACL-HLT ' 18 , New

Orleans

, Louisiana, USA ( 2018 )

25. Kartal , Y.S. , Kutlu , M. : TOBB ETU at CheckThat! 2020: Prioritizing English and Arabic claims based on check-worthiness . In: Cappellato et al. [ 10 ]

26. Khoo , L.M.S. , Chieu , H.L. , Qian , Z. , Jiang , J.: Interpretable rumor detection in microblogs by attending to user interactions . In: Proceedings of the AAAI Conference on Arti cial Intelligence . pp. 8783 { 8790 . AAAI' 20 , New York, NY, USA ( 2020 )

27. Liu , Y. , Wu , Y.F.B. : Early detection of fake news on social media through propagation path classi cation with recurrent and convolutional networks . In: Proceedings of the Thirty-Second AAAI Conference on Arti cial Intelligence . pp. 354 { 361 . AAAI ' 18 , New

Orleans

, Louisiana, USA ( 2018 )

28. Ma , J. , Gao , W. , Mitra , P. , Kwon , S. , Jansen , B.J. , Wong , K.F. , Cha , M. : Detecting rumors from microblogs with recurrent neural networks . In: Proceedings of the 25th International Joint Conference on Arti cial Intelligence . pp. 3818 { 3824 . IJCAI ' 16 , New York, NY, USA ( 2016 )

29. Ma , J. , Gao , W. , Wong , K.F. : Detect rumors in microblog posts using propagation structure via kernel learning . In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics . pp. 708 { 717 . ACL ' 17 , Vancouver, Canada ( 2017 )

30. Ma , J. , Gao , W. , Wong , K.F. : Detect rumors on Twitter by promoting information campaigns with generative adversarial learning . In: Proceedings of the World Wide Web Conference . pp. 3049 { 3055 . WWW ' 19 , San Francisco, CA, USA ( 2019 )

31. Martinez-Rico , J. , Araujo , L. , Martinez-Romo , J.: NLP &IR@UNED at CheckThat! 2020 : A preliminary approach for check-worthiness and claim retrieval tasks using neural networks and graphs . In: Cappellato et al. [ 10 ]