<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">DL-TXST FakeNews: Enhancing Tweet Content Classification with Adapted Language Models</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Muhieddine</forename><surname>Shebaro</surname></persName>
							<email>m.shebaro@txstate.edu</email>
							<affiliation key="aff0">
								<orgName type="department">Computer Science</orgName>
								<orgName type="institution">Texas State University</orgName>
								<address>
									<settlement>San Marcos TX</settlement>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jason</forename><surname>Oliver</surname></persName>
							<email>jasonoliver@txstate.edu</email>
							<affiliation key="aff0">
								<orgName type="department">Computer Science</orgName>
								<orgName type="institution">Texas State University</orgName>
								<address>
									<settlement>San Marcos TX</settlement>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Tomiwa</forename><surname>Olarewaju</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Computer Science</orgName>
								<orgName type="institution">Texas State University</orgName>
								<address>
									<settlement>San Marcos TX</settlement>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jelena</forename><surname>Tešić</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Computer Science</orgName>
								<orgName type="institution">Texas State University</orgName>
								<address>
									<settlement>San Marcos TX</settlement>
									<country key="US">USA</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">DL-TXST FakeNews: Enhancing Tweet Content Classification with Adapted Language Models</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">F4DEE0B5F507CC9EAD1627869232A271</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T19:59+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>DL-TXST team participation runs submitted to the MediaEval Fake News task this year focused on improving the baseline benchmark pre-processing and modeling. We have introduced features learned from large, adapted language models. The predictive power of our pipeline was the strongest when we included the BERT model tuned to Tweet content. Subtask 1 on the test set had MCC 0.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">DATA MANAGEMENT</head><p>The most recent data is collected from MediaEval's FakeNews: Coronavirus and 5G Conspiracy benchmark project [6] and is integrated with the data of the previous analysis and work that was</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>With today's modern technology, breaking news, from the latest celebrity gossip to updates on unprecedented events like the COVID-19 pandemic, are now available with just a few taps on your smartphone. As the availability and volume of readily available information has grown, so has the rise of misinformation. Fake news is specifically designed to plant a seed of mistrust and exacerbate existing social and cultural dynamics by misusing political, regional, and religious undercurrents <ref type="bibr">[1]</ref>. "In 2019, 8 percent of engagement with the 100 top-performing news sources on social media was dubious. In 2020, that number more than doubled to 17 percent" <ref type="bibr" target="#b3">[3]</ref>. Twitter's purpose has been advertised to the public as a platform that "uniquely provides its users the opportunity to discover what's happening in the world" <ref type="bibr" target="#b4">[4]</ref>. Unique includes fake, so the Twitter platform has become an easy target for the rapid dissemination of skewed facts to the world, as seen with the attribution of the current COVID-19 pandemic to novel 5G technology. Topical automated classification systems with potent predictive power for innumerable conspiracies are urgently needed to curb the spread of inaccurate news. In this paper, we focus on content-based fake news detection strategies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORKS</head><p>This problem of misinformation in social media is universally faced by any user of a social media site. These users, as well as the private companies who run these social media sites, have a vested interest in ensuring that the information on the platform is beneficial to the consumer (the users). For most users, this means that information is accurate and can be trusted as valid. For example, rumors have surfaced in the past about McDonald's use of worm filler in its food. This has caused tremendous boycott threats <ref type="bibr" target="#b1">[2]</ref>. retrieved using TwitterAPI. We used several pre-processing methods on the data. First, we used the baseline pre-processing <ref type="bibr" target="#b5">[5]</ref>, which included converting to lowercase; removing punctuation; preserving URLs; removing stop words; and normalizing terms ("u.k" to UK). Our pre-processing enhancements to the pipeline this year include removing usernames (Twitter handle); removing all special characters; removing hashtags; removing contractions (e.g., convert "won't" to "will" and "not"); removing non-English Tweets if present, removing links (which not only incorporates "https://t.co/", but also "http" and "www"), and removing Emojis. When we looked at the dataset for Subtask 2 and 3, the Tweet was divided into several parts, and each part was present in a separate column. To deal with this, we merged them into one column in the data frame separated by a space. The validation size was set to 0.2 to partition our dataset for the sake of evaluating our model's predictive power according to a set of predefined metrics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">EXPERIMENTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Subtask 1</head><p>The objective is to build a multi-class classifier that can flag whether a Tweet promotes, supports or discusses at least one (or many) of the conspiracy theories. 41.88% Pre-Processing. Links that contain or start with "https://t.co/" are removed, but links such as the ones beginning with "http" and "www" are still present even after applying the control's normalization. Username handles are also not filtered out. Data Integration. Combining two datasets requires them to have the same dimensions, as well as consistent and meaningful class labels. We observed that there are some discrepancies between these two datasets, which would impede the flow of the integration process. For this reason, before integration, we began by carefully selecting class labels from fine-grained classification that would make sense in the new dataset. We replaced the class label of tuples that is 1 with 3 and 3 with 1. We also came to a consensus that label 2 is irrelevant in our new context. Thus, we excluded all tuples having this class label. To form a uniform dataset with a uniform number of dimensions, we extracted only the "Tweet" and the "Label" dimensions from the old dataset, finally rendering the previous dataset integrable with the new one (no missing Tweets detected). Before fusion, the number of tuples of the old dataset was 5,946 rows. After integration and removing rows that are labeled as 2, we got a total of 6,769 tuples.</p><p>Modeling We chose Logistic Regression as a baseline model because it performed the best in the control experiment <ref type="bibr" target="#b5">[5]</ref>. However, we modified it by applying some hyperparameter tuning to adjust to our new, fully integrated dataset. For example, we altered the class weight attribute to 1: 0.1, 2: 0.7, 3: 0.2. We also increased maximal iterations from 2000 to 4000 because there were some instances in which the model did not converge. We kept the same feature extracting technique (CountVectorizer) and utilized the spacy to tokenize our text. In addition, the test size was set to 0.2 to partition our dataset for the sake of evaluating our model's predictive power according to a set of predefined metrics. We utilized the voting classifier to combine several selected models. The selected models were based on similar related works on Tweets <ref type="bibr" target="#b7">[7]</ref>. They are SVC, Multinomial NB, Logistic Regression, and Random Forest Classifier. The voting type was set to "hard."</p><p>BERT for Tweets "BERT-large was trained on 64 TPU chips for four days at an estimated cost of $7,000." <ref type="bibr" target="#b8">[8]</ref>. Selecting a pretrained model for BERT is a crucial step when fitting your model. For instance, we initially used the pretrained model offered by Google (BERT-Large, Uncased) <ref type="bibr" target="#b9">[9]</ref>. We ended up with dismal results. As it turns out, the BERT-Large pretrained model was trained and is based on conversational English text. However, we know that the structure and nature of Tweets are very different from those of any other text. For this reason, we searched to find a pretrained model for Tweets, and we stumbled upon BERTweet <ref type="bibr" target="#b10">[10]</ref>. We based our code on a similar work that was already done on Kaggle, but for disaster Tweets. This code has utilized BERTweet pretrained model from "Vinai" <ref type="bibr" target="#b11">[11]</ref> with some modifications. For example, shifting our class labels (1 to 0, 2 to 1, and 3 to 2) was a requirement for BERTweet to work. We kept the same hyperparameters (5 epochs and batch set size to 8) and changed the num_classes parameter.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Subtask 2 &amp; 3</head><p>Encoding &amp; Decoding Since our data contains multiple target variables, it was beyond the models' inference capabilities of more than one dependent variable at once. So, we came up with an idea to encode every occurrence of a combination of binary target variables into a single target variable. For example, 0,0,0,0,0,0,0,0,0 has 754 occurrences and we encoded it with 0. For the sake of simultaneously reducing the number of class labels and improving generalization, we decided to apply a threshold to remove any rare combination of binaries that has an occurrence less than the threshold. We found that the ideal threshold, 20, would capture the most frequent occurrences. A total of 10 encodings (class labels) were produced after using this threshold. When the model produces a label 3, the decoding process is going to translate it back into 0,0,0,0,0,0,1,0,0. The same experiments were applied to subtasks 2 and 3, except for the data integration, as the class labels of the old dataset are irrelevant in this context. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Figure 2 Validation set results for Subtask 1</head><p>BERTweet outperforms all models in terms of multiple metrics on validation set for subtask 1, as illustrated in Figure <ref type="figure">2</ref>.  </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 3 Validation set results for Subtask 2 Figure 4 Validation set results for Subtask 3</head><label>3243</label><figDesc>Figure 3 Validation set results for Subtask 2</figDesc><graphic coords="2,333.50,440.70,208.98,139.25" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 . Baseline performance on old development set.</head><label>1</label><figDesc></figDesc><table><row><cell cols="2">Accuracy Precision</cell><cell>Recall</cell><cell>F1</cell><cell>MCC</cell></row><row><cell>73.92%</cell><cell>56.85%</cell><cell>54.46%</cell><cell>54.69%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 MCC for Official Test Runs</head><label>2</label><figDesc></figDesc><table /><note>5. </note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head></head><label></label><figDesc>Table 2 summarizes return results on the test set.</figDesc><table><row><cell>Subtask 1</cell><cell>MCC</cell><cell>Subtask 2</cell><cell>MCC</cell><cell>Subtask 3</cell><cell>MCC</cell></row><row><cell></cell><cell>Score</cell><cell></cell><cell>Score</cell><cell></cell><cell>Score</cell></row><row><cell>001</cell><cell>0.106</cell><cell>101</cell><cell>0.0807</cell><cell>201</cell><cell>0.08926</cell></row><row><cell>002</cell><cell>0.0784</cell><cell>102</cell><cell>0.0775</cell><cell>202</cell><cell>0.03060</cell></row><row><cell>003</cell><cell>0.0995</cell><cell>103</cell><cell>0.0724</cell><cell>203</cell><cell>0.0676</cell></row></table><note>CONCLUSION Tweet content normalization techniques improve the predictive power of the pipeline. BERTweet was significantly better at predicting the subtask 1 data with MCC 0.106. The new Normalizations + Logistic Regression performed the best in both subtasks 2 and 3.</note></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<author>
			<persName><forename type="first">Claire</forename><surname>Wardle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hossein</forename><surname>Derakhshan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">INFORMATION DISORDER: Toward an interdisciplinary framework for research and policy making</title>
				<meeting><address><addrLine>Avenue de l&apos;Europe F -</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">09</biblScope>
			<biblScope unit="page">67075</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">Kate</forename><surname>Taylor</surname></persName>
		</author>
		<ptr target="https://www.businessinsider.com/debunked-mcdonalds-uses-worm-filler-2016-" />
		<title level="m">A viral rumor that McDonald&apos;s uses ground worm filler in burgers has been debunked</title>
				<imprint/>
	</monogr>
	<note type="report_type">Business Insider</note>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<title level="m">#:~:text=A%20viral%20rumor%20that%20McDonald&apos;s,in%20 burgers%20has%20been%20debunked&amp;text=Robert%20Galbrait h%2FReuters%20If%20you</title>
				<imprint>
			<date type="published" when="2021-09-19">September 19, 2021</date>
			<biblScope unit="page" from="20" to="E22" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">Emily</forename><surname>Stewart</surname></persName>
		</author>
		<ptr target="https://www.vox.com/policy-and-politics/2020/12/22/22195488/fake-news-social-media-2020" />
		<title level="m">America&apos;s growing fake news problem, in one chart</title>
				<imprint>
			<date type="published" when="2021-09-19">September 19, 2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Defining what makes Twitter&apos;s audience unique</title>
		<author>
			<persName><forename type="first">Cartier</forename><surname>Stennis</surname></persName>
		</author>
		<ptr target="https://blog.twitter.com/en_us/topics/insights/2018/defining-what-makes-twitters-audience-unique" />
	</analytic>
	<monogr>
		<title level="m">Twitter Blog</title>
				<imprint>
			<date type="published" when="2021-09-19">September 19, 2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">Enriching Content Analysis of Tweets using Community Discovery Graph Analysis</title>
		<author>
			<persName><forename type="first">Andrew</forename><surname>Magill</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Lia</forename><surname>Nogueira De</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Maria</forename><surname>Moura</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Mirna</forename><surname>Tomasso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jelena</forename><surname>Elizondo</surname></persName>
		</author>
		<author>
			<persName><surname>Tešić</surname></persName>
		</author>
		<imprint/>
	</monogr>
	<note>MediaEval 2020 workshop paper</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">FakeNews: Corona Virus and Conspiracies Multimedia Analysis Subtask at MediaEval</title>
		<author>
			<persName><forename type="first">Konstantin</forename><surname>Pogorelov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><forename type="middle">Thilo</forename><surname>Schroeder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefan</forename><surname>Brenner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Johannes</forename><surname>Langguth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the MediaEval 2021 Workshop</title>
				<meeting>of the MediaEval 2021 Workshop</meeting>
		<imprint>
			<date type="published" when="2021-12">2021. December 2021</date>
			<biblScope unit="page" from="13" to="15" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">An Ensemble Classification System for Twitter Sentiment Analysis</title>
		<author>
			<persName><surname>Ankit</surname></persName>
		</author>
		<author>
			<persName><surname>Saleena</surname></persName>
		</author>
		<author>
			<persName><surname>Nabizath</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.procs.2018.05.109</idno>
	</analytic>
	<monogr>
		<title level="j">Procedia Computer Science</title>
		<imprint>
			<biblScope unit="volume">132</biblScope>
			<biblScope unit="page" from="937" to="946" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">GreenAI</title>
		<author>
			<persName><forename type="first">Roy</forename><surname>Schwartz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jesse</forename><surname>Dodge</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Noah</forename><forename type="middle">A</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Oren</forename><surname>Etzioni</surname></persName>
		</author>
		<idno type="DOI">10.1145/3381831</idno>
		<ptr target="https://dl.acm.org/doi/fullHtml/10.1145/3381831" />
		<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<monogr>
		<author>
			<persName><forename type="first">Jacob</forename><surname>Devlin</surname></persName>
		</author>
		<ptr target="https://github.com/google-research/bert" />
		<title level="m">Google Research / BERT</title>
				<imprint>
			<date type="published" when="2021-10-20">October 20, 2021</date>
		</imprint>
	</monogr>
	<note type="report_type">Github</note>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">BERTweet: A pre-trained language model for English Tweets</title>
		<author>
			<persName><forename type="first">Thanh</forename><surname>Dat Quoc Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Anh</forename><forename type="middle">Tuan</forename><surname>Vu</surname></persName>
		</author>
		<author>
			<persName><surname>Nguyen</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2020.emnlp-demos.2.pdf" />
		<imprint>
			<date type="published" when="2021-10-20">October 20, 2021</date>
		</imprint>
	</monogr>
	<note type="report_type">Aclanthology</note>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<title level="m" type="main">Disaster Tweets -BERTweet</title>
		<author>
			<persName><forename type="first">Matthias</forename><surname>Bachfischer</surname></persName>
		</author>
		<ptr target="https://www.kaggle.com/matthiasbachfischer/disaster-tweets-bertweet" />
		<imprint>
			<date type="published" when="2021-10-20">October 20, 2021</date>
		</imprint>
	</monogr>
	<note type="report_type">GitHub</note>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">WICO Text: A Labeled Dataset of Conspiracy Theory and 5G-Corona Misinformation Tweets</title>
		<author>
			<persName><forename type="first">Konstantin</forename><surname>Pogorelov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><forename type="middle">Thilo</forename><surname>Schroeder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Petra</forename><surname>Filkuková</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefan</forename><surname>Brenner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Johannes</forename><surname>Langguth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proc. of the 2021 Workshop on Open Challenges in Online Social Networks</title>
				<meeting>of the 2021 Workshop on Open Challenges in Online Social Networks</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="21" to="25" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
