<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Detecting Conspiracy Tweets Using Support Vector Machines</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Manfred</forename><surname>Moosleitner</surname></persName>
							<email>manfred.moosleitner@uibk.ac.at</email>
							<affiliation key="aff0">
								<orgName type="institution">Universität Innsbruck</orgName>
								<address>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Benjamin</forename><surname>Murauer</surname></persName>
							<email>b.murauer@posteo.de</email>
							<affiliation key="aff0">
								<orgName type="institution">Universität Innsbruck</orgName>
								<address>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Günther</forename><surname>Specht</surname></persName>
							<email>guenther.specht@uibk.ac.at</email>
							<affiliation key="aff0">
								<orgName type="institution">Universität Innsbruck</orgName>
								<address>
									<country key="AT">Austria</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Detecting Conspiracy Tweets Using Support Vector Machines</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">5370FB6C1E411BDB76D0940509DC81B9</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-24T07:13+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>This paper summarizes the contribution of our team UIBK-DBIS-FAKENEWS to the task "FakeNews: Corona virus and 5G conspiracy" as part of MediaEval 2020. The goal for this task is to classify tweets as "5G corona virus conspiracy", "other conspiracy", or "non conspiracy", based on text analysis and based on the retweet graphs. We achieved our best results using a calibrated linear SVM with word and character n-grams for the text classification task and a non-calibrated linear SVM with graph statistics for the graph classification task.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>The main objective in the task is to distinguish tweets and classify them as either <ref type="bibr" target="#b0">(1)</ref> contributing to a conspiracy suggesting that the 5G network technology caused the SARS-CoV-2 virus epidemic, <ref type="bibr" target="#b1">(2)</ref> contributing to a different conspiracy, or (3) not contribute to a conspiracy. For the first subtask, this classification is based on the text content of the tweets. The second subtask focuses on the retweet and follower graph of the tweets. A detailed description and the results of the challenge can be found in <ref type="bibr" target="#b7">[8]</ref>, the collection of the data is described in <ref type="bibr" target="#b8">[9]</ref>.</p><p>In the remainder of this overview, we present our solutions for the two subtasks in the following Section 2, and discuss the results thereafter in Section 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">METHODOLOGY</head><p>In both subtasks, the participants are allowed to submit 5 different solutions, whereas the first 2 solutions of each subtask are restricted to only use part of the information available. In the remaining 3 submissions, also external data points may be used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Subtask 1: Twitter Messages</head><p>We extract character and word-based 𝑛-grams from the text of the tweets and use them as features for our classification models. This has been shown to be effective and versatile in different text classification task ranging from stance detection <ref type="bibr" target="#b1">[2]</ref> to classifying hacked tweet accounts <ref type="bibr" target="#b3">[4]</ref>. We tested different parameters in a grid search, the values of which are listed in Table <ref type="table" target="#tab_0">1</ref>.</p><p>Submissions 2 may include additional information, so we added all features that were included in the JSON structure, which correspond to the fields available from Twitter's API 1 . We transformed all textual features to tf/idf normalized frequencies of 𝑛-grams, as listed in Table <ref type="table" target="#tab_0">1</ref>, left the numeric features were left as-is, and mapped all categorical features to one-hot vectors.</p><p>We included two additional features that were not in the JSON files directly. Firstly, we crawled all URLs which were included in the messages and extracted the content of the sites &lt;title&gt; tag, hoping that it would contain a distinctive vocabulary. Secondly, we used the free OCR software tesseract 2 to find any text within the images that are included in the messages.</p><p>We tested linear support vector machines and extra random trees as classifiers, and also added the option of calibrating the SVM using Platt's method <ref type="bibr" target="#b6">[7]</ref>. These classifiers have been well-studied and perform well in diverse text classification tasks <ref type="bibr" target="#b9">[10]</ref>, and can compete with neural-network-based approaches in many fields like spam detection <ref type="bibr" target="#b4">[5]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Subtask 2: Retweet-Follower-Graphs</head><p>Standard graph statistics like the number of nodes or the graphs degrees are known to carry characteristics about the retweet graph to help in classification <ref type="bibr" target="#b0">[1]</ref>. Also, algorithms like HITS <ref type="bibr" target="#b2">[3]</ref> and PageRank <ref type="bibr" target="#b5">[6]</ref> could produce discriminating features, as they were used on retweet graphs by Yang et al. in <ref type="bibr" target="#b10">[11]</ref> to distinguish between tweets that are interesting only to a small group of people or a broader audience. Thus, we used the statistical networking Python package NetworkX 3 to extract statistical figures describing the retweet-follower-graphs. For the first run of the second subtask, we calculate order, size, degree, indegree, outdegree, number of connected components, density, transitivity, pagerank, HITS (hubs, authorites), number of partitions, planarity, and number of cycles, and combined them into a single feature vector. Some of the functions in NetworkX to calculate the graph statistics return lists of variable length, as their number depends on the number of nodes and edges. To create fixed-length feature vectors, we computed arithmetic mean, standard deviation, and the five-number summary of the values in the individual lists, and used these as features. For the second run in subtask 2, we additionally used the data from the nodes files, from which we calculated min, max, mean, and standard deviation of the number of friends and followers, and added these to the feature vectors calculated for the first run.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Other conspiracy</head><p>Since we extracted significantly fewer features in the second subtask, we added polynomial feature generation, and added a gaussian naïve Bayes classifier and a K-nearest neighbor to the models from the first subtask. Both are well-studied algorithms and we were interested in how well they would perform for this task. We tested several parameters in a grid search, which are displayed in Table <ref type="table" target="#tab_0">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">RESULTS AND DISCUSSION</head><p>After preliminary experiments for both subtasks, we selected the setup with the highest MCC score in a 10-fold cross-validation setup as the model that predicts our submission results for each subtask.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Subtask 1</head><p>The scores displayed in Table <ref type="table" target="#tab_1">2a</ref> show that the SVM model clearly outperforms the extra random trees approach in the first subtask. Thereby, calibrating the SVM increased the performance slightly.</p><p>Interestingly, the performance of the classifiers dropped when taking more features into account for the second submission. This indicates that either too many features are extracted from the text, or that the additional meta-information was not expressive to the problem. Nevertheless, we submitted the two results in this state, being aware that we could have possibly increased the performance of the second submission by ignoring the meta-features. The evaluation results, on the other hand, don't display a performance decrease between the two submissions, where both runs result in a score of 0.440 and 0.441, respectively. As shown in Table <ref type="table" target="#tab_2">3</ref>, the best results were obtained by combining word unigrams and character-3-and -4-grams and a strict regulation parameter of C=0.1.</p><p>Using a linear SVM as a model allows an easy interpretation of the importance of words by looking at the respective coefficients. For each output class, Figure <ref type="figure">1</ref> shows the terms with the three highest and lowest coefficients. The high value for the term 5g suggests that not many topics within the other conspiracies are  discussing the telecommunication standard. This relationship could be experimented with in more detail using topic modeling.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Subtask 2</head><p>Similar to subtask 1, we used grid search to find the best performing classifier and parameters. The scores of the classifiers were rather similar, with the linear SVM producing the best score with the parameters C=10. While using polynomial features at all increased the result in both submissions by 0.05, whereas the parameters (degree= <ref type="bibr" target="#b1">[2,</ref><ref type="bibr" target="#b2">3]</ref>, include bias=[true, false]) did not have a great influence (&lt; 0.01 MCC). as shown in Table <ref type="table" target="#tab_2">3</ref>. The results in training and evaluation approaches for subtask 2 were quite low, as displayed in Table <ref type="table" target="#tab_1">2b</ref>. Interestingly, our MCC validation scores for subtask 2 were lower than the training scores, which is in contrast to the scores of subtask 1, where the validation scores were slightly better than our training scores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">CONCLUSION</head><p>Our simple text-based approaches were able to classify the tweets reliably, and the coefficients of the model give insights into the most important terms. We suggest that more preprocessing might further improve these results.</p><p>The simple graph statistics, on the other hand, were not expressive enough for this task. Here, incorporating more metadata like the time between the retweets might improve the classification results.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>2 https://tesseract-ocr.github.io/ 3 https://networkx.org/</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1 :</head><label>1</label><figDesc>Hyperparameters tested in grid search.Figure 1: Top 3 positive and negative SVM coefficients for each class after fitting the message bodies of the training data.</figDesc><table><row><cell>Parameter</cell><cell>Tested values</cell></row><row><cell cols="2">Word &amp; character 𝑛-gram size1 [1,2,3,4]</cell></row><row><cell>SVM: C</cell><cell>[0.1, 1, 10]</cell></row></table><note>Extra Trees: number of trees [1, 2, 3, 4] ×10 3 Poly. degree [2, 3] Poly. include bias [True, False] KNN: number of neighbors [3, 4, 5, 10, 20, 50]</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2 :</head><label>2</label><figDesc>Evaluation results measured with Matthew's correlation coefficient.</figDesc><table><row><cell>Phase</cell><cell>Model</cell><cell cols="2">Run 1 Run 2</cell></row><row><cell></cell><cell cols="3">Linear SVM (calibrated) 0.432 0.412</cell></row><row><cell>Training</cell><cell>Linear SVM</cell><cell>0.428</cell><cell>0.404</cell></row><row><cell></cell><cell>Extra Random Trees</cell><cell>0.274</cell><cell>0.253</cell></row><row><cell>Evaluation</cell><cell>Linear SVM (calibrated)</cell><cell>0.440</cell><cell>0.441</cell></row><row><cell></cell><cell>(a) Results of Subtask 1</cell><cell></cell><cell></cell></row><row><cell>Phase</cell><cell>Model</cell><cell cols="2">Run 1 Run 2</cell></row><row><cell></cell><cell>Linear SVM (calibrated)</cell><cell>0.003</cell><cell>0.054</cell></row><row><cell></cell><cell>Linear SVM</cell><cell cols="2">0.127 0.197</cell></row><row><cell>Training</cell><cell>KNN</cell><cell>0.118</cell><cell>0.135</cell></row><row><cell></cell><cell>Extra Random Trees</cell><cell>0.089</cell><cell>0.091</cell></row><row><cell></cell><cell>Gaussian Naive Bayes</cell><cell>0.092</cell><cell>0.101</cell></row><row><cell>Evaluation</cell><cell>Linear SVM</cell><cell>0.090</cell><cell>0.092</cell></row><row><cell></cell><cell>(b) Results of Subtask 2</cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3 :</head><label>3</label><figDesc>Best parameters for the four submissions.</figDesc><table><row><cell>Subm.</cell><cell>Parameters</cell></row><row><cell>Text 1</cell><cell>word-1-grams + character-3+4-grams, calibrated SVM, C=0.1</cell></row><row><cell>Text 2</cell><cell>word-1-grams + character-3+4-grams, calibrated SVM, C=0.1</cell></row><row><cell cols="2">Graph 1 linear SVM, C=10, Poly. deg=2, Poly. include bias = True</cell></row><row><cell cols="2">Graph 2 linear SVM, C=10, Poly. deg=3, Poly. include bias = False</cell></row></table></figure>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Aggregate characterization of user behavior in twitter and analysis of the retweet graph</title>
		<author>
			<persName><forename type="first">Yue</forename><surname>David R Bild</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Robert</forename><forename type="middle">P</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Morley</forename><surname>Dick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Dan</forename><forename type="middle">S</forename><surname>Mao</surname></persName>
		</author>
		<author>
			<persName><surname>Wallach</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Internet Technology (TOIT)</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="24" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">From clickbait to fake news detection: an approach based on detecting the stance of headlines to articles</title>
		<author>
			<persName><forename type="first">Peter</forename><surname>Bourgonje</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Julian</forename><forename type="middle">Moreno</forename><surname>Schneider</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Georg</forename><surname>Rehm</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2017 EMNLP Workshop: Natural Language Processing meets Journalism</title>
				<meeting>the 2017 EMNLP Workshop: Natural Language Processing meets Journalism</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="84" to="89" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Hubs, authorities, and communities</title>
		<author>
			<persName><forename type="first">Jon</forename><forename type="middle">M</forename><surname>Kleinberg</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM computing surveys (CSUR)</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="page">5</biblScope>
			<date type="published" when="1999">1999</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">A peer-based approach on analyzing hacked twitter accounts</title>
		<author>
			<persName><forename type="first">Benjamin</forename><surname>Murauer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Eva</forename><surname>Zangerle</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Günther</forename><surname>Specht</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 50th Hawaii International Conference on System Sciences</title>
				<meeting>the 50th Hawaii International Conference on System Sciences</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Comparison of multinomial naïve bayes classifier, support vector machine, and recurrent neural network to classify email spams</title>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">L</forename><surname>Octaviani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hari Rachmawanto</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><forename type="middle">A</forename><surname>Sari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ignatius</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Moses</forename><surname>Setiadi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Seminar on Application for Technology of Information and Communication (iSemantic)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="17" to="21" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<title level="m" type="main">The pagerank citation ranking: Bringing order to the web</title>
		<author>
			<persName><forename type="first">Lawrence</forename><surname>Page</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Sergey</forename><surname>Brin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Rajeev</forename><surname>Motwani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Terry</forename><surname>Winograd</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1999">1999</date>
		</imprint>
		<respStmt>
			<orgName>Stanford InfoLab</orgName>
		</respStmt>
	</monogr>
	<note type="report_type">Technical report</note>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods</title>
		<author>
			<persName><forename type="first">John</forename><surname>Platt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advanced Large Margin Classifiers</title>
		<imprint>
			<biblScope unit="volume">10</biblScope>
			<date type="published" when="2000-06">June 2000</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Fakenews: Corona virus and 5g conspiracy task at mediaeval 2020</title>
		<author>
			<persName><forename type="first">Konstantin</forename><surname>Pogorelov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daniel</forename><forename type="middle">Thilo</forename><surname>Schroeder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Luk</forename><surname>Burchard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Johannes</forename><surname>Moe</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stefan</forename><surname>Brenner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Petra</forename><surname>Filkukova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Johannes</forename><surname>Langguth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">MediaEval 2020 Workshop</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Fact: a framework for analysis and capture of twitter graphs</title>
		<author>
			<persName><forename type="first">Thilo</forename><surname>Daniel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Konstantin</forename><surname>Schroeder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Johannes</forename><surname>Pogorelov</surname></persName>
		</author>
		<author>
			<persName><surname>Langguth</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2019">2019. 2019</date>
			<biblScope unit="page" from="134" to="141" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Support vector machine active learning with applications to text classification</title>
		<author>
			<persName><forename type="first">Simon</forename><surname>Tong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Daphne</forename><surname>Koller</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Journal of machine learning research</title>
		<imprint>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="45" to="66" />
			<date type="published" when="2001-11">Nov. 2001</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Finding interesting posts in twitter based on retweet graph analysis</title>
		<author>
			<persName><forename type="first">Min-Chul</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Jung-Tae</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Seung-Wook</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Hae-Chang</forename><surname>Rim</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval</title>
				<meeting>the 35th international ACM SIGIR conference on Research and development in information retrieval</meeting>
		<imprint>
			<date type="published" when="2012">2012</date>
			<biblScope unit="page" from="1073" to="1074" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
