<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A Linguistic Approach to Misinformation in Chinese</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Charles</forename><surname>Lam</surname></persName>
							<email>charleslam@hsu.edu.hk</email>
							<affiliation key="aff0">
								<orgName type="department">Department of English</orgName>
								<orgName type="institution">The Hang Seng University of Hong Kong</orgName>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Brian</forename><surname>Leung</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">F-STEM Solution Limited</orgName>
								<address>
									<settlement>Hong Kong</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Cora</forename><surname>Yip</surname></persName>
							<affiliation key="aff1">
								<orgName type="department">F-STEM Solution Limited</orgName>
								<address>
									<settlement>Hong Kong</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Jason</forename><surname>Yung</surname></persName>
							<email>jason.wl.yung@gmail.com</email>
							<affiliation key="aff1">
								<orgName type="department">F-STEM Solution Limited</orgName>
								<address>
									<settlement>Hong Kong</settlement>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A Linguistic Approach to Misinformation in Chinese</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">C2E6EBA324F3662E19EEFC238295E199</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-23T22:10+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Misinformation</term>
					<term>Fake news</term>
					<term>Linguistics</term>
					<term>Chinese</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Identifying useful information is increasingly important and difficult. Correct information is crucial in when we make our decisions, regardless in finance/economy, health and politics. Yet, the amount of misinformation has been rising in all these aspects. Existing works primarily focus on the truthfulness of information using data in English, and either ignore unverifiable claims or categorize them with misinformation (also known as 'fake news'). However, this approach often disregards misleading information or conspiracy, which can be as dangerous as verifiably wrong information. From a linguistic perspective, the present study analyzes headlines of 69,170 extracted articles in Chinese and identifies their linguistic features. Results show that misinformation in Chinese use emotive language and hyperbole to get readers' attention, which echoes previous studies on clickbaits and shows that these tactics in misinformation are shared across languages. We further argue that these tactics are particularly obvious, when the articles are categorized based on the topics. Through an analysis of commonly used phrases and keywords, we discuss how the word list can be further developed into an identification system for misinformation.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The spread of misinformation has become a serious problem across the world. Misinformation and other similar text types are problematic because they often confuse readers and perpetuate false information. This can be a matter of life and death for many. A prime example is misinformation related to the coronavirus pandemic. It has even been claimed, by some conspiracy theories, that the pandemic is a biological weapon, or it is a creation of pharmaceutical companies, or the virus or disease does not exist at all. The Europol called misinformation around COVID-19 a "sneaky threat" in a blogpost and urged users to beware of the spread of it 1 .</p><p>The present study belongs to a larger project that aims to identify misinformation and fake news with NLP/NLU (natural language processing / understanding). For this study, we do not focus on the fine distinction between these text types. Rather, we aim to identify common features in the language used by these misleading articles. While we assume that the different types of misleading or wrong text types (such as misinformation, disinformation, fake news, content farm and satire) bear different impacts to readers and can be further categorized from 'untruthful texts' <ref type="bibr" target="#b9">[10]</ref>, there might still be common features among them that can separate misinformation from regular and truthful news.</p><p>Content-based automatic fact checking is difficult, because it relies on both common sense and expert knowledge. For instance, it is provably false to claim that the wire in the surgical mask is secretly an antenna for 5G network <ref type="foot" target="#foot_0">2</ref> . However, it is unlikely that any system would already contain the knowledge that the mask wire and the antenna cannot be the same entity. The falsehood of the claim relies on expert knowledge (e.g. the knowledge about structure of surgical masks and the knowledge about materials suitable for 5G network antenna). In addition, misinformation and fake news often use faulty logic to deceive readers. For computer systems that use primarily "bag of words" approach without considering causal relations between clauses, it is difficult to identify faulty logic that misrepresent unrelated facts as related. This is particularly clear in the conspiracy theories, where unverifiable claims are made.</p><p>To tackle the issue of misinformation and fake news, human users often have to fact-check with their general knowledge and apply their skills to critically read and reflect on new information. In some cases, the knowledge required to verify the information is beyond any individual's knowledge base. It is therefore useful for AI systems to identify or pre-screen the truthfulness and veracity in this era of information overflow.</p><p>Given the limitations with content-or knowledge-based fact-checking, we advocate the use of language features in identifying misinformation. This linguistic approach can work in parallel with the use of real-world knowledge, potentially through human annotation. Before knowledge representation and ontologies are made more accessible (e.g., as it is done for path planning <ref type="bibr" target="#b2">[3]</ref>) for the purpose of news verification, language features may serve as proxy for suspicious news articles. To this end, the objective of this study is to explore the features in misinformation. Due to the paucity of previous studies on misinformation in the Chinese-speaking world, the present study aims to explore misinformation in Chinese due to the large number of users and their growing influence. The present study also aims to bring empirical language data of a non-English language, and thereby contribute with diversity both linguistically and culturally.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Works</head><p>Having acknowledged that there is a need to identify misinformation, the next question is "how"? Given the difficulties in content-based automatic tools in fact-checking, many studies resort to more tangible proxies, such as the sources of the information or the propagation dynamics of the posts in question. Most previous studies concern themselves with the identification of misinformation via more tangible cues (web links, source identification) or metaanalysis (survey papers, detection methods, propagation dynamics) <ref type="bibr" target="#b5">[6]</ref>. One may also use a bundle of measurements that includes structural, temporal and linguistic cues for misinformation detection <ref type="bibr" target="#b11">[12]</ref>.</p><p>Until recently, it has been rare to find research that focuses on the language use of misinformation <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b1">2,</ref><ref type="bibr" target="#b13">14,</ref><ref type="bibr" target="#b12">13]</ref>. Rashkin et al. use a variety of language features to characterize how a story is dramatized or sensationalized <ref type="bibr" target="#b8">[9]</ref>. These features include lexical resources with Linguistic Inquiry and Word Count (LIWC) <ref type="bibr" target="#b7">[8]</ref>, language that signals vagueness (hedging and qualifying / degree adverbs), superlatives and subjective adjectives. Some of the cues from LIWC, e.g., swearing, are highly correlated with misinformation in English. However, the same cues do not seem to be effective in Chinese texts. The use of first and second person pronouns (I and you) also appears to be common in English data. In addition to these text-based measurements, sentiment analysis has also been reported to be useful for the identification of misinformation <ref type="bibr" target="#b1">[2]</ref>. An alternative approach is to utilize user comments as a cue to gauge the veracity <ref type="bibr" target="#b4">[5]</ref>. Instead of looking at the original posts alone, Jiang and Wilson analyzed the use of language in user responses to over 5,000 original posts. Specifically, they found that user responses generally contain more signals indicating awareness of misinformation and show less trust when the original posts contain misinformation. Moreover, there are more emojis and swear words in replies to misinformation. The intuition behind this linguistic approach is that journalists are trained to write to a particular style that caters to their audience. Similarly, writers and creators of misinformation and the like also have to attract their readers' attention. As a result, the style of the texts from these writers becomes distinct. Style can be seen as a conglomerate of language features that include lexical choice, syntactic complexity, organization and flow of information. Some of these features (e.g., lexical choice) can be captured more easily with computers than the others (e.g., organization of the text).</p><p>The vast majority of the literature on misinformation detection focuses on data in English. For example, the frequently cited datasets LIAR <ref type="bibr" target="#b15">[16]</ref> and the more recent FakeNewsNet <ref type="bibr" target="#b10">[11]</ref> are based on English. We recognize that the focus on English is largely related to the availability of social media data and fact-checking sites, and to the existing NLP resources for English (e.g., tokenization and lexical resources for sentiment analysis). However, the issue with misinformation in other languages remains understudied. This gives rise to another challenge in curbing the influence of misinformation: Researchers are not certain whether the misinformation cues in English would work in other languages. The global pandemic in 2020 has clearly shown that communities across the globe are interconnected, despite their linguistic differences. It is therefore necessary to explore how misinformation is manifested in Chinese, assuming that linguistic cues are an effective tool to detect misinformation. The present study adds to a small but emerging group of works that tackle misinformation in languages other than English.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methodology</head><p>In this section, we describe the process of data collection, preprocessing and feature extraction of the dataset.</p><p>The dataset was extracted from a kaggle competition "WSDM -Fake News Classification"<ref type="foot" target="#foot_1">3</ref> . We included only the titles that were considered misinformation. The dataset consists of 320,767 titles of misinformation articles. Most of the titles come from Mainland China and some of them come from Hong Kong and Taiwan. All texts were converted to traditional Chinese using OpenCC<ref type="foot" target="#foot_2">4</ref> to accurately recognize identical texts and characters. Because many titles were exact duplicates, our dataset ends up with only 69,170 titles.</p><p>Before feeding the raw texts into the model, we first performed data cleaning to our dataset, eliminated strings that carry no information, such as URL addresses, hashtags and emojis. We then conducted word segmentation and removed stopwords and punctuations. Lastly, we combined word tokens and separated them with single space as our clean text to allow for the extraction of several linguistic features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Topic Extraction</head><p>Different types of articles have different expressions and styles. To extract the topics, we applied supervised learning to classify the texts. The distribution of topics is shown in table <ref type="table" target="#tab_0">1</ref>.</p><p>Our model was trained to identify three major categories in news disseminated on the Internet (Economy, Health and Politics). None of the stories (or titles) appears to be satirical. We have therefore excluded the possibility in our analysis for the dataset. Titles that cannot be categorized are included in 'Others'. Typical examples in this category include "(5 毛錢的特 效)2014 浙江手機實拍 UFO 不明飛行物！" (50 cents special effect) UFO spotted by cell phone in Zhejiang province in 2014! and "1000 人犯罪團伙來德州偷孩子取器官" Gang of 1,000 members coming to Texas to steal children for their organs. These titles are often unverifiable urban legends or celebrity gossips, and do not pertain to any of our three main themes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Keyword and n-gram extraction</head><p>Keyword extraction allows us to lift the important words from the raw texts. Given that the original dataset only consists of the titles of the articles, we use the extracted nouns and named entities as the keywords of each title, after we performed the part-of-speech tagging with CkipTagger <ref type="bibr" target="#b14">[15]</ref>. In the data, there are 43,193 unique word types and 475,457 tokens after word segmentation. Table <ref type="table" target="#tab_1">2</ref> shows the number of tokens of the most frequent 10 content words, i.e., stop words are not included. Word-based n-gram is a good indicator to discover features like keywords and common word combinations. To extract top n-gram tokens, we used CountVectorizer from the Python Scikitlearn library <ref type="bibr" target="#b6">[7]</ref>. Figure <ref type="figure" target="#fig_0">1</ref> shows the numbers of types and tokens. The overall statistics of n-grams help us gauge the scale of the corpus. From the 69,170 data points, there are 240,681 unique bigrams and 270,650 unique trigrams. Among these unique bigrams and trigrams (i.e., combinations of two or three words), we list the most frequent ones in tables 3 and 4. Across the bigrams and trigrams, we observe similar keywords and topics.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Sentiment Analysis</head><p>In addition to the general distribution and frequency of keywords, we use sentiment analysis to gauge the language style of these news titles <ref type="foot" target="#foot_3">5</ref> . The results show that a much higher proportion of these misinformation titles was rated with stronger emotions. Figure <ref type="figure" target="#fig_1">2</ref> shows that as much as 40% of the titles with misinformation were rated with "0" or "1". To provide a benchmark, a sample of 900 titles were collected from traditional newspapers. The distribution of the sentiment scores of the titles in the misinformation dataset is clearly different from the traditional news titles, which shows a more even distribution. On the two sides of figure <ref type="figure" target="#fig_1">2</ref>, it can be seen that information articles have a greater tendency to have more extreme emotions detected in the titles. In the middle of the figure, traditional news shows a larger proportion of titles with neutral sentiment, compared to misinformation titles. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Discussion</head><p>The data show that misinformation articles tend to carry stronger emotions, echoing previous studies on English misinformation <ref type="bibr" target="#b13">[14]</ref>. Both quantitative and qualitative measures show this tendency. Compared to articles from traditional news outlets (figure <ref type="figure" target="#fig_1">2</ref>), titles in our dataset tend to demonstrate stronger emotions, and fewer of them display neutral sentiments.</p><p>Based on the frequent keywords and n-grams, the dataset displays a general tendency in misinformation articles to be informal and casual. This is likely a click-bait strategy that aims to attract readers' attention. Specifically, the frequent keywords and n-grams reflect how these titles promise casual topics and easy reads to boost site traffic. Another feature that sets misinformation articles apart from traditional news is the high frequency of particular celebrities (e.g., Fan Bingbing (n=663), Nicholas Tse (n=504), Cecilia Cheung (n=501) and Yang Mi (n=475), among several others), often related to their divorce or romantic lives. While gossips are also part of traditional news, it is the repetition in the misinformation dataset that makes it different. In traditional news, it is more likely that news agencies typically need to cover updated news and do not dwell on only a few celebrities.</p><p>It is also common to see scare tactics as a means to convince readers of the relevance of the articles. The top three trigrams (WeChat -chat -record (n=210); equal -chronic -suicide (n=130); farmer -friend -note (n=91)) are related to warnings in privacy (instant messenger records), health (alleged bad habits causing chronic health issues) and economy (in the context of loan credits for farmers). The same strategy has been seen on conspiracy theories and other sources of misinformation. By creating a sense of urgency and danger, these titles have a better chance to trick readers to clicking on the articles or believing the stories.</p><p>Another common strategy is the promise of secrets. The few verbs on the list of frequent words include 'exposed' (n=1841) and 'know' (n=1799), which are relevant in that they attract readers' attention. The strategy appears to be equally applicable to the different topics in the dataset, as evidenced by the frequencies in the subcategories (see details in table <ref type="table" target="#tab_1">2</ref>). Another interesting word is 'really' (n=122 in politics and n=578 in others). This can be explained through the Gricean Cooperative Principle <ref type="bibr" target="#b3">[4]</ref>. The maxims of relevance and quality would suggest that the reassurance of authenticity is called for in the communication, because there is a need that the authenticity might be in question. From the co-occurrence of the frequent words in the 'others' category, such as exposed (n=975), divorce (n=969), pregnancy (n=784), romantic relationship (n=710) and Fan Bingbing (n=663), one can see that celebrity gossips are a common topic, similar to tabloids in print media.</p><p>It is crucial to note that the use of linguistic features in this study is not intended to replace expert knowledge or journalistic fact-checking. Rather, we consider the linguistic approach a cost-effective proxy for suspicious contents. All the measurements used in this study can be done without human annotation or knowledge bases. While the results from the Chinese dataset show a similar pattern to English, it is also important to note that the difference in language poses additional challenges. Relating the keywords to the topics requires some background knowledge of the social environment. For example, the occurrences of "farmers" are primarily linked to financial services in the Chinese rural credit system. The names of celebrities cannot be automatically linked to gossip, as they also appear in political rumors about movie stars' tax evasion and the authority's reaction. A part of the task can be done with NER (named entity recognition) tools, but the interpretation will require more in-depth understanding of the text, and potentially aided by some form of knowledge representation.</p><p>The dataset shows that the linguistic features described can help identify suspicious sources and flag them as less reliable for users. Given that content farms may change their domain names often, identifying them in a dynamic manner is a useful step to curb the spread of misinformation. In particular, the co-occurrence of various signals at the post-level (i.e., metrics of individual texts) and corpus-level (e.g., distribution of sentiments) is more illustrative for content farms and similar harmful sites. While the categorization in this study is limited in scope, it captures the use of emotive language with some of the common tactics in misinformation. For future research, a more fine-grained distinction in topics (e.g. "celebrity gossip" or "alternative medicine") will reduce miscategorization, since the classifier will no longer be forced to categorize these as "others" or the existing categories. The present dataset can be seen as a proof of concept for this linguistic approach. The results in this study are based on the titles of the articles, so future studies on entire articles will obtain more details in the body texts, which will be illustrative on the linguistic style of articles containing misinformation.</p><p>These findings related to the topics of scare tactics and gossips can be connected to deeper psychological mechanisms <ref type="bibr" target="#b0">[1]</ref>. From a cognitive anthropology perspective, Acerbi proposed that certain types of negative contents can attract readers / listeners more easily. These negative contents appear to be related to disgust, threats or sex. Acerbi's proposal is confirmed by the results about gossip or cheating of celebrities from the present study. While it is inadequate to support any claim to universality, we believe that the present study contributes towards the investigation of the attractiveness and contagiousness of misinformation across languages and cultures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>In the present study, we have contributed with an analysis of data in Chinese with text-based analytics to explore the linguistic features of titles in misinformation articles. Emotive language is found to be a prominent feature in the dataset, indicating that misinformation in Chinese uses similar tactics as misinformation in English. Quantitatively, the misinformation dataset has shown a stronger tendency to use emotive language, compared to regular and traditional news articles. This helps identifying the dataset as a whole as suspicious or less reliable. Qualitatively, the occurrence of emotive keywords and their collocations helps identify titles with emotive language at the level of individual articles. Specifically, we identify the casual style of the prose and the mention of secrets as prominent markers in these misinformation titles. The same strategy can be found across the three topics of economy/finance, health and politics. We recognize celebrity gossip / entertainment as another common theme in misinformation sources, and these articles should be categorized separately in future studies.</p><p>Future research can expand the scope to analyze the entire text with a greater variety of methods. Collocation of keywords is another useful tool. This study used n-grams, which is limited to contiguous collocates. More sophisticated collocation analytics will cover noncontiguous cases (e.g., separated by articles and other function words) and take ordering into account, and in turn better represent the linguistic features in misinformation articles.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure 1: Types and tokens of monograms to 7-grams</figDesc><graphic coords="5,74.40,317.64,446.48,243.77" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Comparison of sentiment scores of our dataset with regular news</figDesc><graphic coords="7,74.40,70.15,446.48,287.02" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Distribution of Topics</figDesc><table><row><cell>Topic</cell><cell>Count Percentage</cell></row><row><cell cols="2">Economy 20,155 29.14%</cell></row><row><cell>Health</cell><cell>15,137 21.88%</cell></row><row><cell>Politics</cell><cell>3252 4.70%</cell></row><row><cell>Others</cell><cell>30,626 44.28%</cell></row><row><cell>Total</cell><cell>69,170 100%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 2</head><label>2</label><figDesc>Most frequent words by topic</figDesc><table><row><cell>Topic</cell><cell>Word (Tokens)</cell></row><row><cell>All topics</cell><cell></cell></row><row><cell>combined</cell><cell></cell></row></table><note>農村 farming village (3147); 網友 netizen (2551); 減肥 lose weight (2362); 中國 China (2013); 曝光 exposed (1841); 手機 cell phone (1801); 知道 know (1799); 農民 farmer (1722) Economy 農村 farm village (2591); 中國 China (1291); 補貼 subsidy (1268); 農民 farmer (1161); 網友 netizen (1046); 2018 年 year 2018 (884); 減肥 lose weight (605); 方法 method (575); 知道 know (557) Health 食物 food (1220); 減肥 lose weight (1068); 手機 cellular phone (901); 健康 health (749); 10 10 (668); 中醫 Chinese medicine (483); 輕鬆 relaxed (473); 方法 method (460); 身體 body (442); 治療 treatment (410) Politics 知道 know (286); 網友 netizen (208); 曝光 exposed (151); 女人 woman (132); 真的 really (122); 不用 no need to (120); 女友 girlfriend (119); 宣佈 announce (112); 孩子 child (109); 事件 event (108) Others 網友 netizen (1128); 曝光 exposed (975); 離婚 divorce (969); 懷孕 pregnancy (784); 戀情 romantic relationship (710); 減肥 lose weight (672); 范冰冰 Fan Bingbing (a movie star) (663); 知道 know (643); 孩子 child (612); 真的 really (578)</note></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>Most frequent bigrams by topic</figDesc><table><row><cell>Topic</cell><cell>Bigram (Tokens)</cell></row><row><cell>All topics</cell><cell>腰間盤 -突出 lumbar disc -protrusion (456); 聊天 -記錄 chat -record (345); 退</cell></row><row><cell>combined</cell><cell>出 -娛樂圈 leave -entertainment industry (344); 戀情 -曝光 romantic relationship -</cell></row><row><cell></cell><cell>exposed (237); 快速 -減肥 fast -lose weight (236)</cell></row><row><cell cols="2">Economy 2018 年 -農村 year 2018 -farm village (235); 農村 -補貼 farm village -subsidy (150);</cell></row><row><cell></cell><cell>腰間盤 -突出 lumbar disc -protrusion (141); 農民 -朋友 farmer -friend (139); 第一</cell></row><row><cell></cell><cell>-龍頭 the first -leader (138)</cell></row><row><cell>Health</cell><cell>腰間盤 -突出 lumbar disc -protrusion (154); 聊天 -記錄 chat -record (134); 快速 -</cell></row><row><cell></cell><cell>減肥 fast -lose weight (104); 微信 -聊天 WeChat -chat (84); 慢性 -自殺 chronic -</cell></row><row><cell></cell><cell>suicide (81)</cell></row><row><cell>Politics</cell><cell>退出 -娛樂圈 leave -entertainment industry (46); 繼承 -父母 inherit -parents (28);</cell></row><row><cell></cell><cell>宣佈 -退出 announce -retirement (27); 父母 -房產 parents -estate (23); 無法 -繼</cell></row><row><cell></cell><cell>承 unable -inherit (20)</cell></row><row><cell>Others</cell><cell>退出 -娛樂圈 leave -entertainment industry (190); 腰間盤 -突出 lumbar disc -</cell></row><row><cell></cell><cell>protrusion (153); 聊天 -記錄 chat -record (149); 戀情 -曝光 romantic relationship -</cell></row><row><cell></cell><cell>exposed (147); 公佈 -戀情 announce -romantic relationship (129)</cell></row><row><cell>Table 4</cell><cell></cell></row><row><cell cols="2">Most frequent trigrams by topic</cell></row><row><cell>Topic</cell><cell>Trigram (Tokens)</cell></row><row><cell>All topics</cell><cell></cell></row><row><cell>combined</cell><cell></cell></row></table><note>微信 -聊天 -記錄 WeChat -chat -record (210); 等於 -慢性 -自殺 equal -chronic -suicide (130); 農民 -朋友 -注意 farmer -friend -note (91); 宣佈 -退出 -娛樂圈 announce -leave -entertainment industry (86); 第一 -龍頭 -沉睡 the first -leaderslumber (77) Economy 第一 -龍頭 -沉睡 the first -leader -slumber (73); 農民 -朋友 -注意 farmer -friend -note (68); 芯片 -第一 -龍頭 chip -the first -leader (57); 4 月 -趕超科 -大訊 April -section catch -Ablecom (42); 農村 -退伍 -軍人 farm village -retired -soldier (36) Health 微信 -聊天 -記錄 WeChat -chat -record (79); 等於 -慢性 -自殺 equal -chronicsuicide (64); 手機 -輸入 -數字 cellular phone -enter -digits (44); 治療 -腰間盤 -突 出 treatment -lumbar disc -protrusion (39); 聊天 -記錄 -恢復 chat -record -restore (28) Politics 繼承 -父母 -房產 inherit -parents -estate (23); 手機號 -發財 -數字 phone number -make a fortune -digits (19); 發財 -數字 -命運 make a fortune -digits -fate (19); 獨生子女 -無法 -繼承 only child -unable -inherit (17); 無法 -繼承 -父母 unableinherit -parents (17) Others 微信 -聊天 -記錄 WeChat -chat -record (94); 等於 -慢性 -自殺 equal -chronicsuicide (63); 宣佈 -退出 -娛樂圈 announce -leave -entertainment industry (47); 4 月 -1 日 -駕考 April -1 -driving test (43); 聊天 -記錄 -刪除 chat -record -delete (38)</note></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0">For more details, see the reports from Forbes https://www.forbes.com/sites/brucelee/2020/07/11 /face-masks-with-5g-antennas-the-latest-covid-19-coronavirus-conspiracy-theory/ and Reuters https: //www.reuters.com/article/uk-factcheck-metal-strip-medical-masks-5/fact-check-metal-strip-in-medical-mas ks-is-not-a-5g-antenna-idUSKBN24A2O1.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1">WSDM -Fake News Classification: https://www.kaggle.com/wsdmcup/wsdm-fake-news-classification</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">Open Chinese Convert: https://pypi.org/project/OpenCC/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3">SnowNLP https://github.com/isnowfy/snownlp.</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Cognitive attraction and online misinformation</title>
		<author>
			<persName><forename type="first">A</forename><surname>Acerbi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Palgrave Communications</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="1" to="7" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Fake news detection using sentiment analysis</title>
		<author>
			<persName><forename type="first">B</forename><surname>Bhutani</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Twelfth International Conference on Contemporary Computing (IC3)</title>
				<imprint>
			<publisher>IEEE</publisher>
			<date type="published" when="2019">2019. 2019</date>
			<biblScope unit="page" from="1" to="5" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Ontology based knowledge representation technique, domain modeling languages and planners for robotic path planning: A survey</title>
		<author>
			<persName><forename type="first">R</forename><surname>Gayathri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Uma</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ICT Express</title>
		<imprint>
			<biblScope unit="volume">4</biblScope>
			<biblScope unit="issue">2</biblScope>
			<biblScope unit="page" from="69" to="74" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m" type="main">Studies in the Way of Words</title>
		<author>
			<persName><forename type="first">P</forename><surname>Grice</surname></persName>
		</author>
		<imprint>
			<date type="published" when="1989">1989</date>
			<publisher>Harvard University Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Linguistic signals under misinformation and fact-checking: Evidence from user comments on social media</title>
		<author>
			<persName><forename type="first">S</forename><surname>Jiang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wilson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the ACM on Human-Computer Interaction 2</title>
				<meeting>the ACM on Human-Computer Interaction 2</meeting>
		<imprint>
			<publisher>CSCW</publisher>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="1" to="23" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Fake news, rumor, information pollution in social media and web: A contemporary survey of state-of-the-arts, challenges and opportunities</title>
		<author>
			<persName><forename type="first">P</forename><surname>Meel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">K</forename><surname>Vishwakarma</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Expert Systems with Applications</title>
		<imprint>
			<biblScope unit="page">112986</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Scikit-learn: Machine learning in Python</title>
		<author>
			<persName><forename type="first">F</forename><surname>Pedregosa</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">the Journal of machine Learning research</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page" from="2825" to="2830" />
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<monogr>
		<title level="m" type="main">Linguistic Inquiry and Word Count: LIWC2015</title>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">W</forename><surname>Pennebaker</surname></persName>
		</author>
		<ptr target="https://www.liwc.net" />
		<imprint>
			<date type="published" when="2015">2015</date>
			<publisher>Pennebaker-Conglomerates</publisher>
			<pubPlace>Austin, TX</pubPlace>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Truth of Varying Shades: Analyzing Language in Fake News and Political Fact-Checking</title>
		<author>
			<persName><forename type="first">H</forename><surname>Rashkin</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/D17-1317</idno>
		<ptr target="https://www.aclweb.org/anthology/D17-1317" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2017 Conference on Empirical Methods in Natural Language Processing<address><addrLine>Copenhagen, Denmark</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017-09">Sept. 2017</date>
			<biblScope unit="page" from="2931" to="2937" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">FakeNewsTracker: a tool for fake news collection, detection, and visualization</title>
		<author>
			<persName><forename type="first">K</forename><surname>Shu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Mahudeswaran</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Liu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Computational and Mathematical Organization Theory</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="60" to="71" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">FakeNewsNet: A Data Repository with News Content, Social Context and Spatialtemporal Information for Studying Fake News on Social Media</title>
		<author>
			<persName><forename type="first">K</forename><surname>Shu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1809.01286</idno>
		<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note>cs.SI</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Hierarchical propagation networks for fake news detection: Investigation and exploitation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Shu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International AAAI Conference on Web and Social Media</title>
				<meeting>the International AAAI Conference on Web and Social Media</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="626" to="637" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Motivations, Methods and Metrics of Misinformation Detection: An NLP Perspective</title>
		<author>
			<persName><forename type="first">Q</forename><surname>Su</surname></persName>
		</author>
		<ptr target="https://www.atlantis-press.com/journals/nlpr/125941255/view#sec-s2_1" />
	</analytic>
	<monogr>
		<title level="m">Natural Language Processing Research</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Big Data and quality data for fake news and misinformation detection</title>
		<author>
			<persName><forename type="first">F</forename><surname>Torabi Asr</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Taboada</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Big Data &amp; Society</title>
		<imprint>
			<biblScope unit="volume">6</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page">2053951719843310</biblScope>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Reliable and Cost-Effective Pos-Tagging</title>
		<author>
			<persName><forename type="first">Y.-F</forename><surname>Tsai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K.-J</forename><surname>Chen</surname></persName>
		</author>
		<ptr target="https://www.aclweb.org/anthology/O04-2005" />
	</analytic>
	<monogr>
		<title level="j">International Journal of Computational Linguistics &amp; Chinese Language Processing</title>
		<imprint>
			<biblScope unit="volume">9</biblScope>
			<biblScope unit="issue">1</biblScope>
			<biblScope unit="page" from="83" to="96" />
			<date type="published" when="2004-02">February 2004. Feb. 2004</date>
		</imprint>
	</monogr>
	<note>Special Issue on Selected Papers from ROCLING XV</note>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Liar, Liar Pants on Fire&quot;: A New Benchmark Dataset for Fake News Detection</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">Y</forename><surname>Wang</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/P17-2067</idno>
		<ptr target="https://www.aclweb.org/anthology/P17-2067" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</title>
				<meeting>the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)<address><addrLine>Vancouver, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017-07">July 2017</date>
			<biblScope unit="page" from="422" to="426" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
