On the Readability of Misinformation in Comparison to the Truth Mohammadali Tavakoli1 , Harith Alani1 and Grégoire Burel1 1 Knowledge Media institute, The Open University, Walton Hall, Milton Keynes, MK7 6AA Abstract Psychological studies have demonstrated that much misinformation circulating on the Web tends to be more believable and memorable due to its ease of processing. The readability of a passage is a crucial factor in the ease of processing, as it indicates how easy or difficult it is to read and understand. According to some qualitative research, if online misinformation is easier to read, it becomes stickier and more memorable. In contrast, other studies showed that people are more likely to trust and believe misinformation when it appears to be more complex. As a result of such conflicting findings, it remains unclear how readability is associated with true or false content on the Web in general. This paper aims to gain a deeper understanding of readability through quantitative analysis by applying six readability formulas to four datasets containing both true and false content, as well as across multiple datasets. Our research shows that false claims are generally harder to read than true claims. Keywords Ease of processing, Readability, Misinformation, False claims 1. Introduction Papers from psychology have demonstrated through a range of qualitative studies that misin- formation tends to be easier to process in general, and thus easier to believe and remember [1, 2]. Ease of processing, also called processing fluency, refers to the ease with which a piece of information can be processed by its readers. Understanding what makes misinformation easier to process is key to producing more effective methods to curb its spread. In textual content, one of the features that influence its ease of processing is readability [3]. Currently, research is conflicting with respect to how readability is associated with online misinformation. On the one hand, easy-to-read misinformation is found to stick more to the readers’ mind [1] and on the other hand, people are found to be more likely to trust and believe more complex information [4]. This raises the need for analysing information that is known to be false and comparing its readability measurement with information that is true, to help in better determining how high/low readability is associated with true/false information online. To understand how readability relates to these categories, we analysed the readability of true and false information collected from the Web. To this end, the research question addressed in this paper is: How readability of misinformation compares to that of true information? In: R. Campos, A. Jorge, A. Jatowt, S. Bhatia, M. Litvak (eds.): Proceedings of the Text2Story’23 Workshop, Dublin (Republic of Ireland), 2-April-2023 Envelope-Open ali.tavakoli@open.ac.uk (M. Tavakoli); harith.alani@open.ac.uk (H. Alani); gregoire.burel@open.ac.uk (G. Burel) Orcid 0000-0003-3005-4539 (M. Tavakoli); 0000-0003-2784-349X (H. Alani); 0000-0003-0029-5219 (G. Burel) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 63 To address this question, we collect news articles and claims containing false and true content items (i.e., claims and articles) and analyse them in terms of readability. The main contributions of this paper are (1) Analyse four datasets of True and False information from the Web; (2) Measure and compare the readability of the datasets using six different readability measures, and; (3) Demonstrate that misinformation appears to be harder to read than true information. 2. Related work The mechanism of assessing the truth by humans often consists of two phases; intuitive and analytic assessments. Through the intuitive phase, we make a decision on whether to accept the received information or to begin the analytic assessment process [5, 6]. The simpler and more intuitive the information is to us, the less likely we are to kick-start the analytical process [7]. Ease of processing of (mis)information is, therefore, an influential factor of how quickly and intuitively we are prone to accepting such information without proper scrutiny [1]. Various parameters have been found to be associated with increasing ease of processing, such as familiarity [8, 9, 10], compatibility with prior beliefs [11, 12], perceived credibility of source [13], and social consensus [14, 15, 16]. Readability is another key feature for assessing the ease of processing textual contents and reflects the level of difficulty in which text information can be read and understood [17]. Some readability studies focused on cosmetic features such as colour contrast [18] and font type and size [19]. In [19], authors found that 35% more participants were misled by information when using easier-to-read fonts. In a study with over 92K false and true news articles, it was found that misinformation was 3% easier to read than true information [20], where readability was measured using Flesch-Kincaid method (FK) [21] which takes into account the number of words, sentences, and syllables to calculate the level of readability of given text. In some scenarios, readability was found to play a rather surprising role. For example, in [4], authors found that when providing text with either False or True information, the participants trusted the harder-to-read text regardless of its veracity. The authors concluded that reading difficulty gave a stronger perception of truthfulness [22]. Other researchers found that readers tend to invest less cognitive effort in judging the truthfulness of news when they have a higher level of reading difficulty, i.e., they believe the information based on face value [23]. Some of the readability measures have been used as classification attributes to distinguish between true and false information. FK and GFI (see section 3.3) for example have been used along with several other lexical, stylistic, and grammatical features by Horne and Adah [24], in an SVM-based model to classify news articles into true, false, and satire. The authors concluded that the style and complexity of fake content are significantly different from real one, yet, it is more closely related to satire than to real. They found that the readability related features cause improvement in classifying news articles into the target classes. A similar model was built in [25] to classify Portuguese news articles into true and false. The authors used 165 textual features including some readability measures adopted for the utilised language. Although it is yet unknown whether their findings from investigating the Portuguese data are generalizable to English and other languages, they show that the classifiers with readability-related features, such as DCI and GFI (see section 3.3), in turn, achieve higher accuracy. These studies, however, 64 lack a proper analysis to investigate how each of these features is associated with true and false information and to what extent these associations differ from each other. From the above, it is clear that readability can be measured in different ways and can have different impacts on misinformation. Our work in this paper differs from the state of the art in that we apply multiple computational methods for calculating readability, and we perform this analysis on several datasets of true and false information. Expanding the analysis to more readability methods and datasets increases the chances of establishing more concrete and representative evidence on how readability differs between true and false information. 3. Readability of True and False Information The aim of this paper is to measure and compare the readability of online misinformation and true information to gain a better understanding of how readability differs between the two categories of content. To achieve this in a systematic manner, the readability score of content items is calculated using six different readability measures (Section 3.3). Apart from three datasets of short claims, a dataset of full news articles is also processed in our experiment. The workflow of our experiments is as follows: (1) Collect datasets consisting of true as well as false claims found on the Web, written in varying lengths (full news articles, short messages); (2) Pre-process the datasets; (3) Calculate the readability of each content item and aggregate their values in our four datasets using six readability measures; (4) Evaluate the readability difference for each of the datasets depending on their true/false labels. 3.1. Datasets In our experiments, two different types of data are used for readability measurement and comparison. A dataset of full news articles and another three datasets of short text. Each dataset consists of true and false claims. The first dataset used in this study is a collection of 5K full news articles named Fake News Detection Challenge Dataset1 (KDD2020) gathered from a variety of news websites in 2020. The veracity of each article is manually labelled with 0 or 1, indicating true and false respectively. The average length of the articles is 27.84 sentences. The second dataset is a manufactured collection of 67, 366 claims named FEVEROUS2 (Fact Extraction and VERification Over Unstructured and Structured information) [26]. This dataset was manually generated in 2021. Each claim is verified against Wikipedia relevant pages by trained annotators and labelled with SUPPORTED, REFUTED, and NOT ENOUGH EVIDENCE. For our experiments, we only consider the claims that were either SUPPORTED or REFUTED. PubHealth3 [27] is another dataset of claims. The dataset was constructed in 2020 and consists of 11k claims collected from fact-checking websites (i.e., Politifact, FactCheck, Snopes, TruthorFiction, and FullFact) and online news sources (i.e., Associated Press, Reuters News, and Health News Review). In this experiment, an equal number of claims from each source is selected to avoid bias. The veracity labelling provided with the dataset is true, false, mixture, and unproven. To meet the need of our experiments, only true and false labels are used. 1 Fake News Detection Challenge, https://www.kaggle.com/c/fakenewskdd2020/data. 2 FEVEROUS, https://fever.ai/dataset/feverous.html. 3 PubHealth, https://github.com/neemakot/Health-Fact-Checking. 65 Table 1 Distribution of content items in datasets and pre-processing statistics. Dataset KDD2020 FEVEROUS PubHealth Liar Number of Content Items (NCI) 4,280 69,058 7,496 4,516 NCI after removing non-English samples 4,280 68,728 7,432 4,483 NCI after removing samples with ≤ 2 words 4,141 68,661 7,421 4,459 NCI after Balancing true & false samples 3,300 54,000 5,422 3,334 The last dataset of claims is LIAR4 [28] with 12.8k claims. The data is collected from Politi- fact.com. The labels used in coding the data are pants-fire, false, barely-true, half-true, mostly-true, and true. Our focus is on claims that are untrue (pants-fire and false labels) and true. 3.2. Pre-processing The pre-processing tasks aim to clean and prepare the data for our experiment. The pre- processing phase consists of the following tasks: discarding duplicates, non-English content items, short ones consisting of less than 3 words, punctuation letters apart from full stops which indicate sentence boundaries, and discarding irrelevant or excessively repeated symbols and characters such as emoji, asterisks, hashes, etc. The number of articles in each dataset is not balanced. Therefore, to avoid bias, we selected the same number of each set (false, true) after cleaning the data and removing noises. Apart from full articles with no information about their sources available, we balance the number of claims with regard to the source (e.g., BBC, CBS) to minimize bias that could emerge from a particular source (e.g., specific writing style or more complex text) for all other datasets. The final size of the datasets used in our study, along with some statistics about the pre-processing steps is shown in Table 1. 3.3. Readability Measures The readability tests that are used in this work for measuring the readability of false and true content items are listed in Table 2). For each readability metric, we apply the min-max normalisation method, the scores from each readability measure are therefore normalised between 0 (very easy to read) to 100 (very hard to read) for comparative purposes. 4. Readability Comparison Results In this section, we describe various comparisons of readability between the true and false sets in our four datasets, to reach a better understanding of the similarities and differences in the overall results as well as the results between the different datasets. 4 LIAR, https://www.kaggle.com/code/hendrixwilsonj/liar-data-analysis. 66 Table 2 The Readability measures (Parameters: ASL: Avg sentence length, ASW: Avg word length in syllables, Complex words: words with ≥ 3 syllables, DW: words with ≥ 7 characters). Name Formula Source Flesch Reading Ease Score (FRES) 206.835 − (1.015 × 𝐴𝑆𝐿) − (84.6 × 𝐴𝑆𝑊 ) [29] Flesch-Kincaid Grade Level (FKGL) 0.39 × 𝐴𝑆𝐿 + 11.8 × 𝐴𝑆𝑊 –15.59 [21] 𝐶𝑜𝑚𝑝𝑙𝑒𝑥𝑊 𝑜𝑟𝑑𝑠 Gunning’s Fog Index (GFI) 0.4 × [𝐴𝑆𝐿 + 100 × ( )] [30] 𝑊 𝑜𝑟𝑑𝑠 𝐶ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑠 Automated Readability Index (ARI) 4.71 × ( ) + 0.5 × 𝐴𝑆𝐿 − 21.43 [31] 𝑊 𝑜𝑟𝑑𝑠 𝐷𝑊 𝑠 Dale-Chall readability formula (DCRF) 0.1579 × ( × 100) + 0.0496 × 𝐴𝑆𝐿 [32] 𝑊 𝑜𝑟𝑑𝑠 Spache Readability Formula (SRF) 0.121 × 𝐴𝑆𝐿 + 0.082 × 𝑃𝐷𝑊 + 0.659 [33] (a) KDD2020 (b) FEVEROUS (c) PubHealth (d) Liar Figure 1: Distribution of content items by mean of readability scores (false vs. true). 4.1. Statistical Comparison of Readability Scores To investigate if false and true content items differ in terms of readability scores, we first compare the means of these scores in all four datasets. Figure 1 shows the distribution of these readability means across the datasets for both true and false sets. These results suggest that although readability is relatively different across the datasets, they are more comparative between the true and false sets in each individual dataset. Overall, we observe that the KDD2020 dataset has a lower readability score compared to the other datasets. This may be due to the item length difference between this dataset and the other analysed datasets. To get an understanding of these readability values and the significance of the similarities or differences between false and true content items, we obtain the scores from the readability measures and apply the Mann-Whitney U (MWU) test. For this experiment, the significance 67 Table 3 Comparison of the avg readability of false and true content items (𝛼 = 0.05). P-values Measure KDD2020 FEVEROUS PubHealth Liar FRES 2.3𝐸 − 4 1.00 0.46 1.6𝐸 − 6 FKGL 1.63𝐸 − 10 1.00 6.66𝐸 − 26 1.4𝐸 − 2 GFI 5.57𝐸 − 17 1.00 1.77𝐸 − 11 0.95 ARI 1.55𝐸 − 14 1.00 3.06𝐸 − 19 0.20 DCRF 0.40 1.00 1.00 2.3𝐸 − 9 SRF 0.40 1.00 1.17𝐸 − 20 4.7𝐸 − 4 All 2.46𝐸 − 10 0.99 6.56𝐸 − 11 5.2𝐸 − 4 level (𝛼) is set to 0.05 indicating that any calculated 𝑝 − 𝑣𝑎𝑙𝑢𝑒 ≤ 𝛼 is showing that a significant difference exists between readability scores. Table 3 represents the results of the MWU test, showing that the content items in the false set are generally harder to read than the ones in the true set and that these distributions differences are statistically significant. The only exception is in FEVEROUS dataset which shows a different pattern. However, as mentioned earlier, this dataset is lab-manufactured and hence is more likely to differ from the other three more naturally-generated datasets. What we can conclude from the statistical analysis above is that the readability of false content is generally harder than true content in all our datasets except the manufactured one. This provides computational evidence in support of the common view and most qualitative studies from psychologists, which argue that falsified information tends to be written in a more complex fashion to give the perception of depth and truthfulness (see Section 2). What remains unknown is how the individual readability parameters differ from one set to another, which is the focus of the next part of the experiment. 4.2. Comparison of Readability Parameters As discussed in Section 3.3, each readability formula has several influencing parameters for calculating readability. To compare the influence of the different readability parameters between the datasets we use the Pearson Correlation Coefficient (PCC). Correlations between each parameter and the readability of true and false content items across the datasets are represented in Figure 2. It can be seen that the correlation between the parameters and readability scores for the formulas is positive in almost all cases. In general, there is a strong correlation between ASL and the mean value of the readability scores. The figures also show that Char_Wrds also has a correlation slightly stronger than moderate with the mean value. Such findings enhance our understanding of why readability is proving to be different between true and false content in our datasets (more on this in Section 5). 68 (a) KDD2020 (Top: True / Bottom: False). (b) FEVEROUS (Top: True / Bottom: False). (c) PubHealth (Top: True / Bottom: False). (d) Liar (Top: True / Bottom: False). Figure 2: Correlation between readability measures and readability parameters including measure- measure, parameter-parameter, and measure-parameter correlations for true and false items (Char-Wrds: Number of characters/number of words, CmpWrds-Wrds: Number of Complex words/number of words. 5. Discussion The results of the analysis are illustrated in Figure 1 and reveal that false content items are in general slightly more difficult to read than true ones. This finding contradicts [20] (see Section 2). However, only one dataset was used in [20]. This indicates the need for further quantitative research to better understand the reasons behind such variation in results. The analysis of the datasets showed an inconsistency between the FEVEROUS dataset and the other datasets in the difference between the readability of false and true content items. Analysing the FEVEROUS content shows that true claims are more difficult to read than false ones which contradicts our results from the other datasets (Figure 1). Looking into the collection/creation process of these datasets, we can infer that the FEVEROUS synthetic dataset is not representative of the real-world true/false content distributions that are observed in the other datasets since the claims created in FEVEROUS are written artificially by a limited number of experts from the misinformation domain rather than naturally authored and published on the Web. Regarding the parameters used in the readability formulas, Figure 2 shows that excluding FEVEROUS for its deviation discussed above, for the rest of the claims datasets (i.e., PubHealth and Liar), Char-Wrds and ASW are of slightly higher than the moderate correlation with the mean value of readability scores. However, this is not the case in the dataset of full articles (i.e., 69 KDD2020) which shows that these parameters could have more impact when experimenting with short texts. The impact of them, however, would be minor when using the GFI measure which might be due to the use of complex words in the measure that diminishes the correlation of these parameters to the measure as it stands for the words with more than 3 syllables. On the other hand, ASL has a contradictory pattern appearing to influence best with long documents. It has a strong relationship with the mean value. Lengthier sentences are used in false news articles with an average of 29 words per sentence. The average length of the sentences in true content items, however, is 25. This indicates that these parameters should be considered when building models for identifying misinformation on the Web. The disparity in content length between true and false content suggests that brevity and conciseness may be a key differentiating factor between misinformation and true information with misinforming content being more convoluted than true content. Such variety in the correlation of parameters and the measures between different types of content items (i.e., claims and full news articles) enables future research to be more wisely when selecting features for classifying content items of different types. 6. Limitations and Future Work In this experiment, we looked into readability and its association with misinformation. Apart from the readability, the concept of ease of processing has other aspects, such as social consensus and source credibility (see section 2). Analytically investigating their association with misinfor- mation and discovering relevant features correlated to them would be an interesting angle to investigate in future. In this experiment, the only language considered was English. Although the readability measures might need modifications to work properly with different languages, experimenting with other languages might result in different findings that may highlight the cultural and structural differences between languages when dealing with true and false information. As discussed in section 3.1, our focus was only on the content items with true and false labels, while some datasets have additional fine-grained annotations, such as Not enough evidence, unproven, mixture, etc. Although, including such fine-grained labels in the analysis would make the experiment more comprehensive, matching labels across various datasets annotated with different guidelines is not straightforward and may result in inconsistent results. It is also of great importance to investigate how the association of readability with misinfor- mation differs across topics. Discovering topic-specific readability patterns and considering them when building models for detecting misinformation is another research direction. 7. Conclusion Our analysis of four distinct datasets showed that readability, in general, is higher (i.e. more difficult) for false information compared to true information. We found a strong difference in the average length of sentences and the number of characters in words in the false and true content, which could be used in misinformation detection models. We also found that when measuring the readability of long documents, the average length of sentences is the most 70 indicative parameter, while the average number of syllables per word and the average number of characters per word work best with short documents. Our analysis also showed that the lab-manufactured FEVEROUS dataset produced readability patterns that were inconsistent with the real-world Web data present in the other datasets. This shows the importance of using real-world datasets when studying misinformation. Acknowledgments This work has been partially supported by the European CHIST-ERA program via the UK Engineering and Physical Sciences Research Council (UKRI - EP/V062662/1) within the CIMPLE project (grant agreement CHIST-ERA-19-XAI-003). References [1] N. Schwarz, M. Jalbert, When (fake) news feels true: Intuitions of truth and the acceptance and correction of misinformation, in: The Psychology of Fake News, Routledge, 2020, pp. 73–89. [2] R. Reber, R. Greifeneder, Processing fluency in education: How metacognitive feelings shape learning, belief formation, and affect, Educational psychologist 52 (2017) 84–103. [3] K. Rennekamp, Processing fluency and investors’ reactions to disclosure readability, Journal of accounting research 50 (2012) 1319–1354. [4] A. Withall, E. Sagi, The impact of readability on trust in information, in: Proceedings of the Annual Meeting of the Cognitive Science Society, volume 43, 2021. [5] K. E. Stanovich, Who is rational?: Studies of individual differences in reasoning, Psychology Press, 1999. [6] R. E. Petty, J. T. Cacioppo, The elaboration likelihood model of persuasion, in: Communi- cation and persuasion, Springer, 1986, pp. 1–24. [7] D. Kahneman, Thinking, fast and slow, Farrar, Straus and Giroux, New York, 2011. [8] L. E. Boehm, The validity effect: A search for mediating variables, Personality and Social Psychology Bulletin 20 (1994) 285–293. [9] D. Gefen, E-commerce: the role of familiarity and trust, Omega 28 (2000) 725–737. [10] E. J. Newman, M. Sanson, E. K. Miller, A. Quigley-McBride, J. L. Foster, D. M. Bernstein, M. Garry, People with easier to pronounce names promote truthiness of claims, PloS one 9 (2014) e88671. [11] W. Kintsch, C. Walter Kintsch, Comprehension: A paradigm for cognition, Cambridge university press, 1998. [12] E. Aronson, The theory of cognitive dissonance: A current perspective, in: Advances in experimental social psychology, volume 4, Elsevier, 1969, pp. 1–34. [13] A. H. Eagly, S. Chaiken, The psychology of attitudes., Harcourt brace Jovanovich college publishers, 1993. [14] R. B. Cialdini, L. James, Influence: Science and practice, volume 4, Pearson education Boston, 2009. [15] L. Festinger, A theory of social comparison processes, Human relations 7 (1954) 117–140. 71 [16] P. S. Visser, R. R. Mirabile, Attitudes in the social context: the impact of social network composition on individual-level attitude strength., Journal of personality and social psychology 87 (2004) 779. [17] C. Tekfi, Readability formulas: An overview, Journal of documentation (1987). [18] H. Geoffrey, R. Rolf, Forming judgments of attitude certainty, importance, and intensity: The role of subjective experiences, Personality and Social Psychology Bulletin (1999) 771–782. [19] H. Song, N. Schwarz, Fluency and the detection of misleading questions: Low processing fluency attenuates the moses illusion, Social cognition 26 (2008) 791. [20] C. Carrasco-Farré, The fingerprints of misinformation: how deceptive content differs from reliable sources in terms of cognitive effort and appeal to emotions, Humanities and Social Sciences Communications 9 (2022). [21] J. P. Kincaid, R. P. Fishburne Jr, R. L. Rogers, B. S. Chissom, Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel, Technical Report, Naval Technical Training Command Millington TN Research Branch, 1975. [22] B. Lutz, M. T. Adam, S. Feuerriegel, N. Pröllochs, D. Neumann, Identifying linguistic cues of fake news associated with cognitive and affective processing: Evidence from neurois, in: NeuroIS Retreat, Springer, 2020, pp. 16–23. [23] H. A. Simon, Motivational and emotional controls of cognition., Psychological review 74 (1967) 29. [24] B. Horne, S. Adali, This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news, in: Proceedings of the international AAAI conference on web and social media, volume 11, 2017, pp. 759–766. [25] R. Santos, G. Pedro, S. Leal, O. Vale, T. Pardo, K. Bontcheva, C. Scarton, Measuring the impact of readability features in fake news detection, in: Proc. 12th language resources and evaluation Conf., 2020. [26] R. Aly, Z. Guo, M. Schlichtkrull, J. Thorne, A. Vlachos, C. Christodoulopoulos, O. Cocarascu, A. Mittal, Feverous: Fact extraction and verification over unstructured and structured information, arXiv preprint arXiv:2106.05707 (2021). [27] N. Kotonya, F. Toni, Explainable automated fact-checking for public health claims, arXiv preprint arXiv:2010.09926 (2020). [28] W. Y. Wang, ”liar, liar pants on fire”: A new benchmark dataset for fake news detection, arXiv preprint arXiv:1705.00648 (2017). [29] R. F. Flesch, et al., Art of readable writing (1949). [30] R. Gunning, The fog index after twenty years, Journal of Business Communication 6 (1969) 3–13. [31] R. Senter, E. A. Smith, Automated readability index, Technical Report, Cincinnati Univ OH, 1967. [32] J. S. Chall, E. Dale, Readability revisited: The new Dale-Chall readability formula, Brookline Books, 1995. [33] G. Spache, A new readability formula for primary-grade reading materials, The Elementary School Journal 53 (1953) 410–413. 72