Using heterogenous information networks for integrative discourse mapping The Covid19 example Alexander Brand Tim König Wolf J. Schünemann Institute of Social Sciences Institute of Social Sciences Institute of Social Sciences University of Hildesheim University of Hildesheim University of Hildesheim Hildesheim, Germany Hildesheim, Germany Hildesheim, Germany alexander.brand@uni- tim.koenig@uni-hildesheim.de wolf.schuenemann@uni- hildesheim.de hildesheim.de ABSTRACT extension in order to live up to the newly accessible data worlds of digital societies. One of the generic operations of This short paper presents a novel way of mapping knowledge discourse analysis, and quite often the best way to reduce the communities in discourse by utilizing heterogenous complexity of results, is discourse mapping. Discourse mapping information networks (HINs) and a two-stage grouping procedure. After laying out the theoretical foundations of a is to be understood as an umbrella term for visualisation discourse analytical framework grounded in the sociology of techniques that allow for synoptical integration of relational or knowledge, it will demonstrate the applicability of the knowledge structures dissected through discourse analysis. framework on the platform Twitter. In exploratively analysing The short paper puts emphasis on this particular task. a sample of 6.317.324 tweets on the Covid19 pandemic, we will show how clustered HINs can make visible the social Qualitative discourse research and other approaches within embeddedness of knowledge production in digital the interpretive paradigm of social sciences have developed a environments. great multitude and rich variety of mapping strategies for illustrative and instructive syntheses of empirical findings [2, 3, 4, 5]. Such maps can be regarded as a type of small-sized, CCS CONCEPTS non-standardised, interpretation-loaded knowledge graphs, •Applied computing~Law, social and behavioral making visible the heterogenous knowledge communities that sciences~Sociology•Information systems~Information shape discourses. Established discourse mapping strategies, systems applications~Data mining~Clustering though highly flexible, cannot easily cope with the large-scale data of online communication and their need for KEYWORDS standardisation and automation. Due to their requirements in interpretive work, they are not scalable, thus not extendable to Network Analysis, Covid19, Heterogenous Information Networks, wider contexts of meaning-making or transferable to other Discourse Analysis, Sociology of Knowledge subjects. In this paper, we argue that the flaws in established mapping strategies can be - at least partly - overcome by using HINs for the integrative and adaptive mapping of discourses. 1 Introduction: Scalable and adaptive HINs are defined as a directed graph consisting of multiple mapping tools for discourse analysis types of objects or multiple types of relations between objects In this short paper, we present ongoing work that seeks to [6]. This mirrors the assumptions of discourse maps, which combine the theoretical foundations of knowledge-oriented relate different actors to different kinds of information in order discourse analysis with the application of heterogenous to make visible their specific knowledge communities. information network (HIN) analysis. Given its long tradition of In the next section, we present the theoretical foundations of investigating the politics of knowledge and meaning-making, our work, informed by a sociology of knowledge perspective on social science discourse research can help to avoid the common discourse. This is followed, in section 3, by a description of our pitfalls of studying insulated information elements without methodology. In section 4, we present an exemplary online reflecting the relevant “social relations of knowledge and discourse analysis of a publicly available, large-scale dataset of Covid19 Tweets. We chose the pandemic as context, as we knowing” [1, p. 18]. With its toolboxes for the inquiry of the expect the respective dataset to indeed represent a complex social construction of reality, the discourse analytical approach network of discursive formations and structures of knowledge can help to go beyond facts, especially when studying online production. communication. These toolboxes, however, need renewal and KnOD'21 Workshop - April 14, 2021 Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). KnOD’21, April, 2021, Ljubljana, Slovenia A. Brand et al. information flows by its users. While user influence on the platform follows a power-law distribution, all users are free to distribute, share and comment on information with their own 2 Theoretical Foundations followers, effectively providing the tools to collectively shape Knowledge is an ambiguous concept. Even in empirical social information environments in a network-based manner [21]. By sciences, it is frequently understood as a measurable resource looking at Twitter, we can make visible the processes which of individual people. This conception is especially prevalent in filter and curate the information environments of its users the fields of political studies, including political psychology and without neglecting the role of these very users and their political sociology, wherein - somewhat surprisingly - networks. knowledge is mostly conceived as an individual instead of a collective resource [7, 8]. Theoretically and methodologically, this goes along with a widely shared pre-occupation with the 3 Methodology micro-level foundations of social action in political studies at Methodologically, we use a two-stage grouping procedure. the expense of the relational dimension [9, 10] essential for First, we obtain a mesoscopic representation of the network. knowledge production. In contrast to such prevalent Following Bar-Hen et al [22], such a representation of the conceptions, we root ourselves in a sociology of knowledge network is obtained by grouping together nodes of the same tradition [11, 12, 13, 14]. Moreover, the accompanying entity and the same cluster and displaying them as blocks. This methodological re-orientation towards a social science representation is very similar to a general block model with the research tradition of discourse analysis helps to avoid notable difference of additional separation by node type. The individualist misconceptions of knowledge and provides choice to use a block model method in the first step was made research methods that allow to go beyond facts in the empirical with regard to the good performance of such models for large inquiry of the social construction of knowledge. This seems networks. Additionally, it allowed us to draw on applied particularly helpful these days, as so-called disinformation is research on combined clustering of multiple types of entities, increasingly gathering academic interest and the scholars such as documents and text in the case of Gerlach et al [23]. In involved are running the risk of neglecting the social dimension the second step, a simpler clustering procedure can be carried in the production of knowledge [15]. out, which takes the edge weights into account. In the following chapters we refer to the clustered mesoscopic view of the Delving into processes of collective meaning-making by network as the macroscopic view and to its clusters as applying discourse analytical methods is essential, as macroclusters. “information by itself usually has no value: it is a raw material that gains value if further processed in specific ways and if meaning and a certain quality are attached to it”[16, p. 15]. 4 Exemplary Analysis Thus, knowledge cannot belong to the features of an individual (a user of digital media in our case) but is produced, processed For our analysis we used the TweetsCOV19 dataset [24]. We in and obtained from discourses. Information or facts are chose the pandemic as context not just due to its current ‘consumed’ by users only through these collectively built filters relevance, but because we expect the respective dataset to of perception. While emanating from a Foucauldian, post- indeed represent a complex network of discursive formations structuralist tradition, discourse analysis does not necessarily and knowledge communities. This would include, among mean to neglect the crucial relevance of agency. It is our social- others, various special discourses of scientific experts, constructivist conception that makes us attribute a hub-like governmental communicative discourses, as well as general role to actors (here: users of online media) in the basic design public discourse co-constituted by mainstream and social of our complex networks instead (see analysis section below). media. TweetsCOV19 is an annotated publicly available Twitter As discourses “are performed through social actors’ (often corpus of more than 8 million tweets on Covid19, including competing or conflictual) discursive practices” [17, p. 3] it is data from October 2019 - April 2020. For our analysis, we used actors that performatively produce the linkages that we can a restricted version starting with the first public appearance of map in a network, be it links to other entities that are a Covid19 case in the general media on 12/31/2019, identifiable in online discourses such as URLs, hashtags, named preventing false positive matches. After this we build our entities, or other users. Taking such entities not only as sample consisting of 6.317.324 tweets. A timeline of the linkages in communicative networks but as constitutive number of tweets can be found in Appendix 1. elements of issue publics or even communities of discourse and We proceeded as follows: In the first step, we constructed a knowledge, we can rely on theoretical and analytical poly-partite network from the tweets with the username, assumptions developed in the field of digital communication studies [18, 19, 20]. mentions, hashtags, URLs and named entities as node types and edges of one type which symbolize references (e.g., User X uses These relational patterns can be made visible by studying Hashtag Y in a tweet). Named entities were extracted using digital trace data at a large scale. The social media platform scores from the Fast Entity Linker Core library and URLs were Twitter provides a particularly well-suited test case for our expanded when necessary [24]. Furthermore, we removed stop methodology. Twitter, with its characteristics of both a social words from user mentions, hashtags and named entities. This network and an information network, makes visible the formation of knowledge communities through the curation of led to a poly-partite network with the following properties: Using heterogenous information networks for integrative KnOD’21, April, 2021, Ljubljana, Slovenia discourse mapping are rather strongly separated from each other, while a fifth Table 1: Basic metrics of the constructed poly-partite (grey) is more torn apart. However, we also observe some network outliers. For example, three blocks of the third macrocluster Metric Value are located relatively apart from the rest of their cluster. Total sum of nodes 708.352 Sum of unique usernames 130.997 Sum of unique user mentions 176.991 Sum of unique hashtags 158.438 Sum of unique URLs 145.041 Sum of unique named entities 96.885 Total sum of edges 111.399.912 In the next step, an agglomerative collapsing algorithm [25, 26] Figure 1: Full macroscopic view of the network with a was used to block the nodes in the network. Following our stress-based layout. The background colour symbolizes agent-centric theoretical assumptions that knowledge is the assignment of the nodes to the respective macrocluster produced by a community of users (see above), interblock (Dirichlet tessellation). The node size represents the connections consist of user-user, user-hashtag, user-URL, and number of nodes in the blocks of the poly-partite network. user-named entity relations. Due to the large amount of edges The visibility of the edges symbolizes the number of an agglomerative heuristic was applied, which iteratively tries connections between blocks. Blocks with less than 0.1% of to find a better configuration of blocks by progressively all out-going edges were cut from the representation to merging blocks together [25]. The final model selected via the support the visualization. lowest entropy criteria consists of 16 named entity blocks, 28 hashtag blocks, 19 user mention blocks, 23 URL blocks and 14 The annotated version of the network (Figure 2) enables a username blocks. An overview of the number of nodes in each more qualitative look at the different areas of discourse. block can be found in Appendix 2. In the next step, the Consistent with Figure 1, it is observable that macrocluster one macroclusters were computed via a simple greedy clustering is a mixed compilation with no clear identity. The second algorithm, clustering the blocks in the mesoscopic network. In cluster, “Organizational Aspect and Early Response“, contains the last step a qualitative coding of the blocks and clusters was blocks associated with aspects like the role of organizations in performed. To ensure the interpretability of the results, we the pandemic and early response actions like the proposed calculated the PageRank of each node in the original poly- usage of Hydroxychloroquine for the treatment of Covid19 partite network and considered the top 10 nodes per block for patients. The third cluster, “Technology and Daily Life“, takes a the coding, similar to the evaluation of structural topic models deeper dive into media, culture, and technological aspects. and in line with Twitter’s power-law distribution. For the There is a certain proximity to macrocluster two, which is also coding of the usernames, we further included the account notable in Figure 1. The aforementioned three outliers are description into the coding step. URLs were coded via clues in more related to topics like weapons and US politics. The fourth the URL title. Generally, we used simple heuristics for content cluster, “Culture and Safety“, deals more with aspects like coding. For example, clusters containing actors from the fields mental health and media, while the fifth cluster, “Uncertainty”, of music, art and film were coded as "Cultural", while URLs focusses on the uncertainties of living through the pandemic. coded with "Protection from Covid19" contain reports on Following this differentiation, we can see that the chosen different levels of protection in relation to aspects like representation suggests a description of the Covid19 corpus ethnicity. along the lines of organizational aspects, reaction compulsion, cultural and technical adaptations, media use and general uncertainty. These aspects do not appear in isolation, but 5 Results within the framework of a complex web of different emphases Our results indicate a heterogenous discursive space. For the and affiliations. evaluation of the results, we present two novel visualizations: A full macroscopic view of the poly-partite network and a qualitatively annotated visualization of each macrocluster. The macroscopic representation allows to visualize the general structure of the network in a representation similar to discourse maps commonly used in social science discourse research. As can be seen in the Figure 1, four of the five groups KnOD’21, April, 2021, Ljubljana, Slovenia A. Brand et al. the explorative analysis of the Covid19 pandemic on Twitter, five macroclusters with differing users, URLs, hashtags, named entities and mentions became visible. As such, we can identify these clusters as knowledge communities, collectively shaping heterogenous information environments through their intra- and intercluster relations. In order to exhaust the possibilities of this approach, future analyses should consider utilizing even more diverse types of data to compute as clusters. The framework is highly flexible and able to incorporate multiple data sources and types of nodes. This flexibility can stretch to different types of data, such as textual or visual analyses, and even heterogenous data sources, such as different platforms. Furthermore, HINs allow for the specification of different edge types for an even more sophisticated model. This allows researchers to tailor their analysis around specific subjects without compromising neither theoretical foundations nor scalability. However, the selection of nodes should be theory- driven in order to avoid arbitrariness and remain economical with regard to computational resources. As such, our next steps to improve the information richness of the macroclusters would be the implementation of quantitative text analysis into the model, giving a more in-depth look into the knowledge communities surrounding the Covid19 pandemic on Twitter beyond facts. REFERENCES [1] R. Keller, 2018. The sociology of knowledge approach to discourse. An introduction. In The sociology of knowledge approach to discourse, R. Keller, A.-K. Hornidge, and W. Schünemann, Eds. Abingdon, Oxon and New York, Figure 2: Subdivision of the macroscopic perspective of NY: Routledge, 16–47. the network by macrocluster. Colours are consistent with [2] A.E. Clarke, C. Friese, and R. Washburn. 2018. Situational analysis: the background colours in Figure 1. The node size Grounded theory after the interpretive turn (Second edition). Los Angeles, represents the number of nodes in the blocks of the poly- London, New Delhi, Singapore: Sage. partite network. The visibility of the edges symbolizes the [3] A. Luther. 2017. The Entity Mapper: A Data Visualization Tool for number of connections between the different blocks. Qualitative Research Methods. Leonardo, vol. 50, no. 3, 268–271. DOI: Blocks with less than 0.1% of all out-going edges were cut 10.1162/LEON‗ a‗ 01148. from the representation to support the visualization. Node [4] R. Keller, 2013. Doing discourse research: An introduction for social scientists. London: SAGE Publications. types are username, mentions, hashtags, URLs and named [5] A. Luther and W. J. Schünemann, 2018. From analysis to visualisation: entities (here shortened to “entities”). Synoptical tools from SKAD studies and the entity mapper. In The sociology of knowledge approach to discourse, R. Keller, A.-K. Hornidge, and W. J. 6 Conclusion Schünemann, Eds. Abingdon, Oxon and New York, NY: Routledge, 274–299. [6] C. Shi, Y. Li, J. Zhang, Y. Sun, and P. S. Yu. 2015. A Survey of Heterogeneous This paper aimed to showcase a methodology building on Information Network Analysis. arXiv:1511.04854 [physics]. Available: established discourse analytical assumptions about the social http://arxiv.org/abs/1511.04854. embeddedness of knowledge production with a scalable [7] A. Downs. 1968. Ökonomische Theorie der Demokratie, vol. 8. Tübingen: Mohr. framework for heterogenous information network analysis. We [8] A. Lupia and M. D. McCubbins, 1998. The democratic dilemma. Can citizens showed that on Twitter, a platform affording user-based learn what they need to know? Cambridge: Cambridge Univ. Press. knowledge production and sharing, HINs can make these [9] D. Lazer and S. Wojcik, 2018. Political Networks and Computational Social processes visible. Therefore, we demonstrated that HINs Science. In The Oxford handbook of political networks, J. N. Victor, A. H. Montgomery, and M. Lubell, Eds. New York, NY: Oxford University Press, provide a powerful tool for social science discourse research to 115–130. map large-scale online discourses. As such, they can help [10] J. N. Victor, A. H. Montgomery, and M. Lubell, 2018. Introduction: The unearth the complex discourse formations in which knowledge Emergence of the Study of Networks in Politics. In The Oxford handbook of is produced, especially in digital contexts where the amount of political networks, J. N. Victor, A. H. Montgomery, and M. Lubell, Eds. New York, NY: Oxford University Press, 3–57. data often makes a qualitative approach unfeasible. Possible [11] P. L. Berger and T. Luckmann, 1969. Die gesellschaftliche Konstruktion der applications range from the mapping of issue-centred Wirklichkeit. Eine Theorie der Wissenssoziologie. Frankfurt and Main: discourses to the identification of (mis-)information hubs on Fischer. social media and the large-scale analysis of policy networks. In Using heterogenous information networks for integrative KnOD’21, April, 2021, Ljubljana, Slovenia discourse mapping [12] R. Keller, 2005. Analysing Discourse. An Approach From the Sociology of Knowledge. Forum: Qualitative Social Research (FQS), vol. 6, no. 3, Art. 32. APPENDIX [13] K. Mannheim, 1964. Wissenssoziologie Auswahl aus dem Werk, vol. 28. Berlin Neuwied: Luchterhand. [14] S. Maasen, 2009. Wissenssoziologie (2., komplett überarb. Aufl.). Bielefeld: Transcript-Verlag. [15] W. L. Bennett and S. Livingston, 2018. The disinformation order: Disruptive communication and the decline of democratic institutions. European Journal of Communication, vol. 33, no. 2, pp. 122–139, 2018, DOI: 10.1177/0267323118760317. [16] M. Dunn Cavelty, 2008. Cyber-security and threat politics: US efforts to secure the information age. London: Routledge. [17] A.-K. Hornidge, R. Keller, and W. Schünemann, 2018. Introduction. The sociology of knowledge approach to discourse in an interdependent world. In The sociology of knowledge approach to discourse, R. Keller, A.-K. Hornidge, and W. Schünemann, Eds. Abingdon, Oxon and New York, NY: Routledge, 1–15. [18] L. A. Adamic and N. Glance, 2005. The political blogosphere and the 2004 US election: Divided they blog. In Proceedings of the 3rd international workshop on Link discovery, 26-43. DOI: https://doi.org/10.1145/1134271.1134277 Appendix 1: Number of Tweets over Time [19] A. Bruns and J. Burgess, 2011. The use of Twitter hashtags in the formation of ad hoc publics. In 6th European Consortium for Political Research General Conference. University of Iceland, Reykjavik. [20] M. Eriksson Krutrök and S. Lindgren, 2018. Continued Contexts of Terror: Analyzing Temporal Patterns of Hashtag Co-Occurrence as Discursive Articulations. Social Media + Society, vol. 4, no. 4. DOI: 10.1177/2056305118813649. [21] S. A. Myers, A. Sharma, P. Gupta, and J. Lin, 2014. Information network or social network? the structure of the twitter follow graph. In Proceedings of the 23rd International Conference on World Wide Web, Seoul, Korea, 493– 498, DOI: 10.1145/2567948.2576939. [22] A. Bar-Hen, P. Barbillon, and S. Donnet, 2020 .Block models for multipartite networks.Applications in ecology and ethnobiology. arXiv:1807.10138 [stat]. Available: http://arxiv.org/abs/1807.10138. [23] M. Gerlach, T. P. Peixoto, and E. G. Altmann, 2018. A network approach to topic models. Science advances, vol. 4, no. 7, eaaq1360. [24] D. Dimitrov et al., 2020. TweetsCOV19 -- A Knowledge Base of Semantically Annotated Tweets about the COVID-19 Pandemic. Proceedings of the 29th ACM International Conference on Information & Knowledge Management, 2991–2998. DOI: 10.1145/3340531.3412765. [25] T. P. Peixoto, 2014. Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models. Phys. Rev. E, vol. 89, no. 1, 012804. DOI: 10.1103/PhysRevE.89.012804. [26] T. P. Peixoto, 2019. Bayesian stochastic blockmodeling. arXiv:1705.10225 [cond-mat, physics:physics, stat], 289–332. DOI: 10.1002/9781119483298.ch11. Appendix 2: Number of Nodes in each Block