Research and Teaching Public Communication of Science and Technology on Digital Data Emanuele Di Buccio1,2,3 , Federico Neresini3 1 Department of Information Engineering, Univesity of Padova 2 Department of Statistical Sciences, University of Padova 3 Department of Philosophy, Sociology, Education and Applied Psychology, Univesity of Padova Abstract In recent decades, there has been a growing interest among Social Science researchers in computational approaches; Computational Social Science and Digital Sociology are examples of these research directions. An interdisciplinary research field that can be framed within Social Science is Public Communication of Science and Technology (PCST), which examines how science and technology can affect contemporary society and how society can affect science and technology. The digitization of traditional media and the proliferation of other information channels, such as Social Media, provide new opportunities for PCST. This paper discusses the issues that need to be addressed to support PCST scholars, possible solutions to address them, and the integration of these solutions into a single platform that is being used to support research and teaching. Concerning teaching, the paper presents an example of how the platform can be used in the context of a university course. Keywords Digital Social Science, Research Platform, Public Communication of Science and Technology 1. Introduction Social Science is a broad field of research that includes multiple disciplines, such as Sociology and Political Science. In recent decades, there has been a growing interest in adopting computational approaches to deal with the increasing amount of digitized content available [1, 2]. There is a large body of work that involves the adoption of Machine Learning (ML), and more generally AI-based or AI-inspired methodologies to support the research tasks of social scientists [2], to propose new interdisciplinary methodologies, or to rethink and possibly automate previous theories and approaches from a new perspective [3, 4]. This work will focus on an interdisciplinary field called Public Communication of Science and Technology (PCST) [5]. This field includes the “practice to make specialized knowledge available for the public” [6]; science communication “is used to inform, engage, persuade, change behaviors, and support better decision making [. . . ] aims to lift the social, environmental and economic standing of a nation’s people [. . . ] It may also support the participation of citizens in setting the agenda for scientific research.” [7]. Communication plays a crucial role in scientific IRCDL 2024: 20th conference on Information and Research science Connecting to Digital and Library science formerly the Italian Research Conference on Digital Libraries, Bressanone, Brixen, Italy - 22-23 February 2024 $ emanuele.dibuccio@unipd.it (E. Di Buccio); federico.neresini@unipd.it (F. Neresini)  0000-0002-6506-617X (E. Di Buccio); 0000-0003-3918-2588 (F. Neresini) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings research, e.g., to attract attention and to enhance its legitimacy in the eyes of its stakeholders and potential supporters [8, 9]. The public debate on science-related issues may begin when the debate among experts (scientists) is still “ongoing” – see, for example, the case of cloning. Even when there is a consensus among scientists, scientific issues become controversial when they become the subject of public debate. The mass media – both “old” media, such as newspapers, and “new” media, such as social networks – play an essential role in the public perception of scientific and technological issues and innovations. For example, they constantly propose different (interpretative) frames that may influence how the public perceives a particular issue. If, therefore, PCST activities take place mainly in the media arena, this means that they are characterized by the dual role of the mass media: on the one hand, they constitute a privileged space within which relationships between science and society occur; on the other hand, the media themselves contribute to fueling and shaping these relationships. For this reason, the analysis of media communication on science and technology represents an excellent opportunity for the social sciences. This paper will consider research and teaching on PCST. We will reconsider the analysis on user needs conducted in the context of Digital Humanities [10, 11] and Social Science Research [3] and report on issues that need to be addressed to support these activities, possible directions to address them, and our past and current efforts in pursuing these directions. To show how we can adopt some of the proposed solutions for teaching, we will discuss a recent activity we carried out during a course for the Master’s Degree in Communication Strategies. 2. Issues in PCST using Digital Data Public Communication of Science and Technology encompasses several activities. In this paper, we will focus on media monitoring, which aims to follow the discourse – in the case of PCST, the discourse on science and technology – on one or more media channels. One of the channels traditionally followed is newspapers — in this sense, we can think of newspapers as “old” media compared to more recent channels such as social media platforms. Data digitization allows PCST scholars to test their research hypotheses on both “new” articles and historical newspaper data. Several research projects have focused on designing and developing digital libraries to preserve and provide access to historical newspaper archives; recent projects include NewsEye [12] and Impresso [13]. Research in PCST through newspapers is still relevant today, for instance, because it offers the possibility of investigating research hypotheses that require longitudinal studies. For example, suppose we wanted to follow the discourse of computing technologies or AI over several decades, starting in the 1960s. Social media are too recent to be a data source for such research questions. However, social media is an invaluable and necessary source for other research questions, such as studying the discourse on COVID-19 or Generative AI, newer viewpoints, or interactions specific to new platforms. Other media channels, such as vlogs or podcasts, are now available and can be used as data sources. Therefore, PCST can primarily benefit from working with digitized data and, as we will discuss later, using computational approaches to uncover how the media present and “frame” issues, such as those related to science and technology. To achieve this goal, however, several issues must be addressed and are discussed in the remainder of this section. 2.1. Datasets and continuous media monitoring Content analysis is a well-established practice in social science research. However, most previous studies have been based on samples, e.g. a subset of articles on a particular topic. While this is acceptable for some research questions, other questions require analysis of or comparison with the “entire population.” For example, when studying the presence of articles on science and technology over time, working only with articles relevant to science and technology and looking at absolute frequencies would have led to an incorrect conclusion: that media pay growing attention to science and technology. However, it is not the case because the relative frequency is almost constant over time in the last decade [14]. Another reason for not working with samples is the definition of the object of interest. If we are interested in following the discourse of science and technology in newspapers, why not focus only on the “Science” and “Technology” sections? The reason is that we will miss part of the discourse: what about the debate on these issues within articles mainly focused on sport or business? In the case of newspapers, studies are rarely conducted on all the newspapers available online: a subset of them is selected to include those that are representative of different spins. Even when a subset of newspapers (and sources in general) is selected, the size of the data, which is longitudinal in nature, may require a scalable platform to handle it. Media monitoring and approaches dedicated to handling news content are not new to IAR and NLP: the topic has been the focus of evaluation campaigns, workshops,1 and projects; dedicated datasets2 have been created. Therefore, PCST can benefit from these methodologies and resources. Several platforms allow experts in other disciplines with limited programming skills to work with data; examples include Knime3 , Orange Data Mining,4 , and CorText [15]. Knime and Orange allow workflows to be implemented via “visual programming”, specifically by connecting blocks — nodes in Knime and widgets in Orange; different types of blocks exist, e.g., for data collection, preprocessing, and analysis. CorText allows workflows to be implemented using several existing functionalities for data collection, subsetting, preprocessing, and analysis. While continuous data monitoring can be implemented using more advanced or custom functionality in Knime or by augmenting CorText with other libraries, these solutions may not be easily implemented by PCST scholars, that can benefit from an integrated environment, complex search functionalities, and (potentially very large) dataset export capabilities. A different approach is taken by systems such as NOAM [16], The European Media Moni- tor [17], NewsEye, or Impresso. These systems aim to provide an integrated and ready-to-use environment. However, none of them meet all the requirements when it comes to continuous and ongoing monitoring (and not just archives), IAR, and support for the strategies PCST scholars use to conduct research (see section 2.3) or within teaching activities (see section 3.2). 2.2. Heterogeneous sources access and processing A common methodology in PCST, given a research hypothesis, is to conduct a comparative study between different sources/channels. For example, we can study the same phenomenon, 1 See, for example, https://research.signal-ai.com/newsir18/ and https://research.idi.ntnu.no/NewsTech/INRA/ 2 See, for example, https://catalog.ldc.upenn.edu/LDC2008T19 and https://trec-core.github.io/2018/ 3 https://www.knime.com/ 4 https://orangedatamining.com/ e.g. the public debate on AI, in different countries and use media channels such as newspapers and social media as proxies for public opinion. This requires the construction of corpora that are aligned in terms of the temporal dimension in different languages, from different channels, and possibly in different modalities. Podcasts, for example, can be a relevant source to study nowadays since there are many of them focused on news, and besides the content presented in the episodes, one could also look at the way the content is delivered, e.g., as done in [18], by considering vocal and conversational properties when predicting seriousness and energy. Even when libraries are available to collect and preprocess different types of data, such as those mentioned above, access to the channels can be a problem. Newspapers such as The Guardian5 or The New York Times6 provide Web APIs. However, services for collecting data from (some) social media that used to be freely available for research purposes are no longer available or, if available, require substantial fees to download large amounts of data. This can severely limit research on these channels. In addition to access to sources, another aspect to consider is content heterogeneity. Even when considering the same modality, e.g., text, different channels may require different methods. A well-known example is the case of topic extraction methods, such as Topic Modeling (TM) algorithms, whose effectiveness may be affected by document length [19], which may vary in datasets consisting of microblog or forum posts such as Reddit, or when different sources are considered simultaneously [20]. 2.3. Workflow support When dealing with experts in other fields, such as Humanities or Social Sciences, one cen- tral issue is supporting their workflows. Those experts alternate quantitative and qualitative approaches to investigate their research questions or, more in general, to accomplish a task. A discussion on this aspect is reported in [11]. Even if the contribution is in the context of the NewsEye project and about the study of historical newspapers, the authors provided an abstraction of the problem and proposed a workflow that can adopted as a conceptual tool. The authors discuss how an interdisciplinary digital hermeneutics workflow is necessary to pursue the direction of interdisciplinary research that does not consider only the distinct points of view – the one by humanists and the one by computer scientists – but a joined view. The goal is to move us away from “supporting their workflow” and towards an interdisciplinary approach. For instance, it is not always possible to frame a task in “simpler” subtasks and then later translate them in pipelines: this is not how humanists (and in our case PCST scholars) proceed. For this reason, [11] proposed a workflow that considers three main aspects: (a) data, (b) iterative qualitative analytical steps over the data, and (c) critical reflection on data, algorithms, and tools. As for (a), besides the need to focus on specific subcorpora, another critical aspect is the curation of the data. In the event of historical newspapers as in [11], tasks related to this point included extracting content and metadata from scans or images via OCR technologies. Based on our experience with the more “recent” online newspapers, we can add that automatic techniques robust to diverse templates and structures of the pages are required when scaling on the number 5 https://open-platform.theguardian.com 6 https://developer.nytimes.com of sources. Another point stressed in [11] is the importance of metadata to get the context of the themes under investigation, which is crucial for deriving meaning from the data. As for (b), a key aspect is the iterative approach and the high level of interaction required with the data. In this case, search can be a specific tactic within a more complex strategy to accomplish a task. In [10], the authors discuss some relevant tasks when working with historical newspapers and useful digital interface functionalities to accomplish such tasks: 1. filtering and searching by full-text queries and metadata such as time or newspaper; 2. identification and disambiguation of named entities; 3. identify the first occurrence of words or expressions; 4. study the change in meaning of words over time; 5. extraction of themes (topic in TM), interaction with theme descriptions (labels) for their interpretation and possible refinement, access to a representation of a document based on themes, visualization of the prominence of themes over time; 6. advanced search features such as relying on Boolean or regular expressions. In addition, [11] mentioned the importance of improving search beyond keywords since some concepts are complex to express by a set of keywords, even if the suggestion/extraction of new relevant keywords – such as named entities extracted from the subcorpora – could help. The last point (c) is crucial for computer scientists because it requires “openness and trans- parency of methods and tools” [11]; this is important both for reproducibility and to make explicit the assumptions underlying methods and algorithms and the role these assumptions play in investigating the experts’ research questions. We observed analogous needs when interacting with PCST scientists [21, 22]; they require: • better support for IAR, going beyond keyword-based search; • ways to easily incorporate new and possibly heterogeneous sources for new perspectives on the public perception of science and technology issues; • to switch from one (quantitative) strategy to another, to return to a more qualitative analysis, and then to perform further interactions; • to compute a set of consolidated or new indicators, e.g., the “risk indicator” [23], on the subcorpora identified after several iterations. Digital platforms to support research and teaching in these areas should be able to provide these functionalities in an integrated environment or at least facilitate the “implementation” of complex workflows that are not necessarily linear, but that can support different strategies and the alternation between quantitative and qualitative approaches to data analysis. 3. Towards Supporting Research and Teaching in PCST 3.1. Research In Section 2, we identified three main issues: (i) datasets and continuous monitoring; (ii) hetero- geneous sources access and processing; (iii) lack of supporting workflows for PCST scholars. In this section, we will describe how we are currently addressing them in an interdisciplinary project called TIPS7 (Technoscientific Issues in the Public Sphere). 7 https://www.tipsproject.eu/tips/ The first issue, i.e. not to limit the research to samples and to carry out continuous monitoring, has been addressed by designing and developing a modular software platform, presented in [21, 22], that stores and provides access to articles collected from fifteen newspapers in different languages; for three of them, archives are available that go back to the 1980s, obtained by complementing test collections, such as The New York Times Annotated Corpus, or collected through the available Web API, as for The Guardian. All newspapers are still being monitored so that we can work on more recent issues. Continuous monitoring required engineering effort to design and implement a robust architecture. We have integrated search, PoS tagging, named entity extraction, and topic modeling into a single platform that meets all of the requirements discussed in the section 2.1 for continuous and ongoing monitoring, IAR, and support for the strategies used by PCST researchers to conduct their research. Access to the platform requires authentication; credentials may be requested for research purposes. To support reproducibility, users can download metadata of the documents used for their analysis; the metadata includes the article’s URL, which allows access to the full content via the original source. Regarding the second issue, i.e. heterogeneous source access and processing, we designed the platform to work with arbitrary documents and different languages. The platform already stores and provides access to aligned corpora in different languages, which allowed us to perform comparative studies between different countries. Even if we do not monitor social media platforms, existing datasets can be easily included and processed by the existing pipeline. How to replace channels like the Twitter API is an open question and a solution has not been found yet. As for the robustness of the algorithms when working with heterogeneous data, such as [20], we have built temporally aligned test collections for different topics — e.g., DNA and AI — from 2010 to 2022 using data collected from social media and the news; they will be used to conduct experimental evaluations. The involvement of the PCST scholars provides a unique opportunity to gain insights into the effectiveness of these approaches on “real” tasks through qualitative evaluation. The third issue, i.e., supporting workflows for PCST scholars, is the central aspect we are working on. The current functionalities already support filtering and search, named entity extraction, identifying the first occurrence of words and expressions, and advanced search features like regex or boolean constraints. A dedicated functionality in the platform allows the extraction of the top-named entities and nouns for a given query, thus helping users express their needs better. Additionally, users can work on specific subcorpora, obtained by boolean queries, directly within the platform, thus utilizing all the functionalities and indicators available without relying on other platforms or libraries. Some workflows are already supported, e.g., search → identify a subcorpus → topic extraction on the subcorpus → topic description and top docs per topic analysis → analysis of the evolution of topics over time → identify a subcorpus using a subset of the topics and re-extract them Even if the platform supports multiple iterations, the above workflow is limited to search and TM. More articulated workflows can be integrated into the platform, e.g., that proposed in [24] to study the case of energy transition in Italian newspapers; that workflow involves additional techniques such as named entity recognition to identify prominent actors, and graph-based representations obtained from the articles’ content to study actors relationships. Section 3.2 will present another possible workflow for PCST scholars and students. Dealing with large amounts of data in terms of information access and automatic extraction of valuable and usable representations is not the only reason to introduce experts in PCST to computational approaches. Another reason is the increasing attention some of these Computer Science disciplines are receiving in the public sphere today. AI is becoming a prominent topic in the media and for political institutions, thanks to the progress achieved through new (computer) architectures and models and their widespread potential applications. Emerging technologies, such as AI-based chat-bots, are becoming controversial for their potential future impact on society, and institutional organizations have proposed specific regulations.8 Sociologists and communication experts may be directly affected by these emerging technologies, and their role may be critical in discussing or communicating the implications for society. For example, when considering Generative AI, one concern is whether automatically generated content will become dominant. How might that affect society? In addition to the impact on political orientations or public perception, could other aspects be affected? These types of questions are relevant to PCST research activities that focus on the impact of digital technology on society and social interactions. These questions can also be the subject of teaching activities for future communication and sociology professionals, as we will discuss in the next section. 3.2. Teaching The last remark on section 3.1 allows us to introduce another aspect, i.e. the introduction of students of social and communication sciences to topics in IAR, NLP and ML. This need has been the rationale for interdisciplinary courses such as those in Digital Humanties and Computational Social Science, which have been offered for several years. As discussed in [11] in the context of research on historical newspapers, “historians need to acquire new skills, especially in the practice of (digital) hermeneutics, which refers to the interpretation and understanding of large, digitized or digitally born data sources.” The same is true for students (and scholars) in PCST. To promote an appropriate level of understanding of IAR, NLP, and ML topics, and to foster debate among the “future” communication professionals, it is imperative to introduce students to these concepts and also to some of the current and possible future implications. Following this direction, in this section, we will present a teaching activity we carried out on text classification to identify articles on emerging technologies; moreover, we will discuss other activities we plan to carry out to complement the first one. As an example of teaching activities that might benefit students in Sociology and Communi- cation, we report on recent experience in the course of Digital Sociology in the Master Degree of Communication Strategies at the University of Padova. The objective of the course is to introduce the students to epistemological and methodological issues concerning Digital Social Research, to digitalizing traditional methods (e.g., web surveys), data-driven social research, and making social research on Social Media and through Social Media, and digitalized newspapers. As part of the course, the students were presented with how to represent unstructured data, such as newspaper articles, and ML techniques to analyze them, more specifically, supervised text classification on a specific object of study: emerging technologies. The task was framed 8 https://www.europarl.europa.eu/news/en/headlines/society/20230601STO93804/ eu-ai-act-first-regulation-on-artificial-intelligence as a binary classification problem where the goal was to determine whether a document was about emerging technologies. In the first lecture on this activity, the students were introduced to the demarcation problem, i.e., to the problem of determining the object of interest for the study; the goal was to identify a set of criteria then used to classify a document concerning relevance to emerging technologies. Even if the lecturers determined some criteria before the beginning of the activity, those criteria were not presented to the students, who were left to come up with the criteria and agree on those that must be adopted. The result of this activity was the following criteria: (i) a technology framed as "new" is mentioned; (ii) some impacts, i.e., changes, on society are described; (iii) a technology related to scientific research is mentioned. After determining the criteria, we built a dataset using articles from six English newspapers9 published from January 1, 2016, to November 7, 2023. The articles were retrieved by merging the results from the following queries: • "emerging technology" OR "emerging technologies" • chatgpt • "fusion energy" • "genome editing" • "neuralink" • "self driving car" OR "autonomous car" OR "driverless car" OR "robotic car" OR "google car" (and the corresponding version with "cars") The expression between double quotes is interpreted as phrase queries (the constituting words must occur near each other as in the string). We considered candidate non-relevant documents those not returned as results for those queries. We set as an additional filter that articles must be relevant to Science and Technology issues, according to a previously developed classifier — see [21] for details on the manually labeled dataset and [22] for the approach used. Then, we extracted a random sample of 658 documents that covered the diverse queries and (possibly) non-relevant documents; the sample was explicitly created, maintaining a balance between the two classes. The sample was then delivered to the students, who manually labeled 14 documents, each using the criteria. A total of 46 students were involved in the activity. Along with labeling relevant and non-relevant documents, students were asked to perform a quality check to identify duplicates or documents with incomplete text due to our extraction procedure. After labeling the assigned 14 documents, students were asked to label the documents from another student. After the labeling was performed, the students were divided into groups. They were asked to discuss the given labels and to agree on each document label, paying particular attention to problematic cases. The results were then provided to the lecturer. Some documents were removed because of the quality check; the resulting dataset consists of 642 documents. In the following lecture, the students were introduced to fundamental notions on Computer Science, such as the notion of “algorithm”, and on ML learning, focusing on supervised text classification. The experience of labeling was instrumental in this subsequent lecture. Part of the lecture was devoted to presenting the effectiveness of some classifiers trained on the labeled dataset produced by the students. Even if the resulting dataset was small, we trained several classifiers using 5-fold cross-validation as a proof of concept. We used the JSAT Library [25], 9 Mirror, The Guardian, The Telegraph, The New York Times, The Times of London, and The Financial Times Table 1 Effectiveness of Text Classifiers for identifying articles on Emerging Technologies Classifier AUC F1 Precision Recall MNB 0.837 0.771 0.832 0.721 LRDCD 0.809 0.766 0.770 0.764 Stacking 0.829 0.770 0.808 0.738 the one currently adopted “in production” in TIPS. The tested classifiers were Multinomial Naive Bayes (MNB), Logistic Regression with Coordinate Descent Methods (LRDCD) [26], and Stacking [27] of these two classifiers; the focus on these approaches was motivated by the previous encouraging results observed for classifying Science and Technology articles [22]. The subsequent lecture was devoted to current and possible future implications of ML approaches and technologies, e.g., approaches and technologies using behavioral data and Large Language Models, on society. The participation of the students and the constructive interactions were perceived as an indicator of a positive and valuable experience, and the main methodological aspects related to IAR, NLP, and ML were acquired by the students. The students were then given a new sample to label and, later, the predictions based on the most effective classifier (the one using Stacking) to check the predictions’ correctness. The labeling procedure followed the same approach adopted in the previous phase: labeling 14 documents and then discussing the labeling within the group. In the second phase, we got fewer labeled documents (383). As for comprehensiveness, Table 1 reports the results of the three classifiers on the full dataset; article URLs and labels of the adopted dataset are available in [28]. The obtained results suggest additional work should be done, e.g., increasing the size of the labeled set, before using the classifier for research purposes. Text classification is one of many topics that can benefit teaching activities. The availability of classified and indexed longitudinal corpora allows us to present methodological aspects, such as the importance of working with the entire population to monitor some phenomena — see the example of the erroneous interpretation related to absolute frequency reported in Section 2. Another activity might be to show some representations obtained from the classified data. For example, we might ask if the perception of risk when discussing emerging technologies has changed over time. Can NLP and ML help? We might extract temporal word embedding representations and compare the distance among the embedding representation of the terms “risk” and “emerging technology” over time. As a proof of concept and a basis for a possible activity to carry out for the next edition of the course, we considered all the documents answering the queries reported above and merged the obtained results; 80494 documents constitute the resulting subcorpus. We preprocessed them, replacing all the terms in the query constituted by more than one token, e.g., emerging technology, with a string to denote the entire expression; we then transform the text using lemmas instead of the original words — we used Stanza [29] for the extraction of the lemmas. Then, we trained a model using 100 dimensions to represent each word, 5 static iterations, and 5 dynamic iterations as suggested in [30]. We then computed the distance (cosine similarity) between “risk” and “emerging technology” over time; the trend is reported in Figure 1, specifically Figure 1: Similarity between the term “risk” and “emerging technology” over time; emerging technology collapsed refers to the case where all the terms reported in the queries were replaced with “emerging technology” before the training. 1 "emerging technology" only risk-emerging technology similarity emerging technologies collapsed 0.5 0 −0.5 −1 2016 2017 2018 2019 2020 2021 2022 2023 years the line connecting points depicted by circles. The other line (points depicted as squares) refers to the similarity between the term “’risk” and the term “emerging technology”, when in the pre- processing step, all the terms used in the query – “chatgpt”, “fusion energy”, “genome editing”, “neuralink” and the different variants of “autonomous car” – where replaced by “emerging technology”; the basic idea underlying the second approach was to have a measure of the relationship among risk and emerging technologies when considering all the technologies of the case study. Both cases show a peak in 2023. One can then look at the words closest to “emerging technology” to interpret the results; the top 20 per year are reported in Table 2 when not “collapsing” the diverse queries. In the case of “emerging technologies collapsed”, we observed words such as “AI”, “generative”, “vehicle”, “robot”, “automation”, “autopilot”; those words, along with “cybersecurity”, might suggest a possible interpretation of the reasons for a peak of the closeness to risk. A more fine-grained analysis based on the actual documents from that year must then be adopted to confirm the result, and that requires advanced search functionalities to retrieve documents relevant to the task. This is an example of workflow mentioned in Section 2 and that we aim to support. Another example, relying on TM algorithms, might help to introduce the discussion on controversial issues rooted or related to research on IAR, ML, or AI in general. These indica- tions by PCST scholars might lead to novel research problems to address and result in novel algorithms and paradigms. For example, we considered two newspapers: The New York Times and The Guardian. We used the articles available in TIPS from 1999 to 2022. That resulted in 4,556,415 documents. Then we extracted all the articles answering the query: (”search engine” OR ”information retrieval” OR ”machine learning” OR ”artificial intelligence”) in the time interval 1999-2022; 1999 was selected as the starting year because the number of articles in The Guardian seems to be small before that date and we wanted to Table 2 Top 5 words closest to “Emerging Technology” 2016 2017 2018 2019 2020 2021 2022 2023 innovation robotics innovation AI AI innovation innovation innovation advance AI AI blockchain automation AI digitization diversification discipline innovation automation robotics disruptive automation domain domain advancement Mana expertise tool innovation niche robotics enabler opportunity groundbreaking tool automation tool expertise entrepreneurship infrastructure entrepreneurial Hide disruptive domain blockchain digitisation tool entrepreneurship evolution consortia frontier computing robotics robotics advancement industrial meaningful intelligence blockchain intelligence nanotechnology entrepreneurship solution cybersecurity insightful levitation dynamic disruptive intelligence skill sustainability learnings robotics artificial creative innovation transformative domain cybersecurity multilateralism organizational nanotechnology solution algorithm computing ecosystem skilling solution frontier DeepMind robotics ML storytelling enhancement cyberspace organisational avenue bioengineering talent artificial digital disruptive skill reform cosmos computing skill innovate innovative enabler computing advancement disruptive startup hardware futuristic futuristic skilling ICT mobilisation potential cloud idea Machine quantum capability AI skill era usher workload cyberspace artificial computing innovative upskilling emergence field complexity cloud skill knowledge creation complementary strategy talent revolution societal Automation blockchain learning automation immense nationality industry advancement IoT advancement blockchain biofuel align the two corpora. We then extracted 30 topics using LDA10 with 500 iterations and the stop-word list provided by the library. Table 3.2 reports a subset of the topics. For instance, the top documents from topic 8 suggest several concerns related to social media platforms and content, such as misinformation. Topic 14 is focused on preoccupations associated with climate change and the environment. Still, top documents also include how these issues can benefit from AI research results and advancement in the field. Topic 25 concerns recent advances in large language models, and the discussion includes the impact that might have on society. Topic 29 includes a discussion on what AI can bring to Art, but also concerns on the problem of copyright infringement due to some IR and AI technologies; these concerns, for instance, resulted in publishers asking governments to protect their work which is “ingested” by AI-based technologies. Those mentioned in the last paragraphs are only a few examples of the relationship between science, technology, and society. 4. Final remarks This paper discussed how PCST can benefit from working on Digital Data. We explicitly discussed some issues that need to be addressed, relying on our experience on an ongoing interdisciplinary research project called TIPS and previous works on Historical Newspapers Archives. The solutions to some of these issues are already integrated into the TIPS platform. Section 3 discussed how our effort is helpful for research and teaching activities. Besides the specific activity considered in the paper, we should also point out that the platform has been actively used for several years by Bachelor’s, Master’s, and Ph.D. students for their theses, which generally focus on specific issues concerning technologies and their impact on society. A large body of work still needs to be done to provide better support, e.g., implementing other workflows like that described in Section 3.2 or proposed in [24] directly into the platform. Moreover, some platforms discussed in the paper are useful for research activities; are they 10 https://mimno.infosci.cornell.edu/jsLDA/ Table 3 Topics extracted from NYT and The Guardian on IAR, AI, and ML. ID Top words 1 apple mobile android gt> app phone iphone apps google phones 2 students university science education school professor research universities computer online 3 human intelligence computer artificial machine machines humans computers language learning 4 game games video virtual players play player real reality film 5 internet software your users use information web data system computer 6 company said companies percent market business billion investors chief money 7 your home like technology devices voice into smart amazon assistant 8 facebook social users media twitter news content tech youtube company 9 data how used such says could research information about use 11 said online advertising internet ads service companies business web site 12 health patients medical cancer care nhs said doctors patient disease 13 robots robot space human robotics weapons artificial intelligence military robotic 14 energy climate water species said food carbon could change global 15 said data privacy about information public government law had use 16 brain science human his life scientists scientific mind book consciousness 17 google google’s microsoft said companies company amazon search european tech 18 said technology software like research computer company a.i companies data 20 jobs workers work job economy automation economic report skills employees 21 cars car vehicles self-driving autonomous uber travel tesla vehicle driving 22 search web site sites information engine online pages your find 23 said covid vaccine health coronavirus were pandemic virus cases had 24 google search engine yahoo google’s microsoft users company results internet 25 chatgpt said technology about artificial intelligence google musk use privacy 26 says digital technology business media social guardian such director marketing 27 china chinese government said united states american companies technology china’s 28 even about future technology power way might just too much 29 music books art book library digital copyright work artists into effective also for educational activities? How can we improve them to support teaching better and allow practice on some methodologies/research issues? Should we provide the users with novel search or analysis primitives for better support? The active participation by experts is a unique opportunity, especially for the evaluation of the effectiveness of IAR, NLP and ML. For this reason, we plan to extend the platform to gather more feedback from the users, e.g., allowing them to specify additional annotations during the labeling procedure, such as notes on why the document was perceived as relevant. Acknowledgments The authors would like to thank all the students of the Digital Sociology course in the Master of Communication Strategies at the University of Padova. Without their commitment and participation, the teaching activity reported in this paper would not have been possible. References [1] N. Marres, Digital sociology : the reinvention of social research / Noortje Marres, Polity Press, Cambridge Malden, 2017. [2] G. A. Veltri, Digital social research / Giuseppe A. Veltri, Polity, Cambridge (UK) Medford (MA, USA), 2020. [3] P. DiMaggio, M. Nag, D. Blei, Exploiting affinities between topic modeling and the sociological perspective on culture: Application to newspaper coverage of U.S. government arts funding, Poetics 41 (2013) 570–606. arXiv:9605103. [4] D. Odijk, B. Burscher, R. Vliegenthart, M. de Rijke, Automatic thematic content anal- ysis: Finding frames in news, in: A. Jatowt, E.-P. Lim, Y. Ding, A. Miura, T. Tezuka, G. Dias, K. Tanaka, A. Flanagin, B. T. Dai (Eds.), Social Informatics, Springer International Publishing, Cham, 2013, pp. 333–345. [5] M. Bucchi, B. Trench, Science communication research: themes and challenges, Routledge, New York, 2014, pp. 1–14. [6] P. Catapano, P. Fayard, B. V. Lewenstein, The Public Communication of Science and Technology and International Networking, Springer Netherlands, Dordrecht, 2003, pp. 31–42. doi:10.1007/978-94-017-0801-2_3. [7] P. Broks, T. Gascoigne, J. Leach, B. V. Lewenstein, L. Massarani, M. Riedlinger, B. Schiele, Communicating science: a global perspective, ANU Press, 2020. [8] M. Bauer, P. Pansegrau, R. Shukla, The Cultural Authority of Science: Comparing across Europe, Asia, Africa and the Americas, Routledge Studies in Science, Technology and Society, Taylor & Francis, 2019. [9] P. Magaudda, F. Neresini, Gli studi sociali sulla scienza e la tecnologia, Manuali. Sociologia, Il Mulino, 2020. [10] E. Pfanzelter, S. Oberbichler, J. Marjanen, P.-C. Langlais, S. Hechl, Digital interfaces of historical newspapers: opportunities, restrictions and recommendations, Journal of Data Mining & Digital Humanities HistoInformatics (2021). URL: https://jdmdh.episciences.org/ 6121. doi:10.46298/jdmdh.6121. [11] S. Oberbichler, E. Boroş, A. Doucet, J. Marjanen, E. Pfanzelter, J. Rautiainen, H. Toivonen, M. Tolonen, Integrated interdisciplinary workflows for research on historical newspapers: Perspectives from humanities scholars, computer scientists, and librarians, Journal of the Association for Information Science and Technology 73 (2022) 225–239. doi:10.1002/ asi.24565. [12] A. Doucet, M. Gasteiner, M. Granroth-Wilding, M. Kaiser, M. Kaukonen, R. Labahn, J.- P. Moreux, G. Muehlberger, E. Pfanzelter, M.-E. Therenty, et al., Newseye: A digital investigator for historical newspapers, in: 15th Annual International Conference of the Alliance of Digital Humanities Organizations, DH 2020, 2020. [13] Impresso Project, 2023. URL: https://impresso-project.ch/. [14] F. Neresini, Old media and new opportunities for a computational social science on PCST, Journal of Communication 16 (2017). [15] P. Breucker, J.-P. Cointet, A. Hannud Abdo, G. Orsal, C. de Quatrebarbes, T.-K. Duong, C. Martinez, J. P. Ospina Delgado, L. D. Medina Zuluaga, D. F. Gómez Peña, T. A. Sánchez Castaño, J. Marques da Costa, H. Laglil, L. Villard, M. Barbier, Cortext man- ager, 2016. URL: https://docs.cortext.net. [16] I. Flaounas, O. Ali, M. Turchi, T. Snowsill, F. Nicart, T. De Bie, N. Cristianini, Noam: News outlets analysis and monitoring system, in: Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data, SIGMOD ’11, Association for Computing Machinery, New York, NY, USA, 2011, p. 1275–1278. doi:10.1145/1989323.1989474. [17] R. Steinberger, B. Pouliquen, E. van der Goot, An introduction to the Europe Media Monitor family of applications, in: Proceedings of the SIGIR 2009 Workshop on Information Access in a Multilingual World, volume 43, 2009. arXiv:1309.5290. [18] L. Yang, Y. Wang, D. Dunne, M. Sobolev, M. Naaman, D. Estrin, More than just words, ACM, 2019, pp. 276–284. URL: https://dl.acm.org/doi/10.1145/3289600.3290993. doi:10. 1145/3289600.3290993. [19] J. Qiang, Z. Qian, Y. Li, Y. Yuan, X. Wu, Short Text Topic Modeling Techniques, Applications, and Performance: A Survey, IEEE Transactions on Knowledge and Data Engineering 34 (2022) 1427–1445. doi:10.1109/TKDE.2020.2992485. arXiv:1904.07695. [20] J. Qiang, P. Chen, W. Ding, T. Wang, F. Xie, X. Wu, Heterogeneous-Length Text Topic Modeling for Reader-Aware Multi-Document Summarization, ACM Transactions on Knowledge Discovery from Data 13 (2019) 1–21. doi:10.1145/3333030. [21] A. Cammozzo, E. Di Buccio, F. Neresini, Monitoring technoscientific issues in the news, in: ECML PKDD 2020 Workshops - Workshops of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2020): So- Good 2020, PDFL 2020, MLCS 2020, NFMCP 2020, DINA 2020, EDML 2020, XKDD 2020 and INRA 2020, Ghent, Belgium, September 14-18, 2020, Proceedings, volume 1323 of Communications in Computer and Information Science, Springer, 2020, pp. 536–553. doi:10.1007/978-3-030-65965-3\_37. [22] E. Di Buccio, A. Cammozzo, F. Neresini, A. Zanatta, TIPS: search and analytics for social science research, in: L. Tamine, E. Amigó, J. Mothe (Eds.), Proceedings of the 2nd Joint Conference of the Information Retrieval Communities in Europe (CIRCLE 2022), Samatan, Gers, France, July 4-7, 2022, volume 3178 of CEUR Workshop Proceedings, CEUR-WS.org, 2022. URL: https://ceur-ws.org/Vol-3178/CIRCLE_2022_paper_33.pdf. [23] E. Di Buccio, A. Lorenzet, M. Melucci, F. Neresini, Unveiling latent states behind social indicators, in: R. Gavaldà, I. Zliobaite, J. Gama (Eds.), Proceedings of the First Workshop on Data Science for Social Good co-located with European Conference on Machine Learning and Principles and Practice of Knowledge Dicovery in Databases, SoGood@ECML-PKDD 2016, Riva del Garda, Italy, September 19, 2016, volume 1831 of CEUR Workshop Proceedings, CEUR-WS.org, 2016. URL: https://ceur-ws.org/Vol-1831/paper_6.pdf. [24] F. Neresini, P. Giardullo, E. D. Buccio, A. Cammozzo, Exploring socio-technical future scenarios in the media: the energy transition case in italian daily newspapers, Quality and Quantity 54 (2020) 147–168. doi:10.1007/s11135-019-00947-w. [25] E. Raff, Jsat: Java statistical analysis tool, a library for machine learning, Journal of Machine Learning Research 18 (2017) 1–5. URL: http://jmlr.org/papers/v18/16-131.html. [26] H.-F. Yu, F.-L. Huang, C.-J. Lin, Dual coordinate descent methods for logistic regression and maximum entropy models, Machine Learning 85 (2011) 41–75. URL: https://doi.org/ 10.1007/s10994-010-5221-8. doi:10.1007/s10994-010-5221-8. [27] D. H. Wolpert, Stacked generalization, Neural Networks 5 (1992) 241–259. doi:10.1016/ S0893-6080(05)80023-1. [28] E. Di Buccio, F. Neresini, Data from: Research and Teaching Public Communication of Science and Technology on Digital Data, 2024. doi:10.5281/zenodo.10616684. [29] P. Qi, Y. Zhang, Y. Zhang, J. Bolton, C. D. Manning, Stanza: A Python natural language processing toolkit for many human languages, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2020. URL: https://nlp.stanford.edu/pubs/qi2020stanza.pdf. [30] V. Di Carlo, F. Bianchi, M. Palmonari, Training temporal word embeddings with a com- pass, in: 33rd AAAI Conference on Artificial Intelligence, AAAI 2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Sympo- sium on Educational Advances in Artificial Intelligence, EAAI 2019, 2019, pp. 6326–6334. doi:10.1609/aaai.v33i01.33016326. arXiv:1906.02376.