=Paper=
{{Paper
|id=Vol-3178/CIRCLE_2022_paper_33
|storemode=property
|title=TIPS: Search and Analytics for Social Science Research
|pdfUrl=https://ceur-ws.org/Vol-3178/CIRCLE_2022_paper_33.pdf
|volume=Vol-3178
|authors=Emanuele Di Buccio,Alberto Cammozzo,Federico Neresini,Alberto Zanatta
|dblpUrl=https://dblp.org/rec/conf/circle/BuccioCNZ22
}}
==TIPS: Search and Analytics for Social Science Research==
<pdf width="1500px">https://ceur-ws.org/Vol-3178/CIRCLE_2022_paper_33.pdf</pdf>
<pre>
TIPS: Search and Analytics for Social Science Research
Emanuele Di Buccio1,2,* , Alberto Cammozzo3 , Federico Neresini3 and Alberto Zanatta3
1
  Department of Information Engineering, University of Padova, Via G. Gradenigo 6/b, 35131, Padova, Italy
2
  Department of Statistical Sciences, University of Padova, Via C. Battisti, 241, 35121, Padova, Italy
3
  Department of Philosophy, Sociology, Education and Applied Psychology, University of Padova, Via M. Cesarotti 10/12,
35123 Padova, Italy


                                         Abstract
                                         The vast amount of digital data available online, which include digitized traditional media, offers new
                                         opportunities for social science researchers. This is, for instance, the case of social processes where the
                                         temporal dimension is crucial and longitudinal data is necessary. This paper presents a system called
                                         TIPS, designed in order to support social science researchers in their investigations. We will present the
                                         main modules of the system and how it can support diverse research tasks on longitudinal data such as
                                         archives of digitized newspapers.

                                         Keywords
                                         News Search and Analytics, Information Retrieval, Expert Users


1. Introduction
The vast amount of digital data available nowadays provides new opportunities for many
disciplines, e.g. Humanities and Social Sciences. Social Science concerns with “any branch of
academic study or science that deals with human behavior in its social and cultural aspects”.1
Social Sciences include several disciplines, e.g. sociology, psychology or political science. A first
opportunity for disciplines, such as sociology, is the digitization of traditional media. This is, for
instance, the case of newspapers: the analysis of newspapers is traditionally used to carry out
research investigations on the public perception of issues along with other methods, such as
research surveys. Another opportunity is related to the investigation of social processes where
the temporal dimension is crucial and longitudinal data is necessary.
   The main contribution of this paper is the description of a system called Technoscientific
Issues in the Public Sphere (TIPS). The TIPS system is the result of an interdisciplinary research
project which involves researchers in diverse areas, including computer science and engineering,
sociology, statistics, psychology and linguistics. The objective of the project is the analysis
of the presence of tecnoscience2 in the mass media, which constitutes a relevant part of the

CIRCLE 2022: Joint Conference of the Information Retrieval Communities in Europe, July 4-7, 2022, Samatan, Gers,
France
*
  Corresponding author.
$ emanuele.dibuccio@unipd.it (E. Di Buccio); alberto.cammozzo@unipd.it (A. Cammozzo);
federico.neresini@unipd.it (F. Neresini); albertozanatta95@hotmail.it (A. Zanatta)
 0000-0002-6506-617X (E. Di Buccio); 0000-0003-1551-0022 (A. Cammozzo); 0000-0003-3918-2588 (F. Neresini)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR

          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073


1
  From: https://www.britannica.com/topic/social-science
2
  The term “technoscience” refers to “science and technologies”
public sphere. Support the analysis through automated techniques is crucial because it allows
expert users – researchers, journalists, policy makers – to monitor the public opinion relying on
very large amount of data. Among the diverse challenges, one is handling heterogeneous data,
i.e., diverse source types (blogs, research papers, newspapers, tweet streams). Even if the TIPS
system was designed to achieve this goal, in this paper we are going to focus on how support
expert users in implementing their methodologies to investigate research activities.
   Social Science researchers are expert users that often need many interactions with the
information space and at different levels. Multiple searches involving both full-text and metadata
may be needed to determine the corpus or the corpora that will be adopted for their research.
Indicators based on social science theories might help validating a research hypothesis and
a system able to offer ways to build a methodology with “high level” interaction may be
helpful also to researchers with limited knowledge in computer science topics or to speedup
the methodology implementation.
   A first version of the system was documented in [1]. This paper extends that work focusing
on how the system can support expert users like sociologists, the extended architecture, and
how the methodology proposed and exploited in previous works [2] has been integrated into
the system to provide information access beyond the traditional query-response paradigm. The
system is available at: https://www.tipsproject.eu/tips


2. Background and System Requirements
TIPS was explicitly designed in collaboration with the members of the Pa.S.T.I.S. Research Unit
of the University of Padova in order to support their research methodologies. One of the unit’s
research topics is Science and Technology Studies (STS), particularly understanding the role of
Technoscience in the Media and how Technoscience is represented. Indeed, the representation
of technoscientific issues in the Media may greatly differ from that of the specialists; unveiling
and monitoring the evolution of that representation could be useful, e.g., to identify critical
aspects or communicate more effectively innovations and issues to the public.
   Research at the intersection of Social Science and Computer Science is not new [3]; Com-
putational Social Science [4], Social Informatics and Digital Social Research [5] are examples
of these interdisciplinary research lines. Works relevant to that reported in this paper are
systems explicitly designed to support Social Science researchers, e.g., NOAM [6, 7], and to
monitor the media such as the European Media Monitor [8]. A number of libraries and tools
have been made available and are used by social scientists to carry out their investigations;
[5] provides an overview both of methodologies and tools. Even if TIPS shares the objectives
and functionalities with many of these systems, we explicitly designed the system to support
a complete workflow, from data collection and feature extraction to full-text, metadata and
theme-based access; moreover, TIPS is equipped with a number of indicators explicitly designed
in collaboration with a team of Social Science researchers.
   In order to design the system, we relied on the experience gathered during our prior involve-
ment in interdisciplinary research activities, e.g., [2], we carried out a number of seminars with
Social Science researchers, and we exploited feedback from preliminary versions of the system –
at the time essentially a full-text and metadata-based search engine – to understand the typical
workflow of a user when carrying out activities to investigate a research hypothesis. The main
activities performed by the researchers are listed below:
       1. defining the specific thematic domain and the corpus of interest for the investigation of
          the research hypothesis;
       2. access both by full-text search and metadata-based search the corpus of interest to explore
          the thematic space;
       3. extract the main (sub)themes in the corpus of interest, access and search the corpus by
          themes representation and analyze their presence, usually, over time;
       4. compute, visualize, and analyze trends of indicators which are meant to provide a measure
          of the degree to which certain properties characterize a set of documents;
       5. extract word, sentence, document and document set properties to perform fine-grained
          analysis and uncover possible relationships among these properties in order to verify a
          research hypothesis or to formulate a new one;
       6. keep track of all the choices made for the diverse research methodology steps in order to
          foster reproducibility.
The TIPS system aims at automatizing some of the above activities or to support users in
performing them through diverse forms of user-system interaction.


3. The TIPS System
3.1. Document Collectors and Repositories
The first versions of the system were focused on the collection, enrichment and content-based
search of documents from diverse types of sources, e.g., blogs, online newspapers or tweet
streams. Many of the research activities in the TIPS project are carried out on newspapers since
they still constitute a large portion of the Media Sphere and allow longitudinal research studies
which can involve many decades, e.g., from the 1980’s, thanks to digitization of articles.
   In TIPS, a document collector is responsible for gathering documents from multiple sources,
e.g., RSS of online newspapers, where sources are homogeneous in terms of type and language;
for instance, there is a collector for Italian Newspapers, one for Italian Blogs, and one for English
Newspapers. A database is used for each collector; currently we are relying on MongoDB.3
Each document undergoes through a number of steps:
     (a) Harvesting: articles are collected through RSS feeds; when feeds are not available or not
         working, HTML website traversal is used, focusing only on the relevant sections.
     (b) Scraping: the relevant parts of the page are extracted mainly through the library
         Newspaper3k;4 some additional handcrafted rules resulted from the analysis of some
         samples are used to remove additional strings, e.g., those peculiar of a specific newspaper.
     (c) De-duplication: The same article can be published in different news feeds, or updated
         versions can be published at different times. In the former case, we store only one article
         but keep track of all the feeds where it was published, especially if it was published on
3
    https://www.mongodb.com/
4
    https://newspaper.readthedocs.io/en/latest/
        the homepage. In the latter case, we keep only the newest version of the articles, but we
        keep track of all publication timestamps. The identification of duplicates relies on MD5
        hashing of content and metadata, more specifically, the RSS feed, the URL, and the date it
        was published. If an article with the same content and metadata hash is already present in
        the collector database, the article is dropped as a duplicate. If an article is present with the
        same content hash but a different metadata hash, the original article metadata is enriched.
    (d) Near-duplicates detection: Previous research activities using early versions of the
        system revealed a significant number of articles that overlap in most of the content, i.e.,
        near-duplicates. In order to detect them, the current version of the system relies on Locality
        Sensitive Hashing for Minhash Signatures [9, 10], specifically the implementation made
        available in the datasketch library.5 Each document is represented using 𝑘-shingle with
        𝑘 = 4 and considering shingles on tokens, not on characters. We set to the value 0.4 the
        similarity threshold to determine if two articles should be considered near-duplicates. The
        threshold was determined considering a week at random for each year between 2010 and
        2018, then considering all the articles published in that week and in the same newspaper,
        and manually inspecting the articles; the most effective threshold for detecting near-
        duplicates was selected. The procedure for near-duplicates detection is now integrated
        in TIPS, processes all the incoming articles, and relies on Redis6 to store and search
        MinHash signatures; near-duplicates are searched only among articles published in the
        same newspaper.
    (e) Named Entities extraction: For many research activities a named entity search is
        required to identify names of places, people, organizations, and several computer libraries
        are available. On one hand it is desirable to have a very specific and accurate recognition
        of entities, on the other, the correct attribution of a name to an entity is often difficult and
        inaccurate despite being resource consuming: "Galileo" may denote the famous scientist
        (person), a company or a school with the same name, the name of a place (e.g. "Galileo
        Galilei square"), and so on, heavily depending on context. Moreover, newspaper articles
        may often contain new, country-specific entities and misspelled names. We have decided
        for a layered, concentric approach to named entity recognition. The first step being the
        recognition of "Named Entity Candidates" (NEC) at download-time, with a simple and
        swift regular expression pinning capitalized words and uppercase acronyms. A second
        layer uses the spaCy 7 library, which is trained with many language-specific pre-built
        models. We store both results, having found that combining the two first layers gives a
        more accurate results than using just a single approach. While these two first layers are
        run on the whole corpus, further analysis with more resource-intensive approaches is
        restricted to sub-corpora according to research needs, providing a third layer. So far we
        used Stanford NER [11] and several context-specific APIs focused on specific archives of
        entities (e.g. scientists names are investigated with Elsevier SCOPUS and Orcid APIs). A
        layered approach allows researchers to spot entities at the corpus level and conveniently
        select articles where context-specific entities are recognized when inspecting the articles.

5
  http://ekzhu.com/datasketch/index.html
6
  https://redis.io
7
  https://spacy.io
     (f) Extraction of other document properties and indicators: Besides named entities, a
         number of document properties are extracted, e.g., document length, number of characters,
         non-text prevalence. When considering Italian document collectors, distinct nouns and
         adjectives in the article, and article readability are obtained through The Italian NLP Tool
         (Tint) [12]. Moreover, for each document a number of indicators based on controlled
         vocabularies and the frequency of vocabulary terms. Each indicator is meant to provide a
         measure of the degree to which certain properties characterize a set of documents. For
         instance, the risk indicator [13] provides a measure of the extent a document or a set of
         documents may evoke risk, conflict, worry or controversy in the reader. The procedure
         for computing the risk indicator is described in [14]. Other indicators have been proposed
         and are available in TIPS, e.g., molecularisation and individualization [2].
     (g) Classification: Researchers can be involved in multiple research projects. In TIPS each
         project has at least one classifier to identify documents pertinent to the thematic domain.
         For instance, the project on “Food Risk Monitoring” has its own classifier; the project on
         Technoscience has multiple classifiers relying on diverse approaches. One is a Knowledge
         Engineering approach [1] used since from the early versions of the system. In the new
         version of the system the default classifier is based on Supervised Machine Learning
         techniques, more specifically on Stacking [15] of Regularized Logistic Regression using
         Dual Coordinate Descent [16] and Multinomial Naive Bayes; the implementation relies
         on the JSAT Library [17]. The model was selected using both Holdout (random split) and
         5-fold Cross Validation on a dataset manually labeled by the sociologists and described
         in [1]; we examined the effectiveness looking both at value and variance of AUC, F1,
         Precision and Recall. In addition to the labeled set described in [1], we built a new labeled
         set used as an additional separate test set.
   Steps (a)-(f) have been implemented in a library called hactar which has been published under
the AGPL open source license.8 Part of step (f) and step (g) have been implemented in a Java
module called tips-data. Besides classification, indicator computation, and metadata extraction
and enrichment, the module is responsible for updating the document repositories with all
the “indexable” articles. An article is indexable if it has title, content, date, and is not marked
as duplicate or near-duplicate. All the indexable documents are indexed via elasticsearch,9 a
Distributed Search and Analytics Engine. A document repository in TIPS corresponds to an
elasticsearch index. The index update is performed on a daily basis.

3.2. Repository Search and Exploration
Research activities in the TIPS project require different forms of interaction with document
repositories: search is one of them. There are two ways to search a repository. The first way is
to use REST Search APIs which are built on top of elasticsearch APIs. Indeed, we designed and
implemented the system in a modular way and exposed all the functionalities via REST APIs
through the “TIPS Web Server”. REST Search APIs allow users with programming skills to build
scripts for interacting with the repositories and retrieving all the documents relevant to her

8
    https://gitlab.com/mmzz/hactar
9
    https://www.elastic.co/elasticsearch/
information need. The second way is through the use a Web User Interface (UI) built on top of
the REST APIs.
   We equipped the last version of the system with a module to monitor diverse interactions
between the users and the system; logs are stored in dedicated index and are anonymized. We
analyzed about 3000 log entries gathered since from March 2019 and we observed that most of
the queries involved compound queries, where both full-text and metadata-based search are
used. The metadata usually adopted in the query include a (filter for a) specific newspaper or a
set of newspapers, or a particular set of documents – e.g. only on the documents predicted in
the “technoscience” category according to the default classifier. The Search UI allows also the
ranking criterion to be specified: possible criteria are by date (most recent first), by classifier
score (highest score first), or by query score (BM25 [18] score computed using query terms). The
most adopted ranking criterion is that based on the classifier score, followed by date and query
score. Most of the interactions aim at determining the corpus later adopted for detailed analysis:
queries can be very complex, relying on Boolean or other operators – e.g., prefix match – to
retrieve as much documents topically relevant to an issue in the thematic domain as possible —
for instance nuclear-related articles in tecnoscience domain. A common task that can be evinced
from the query logs is identify useful terms to expand the original, more succinct query. For
this reason, we equipped the system with functionalities to get the top terms, named entities, or
nouns and adjectives in a document set specified by the user through search parameters; term
ranking can be based on document frequency or other measures, e.g., Chi Square.
   Once the query to determine the corpus has been identified, research activities usually
require a “compact” representation of a large number of documents in order to have an idea
of the thematic structure. A well known class of algorithms to achieve this objective is Topic
Modeling [19, 20]. For these analyses, in past works [2] we relied on the Mallet library,10 in
particular on LDA [21] using asymmetric Dirichlet Prior [22]. Also in this case we observed
that multiple interactions were needed, for example, in order to “interpret” the topics, build
different topic models on refined version of the corpus, or compare the thematic structure
of an issue-specific or a general corpus. For this reason we designed and implemented novel
system components to perform theme-based analysis, visualize the result and interact with
them, implementing some of the visualizations and forms of interaction proposed in previous
works [23, 24].
   The new components allow the user to submit a query and extract topics from all the
documents in the repository satisfying the query; in other the words, the query serves as filter
to specify the corpus. Along with the query, the user can specify some of the LDA settings
available in Mallet: the number of topics, the number of top words, the number of sampling
iterations. Moreover, she can specify a threshold to determine the minimum probability a topic
should have in a document to be considered “representative” for that document; note that the
threshold is not used in the inference process, but only to access documents where the topic
is present, after the inference procedure. The query is then submitted to the TIPS Web Server
that forwards the request to a dedicated Topic Modeling and Word Embedding (TM&WE) Web
Server that relies on Mallet. The TM&WE Web Server is responsible for:
        • gathering all the documents satisfying the query from the document repository;
10
     https://github.com/mimno/Mallet
           TIPS Web Server                            TM&WE Web Server                                 Search Server
                                Check authorization                      Retrieve documents
                                and forward request                      according to the user query


                                                                        Return retrieved documents


                         Return                 Store user   Store    Return requested
 Request for
                         Topic Model and        request      Mallet   Topic Model and
 Topic Model and
                         Word Embedding                      output   Word Embedding
 Word Embedding
                         output


               TIPS Web UI                                   DBMS

Figure 1: Overview of the TIPS Architecture responsible for Topic Modeling and Word Embedding


     • build the topic model according to the specified settings;
     • store the query and the settings in a dedicated MongoDB database;
     • store the model and all the necessary outputs into a MongoDB database.

The last two points are necessary to prevent building multiple times the same topic model
and allow a user to access the list of queries previously submitted to the system and their
corresponding results; therefore, if the user already performed a request with the same query
and the same LDA settings, results are retrieved from the dedicated MongoDB database, where
they were stored after the first request. This is crucial also to support reproducibility, since the
system automatically keeps track of the history of all the user requests, and the settings used
for those requests. Figure 1 depicts the entire process and the relevant components of TIPS. The
output generated by the TM&WE component can be accessed through the Web UI and includes:

     • the list of topics, whose labels are by default “Topic𝑛”, where 𝑛 is the identifier provided
       by Mallet; the user can edit the label and save the updated version;
     • the top words per topic;
     • the top documents for each topic, where the probability is above the specific threshold;
     • a chart with the topic trends over time, where the time granularity (yearly, monthly,
       weekly) can be specified by the user and the importance of a topic in a given time interval
       is computed as in [25].

Actually, the request submitted to the TM&WE server includes a request for training Word
Embedding through the Mallet functionalities. The user can specify the settings, which include
embedding size, window size, sample size and number of iterations. The learned embedding are
                                                                                     Charts
                                                                                      (iv)
    Login               Dashboard configuration
      (i)                         (ii)


                                                                                     Search
                                                                                       (v)


                                                                                  Topic Analysis
                                                                                        (vi)


                  Project selection
                         (iii)


Figure 2: Overview of typical steps performed by a TIPS user and the relevant components


then used to visualize the words “closest” to a user specified keyword in the embedding space —
if the query is constituted by a single term, the term is used by default as keyword.


4. A Use Case on Nuclear Issue
In this section we will rely on a use case to illustrate a typical TIPS user workflow; a summary
of the steps and the relevant UI components is depicted in Fig. 2.
   Let us consider a researcher who wants to perform a longitudinal study on how nuclear-
related issues are discussed in a specific newspaper, e.g., La Repubblica. The reason for focusing
on this newspaper might be that an online archive is available and spans several decades. In TIPS,
each user is associated to at least one project, which determines repositories, charts, classifiers,
and predefined issues (standing queries) she can access. In this case, the researcher should have
access to the TIPS-IT project, which includes the repository La Repubblica.
   After the authentication, the user can access the Dashboard, which allows the configuration
of all the “parameters” of her research activity. A default project is associated to each user and
is adopted to populate the dashboard after the login. The user can change the default project
via a dedicated page – (iii) in Fig. 2 – where all her projects are listed; this is necessary, for
instance, if the research activity is focused on another newspaper archive, e.g., El País, The New
York Times, or The Guardian, or the activity consists in a comparative study among different
countries as in [2]. In the dashboard, the user can specify the document repository – in our
case La Repubblica – and the preferred classifier – e.g., the “Science and Technology” classifier –
which will be used by default to generate some of the charts and for document ranking in the
search requests, when sorting is based on the classifier score.
   The user can also specify an issue, which consists in a query used to determine the corpus
for the research activity. Besides a set of predefined issues, the user can define her own issue
using a custom query; the history of all the user issues is stored in a MongoDB database to keep
track of the research activities. In the considered use case, the user will specify a custom issue
to determine nuclear related articles, e.g., by the prefix query “nuclear*”.
   After the dashboard configuration, the user can access documents via full-text or metadata-
based search using the basic or advanced search page (v). Multiple search interactions can be
adopted, both to gain additional knowledge required for the research activity or, to build the
query for the custom issue. For instance, the researcher might be interested only in “nuclear
power” and not in “nuclear weapons”, and she might modify the issue query accordingly.
   The “Charts” page (iv) reports the list of all the available charts. In the considered use case,
the researcher might use the Charts per Issue, where different charts are dynamically computed
on the corpus determined by the issue. Examples are the distribution among the newspaper
sections of the articles about the issue, the trend of the risk indicator over time, or the readability
index. For instance, the researcher can check if the value of the readability index in articles about
the issue is lower than in the corpus and comparable with values observed in technoscientific
articles. She can study the trend over time of the indicators, e.g., risk, and investigate if the
observed peaks are due to specific events like nuclear accidents or if the trend is in line with
the results obtained with traditional surveys [13]. Finally, the distribution among sections can
help to provide insight into the different “thematic areas” where the nuclear issue is discussed.
   The researcher can then perform fine-grained analysis by retrieving and examining documents
by the Search UI (v) or extracting features from sentences, documents, or document sets; the
latter is the case of the syntactic features used to study the Communication of Science and
Technology in Online Newspapers using Multidimensional Analysis [26].
   The user can also explore themes in the corpus using the Topic Modeling functionalities
(vi). Using the query built in the previous interactions, the user will submit a request to the
TM&WE Web Server; parameters are set by default to 10 topics, 50 top words, 1000 iterations,
and 0.25 for the probability threshold. Along with Topic Modeling, also Word Embedding will
be trained. Figure 3 reports the page to explore the results. The user can select which topics to
visualize; in Figure 3, topics 8 and 9 were selected, and their presence over time is visualized.
The top words suggest that topic 9 is about nuclear research. Access to the top documents can
help the interpretation of the topic. If only a subset of the topics is pertinent to the research
activity, those topics can be later used to determine a subset of the corpus constituted only by
documents where those topics are the most prominent — this is the approach used in [2].


5. Final Remarks and Future Directions
In this paper, we described the TIPS system and how it can support Social Sciences researchers
such as sociologists in their research activities. We reported on a specific use case in order to
present some of the system functionalities and how they are exploited.
   Our research group is currently working on text-mining-based methodologies that will be
later included in TIPS. Two are the research directions we are currently pursuing. One concerns
the use of heterogeneous sources, e.g., news, tweets, blog and forum posts. Previous works
Figure 3: TIPS component to explore Topic Modeling results when considering documents published
since form 1984 on an Italian Newspaper and pertinent to the query “nuclear*”.


show how “standard” techniques, e.g., LDA, cannot be straightforwardly applied to short texts
such as tweets [27] or to heterogeneous collections [28]. The other research direction concerns
techniques to support longitudinal studies. We are planning to include existing Dynamic Word
Embedding (DWE) approaches in the system. Moreover, we are investigating other time-aware
representation of words or of groups of related words, thus complementing methods such as
Time-aware Topic Modeling or DWE techniques.
   Another line of work is focused on the “optimization” of the system. The current version
exploits BM25 for ranking, using default values for the free parameters (𝑏, 𝑘1 ). As mentioned
in [18], optimization can be costly both in terms of human evaluation and computing power;
however, specific collections, such as those available in TIPS, are worthy of the cost. Since our
collections have multiple fields, we are planning to move from BM25 to BM25F and optimize the
parameters using techniques such as those described in [18, 29]. Another aspect is scalability.
The indexing and classification module, tips-data, relies on (Java) multi-threading to index the
diverse repositories in parallel. A possible issue might be the request load. The current number
of users is limited and has its peak when TIPS is used for teaching activities, where students
are usually divided into groups, and each group has its own TIPS account. However, we are
planning to evaluate the robustness of the current system in terms of request load. We opted
for MongoDB and elasticsearch because they are scalable and can be distributed on a number of
machines. Possible bottlenecks might be the TIPS Web Server or the TM&WE Web Server; a
possible strategy might be to run multiple instances of these servers and use a load balancer.
Acknowledgments
The authors would like to thank the Pa.S.T.I.S. Research Unit of Department of Philosophy,
Sociology, Education and Applied Psychology (FISPPA), University of Padova, for the fruitful
discussions and the feedback in the design of the TIPS System. The contribution by A. Zanatta
was provided when he was working at the Department FISPPA of the Univeristy of Padova.


References
 [1] A. Cammozzo, E. Di Buccio, F. Neresini, Monitoring technoscientific issues in the news,
     in: ECML PKDD 2020 Workshops - Workshops of the European Conference on Machine
     Learning and Knowledge Discovery in Databases (ECML PKDD 2020), Ghent, Belgium, Sept.
     14-18, 2020, Proceedings, volume 1323 of Communications in Computer and Information
     Science, Springer, 2020, pp. 536–553. doi:10.1007/978-3-030-65965-3\_37.
 [2] F. Neresini, S. Crabu, E. Di Buccio, Tracking biomedicalization in the media: Public
     discourses on health and medicine in the uk and italy, 1984–2017, Social Science & Medicine
     243 (2019) 112621. doi:https://doi.org/10.1016/j.socscimed.2019.112621.
 [3] G. Sadowsky, Future developments in social science computing, in: Proceedings of Spring
     Joint Computer Conference, 1972, pp. 875–883.
 [4] R. M. Alvarez (Ed.), Computational social science: Discovery and Prediction, Analytical
     methods for social research, New York: Cambridge University Press, 2016.
 [5] G. A. Veltri, Digital social research, John Wiley & Sons, 2019.
 [6] I. Flaounas, O. Ali, M. Turchi, T. Snowsill, F. Nicart, T. De Bie, N. Cristianini, Noam:
     News outlets analysis and monitoring system, in: Proceedings of the 2011 ACM SIGMOD
     International Conference on Management of Data, SIGMOD ’11, Association for Computing
     Machinery, New York, NY, USA, 2011, p. 1275–1278. doi:10.1145/1989323.1989474.
 [7] I. Flaounas, T. Lansdall-Welfare, P. Antonakaki, N. Cristianini, The Anatomy of a
     Modular System for Media Content Analysis, Arxiv - Social Media Intelligence (2014).
     arXiv:1402.6208.
 [8] R. Steinberger, B. Pouliquen, E. van der Goot, An introduction to the Europe Media Monitor
     family of applications, in: Proceedings of the SIGIR 2009 Workshop on Information Access
     in a Multilingual World, volume 43, 2009. arXiv:1309.5290.
 [9] A. Z. Broder, Identifying and filtering near-duplicate documents, in: Lecture Notes in
     Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture
     Notes in Bioinformatics), volume 1848, 2000, pp. 1–10. doi:10.1007/3-540-45123-4_1.
[10] M. Henzinger, Finding near-duplicate web pages, in: Proceedings of the Twenty-Ninth
     Annual International ACM SIGIR Conference on Research and Development in Information
     Retrieval, volume 2006, 2006, p. 284. doi:10.1145/1148170.1148222.
[11] J. R. Finkel, T. Grenager, C. Manning, Incorporating non-local information into information
     extraction systems by Gibbs sampling, in: Proceedings of the 43rd Annual Meeting of
     the Association for Computational Linguistics (ACL’05), Association for Computational
     Linguistics, Ann Arbor, Michigan, 2005, pp. 363–370. doi:10.3115/1219840.1219885.
[12] A. Palmero Aprosio, G. Moretti, Tint 2.0: an all-inclusive suite for nlp in italian (2018).
[13] F. Neresini, A. Lorenzet, Can media monitoring be a proxy for public opinion about
     technoscientific controversies? The case of the Italian public debate on nuclear power.,
     Public understanding of science (Bristol, England) (2014).
[14] E. Di Buccio, A. Lorenzet, M. Melucci, F. Neresini, Unveiling Latent States Behind Social
     Indicators, in: R. Gavaldà, I. Zliobaite, J. Gama (Eds.), Proceedings of the SoGood@ECML-
     PKDD 2016, Riva del Garda, Italy, September 19, 2016., volume 1831 of CEUR Workshop
     Proceedings, CEUR-WS.org, 2016.
[15] D. H. Wolpert, Stacked generalization, Neural Networks 5 (1992) 241–259. doi:10.1016/
     S0893-6080(05)80023-1.
[16] H.-F. Yu, F.-L. Huang, C.-J. Lin, Dual coordinate descent methods for logistic regres-
     sion and maximum entropy models, Machine Learning 85 (2011) 41–75. doi:10.1007/
     s10994-010-5221-8.
[17] E. Raff, Jsat: Java statistical analysis tool, a library for machine learning, Journal of
     Machine Learning Research 18 (2017) 1–5. URL: http://jmlr.org/papers/v18/16-131.html.
[18] S. Robertson, The Probabilistic Relevance Framework: BM25 and Beyond, Foundations
     and Trends® in Information Retrieval 3 (2009) 333–389. doi:10.1561/1500000019.
[19] D. M. Blei, J. D. Lafferty, Topic Models, in: A. Srivastava, M. Sahami (Eds.), Text Mining:
     Classification, Clustering, and Applications, New York, New York, USA, 2009.
[20] J. Boyd-Graber, Y. Hu, D. Mimno, Applications of Topic Models, Foundations and Trends®
     in Information Retrieval 11 (2017) 143–296. doi:10.1561/1500000030.
[21] D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation, The Journal of Machine
     Learning Research 3 (2003) 993–1022.
[22] H. M. Wallach, D. Mimno, A. McCallum, Rethinking lda: Why priors matter, in: Proceedings
     of the 22Nd International Conference on Neural Information Processing Systems, NIPS’09,
     Curran Associates Inc., USA, 2009, pp. 1973–1981.
[23] J. Chuang, C. D. Manning, J. Heer, Termite: Visualization techniques for assessing textual
     topic models, in: Proceedings of the Workshop on Advanced Visual Interfaces AVI, ACM
     Press, New York, New York, USA, 2012, pp. 74–77. doi:10.1145/2254556.2254572.
[24] S. Liu, M. X. Zhou, S. Pan, Y. Song, W. Qian, W. Cai, X. Lian, TIARA: Interactive, topic-based
     visual text summarization and analysis, in: ACM Transactions on Intelligent Systems and
     Technology, volume 3, 2012, pp. 1–28. doi:10.1145/2089094.2089101.
[25] D. Mimno, Computational historiography, Journal on Computing and Cultural Heritage 5
     (2012) 1–19.
[26] V. Zorzi, The Communication of Science and Technology in Online Newspapers: a Multi-
     dimensional Perspective, Ph.D. thesis, University of Padova, Italy, 2018.
[27] J. Qiang, Z. Qian, Y. Li, Y. Yuan, X. Wu, Short Text Topic Modeling Techniques, Applications,
     and Performance: A Survey, IEEE Transactions on Knowledge and Data Engineering 34
     (2022) 1427–1445. doi:10.1109/TKDE.2020.2992485. arXiv:1904.07695.
[28] J. Qiang, P. Chen, W. Ding, T. Wang, F. Xie, X. Wu, Heterogeneous-Length Text Topic
     Modeling for Reader-Aware Multi-Document Summarization, ACM Transactions on
     Knowledge Discovery from Data 13 (2019) 1–21. doi:10.1145/3333030.
[29] A. Costa, E. Di Buccio, M. Melucci, G. Nannicini, Efficient parameter estimation for
     information retrieval using black-box optimization, IEEE Transactions on Knowledge and
     Data Engineering 30 (2018) 1240–1253. doi:10.1109/TKDE.2017.2761749.

</pre>