<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards a Methodology for Technoscientific Objects Extraction (Short Paper)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alberto Cammozzo</string-name>
          <email>alberto.cammozzo@unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Emanuele Di Buccio</string-name>
          <email>emanuele.dibuccio@unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Giardullo</string-name>
          <email>paolo.giardullo@unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federico Neresini</string-name>
          <email>federico.neresini@unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Sciandra</string-name>
          <email>andrea.sciandra@unipd.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Terminology Extraction, Expert Users, Digital Sociology</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Engineering, University of Padova</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Philosophy</institution>
          ,
          <addr-line>Sociology, Education and Applied Psychology</addr-line>
          ,
          <institution>University of Padova</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Statistical Sciences, University of Padova</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Media monitoring is one of the activities carried out in the research field of Public Communication of Science and Technology (PCST). This interdisciplinary research field investigates how science and technology can afect contemporary society and how society can afect science and technology. Monitoring the media discourse can be beneficial to understanding the narrative that might afect society's perception of an issue when carried on by non-experts. One of the necessary tasks when following the discussion on the media is the automatic extraction of the actors involved. Besides people, companies, or institutions, a crucial task is the extraction of other non-human actors that play a leading role in the science narrative, such as relevant scientific terms or technologies. This paper documents our ongoing efort in extracting those terms and how they can be helpful for researchers in PCST.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Science and technology have an important role in society, as shown, for instance, by the recent
COVID-19 pandemic or by the increasing attention to controversial issues related to Artificial
Intelligence. Indeed, science and technology, hereafter denoted as “technoscience” [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], can afect
society or be afected by society, e.g., because of the public discourse on technoscientific issues.
Public Communication of Science and Technology (PCST) is an interdisciplinary research field
that studies the relationship between science and society and includes several activities, such as
Media Monitoring, that can help to understand and follow the narrative on some issues and
how that narrative might afect the public perception. The research activities performed by
PCST scholars can involve studying actors involved in public debate and the relationship among
those actors. When working on the Media, automatic approaches can be adopted to follow the
nEvelop-O
discourse on more traditional ones, such as newspapers, or recent ones, like Social Networks.
Once actors have been “extracted,” they can be used as terms to formulate queries for accessing
documents relevant to some specific technoscientific issues or to analyze the prominence of the
actors over time or the context where they occur. Therefore, PCST scholars need methods to
extract actors from possibly vast amounts of data that are infeasible to process manually.
      </p>
      <p>
        Automatic extraction of terms from text [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Named Entity Recognition (NER) are therefore
helpful to support PCST scholars tasks. However, NER methods are usually adopted to extract
entities such as people (names), locations, and time. There are scenarios where those approaches
must be tailored to extract entities to specific domains. This is the case, for instance, of the
medical domain [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or the technology domain [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] in patent corpora. In this work we are
interested in non-human actors, specifically those playing a leading role in the technoscience
narrative, such as relevant scientific terms or technologies.
      </p>
      <p>
        The work reported in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] is particularly relevant to ours since it is focused on technology
extraction. In that paper, the authors investigated diverse Natural Language Processing (NLP)
approaches for extracting technologies from patent data. They categorized the three approaches
as gazetteer-based, rule-based, and distributional-based. As for the first approach,
gazetteerbased, two diferent sources were used: Wikipedia (the list of emerging technologies) and O*NET,
a free online text database containing job definitions and technologies related to the occupational
domain. The adopted rule-based approaches included one based on lexico-syntactic patterns
for hyponymy extract from data [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. As for the third approach, they used distributional-based
approaches relying on BERT for the embedding, a 2-layer bi-directional LSTM on top of the
embeddings and conditional random fields, with an architecture similar to that used in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Models were trained by data gathered from the proposed rule-based approaches.
      </p>
      <p>
        In this work, we are interested in terms related to technoscience, including but not limited
to technologies. We will propose a methodology for single-word and MWE extraction that,
as in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], relies on multiple approaches and multiple sources of evidence. The methodology
is intended both as a first approach to extracting technoscientific objects and to support the
creation of a labeled set that will be used in future work, e.g., for fine-tuning models. This paper
describes the methodology and some preliminary results.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. Methodology</title>
      <p>
        This work will consider online newspaper articles as a source for terms. Even if we are working
on methodology instantiations for diverse languages, such as English and Italian, this section
will refer to a corpus of 260627 articles published in eight Italian newspapers in 2022. Those
articles have been gathered thorugh a Media Monitoring platform called TIPS (Technoscienfic
Issues in the Public Sphere).1 The platform, originally described in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], has been extended with
a number of additional modules and services [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], a description of hactar,2 the platform
for collecting, extracting, cleaning, and processing online web pages from newspapers, blogs,
and websites, was provided. Hactar, released under AGPL, is equipped with functionalities
for NER thorugh the adoption of open source libraries. The corpus considered in this paper
1http://www.tipsproject.eu/tips
2https://gitlab.com/mmzz/hactar
is constituted of articles from the following newspapers: “Avvenire”, “Corriere della Sera”, “il
Giornale”, “Il Mattino”, “Il Messaggero”, “la Repubblica”, “Il Sole 24 Ore”, and “La Stampa”. The
URLs of the articles have been made available through Zenodo [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Our methodology for technoscientific object extraction focuses on two types of terms:
singleword and multi-word expressions. For both the term types, we will use multiple strategies as
done in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which includes Gazetteers, rule-based named entity candidates identification, and
Supervised Machine Learning (ML) techniques. Since most of the available models have been
trained for the well-known classes of entities, such as Person, Organization, and Place, our
approach have two goals:
• obtaining a set of technoscientific objects without fine-tuned ML-based NER;
• building a labeled set that can be later adopted to train and evaluate fine-tuned ML-based
      </p>
      <p>NER approaches to extract technoscientific objects directly.</p>
      <p>Scientific instruments, laboratory equipment, measuring instruments, medical devices,
pharmaceuticals, methodologies, and disciplines are examples of “objects” of interest.</p>
      <p>
        Our current approach involves common methodology steps for both types of terms:
1. Corpus cleaning, which involved the removal of documents in other languages,
duplicates, and near-duplicates; the last step was performed using the min-wise independent
permutations locality sensitive hashing scheme (MinHash) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], relying on the
implementation available in the datasketch library.3
2. Extraction of candidate terms by Part of Speech (PoS) Taggers (for single-words), also
in conjunction with an approach for Multiword expression (MWE) extraction.
3. Filter Hapax Legomena and known terms from other “classes” using gazetteers or
lists of terms obtained via ML-based NER approaches, Web API (e.g., Scopus), or looking
at DBPedia entries categories.
4. Technoscientific content classification via Supervised Machine Learning (ML)
techniques.
5. Candidate terms ranking by the approach proposed in [12], which requires information
on terms (statistics) on documents relevant and non-relevant to technoscience.
6. Extraction of the final term list by manual inspection or by adopting a threshold based
on term statistics or the score from the previous step.
      </p>
      <p>In this paper, we will focus on steps 2–5.</p>
      <sec id="sec-3-1">
        <title>2.1. Candidate Term Extraction and Filtering</title>
        <p>As for the extraction of candidate terms (step 2), we adopted two approaches: one to extract
single-word and one to extract MWEs.</p>
        <p>As for single words, we relied on PoS Taggers to extract all the nouns occurring in the
documents. The extraction of nouns was performed by Tint [13], which is based on the Standard
CoreNLP Library [14] and it was explicitly developed for Italian. We considered all the terms
labeled by the PoS tagger with the tags S (common noun) and SP (proper noun).
3https://github.com/ekzhu/datasketch</p>
        <p>As for MWEs, we relied on NPFST [15] and its implementation available in the phrasemachine
library.4 The required input is, for each sentence, the set of constituting tokens and the
corresponding PoS tags. To extract PoS tags, we relied on the spaCy Library because it is the
tool currently adopted in the TIPS pipeline to process all the collected articles. The output
of this step is a set of candidate MWEs, where some may overlap or be nested – e.g., one is a
sub-string of the other – or might be closely related. Examples in Italian are:
• “telescopio spaziale James”,
• “telescopio spaziale James Webb”,
• “super telescopio spaziale James”,
• “super telescopio spaziale James Webb”.</p>
        <p>These MWEs can be combined into a single MWE “telescopio spaziale James Webb” (“James
Webb Space Telescope”).</p>
        <p>After extracting single-word and multi-word technoscientific object candidates, we filter out
Hapax Legomena and terms present in available gazetteers or previously manually labeled sets,
thus restricting the number of nouns and MWE to process. The resources adopted were:
• a “standard” Italian stoplist;5
• a list of all countries and over eleven million placenames made available in the GeoNames
geographical database;6
• OpenStreetMap7 nodes, relations and ways (in Italy);
• a list of persons, organizations, and places identified by the spaCy NER for Italian;
• a list of person surnames (used to filter single-word entries);
• a list of persons which includes scientists extracted through the Scopus Web API, which
were manually checked for a previous study [16];
• a list of journals gathered from Scopus;
• a list of farmaceutical corporations, a list of drugs and drug active ingredients gathered
from the Website of the Agenzia Italiana del Farmaco8.</p>
        <p>Note that the drugs and active ingredients mentioned in the last item are objects of interest
that can be added to the final list. When processing single-words and MWEs, we also checked
for corresponding entries in DBPedia through the DBPedia Lookup service9, looking for the
presence of the term in the label of the returned entities; this approach helped to identify
some persons not present in our gazetteers and the category of some technoscientic terms, e.g.,
proteins or chemical compounds.</p>
        <p>After filtering, the remaining MWEs required a subsequent step because of related terms.
Following the suggestion reported in [15] on merging related terms, we added a subsequent
step for grouping terms. In that work, a high-level algorithm is illustrated; we opted for a
4https://github.com/slanglab/phrasemachine
5https://github.com/stopwords-iso/stopwords-it
6https://www.geonames.org
7https://www.openstreetmap.org/
8https://www.aifa.gov.it/liste-dei-farmaci
9https://www.dbpedia.org/resources/lookup/
diferent approach that used both LSH with MinHash and the Morrone’s Index [ 17]. LSH with
MinHash was adopted to group related MWEs. We adopted the implementation available in the
datasketch library, using 128 permutations and MWE representations based on 4-characters
long n-grams, with a threshold of 0.5. The nearest MWEs by MinHash were then compared
using the Morrone’s Index. The index provides a measure of the extent to which a MWE is
significant in a corpus, and it is defined as follows:
• “telescopio spaziale James”: 0.039
• “telescopio spaziale James Webb”: 0.069
• “super telescopio spaziale James”: 0.0033
• “super telescopio spaziale James Webb”: 0.006
  =  ∗
∑
=1
  MWE



where  is the number of tokens in the MWE,  MWE is the frequency of the MWE in the
corpus computed as the number of documents where it occurs (document frequency),   is the

(document) frequency of the token   , and  is the number of non-stopwords in the MWE. We
used the standardized version of the index obtained by dividing Eq. 1 by  2. For instance, for
the previously mentioned MWEs, the value of the index are:
Therefore, among the MWEs in this group, we will opt for the one with the highest score, i.e.,
“telescopio spaziale James Webb”. Note that “James Webb” will be kept as an instance of a Person
entity and “telescopio” (telescope) as an instance of a single-word technoscientific object.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Candidate Terms Ranking</title>
        <p>
          As for step 4, since our goal is to follow the technoscientific discourse, we need a definition of
the object of analysis. The work reported in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] describes and uses a “pragmatic” solution that
assumes the point of view of a hypothetical “typical newspaper reader” and what this person
might recognize as “technoscientific”. This solution suggests several features, including the
occurrence of scientists/engineers, a discovery, a scientific instrument, or a general reference
to research processes and technological innovations; these features were used to define the
criteria for manually labeling documents according to their relevance to technoscience. The
initial labeled set described in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] was then extended by labeling additional documents; the final
labeled set was then used to train a classifier using supervised ML. The most efective technique
among those studied in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is based on the stacking [18] of a Multinomial Naive Bayes classifier
and Logistic Regression with Coordinate Descent Methods [19]. This result was consistent for
both Italian and English. Therefore, this is the approach adopted in this work to determine
articles relevant and not relevant to technoscience.
        </p>
        <p>Once all the articles have been classified, we ranked all the terms by the following score [ 12]:
  =   (  −   ) = log
  (1 −   )
  (1 −   )
(  −   )
where   (  ) is the probability that a given relevant (nonrelevant) document is assigned the term
 . For instance, the top 10 single words extracted from articles published in 2022 and ranked by
this score are:
(1)
(2)
• “biomarcatore” (biomarker)
• “citochina” (Cytokine)
• ‘neurology”
• “her2”
• “esmo”
• “nivolumab”
• “interferone” (interferon)
• “statine” (Statins)
• “monoterapia”
• “trastuzumab”
As a proof of concept, we examined the top 1000 single words extracted, and 32 of them were
not correct; errors included surnames of scientists, acronyms of scientific journals such as
PNAS or PLOS, Twitter accounts (of scientists or personalities related to technoscience), and
the abbreviation of organizations; some of those terms that will be added to the gazetteers. We
are currently evaluating a large sample of the extracted terms.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Final remarks and Future Works</title>
      <p>In this paper, we proposed a methodology for technoscientific object extraction relying on
multiple strategies, which include the use of gazetteers and ML-based NER to filter out
nonrelevant objects from those identified by PoS taggers or by Multiword expression (MWE)
extraction approaches. We provided a method for ranking the remaining terms that can be used
to carry out the analysis by social science researchers, such as PCST scholars, and to build a
labeled set to investigate methods for a fully automated process.</p>
      <p>
        To provide access to the gazetteers and the extracted terms both for the extraction procedure
and support analysis, we used elasticsearch,10 thus allowing terms and MWEs (and their statistics)
to be retrieved using full-text and fuzzy search both on the constituting tokens and the PoS tags.
In addition, we stored the occurrence of these terms in the newspaper articles, so that we could
search for technoscientific objects and monitor their presence and their relationship over time.
Moreover, the study of the occurrence of technoscientific objects in conjunction with indicators
such as the one on risk [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], might help us to gain some insights on the perception of some of
these objects or how they are discussed in the media sphere.
      </p>
      <p>Our current efort is devoted to a more extensive evaluation using expert annotators, i.e.,
researchers in PCST and Science and Technology Studies. Moreover, we will complement the
methodology with ML-based approaches to NER, such as those devised to extract entities from
specific domains.
1999.1690.
[12] S. Robertson, On term selection for query expansion, Journal of Documentation 46 (1990)
359–364. URL: https://www.emerald.com/insight/content/doi/10.1108/eb026866/full/html.
doi:10.1108/eb026866.
[13] A. Palmero Aprosio, G. Moretti, Tint 2.0: an All-inclusive Suite for NLP in Italian 10 (2018)
12.
[14] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, D. McClosky, The
Stanford CoreNLP natural language processing toolkit, in: Association for Computational
Linguistics (ACL) System Demonstrations, 2014, pp. 55–60. URL: http://www.aclweb.org/
anthology/P/P14/P14-5010.
[15] A. Handler, M. Denny, H. Wallach, B. O’Connor, Bag of what? simple noun phrase
extraction for text analysis, Association for Computational Linguistics, 2016, pp. 114–124.</p>
      <p>URL: http://aclweb.org/anthology/W16-5615. doi:10.18653/v1/W16- 5615.
[16] F. Neresini, P. Giardullo, E. Di Buccio, B. Morsello, A. Cammozzo, A. Sciandra, M. Boscolo,
When scientific experts come to be media stars: An evolutionary model tested by analysing
coronavirus media coverage across italian newspapers, PLoS ONE 18 (2023). doi:10.1371/
journal.pone.0284841, all Open Access, Gold Open Access, Green Open Access.
[17] A. Morrone, Temi generali e temi specifici dei programmi di governo attraverso le sequenze
di discorso, in: L’attività dei governi della Repubblica italiana (1948–1994), Il Mulino;
Bologna, 1996, p. 351–369.
[18] D. H. Wolpert, Stacked generalization, Neural Networks 5 (1992) 241–259. doi:10.1016/</p>
      <p>S0893- 6080(05)80023- 1.
[19] H.-F. Yu, F.-L. Huang, C.-J. Lin, Dual coordinate descent methods for logistic regression
and maximum entropy models, Machine Learning 85 (2011) 41–75. URL: https://doi.org/
10.1007/s10994-010-5221-8. doi:10.1007/s10994- 010- 5221- 8.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Latour</surname>
          </string-name>
          ,
          <article-title>Science in action: How to follow scientists and engineers through society</article-title>
          , Harvard university press,
          <year>1987</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          ,
          <article-title>A systematic review of automatic term extraction: What happened in 2022?, Digital Scholarship in the Humanities 38 (</article-title>
          <year>2023</year>
          )
          <fpage>I41</fpage>
          -
          <lpage>I47</lpage>
          . doi:
          <volume>10</volume>
          . 1093/llc/fqad030.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marchesin</surname>
          </string-name>
          , G. Silvello,
          <article-title>Terminology extraction in electronic health records. the examode project (poster)</article-title>
          , in: G.
          <string-name>
            <surname>M. D. Nunzio</surname>
            ,
            <given-names>G. M.</given-names>
          </string-name>
          <string-name>
            <surname>Henrot</surname>
            ,
            <given-names>M. T.</given-names>
          </string-name>
          <string-name>
            <surname>Musacchio</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Vezzani</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 1st International Conference on Multilingual Digital Terminology Today</source>
          , Padua, Italy, June 16-17,
          <year>2022</year>
          <article-title>(hybrid event)</article-title>
          , volume
          <volume>3161</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3161</volume>
          /poster1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Puccetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Giordano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Spada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Chiarello</surname>
          </string-name>
          , G. Fantoni,
          <article-title>Technology identification from patent texts: A novel named entity recognition method</article-title>
          ,
          <source>Technological Forecasting and Social Change</source>
          <volume>186</volume>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1016/j.techfore.
          <year>2022</year>
          .
          <volume>122160</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hearst</surname>
          </string-name>
          ,
          <article-title>Automatic acquisition of hyponyms from large text corpora</article-title>
          , volume
          <volume>2</volume>
          , Association for Computational Linguistics,
          <year>1992</year>
          , p.
          <fpage>539</fpage>
          . URL: http://portal.acm.org/citation. cfm?doid=
          <volume>992133</volume>
          .992154. doi:
          <volume>10</volume>
          .3115/992133.992154.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Neumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Iyyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <article-title>Deep contextualized word representations</article-title>
          ,
          <source>Association for Computational Linguistics</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>2227</fpage>
          -
          <lpage>2237</lpage>
          . URL: http://aclweb.org/anthology/N18-1202. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N18</fpage>
          - 1202.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Di Buccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lorenzet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Melucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Neresini</surname>
          </string-name>
          ,
          <article-title>Unveiling latent states behind social indicators</article-title>
          , in: R.
          <string-name>
            <surname>Gavaldà</surname>
            ,
            <given-names>I. Zliobaite</given-names>
          </string-name>
          , J. Gama (Eds.),
          <source>Proceedings of the First Workshop on Data Science for Social Good co-located with European Conference on Machine Learning and Principles and Practice of Knowledge Dicovery in Databases, SoGood@ECML-PKDD 2016, Riva del Garda, Italy, September</source>
          <volume>19</volume>
          ,
          <year>2016</year>
          , volume
          <volume>1831</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2016</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-1831/paper_6.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cammozzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Di</given-names>
            <surname>Buccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Neresini</surname>
          </string-name>
          ,
          <article-title>Monitoring technoscientific issues in the news</article-title>
          ,
          <source>in: ECML PKDD 2020 Workshops - Workshops of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD</source>
          <year>2020</year>
          ):
          <article-title>SoGood 2020</article-title>
          ,
          <article-title>PDFL 2020</article-title>
          ,
          <article-title>MLCS 2020</article-title>
          ,
          <article-title>NFMCP 2020</article-title>
          ,
          <article-title>DINA 2020</article-title>
          ,
          <article-title>EDML 2020</article-title>
          ,
          <article-title>XKDD 2020</article-title>
          and
          <article-title>INRA 2020, Ghent</article-title>
          , Belgium,
          <source>September 14-18</source>
          ,
          <year>2020</year>
          , Proceedings, volume
          <volume>1323</volume>
          of Communications in Computer and Information Science, Springer,
          <year>2020</year>
          , pp.
          <fpage>536</fpage>
          -
          <lpage>553</lpage>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>030</fpage>
          - 65965- 3\ _
          <fpage>37</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Di Buccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cammozzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Neresini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zanatta</surname>
          </string-name>
          ,
          <article-title>TIPS: search and analytics for social science research</article-title>
          , in: L.
          <string-name>
            <surname>Tamine</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Amigó</surname>
          </string-name>
          , J. Mothe (Eds.),
          <source>Proceedings of the 2nd Joint Conference of the Information Retrieval Communities in Europe (CIRCLE</source>
          <year>2022</year>
          ), Samatan, Gers, France,
          <source>July 4-7</source>
          ,
          <year>2022</year>
          , volume
          <volume>3178</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-3178/CIRCLE_2022_paper_33.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A.</given-names>
            <surname>Cammozzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Di</given-names>
            <surname>Buccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Giardullo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Neresini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sciandra</surname>
          </string-name>
          ,
          <article-title>Data from: Towards a Methodology for Technoscientific Objects Extraction</article-title>
          (Short Paper),
          <year>2024</year>
          . doi:
          <volume>10</volume>
          .5281/ zenodo.10869937.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A. Z.</given-names>
            <surname>Broder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Charikar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Frieze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitzenmacher</surname>
          </string-name>
          ,
          <article-title>Min-wise independent permutations</article-title>
          ,
          <source>Journal of Computer and System Sciences</source>
          <volume>60</volume>
          (
          <year>2000</year>
          )
          <fpage>630</fpage>
          -
          <lpage>659</lpage>
          . doi:
          <volume>10</volume>
          .1006/jcss.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>