=Paper=
{{Paper
|id=Vol-1749/paper38
|storemode=property
|title=KD Strikes Back: from Keyphrases to Labelled Domains Using External Knowledge Sources
|pdfUrl=https://ceur-ws.org/Vol-1749/paper38.pdf
|volume=Vol-1749
|authors=Giovanni Moretti,Rachele Sprugnoli,Sara Tonelli
|dblpUrl=https://dblp.org/rec/conf/clic-it/MorettiST16
}}
==KD Strikes Back: from Keyphrases to Labelled Domains Using External Knowledge Sources==
<pdf width="1500px">https://ceur-ws.org/Vol-1749/paper38.pdf</pdf>
<pre>
 KD Strikes Back: from Keyphrases to Labelled Domains Using External
                         Knowledge Sources

                     Giovanni Moretti1 , Rachele Sprugnoli1-2 , Sara Tonelli1
                              1
                                Fondazione Bruno Kessler, Trento
                                     2
                                       Università di Trento
                      {moretti,sprugnoli,satonelli}@fbk.eu


                    Abstract                          to extract key-concepts and assign them to a do-
                                                      main without the need of supervision would al-
    English.     This paper presents L-KD,            low them to systematically track the flow of in-
    a tool that relies on available linguis-          formation and retain only relevant content at two
    tic and knowledge resources to perform            granularity levels: key-concepts, and domains to
    keyphrase clustering and labelling. The           which these key-concepts can be ascribed. Al-
    aim of L-KD is to help finding and trac-          though topic models (Blei et al., 2003) can be used
    ing themes in English and Italian text data,      to this purpose, they have two main drawbacks:
    represented by groups of keyphrases and           the number of topics for a corpus is arbitrary and
    associated domains. We perform an evalu-          topics are often not labelled.
    ation of the top-ranked domains using the         In this work, we present a solution to the afore-
    20 Newsgroup dataset, and we show that            mentioned research problem by presenting L-
    8 domains out of 10 match with manually           KD (Labelled-KD), a tool to perform keyphrase
    assigned labels. This confirms the good           clustering and labelling through the exploitation
    accuracy of this approach, which does not         of external linguistic and knowledge resources.
    require supervision.                              The tool takes advantage of the availability of
    Italiano. In questo lavoro descriviamo L-         Keyphrase Digger1 (KD), a multilingual rule-
    KD, un sistema che utilizza risorse lin-          based system that detects a weighted list of n-
    guistiche e basate su conoscenza per ra-          grams representing the most important concepts in
    gruppare concetti-chiave e categorizzarli.        a text (Moretti et al., 2015). These key-concepts
    L’obiettivo di L-KD è quello di support-         are then linked to WordNet Domains (Magnini
    are gli utenti nel rilevare la presenza di        and Cavaglia, 2000) in order to create clusters of
    specifici temi in documenti italiani e in-        key-concepts labelled by domain. The problem
    glesi, rappresentandoli attraverso gruppi         of ambiguous concepts, i.e. possibly belonging
    di concetti-chiave e relativi domini. Ab-         to more than one WordNet domain, is tackled by
    biamo valutato l’affidabilità del sistema        using ConceptNet 5 (Speer and Havasi, 2013), a
    analizzando i domini più rilevanti nel 20        multilingual knowledge source containing single
    Newsgroup dataset, e dimostrando che 8            and multi-word concepts linked to each other by
    su 10 domini nel gold standard sono as-           a broad set of relations covering different types
    segnati correttamente anche dal sistema.          of associations. The outcome of this study is the
    Questa valutazione conferma le buone              L-KD tool, supporting both English and Italian,
    performance di L-KD, senza il bisogno di          which we make available to the research commu-
    supervisione.                                     nity2 . L-KD takes in input a document in plain text
                                                      format, and outputs the ranked list of semantic do-
                                                      mains discussed in the documents, each associated
1   Introduction                                      with a set of keyphrases.
With the increasing availability of large document
collections in digital format, companies, organiza-
tions but also non-expert users face everyday the
need to efficiently extract and categorize relevant      1
                                                             http://dh.fbk.eu/technologies/kd
                                                         2
information from large corpora. The possibility              https://dh.fbk.eu/technologies/l-kd
2   Related Works
In the last years, a number of works dealing
with the unsupervised clustering of keyphrases
has been presented (Hasan and Ng, 2014). Liu
et al. (2009) use Wikipedia and co-occurrence-
based statistics to semantically cluster similar
keyphrases in a set of unweighted topics. In order      Figure 2: Excerpt of the expansion of an ambigu-
to improve this approach by weighting topics, Liu       ous keyphrase using ConceptNet 5 (top) and top
et al. (2010) and Grineva et al. (2009) propose a       domains assigned to this expansion (bottom).
topic-decomposed PageRank and a network anal-
ysis algorithm respectively to perform hierarchical
clustering. Our method is simpler than the previ-          L-KD performs several steps (see Fig. 1) to se-
ously mentioned studies, and relies on available        mantically cluster keyphrases and label each clus-
resources to label the clusters. Indeed, the lists of   ter:
terms listed in the topics are not always easy to          1) Text preprocessing: Stanford CoreNLP
interpret (Aletras et al., 2015), and adding a la-      (Manning et al., 2014) is used to split sentences,
bel that captures the meaning of each cluster is a      tokenize, lemmatize and tag the part-of-speech of
way to enhance its understanding. The problem           the input English text. For Italian texts, we rely on
of interpretation affects also the output of topic      Tint3 , a suite of NLP tools (Aprosio and Moretti,
modelling algorithms, i.e. unsupervised statisti-       2016) based on the Stanford CoreNLP pipeline.
cal methods such as Latent Dirichlet Allocation            2) Keyphrase extraction and ranking: L-
(Blei et al., 2003). Many techniques have been          KD integrates KD, a keyphrase extraction tool
developed to automatically label topics for exam-       that combines statistical and linguistic knowledge,
ple by using probabilistic approaches (Mei et al.,      given by recurrent relevant PoS patterns, to ex-
2007), Wikipedia links (Xu and Oard, 2011) and          tract single words and multi-token expressions en-
DBpedia structured data (Hulpus et al., 2013). As       coding the main concepts of a document. A de-
for the automatic labelling of keyphrase clusters,      tailed description of KD functionalities is given in
Carmel et al. (2009) adopt Wikipedia as an exter-       Moretti et al. (2015). The output of this step is a
nal resource to extract candidate labels. To the        weighted and ranked list of keyphrases.
best of our knowledge, no available system per-            3) Domain mapping: L-KD maps the lemma
forms this task by combining WordNet Domains            forms of keyphrases with the lemmas in WND
and ConceptNet 5.                                       aligned to WordNet 3.04 . For Italian we rely
                                                        on the data available through the Open Multi-
3   System Overview                                     lingual WordNet project (Bond and Paik, 2012)
                                                        as a bridge between lemmas and WND. In case
                                                        of multi-token expressions (e.g. “federal govern-
                                                        ment”), the system looks for a perfect match. If no
                                                        match is found, the tokens are splitted and only the
                                                        nouns are searched in WND (e.g. “government”).
                                                        A list of domain-keyphrases associations is cre-
                                                        ated, as well as a list of ambiguous keyphrases.
                                                        The latter comprises those that are assigned to the
                                                        Factotum domain and those that could belong
                                                        to several domains, if none of them contains > 3
                                                        keyphrases. This threshold was manually set in or-
                                                        der to identify domains that are likely to be little
Figure 1: General workflow underlying L-KD for          relevant.
English documents with the steps involving the             4) Expansion of ambiguous keyphrases: The
use of Stanford CoreNLP, KD, WordNet Domains            lemmas of ambiguous keyphrases are aligned with
(WND) and ConceptNet 5 (CN5).                              3
                                                               http://tint.fbk.eu/
                                                           4
                                                               Courtesy of Carlo Strapparava.
the lemmas in ConceptNet 5 and are expanded                4       Evaluation
by retrieving all the connected concepts follow-
                                                           We evaluated L-KD using the 20 Newsgroup
ing ConcepNet 5 relations. L-KD relies on a
                                                           dataset (Joachims, 1996), a corpus of 20,000 doc-
subset of relations including hierarchical (HasA,
                                                           uments extracted from UseNet discussion groups.
PartOf, MadeOf, IsA, DerivedFrom) and synony-
                                                           This dataset is freely available online5 and has
mous (Synonym, RelatedTo) ones (Mukherjee and
                                                           been often employed to train and test text catego-
Joshi, 2013). Functional relations such as Capa-
                                                           rization algorithms (Moschitti and Basili, 2004).
bleOf and UsedFor are not taken into consider-
                                                           Specifically, each of its documents was manu-
ation because the concepts evoked by these rela-
                                                           ally assigned to one out of twenty different cate-
tions may be too far from the original meaning
                                                           gories, which can be easily mapped to WND la-
of the key-concept. The upper part of Fig. 2
                                                           bels. Although L-KD can assign a ranked list of
shows how “nature”, an ambiguous keyphrase, is
                                                           domains to one or more documents, thus provid-
expanded following this procedure. Examples of
                                                           ing a richer representation of the document(s) con-
the relations that lead to this expansion are the fol-
                                                           tent, we did not find a suitable gold standard to
lowing:
                                                           evaluate the rank. Therefore, we limit our eval-
- nature ⇒ RelatedTo ⇒ flora
                                                           uation to the top-ranked domain extracted by the
- nature ⇒ IsA ⇒ great place
                                                           tool. We also decided to group Newsgroup cate-
- nature ⇒ HasA ⇒ many wonder
                                                           gories that are strictly related to each other: e.g.
                                                           documents in talk.religion.misc, alt.atheism, and
   5) Domain mapping of expanded keyphrases:               soc.religion.christian all discuss religious issues
All the lemmas included in the expansion created           and for this reason their texts are collapsed in a
in the previous step are mapped to domains us-             single category.
ing WND. The lower part of Fig. 2 reports the              Table 1 reports the results of L-KD on the
top domains related to the expansion of “nature”           documents included in each category or group
together with the number of lemmas associated              of categories. The second column shows the
with them, e.g. 19 lemmas are mapped to the                top two domains retrieved by the system and
Biology domain. A relevance score (i.e. num-               the third column presents some of the extracted
ber of keyphrases associated with a domain) is             keyphrases. Only in 2 cases out of 10, the
computed for the domains retrieved for each ex-            first ranked domain does not perfectly match
panded keyphrase. Domains are then compared                the original category: indeed Law is the top
with the ones found in Step (3) starting from the          domain of sci.eletronics and of the documents
domain with the highest score. If it is already            related to political themes (talk.politics.misc,
present in the domain-keyphrases list compiled in          talk.politics.guns, talk.politics.mideast). We can
Step (3), then the keyphrase is associated with this       notice that Law is a very frequent domain because
domain, otherwise the other domains are checked.           it contains generic and recurring words such as
If the domain is not present in the list, it is added to   “article”, “opinion” and “information”. In the rest
the list with its associated keyphrase. The final rel-     of the cases (8 out of 10), the match between the
evance score of the domains is recalculated at the         first ranked domain and the original category is
end of this step. Four sub-domains of Factotum,            perfect: for example, the domain with the high-
i.e. Time Period, Person, Metrology and                    est rank for documents discussing computer tech-
Numbers, which are very generic, usually have a            nologies is Computer Science. In many cases
high relevance because they tend to include many           also the second domain is extremely relevant. For
keywords. Therefore, we introduce a final re-              instance, misc.forsale contains messages of peo-
weighting step to deboost them.                            ple searching or selling goods with a focus on
                                                           computer devices and components: the first re-
   6) Final ranking.       L-KD creates a final            trieved domain is Commerce and the second one
ranked list of domains associated with clusters of         is Computer Science. Each domain is associ-
keyphrases. The ranking is based on the relevance          ated with pertinent keyphrases such as “best offer”
score of the domains as described in the previous          for the first domain and “floppy drive” for the sec-
step and on the rank of keyphrases as given by KD          ond.
                                                               5
in step (2).                                                       http://qwone.com/˜jason/20Newsgroups/
     ORIGINAL CATEGORIES                         TOP DOMAINS            KEYPHRASES
                                                 Medicine               doctor, infectious disease, side effect
     sci.med
                                                 School                 course, science, study
                                                 Astronomy              solar system, physical universe, satellite
     sci.space
                                                 Transport              spacecraft, shuttle, high-speed collision
                                                 Computer Science       internet, e-mail, bit
     sci.crypt
                                                 Law                    security, second amendment, criminal
                                                 Law                    article, opinion, information
     sci.electronics
                                                 Electricity            amateur radio, voltage, wire
     talk.religion.misc - alt.atheism -          Religion               christian, atheist, objective morality
     soc.religion.christian                      Law                    law, evidence, private activities
                                                 Sport                  game, playoff, second period
     rec.sport.baseball - rec.sport.hockey
                                                 Play                   player, baseball
                                                 Transport              car, mph, front wheel
     rec.autos - rec.motorcycles
                                                 Law                    article, opinion
     comp.graphics - comp.os.mswindows.misc -    Computer Science       software, hard drive, anonymous ftp
     comp.sys.ibm.pc.hardware - comp.windows.x
                                                 Publishing             article, opinion
     - comp.sys.mac.hardware
     talk.politics.misc - talk.politics.guns -   Law                    opinion, second amendment
     talk.politics.mideast                       Transport              road, ways of escape
                                                 Commerce               best offer, price, excellent condition
     misc.forsale
                                                 Computer Science       hard drive, floppy drive, email

Table 1: Results of L-KD on the 20 Newsgroup dataset. The original categories are compared with the
top domains extracted by the systems. Examples of keyphrases are provided for each domain. Perfect
matches between the main theme of the original classification and L-KD top domains are in bold.


5   Use Case: the De Gasperi Project

L-KD has been recently applied to the analysis
of the complete corpus of public writings by Al-
cide De Gasperi (De Gasperi, 2006) in the con-
text of a research project, whose goal is to give
insight into De Gasperi’s communication strategy
with the help of innovative tools for text analy-
sis. We processed the 2,762 documents (around
3,000,000 tokens) in the corpus, published be-
tween 1901 and 1954, to analyse which domains
appeared in the collection and how they changed
over time. The advantage of L-KD is that it can
provide both a distant view, by computing aggre-
gated information on the domains, and a close
reading of the documents, showing which key-             Figure 3: Dendogram related to two documents
concepts are mapped to which domain. As an ex-           from De Gasperi’s corpus
ample, we report in Fig. 3 the analysis related to
two documents, entitled “Rene de la Tour du Pin”
                                                         6        Tool Availability
and “I cattolici nell’evoluzione sociale’. For each
of them, the dendogram shows the three top do-           L-KD is available as a web application6 through
mains and the associated key-concepts. The pro-          which users can copy&paste a document and run
posed analysis was validated at different granular-      the tool processing it on the fly. This application
ities by two history scholars, who confirmed the         makes L-KD easily accessible also by users with-
consistency of L-KD analysis and found corre-            out a technical background.
spondences between the top domains and relevant
                                                              6
events in De Gasperi’s life.                                      http://dhlab.fbk.eu:8080/L_KD/
In the application some parameters are given,                    the analysis of large document collections for hu-
while others can be changed by the user accord-                  manities studies.
ing to his/her needs. As for the fixed parame-
ters, proper names are always discarded so to ex-                Acknowledgments
clude them from the list of keyphrases: this setting             The research leading to this paper was partially
is justified by the fact that WordNet, and conse-                supported by the EU Horizon 2020 Programme via
quently WND, contains few proper nouns7 while                    the SIMPATICO Project (H2020-EURO-6-2015,
we want to maximize the mapping. For the same                    n. 692819). We thanks Alessio Palmero Aprosio
reason, short keyphrases, i.e. single words and                  for his help in the evaluation process.
multi-token expressions with a maximum length
of 4 words8 , are preferred. On the contrary, the
minimum number of occurrences for a word or ex-                  References
pression to be considered as a candidate keyphrase               Matteo Abrate, Clara Bacciu, Andrea Marchetti, and
and the number of keyphrases to be extracted can                  Maurizio Tesconi. 2012. WordNet atlas: a web
be customized by the user. For example, in case                   application for visualizing WordNet as a zoomable
of short documents, a low number of keyphrases                    map. In GWC 2012 6th International Global Word-
                                                                  net Conference, page 23.
(e.g. up to 20) can be set together with a minimum
frequency of 1 or 2 (in a short text repetitions are             Nikolaos Aletras, Timothy Baldwin, Jey Han Lau, and
less likely to occur). For long documents more                     Mark Stevenson. 2015. Evaluating topic represen-
keyphrases can be extracted: in this way it would                  tations for exploring document collections. Journal
                                                                   of the Association for Information Science and Tech-
be easy to find clusters covering multiple themes.                 nology.
7    Conclusions and Future Works                                Alessio Palmero Aprosio and Giovanni Moretti.
                                                                   2016. Italy goes to Stanford: a collection of
This paper presents L-KD, a tool that extracts                     CoreNLP modules for Italian.   arXiv preprint
keyphrases from text data, clusters them accord-                   arXiv:1609.06204.
ing to the domain and assigns a label to each clus-
                                                                 David M Blei, Andrew Y Ng, and Michael I Jordan.
ter. The process underlying L-KD is based on the                   2003. Latent dirichlet allocation. Journal of ma-
exploitation of external linguistic and knowledge                  chine Learning research, 3(Jan):993–1022.
resources, i.e. WordNet Domains and ConceptNet
                                                                 Francis Bond and Kyonghee Paik. 2012. A survey of
5. Our tool can process both English and Italian
                                                                   WordNets and their licenses. Small, 8(4):5.
texts of different length and content, from a single
news article to an entire book, from single-theme                David Carmel, Haggai Roitman, and Naama Zw-
to multi-theme documents.                                          erdling. 2009. Enhancing cluster labeling using
                                                                   wikipedia. In Proceedings of the 32nd international
   In the future we will explore different research                ACM SIGIR conference on Research and develop-
directions. First of all we want to evaluate the                   ment in information retrieval, pages 139–146. ACM.
tool on Italian data, even if we have not found
a suitable gold standard so far. Resorting to                    A. De Gasperi. 2006. Scritti e discorsi politici. In
                                                                   E. Tonezzer, M. Bigaran, and M. Guiotto, editors,
crowd-sourcing may be a viable solution. We ex-                    Scritti e discorsi politici, volume 1. Il Mulino.
pect lower performances than the ones obtained
for English, given that the current mapping be-                  Maria Grineva, Maxim Grinev, and Dmitry Lizorkin.
                                                                  2009. Extracting key terms from noisy and multi-
tween Open Multilingual WordNet and WordNet                       theme documents. In Proceedings of the 18th inter-
3.0 covers only the 32.5% of the English synsets:                 national conference on World wide web, pages 661–
this consequently affects the mapping on the do-                  670. ACM.
mains of WND. Moreover, the coverage of Italian
                                                                 Kazi Saidul Hasan and Vincent Ng. 2014. Automatic
in ConcepNet 5 is limited. As for the availability                 keyphrase extraction: A survey of the state of the art.
of L-KD, we plan to release the tool as a stand-                   In Proceedings of the 52nd Annual Meeting of the
alone module. It will also be integrated in the AL-                Association for Computational Linguistics (Volume
CIDE platform (Moretti et al., 2016) that supports                 1: Long Papers), pages 1262–1273. Association for
                                                                   Computational Linguistics.
     7
       Only the 9.4% of synsets are tagged as being instances,
i.e. proper nouns, in WordNet 3.0 (Abrate et al., 2012).         Ioana Hulpus, Conor Hayes, Marcel Karnstedt, and
     8                                                              Derek Greene. 2013. Unsupervised graph-based
       In WordNet 3.0 only the 0.2% of noun synsets have a
length greater than 4 words.                                        topic labelling using dbpedia. In Proceedings of the
  sixth ACM international conference on Web search       Tan Xu and Douglas W Oard. 2011. Wikipedia-based
  and data mining, pages 465–474. ACM.                     topic clustering for microblogs. Proceedings of the
                                                           American Society for Information Science and Tech-
Thorsten Joachims. 1996. A Probabilistic Analysis of       nology, 48(1):1–10.
  the Rocchio Algorithm with TFIDF for Text Catego-
  rization. Technical report, DTIC Document.

Zhiyuan Liu, Peng Li, Yabin Zheng, and Maosong
  Sun. 2009. Clustering to find exemplar terms for
  keyphrase extraction. In Proceedings of the 2009
  Conference on Empirical Methods in Natural Lan-
  guage Processing: Volume 1-Volume 1, pages 257–
  266. Association for Computational Linguistics.

Zhiyuan Liu, Wenyi Huang, Yabin Zheng, and
  Maosong Sun. 2010. Automatic keyphrase extrac-
  tion via topic decomposition. In Proceedings of the
  2010 conference on empirical methods in natural
  language processing, pages 366–376. Association
  for Computational Linguistics.

Bernardo Magnini and Gabriela Cavaglia. 2000. Inte-
  grating Subject Field Codes into WordNet. In Pro-
  ceedings of the Second International Conference on
  Language Resources and Evaluation (LREC-2000).

Christopher D Manning, Mihai Surdeanu, John Bauer,
  Jenny Rose Finkel, Steven Bethard, and David Mc-
  Closky. 2014. The Stanford CoreNLP Natural Lan-
  guage Processing Toolkit. In ACL (System Demon-
  strations), pages 55–60.

Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai.
  2007. Automatic labeling of multinomial topic
  models. In Proceedings of the 13th ACM SIGKDD
  international conference on Knowledge discovery
  and data mining, pages 490–499. ACM.

Giovanni Moretti, Rachele Sprugnoli, and Sara Tonelli.
  2015. Digging in the dirt: Extracting keyphrases
  from texts with kd. In Proceedings of CLiC-it 2016,
  page 198.

Giovanni Moretti, Rachele Sprugnoli, Stefano Menini,
  and Sara Tonelli. 2016. ALCIDE: Extracting and
  visualising content from large document collections
  to support Humanities studies. Knowledge-Based
  Systems, 111:100–112.

Alessandro Moschitti and Roberto Basili. 2004. Com-
  plex linguistic features for text classification: A
  comprehensive study. In European Conference on
  Information Retrieval, pages 181–196. Springer.

Subhabrata Mukherjee and Sachindra Joshi. 2013.
  Sentiment Aggregation using ConceptNet Ontology.
  In Proceedings of the Sixth International Joint Con-
  ference on Natural Language Processing (IJCNLP),
  pages 570–578.

Robert Speer and Catherine Havasi. 2013. ConceptNet
  5: A large semantic network for relational knowl-
  edge. In The Peoples Web Meets NLP, pages 161–
  176. Springer.

</pre>