=Paper=
{{Paper
|id=Vol-1749/paper38
|storemode=property
|title=KD Strikes Back: from Keyphrases to Labelled Domains Using External Knowledge Sources
|pdfUrl=https://ceur-ws.org/Vol-1749/paper38.pdf
|volume=Vol-1749
|authors=Giovanni Moretti,Rachele Sprugnoli,Sara Tonelli
|dblpUrl=https://dblp.org/rec/conf/clic-it/MorettiST16
}}
==KD Strikes Back: from Keyphrases to Labelled Domains Using External Knowledge Sources==
KD Strikes Back: from Keyphrases to Labelled Domains Using External
Knowledge Sources
Giovanni Moretti1 , Rachele Sprugnoli1-2 , Sara Tonelli1
1
Fondazione Bruno Kessler, Trento
2
Università di Trento
{moretti,sprugnoli,satonelli}@fbk.eu
Abstract to extract key-concepts and assign them to a do-
main without the need of supervision would al-
English. This paper presents L-KD, low them to systematically track the flow of in-
a tool that relies on available linguis- formation and retain only relevant content at two
tic and knowledge resources to perform granularity levels: key-concepts, and domains to
keyphrase clustering and labelling. The which these key-concepts can be ascribed. Al-
aim of L-KD is to help finding and trac- though topic models (Blei et al., 2003) can be used
ing themes in English and Italian text data, to this purpose, they have two main drawbacks:
represented by groups of keyphrases and the number of topics for a corpus is arbitrary and
associated domains. We perform an evalu- topics are often not labelled.
ation of the top-ranked domains using the In this work, we present a solution to the afore-
20 Newsgroup dataset, and we show that mentioned research problem by presenting L-
8 domains out of 10 match with manually KD (Labelled-KD), a tool to perform keyphrase
assigned labels. This confirms the good clustering and labelling through the exploitation
accuracy of this approach, which does not of external linguistic and knowledge resources.
require supervision. The tool takes advantage of the availability of
Italiano. In questo lavoro descriviamo L- Keyphrase Digger1 (KD), a multilingual rule-
KD, un sistema che utilizza risorse lin- based system that detects a weighted list of n-
guistiche e basate su conoscenza per ra- grams representing the most important concepts in
gruppare concetti-chiave e categorizzarli. a text (Moretti et al., 2015). These key-concepts
L’obiettivo di L-KD è quello di support- are then linked to WordNet Domains (Magnini
are gli utenti nel rilevare la presenza di and Cavaglia, 2000) in order to create clusters of
specifici temi in documenti italiani e in- key-concepts labelled by domain. The problem
glesi, rappresentandoli attraverso gruppi of ambiguous concepts, i.e. possibly belonging
di concetti-chiave e relativi domini. Ab- to more than one WordNet domain, is tackled by
biamo valutato l’affidabilità del sistema using ConceptNet 5 (Speer and Havasi, 2013), a
analizzando i domini più rilevanti nel 20 multilingual knowledge source containing single
Newsgroup dataset, e dimostrando che 8 and multi-word concepts linked to each other by
su 10 domini nel gold standard sono as- a broad set of relations covering different types
segnati correttamente anche dal sistema. of associations. The outcome of this study is the
Questa valutazione conferma le buone L-KD tool, supporting both English and Italian,
performance di L-KD, senza il bisogno di which we make available to the research commu-
supervisione. nity2 . L-KD takes in input a document in plain text
format, and outputs the ranked list of semantic do-
mains discussed in the documents, each associated
1 Introduction with a set of keyphrases.
With the increasing availability of large document
collections in digital format, companies, organiza-
tions but also non-expert users face everyday the
need to efficiently extract and categorize relevant 1
http://dh.fbk.eu/technologies/kd
2
information from large corpora. The possibility https://dh.fbk.eu/technologies/l-kd
2 Related Works
In the last years, a number of works dealing
with the unsupervised clustering of keyphrases
has been presented (Hasan and Ng, 2014). Liu
et al. (2009) use Wikipedia and co-occurrence-
based statistics to semantically cluster similar
keyphrases in a set of unweighted topics. In order Figure 2: Excerpt of the expansion of an ambigu-
to improve this approach by weighting topics, Liu ous keyphrase using ConceptNet 5 (top) and top
et al. (2010) and Grineva et al. (2009) propose a domains assigned to this expansion (bottom).
topic-decomposed PageRank and a network anal-
ysis algorithm respectively to perform hierarchical
clustering. Our method is simpler than the previ- L-KD performs several steps (see Fig. 1) to se-
ously mentioned studies, and relies on available mantically cluster keyphrases and label each clus-
resources to label the clusters. Indeed, the lists of ter:
terms listed in the topics are not always easy to 1) Text preprocessing: Stanford CoreNLP
interpret (Aletras et al., 2015), and adding a la- (Manning et al., 2014) is used to split sentences,
bel that captures the meaning of each cluster is a tokenize, lemmatize and tag the part-of-speech of
way to enhance its understanding. The problem the input English text. For Italian texts, we rely on
of interpretation affects also the output of topic Tint3 , a suite of NLP tools (Aprosio and Moretti,
modelling algorithms, i.e. unsupervised statisti- 2016) based on the Stanford CoreNLP pipeline.
cal methods such as Latent Dirichlet Allocation 2) Keyphrase extraction and ranking: L-
(Blei et al., 2003). Many techniques have been KD integrates KD, a keyphrase extraction tool
developed to automatically label topics for exam- that combines statistical and linguistic knowledge,
ple by using probabilistic approaches (Mei et al., given by recurrent relevant PoS patterns, to ex-
2007), Wikipedia links (Xu and Oard, 2011) and tract single words and multi-token expressions en-
DBpedia structured data (Hulpus et al., 2013). As coding the main concepts of a document. A de-
for the automatic labelling of keyphrase clusters, tailed description of KD functionalities is given in
Carmel et al. (2009) adopt Wikipedia as an exter- Moretti et al. (2015). The output of this step is a
nal resource to extract candidate labels. To the weighted and ranked list of keyphrases.
best of our knowledge, no available system per- 3) Domain mapping: L-KD maps the lemma
forms this task by combining WordNet Domains forms of keyphrases with the lemmas in WND
and ConceptNet 5. aligned to WordNet 3.04 . For Italian we rely
on the data available through the Open Multi-
3 System Overview lingual WordNet project (Bond and Paik, 2012)
as a bridge between lemmas and WND. In case
of multi-token expressions (e.g. “federal govern-
ment”), the system looks for a perfect match. If no
match is found, the tokens are splitted and only the
nouns are searched in WND (e.g. “government”).
A list of domain-keyphrases associations is cre-
ated, as well as a list of ambiguous keyphrases.
The latter comprises those that are assigned to the
Factotum domain and those that could belong
to several domains, if none of them contains > 3
keyphrases. This threshold was manually set in or-
der to identify domains that are likely to be little
Figure 1: General workflow underlying L-KD for relevant.
English documents with the steps involving the 4) Expansion of ambiguous keyphrases: The
use of Stanford CoreNLP, KD, WordNet Domains lemmas of ambiguous keyphrases are aligned with
(WND) and ConceptNet 5 (CN5). 3
http://tint.fbk.eu/
4
Courtesy of Carlo Strapparava.
the lemmas in ConceptNet 5 and are expanded 4 Evaluation
by retrieving all the connected concepts follow-
We evaluated L-KD using the 20 Newsgroup
ing ConcepNet 5 relations. L-KD relies on a
dataset (Joachims, 1996), a corpus of 20,000 doc-
subset of relations including hierarchical (HasA,
uments extracted from UseNet discussion groups.
PartOf, MadeOf, IsA, DerivedFrom) and synony-
This dataset is freely available online5 and has
mous (Synonym, RelatedTo) ones (Mukherjee and
been often employed to train and test text catego-
Joshi, 2013). Functional relations such as Capa-
rization algorithms (Moschitti and Basili, 2004).
bleOf and UsedFor are not taken into consider-
Specifically, each of its documents was manu-
ation because the concepts evoked by these rela-
ally assigned to one out of twenty different cate-
tions may be too far from the original meaning
gories, which can be easily mapped to WND la-
of the key-concept. The upper part of Fig. 2
bels. Although L-KD can assign a ranked list of
shows how “nature”, an ambiguous keyphrase, is
domains to one or more documents, thus provid-
expanded following this procedure. Examples of
ing a richer representation of the document(s) con-
the relations that lead to this expansion are the fol-
tent, we did not find a suitable gold standard to
lowing:
evaluate the rank. Therefore, we limit our eval-
- nature ⇒ RelatedTo ⇒ flora
uation to the top-ranked domain extracted by the
- nature ⇒ IsA ⇒ great place
tool. We also decided to group Newsgroup cate-
- nature ⇒ HasA ⇒ many wonder
gories that are strictly related to each other: e.g.
documents in talk.religion.misc, alt.atheism, and
5) Domain mapping of expanded keyphrases: soc.religion.christian all discuss religious issues
All the lemmas included in the expansion created and for this reason their texts are collapsed in a
in the previous step are mapped to domains us- single category.
ing WND. The lower part of Fig. 2 reports the Table 1 reports the results of L-KD on the
top domains related to the expansion of “nature” documents included in each category or group
together with the number of lemmas associated of categories. The second column shows the
with them, e.g. 19 lemmas are mapped to the top two domains retrieved by the system and
Biology domain. A relevance score (i.e. num- the third column presents some of the extracted
ber of keyphrases associated with a domain) is keyphrases. Only in 2 cases out of 10, the
computed for the domains retrieved for each ex- first ranked domain does not perfectly match
panded keyphrase. Domains are then compared the original category: indeed Law is the top
with the ones found in Step (3) starting from the domain of sci.eletronics and of the documents
domain with the highest score. If it is already related to political themes (talk.politics.misc,
present in the domain-keyphrases list compiled in talk.politics.guns, talk.politics.mideast). We can
Step (3), then the keyphrase is associated with this notice that Law is a very frequent domain because
domain, otherwise the other domains are checked. it contains generic and recurring words such as
If the domain is not present in the list, it is added to “article”, “opinion” and “information”. In the rest
the list with its associated keyphrase. The final rel- of the cases (8 out of 10), the match between the
evance score of the domains is recalculated at the first ranked domain and the original category is
end of this step. Four sub-domains of Factotum, perfect: for example, the domain with the high-
i.e. Time Period, Person, Metrology and est rank for documents discussing computer tech-
Numbers, which are very generic, usually have a nologies is Computer Science. In many cases
high relevance because they tend to include many also the second domain is extremely relevant. For
keywords. Therefore, we introduce a final re- instance, misc.forsale contains messages of peo-
weighting step to deboost them. ple searching or selling goods with a focus on
computer devices and components: the first re-
6) Final ranking. L-KD creates a final trieved domain is Commerce and the second one
ranked list of domains associated with clusters of is Computer Science. Each domain is associ-
keyphrases. The ranking is based on the relevance ated with pertinent keyphrases such as “best offer”
score of the domains as described in the previous for the first domain and “floppy drive” for the sec-
step and on the rank of keyphrases as given by KD ond.
5
in step (2). http://qwone.com/˜jason/20Newsgroups/
ORIGINAL CATEGORIES TOP DOMAINS KEYPHRASES
Medicine doctor, infectious disease, side effect
sci.med
School course, science, study
Astronomy solar system, physical universe, satellite
sci.space
Transport spacecraft, shuttle, high-speed collision
Computer Science internet, e-mail, bit
sci.crypt
Law security, second amendment, criminal
Law article, opinion, information
sci.electronics
Electricity amateur radio, voltage, wire
talk.religion.misc - alt.atheism - Religion christian, atheist, objective morality
soc.religion.christian Law law, evidence, private activities
Sport game, playoff, second period
rec.sport.baseball - rec.sport.hockey
Play player, baseball
Transport car, mph, front wheel
rec.autos - rec.motorcycles
Law article, opinion
comp.graphics - comp.os.mswindows.misc - Computer Science software, hard drive, anonymous ftp
comp.sys.ibm.pc.hardware - comp.windows.x
Publishing article, opinion
- comp.sys.mac.hardware
talk.politics.misc - talk.politics.guns - Law opinion, second amendment
talk.politics.mideast Transport road, ways of escape
Commerce best offer, price, excellent condition
misc.forsale
Computer Science hard drive, floppy drive, email
Table 1: Results of L-KD on the 20 Newsgroup dataset. The original categories are compared with the
top domains extracted by the systems. Examples of keyphrases are provided for each domain. Perfect
matches between the main theme of the original classification and L-KD top domains are in bold.
5 Use Case: the De Gasperi Project
L-KD has been recently applied to the analysis
of the complete corpus of public writings by Al-
cide De Gasperi (De Gasperi, 2006) in the con-
text of a research project, whose goal is to give
insight into De Gasperi’s communication strategy
with the help of innovative tools for text analy-
sis. We processed the 2,762 documents (around
3,000,000 tokens) in the corpus, published be-
tween 1901 and 1954, to analyse which domains
appeared in the collection and how they changed
over time. The advantage of L-KD is that it can
provide both a distant view, by computing aggre-
gated information on the domains, and a close
reading of the documents, showing which key- Figure 3: Dendogram related to two documents
concepts are mapped to which domain. As an ex- from De Gasperi’s corpus
ample, we report in Fig. 3 the analysis related to
two documents, entitled “Rene de la Tour du Pin”
6 Tool Availability
and “I cattolici nell’evoluzione sociale’. For each
of them, the dendogram shows the three top do- L-KD is available as a web application6 through
mains and the associated key-concepts. The pro- which users can copy&paste a document and run
posed analysis was validated at different granular- the tool processing it on the fly. This application
ities by two history scholars, who confirmed the makes L-KD easily accessible also by users with-
consistency of L-KD analysis and found corre- out a technical background.
spondences between the top domains and relevant
6
events in De Gasperi’s life. http://dhlab.fbk.eu:8080/L_KD/
In the application some parameters are given, the analysis of large document collections for hu-
while others can be changed by the user accord- manities studies.
ing to his/her needs. As for the fixed parame-
ters, proper names are always discarded so to ex- Acknowledgments
clude them from the list of keyphrases: this setting The research leading to this paper was partially
is justified by the fact that WordNet, and conse- supported by the EU Horizon 2020 Programme via
quently WND, contains few proper nouns7 while the SIMPATICO Project (H2020-EURO-6-2015,
we want to maximize the mapping. For the same n. 692819). We thanks Alessio Palmero Aprosio
reason, short keyphrases, i.e. single words and for his help in the evaluation process.
multi-token expressions with a maximum length
of 4 words8 , are preferred. On the contrary, the
minimum number of occurrences for a word or ex- References
pression to be considered as a candidate keyphrase Matteo Abrate, Clara Bacciu, Andrea Marchetti, and
and the number of keyphrases to be extracted can Maurizio Tesconi. 2012. WordNet atlas: a web
be customized by the user. For example, in case application for visualizing WordNet as a zoomable
of short documents, a low number of keyphrases map. In GWC 2012 6th International Global Word-
net Conference, page 23.
(e.g. up to 20) can be set together with a minimum
frequency of 1 or 2 (in a short text repetitions are Nikolaos Aletras, Timothy Baldwin, Jey Han Lau, and
less likely to occur). For long documents more Mark Stevenson. 2015. Evaluating topic represen-
keyphrases can be extracted: in this way it would tations for exploring document collections. Journal
of the Association for Information Science and Tech-
be easy to find clusters covering multiple themes. nology.
7 Conclusions and Future Works Alessio Palmero Aprosio and Giovanni Moretti.
2016. Italy goes to Stanford: a collection of
This paper presents L-KD, a tool that extracts CoreNLP modules for Italian. arXiv preprint
keyphrases from text data, clusters them accord- arXiv:1609.06204.
ing to the domain and assigns a label to each clus-
David M Blei, Andrew Y Ng, and Michael I Jordan.
ter. The process underlying L-KD is based on the 2003. Latent dirichlet allocation. Journal of ma-
exploitation of external linguistic and knowledge chine Learning research, 3(Jan):993–1022.
resources, i.e. WordNet Domains and ConceptNet
Francis Bond and Kyonghee Paik. 2012. A survey of
5. Our tool can process both English and Italian
WordNets and their licenses. Small, 8(4):5.
texts of different length and content, from a single
news article to an entire book, from single-theme David Carmel, Haggai Roitman, and Naama Zw-
to multi-theme documents. erdling. 2009. Enhancing cluster labeling using
wikipedia. In Proceedings of the 32nd international
In the future we will explore different research ACM SIGIR conference on Research and develop-
directions. First of all we want to evaluate the ment in information retrieval, pages 139–146. ACM.
tool on Italian data, even if we have not found
a suitable gold standard so far. Resorting to A. De Gasperi. 2006. Scritti e discorsi politici. In
E. Tonezzer, M. Bigaran, and M. Guiotto, editors,
crowd-sourcing may be a viable solution. We ex- Scritti e discorsi politici, volume 1. Il Mulino.
pect lower performances than the ones obtained
for English, given that the current mapping be- Maria Grineva, Maxim Grinev, and Dmitry Lizorkin.
2009. Extracting key terms from noisy and multi-
tween Open Multilingual WordNet and WordNet theme documents. In Proceedings of the 18th inter-
3.0 covers only the 32.5% of the English synsets: national conference on World wide web, pages 661–
this consequently affects the mapping on the do- 670. ACM.
mains of WND. Moreover, the coverage of Italian
Kazi Saidul Hasan and Vincent Ng. 2014. Automatic
in ConcepNet 5 is limited. As for the availability keyphrase extraction: A survey of the state of the art.
of L-KD, we plan to release the tool as a stand- In Proceedings of the 52nd Annual Meeting of the
alone module. It will also be integrated in the AL- Association for Computational Linguistics (Volume
CIDE platform (Moretti et al., 2016) that supports 1: Long Papers), pages 1262–1273. Association for
Computational Linguistics.
7
Only the 9.4% of synsets are tagged as being instances,
i.e. proper nouns, in WordNet 3.0 (Abrate et al., 2012). Ioana Hulpus, Conor Hayes, Marcel Karnstedt, and
8 Derek Greene. 2013. Unsupervised graph-based
In WordNet 3.0 only the 0.2% of noun synsets have a
length greater than 4 words. topic labelling using dbpedia. In Proceedings of the
sixth ACM international conference on Web search Tan Xu and Douglas W Oard. 2011. Wikipedia-based
and data mining, pages 465–474. ACM. topic clustering for microblogs. Proceedings of the
American Society for Information Science and Tech-
Thorsten Joachims. 1996. A Probabilistic Analysis of nology, 48(1):1–10.
the Rocchio Algorithm with TFIDF for Text Catego-
rization. Technical report, DTIC Document.
Zhiyuan Liu, Peng Li, Yabin Zheng, and Maosong
Sun. 2009. Clustering to find exemplar terms for
keyphrase extraction. In Proceedings of the 2009
Conference on Empirical Methods in Natural Lan-
guage Processing: Volume 1-Volume 1, pages 257–
266. Association for Computational Linguistics.
Zhiyuan Liu, Wenyi Huang, Yabin Zheng, and
Maosong Sun. 2010. Automatic keyphrase extrac-
tion via topic decomposition. In Proceedings of the
2010 conference on empirical methods in natural
language processing, pages 366–376. Association
for Computational Linguistics.
Bernardo Magnini and Gabriela Cavaglia. 2000. Inte-
grating Subject Field Codes into WordNet. In Pro-
ceedings of the Second International Conference on
Language Resources and Evaluation (LREC-2000).
Christopher D Manning, Mihai Surdeanu, John Bauer,
Jenny Rose Finkel, Steven Bethard, and David Mc-
Closky. 2014. The Stanford CoreNLP Natural Lan-
guage Processing Toolkit. In ACL (System Demon-
strations), pages 55–60.
Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai.
2007. Automatic labeling of multinomial topic
models. In Proceedings of the 13th ACM SIGKDD
international conference on Knowledge discovery
and data mining, pages 490–499. ACM.
Giovanni Moretti, Rachele Sprugnoli, and Sara Tonelli.
2015. Digging in the dirt: Extracting keyphrases
from texts with kd. In Proceedings of CLiC-it 2016,
page 198.
Giovanni Moretti, Rachele Sprugnoli, Stefano Menini,
and Sara Tonelli. 2016. ALCIDE: Extracting and
visualising content from large document collections
to support Humanities studies. Knowledge-Based
Systems, 111:100–112.
Alessandro Moschitti and Roberto Basili. 2004. Com-
plex linguistic features for text classification: A
comprehensive study. In European Conference on
Information Retrieval, pages 181–196. Springer.
Subhabrata Mukherjee and Sachindra Joshi. 2013.
Sentiment Aggregation using ConceptNet Ontology.
In Proceedings of the Sixth International Joint Con-
ference on Natural Language Processing (IJCNLP),
pages 570–578.
Robert Speer and Catherine Havasi. 2013. ConceptNet
5: A large semantic network for relational knowl-
edge. In The Peoples Web Meets NLP, pages 161–
176. Springer.