=Paper= {{Paper |id=Vol-1963/paper611 |storemode=property |title=Language Agnostic Dictionary Extraction |pdfUrl=https://ceur-ws.org/Vol-1963/paper611.pdf |volume=Vol-1963 |authors=Alfredo Alba,Anni Coden,Anna Lisa Gentile,Daniel Gruhl,Petar Ristoski,Steve Welch |dblpUrl=https://dblp.org/rec/conf/semweb/AlbaCGGRW17 }} ==Language Agnostic Dictionary Extraction== https://ceur-ws.org/Vol-1963/paper611.pdf
      Language Agnostic Dictionary Extraction

               Alfredo Alba1 , Anni Coden2 , Anna Lisa Gentile1 ,
               Daniel Gruhl, Petar Ristoski1 , and Steve Welch1
                          1
                        IBM Research Almaden, CA, US
                      2
                      IBM Watson Research Lab, NY, US
       aalba@us.ibm.com, anni@us.ibm.com, annalisa.gentile@ibm.com,
       dgruhl@us.ibm.com, petar.ristoski@ibm.com, welchs@us.ibm.com


      Abstract. Ontologies are dynamic artifacts that evolve both in struc-
      ture and content. Keeping them up-to-date is a very expensive and crit-
      ical operation for any application relying on semantic Web technologies.
      In this paper we focus on evolving the content of an ontology by extract-
      ing relevant instances of ontological concepts from text. The novelty of
      this work is that we propose a technique which is (i) completely language
      independent, (ii) combines statistical methods with human-in-the-loop
      and (iii) exploits Linked Data as bootstrapping source. Experiments on a
      publicly available parallel medical corpus show comparable performances
      regardless of the chosen language.


1   Introduction
In this paper we focus on a computer/human partnership to more rapidly evolve
the content of an ontology through extraction of new relevant concepts from
text. The atomic operation behind this population step is the discovery of all in-
stances that belong to each concept. A plethora of solutions have been proposed
to populate ontologies or extract domain dictionaries from both unstructured
text [4,6,7,10] and semi-structured content [5,12], but the majority of extraction
techniques are language dependent, i.e., they rely on Natural Language Process-
ing (NLP) operations and tools that are language specific, such as parsing, part
of speech tagging, etc. We propose glimpseLD a novel solution that builds upon
our previous work [2] and revolves around three main aspects: (i) it is a sta-
tistical method which extracts dictionary items based on context patterns; (ii)
it relies on human feedback to automatically tune scores and thresholds for the
extraction patterns; (iii) uses Linked Data (when available, even in small quanti-
ties) to bootstrap the process. We demonstrate that our approach (i) has similar
performances in all languages and that (ii) exploiting Linked Data to bootstrap
the method maintains the same comparable performances in all languages, while
reducing the number of required human-in-the-loop iterations by at least half.

2   State of the art
There is a vast amount of literature devoted to ontology population from text,
with a number of established initiatives to foster research on the topic, such
as the Knowledge Base Population task at TAC,3 the TREC Knowledge Base
Acceleration track,4 and the Open Knowledge Extraction (OKE) Challenge,5 to
name a few. In these initiatives, systems are compared on the basis of recog-
nizing individuals belonging to a few selected ontology classes, spanning from
the common Person, Place and Organization,6 to more specific classes such as
Facility, Weapon, Vehicle7 or Drug,8 among others. The evaluation focus is usu-
ally on the specific sub-tasks involved in the process, such as Entity Recognition,
Linking and Typing. Several solutions have been proposed in the literature, span-
ning from general purpose comprehensive approaches [4] to more domain-specific
ones [6,7,10]. The majority of available methods operate (and are assessed) for
the English language and although specific initiatives are aimed at encouraging
replicable studies in other languages,9 we argue that truly language-independent
methods for this task are not yet widespread and often limited to portability
from one language to another [11,9,1]. These methods often exploit linguistic
features, which extraction relies on NLP tools - and thus does out-of-the-box
portability to different languages is not guaranteed. We propose a human-in-
the-loop approach where the human works in partnership with the statistical
method to drive the semantic of the task effectively and efficaciously. Moreover
we use Linked Data to bootstrap the process. While its usage has been vastly
explored for many Information Extraction tasks and specifically for dictionary
extraction [3,8,5], the applicability of the models to multiple languages has not
been extensively explored.

3    Extracting Dictionaries with glimpse and glimpseLD
Glimpse is a statistical algorithm for dictionary extraction based on SPOT [2].
The input is a large text corpus whose content is relevant to the domain of the
dictionary to be extracted. Besides the corpus, glimpse needs one or more exam-
ples (seeds) of the dictionary items to extract. Starting from these it evaluates
the contexts (the set of words surrounding an item) in which the seeds occur
and identifies “good” contexts to identify further terms or phrases in the cor-
pus, presented to a human to be accepted/rejected (full details of the method
can be found in [2]). In this work we synthetically evaluate that glimpse is lan-
guage independent and we prove that using Linked Data to seed the method
(glimpseLD) can significantly improve the performance, allowing it to extract a
higher number of terms in fewer human iterations.
    As dataset we use EM EA10 (European Medicines Agency documents), a par-
allel corpus comprised of PDF documents from the European Medicines Agency,
3
   http://www.nist.gov/tac/2015/KBP
4
   http://trec-kba.org/
 5
   https://2016.eswc-conferences.org/eswc-16-open-knowledge-extraction-oke-challenge
 6
   http://www.cnts.ua.ac.be/conll2003/ner/
 7
   https://www.ldc.upenn.edu/collaborations/past-projects/ace
 8
   In Semeval-2013, task 9 https://www.cs.york.ac.uk/semeval-2013/task9/.
 9
   Named Entity Recognition and Linking in Italian Tweets: http://www.evalita.it/2016/tasks/
   neel-it
10
   http://opus.lingfil.uu.se/EMEA.php
related to medical products and their translations into 22 official languages of the
European Union. The documents have been sentence aligned within the OPUS
project [13]. We select the English, Spanish, Italian and German portion of the
dataset and we use it for the task of constructing a dictionary of drugs in the
various languages.
    From the EM EA corpus - and using a standard drug dataset (RxNorm) -
we select drugs that appear in all the corpora with the same name. Despite the
target terms being the same in all languages, their context is highly language
dependent. The selected drugs (363 in total) are consider as Gold Standard and
we start with one seed only in every language (specifically we used the drug
irbesartan) to automatically build a drug dictionary in every target language.
The behavior of glimpse is homogeneous in every language, with similar discovery
growth at each iteration. With 20 iterations glimpse discovered more than 300
drugs in each language (out of the 363 of the GS, with average accuracy 11
>85%). The average Pearson correlation amongst the results in all languages is
> 0.99. The discovery growth - the ratio of new correct terms added - is a useful
indication of performance in a real scenario, where no gold standard is available,
but correctness of extraction is assured by human-in-the-loop.
    We repeat the experiment with glimpseLD. We build a truly multi-language
GS crawling instances of drugs from Linked Data, making sure to cover the same
drugs in all languages. Particularly, we use two of the biggest cross-domain LOD
datasets, DBpedia12 and Wikidata.13 We select all the entities of type dbo:Drug 14
from DBpedia and all the entities of type wikidata:Q11173 from Wikidata. For all
of the selected entities, we retrieve the corresponding labels in English, German,
Spanish and Italian and consider this our gold standard dictionary. We then
select 20% as seeds and measure the performance of recreating the remaining
80% by using glimpseLD. We perform 5-fold cross validation without repetition
and randomly select the 20% of seeds at each iteration (making sure that the
seeds represent the same drugs for all 4 languages), to test if the choice of initial
seeds impacts the results. Fig. 1b and 1c show that the discovery growth is
comparable for all languages, with correlation always above 0.98.
4    Conclusions and future work
This paper proposes a language-independent solution to discover new instances
for populating ontology concepts. Our algorithm is iterative and purely statis-
tical, hence does not require any feature extraction which can be difficult and
expensive in different languages and texts. It leverages Linked Data to seed the
process, and integrates human feedback to improve the accuracy and the con-
trol concept drift at every iteration cycle. We show extremely similar discovery
growth extracting drug names on four languages over parallel corpora of medical
text.
11
   Note that as irrelevant terms are manually rejected in a human-in-the-loop approach it does not
   make sense to calculate Precision, as retained terms are all correct by design.
12
   http://.dbpedia.org
13
   https://www.wikidata.org/
14
   dbo: http://dbpedia.org/ontology/, wikidata: http://www.wikidata.org/entity/
          a) r = 0.998.                    b) r = 0.997                    c) r= 0.985

Fig. 1: Comparison of discovery growth for glimpse and glimpseLD across different languages on
the EMEA dataset, using one manual seed (Fig 1a), seeds from DBpedia (Fig 1b) and Wikidata (Fig
1c). Pearson correlation (r) amongst results from different languages is reported.



References
 1. A. Ben Abacha, M. F. M. Chowdhury, A. Karanasiou, Y. Mrabet, A. Lavelli,
    and P. Zweigenbaum. Text mining for pharmacovigilance: Using machine learning
    for drug name recognition and drug-drug interaction extraction and classification.
    Journal of Biomedical Informatics, 58:122–132, 2015.
 2. A. Coden, D. Gruhl, N. Lewis, M. Tanenblatt, and J. Terdiman. SPOT the drug!
    An unsupervised pattern matching method to extract drug names from very large
    clinical corpora. HISB 2012, pages 33–39, 2012.
 3. J. Dolby, A. Fokoue, A. Kalyanpur, E. Schonberg, and K. Srinivas. Extracting
    enterprise vocabularies using linked open data. ISWC 2009, pages 779–794, 2009.
 4. A. Gangemi, V. Presutti, D. Reforgiato Recupero, A. G. Nuzzolese, F. Draic-
    chio, and M. Mongiovı̀. Semantic web machine reading with fred. Semantic Web,
    (Preprint):1–21, 2016.
 5. A. L. Gentile, Z. Zhang, I. Augenstein, and F. Ciravegna. Unsupervised wrapper
    induction using linked data. In K-CAP’13, pages 41–48. ACM, 2013.
 6. K. Lee, A. Qadir, S. A. Hasan, V. Datla, A. Prakash, J. Liu, and O. Farri. Adverse
    drug event detection in tweets with semi-supervised convolutional neural networks.
    In WWW’17, pages 705–714, 2017.
 7. S. Liu, B. Tang, Q. Chen, and X. Wang. Effects of semantic features on machine
    learning-based drug name recognition systems: Word embeddings vs. Manually
    constructed dictionaries. Information (Switzerland), 6(4):848–865, 2015.
 8. P. Mitzias, M. Riga, E. Kontopoulos, T. G. Stavropoulos, S. Andreadis, G. Med-
    itskos, and I. Kompatsiaris. User-driven ontology population from linked data
    sources. In KESW 2016, pages 31–41. Springer, 2016.
 9. A. Pappu, R. Blanco, Y. Mehdad, A. Stent, and K. Thadani. Lightweight multi-
    lingual entity extraction and linking. In WSDM ’17, pages 365–374. ACM, 2017.
10. N. Pröllochs, S. Feuerriegel, and D. Neumann. Generating Domain-Specific Dic-
    tionaries using Bayesian Learning. Ecis, (2015):0–14, 2015.
11. M. Sahlgren and J. Karlgren. Automatic bilingual lexicon acquisition using random
    indexing of parallel corpora. Nat. Lang. Eng., 11(3):327–341, Sept. 2005.
12. H.-J. Song, S.-B. Park, and S.-Y. Park. An automatic ontology population with
    a machine learning technique from semi-structured documents. In ICIA’09, pages
    534–539. IEEE, 2009.
13. J. Tiedemann. News from OPUS-A collection of multilingual parallel corpora with
    tools and interfaces. RANLP, 2009.