=Paper=
{{Paper
|id=Vol-1690/paper58
|storemode=property
|title=A Web Application to Search a Large Repository of Taxonomic Relations from the Web
|pdfUrl=https://ceur-ws.org/Vol-1690/paper58.pdf
|volume=Vol-1690
|authors=Stefano Faralli,Christian Bizer,Kai Eckert,Robert Meusel,Simone Paolo Ponzetto
|dblpUrl=https://dblp.org/rec/conf/semweb/FaralliB0MP16
}}
==A Web Application to Search a Large Repository of Taxonomic Relations from the Web==
<pdf width="1500px">https://ceur-ws.org/Vol-1690/paper58.pdf</pdf>
<pre>
    A Web Application to Search a Large Repository of
          Taxonomic Relations from the Web

    Stefano Faralli1 , Christian Bizer1 , Kai Eckert2 , Robert Meusel1 , Simone Paolo Ponzetto1
                      1
                         University of Mannheim, Germany
      {stefano,chris,robert,simone}@informatik.uni-mannheim.de,
                     2
                        Stuttgart Media University, Germany
                        eckert@hdm-stuttgart.de


        Abstract. Taxonomic relations (also known as isa or hypernymy relations) rep-
        resent one of the key building blocks of knowledge bases and foundational on-
        tologies and provide a fundamental piece of information for many text under-
        standing applications. Despite the availability of very large knowledge bases,
        however, some Natural Language Processing and Semantic Web applications
        (e.g., Ontology Learning) still require automatic isa relation harvesting tech-
        niques to cope with the coverage of domain-specific and long-tail terms. In this
        paper, we present a web application to directly query a very large repository of isa
        relations automatically extracted from the Common Crawl (the largest publicly
        available crawl of the Web). Our resource can be also downloaded for research
        purposes and accessed programmatically (we additionally release a Java applica-
        tion programming interface for this purpose).

        Keywords: Hearst patterns, hypernym extraction, information extraction and Nat-
        ural Language Processing techniques for the Semantic Web


1    Introduction
Taxonomic relations play an important role when interpreting data and text that are not
already semantically annotated. In fact, to infer the types of entities (be it named entities
in text, or entities in semi-structured data) represents a crucial step to understanding the
data. Paulheim et al. [5] have shown that adding precise types of instances can lead
to a significantly improved performance in many data mining tasks. Moreover, when
performing data integration – for instance, of a large collection of tabular datasets into a
knowledge base – understanding whether the entities in a table are, for example, cities,
states, or mountains, is a very important step towards a high-quality result [7].
     While there are quite a few named entity recognition and disambiguation tools that
do serve that purpose and exploit knowledge resources such as Wikipedia, DBpedia,
or Freebase, a common problem is dealing with the long tail of entities that are not
contained in such knowledge bases. These, in fact, have no problems in covering, for
instance, major cities (“New York is a city”) or celebrities (“Madonna is a singer”),
but show limitations with respect to small villages and less known people. Moreover,
many common benchmarks for entity linking are also tailored towards popular entities
[2]. However, the potential of real-world semantic applications can only be unlocked if
2       Faralli et al.


Fig. 1: The main user interface used to query our WebIsA database – for instance, all
definitions of the term “darth vader” occurring at least twice.


      Fig. 2: The interface to browse the result tuples retrieved for a user query.


these are capable of dealing with the most prominent entities, as well as the long tail.
Hence, the necessity of extending existing knowledge bases with hypernymy relations
that also cover the long tail entities.
    In this paper, we present a web application to query an open knowledge repository
consisting of around 400 million isa relations, in the form of tuples, which have been
automatically extracted from the Common Crawl1 , the largest publicly available crawl
of the Web. Our resource is built by combining traditional Hearst-like lexico-syntactic
patterns [3] with filtering, duplicate removal and tuple normalization techniques. These
methods are applied on web scale using the extraction framework of the WebData-
Commons project2 . Each tuple comes with a rich set of attributes, such as the set of
patterns matching the pair, the pay-level domains on which the patterns were matched,
the number of occurrences, etc.. In this paper, we present the web application which
lets users easily interact with the knowledge repository. A detailed description of our
resource construction method and programmatic access can be instead found in [8].


2    A Web Application for the WebIsA Database
The application is designed as a typical client-server web application. The server-side
implementation includes our Java API [8] to query a MongoDB3 server serving the
 1
   https://commoncrawl.org
 2
   http://webdatacommons.org/framework/
 3
   https://www.mongodb.com
                                       A Web Application for the WebIsA Database         3


Fig. 3: The interface to browse the additional meta-data of a selected tuple – e.g., “darth
vader” isa “star wars character”.


                    Fig. 4: The set of pre-compiled query examples.


access to an instance of our repository. On the client side, the user is guided by a form-
based web page for formulating queries (Figure 1). After submitting a query, the results
can be browsed in a tabular format (Figure 2). For each triple in the set of results the
table provides the syntactic decomposition of the two noun phrases involved in the isa
relations into pre-modifiers, head and post-modifiers [6], as well as the frequency of
occurrence of the relation in the Common Crawl. The user can also access a detailed
view with additional meta-data, namely the patterns matching the relation, the textual
contexts of the matching and the pay-level domains indicating the provenance in the
corpus (Figure 3). Finally, in order to showcase our application and provide users with
some usage examples, we include a few pre-compiled queries – i.e., relations involving
instance like “Katy Perry” or “Darth Vader”, as well as classes like “Animals”, “Plants”,
and so on (Figure 4).
    Note that our system is able to retrieve dozens of tuples also for less popular con-
cepts. As an example, consider the small Italian town of San Sisto (which can be found
near to the more famous one Todi). Such a small medieval town, in fact, is not even
found within large knowledge bases like Wikipedia (and accordingly YAGO or DB-
pedia). However, thanks to our knowledge base, we are able to provide the user with
useful definitional information nuggets such as the fact that “San Sisto” isa “beautiful
borgo” – from the pay-level domain usfreeads.com, as extracted from the sentence
“Sisto is a beautiful borgo very close to the fascinating medieval town of Todi.”.


3   Conclusions
In this paper, we have presented a web application to directly query a publicly available
knowledge repository containing millions of isa relations automatically extracted from
4        Faralli et al.

the Common Crawl. Our resource is, to the best of our knowledge, the largest collection
of hypernymy relations created from textual resources: accordingly, our web application
is meant to provide an easy-to-use user interface for rapid exploration and facilitated
access to this very large knowledge resource.
    We make our resource freely available to share the wealth of knowledge contained
therein, as well as to foster the development of novel knowledge-rich applications that
work with the largest and richest textual resource of our time, namely the Web. We
believe that our WebIsA database represents a first step towards more complex semantic
resources such as web-scale full-fledged taxonomies. In fact, our resource was already
successfully used as part of a SemEval competition on taxonomy induction [1], and
helped us achieve a competitive performance [4] on challenging benchmarks.

Online Web Application and Downloads
Our web interface is available at http://webisadb.webdatacommons.org/.
All the resources described in this paper are freely available under a CC BY-NC-SA
3.0 license at http://webdatacommons.org/isadb/, where we additionally
provide a Java application programming interface for programmatic access from client
applications, as well as the source code of the extraction framework.

Acknowledgements
This work was partially funded by the Junior-professor funding programme of the Min-
istry of Science, Research and the Arts of the state of Baden-Württemberg, Germany
(project “Deep semantic models for high-end NLP applications”). Part of the computa-
tional resources were provided by an Amazon AWS in Education Grant award.

References
1. Bordea, G., Lefever, E., Buitelaar, P.: SemEval-2016 task 13: Taxonomy extraction evaluation
   (TExEval-2). In: Proc. SemEval. pp. 1081–1091 (2016)
2. van Erp, M., Mendes, P., Paulheim, H., Ilievski, F., Plu, J., Rizzo, G., Waitelonis, J.: Evaluating
   entity linking: An analysis of current benchmark datasets and a roadmap for doing a better job.
   In: Proc. LREC (2016)
3. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proc. COLING.
   pp. 539–545 (1992)
4. Panchenko, A., Faralli, S., Ruppert, E., Remus, S., Naets, H., Fairon, C., Ponzetto, S.P., Bie-
   mann, C.: TAXI at SemEval-2016 Task 13: A taxonomy induction method based on lexico-
   syntactic patterns, substrings and focused crawling. In: Proc. SemEval. pp. 1320–1327 (2016)
5. Paulheim, H., Fürnkranz, J.: Unsupervised Generation of Data Mining Features from Linked
   Open Data. In: Proc. WIMS. pp. 31:1–31:12 (2012)
6. Ponzetto, S.P., Strube, M.: Taxonomy induction based on a collaboratively built knowledge
   repository. ArtInt 175(9-10), 1737–1756 (2011)
7. Ritze, D., Lehmberg, O., Bizer, C.: Matching HTML tables to DBpedia. In: Proc. WIMS. pp.
   10:1–10:6 (2015)
8. Seitner, J., Bizer, C., Eckert, K., Faralli, S., Meusel, R., Paulheim, H., Ponzetto, S.P.: A large
   database of hypernymy relations extracted from the web. In: Proc. LREC (2016)

</pre>