Introduction

NLP & DBpedia: An Upward Knowledge Acquisition Spiral

Sebastian Hellmann

Agata Filipowska

1 3

Caroline Barriere

Pablo N. Mendes

Dimitris Kontokostas

4 0 Centre de Recherche Informatique de Montreal , Montreal , Canada 1 Instytut Informatyki Gospodarczej Sp. z o.o. , ul. Rubiez 12G/6, 61-612 Poznan , Poland 2 Kno.e.sis Center, Wright State University , USA 3 Poznan University of Economics, Faculty of Informatics and Electronic Economy, Department of Information Systems , Al. Niepodleglosci 10, 61-875 Poznan , Poland 4 University of Leipzig, Institute of Computer Science, AKSW Group , Augustusplatz 10, D-04009 Leipzig , Germany

Recently, the DBpedia community has experienced an immense increase in activity and we believe, that the time has come to explore the connection between DBpedia & Natural Language Processing (NLP) in a yet unprecedented depth. DBpedia has a long-standing tradition to provide useful data as well as a commitment to reliable Semantic Web technologies and living best practices. As the extraction of the Wikipedia's infoboxes by DBpedia matures, we can shift our focus to new challenges such as extracting information from an unstructured article text as well as becoming a testing ground for multilingual NLP methods. DBpedia has the potential to create an upward knowledge acquisition spiral as it provides a small amount of general knowledge allowing to process text, derive more knowledge, validate this knowledge and improve text processing methods. The goal of this workshop was to present existing research, systems and resources, but also to allow discussion about di erent points of convergence and divergence of the NLP and DBpedia community with a special focus on challenges that lie ahead. We would like to take part in the debate on how to use DBpedia for NLP and NLP for DBpedia.

DBpedia Natural Language Processing RDF

Introduction

Communities interested in Natural Language Processing (NLP) and in the Semantic Web, in particular DBpedia, come together to explore di erent ways of collaborating, and helping each other, towards a common goal of understanding and representing information.

Resources such as DBpedia are a step towards a solution to the knowledge acquisition bottleneck, so often mentioned in earlier days of NLP [ 10 ]. A prerequisite of text processing and understanding is the availability of knowledge about words, concepts and ways of expressing information. But then, to acquire such knowledge, we are required to automatically process text or immerse in costly and error-prone manual knowledge engineering.

Where formerly, there was a chicken and egg problem with a serious bootstrapping issue, we now have structured data in DBpedia, which is readily available to turn the bottleneck into an upward knowledge acquisition spiral { a small amount of general knowledge allowing to process text, create more knowledge, validate this knowledge and improve text processing for more acquisition (and so on).

The recent years have seen a major change, mostly through crowd-sourcing for the construction of the largest encyclopaedic resource, Wikipedia. Although rst, mainly made of unstructured data (paragraphs), the addition of infoboxes, and the expansion of interest towards the Semantic Web, have led to DBpedia { one of the largest openly shared structured resource available today.

However, any resource not curated nor scrutinized by experts will be prone to noise, and that becomes a new and di erent challenge for NLP. Also, any resource, even as large as DBpedia, is not complete. So far, mainly the infoboxes, which are already semi-structured, are used to build the RDF repository. But even then, Aprosio et al. [ 1 ] (this volume) mention that more than 50% of Wikipedia articles do not include an infobox. So if the article text is analysed, the spiral can turn further, using DBpedia as input for the NLP process and then create more RDF triples to add and integrate into DBpedia [ 12 ].

This workshop's aim is right in the knowledge acquisition spiral, bringing together researchers in both areas to see how NLP can bene t DBpedia and how DBpedia can bene t NLP. The contributions in the workshop allow to highlight multiple facets of this duality. In the remainder of this article, we discuss the contributions to the NLP&DBpedia workshop. Our main interest, however, are the challenges that the readers can expect to stay unresolved, that is the many interesting underlying issues brought forward by these articles. Another goal of this workshop was to present existing research, systems and resources to allow discussion about di erent points of convergence and divergence of the NLP and DBpedia community. It is also interesting to illustrate when both communities actually tackle very similar problems, with di erent approaches.

Knowledge acquisition and structuring

As we look at NLP and DBpedia, we see that NLP requires knowledge about words, not only about concepts. Obviously the notion of labels exists in DBpedia, but there is more to language than labels. Should this lexical information be represented the same ways as conceptual information is?

The separation between lexical, conceptual, terminological, encyclopaedic, and other kinds of knowledge has been a debate for years. Can a single schema allow all types of knowledge? Lexical approaches usually start from words, going from a word to all its senses, and sometimes terminological approaches will start from concepts, and de ning all the words that illustrate such concept. If DBpedia is more concept-based, we can then wonder how lexical information would be

6 http://lemurproject.org/clueweb09/

attached to it, or a more general question of how lexical knowledge has its place within the Semantic Web?

[ 26 ] (this volume) present a lemon lexicon for DBpedia and discuss di erent issues of lexicalization of conceptual structures.

The BabelNet[ 15 ] resource, resulting from a merge of WordNet [ 9 ] (a widelyused lexical resource in NLP) and Wikipedia, is an example of a mixed-level representation in which lexical, conceptual and encyclopaedic knowledge is combined. BabelNet is used in the work of [ 8 ] (this volume) for the task of QALD (Question Answering over Linked Data) as we will see in the next section. Also [ 27 ] (this volume) talk of developing their own representation, SAR-Graphs (Semantically Associated Relations Graphs) to express not only lexical knowledge, but sentence-based knowledge, that is useful for verbalizing simple predicates but also combined predicates (child of child, for example). These three contributions stimulate a debate on the granularity of the representation of any language resource. Such debate is present in corpus studies, where experts study the value of not only terms, but also phrases (phraseology) in the understanding of language use [ 24 ]. 4

NLP tasks and applications

Although di erent tasks are mentioned in our workshop's contributions, three of them are more prominent, that of NER (Named Entity Recognition), Relation Extraction, and Question Answering over Linked Data (QALD). 4.1

Named Entity Recognition

Named Entity Recognition is de ned as the task of assigning a class to entities found in a text, such as person, location, organization, date, etc. NER is a well-recognized task in the NLP community since the beginning of the Message Understanding Conferences (MUC) in 1987 (see [ 11 ] for a good overview of information extraction and the early MUC conferences). Although not called as such at the time, early work on information extraction looked at text to nd Who did What When How discovering entities such as places, people and dates. Extracted entities were not necessarily typed, or classi ed, but as information extraction templates were used, such types were implicitly given by the roles the entities lled (Agent, Place, Date).

Later on, researchers, such as Sekine ([ 21 ]) de ned a hierarchical schema of classes for the NER task. Although, the more ne-grained the classes are, however, the more di cult it is to obtain (or even measure) classi cation results. Obviously, integration and comparison of these hierarchies can have high complexity, if no reference hierarchy is agreed upon. One such reference hierarchy is the recently created NERD ontology [ 20 ], however, containing only 84 types7 which is coarse grained when compared to the over 500 DBpedia Ontology classes8, which are used in [ 6 ] (this volume).

As mentioned in [ 23 ] (this volume) Named Entity Disambiguation is a further step towards identifying not only that an entity is a Person, but who this person actually is by establishing a link to a more speci c reference id or URI in a knowledge base. New names are given to the NED or NERD task, that of Entity linking and "wiki ers" [ 6 ] (this volume) and the list of emerging tools, which belong to this class of wiki ers is quite huge and growing steadily: Zemanta, OpenCalais, Ontos, Evri, Extractiv, Alchemy API and many more9.

Wikipedia (and therefore DBpedia) is limited to encyclopaedic knowledge, but often terminological knowledge (how di erent terms describe di erent domain speci c concepts) as well as lexical knowledge (common words) are available for interlinking with text, thus resembling Word-Sense Disambiguation (WSD), i.e. taking any word in a text and being able to connect the appropriate URI. In [ 8 ] (this volume), both tasks (NED and WSD) are tackled using BabelNet. 4.2

Relation extraction

The task of relation extraction is sometimes seen as a step following that of NER. After entities are extracted, it would be interesting to see how they are related. But sometimes a more "template-like" strategy, as was suggested in early information extraction is done. For example, a system would look for "merger" relations between companies, to nd out which companies merged. In such case, the relation is known in advance, and we look in text for both the relation and the participants in such relation.

Di erent types of relations have been investigated over the years, and as NLP and DBpedia come closer, relations found in DBpedia tend to be used. [ 16 ] (this volume) focus on ten di erent relations found in DBpedia. They identify such relations in text through developed lexical extraction rules. The work of [ 1 ] (this volume) focuses on seven di erent properties found in DBpedia. By properties, they mean relations for which the subject is most likely a named entity, but the object could be a literal, such as the property populationTotal. The line is fuzzy between properties and relations (for example, both contributions mentioned above use the birthDate as a relation to extract in text), and could bring an interesting discussion and debate about this topic. The work of [ 27 ] (this volume) does not target any speci c relation and is mostly about the development of a representational schema (as mentioned before) for the English expression of relations.

The explicit expression of relations in text is a topic of interest in the NLP community for a while. Di erent methods, either statistical [ 25 ] or pattern-based

7 accessed Oct. 10th, 2013 http://nerd.eurecom.fr/ontology

8 An up to date version can be downloaded fromhttp://mappings.dbpedia.org/ server/ontology/dbpedia.owl 9 http://en.wikipedia.org/wiki/Knowledge_extraction#Tools contains an up-todate overview are developed and experimented on [ 2 ]. This is an interesting place for NLP and the Semantic Web to meet as both communities are interested in nding links between concepts and extract facts. 4.3

Question Answering over Linked Data

The tasks of Information Retrieval and Question Answering, within the NLP community, provided some of the early attempts towards a more systematized approach to making the eld of NLP grow. Those tasks encouraged the development of challenges and competitions with common data (TREC, [ 22 ]) which we discuss in the next section. The more recent task of Question Answering over Linked Data10 is a very interesting task, certainly promoting a communication and shared interest between the NLP and the Semantic Web community, and also providing some early attempts within the Semantic Web community at sharing data and evaluation standards.

Three contributions look into QALD. The work of [ 8 ] (this volume), addresses the task of QALD, with a particular strategy which involves NED and word sense disambiguation, as we mentioned above. In [ 3 ] (this volume), the QALD task is not just tackled, but they go further into the study of inconsistency detection when gathering knowledge to answer questions. They look into English, German, French and Italian chapters of DBpedia, and try to detect inconsistencies and supporting evidence among the di erent answers. In [ 26 ] (this volume) the task of QALD is not performed in itself, but it is mentioned as an extrinsic evaluation of the coverage of the lemon lexicon, saying that the verbalizations found in the lexicon cover many of the questions. 5

Resources

As most workshop contributions combine some techniques from NLP with the Semantic Web, they talk about di erent resources that would be useful to the community. We don't want to reinvent the wheel. Obviously, even if alternative Semantic Web resources, such as Yago (http://www.mpi-inf.mpg.de/ yago-naga/yago/) and Freebase (http://www.freebase.com) exist, this workshop focuses on DBpedia, which therefore is the Semantic Web resource most referred to in the di erent contributions.

On the NLP side, many frameworks and typical resources exist as well. Wordnet (http://wordnet.princeton.edu/) for example, has been a resource much used in the community for English. More recently, Babelnet (http://babelnet. org), mentioned earlier, has been developed to merge Wikipedia and Wordnet. Also GATE, an open source development framework (http://gate.ac.uk), is used in [ 6 ] (this volume).

We can think that the primary resource for NLP is text, but which text? There has been work in NLP on di erent types of texts, from news articles to 10 The rst challenge started in 2011, and information can be found at http:// greententacle.techfak.uni-bielefeld.de/~cunger/qald/ scienti c articles, to blogs, to web data. In the present day, textual content is abundant, and the appropriateness of which text should be analysed for which purpose is a pertinent question. In fact, if we see NLP for DBpedia, at the service of expanding DBpedia, then the chosen text should be informative, factual, accurate. As we saw above, mining Wikipedia for more information is an interesting direction, it is not the only one. We also saw (with NELL) that a large crawled Web corpus is a possibility, as it brings large coverage, but it can also bring noise.

Di erent ways of ltering noise exists, either by trying to evaluate the source of information (trust), or by looking at how consistent or inconsistent di erent information is, looking at redundancy and con icts. In [ 3 ] (this volume), the general problem of inconsistent information is tackled.

If we reverse our point of view and see DBpedia at the service of NLP, then the text on which NLP techniques are used is quite arbitrary and depends on further purposes and applications. For example, in [ 6 ] (this volume), both news articles and tweets are explored, which are two very di erent types of texts.

The question of language is valid whether we are looking at "NLP for DBpedia" or "DBpedia for NLP". In [ 16 ] (this volume), French text is analysed, and in [ 3 ] (this volume), four di erent language chapters of DBpedia are used. This is a minority of contributions exploring other languages than English. As always, work on English is more prominent than that on other language, and it brings awareness that it would be interesting for both communities to work on di erent languages. 5.1

Gold and silver standards

The topic of evaluation is both an important one, and a much debated one. In NLP, there has been a tendency in the past 15 years to perform experiments for which there are well de ned gold standards and datasets. There has been an increase in the number of competitions and challenges in many sub- elds of NLP, such as automatic summarization [ 17 ], word-sense disambiguation [ 14 ], textual entailment [ 5 ], etc.

In the Semantic Web community, there is less of such rigid evaluation, as the eld is younger than NLP, and is still looking at pushing the eld with di erent ideas and concepts without imposing rigid evaluations. Certainly, one of the purposes of this workshop was to start discussion towards bringing more of gold standards and evaluation datasets into the community. Although there are some competitions in other areas, such as the OAEI (Ontology Alignment Evaluation Initiative11) which has been happening for a few years now, as well as the QALD (see above) and the plethora of benchmarks for triplestores such as the DBPSB (DBepdia SPARQL Benchmark [ 13 ]). In the eld of NER/NED, there are not many datasets or gold standards and only few challenges. The work of [ 6 ] (this volume) paves the way towards the standardization of NER and NED benchmarking in an implemented benchmarking system. 11 http://oaei.ontologymatching.org/

As a rst important step to develop such a gold standard, it is also good to review and question existing work. The work of [ 23 ] (this volume) is an extensive comparison of NED benchmarks and characterizes them to see, if they could be biased for particular types of algorithms, or types of test data. The contribution therefore opens the debate as to how we should develop such benchmarks and provides a solid foundation to built upon.

When gold standards are hard (costly, time-consuming) to develop, it can be interesting to develop silver standards that are the results of well-known methods, or the combined results of di erent methods. Such standards do not replace gold standards, but they at least give an indication of the direction of progress for particular algorithms. One possibility when two communities come together is to take the results of one to become the "silver standard" of the other. [ 18 ] (this volume) describes such a silver standard and discusses its bene ts as well as its limitations.

In some work, such as [ 16 ] (this volume) and [ 1 ] (this volume), DBpedia's network of relations is used as a gold standard in relation extraction. Also Wikipedia/DBpedia entities have become the most predominant link targets in NED. [ 20 ] reports of 7 out of 10 tools that attach Wikipedia/DBpedia URLs as annotations (3 out of 10 for the DBpedia Ontology). Although this is an interesting way to proceed, we can debate whether we are using gold or silver standards and how to unify benchmarks for comparison. 6

Summary

We conclude by highlighting a few issues brought forward by the contributions in this workshop. First, the selected papers discuss many problems that have been recognized within the NLP community for a long time, but have only recently been introduced to Semantic Web researchers. The main challenges here concern: { consensus upon annotation guidelines, { development of extraction rules and agreed upon hierarchies that may be used to unify semantic enrichment and benchmarks, { identi cation of well-de ned tasks and problem classes, { transferability of NLP tasks, resources and tools to other research communities (e.g. library and life sciences) as well as other languages and application areas, { building practical resources and infrastructures, which do not target one single research question, but can be exploited in a more universal manner by NLP tools, { unlock higher layers of semantic annotation to enable state-of-the art OWLbased reasoning on a combination of noisy NLP data and LOD and DBpedia based knowledge structures.

Second, and perhaps more importantly, new possibilities emerge from the combination of the communities, and we hope to further push such possibilities to have more NLP for DBpedia and more DBpedia for NLP, continuing the knowledge spiral, and ghting together to open the knowledge acquisition bottleneck. We hope that the readers of this volume will nd all papers interesting. We invite you to join our community and attend future workshop editions.

Acknowledgments.

We especially thank all contributors to DBpedia and the DBpedia Internationalisation committee12. This work was supported by grants from the European Union's 7th Framework Programme provided for the projects LOD2 (GA no. 257943) and GeoKnow (GA no. 318159).

Programme Commitee

We would like to thank all reviewers that have helped us and especially the authors with their comments and feedback. 12 http://wiki.dbpedia.org/Internationalization

A. P.

Aprosio ,

Giuliano , and

L. Alberto

Lavelli . Extending the Coverage of DBpedia Properties using Distant Supervision over Wikipedia . In Proceedings of 1st International Workshop on NLP and DBpedia, October 21 -25, Sydney, Australia, volume 1064 of NLP & DBpedia 2013 , Sydney, Australia, October 2013 . CEUR Workshop Proceedings.

Auger and

Barriere . Probing Semantic Relations: Exploration and identi - cation in specialized texts . John Benjamins, benjamins edition , 2010 .

Cabrio ,

Cojan ,

Villata , and

Gandon . Argumentation-based Inconsistencies Detection for Question-Answering over DBpedia . In Proceedings of 1st International Workshop on NLP and DBpedia, October 21 -25, Sydney, Australia, volume 1064 of NLP & DBpedia 2013 , Sydney, Australia, October 2013 . CEUR Workshop Proceedings.

Carlson ,

Betteridge ,

Kisiel ,

Settles ,

E. R. H.

Jr. , and

T. M.

Mitchell . Toward an architecture for never-ending language learning . In M. Fox and D. Poole, editors, AAAI. AAAI Press , 2010 .

Cristea . Textual entailment . Computational Linguistics , (June): 1140 { 1143 , 2009 .

Dojchinovski and

Kliegr . Datasets and GATE Evaluation Framework for Benchmarking Wikipedia-Based NER Systems . In Proceedings of 1st International Workshop on NLP and DBpedia, October 21 -25, Sydney, Australia, volume 1064 of NLP & DBpedia 2013 , Sydney, Australia, October 2013 . CEUR Workshop Proceedings.

Dutta ,

Meilicke ,

Niepert , and

Ponzetto . Integrating Open and Closed Information Extraction: Challenges and First Steps . In Proceedings of 1st International Workshop on NLP and DBpedia, October 21 -25, Sydney, Australia, volume 1064 of NLP & DBpedia 2013 , Sydney, Australia, October 2013 . CEUR Workshop Proceedings.

Elbedweihy ,

Wrigley , and

Ciravegna . Using BabelNet in Bridging the Gap Between Natural Language Queries and Linked Data Concepts . In Proceedings of 1st International Workshop on NLP and DBpedia, October 21 -25, Sydney, Australia, volume 1064 of NLP & DBpedia 2013 , Sydney, Australia, October 2013 . CEUR Workshop Proceedings.

9. C. Fellbaum, editor. WordNet: an electronic lexical database . MIT Press, 1998 .

10.

W. A.

Gale ,

K. W.

Church , and

Yarowsky . A method for disambiguating word senses in a large corpus . Computers and the Humanities , 26 ( 5-6 ): 415 { 439 , 1992 .

11.

Grishman . Information Extraction: Techniques and Challenges. New York, i(4): 10 { 27 , 1997 .

12.

Heder and

P. N.

Mendes . Round-trip semantics with sztakipedia and dbpedia spotlight . In A. Mille , F. L.

Gandon , J.

Misselis , M.

Rabinovich , and S. Staab, editors, WWW (Companion Volume) , pages 357 { 360 . ACM, 2012 .

13. M. Morsey , J.

Lehmann , S.

Auer , and A. -C. Ngonga Ngomo. DBpedia SPARQL Benchmark { Performance Assessment with Real Queries on Real Data . In ISWC 2011 , 2011 .

14.

Navigli ,

Jurgens , and

Vannella . SemEval-2013 Task 12 : Multilingual Word Sense Disambiguation . In Proceedings of the 7th International Workshop on Semantic Evaluation SemEval 2013 in conjunction with the Second Joint Conference on Lexical and Computational Semantics SEM 2013 , 2013 .

15.

Navigli and

S. P.

Ponzetto . Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network . Artif . Intell., 193 : 217 { 250 , 2012 .

16.

Nebhi . A Rule-Based Relation Extraction System using DBpedia and Syntactic Parsing . In Proceedings of 1st International Workshop on NLP and DBpedia, October 21 -25, Sydney, Australia, volume 1064 of NLP & DBpedia 2013 , Sydney, Australia, October 2013 . CEUR Workshop Proceedings.

17. T. Okumura,

Manabu

Fukusima and

Nanba . Text Summarization Challenge 2 - Text summarization evaluation at NTCIR Workshop 3 . In Proceedings of the HLT-NAACL 03 Text Summarization Workshop , pages 49 { 56 , 2003 .

18.

Paulheim. DBpediaNYD ,

A Silver

Standard Benchmark Dataset for Semantic Relatedness in DBpedia . In Proceedings of 1st International Workshop on NLP and DBpedia, October 21 -25, Sydney, Australia, volume 1064 of NLP & DBpedia 2013 , Sydney, Australia, October 2013 . CEUR Workshop Proceedings.

19.

Paulheim and

S. P.

Ponzetto . Extending DBpedia with Wikipedia List Pages . In Proceedings of 1st International Workshop on NLP and DBpedia, October 21 - 25, Sydney, Australia, volume 1064 of NLP & DBpedia 2013 , Sydney, Australia, October 2013 . CEUR Workshop Proceedings.

20. G. Rizzo,

Troncy ,

Hellmann , and

Bruemmer . NERD meets NIF: Lifting NLP extraction results to the linked data cloud . In LDOW , 2012 .

21.

Sekine and

Nobata . De nition, dictionaries and tagger for extended named entity hierarchy . In A. Zampolli and M. T. Lino, editors, Proceedings of the Language Resources and Evaluation Conference LREC , pages 1977 { 1980 .

European

Language Resources Association , 2004 .

22.

K. Sparck

Jones . Further re ections on TREC. Information Processing & Management , 36 ( 1 ): 37 { 85 , 2000 .

23.

Steinmetz ,

Knuth , and

Sack . Statistical Analyses of Named Entity Disambiguation Benchmarks . In Proceedings of 1st International Workshop on NLP and DBpedia, October 21 -25, Sydney, Australia, volume 1064 of NLP & DBpedia 2013 , Sydney, Australia, October 2013 . CEUR Workshop Proceedings.

24.

Stubbs . An example of frequent English phraseology: Distribution, structures and functions . In R. Facchinetti, editor, Corpus linguistics 25 years on, number 62 , pages 89 { 105 385. Rodopi, 2007 .

25. P. D. Turney and M. L. Littman . Corpus-based Learning of Analogies and Semantic Relations . Machine Learning , 60 ( 1-3 ):1{ 3 , 2005 .

26. C. Unger , J.

Mccrae , S.

Walter , S.

Winter , and P.

Cimiano . A lemon lexicon for DBpedia . In Proceedings of 1st International Workshop on NLP and DBpedia, October 21 -25, Sydney, Australia, volume 1064 of NLP & DBpedia 2013 , Sydney, Australia, October 2013 . CEUR Workshop Proceedings.

27.

Uszkoreit and

Xu . From Strings to Things SAR-Graphs: A New Type of Resource for Connecting Knowledge and Language . In Proceedings of 1st International Workshop on NLP and DBpedia, October 21 -25, Sydney, Australia, volume 1064 of NLP & DBpedia 2013 , Sydney, Australia, October 2013 . CEUR Workshop Proceedings.