=Paper=
{{Paper
|id=None
|storemode=property
|title=Do it your own (DIY) Jeopardy Question Answering System
|pdfUrl=https://ceur-ws.org/Vol-1035/iswc2013_demo_37.pdf
|volume=Vol-1035
|dblpUrl=https://dblp.org/rec/conf/semweb/FreitasC13
}}
==Do it your own (DIY) Jeopardy Question Answering System==
Do it your own (DIY) Jeopardy Question Answering System André Freitas and Edward Curry Digital Enterprise Research Institute (DERI) National University of Ireland, Galway 1 Motivation The evolution and maturity of semantic technologies techniques and frameworks are bringing functionalities which were once considered academic or prototyp- ical into real-life applications. Products such as IBM Watson [1] and Siri are examples of applications which are heavily leveraged on state-of-the-art seman- tic technologies. These systems provide a synthesis of the functionalities which are available for general applications today such as: natural language search and queries over large-scale data, semantic flexibility and integration between struc- tured and unstructured resources. The success of these projects in demonstrating the potential of existing technologies lies on the fact that they bring into a sin- gle system approaches from Natural Language Processing (NLP), Semantic Web (SW), Information Retrieval (IR) and Databases. This work demonstrates Treo, a framework which converges elements from NLP, IR, SW and Databases, to create a semantic search engine and question an- swering (QA) system for heterogeneous data. Jeopardy and Question Answering queries over open domain structured and unstructured data are used to demon- strate the approach. In this work, Treo is extended to cope with unstructured text in addition to structured data. The setup of the framework is done in 3 steps and can be adapted to other datasets in a simple DIY process. 2 Treo: Querying structured & unstructured data Treo supports free natural language queries over both structured and unstruc- tured data. To enable semantic flexibility and vocabulary independency in the query process, a principled distributional-compositional semantic model is used to build a distributional structured vector space model (τ − Space) [2]. Distri- butional semantics focuses on the automatic construction of a semantic model based on the statistical distribution of co-located words in large-scale corpora. The distributional semantics component of the model, supports a semantic ap- proximation between query and dataset terms: operations in the τ − Space are mapped to semantic relatedness operations using the distributional model as a commonsense knowledge base [2]. The automatic creation of distributional se- mantic models supports the transportability of the approach to other datasets and languages, not requiring the manual creation effort of ontologies (Treo does not rely on ontology-based reasoning for semantic approximation). In addition to queries over structured data, this work extends the query mechanism for searching entities in unstructured text. Both structured and un- structured data are linked in an entity-centric semantic index (Figure 1 (B)). The elements of the query processing approach are depicted in Figure 1 (A). Two different query processing strategies are used: - Query processing over structured data: In the query pre-processing phase, the natural language query is analyzed by the Interpreter component, where a set of query triple patterns and features are detected in the user query. The second phase consists of the vocabulary independent query processing approach which defines a sequence of search and data transformation operations over the structured data graph embedded in the τ − Space [2], targeting the maximiza- tion of the semantic matching with the query. The Query Planner generates the sequence of semantic search, navigation and transformation operations over the graph data, which defines the query processing plan, based on a set of query features which are determined in the pre-processing phase. The third phase con- sists in the execution of the query processing plan operations over the τ − Space index. - Query processing over structured & unstructured data: In case the query is not addressed by the available structured data, the query can be pro- cessed against both structured data and unstructured text in the entity-centric index. The query pre-processing approach for this query type consists on the de- tection of the query focus by the application of POS Tag based rules and by the detection and resolution of named entities in the query. The query plan consists of the composition of keyword-search operations over the text segments asso- ciated with entities, distributional search operations over structured data, and keyword search over associated entities. A ranking function weights the results of all operations, also taking into account the cardinality for each entity (number of associated entities, facts and text segments). The initial top-20 entity results are re-ranked based on the computation of the distributional semantic relatedness scores between the query focus phrase and the associated entity types. 3 DIY Setup Process The setup of the Treo platform for a new dataset consists in the creation of a semantic index for both structured and unstructured data, which requires three steps: 1. Construction of the distributional semantic model: Consists on the use of a large-scale reference corpora to build the distributional semantic reference model [2]. In this demonstration Wikipedia 2006 is used as the reference corpus and Explicit Semantic Analysis (ESA) is the distributional semantic model. 2. Semantic indexing of structured data: Consists in the indexing of structured data using the distributional semantic reference model [2]. The framework DBpedia A B :company :Bad_Robot_Productions :creator :J._J._Abrams :format :Action_(fiction) Natural Language Query: :location :Walt_Disney_Studios_(Burbank) Was Margareth Thatcher a :location :Burbank,_California chemist ? :network :American_Broadcasting_Company Indexing :numberOfEpisodes 105 1 Explicit :numberOfSeasons 5 Pre-Processing Semantic :releaseDate 2001-09-30 Reference Query Dependency Corpora Analysis (ESA) :starring :Amy_Acker Interpreter Parser :starring :Jennifer Garner concept vectors 2 user Distributional Compositional Distributional ... Disambiguation Entity Search feedback Index Indexer (Ƭ-Space) pre-processed query + Entity-Text Datasets :Alias(TV Series) Text Index [[:Bill Clinton]] - daughter - Indexer YAGO married search operations :type :2006AmericanTelevisionSeriesEndings Query Processing :type :2001AmericanTelevisionSeriesDebuts Query Querry Distributional NER Planner Processor Search :type :BadRobotProductions 3 user Disambiguation Operators Wikipedia feedback :hasSentence :Jack Bristow (:Victor Garber) is Document Sydney's father and also works for Collection :SD-6 as a double agent for the :CIA. Answer: Yes :hasSentence It stars :Jennifer Garner as :Sydney Triples: Bristow, a CIA agent. Margareth Thatcher’s type is English Chemists Margareth Thatcher’s profession is chemist ... Fig. 1: (A) Semantic indexing and query processing architecture. (B) Entity-centric representation of structured and unstructured data. takes as input data any dataset following an Entity-Attribute-Value (EAV) format. DBpedia 3.7 and YAGO are used as the demonstration datasets. 3. Unstructured data entity-centric indexing: This step takes as input a text collection, recognizes the named entities based on the structured data pre- viously indexed, aligning it with the indexed structured data. The demon- stration uses Wikipedia 2013 as the test collection. The steps are executed by calling one script, which takes as input the three types of resources (reference corpora, structured datasets and unstructured texts). After the setup, natural language queries can be executed against the structured and unstructured data indexes. Figure 1 shows the components of the Treo ar- chitecture (A) and an example of the entity-centric linking between structured and unstructured data (B). 4 Demonstration The system is demonstrated over the open-domain DBpedia 3.7 /YAGO RDF datasets and Wikipedia 2013 text data. The RDF datasets consist of 128,071,259 triples (17GB) loaded into the Treo index for structured data. A set of natural language queries from the Jeopardy challenge 1 and from the Question Answering over Linked Data challenge2 are used to demonstrate the system. In the demon- stration, users input free natural language queries and the system returns two 1 http://j-archive.com/ 2 QALD-1, http://www.sc.cit-ec.uni-bielefeld.de/qald-1, 2011 3 1 2 4 Fig. 2: Example queries: (1,2) Queries over structured data (3,4) Jeopardy queries over structured and unstructured data. types of results: (i) a list of highly related triples or (ii) post-processed results, depending on the query type. Figure 2 (2) shows the output of a query over the structured data index for the query ‘Was Margaret Thatcher a chemist?’. In addition to the post- processed answer, which provides a direct (QA-style) answer for the query, the mechanism shows the justification for the answer with the supporting triples. Figure 2 (1) shows a query over structured data with a complex query plan (‘Which cities in New Jersey have more than 10000 inhabitants?’ ). Figure 2 (3) and (4) show examples of Jeopardy queries, which typically provide a natural language description of a named entity or concept (for example: ‘Sydney’s dad, Jack, was a CIA double agent working against SD-6 on this Jennifer Garner show’ ). Further examples can be found online3 . Acknowledgments. This work was funded by SFI Ireland (SFI/08/CE/I1380). References 1. D. Ferrucci et al., Building Watson: An Overview of the DeepQA Project, AI Mag- azine, 2010. 2. A. Freitas, E. Curry, J. G. Oliveira, S. O’Riain, A Distributional Structured Se- mantic Space for Querying RDF Graph Data. International Journal of Semantic Computing (IJSC), 2012. 3 http://treo.deri.ie/ISWC2013Demo