XB: A Large-scale Korean Knowledge Base for Question Answering Systems Jongmin Lee1, Youngkyoung Ham1, Tony Lee1 1 Saltlux Inc. Daewoong Bldg. 689-4, Yeoksam 1 dong, Gangnam-gu, Seoul, South Korea {jmlee, ykham, tony}@saltlux.com Abstract. There are many studies on question answering system which can answer to natural language questions. Diverse techniques are required for building this system, but it cannot be implemented without well-structured knowledge data. For this reason, we construct a large-scale knowledge base in Korean, with the goal of creating a uniquely Korean question answering system. 1 Introduction Recently, a variety of Question Answering (QA) systems have been developed, such as IBM Watson and Apple Siri. In these systems, a user inputs a query in natural language, and the QA system searches for the corresponding answer, often using inferences from other related search queries, and provides the user with accurate and relevant information. Most QA systems use a knowledge base to store knowledge studied from a multitude of data. Extremely large knowledge bases, such as YAGO[1] and Wikidata[2], have been constructed using documents written in English, with the contents well known in the world. However, individual countries require individualized QA systems for their own knowledge. For example, even though the Eulmi Incident is very significant in Korean history, no knowledge of it is found in the English version of Wikipedia. If there is a question about when Eulmi Insident happened, most of existing knowledge resources cannot answer to it. There is no structured knowledge about that question in Korean DBpedia and Korean Wikipedia only has that information in the text. For this reason, it was necessary to construct a large-scale knowledge base in Korean from various knowledge resources, with the goal of creating a uniquely Korean QA system. The resulting XB was constructed using the dual-spiral method[3], which allows for both automatic conversion and manual construction simultaneously. In addition, the XB implemented knowledge bases like GeoNames[4], Openstreetmap[5], DBpedia[6] and WikiData. Knowledge in the XB is represented as triple(subject/predicate/object). So far, approximately 200 million triples have been constructed. Through the owl axiom inference(rdfs:subClassOf, rdfs:subPropertyOf, owl:Transitive, owl:inverseOf , owl:disjointWith and etc.), the number of triples are increased by 0.4 billion. 2 Development The XB is a large-scale knowledge base of common sense level for Korean QA systems, utilizing the ontological method to express knowledge. Figure 2 shows a simple process of our question answering scenario. A user inputs a question in natural language form, and it is converted into a SPARQL using various converting techniques. The converted SPARQL finds answers from the knowledge base. Figure. 1 Part of SPARQL results The XB is built by the following procedure for the QA scenario. To define classes, we used the hierarchical structure of Korlex[7], WordNet in Korean. Korlex is a lexical database wherein a variety of linguistic relations among synonym, hypernym and hyponym are structured. Classes are chosen by the frequency of searching on each keyword from Korlex and grant relations of higher or not between classes. Properties refer to YAGO and DBpedia to define key properties based on the frequency of using per property. In addition, a property is added in case it is requested additionally or identified from competency question on the way of constructing the knowledge base. To build entities, necessary knowledge is extracted from diverse knowledge resources through the rule-based automatic conversion and the curation manually implemented by domain experts, depending on the dual-spiral methodology. Default entities are from Wikipedia pages and are extended, if other resources contain unmapped entities. The rule-based automatic conversion is a process by which the machine distinguishes between classes and properties through mapping rules between a predefined schema and a knowledge resource to build knowledge. The curation is a process to additionally verify the automatically converted knowledge or build a new knowledge by human. For example, a main text in a Wiki page written in a natural language is not easily automatically converted. The rule- based automatic conversion and the curation are verified in trade-off for their own results, respectively. Domains that are high-probable to be used in it so that the knowledge related to it can be built primarily, since the core part of knowledge is constructed based on the Korean Wikipedia. Moreover, the knowledge base has been enlarged with existing knowledge resources such as DBpedia, Wikidata and GeoNames. Table 1 Knowledge base statistics Class Property Domain URI #Instance URI #Instance People xbc:person_00006026 2,467,831 rdfs:label 19,588,253 Organization xbc:organization_07523126 972,788 xbp:nation 11,113,066 Event xbc:event_00025950 407,272 xbp:relatedTerm 7,526,036 Term xbc:term_05916288 31,339 xbp:description 4875147 Theory xbc:theory_05637633 1,737 xbp:gender 2,171,672 xbc:writing_05967883 Literature 579,891 xbp:job 1,974,747 xbc:book_06013091 Music xbc:music_06591368 270,201 xbp:scientificName 1,939,233 xbc:graphic_art_03327573 Art 90,930 xbp:bornOn 1,768,723 xbc:work_of_art_04423283 Table 1 is a part of statistic data about the knowledge base constructed through the above-mentioned processes. Domain refers to the field of knowledge. There are approximately 6,000 classes and approximately 1,000 properties. In addition, there are about 20 million instances that are focused mainly on people, locations, organizations, events, and works. 3 APIs Generally, a knowledge base based upon ontology uses SPARQL, a standard query language for RDF data. However, it is very difficult for a user who is not familiar with ontology to understand a schema correctly and implement a variety of services utilizing a QA system or a knowledge base through SPARQL. This study provides a variety of APIs other than SPARQL Endpoint to allow a greater number of users to access easily to XB. Table 2 lists the APIs supplied by the XB. Table 2 List of APIs API Description /api/class Search class by keywords /api/classInfo Get information of a class with its uri /api/property Search property by keywords /api/propertyInfo Get information of a property with its uri /api/instance Search instance by keywords /api/instanceInfo Get information of an instance with its uri /api/instanceTime Get temporal information of an instance with its uri /api/instanceSpace Get spatial information of an instance with its uri /api/checkType Check if it is true or false about input instance and class /api/typeRelation Inference relationship between two input classes /api/timeRelation Inference temporal relationship between two input instances /api/spaceRelation Inference spatial relationship between two input instances /api/shortestPath Find a shortest path between two input instances 4 Future works In the near future, additional tools to enhance quality and quantity are expected to be developed. The knowledge has been completely verified through the curation work, but it is restricted in that a finite number of human ability cannot verify all knowledge in the system. To solve that problem, a crowdsourcing service has been being developed to construct and verify knowledge. There is also debate as to whether or not to develop massive amounts of knowledge through auto-mapping of a knowledge base featuring a large-scale triploid generated by language processing of knowledge or sentences that are aggregated from different knowledge resources connected with machine learning. In addition, even if not appearing explicitly in the knowledge base, inferencing rules are defined to analyze relations between pieces of knowledge to generate new knowledge. The XB has been built mainly with a knowledge resource of Korean language as it is today. However, as most instances are granted with labels and types in English and based on Wikipedia, we believe that it might be relatively easy to extend into Korean if the multi-language link of Wikipedia were used. The XB will be extended and is expected to be available to public users soon, with a variety of practical applications. Acknowledge This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (No. R0101-16-0054, WiseKB: Big data based self-evolving knowledge base and reasoning platform) References 1. Hoffart, J., Suchanek, F. M., Berberich, K., Weikum, G.: YAGO2: a spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence, Vol 194 (2013) 28-61 2. Vrandečić, D., Markus, K.: Wikidata: a free collaborative knowledgebase. Communications of the ACM (2014) 78-85 3. Kyosung, J., Youngkyoung, H., Kyungil, L.: Dual-Spiral methodology for knowledgebase constructions. International Conference on Big Data and Smart Computing (2016) 477-480 4. Wick, M., Bernard, Vatant.: The geonames geographical database. Available from World Wide Web: http://geonames. Org (2012) 5. Haklay, M., Patrick, W.: Openstreetmap: User-generated street maps. IEEE Pervasive Computing (2008) 12-18 6. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J. et al.: Dbpedia: A nucleus for a web of open data. Springer Berlin Heidelberg (2007) 722-735 7. Yoon, Ae-Sun, et al.: Construction of Korean Wordnet. Journal of KIISE: Software and Applications 36.1 (2009): 92-108.