=Paper=
{{Paper
|id=Vol-2512/paper7
|storemode=property
|title=A Scalable and Plug-in Based System to Construct A Production-Level Knowledge Base
|pdfUrl=https://ceur-ws.org/Vol-2512/paper7.pdf
|volume=Vol-2512
|authors=Tomoya Yamazaki,Kentaro Nishi,Takuya Makabe,Mei Sasaki,Chihiro Nishimoto,Hiroki Iwasawa,Masaki Noguchi,Yukihiro Tagami
|dblpUrl=https://dblp.org/rec/conf/kdd/YamazakiNMSNINT19
}}
==A Scalable and Plug-in Based System to Construct A Production-Level Knowledge Base==
A Scalable and Plug-in Based System to Construct a Production-Level Knowledge Base Tomoya Yamazaki, Kentaro Nishi, Takuya Makabe, Mei Sasaki, Chihiro Nishimoto, Hiroki Iwasawa, Masaki Noguchi, and Yukihiro Tagami {tomoyama, kentnish, tmakabe, mesasaki, cnishimo, hiwasawa, manoguch, yutagami}@yahoo-corp.jp Yahoo Japan Corporation Abstract. Knowledge bases play crucial roles in a wide variety of in- formation systems, such as web search engines and intelligent personal assistants. For responding to constantly fluctuating user information de- mands, we aim to construct a large-scale and well-structured compre- hensive knowledge base from the world’s evolving data. To maintain the quality of our large knowledge base at the production-level, we carefully design not only to match entities but to incorporate various automatic and manual validation methods because it is difficult to filter out all in- correct facts automatically in practice. In this paper, we propose a novel plugin-based system architecture satisfying the ability to rapidly identify mistakes and the system extensibility. Our constructed knowledge base is already utilized in Japanese Web services, and the number of entities in it keeps growing steadily. Keywords: knowledge base · entity matching · data integration 1 Introduction Knowledge Bases (KBs) have been supportive of many user activities such as browsing, searching or buying experiences on the Web. Well-known search en- gines, such as Google, Bing and Yahoo! Inc., show not only the search results, called 10 blues links but the information as entity panels provided by their own KBs, i.e., Google KG (https://googleblog.blogspot.com/2012/05/introducing- knowledge-graph-things-not.html), Satori (https://blogs.bing.com/search/2013/03/21/ understand-your-world-with-bing/), and Yahoo! KG [2], respectively. Our Japanese KB (JKB ) is one of the largest KBs in Japanese and has the same uses as them. KB construction systems differ depending on the application, for example, NOUS [4] is a domain-specific KB construction system, and many machine- learning-based approaches, such as Fonduer [9], have been proposed to maxi- mize the F1 score on the quality of the output knowledge base. P. Suganthan et DI2KG 2019, August 5, 2019, Anchorage, Alaska. Copyright held by the author(s). Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 2 T. Yamazaki et al. al. [5] proposed that industrial systems are considerably differerent to those in academia regarding objectives; therefore, we cannot merely adopt such one-shot KB construction systems. The main objective of our KB construction system is not to maximize F1 , but to maintain high precision at all time, while trying to improve recall over time, because displaying incorrect information leads to losing our trust. Since the qualities of KBs include various types of measurement such as the precision and recall of the entity matching, the ratio of invalid or incor- rect data, the ratio of missing relation etc., we design that our KB construction system can control its measurement and evaluate each algorithm or software component one by one. Therefore, plugin-based system architectures are criti- cal to satisfying many system requirements such as the ability to fix incorrect data immediately and the interchangeability of each method. This plugin-based system allows to implement various algorithms and also satisfy business require- ments, while the system focuses on ensuring our SLAs and scaling computation. WOO [1] is one of the plugin-based KB construction systems in Yahoo! Inc. and is designed to enable various types of products to synthesize KBs. We de- scribe the procedure of how to output KBs from input data via the WOO as follows: (1) Import various scheme of data and normalize input data to the com- mon format, (2) Match entities by some entity matching algorithms, (3) Assign persistent ID, and (4) Export KB in any output format. In this paper, we ex- tend the WOO regarding maintain high precision from multiple measurement perspectives and explore loose-coupling and plugin-based architectures. In addi- tion to the above four primary functions of WOO, our KB construction systems validates incorrect data and completes missing relations and facts. We carefully design the system architecture such that it is easily extensible. 2 Our KB Construction System Architecture Our KB construction system is designed to handle hundreds of millions of enti- ties covering our wide-ranging domain services (e.g., books, movies, companies, landmarks etc.,) and constructs a large production-level KB every day by us- ing Apache Spark. Our system is mainly composed of twelve components shown in Figure 1, and we roughly divide the roles into two groups whether input data are Web-crawled data or not. We daily collect and update structured, or semi-structured data such as linked open data (LOD, e.g., Wikidata, Wikipedia, DBpedia, or Freebase), content provider (CP) data (e.g., landmark, movie or book data), and Web-crawled data by using our Web crawler. These three types of data have different characteristics regarding their accuracy and the complexity of how to extract key-value information, as shown in Table 1. Our system imports such structured and semi-structured data, other than Web-crawled data, from the Importer to the Exporter through other ten pri- mary components. As shown in Table 1, extracting key-value information from Web-crawled data is another challenge due to the combination of various input https://plus.google.com/u/0/109936836907132434202/posts/bu3z2wVqcQc 3 Fig. 1. Overview of our KB construction system architecture. Solid arrows, dotted arrows, squared frames, rounded cornered frames, and cylindrical objects represent data flows between our system and the external data, data flows within it, primary software components, input/output data, and preserved data, respectively. Mapping files are for mappings of entity types and relations from input data to these types of our ontology. documents and variation in extraction targets [3]; thus, we design our informa- tion extraction (IE) method separately from the main components and retrieve the IE data as additional data. 2.1 The JKB Scheme We introduce a new scheme for KBs and explain its advantages. The JKB is a set of entities. An entity is composed of a unique id, types (e.g., Person and Written Work), and a set of Triples. A Triple is similar to the RDF scheme (https://www.w3.org/TR/rdf11-concepts/) that is composed of subject, predicate, and object. Moreover, triples in the JKB scheme have its certainty score as with YAGO [8], the data type. We can maintain the quality of the JKB by filtering triples following whether the certainty score is higher than a threshold or the data type is consistent to our ontology. Since the JKB scheme is allowed to add arbitrary meta information such as data source, we can easy to debug by tracing the data source. 2.2 Primary Functions of Each Component Importer is a data feed component. It supports arbitrary input data scheme, except for Web-crawled data, as shown in Figure 1, and unifies with the JKB 4 T. Yamazaki et al. Table 1. Comparison of the input data regarding the primary information. We cal- culate data size from JKB formatted data. Data sources Data size Data Accuracy Complexity to use Update frequency Wikidata 300 GB Fair Low (Json) About every two weeks DBpedia (Japanese) 300 GB Fair Low (N-Triple) About once a year Freebase 700 GB Fair Low (N-Triple) End at June 30, 2015 Wikipedia (Japanese) 5.6 GB Good Medium About every two weeks CP data (Japanese) ≈ 500 GB Excellent Low (Json, tsv, xml, etc.) Every day Crawled data ≈ 1000 GB Bad High Every day scheme with the certainty score. We manually assign the certainty score for every data source or predicate taking into account its update frequency or estimated precision. Attribute Converter converts the types and predicates of the input data to our ontology by using the mapping file. For example, we provide a mapping from the person-type in Wikidata such as https://www.wikidata.org/wiki/Q215627 to the Person type in our ontology. We traverse the type hierarchy of each LOD and semi-automatically construct type mappings from LODs to the JKB. We show the semi-automatical steps of constructing person-mappings from the Wikidata to the JKB as follows: 1. We judge whether the mapping from the type in Wikidata (https://www.wikid ata.org/wiki/Q215627) to the type in our ontology is correct. 2. If the previous step passes, we automatically collect candidate mappings between all subclasses of the type in Wikidata and the type in our ontology. 3. We sample the mapping results and manually check the correctness of the above candidate mappings. Entity Matcher outputs entity clusters, which groups the same entities. The Entity Matcher is mainly composed of three steps, rule-based matchings, graph- based matchings and filtering unnecessary entity clusters and their attributes. First, accurate matchings are conducted by rule-based matchings. Second, we create blocks of candidate entity clusters with weak matching methods (e.g., name matching) to reduce the computation time. We connect edges between related entity clusters of each block with more strict matching methods (e.g., name and birthdate) than blocking matching and extract cliques from each entity cluster to entity cluster graph. Third, the Entity Matcher removes entity clusters whose attribute certainties are zero and unmapped attributes at the Attribute Converter. ID Assigner assigns a unique ID to each entity cluster based on the set of data sources of entities in the cluster. For example, if the ID of an entity cluster composed of two entities derived from “Wikidata ID: 100” and “DBpedia ID: 2000” is “JKB ID: 300”, we save two relations as follows: “Wikidata ID: 100” → “JKB ID: 300” and “DBpedia ID: 2000” → “JKB ID: 300”. 5 We preserve the relations between the ID and the set of data sources on Apache HBase (https://hbase.apache.org/), which is the Hadoop database, as the ID Table. The ID Assigner ensures that the same entity clusters compared to the past ones inherit to the past ID in the following steps: 1. Get the past JKB IDs of the entities in the given entity cluster by cross- referencing data sources of the entities in the entity cluster and the past ID Table. The past JKB IDs are candidate IDs for the given entity custer. 2. Get all past data sources by cross-referencing the past JKB IDs and the past ID Table. 3. Calculate the ratio of the number of the intersections of past and current data sources to the number of past data sources. 4. If the ratio is greater than 0.5, assign the past ID to the entity cluster; otherwise, assign a new ID and update the ID Table. The ID is persistent across the time with the ID Assigner. Additional Data Combiner combines entities from the additional data whose scheme is the same as the JKB’s; thus, the Additional Data Combiner incorpo- rates these entities into entity clusters from the ID Assigner by using their IDs. Entity Merger merges entities in each entity cluster into one entity. We can define the certainty score to each Triple derived from the data source. Entity Merger merges the same Triples of entities in the entity cluster into one Triple whose certainty score is merged by a monotone increasing function. Since the main objective of our KB construction system is to maintain high precision at all time, the Entity Merger is a simple but powerful component because the Validator can filter unreliable attributes accurately due to the certainty score. Object Converter converts the object of each Triple to the corresponding entity-id by referring the ID Table. If the Triple is derived from linked data and the object is described as the specific identifier, it is easy for the Object Converter to convert the object to the corresponding entity-id by using the ID Table. Since many objects of triples are represented as literals, linking such literal to the corresponding entities is similar to entity-disambiguation problems. Therefore, we should resolve the entity ambiguity from two viewpoints, (1) the type consistency between the predicate and our ontology and (2) distinguish- ing different objects with the same name and the same type, such as a person’s name. We solve the first problem by comparing our ontology with the range type of the predicate and the second problem by converting the object only when the entity whose pair of the object-type and the object-name is uniquely defined in the JKB. Attribute Completer completes attributes to entities based on our ontology and their URL attributes. First, it completes attributes by using a symmetric property defined in our ontology such as the inverseOf property of owl:inverseOf (https://www.w3.org/TR/owl-ref/#inverseOf-def). Second, it extracts useful infor- mation from the entity-related URL, for example, OGP (http://ogp.me/) im- 6 T. Yamazaki et al. Seed data Triple 1 Web page 1 Subject: Person A