Universal Schemas Using Shortest Dependency Paths for Free Word Order Languages Jiho Kim, Sangha Nam, and Key-Sun Choi Korea Advanced Institute of Science and Technology, Republic of Korea {hogajiho, sangha.nam, kschoi}@kaist.ac.kr Abstract. Universal schemas are a remarkable approach for solving re- lation extraction, in which new facts are extracted by jointly learning latent feature vectors of entity pairs and relation types through matrix factorization models. However, in free word order languages where sur- face form predicates do not fully reveal the characteristics of a sentence, universal schemas cannot be constructed in the same manner. There- fore, in this study, we introduce a novel expansion of universal schemas, dependency-path-based universal schemas. Our model uses shortest de- pendency paths and entity types instead of surface form predicates. For verification of our model, we constructed and evaluated a universal schema in Korean with a combination of Korean DBpedia and Korean Wikipedia distant supervision data. Keywords: Universal Schemas · Relation Extraction · Shortest Depen- dency Paths 1 Introduction Relation extraction (RE) is a core step in various natural language processing applications, and can be defined as the task of finding ontological relations be- tween two target entities in a sentence. The concept of universal schemas [5] was proposed as a considerable approach in RE. A universal schema can be con- structed using a combination of all usable knowledge bases (KBs) and natural language text, then expressed through a huge matrix with entity tuples as rows and relations as columns. The novelty of universal schemas comes from avoiding pre-labeled datasets by using surface form predicates as a source for relation types, and mutually supporting both unstructured and structured data. In English, surface form predicates between two entities can usually serve as relations [1]. However, in free word order languages, the extraction of surface pattern relations between entities is difficult owing to flexible word orders. With- out the presence of language-specific surface form predicate extractors, universal schemas cannot be built in these languages. Even though several algorithmic ex- tensions have been studied [4, 6], none have solved the dependency on surface form predicates. In this study, we propose a novel expansion of universal schemas: dependency path based universal schemas (DPUSs). Shortest dependency paths (SDPs) illus- trate the syntactic structure between entities according to a sequence of directed binary grammatical relations. Unlike surface form predicates, grammatical de- pendency is a common feature in all languages. We also injected entity types into the matrix, expecting the effect of narrowing of the range of relations that the model should predict. To verify that our model shows a decent performance, we constructed two DPUSs (baseline/extended) on a Korean dataset and measured the average precision (AP) for each ontological relation. 2 Methods We first built a matrix with entity tuples as rows and a combination of KB rela- tions, SDPs, and entity types as columns. For all observed and unobserved facts, we filled in 1 and 0, respectively in the corresponding slot. We extracted SDPs between the target entities, including directions of each dependency. Figure 1 illustrates the filling of the DPUS matrix based on raw text. Our model mainly employs the nfe matrix factorization model from [5], which showed the best per- formance among the proposed models. The embedding vector dimensions were optimized using the leave-one-out cross-validation. To ensure consistency of slot values, we normalized each row on a scale of 0 to 1. 3 Experiments 3.1 Dataset and Evaluation For the baseline, we built a DPUS by using Korean DBpedia [2] as the KB and Korean Wikipedia distant supervision data as raw text. The distant supervi- sion data is obtained from Korean Wikipedia documents, which were filtered by Korean DBpedia triples. Specifically, we extracted sentences, each of which contained two entities with a relation in Korean DBpedia. We rejected sentences with SDP of length longer than three because these SDPs are rarely observed in our dataset. For our target relation set, we selected the most frequent 35 predefined KB relations, each with more than 200 entity tuples in the distant supervision data. Then, we chose entity tuples with two or more SDPs in the dis- tant supervision data. We finally obtained 12,360 entity tuples, 35 KB relations, and 348 frequent SDPs. The size of the baseline DPUS matrix was 12360 × 383. For the extended model, we added coarse-grained entity types extracted by Named Entity Recognition (NER) of each entity pair to the baseline DPUS matrix. We used the following entity type categorization: artifacts (e.g., TV programs, books, and movies), data/time, locations (e.g., countries, cities, and towns), organizations (e.g., governments, public corporations, and companies), persons, and others. For each entity tuple, we added a new slot indicating the entity types in the tuple. A total of 15 new columns were added to the baseline matrix, extending the size of the extended matrix to 12360 × 398. For evaluation, we performed a ten-fold experiment to measure APs for each KB relation over all the entity tuples in the test set. Then we measured MAP and WMAP for the performance of the whole model. MAP is simply the average of nsubjpass nmod:at The White House is located at Pennsylvania Avenue in Washington, D.C.. Entity 1 Entity 2 [nsubjpass, Entity tuple < White House, Pennsylvania Avenue > location founder -nmod:at] SDP [nsubjpass, -nmod:at] Entity Types < LC, LC > Ontological Relations SDPs Entity Types Parsing result DPUS matrix Fig. 1. Example of building DPUS matrix. The two entities with an ontological relation location are connected with the SDP [nsubjpass, -nmod:at], and both the entities are of type location (LC ). We convert these parsing results into the DPUS matrix by filling the corresponding slots. Note that the ”-” symbol in SDP implies inverse dependency. APs, and WMAP is the weighted version of MAP in which each AP is weighted by the number of entity tuples in each relation. MAP and WMAP are shown by a previous work [3] to be robust and stable metrics for evaluating classification models. 3.2 Results and Discussion The experimental results of all KB relations are listed in Table 1. Winners of each relation are marked in bold. The baseline DPUS showed performances of 0.62 (MAP) and 0.64 (WMAP), while the extended version of DPUS showed performances of 0.72 (MAP) and 0.75 (WMAP), respectively. Thus, the per- formance shows great improvement after the injection of entity types. This is because for each entity tuple, the model can consider a significantly fewer set of relations when the entity types are decided. Even though the extended DPUS shows better overall performance, the base- line method performs better in some relations, such as birthPlace and writer. As most of these relations are represented by unique SDPs, the injection of entity types seems to confuse the model. 4 Conclusion In this study we introduced DPUSs, which are a novel expansion of universal schemas. DPUSs are built by using SDPs and entity tuple types instead of sur- face form predicates. SDPs contain information about the syntactic structure of a sentence, which makes SDPs a suitable alternative of surface form predi- cates. Entity tuple types greatly improve performance by reducing the relation candidates for the model. Our model has higher potential in terms of future de- velopment and versatility because DPUS is based on globally consistent features rather than language specific features. Table 1. AP and WMAP of of two DPUS models for 35 KB relations. The ”#” column indicates the number of triples in the test set with the corresponding relation. The ”+NER” column indicates the results of the extended DPUS with entity types injected. Acknowledgement This work was supported by Institute for Information & communications Tech- nology Promotion(IITP) grant funded by the Korea government(MSIT) (2017- 0-01780, The technology development for event recognition/relational reasoning and learning knowledge based system for video understanding) References 1. Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Communications of the ACM 51(12), 68–74 (2008) 2. Kim, E.k., Weidl, M., Choi, K.S., Auer, S.: Towards a korean dbpedia and an ap- proach for complementing the korean wikipedia based on dbpedia. OKCon 575, 12–21 (2010) 3. Manning, C.D., Raghavan, P., Schütze, H., et al.: Introduction to information re- trieval, vol. 1. Cambridge university press Cambridge (2008) 4. Neelakantan, A., Roth, B., Mc-Callum, A.: Compositional vector space models for knowledge base inference. In: 2015 AAAI spring symposium series (2015) 5. Riedel, S., Yao, L., McCallum, A., Marlin, B.M.: Relation extraction with matrix factorization and universal schemas. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 74–84 (2013) 6. Yao, L., Riedel, S., McCallum, A.: Universal schema for entity type prediction. In: Proceedings of the 2013 workshop on Automated knowledge base construction. pp. 79–84. ACM (2013)