Introduction

Universal Schemas Using Shortest Dependency Paths for Free Word Order Languages

Jiho Kim

Sangha Nam

Key-Sun Choi

kschoig@kaist.ac.kr 0 0 Korea Advanced Institute of Science and Technology , Republic of Korea

Universal schemas are a remarkable approach for solving relation extraction, in which new facts are extracted by jointly learning latent feature vectors of entity pairs and relation types through matrix factorization models. However, in free word order languages where surface form predicates do not fully reveal the characteristics of a sentence, universal schemas cannot be constructed in the same manner. Therefore, in this study, we introduce a novel expansion of universal schemas, dependency-path-based universal schemas. Our model uses shortest dependency paths and entity types instead of surface form predicates. For veri cation of our model, we constructed and evaluated a universal schema in Korean with a combination of Korean DBpedia and Korean Wikipedia distant supervision data.

Universal Schemas Relation Extraction Shortest Depen- dency Paths

Introduction

Relation extraction (RE) is a core step in various natural language processing applications, and can be de ned as the task of nding ontological relations between two target entities in a sentence. The concept of universal schemas [ 5 ] was proposed as a considerable approach in RE. A universal schema can be constructed using a combination of all usable knowledge bases (KBs) and natural language text, then expressed through a huge matrix with entity tuples as rows and relations as columns. The novelty of universal schemas comes from avoiding pre-labeled datasets by using surface form predicates as a source for relation types, and mutually supporting both unstructured and structured data.

In English, surface form predicates between two entities can usually serve as relations [ 1 ]. However, in free word order languages, the extraction of surface pattern relations between entities is di cult owing to exible word orders. Without the presence of language-speci c surface form predicate extractors, universal schemas cannot be built in these languages. Even though several algorithmic extensions have been studied [ 4, 6 ], none have solved the dependency on surface form predicates.

In this study, we propose a novel expansion of universal schemas: dependency path based universal schemas (DPUSs). Shortest dependency paths (SDPs) illustrate the syntactic structure between entities according to a sequence of directed binary grammatical relations. Unlike surface form predicates, grammatical dependency is a common feature in all languages. We also injected entity types into the matrix, expecting the e ect of narrowing of the range of relations that the model should predict. To verify that our model shows a decent performance, we constructed two DPUSs (baseline/extended) on a Korean dataset and measured the average precision (AP) for each ontological relation. 2

Methods

We rst built a matrix with entity tuples as rows and a combination of KB relations, SDPs, and entity types as columns. For all observed and unobserved facts, we lled in 1 and 0, respectively in the corresponding slot. We extracted SDPs between the target entities, including directions of each dependency. Figure 1 illustrates the lling of the DPUS matrix based on raw text. Our model mainly employs the nfe matrix factorization model from [ 5 ], which showed the best performance among the proposed models. The embedding vector dimensions were optimized using the leave-one-out cross-validation. To ensure consistency of slot values, we normalized each row on a scale of 0 to 1. 3 3.1

Experiments

Dataset and Evaluation For the baseline, we built a DPUS by using Korean DBpedia [ 2 ] as the KB and Korean Wikipedia distant supervision data as raw text. The distant supervision data is obtained from Korean Wikipedia documents, which were ltered by Korean DBpedia triples. Speci cally, we extracted sentences, each of which contained two entities with a relation in Korean DBpedia. We rejected sentences with SDP of length longer than three because these SDPs are rarely observed in our dataset. For our target relation set, we selected the most frequent 35 prede ned KB relations, each with more than 200 entity tuples in the distant supervision data. Then, we chose entity tuples with two or more SDPs in the distant supervision data. We nally obtained 12,360 entity tuples, 35 KB relations, and 348 frequent SDPs. The size of the baseline DPUS matrix was 12360 383.

For the extended model, we added coarse-grained entity types extracted by Named Entity Recognition (NER) of each entity pair to the baseline DPUS matrix. We used the following entity type categorization: artifacts (e.g., TV programs, books, and movies), data/time, locations (e.g., countries, cities, and towns), organizations (e.g., governments, public corporations, and companies), persons, and others. For each entity tuple, we added a new slot indicating the entity types in the tuple. A total of 15 new columns were added to the baseline matrix, extending the size of the extended matrix to 12360 398.

For evaluation, we performed a ten-fold experiment to measure APs for each KB relation over all the entity tuples in the test set. Then we measured MAP and WMAP for the performance of the whole model. MAP is simply the average of nsubjpass

nmod:at The White House is located at Pennsylvania Avenue in Washington, D.C..

Entity 1 Entity 2 Entity tuple Relation SDP Entity Types < White House, Pennsylvania Avenue >

location [nsubjpass, -nmod:at]

< LC, LC > Parsing result

<White House, Pennsylvania Avenue> location founder -[nnsmuobdjp:aats]s, <LC, LC> 1 0 1

1 Ontological Relations SDPs Entity Types

DPUS matrix APs, and WMAP is the weighted version of MAP in which each AP is weighted by the number of entity tuples in each relation. MAP and WMAP are shown by a previous work [ 3 ] to be robust and stable metrics for evaluating classi cation models. The experimental results of all KB relations are listed in Table 1. Winners of each relation are marked in bold. The baseline DPUS showed performances of 0.62 (MAP) and 0.64 (WMAP), while the extended version of DPUS showed performances of 0.72 (MAP) and 0.75 (WMAP), respectively. Thus, the performance shows great improvement after the injection of entity types. This is because for each entity tuple, the model can consider a signi cantly fewer set of relations when the entity types are decided.

Even though the extended DPUS shows better overall performance, the baseline method performs better in some relations, such as birthPlace and writer. As most of these relations are represented by unique SDPs, the injection of entity types seems to confuse the model. 4

Conclusion

In this study we introduced DPUSs, which are a novel expansion of universal schemas. DPUSs are built by using SDPs and entity tuple types instead of surface form predicates. SDPs contain information about the syntactic structure of a sentence, which makes SDPs a suitable alternative of surface form predicates. Entity tuple types greatly improve performance by reducing the relation candidates for the model. Our model has higher potential in terms of future development and versatility because DPUS is based on globally consistent features rather than language speci c features.

Acknowledgement

This work was supported by Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIT) (20170-01780, The technology development for event recognition/relational reasoning and learning knowledge based system for video understanding)

1. Etzioni , O. , Banko , M. , Soderland , S. , Weld , D.S. : Open information extraction from the web . Communications of the ACM 51 ( 12 ), 68 { 74 ( 2008 )

2. Kim , E.k. , Weidl , M. , Choi , K.S. , Auer , S. : Towards a korean dbpedia and an approach for complementing the korean wikipedia based on dbpedia . OKCon 575 , 12{ 21 ( 2010 )

3. Manning , C.D. , Raghavan , P. , Schutze, H., et al.: Introduction to information retrieval , vol. 1 . Cambridge university press Cambridge ( 2008 )

4. Neelakantan , A. , Roth , B. , Mc-Callum , A. : Compositional vector space models for knowledge base inference . In: 2015 AAAI spring symposium series ( 2015 )

5. Riedel , S. , Yao , L. , McCallum , A. , Marlin , B.M.: Relation extraction with matrix factorization and universal schemas . In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . pp. 74 { 84 ( 2013 )

6. Yao , L. , Riedel , S. , McCallum , A. : Universal schema for entity type prediction . In: Proceedings of the 2013 workshop on Automated knowledge base construction . pp. 79 { 84 . ACM ( 2013 )