Analysis of Distant Supervision for Relation Extraction Dataset Kijong Han1 , Sangha Nam1 , YoungGyun Hahm1 , Jiseong Kim1 , Jiho Kim1 , Jin-Dong Kim1,2 , Key-Sun Choi1 1 Korea Advanced Institute of Science and Technology, South Korea 2 Database Center for Life Science, Japan {han0ah, nam.sangha, hahmyg, jiseong, hogajiho}@kaist.ac.kr, jdkim@dbcls.rois.ac.jp, kschoi@kaist.ac.kr Abstract. Deep learning techniques have been applied to relation ex- traction task, and demonstrated remarkable performances. However, the results of these approaches are difficult to interpret and are sometimes counter-intuitive. In this paper, we analyze the ontological and linguistic features of a relation extraction dataset and the pros and cons of existing methods for each feature type. This analysis result could help design an improved method for relation extraction by providing more insights into the dataset and models. Keywords: information extraction, relation extraction, dataset analysis 1 Introduction Relation Extraction (RE) is to extract semantic triples consisting of entity pairs and relation between the entity pairs from non-structured natural language text. Supervised learning approaches for RE require a large amount of labelled train- ing data, which requires considerable human effort. To address this problem, a distant supervision (DS) method [4] is widely used these days. Despite its usefulness, there is a problem in DS for RE. The distant supervision method automatically generates labelled data, so there are wrongly labelled data which can cause noise. Statistical machine learning and deep learning have been applied to solve these problems, and have demonstrated a remarkable performance improve- ment [4,7]. However, the results of these approaches are difficult to interpret and are sometimes counter-intuitive. Thus, to interpret the results of existing RE methods, we analyze ontological and linguistic features of a RE dataset and the pros and cons of existing methods for each feature type. This analysis result provides more insights into the datasets and characteristics of existing RE methods, so it could help design an improved RE method. A convolutional neural network (CNN)-based method [7], and a Markov logic network (MLN)-based method [1] are selected for analysis. Our implementations are available at http://github.com/machinereading/re-cnn for CNN, and http://github.com/machinereading/re-mln for MLN. 2 Background We are inspired by the study that constructs and analyze the dataset for recog- nizing textual entailment (RTE) task [3]. This study categorized an RTE dataset according to linguistic phenomenon, and analyzed it by applying a previous RTE methods. We conducted a type of analysis suitable for a RE task. We selected two existing RE methods for analysis. One is a CNN-based method [7], the other is an MLN-based method [1]. We select CNN as a rep- resentative of deep learning-based methods, and MLN as a representative of methods that utilize logic rules and hand-crafted features. MLN is a model that combines a Markov random field and weighted logic rules [6]. It represents infor- mation as first-order logic predicates and formulas. (e.g. HasF ea(Di , write) ⇒ Label(Di , author) . If data Di has the feature word ’write’ then the relation label of Di is author). Each formula has a weight that represents the confidence, and the weight is trained statistically from the dataset. To utilize this weight, this model can calculate the probability that ground predicate is true. This model has logic rules with weights, and this information is very useful for analyzing the dataset. 3 Analysis Setup 3.1 Dataset We used the Korean DS for RE dataset [5], which was constructed from the Korean Wikipedia (2017. 07) sentences and K-box triples. K-Box is a knowledge base extended from the Korean DBpedia. We randomly sampled these datasets. A total of 13,489 DS training data instances, 4,096 gold test data instances, and the top 30 most frequent relations were used for this study. Gold test data was constructed by the process of removing wrongly labelled data from DS by 14 part-time students hired by our research team. In this paper, we refer to one sentence having a designated entity pair as ‘a data’or ‘a data instance’. 3.2 Method First, we analyzed the overall performance of the two methods. Second, we se- lected and classified four features by analyzing the data manually, and we inves- tigated how important each feature type was for the prediction of a data relation. These features are also used as MLN features. The features are as follows: (1) Entity type: Fine-grained entity type defined by a K-box, which are originated from DBpedia ontology classes.(e.g. An entity type of the Lionel Messi is Athelete.) (2) Entity modifier: A modifier is a sentence component modi- fying another component. For example, in the sentence ‘John who is the author of ...’, ‘author’is the clausal modifier of the entity ‘John’. (3) Lemmas in a dependency path: Lemmas in a dependency path between entity pairs is an important feature. Many previous studies also leveraged this feature [1,4]. (4) Context lemmas: Context lemmas are lemmas of words that not dependencies or modifiers in sentences. Fig. 1. Precision per weight of each feature type 4 Analysis Results Overall Performance. The best F1-score was 0.616 (CNN), 0.611 (MLN), and the accuracy was 0.584 (CNN), 0.594 (MLN). The F1-score was measured in terms of how accurately the method extracts triples, as calculated by other studies [1,4]. Accuracy was measured considering only the best prediction for the data instance. Both models showed similar performance overall. Importance of each Feature Type. We investigated the importance of each feature type by measuring the precision per weight of the formulas in MLN. In the MLN method, each prediction of a relation label for each data has a list of weighted formulas affecting the prediction as described in Section 2. The higher the weight, the more important the feature in the formula based on the training data statistics. Each graph in Figure 1 is drawn considering only the weight of a specific feature type in the calculation. In the X axis, the X% point represents the portion of data that has top (X-10%,X%) weight for a specific feature type. The Y axis represents the accuracy for that portion of the data. Precision was measured by considering only the best prediction for the data instance. For all graphs in Figure 1, the MLN curve shows a lower performance than CNN for a range with a low weight, and higher performance than CNN for a range with a high weight. The MLN curve shows a stronger weight correlation than the CNN curve. Thus, all four features are meaningful to some degree. The MLN curve in entity type (a), entity modifier (b), and dependency lemmas (c) graph shows close to a 1.0 precision for the top 0-20% highest weight dataset. This means that these three features are crucial for specific RE sentences. Method N-of-N All data pattern MLN 0.718 0.594 CNN 0.493 0.584 Fig. 2. Examples of a simple N of N pattern Fig. 3. Accuracy per data type Simple N of N Pattern. We found a data pattern that is very intuitive for determining a relation, but does not work well with CNN. This pattern simply consists of entity1 which is part of an ‘N of N’phrase, and entity2 modified by an ‘N of N’phrase. Examples are shown in Figure 2. There is a total of 71 data instances of this pattern. In this pattern, the clue word (e.g. ‘capital’in Figure 2) is strong evidence for inferring a relation. MLN also utilizes this clue word as strong evidence, because this word acts as both an entity modifier and dependency lemma feature. Thus, MLN shows a higher performance for this pattern than its overall performance as shown in Figure 3. 5 Conclusion We analyzed the ontological and linguistic features of RE datasets, as well as the pros and cons of existing methods for each feature type. We expect that these insights into the RE dataset analyzed in this study could help design an improved RE method. For example, we can use important feature(e.g. entity type) as an additional discrete feature vector for input, or combine high preci- sion rule derived from the pattern(e.g. simple N-of-N pattern in Section 4.) into the neural net architecture by utilizing the model such as [2]. Acknowledgement. This work was supported by Institute for Information & commu- nications Technology Promotion(IITP) grant funded by the Korea government(MSIT) (2013-0-00109, WiseKB: Big data based self-evolving knowledge base and reasoning platform) References 1. Han, X., Sun, L.: Global distant supervision for relation extraction. In: AAAI (2016) 2. Hu, Z., Ma, X., Liu, Z., Hovy, E., Xing, E.: Harnessing deep neural networks with logic rules. In: ACL (2016) 3. Kaneko, K., Miyao, Y., Bekki, D.: Building japanese textual entailment specialized data sets for inference of basic sentence relations. In: ACL (2013) 4. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extrac- tion without labeled data. In: ACL-IJCNLP (2009) 5. Nam, S., Han, K., Kim, E.k., Choi, K.S.: Distant supervision for relation extrac- tion with multi-sense word embedding. In: GWC workshop on Wordnets and Word Embeddings (2018) 6. Richardson, M., Domingos, P.: Markov logic networks. Machine learning 62 (2006) 7. Zeng, D., Liu, K., Lai, S., Zhou, G., Zhao, J.: Relation classification via convolutional deep neural network. In: COLING (2014)