Making Efficient Use of a Domain Expert’s Time in Relation Extraction Linara Adilova, Sven Giesselbach, and Stefan Rüping Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS Schloss Birlinghoven, 53757 Sankt Augustin, Germany {linara.adilova,sven.giesselbach,stefan.rueping}@iais.fraunhofer.de Abstract. Scarcity of labeled data is one of the most frequent problems faced in machine learning. This is particularly true in relation extraction in text mining, where large corpora of texts exists in many application domains, while labeling of text data requires an expert to invest much time to read the documents. Overall, state-of-the art models, like the convolutional neural network used in this paper, achieve great results when trained on large enough amounts of labeled data. However, from a practical point of view the question arises whether this is the most efficient approach when one takes the manual e↵ort of the expert into account. In this paper, we report on an alternative approach where we first construct a relation extraction model using distant supervision, and only later make use of a domain expert to refine the results. Distant supervision provides a mean of labeling data given known relations in a knowledge base, but it su↵ers from noisy labeling. We introduce an active learning based extension, that allows our neural network to incorporate expert feedback and report on first results on a complex data set. Keywords: relation extraction, convolutional neural networks, distant supervision, multi instance learning, interpretability, expert feedback 1 Introduction Nowadays, huge collections of textual data exist that do not only include in- teresting documents for humans to read, but can also be mined for interesting knowledge, which can be further stored in structured form. Examples include extracting general world knowledge from Wikipedia [Vrandečić and Krötzsch, 2014], extracting knowledge about interactions of drugs, genes, and diseases from PubMed [Craven et al., 1999, Herrero-Zazo et al., 2013], or definitions from arbi- trary scientific publications [Augenstein et al., 2017]. Currently, machine learning methods based on deep neural networks play an important role in the extraction of knowledge from texts, achieving top results on many benchmark data sets [dos Santos et al., 2015, Lee et al., 2017]. However, experience shows that deep learn- ing works best in a supervised setting with a massive amount of labeled data. In practical applications, the e↵ort to manually curate a large enough labeled data set is often prohibitively high, in particular in more specialized domains where In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of DMNLP, Workshop at ECML/PKDD, Skopje, Macedonia, 2017. Copyright c by the paper’s authors. Copying only for private and academic purposes. 2 L. Adilova, S. Giesselbach and Stefan Rüping highly trained domain experts are required. Even in cases where one is willing to invest a high manual e↵ort, it may make more sense to extract the required knowledge completely manually because of the unfavorable ratio between e↵ort and precision/recall for a supervised machine learning approach. In this paper, we address the problem of extracting relations from a large collection of documents with only negligible manual e↵ort on the side of a domain expert. We target situations where currently knowledge extraction cannot be applied economically with respect to the manual e↵ort required. Our approach is based on the idea of integrating the expert into the knowledge extraction process not before a deep learning method (as a mere labeling device), but instead by enabling the experts understanding of the extracted model and enabling him give high-level feedback on the results, that will be used to optimize the model. A way of making use of knowledge - in the form of knowledge graphs - in relation extraction is distant supervision. In distant supervision knowledge that is stored in a knowledge graph is aligned with textual corpora. This yields a cheap way of automatically labeling new training data. This paper is organized as follows: in the next section, we discuss background and related work on text mining methods with a focus on deep learning. Section 3 introduces our approach, both giving a detailed description of the interactive knowledge extraction process and the distantly supervised deep network that is applied in the intermediate steps. Section 4 gives first empirical results on the proposed approach. Section 5 concludes and gives an outlook to future research. 2 Background and Related Work In this section we shortly describe the theoretical background behind this paper and set it into the context of related work. 2.1 Relation Extraction The task of Relation Extraction is about getting semantic meaning out of the sentences and texts that contain two entities mentions. This semantic meaning is later aligned with one of the pre-defined relations (so-called ”fixed schema”) or it can also be taken in its natural form as a new relation [Riedel et al., 2013]. Classical and the most used method for Relation Extraction is to get all possible linguistic characteristics of the raw textual data and then apply di↵erent kinds of classifiers on top of these constructed feature vectors [Zhang et al., 2006, Culotta and Sorensen, 2004]. With the development of Deep Learning this approach was also applied to Relation Extraction [Zeng et al., 2014, Nguyen and Grishman, 2015, dos Santos et al., 2015]. Also, the question of finding ways to perform Relation Extraction without costly construction of training datasets was always getting a lot of attention. For example, Open Information Extraction [Banko et al., 2007] performs the extraction without any human input. Furthermore, Distant Supervision was introduced [Mintz et al., 2009] as a way of utilizing existing structured data for obtaining training dataset without manual labeling of the examples. Making Efficient Use of a Domain Expert’s Time in Relation Extraction 3 2.2 Ranking Convolutional Neural Networks for Relation Extraction In [dos Santos et al., 2015], a convolutional neural network for relation extrac- tion is introduced. The model consists of multiple layers, which we will quickly describe in this subsection: 1. Word embedding layer: Transforming words of the input sentence to the embeddings. Every word wi of the sentence transformed to rwi that is a row of an embedding matrix W wrd for some fixed-size vocabulary. 2. Distance embedding layer: Transforming distances between the words in the sentence and marked named entities to the embedding vectors wp1 and wp2 . This approach was introduced by [Zeng et al., 2014]. 3. Embedding merge layer: Concatenates the word embedding rw and cor- responding distance embeddings (to the first and to the second named entity) wp1 and wp2 for every word w in the input sentence into one vector. 4. Convolutional layer: Convolution is applied to windows of three embed- ding vectors with zero padding, so the size of the input is not changed after the layer application. The number of filters dc is 1000, where each value in one vector is a feature value for a specific triplet of words. 5. Global maxpooling: The maximal value is found for each filter. 6. Scoring dense layer In order to classify relations the closeness of a sentence representation to real valued vectors representing each of the relations that are learned during the training process is estimated. The scoring procedure is implemented as a dense layer without bias with weights matrix W classes consisting of the relations embeddings. The objective function uses two of the resulting scores - one is the score, that was obtained for the correct relation according to the label of the example and the second score is one of the wrong relations scores. Thus, the objective function is calculated as follows: – Get the score to the correct relation; – Get the maximal score from the remaining wrong relations; – Calculate the value of the loss according to the formula: L = log(1 + exp( (m+ s✓ (x)+ y ))) + log(1 + exp( (m + s✓ (x)c ))) where m+ (m ) is a margin for the right (wrong) answer; is a scaling factor; s✓ (x)+ y is a score for the right class; s✓ (x)c is a score for the wrong class. 2.3 Distant Supervision While deep learning architectures, such as the one discussed in the previous section, show excellent performance given enough labeled examples, in practice the problem arises that also abundant sentences can usually be found, labeling 4 L. Adilova, S. Giesselbach and Stefan Rüping enough sentences is usually hard. Labeling is usually done either be crowd- sourcing [Angeli et al., 2014] by non-experts - which negatively influences the quality of the labels - or by experts, which in practice very much limits the amount of labeled examples one can generate. In order to alleviate this problem, the approach of distant supervision [Mintz et al., 2009] has been proposed. In order to apply this concept, along with the text corpus a structured knowledge base that contains examples of the desired relation ht base will be used to automatically generate examples of the relation by aligning the entities from the knowledge base with the text cee string match- ing or more complex entity recognition solutions can be used. Hence, distant supervision relies on the following two assumptions: 1. For every triple (e1 , e2 , r) in a knowledge base every sentence containing mentions for e1 and e2 expresses the relation r 2. Every triple that is not in the knowledge base is assumed to be a false ex- ample for a relation (even though the reason might be in the incompleteness of the knowledge base) Evidently, the better the knowledge base and text corpus fulfill these assump- tions, the better one can expect the approach of distant supervision to work. In practice, it must be assumed that in addition to correct example sentences for the relation, additional noise is introduced. 2.4 Multi-instance Learning For coping with the noise introduced by distant supervision, we apply multi instance learning as described in [Zeng et al., 2015]. Multi-instance learning was first introduced for drug classification [Dietterich et al., 1997]. In application for relation extraction, multi-instance learning will mean that we assume existence of at least one sentence containing the description of the relation from the knowledge base. So the set of all sentences that mention the same entity pair is considered as one bag and it has a corresponding label of the relation from the knowledge base. In order to apply neural networks for this bag-based training the maximal score example is chosen from the bag every time to fit the model, while all bags are shu✏ed from epoch to epoch. This approach still looses a lot of possibly useful information obtained by distant supervision, but it serves as an initial step for possible improvement of the approach. 2.5 Interpretability of a Deep Neural Network The interpretability of machine learning models, and in particular of deep net- works, is currently receiving much attention. Several di↵erent directions for mak- ing a deep neural network model understandable to a domain expert have been proposed in the literature Making Efficient Use of a Domain Expert’s Time in Relation Extraction 5 Rule extraction: Early approaches often focused on the extraction of rules or other understandable representations from neural networks, e.g. [Thrun, 1995]. However, for complex data such as texts, and complex models, there is rarely a concise understandable model that summarizes the whole network, hence these approaches have fallen out of interest. Relevance propagation and feature weights: for each individual prediction, it is possible to trace which part of the model in each layer was how relevant for taking the final decision [Bach et al., 2015, Binder et al., 2016]. Taking the relevance propagation back to the level of the input features, this gives an importance for each input feature. Local Approximations: each prediction of the classifier can be approximated locally by simpler model [Ribeiro et al., 2016]. The understandability of the approximating model can be guaranteed by using a simple model class, e.g. a linear model. Joint Models: in certain cases, it is also possible to construct models that both give a highly accurate prediction plus a reason for the prediction. E.g. in [Lei et al., 2016], together with the model an explanatory phrase is extracted. Instance-based models: these methods explain a classifier by means of rep- resentative instances, such as prototypes (typical well-classified instances), or critics (typical mis-classified instances) [Kim et al., 2016]. In general, generic method for classifier interpretation are hard to adapt to models working on text data, since it is a complex data type of low structure. In the case of the approach of [dos Santos et al., 2015], which is used in this paper, the authors suggest to extract representative trigrams from the text. The idea is based on relevance propagation, however make use of the property that in this model the convolutional layer is applied to three embeddings of consecutive words each. Therefore, for each trigram in the original sentence, its relevance for the predicted relation can easily be traced back. More precise, representative trigrams can be obtained from the sentences of the dataset by measuring the value that each of the trigrams in a sentence contribute to the correct class score. The value is simply a sum of all score positions that are traced back to that specific trigram. This method is very similar to the one mentioned in [Craven et al., 1999] when the most valuable words were extracted in order to have an insight into the concept learned by the model. 3 Model Description For our experiments we first implemented the ranking convolutional neural net- work as described in [dos Santos et al., 2015]. We added multi instance learning to cope with noise from distant supervision as well as a feedback loop for do- main experts, that lets them improve the performance by evaluating the most representative trigrams for each relation type. 6 L. Adilova, S. Giesselbach and Stefan Rüping 3.1 Expert Feedback for Training Data Curation An obvious problem with data set creation via distant supervision is that it adds training sentences that do not represent the relation they were sampled for. Instead of letting experts review all of those samples we propose an approach in which the expert reviews not each sentences but rather the concepts that the neural network learned for reach relation. The representative trigrams that our model learns for each relation class can be regarded as its concepts for each relation. If the concepts make sense, our model most likely has achieved a good understanding of the relation from the distant supervised data. If not, assuming that our model is appropriate, the training data is probably not representative for the relations. We propose the following workflow for dataset creation and model improve- ment, displayed in Figure 1: 1. Acquire/Construct a knowledge base with representative facts for relations of interest and text corpora that contain information about the entities and relations of interest. 2. Align the knowledge of the knowledge base with the text corpora and train a Ranking CNN with multi instance learning on it. 3. Extract representative trigrams for each relation class. 4. Show the representative trigrams to experts. Use the trigrams for perfor- mance evaluation of the model. Let the expert analyze what mistakes hap- pened and let them filter out non-representative trigrams. 5. Filter out the training sentences for the relations which contain non-representative trigrams and start the process again with the redefined training set. Fig. 1. Active learning approach diagram. Our assumption here is that the sentences that contain non-representative trigrams are the ones that confuse the network the most. Removing them should at least lead to better precision of the models. The analysis of the trigrams that Making Efficient Use of a Domain Expert’s Time in Relation Extraction 7 the network deems as most important can yield many insights on the causes of bad accuracies. Fig. 2. Concept of knowledge about a specific relation contained in supervised and distantly supervised datasets. If we imagine the concept of a relation as a set of knowledge, than ideally a supervised datasets capture the whole knowledge. In practice this is hardly feasible. An expert would have to label as many relevant sentences as possible, to capture the variance of the whole knowledge about the relation. Real super- vised datasets rather represent a subset of the knowledge about a relation and also some noise e.g. because of wrong labeling. If we add a distantly supervised dataset we will most likely capture 3 di↵erent subsets of the overall knowledge: (1) knowledge or noise that is already included in the supervised dataset, (2) new knowledge and (3) new noise. With our proposed workflow we hope to re- duce the size of the third subset, namely the newly introduced noise. This idea is reflected in Figure 2. It is important to note that knowledge that is not reflected in the supervised training set will most likely also not be reflected in the supervised testing set. This leads to the assumption that we will underestimate the performance of our distantly supervised models when evaluating on test sets of supervised data sets. 4 Evaluation We now evaluate the model architecture. We compare supervised training against distantly supervised training and highlight benefits and downsides of both ap- proaches. We investigate the influence of multi-instance-learning and joint su- pervised and distant supervised learning. Lastly we evaluate the e↵ect of the expert feedback on the quality of the resulting model. 8 L. Adilova, S. Giesselbach and Stefan Rüping 4.1 Data Sets SemEval Task8 We first evaluate our dataset on the SemEval task8 dataset. This dataset was originally used by [dos Santos et al., 2015] and we use it for model validation and comparison. The dataset contains nine bidirectional rela- tion types and the ”Other” class, that includes di↵erent relations, not included in the main ones. Hence there are 19 di↵erent relation classes. The sample sen- tences were manually collected from the web and annotated in three rounds, ensuring that all annotators agree on the label given to the sentence. The KBP37 Dataset The KBP37 dataset1 , as it was called in the paper [Zhang and Wang, 2015], is a revision of MIML-RE annotation dataset from [Angeli et al., 2014], that was build from a subset of Wikipedia articles by manual annotation. The benefit of KBP37 is that it is alignable with the Wikidata2 and KBP-slot-filling datasets3 . The following changes were made to the KBP37 datasets by the authors of [Zhang and Wang, 2015] to adapt it to the description of the SemEval task8: – Added direction to the relations, i.e. ’per:employee-of(e1,e2)’ and ’per:employee- of(e2,e1)’ instead of simply ’per:employee-of’. This is done for all the rela- tions except for ’no-relation’ – Balance the dataset, to exclude the relations that have less than 100 examples for each of the directions. Also 80% of ’no-relation’ examples are discarded – After that examples are shu✏ed and split into three parts, 70% for train, 10% for development and the rest for testing. After all modifications the dataset consists of 18 directional relations and one ”no-relation” class, that will result in 37 classes for recognition. The dataset is more complex than the SemEval task8 dataset. It contains longer sentences (almost twice as long as the longest in SemEval) and it also has multi-relational pairs, making it closer to the real world problem of relation extraction but also more difficult to solve. Also it can be observed that the relations and entities in this dataset are more specific. Most of the entities in the dataset are either names of persons or companies. The relations are very specific, e.g. there are three di↵erent classes for placement of headquarters of a company. One for city, state and country. One more important aspect of the dataset is that human labeling is error prone. Thus there are also very imprecise examples. Here are two examples for the alternate names class: It was because of < e1 >Abu Talib< /e1 > ’s ( a.s. ) good fortune that apart from < e2 >his< /e2 > ancestral services and prestige he also inherited from sons of Ismail ( a.s. ) high status and courage. per:alternate-names(e2,e1) 1 https://github.com/zhangdongxu/kbp37 2 https://query.wikidata.org/ 3 https://nlp.stanford.edu/software/mimlre.shtml Making Efficient Use of a Domain Expert’s Time in Relation Extraction 9 The discography of < e1 > Billie Piper < /e1 > ( as known as < e2 > Bil- lie < /e2 >) an English pop music singer consists of two studio albums two compilation albums and nine singles. per:alternate-names(e2,e1) In the second sentence we have a well labeled example for the class. The first sentence though is hardly an alternate name. It is rather an example for an anaphora resolution task. Such ambiguous labeling will make the classification task even more difficult as it is not obvious even for human annotators why both examples should belong to the same class. Knowledge Bases for Distant Supervision As knowledge bases for distant supervision we used both relational pairs from MIML-RE4 , i.e. from TAC KBP, and Wikidata5 . Wikidata [Vrandečić and Krötzsch, 2014] is a crowd-sourced knowledge base. Its users collaborate on filling it with facts, but they also col- laborate on validating the data and updating the scheme of the knowledge base. The TAC KBP data is from a knowledge base population task by the Text Analysis Conference, with the goal of discovering information about entities and incorporate it into a Knowledge Base. For relational facts alignment the knowl- edge base of the Stanford Natural Language Processing group was used6 . Knowledge base relations were aligned with the New York Times corpus7 . The amount of entity pairs for each of the relations varies a lot - from less than 1000 to more than 50000. In order to create an artificial ”Other” class we chose the relations ”per:religion”, ”per:children” and ”org:political/religious- affiliation”. When investigating the entity pairs from MIML-RE we found them to be not very accurate. An example being an entity of the type ”person-name” that contains only a single letter. In order to minimize noise e↵ect of these pairs, entity pairs from Wikidata were added to the knowledge base. Wikidata contains less matching data for the corresponding relations, but the relations are more precise. We additionally cleaned entity pairs by removing the ones containing one-letter entities or names consisting only of capital letters with dots. To align the knowledge bases with our textual corpus, we simply matched the strings of the entity names with the texts. If a sentence includes both entities which are part of a relation, we used it as sample for the relation. 4.2 Supervised Training Evaluation To validate the correctness of the implementation of the ranking convolutional neural network described in Section 2.2, it was tested on the test set of Se- mEval2010 Task8 dataset and KBP37 test dataset. The scores we achieved is compared to other scores in Table 1. We can conclude, that the model achieves 4 https://nlp.stanford.edu/software/mimlre.shtml 5 https://query.wikidata.org/ 6 https://nlp.stanford.edu/software/mimlre.shtml 7 https://catalog.ldc.upenn.edu/ldc2008t19 10 L. Adilova, S. Giesselbach and Stefan Rüping comparable quality to the reference model and our implementation seems to be correct. We also notice, that the results achieved with our CNN are higher than with the Recurrent Neural Network from [Zhang and Wang, 2015]. Classifier SemEval2010 KBP37 CR-CNN [dos Santos et al., 2015] 84.1 - RNN [Zhang and Wang, 2015] 79.6 58.8 Supervised Ranking CNN 84.39 61.26 Table 1. F1-scores for testing datasets. 4.3 Distant Supervision Evaluation The results of training the network in various ways with distant supervision can be seen in Table 2. For comparison we also add the results of supervised training. Experiment P R F1 Manual E↵ort Supervised training 67.74 57.88 61.26 17638 Distantly supervised training 50.71 45.24 43.81 0 Distantly supervised + MIL 51.82 46.61 45.40 0 Table 2. Precision, Recall, F1-scores and manual e↵ort (number of sentences the expert has to label) for distantly supervised training evaluation. While distant supervision performs worse than supervised training - which was to be expected - the results are usable in practice. In particular, they are significantly higher results than random assignment (with 37 classes, the F1-score for random assignment would be around 0.2%). A publicly available knowledge base and appropriate text corpora can hence serve for the automatic creation of a training set for a neural network tackling the task of relation extraction. Moreover, in the context of the task to continuously extract new knowledge from newly published texts under a constraint budget of manual intervention, this approach is more appealing than both manual extraction and supervised training. We can quantify the savings on the side of the expert by evaluating how many sentences an expert would have to read in each of the settings: for the KBP37 dataset, the number of sentences in the testing set is 3403 and in order to get relations from them the experts should fully comprehend all the information. Moreover, with the manual approach this should be done for all new texts again. The manually supervised approach would require full comprehension for creating the training dataset that is 17638 sentences and later the experts would check the obtained results (1969 sentences). For the distant supervision on the other Making Efficient Use of a Domain Expert’s Time in Relation Extraction 11 hand, all that is required is a result check that is around 1586 sentences and it can be repeated continuously to get all the relations. The second observation we can make is that multi-instance learning has a slight positive e↵ect on the performance. Multi-instance learning improved the results in every experiment by almost 2%. Multi-instance learning did improve precision and recall simultaneously. It is also important to notice, that the supervised training and testing datasets are tightly coupled and they will have common context and common biases. Thus evaluating the distantly supervised model on the existing testing dataset might not be an objective choice. There exist other ways to evaluate the results of distant supervision, for example, as done in [Mintz et al., 2009], but they would not show a realistic comparison to the supervised results. Furthermore, we investigate the dependency between the performance of the approach and the complexity of the sentences. The dependency can be seen in the Figure 3. Spikes around the large values of length are not representative, as the number of the examples there much smaller (3-5 sentences). For all other values, with higher length the number of errors grows and the number of right answers drops. Any distantly supervised dataset will always be characterized by longer sentences on average, so this aspect should be taken into account when the dataset is constructed. For example, sentences longer than some limit can simply not be included in the final set of training examples. Fig. 3. Correlation of amount of correct and wrong answers with sentence length. Number of correct answers (green) and wrong answers (red) is normalized by the overall amount of examples of specific length. To inspect the model in more detail, we extracted the representative trigrams for each class, see Table 3. A first immediate finding of looking at the trigrams was that many of them make sense but tend to include the names of entities and might hence even overfit to the names in the training set. For example, for the 12 L. Adilova, S. Giesselbach and Stefan Rüping org:founded-by founder of the; open society institute; fox broad- casting company; ethical treatment of per:alternate-names known as dwight; known as dj; known as milli; known as matthew; real name is; real name was; , mimi smith; name was selena org:members soccer league milwaukee; american league boston; national league colorado; football league saskatchewan; midwest league burlington; hockey league .; football league and; basketball league ,; football league ’s; soccer league , org:top-members/employees said gene russiano↵; chief operating officer; , managing director; , chief executive; chief execu- tive of; sony pictures entertainment; executive vice president; , vice president per:countries-of-residence england .; france .; states .; australia .; united states ,; philharmonic .; ) italy :; the like , org:founded , 2000 .; the 1980 ’s; , 2001 ,; in 1997 ,; in 1996 ,; , 2001 ,; , 2000 ,; in 1997 ,; in 1999 ,; in 1998 , org:subsidiaries , a subsidiary; high school in; the walt disney; a division iii; the university of; high school ,; depart- ment stores company; general motors corporation per:employee-of ( columbia ); secretary of state; senator daniel in- ouye; senator sam brownback; ( columbia ); ( in- terscope ); ( atlantic ); blue note label per:country-of-birth england .; states .; france .; africa .; united states ,; united states in; , england ,; united states attor- ney per:cities-of-residence los angeles ,; los angeles band; revved-up vancou- ver outfit; in london ,; city .; paris .; los angeles ,; angeles . org:alternate-names states department of; california , los; and munici- pal employees; the university of; known as dwight; known as dj; known as milli; known as matthew org:country-of-headquarters the university of; states .; japan .; germany .; in london ,; york city ,; arbor , mich; cambridge , mass org:stateorprovince-of-headquarters university school of; the university of; , ohio ,; university .; the university of; university in tokyo; life insurance company; institute of technology per:spouse benazir bhutto ,; brad pitt and; david lynch ; star- ring david arquette; and her husband;by richard gere; director herbert ross; starring michael dou- glas org:city-of-headquarters in london ,; york city ,; arbor , mich; cambridge , mass; the university of; hill , n; arlington , va; city .; per:stateorprovinces-of-residence california .; york .; of california at; florida .; new york ,; new york city; in california ,; new york times per:title ) film review; director of; ) television review; the actor who; the director of, this film is; director of; prime minister ,; the director , per:origin of american art; the american artist; the american painter; 20th-century american art; american art ,; american art .; american academy of; american art at; french mathematician , Table 3. Representative trigrams. Making Efficient Use of a Domain Expert’s Time in Relation Extraction 13 relation org:founded, it is obvious that the concrete years should be replaced by a placeholder. 4.4 Using Expert Feedback We have seen that we can construct a useful data set for relation extraction using distant supervision and multi instance learning. Now we want to evaluate whether feedback from experts about the concepts learned by our model can be used to improve the quality of our dataset and the model. For this experiment we used the model trained on the distantly supervised data set with the relations from KBP37 and sentences from the New York Times corpus. To evaluate the approach of integrating expert feedback to improve the model, we conduct the following experiment: from the representative trigrams of Table 3, we select nonsensical trigrams plus trigrams that are too specific, e.g. overfit on specific names. Sentences matching those trigrams are removed from the training set, as they are suspected to introduce too much noise, and the model is trained again on the filtered data set. Table 2 shows the results, with classes where the F1-score changes by less than 1.00 between the initial and filtered run are excluded because of space constraints. Class F1-score initial F1-score filtered New sensible trigrams org:founded-by 19.30 13.33 2 org:members 14.86 13.64 0 org:top-members/employees 44.77 40.49 1 per:alternate-names 24.72 31.17 1 per:cities-of-residence 52.82 54.73 0 per:countries-of-residence 9.33 13.20 0 per:country-of-birth 25.83 26.86 0 per:employee-of 43.39 39.91 3 per:spouse 43.56 36.00 1 per:stateorprovinces-of-residence 43.95 45.49 1 per:title 87.45 87.55 2 Table 4. Changes in F1-score after filtering examples with non representative trigrams in them. Training is performed without Multiple Instance Learning. Classes with dif- ference in F1 smaller that 1.00 excluded due to space constraints. In detail, the following e↵ects can be seen in the trigrams: ”per:cities-of-residence”: all the trigrams contained names of the cities. Here, even after filtering all the new trigrams contain only city names. ”per:countries-of-residence”: a lot of non representative trigrams were fil- tered. As a result, network started to concentrate more on the persons names in the form of ”johan anderson of”, ”vanessa gusmeroli of”. 14 L. Adilova, S. Giesselbach and Stefan Rüping ”org:founded-by”: most of the trigrams included companies names, so they were filtered out. This allowed to obtain other trigrams such as ”cli↵ord noble opened” or ”dick clark productions” but it worsened the overall score as previously learned company names were not taken into account anymore. ”org:members”: the data to the information about sport leagues and the tri- grams contained only leagues names both before and after filtering. ”per:stateorprovince-of-residence”: performance improved as it started to learn also constructions like ”pete domenici of”. ”org:top-members/employees”: had a lot of persons names in its trigrams. So, filtering them out again a↵ected overfitting of the network, but it allowed to get such trigrams as ”editor of the” for example. ”per:alternate-names”: filtering out trigrams with names allowed to get ”real name was” for example without loosing ”real name is” and ”known as dwight”. In this case the network started to see really good constructions. ”per:country-of-birth”: after filtering started learning persons names more, that helped it to give better results. ”per:employee-of”: overfits to companies names. Filtering trigrams allowed to get ”of state” and ”former defense secretary” but worsened the result because it does not make conclusion by the company name anymore. ”per:spouse”: the relation has a lot of training examples with celebrity names. All of such trigrams were filtered out. It allowed to learn at least ”her hus- band,” leaving all the others names again. At first glance, the results may look unconvincing: results improve for 5 relations, but get worse for 5 relations. However, looking at the trigrams before and after the filtering the following two observations can be made: 1. Performance is mostly influenced by overfitting on entities: it is clear from looking at the trigrams, that very often concrete names of cities, persons, or organizations are learned, which is not a desired behaviour. Be- cause of the random training and test split, very often these entities occur in both training and test data, such that good results are obtained still. Re- moving these trigrams has a negative e↵ect in most cases, as no more general relations can be learned. Interestingly, in some cases removing non-sensical trigrams allows the network to identify even more concrete entities which improves the results, e.g. in the case of ”per:country-of-birth”. 2. More sensible trigrams improve the results: some examples, e.g. the re- lation ”per:alternate-names” or ”per:stateorprovince-of-residence” show im- proved results with more sensical trigrams. In summary, it might be meaningful for the expert to make the decision on which trigrams to include based on a comparision of the trigrams both before and after the filtering: in the case where no more meaningful trigrams are found, it might make sense to conclude that no general model can be found and not filter the overfitted trigrams afterall. Making Efficient Use of a Domain Expert’s Time in Relation Extraction 15 5 Conclusion and Outlook Despite the many successes of deep learning in relation extraction, for many practical problems, the availability of labeled data is the main limiting factor. Due to the complexity of the knowledge that is to be extracted from the texts, supervised approaches need many more examples than what usually is available in practical applications. In this paper we explored possibilities to make use of a domain experts knowledge in a more efficient way than using him as a labeling device. It has been shown that distant supervision, in combination with multi-instance learning, is a meaningful method for relation extraction and well surpasses both manual information extraction and state-of-the-art supervised approaches when performance in relation to manual e↵ort is concerned. The necessary e↵ort by the domain experts can in this case be constrained to the identification of a meaningful structured database for generating distantly supervised examples. An analysis of distant supervision and multi-instance learning in the specific case of the KBP37 dataset showed that the quality of the attainable results can be limited e↵ects of overfitting on specific entities. We have shown that in this case the domain expert can contribute by inspecting the predictions made by the deep model on the level of representative trigrams. With the insight gained, the expert can contribute to improve the quality of the model by removing examples that were wrongly labeled by distant supervision, or giving input on pre-processing steps that may help the generalization ability of the model. Future work will aim at a more in-depth evaluation of the approach. Our hypothesis is that the presented approach will be more e↵ective in the case of more specialized relations and in-depth knowledge, for example in the case of medical texts. Finally, obviously representative trigrams are only a very coarse tool for making the model more understandable. Acknowledgements: This work upon which this paper is based was supported by means of the Bundesministerium für Bildung und Forschung (Förderkennzeichen 031L0025C). References [Angeli et al., 2014] Angeli, G., Tibshirani, J., Wu, J., and Manning, C. D. (2014). Combining distant and partial supervision for relation extraction. In EMNLP, pages 1556–1567. [Augenstein et al., 2017] Augenstein, I., Das, M., Riedel, S., Vikraman, L., and McCal- lum, A. (2017). Semeval 2017 task 10: Scienceie-extracting keyphrases and relations from scientific publications. arXiv preprint arXiv:1704.02853. [Bach et al., 2015] Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., and Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140. [Banko et al., 2007] Banko, M., Cafarella, M. J., Soderland, S., Broadhead, M., and Etzioni, O. (2007). Open information extraction from the web. In IJCAI, volume 7, pages 2670–2676. 16 L. Adilova, S. Giesselbach and Stefan Rüping [Binder et al., 2016] Binder, A., Bach, S., Montavon, G., Müller, K.-R., and Samek, W. (2016). Layer-wise relevance propagation for deep neural network architectures. In Information Science and Applications (ICISA) 2016, pages 913–922. Springer. [Craven et al., 1999] Craven, M., Kumlien, J., et al. (1999). Constructing biological knowledge bases by extracting information from text sources. In ISMB, volume 1999, pages 77–86. [Culotta and Sorensen, 2004] Culotta, A. and Sorensen, J. (2004). Dependency tree kernels for relation extraction. In Proc. ACL’04, page 423. Association for Compu- tational Linguistics. [Dietterich et al., 1997] Dietterich, T. G., Lathrop, R. H., and Lozano-Pérez, T. (1997). Solving the multiple instance problem with axis-parallel rectangles. Artificial intel- ligence, 89(1):31–71. [dos Santos et al., 2015] dos Santos, C. N., Xiang, B., and Zhou, B. (2015). Classifying relations by ranking with convolutional neural networks. CoRR, abs/1504.06580. [Herrero-Zazo et al., 2013] Herrero-Zazo, M., Segura-Bedmar, I., Martı́nez, P., and De- clerck, T. (2013). The ddi corpus: An annotated corpus with pharmacological sub- stances and drug–drug interactions. Journal of biomedical informatics, 46(5):914–920. [Kim et al., 2016] Kim, B., Khanna, R., and Koyejo, O. O. (2016). Examples are not enough, learn to criticize! criticism for interpretability. In Advances in Neural Information Processing Systems, pages 2280–2288. [Lee et al., 2017] Lee, J. Y., Dernoncourt, F., and Szolovits, P. (2017). Mit at semeval- 2017 task 10: Relation extraction with convolutional neural networks. arXiv preprint arXiv:1704.01523. [Lei et al., 2016] Lei, T., Barzilay, R., and Jaakkola, T. (2016). Rationalizing neural predictions. arXiv preprint arXiv:1606.04155. [Mintz et al., 2009] Mintz, M., Bills, S., Snow, R., and Jurafsky, D. (2009). Distant supervision for relation extraction without labeled data. In Proc. ACL’09, ACL ’09, pages 1003–1011, Stroudsburg, PA, USA. Association for Computational Linguistics. [Nguyen and Grishman, 2015] Nguyen, T. and Grishman, R. (2015). Relation extrac- tion: Perspective from convolutional neural networks. [Ribeiro et al., 2016] Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). Why should i trust you?: Explaining the predictions of any classifier. In Proc. KDD 2016, pages 1135–1144. ACM. [Riedel et al., 2013] Riedel, S., Yao, L., McCallum, A., and Marlin, B. M. (2013). Re- lation extraction with matrix factorization and universal schemas. [Thrun, 1995] Thrun, S. (1995). Extracting rules from artificial neural networks with distributed representations. In Proc. NIPS’95, pages 505–512. [Vrandečić and Krötzsch, 2014] Vrandečić, D. and Krötzsch, M. (2014). Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78–85. [Zeng et al., 2015] Zeng, D., Liu, K., Chen, Y., and Zhao, J. (2015). Distant super- vision for relation extraction via piecewise convolutional neural networks. In Proc. EMNLP 2015, Lisbon, Portugal, pages 17–21. [Zeng et al., 2014] Zeng, D., Liu, K., Lai, S., Zhou, G., Zhao, J., et al. (2014). Relation classification via convolutional deep neural network. In COLING, pages 2335–2344. [Zhang and Wang, 2015] Zhang, D. and Wang, D. (2015). Relation classification via recurrent neural network. CoRR, abs/1508.01006. [Zhang et al., 2006] Zhang, M., Zhang, J., Su, J., and Zhou, G. (2006). A composite kernel to extract relations between entities with both flat and structured features. In Proc. ACL’06, pages 825–832. Association for Computational Linguistics.