-

Combining Syntactic and Sequential Patterns for Unsupervised Semantic Relation Extraction

Nadège Lechevrel

Kata Gábor

Isabelle Tellier

Thierry Charnois

Haïfa Zargayouna

Davide Buscaldi

0 0 LIPN, CNRS (UMR ), Université Paris 1 LaTTiCe, CNRS (UMR ), ENS Paris, Université Sorbonne Nouvelle - Paris PSL Research University, Université Sorbonne Paris Cité 2 Université Paris-Ouest Nanterre La Défense

81 84

In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of DMNLP, Workshop at ECML/PKDD, Skopje, Macedonia, 2017. Copyright c by the paper's authors. Copying only for private and academic purposes.

This work investigates the impact of syntactic features in a completely unsupervised semantic relation extraction experiment. Automated relation extraction deals with identifying semantic relation instances in a text and classifying them according to the type of relation. This task is essential in information and knowledge extraction and in knowledge base population. Supervised relation extraction systems rely on annotated examples [ , – , ] and extract dierent kinds of features from the training data, and eventually from external knowledge sources. The types of extracted relations are necessarily limited to a pre-defined list. In Open Information Extraction (OpenIE) [ , ] relation types are inferred directly from the data: concept pairs representing the same relation are grouped together and relation labels can be generated from context segments or through labeling by domain experts [ , , ]. A commonly used method [ , ] is to represent entity couples by a pair-pattern matrix, and cluster relation instances according to the similarity of their distribution over patterns. Pattern-based approaches [ , , , , ] typically use lexical context patterns, assuming that the semantic relation between two entities is explicitly mentioned in the text. Patterns can be defined manually [ ], obtained by Latent Relational Analysis [ ], or from a corpus by sequential pattern mining [ , , ]. Previous works, especially in the biomedical domain, have shown that not only lexical patterns, but also syntactic dependency trees can be beneficial in supervised and semi-supervised relation extraction [ , , – ]. Early experiments on combining lexical patterns with dierent types of distributional information in unsupervised relation clustering did not bring significant improvement [ ]. The underlying diculty is that while supervised classifiers can learn to weight attributes from dierent sources, it is not trivial to combine dierent types of features in a single clustering feature space.

In our experiments, we propose to combine syntactic features with sequential lexical patterns for unsupervised clustering of semantic relation instances in the context of (NLP-related) scientific texts. We replicate the experiments of [ ] and augment them with dependency-based syntactic features. We adopt a pairpattern matrix for clustering relation instances. The task can be described as follows: if a1, a2, b1, b2 are pre-annotated domain concepts extracted from a corpus, we would like to classify concept pairs a = (a1, a2) and b = (b1, b2) in homogeneous groups according to their semantic relation. We need an ecient representation of (a1, a2) and (b1, b2) in a vector space which allows to calculate relational similarity sim(a, b) and cluster concept pairs. Concept couples were extracted from ACL-RelAcs corpus [ ] based on frequency of co-occurrence in the same sentence. In [ , ] a set of entities were manually categorized in semantic relations; we use this dataset in the clustering and for evaluation. Both sequential patterns and syntactic features are extracted automatically from the same corpus. The input pairs were first represented using the best performing pattern representation in [ ], i.e. sequential patterns. A sequence, in this context, is a list of literals (items) where an item is a word in the corpus. Only closed sequential patterns were considered, i.e. patterns which are not sub-sequences of another sequence with the same support. This feature space is then augmented using syntactic patterns. A dependency parsing approach was adopted: the structure of sentences is described in terms of binary relations between words, where each word depends on another one. A dependency structure consists in a set of triplets where two lexical items are linked by a typed arc indicating the nature of their grammatical relationship. The dependency structures are thus labeled directed graphs consisting of a set of vertices V (the set of words/punctuation in a given sentence) and a set of pairs of vertices A (the arcs and their types which correspond to the grammatical relationships between the elements in V): G = (V,A). Each dependency structure is a set of triplets where two lexical items are linked by a typed arc indicating the nature of their grammatical relationship. In our experiments, we used the Stanford dependency scheme [ ], a semanticallyoriented dependency representation. The parser is Stanford Parser version . . , trained on the Penn Treebank [ ]. The basic typed dependencies representation (see [ ] for a description of the labels) was chosen. Following [ ], who have shown that the shortest path between two entities in the dependency tree can be used to improve relation extraction, the shortest paths between two eligible entities is extracted along with the grammatical information contained in the types. The dependency label sequences of the shortest path were transformed into attributes. The experiments aimed at comparing clustering results based on sequential patterns alone, syntactic information alone, and a mixed representation with both types of information. Three feature spaces were thus constructed: ) the sequential pattern attributes of [ ], ) the dependency path attributes and ) a combined feature space using both types of attributes. Clustering was done using a hierarchical agglomerative clustering with a bisective initialization [ ] implemented in cluto [ ]. The initialization by repeated bisections yields a number of centroids that serve to augment the original feature space; the values of these new dimensions are given by the distance of each object (here: concept pairs) from the centroids. The clusters were evaluated against the manually categorized sample of [ ]. Results show that the combination of both information is beneficial: the feature space ) provides the best clusters.

Acknowledgments This work is part of the program "Investissements d’Avenir" overseen by the French National Research Agency, ANR- -LABX- (Labex EFL).

Syntactic and Sequential Patterns for Unsupervised Relation Extraction . R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In EDBT, pages – , . . P. D. Turney. Measuring semantic similarity by latent relational analysis. In

IJCAI- , . . P. D. Turney. Similarity of semantic relations. CoRR, abs/cs/ , . . P. D. Turney. Domain and function: A dual-space model of semantic relations and compositions. Journal of Artificial Intelligence Research, , . . P. D. Turney and S. M. Mohammad. Experiments with three approaches to recognizing lexical entailment. Natural Language Engineering, . . J. Weeds, D. Clarke, J. Ren, D. Weir, and B. Keller. Learning to distinguish hypernyms and co-hyponyms. In COLING ’ , . . R. Yangarber, W. Lin, and R. Grishman. Unsupervised learning of generalized names. In COLING ’ , . . Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In CIKM, . . Y. Zhao, G. Karypis, and U. Fayyad. Hierarchical clustering algorithms for document datasets. Data Mining for Knowledge Discovery, , March . . G. Zhou, J. Su, J. Zhang, and M. Zhang. Exploring various knowledge in relation extraction. In ACL ’ , .