Combining Syntactic and Sequential Patterns for Unsupervised Semantic Relation Extraction Nadège Lechevrel1 , Kata Gábor2 , Isabelle Tellier3 , Thierry Charnois2 , Haïfa Zargayouna2 , Davide Buscaldi2 , 1 Université Paris-Ouest Nanterre La Défense 2 LIPN, CNRS (UMR ), Université Paris 3 LaTTiCe, CNRS (UMR ), ENS Paris, Université Sorbonne Nouvelle - Paris PSL Research University, Université Sorbonne Paris Cité This work investigates the impact of syntactic features in a completely unsu- pervised semantic relation extraction experiment. Automated relation extraction deals with identifying semantic relation instances in a text and classifying them according to the type of relation. This task is essential in information and knowl- edge extraction and in knowledge base population. Supervised relation extraction systems rely on annotated examples [ , – , ] and extract different kinds of features from the training data, and eventually from external knowledge sources. The types of extracted relations are necessarily limited to a pre-defined list. In Open Information Extraction (OpenIE) [ , ] relation types are inferred di- rectly from the data: concept pairs representing the same relation are grouped together and relation labels can be generated from context segments or through labeling by domain experts [ , , ]. A commonly used method [ , ] is to represent entity couples by a pair-pattern matrix, and cluster relation instances according to the similarity of their distribution over patterns. Pattern-based ap- proaches [ , , , , ] typically use lexical context patterns, assuming that the semantic relation between two entities is explicitly mentioned in the text. Patterns can be defined manually [ ], obtained by Latent Relational Analysis [ ], or from a corpus by sequential pattern mining [ , , ]. Previous works, especially in the biomedical domain, have shown that not only lexical patterns, but also syntactic dependency trees can be beneficial in supervised and semi-supervised relation extraction [ , , – ]. Early experiments on combining lexical patterns with different types of distributional information in unsupervised relation clustering did not bring significant improvement [ ]. The underlying difficulty is that while supervised classifiers can learn to weight attributes from different sources, it is not trivial to combine different types of features in a single clustering feature space. In our experiments, we propose to combine syntactic features with sequential lexical patterns for unsupervised clustering of semantic relation instances in the context of (NLP-related) scientific texts. We replicate the experiments of [ ] and augment them with dependency-based syntactic features. We adopt a pair- pattern matrix for clustering relation instances. The task can be described as follows: if a1 , a2 , b1 , b2 are pre-annotated domain concepts extracted from a corpus, we would like to classify concept pairs a = (a1 , a2 ) and b = (b1 , b2 ) in homogeneous groups according to their semantic relation. We need an efficient In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of DMNLP, Workshop at ECML/PKDD, Skopje, Macedonia, 2017. Copyright c by the paper’s authors. Copying only for private and academic purposes. 82 N. Lechevrel, K. Gábor, I. Tellier, T. Charnois, H. Zargayouna and D. Buscaldi representation of (a1 , a2 ) and (b1 , b2 ) in a vector space which allows to calculate relational similarity sim(a, b) and cluster concept pairs. Concept couples were extracted from ACL-RelAcs corpus [ ] based on frequency of co-occurrence in the same sentence. In [ , ] a set of entities were manually categorized in semantic relations; we use this dataset in the clustering and for evaluation. Both sequential patterns and syntactic features are extracted automatically from the same corpus. The input pairs were first represented using the best performing pattern representation in [ ], i.e. sequential patterns. A sequence, in this context, is a list of literals (items) where an item is a word in the corpus. Only closed sequential patterns were considered, i.e. patterns which are not sub-sequences of another sequence with the same support. This feature space is then augmented using syntactic patterns. A dependency parsing approach was adopted: the structure of sentences is described in terms of binary relations between words, where each word depends on another one. A dependency structure consists in a set of triplets where two lexical items are linked by a typed arc indicating the nature of their grammatical relationship. The dependency structures are thus labeled directed graphs consisting of a set of vertices V (the set of words/punctuation in a given sentence) and a set of pairs of vertices A (the arcs and their types which correspond to the grammatical relationships between the elements in V): G = (V,A). Each dependency structure is a set of triplets where two lexical items are linked by a typed arc indicating the nature of their grammatical relationship. In our experiments, we used the Stanford dependency scheme [ ], a semantically- oriented dependency representation. The parser is Stanford Parser version . . , trained on the Penn Treebank [ ]. The basic typed dependencies representation (see [ ] for a description of the labels) was chosen. Following [ ], who have shown that the shortest path between two entities in the dependency tree can be used to improve relation extraction, the shortest paths between two eligible entities is extracted along with the grammatical information contained in the types. The dependency label sequences of the shortest path were transformed into attributes. The experiments aimed at comparing clustering results based on sequential patterns alone, syntactic information alone, and a mixed representation with both types of information. Three feature spaces were thus constructed: ) the sequential pattern attributes of [ ], ) the dependency path attributes and ) a combined feature space using both types of attributes. Clustering was done using a hierarchical agglomerative clustering with a bisective initialization [ ] implemented in cluto [ ]. The initialization by repeated bisections yields a number of centroids that serve to augment the original feature space; the values of these new dimensions are given by the distance of each object (here: concept pairs) from the centroids. The clusters were evaluated against the manually categorized sample of [ ]. Results show that the combination of both information is beneficial: the feature space ) provides the best clusters. Acknowledgments This work is part of the program "Investissements d’Avenir" overseen by the French National Research Agency, ANR- -LABX- (Labex EFL). Syntactic and Sequential Patterns for Unsupervised Relation Extraction 83 References . M. Banko, J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, . . N. Béchet, P. Cellier, T. Charnois, and B. Crémilleux. Discovering linguistic patterns using sequence mining. In CICLing ’ , . . R. C. Bunescu and R. J. Mooney. A shortest path dependency kernel for relation extraction. In HLT-EMNLP’ , . . L. Del Corro and R. Gemulla. Clausie: Clause-based open information extraction. In International Conference on World Wide Web, WWW ’ , . . A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP ’ , . . O. Ferret. Language Production, Cognition, and the Lexicon, chapter Typing Rela- tions in Distributional Thesauri, pages – . Springer International Publishing, . . K. Fundel, R. Küffner, and R. Zimmer. Relex—relation extraction using dependency parse trees. Bioinformatics, ( ): , . . K. Gábor, H. Zargayouna, D. Buscaldi, I. Tellier, and T. Charnois. Semantic annotation of the acl anthology corpus for the automatic analysis of scientific literature. In LREC ’ , . . K. Gábor, H. Zargayouna, D. Buscaldi, I. Tellier, and T. Charnois. Unsupervised relation extraction in specialized corpora using sequence mining. In Advances in Intelligent Data Analysis XV (IDA ), LNCS , . . K. Gábor, H. Zargayouna, I. Tellier, D. Buscaldi, and T. Charnois. A typology of semantic relations dedicated to scientific literature analysis. In SAVE-SD Workshop at the th World Wide Web Conference, LNCS , . . M. Hearst. Automatic acquisition of hyponyms from large text corpora. In COLING ’ , page – , . . Amaç Herdaǧdelen and Marco Baroni. Bagpack: A general framework to represent semantic relations. In Proceedings of the Workshop on Geometrical Models of Natural Language Semantics, GEMS ’ , . . J. R. Hobbs and E. Riloff. Information extraction. In Nitin Indurkhya and Fred J. Damerau, editors, Handbook of Natural Language Processing, Second Edition. CRC Press, Taylor and Francis Group, Boca Raton, FL, . . M-C. de Marneffe, B. MacCartney, and C. D. Manning. Generating typed depen- dency parses from phrase structure parses. In LREC ’ , . . M-C. de Marneffe and C. D. Manning. Stanford typed dependencies manual. The Stanford NLP Group, . revised for the Stanford Parser v. . . in September . . M-C. de Marneffe and C. D. Manning. The stanford typed dependencies repre- sentation. In COLING Workshop on Cross-framework and Cross-domain Parser Evaluation, . . R. J. Mooney and R. Bunescu. Mining knowledge from text using information extraction. SIGKDD Explor. Newsl., ( ): – , June . . Y. Nakamura-Delloye and E. de la Clergerie. Exploitation de résultats d’analyse syn- taxique pour extraction semi-supervisée des chemins de relations. In e Conférence sur le Traitement Automatique des Langues Naturelles - TALN , . . M. Porumb, I. Barbantan, C. Lemnaru, and R. Potolea. Remed: Automatic relation extraction from medical documents. In Proceedings of the th International Conference on Information Integration and Web-based Applications & Services, iiWAS ’ . ACM, . 84 N. Lechevrel, K. Gábor, I. Tellier, T. Charnois, H. Zargayouna and D. Buscaldi . R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In EDBT, pages – , . . P. D. Turney. Measuring semantic similarity by latent relational analysis. In IJCAI- , . . P. D. Turney. Similarity of semantic relations. CoRR, abs/cs/ , . . P. D. Turney. Domain and function: A dual-space model of semantic relations and compositions. Journal of Artificial Intelligence Research, , . . P. D. Turney and S. M. Mohammad. Experiments with three approaches to recognizing lexical entailment. Natural Language Engineering, . . J. Weeds, D. Clarke, J. Reffin, D. Weir, and B. Keller. Learning to distinguish hypernyms and co-hyponyms. In COLING ’ , . . R. Yangarber, W. Lin, and R. Grishman. Unsupervised learning of generalized names. In COLING ’ , . . Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for document datasets. In CIKM, . . Y. Zhao, G. Karypis, and U. Fayyad. Hierarchical clustering algorithms for docu- ment datasets. Data Mining for Knowledge Discovery, , March . . G. Zhou, J. Su, J. Zhang, and M. Zhang. Exploring various knowledge in relation extraction. In ACL ’ , .