Combining Syntactic and Sequential Patterns for
  Unsupervised Semantic Relation Extraction

     Nadège Lechevrel1 , Kata Gábor2 , Isabelle Tellier3 , Thierry Charnois2 ,
                   Haïfa Zargayouna2 , Davide Buscaldi2 ,
                        1
                  Université Paris-Ouest Nanterre La Défense
                    2
                 LIPN, CNRS (UMR          ), Université Paris
3
  LaTTiCe, CNRS (UMR        ), ENS Paris, Université Sorbonne Nouvelle - Paris
            PSL Research University, Université Sorbonne Paris Cité

    This work investigates the impact of syntactic features in a completely unsu-
pervised semantic relation extraction experiment. Automated relation extraction
deals with identifying semantic relation instances in a text and classifying them
according to the type of relation. This task is essential in information and knowl-
edge extraction and in knowledge base population. Supervised relation extraction
systems rely on annotated examples [ , – , ] and extract different kinds of
features from the training data, and eventually from external knowledge sources.
The types of extracted relations are necessarily limited to a pre-defined list.
In Open Information Extraction (OpenIE) [ , ] relation types are inferred di-
rectly from the data: concept pairs representing the same relation are grouped
together and relation labels can be generated from context segments or through
labeling by domain experts [ , , ]. A commonly used method [ , ] is to
represent entity couples by a pair-pattern matrix, and cluster relation instances
according to the similarity of their distribution over patterns. Pattern-based ap-
proaches [ , , , , ] typically use lexical context patterns, assuming that the
semantic relation between two entities is explicitly mentioned in the text. Patterns
can be defined manually [ ], obtained by Latent Relational Analysis [ ], or from
a corpus by sequential pattern mining [ , , ]. Previous works, especially in the
biomedical domain, have shown that not only lexical patterns, but also syntactic
dependency trees can be beneficial in supervised and semi-supervised relation
extraction [ , , – ]. Early experiments on combining lexical patterns with
different types of distributional information in unsupervised relation clustering
did not bring significant improvement [ ]. The underlying difficulty is that while
supervised classifiers can learn to weight attributes from different sources, it is
not trivial to combine different types of features in a single clustering feature
space.
    In our experiments, we propose to combine syntactic features with sequential
lexical patterns for unsupervised clustering of semantic relation instances in the
context of (NLP-related) scientific texts. We replicate the experiments of [ ]
and augment them with dependency-based syntactic features. We adopt a pair-
pattern matrix for clustering relation instances. The task can be described as
follows: if a1 , a2 , b1 , b2 are pre-annotated domain concepts extracted from a
corpus, we would like to classify concept pairs a = (a1 , a2 ) and b = (b1 , b2 ) in
homogeneous groups according to their semantic relation. We need an efficient

In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of
DMNLP, Workshop at ECML/PKDD, Skopje, Macedonia, 2017.
Copyright c by the paper’s authors. Copying only for private and academic purposes.
82      N. Lechevrel, K. Gábor, I. Tellier, T. Charnois, H. Zargayouna and D. Buscaldi

representation of (a1 , a2 ) and (b1 , b2 ) in a vector space which allows to calculate
relational similarity sim(a, b) and cluster concept pairs. Concept couples were
extracted from ACL-RelAcs corpus [ ] based on frequency of co-occurrence in
the same sentence. In [ , ] a set of entities were manually categorized in
semantic relations; we use this dataset in the clustering and for evaluation. Both
sequential patterns and syntactic features are extracted automatically from the
same corpus. The input pairs were first represented using the best performing
pattern representation in [ ], i.e. sequential patterns. A sequence, in this context,
is a list of literals (items) where an item is a word in the corpus. Only closed
sequential patterns were considered, i.e. patterns which are not sub-sequences of
another sequence with the same support. This feature space is then augmented
using syntactic patterns. A dependency parsing approach was adopted: the
structure of sentences is described in terms of binary relations between words,
where each word depends on another one. A dependency structure consists in a set
of triplets where two lexical items are linked by a typed arc indicating the nature
of their grammatical relationship. The dependency structures are thus labeled
directed graphs consisting of a set of vertices V (the set of words/punctuation in
a given sentence) and a set of pairs of vertices A (the arcs and their types which
correspond to the grammatical relationships between the elements in V): G =
(V,A). Each dependency structure is a set of triplets where two lexical items are
linked by a typed arc indicating the nature of their grammatical relationship. In
our experiments, we used the Stanford dependency scheme [ ], a semantically-
oriented dependency representation. The parser is Stanford Parser version . . ,
trained on the Penn Treebank [ ]. The basic typed dependencies representation
(see [ ] for a description of the labels) was chosen. Following [ ], who have
shown that the shortest path between two entities in the dependency tree can
be used to improve relation extraction, the shortest paths between two eligible
entities is extracted along with the grammatical information contained in the
types. The dependency label sequences of the shortest path were transformed
into attributes. The experiments aimed at comparing clustering results based on
sequential patterns alone, syntactic information alone, and a mixed representation
with both types of information. Three feature spaces were thus constructed: )
the sequential pattern attributes of [ ], ) the dependency path attributes and
  ) a combined feature space using both types of attributes. Clustering was done
using a hierarchical agglomerative clustering with a bisective initialization [ ]
implemented in cluto [ ]. The initialization by repeated bisections yields a
number of centroids that serve to augment the original feature space; the values
of these new dimensions are given by the distance of each object (here: concept
pairs) from the centroids. The clusters were evaluated against the manually
categorized sample of [ ]. Results show that the combination of both information
is beneficial: the feature space ) provides the best clusters.


Acknowledgments This work is part of the program "Investissements d’Avenir"
overseen by the French National Research Agency, ANR- -LABX-        (Labex
EFL).
    Syntactic and Sequential Patterns for Unsupervised Relation Extraction             83

References
 . M. Banko, J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open
   information extraction from the web. In IJCAI,            .
 . N. Béchet, P. Cellier, T. Charnois, and B. Crémilleux. Discovering linguistic patterns
   using sequence mining. In CICLing ’ ,            .
 . R. C. Bunescu and R. J. Mooney. A shortest path dependency kernel for relation
   extraction. In HLT-EMNLP’ ,             .
 . L. Del Corro and R. Gemulla. Clausie: Clause-based open information extraction.
   In International Conference on World Wide Web, WWW ’ ,                  .
 . A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information
   extraction. In EMNLP ’ ,          .
 . O. Ferret. Language Production, Cognition, and the Lexicon, chapter Typing Rela-
   tions in Distributional Thesauri, pages       –    . Springer International Publishing,
         .
 . K. Fundel, R. Küffner, and R. Zimmer. Relex—relation extraction using dependency
   parse trees. Bioinformatics, ( ):         ,      .
 . K. Gábor, H. Zargayouna, D. Buscaldi, I. Tellier, and T. Charnois. Semantic
   annotation of the acl anthology corpus for the automatic analysis of scientific
   literature. In LREC ’ ,         .
 . K. Gábor, H. Zargayouna, D. Buscaldi, I. Tellier, and T. Charnois. Unsupervised
   relation extraction in specialized corpora using sequence mining. In Advances in
   Intelligent Data Analysis XV (IDA           ), LNCS         ,       .
 . K. Gábor, H. Zargayouna, I. Tellier, D. Buscaldi, and T. Charnois. A typology of
   semantic relations dedicated to scientific literature analysis. In SAVE-SD Workshop
   at the th World Wide Web Conference, LNCS                 ,       .
 . M. Hearst. Automatic acquisition of hyponyms from large text corpora. In COLING
   ’ , page       –    ,      .
 . Amaç Herdaǧdelen and Marco Baroni. Bagpack: A general framework to represent
   semantic relations. In Proceedings of the Workshop on Geometrical Models of
   Natural Language Semantics, GEMS ’ ,                .
 . J. R. Hobbs and E. Riloff. Information extraction. In Nitin Indurkhya and Fred J.
   Damerau, editors, Handbook of Natural Language Processing, Second Edition. CRC
   Press, Taylor and Francis Group, Boca Raton, FL,              .
 . M-C. de Marneffe, B. MacCartney, and C. D. Manning. Generating typed depen-
   dency parses from phrase structure parses. In LREC ’ ,                .
 . M-C. de Marneffe and C. D. Manning. Stanford typed dependencies manual. The
   Stanford NLP Group,          . revised for the Stanford Parser v. . . in September
         .
 . M-C. de Marneffe and C. D. Manning. The stanford typed dependencies repre-
   sentation. In COLING Workshop on Cross-framework and Cross-domain Parser
   Evaluation,       .
 . R. J. Mooney and R. Bunescu. Mining knowledge from text using information
   extraction. SIGKDD Explor. Newsl., ( ): – , June                .
 . Y. Nakamura-Delloye and E. de la Clergerie. Exploitation de résultats d’analyse syn-
   taxique pour extraction semi-supervisée des chemins de relations. In e Conférence
   sur le Traitement Automatique des Langues Naturelles - TALN               ,    .
 . M. Porumb, I. Barbantan, C. Lemnaru, and R. Potolea. Remed: Automatic relation
   extraction from medical documents. In Proceedings of the th International
   Conference on Information Integration and Web-based Applications & Services,
   iiWAS ’ . ACM,           .
84    N. Lechevrel, K. Gábor, I. Tellier, T. Charnois, H. Zargayouna and D. Buscaldi

  . R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and
    performance improvements. In EDBT, pages – ,             .
 . P. D. Turney. Measuring semantic similarity by latent relational analysis. In
    IJCAI- ,         .
 . P. D. Turney. Similarity of semantic relations. CoRR, abs/cs/         ,     .
 . P. D. Turney. Domain and function: A dual-space model of semantic relations and
    compositions. Journal of Artificial Intelligence Research, ,   .
 . P. D. Turney and S. M. Mohammad. Experiments with three approaches to
    recognizing lexical entailment. Natural Language Engineering,    .
 . J. Weeds, D. Clarke, J. Reffin, D. Weir, and B. Keller. Learning to distinguish
    hypernyms and co-hyponyms. In COLING ’ ,             .
 . R. Yangarber, W. Lin, and R. Grishman. Unsupervised learning of generalized
    names. In COLING ’ ,            .
 . Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for
    document datasets. In CIKM,          .
 . Y. Zhao, G. Karypis, and U. Fayyad. Hierarchical clustering algorithms for docu-
    ment datasets. Data Mining for Knowledge Discovery, , March         .
 . G. Zhou, J. Su, J. Zhang, and M. Zhang. Exploring various knowledge in relation
    extraction. In ACL ’ ,        .