<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Combining Syntactic and Sequential Patterns for Unsupervised Semantic Relation Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nadège Lechevrel</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kata Gábor</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Isabelle Tellier</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thierry Charnois</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haïfa Zargayouna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Buscaldi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LIPN, CNRS (UMR ), Université Paris</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LaTTiCe, CNRS (UMR ), ENS Paris, Université Sorbonne Nouvelle - Paris PSL Research University, Université Sorbonne Paris Cité</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Université Paris-Ouest Nanterre La Défense</institution>
        </aff>
      </contrib-group>
      <fpage>81</fpage>
      <lpage>84</lpage>
      <abstract>
        <p>In: P. Cellier, T. Charnois, A. Hotho, S. Matwin, M.-F. Moens, Y. Toussaint (Eds.): Proceedings of DMNLP, Workshop at ECML/PKDD, Skopje, Macedonia, 2017. Copyright c by the paper's authors. Copying only for private and academic purposes.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>This work investigates the impact of syntactic features in a completely
unsupervised semantic relation extraction experiment. Automated relation extraction
deals with identifying semantic relation instances in a text and classifying them
according to the type of relation. This task is essential in information and
knowledge extraction and in knowledge base population. Supervised relation extraction
systems rely on annotated examples [ , – , ] and extract dierent kinds of
features from the training data, and eventually from external knowledge sources.
The types of extracted relations are necessarily limited to a pre-defined list.
In Open Information Extraction (OpenIE) [ , ] relation types are inferred
directly from the data: concept pairs representing the same relation are grouped
together and relation labels can be generated from context segments or through
labeling by domain experts [ , , ]. A commonly used method [ , ] is to
represent entity couples by a pair-pattern matrix, and cluster relation instances
according to the similarity of their distribution over patterns. Pattern-based
approaches [ , , , , ] typically use lexical context patterns, assuming that the
semantic relation between two entities is explicitly mentioned in the text. Patterns
can be defined manually [ ], obtained by Latent Relational Analysis [ ], or from
a corpus by sequential pattern mining [ , , ]. Previous works, especially in the
biomedical domain, have shown that not only lexical patterns, but also syntactic
dependency trees can be beneficial in supervised and semi-supervised relation
extraction [ , , – ]. Early experiments on combining lexical patterns with
dierent types of distributional information in unsupervised relation clustering
did not bring significant improvement [ ]. The underlying diculty is that while
supervised classifiers can learn to weight attributes from dierent sources, it is
not trivial to combine dierent types of features in a single clustering feature
space.</p>
      <p>In our experiments, we propose to combine syntactic features with sequential
lexical patterns for unsupervised clustering of semantic relation instances in the
context of (NLP-related) scientific texts. We replicate the experiments of [ ]
and augment them with dependency-based syntactic features. We adopt a
pairpattern matrix for clustering relation instances. The task can be described as
follows: if a1, a2, b1, b2 are pre-annotated domain concepts extracted from a
corpus, we would like to classify concept pairs a = (a1, a2) and b = (b1, b2) in
homogeneous groups according to their semantic relation. We need an ecient
representation of (a1, a2) and (b1, b2) in a vector space which allows to calculate
relational similarity sim(a, b) and cluster concept pairs. Concept couples were
extracted from ACL-RelAcs corpus [ ] based on frequency of co-occurrence in
the same sentence. In [ , ] a set of entities were manually categorized in
semantic relations; we use this dataset in the clustering and for evaluation. Both
sequential patterns and syntactic features are extracted automatically from the
same corpus. The input pairs were first represented using the best performing
pattern representation in [ ], i.e. sequential patterns. A sequence, in this context,
is a list of literals (items) where an item is a word in the corpus. Only closed
sequential patterns were considered, i.e. patterns which are not sub-sequences of
another sequence with the same support. This feature space is then augmented
using syntactic patterns. A dependency parsing approach was adopted: the
structure of sentences is described in terms of binary relations between words,
where each word depends on another one. A dependency structure consists in a set
of triplets where two lexical items are linked by a typed arc indicating the nature
of their grammatical relationship. The dependency structures are thus labeled
directed graphs consisting of a set of vertices V (the set of words/punctuation in
a given sentence) and a set of pairs of vertices A (the arcs and their types which
correspond to the grammatical relationships between the elements in V): G =
(V,A). Each dependency structure is a set of triplets where two lexical items are
linked by a typed arc indicating the nature of their grammatical relationship. In
our experiments, we used the Stanford dependency scheme [ ], a
semanticallyoriented dependency representation. The parser is Stanford Parser version . . ,
trained on the Penn Treebank [ ]. The basic typed dependencies representation
(see [ ] for a description of the labels) was chosen. Following [ ], who have
shown that the shortest path between two entities in the dependency tree can
be used to improve relation extraction, the shortest paths between two eligible
entities is extracted along with the grammatical information contained in the
types. The dependency label sequences of the shortest path were transformed
into attributes. The experiments aimed at comparing clustering results based on
sequential patterns alone, syntactic information alone, and a mixed representation
with both types of information. Three feature spaces were thus constructed: )
the sequential pattern attributes of [ ], ) the dependency path attributes and
) a combined feature space using both types of attributes. Clustering was done
using a hierarchical agglomerative clustering with a bisective initialization [ ]
implemented in cluto [ ]. The initialization by repeated bisections yields a
number of centroids that serve to augment the original feature space; the values
of these new dimensions are given by the distance of each object (here: concept
pairs) from the centroids. The clusters were evaluated against the manually
categorized sample of [ ]. Results show that the combination of both information
is beneficial: the feature space ) provides the best clusters.</p>
      <p>Acknowledgments This work is part of the program "Investissements d’Avenir"
overseen by the French National Research Agency, ANR- -LABX- (Labex
EFL).</p>
      <p>Syntactic and Sequential Patterns for Unsupervised Relation Extraction
. R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and
performance improvements. In EDBT, pages – , .
. P. D. Turney. Measuring semantic similarity by latent relational analysis. In</p>
      <p>IJCAI- , .
. P. D. Turney. Similarity of semantic relations. CoRR, abs/cs/ , .
. P. D. Turney. Domain and function: A dual-space model of semantic relations and
compositions. Journal of Artificial Intelligence Research, , .
. P. D. Turney and S. M. Mohammad. Experiments with three approaches to
recognizing lexical entailment. Natural Language Engineering, .
. J. Weeds, D. Clarke, J. Ren, D. Weir, and B. Keller. Learning to distinguish
hypernyms and co-hyponyms. In COLING ’ , .
. R. Yangarber, W. Lin, and R. Grishman. Unsupervised learning of generalized
names. In COLING ’ , .
. Y. Zhao and G. Karypis. Evaluation of hierarchical clustering algorithms for
document datasets. In CIKM, .
. Y. Zhao, G. Karypis, and U. Fayyad. Hierarchical clustering algorithms for
document datasets. Data Mining for Knowledge Discovery, , March .
. G. Zhou, J. Su, J. Zhang, and M. Zhang. Exploring various knowledge in relation
extraction. In ACL ’ , .</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>