<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LIP6@CLEF2017: Multi-Modal Spatial Role Labeling using Word Embeddings</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Éloi Zablocki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick Bordes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laure Soulier</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benjamin Piwowarski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick Gallinari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sorbonne Universités, UPMC Univ Paris 06, UMR 7606</institution>
          ,
          <addr-line>CNRS, LIP6, F-75005, Paris</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We report our participation to the multi-modal Spatial Role Labeling (mSpRL) lab at CLEF 2017. The task consists in extracting and classifying spatial relationships from textual data and associated images. Our approach focuses on the classification part as we use a baseline system for the extraction of the relations: we train a linear Support Vector Machine (SVM) model to classify hand-crafted vectors representing spatial relations. We present the obtained experiments and discuss also the effect of model parameters. Finally, we conclude the paper and introduce ideas for future developments.</p>
      </abstract>
      <kwd-group>
        <kwd>multi-modal spatial role labeling</kwd>
        <kwd>linear SVM</kwd>
        <kwd>multi-label classification</kwd>
        <kwd>spatial indicator</kwd>
        <kwd>landmark</kwd>
        <kwd>trajector</kwd>
        <kwd>word embedding</kwd>
        <kwd>RCC8 regions</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In this paper, we report our participation to the multi-modal Spatial Role
Labeling lab [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] at CLEF 2017 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The task consists in extracting and classifying
spatial relationships from textual data and associated images.
      </p>
      <p>The mSpRL goal is composed of three successive sub-tasks. The first one
aims at extracting spatially-related entities and annotating the text using the
following labels: Trajector, Spatial Indicator and Landmark. The second one
consists in associating the previously found entities into spatial relation triplets
r = (trajector, spatial indicator, landmark). The goal of the third sub-task is to
classify those relations. While the two first sub-tasks could be seen as a linguistic
conceptual representation (spatial role labeling), the third sub-task rather refers
to a formal semantic representation of relations (spatial qualitative labeling).
Possible labels for the relation classification are divided in three general types:
– Region RCC8 [10] (8 possible values): disconnected (DC), externally
connected (EC), equal (EQ), partially overlapping (PO), tangential proper part
(TPP), tangential proper part inverse (TPPi), non-tangential proper part
(NTPP), non-tangential proper part inverse (NTPPi).
– Direction (6 possible values): left, right, above, below, behind, front
– Distance (5 possible values): middle, fast, close, far, near</p>
      <p>
        The mSpRL task of CLEF 2017 is built upon SemEval 2012 task 3 [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] which
was proposed several years ago. That task has been augmented with images
paired with text as additional inputs and sub-task 3 (relation classification).
      </p>
      <p>The training set consists in 275 images, 600 associated sentences (several
sentences can be linked to a single image) and a total of 761 relations. Figure 1
shows an example of the task.</p>
      <p>In qualitative spatial representation and reasoning, spatial relations can be
classified precisely (for example, RCC8 is a set of topological relations between
regions). Identifying spatial relations using text only is a difficult task, due to the
variety of meaning and interpretations words and sentences can have.
Exploiting visual data could be paramount to recognize the spatial objects and their
relations but gives rise to multi-modal alignment issues. In the dataset, images
are segmented into annotated bounded boxes, and the spatial relations between
these boxes is given. This spatial information enables us to enrich the textual
data.</p>
      <p>
        In our contribution, the extraction of spatial roles and relationships
(subtasks 1 and 2) was done using the winning system of previous years [11], an
implementation of this model being in a software called Saul [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. It considers
text solely as input (ignoring images) and returns a set of trajector, spatial
indicator and landmark grouped into relations. Our contribution focuses on two
aspects: using provided images as a complementary source of information and
the relation classification sub-task. To do so, we handcraft a representation for
spatial relations as a vector built using multi-modal inputs: the textual triplet
and features from the associated image. We then train a SVM to classify the
spatial relation. The SVM is trained to predict both general types (region,
direction and distance labels) and specific values (EC, front, close, ...). Note that
the multiple labels can be associated with a single relation (as it is the case in
the example of Figure 1).
      </p>
      <p>The rest of this document is organized as follows: we describe our
contribution including the classification pipeline and the design of relation embeddings.
Finally, we present our experiments and the associated results.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Model</title>
      <p>
        Our contribution mainly focuses on sub-task 3 since we used the previous state
of the art model of [11] - re-implemented in Saul [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] - to perform sub-tasks 1
and 2. Sub-task 3 is a supervised classification problem in which the available
input data is composed of 3 elements: the relation triplet, the original sentence
from which the triplet was extracted, and an associated image. We convert that
input data into a multi-modal embedding erelation that we describe in Section
2.1. We then use a linear SVM to classify the general types and specific values
of the relations, the classification part being described in Section 2.2.
2.1
      </p>
      <sec id="sec-2-1">
        <title>Relation Embedding</title>
        <p>A relation is defined by visual (an image) and textual data (a triplet and the
sentence from which it was extracted). We build our embedding by concatenating
a textual embedding etext and an image embedding eimage.</p>
        <p>erelation = etext
eimage
Textual embedding. In our model, the text embedding contains information from
the triplet only, as we drop the original sentence. Indeed, we assume that the
useful information for the classification sub-task is contained in the extracted
triplet, and that using the surrounding context of the sentence would lead to
over-fitting of the model, given the small size of the training data.</p>
        <p>
          We construct etext as follows:
etext = utrajector
ulandmark
1spatial indicator
where u? is the average of the pre-trained embedding of the words that compose
?. In our experiments, we consider both Glove embeddings [9] and multi-modal
word embedding described in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. 1? is a one-hot encoding of the spatial-indicator
? of the relation ; we use a fixed lexicon of 77 spatial-indicators as they are
limited in number. denotes the concatenation operator. Given the small amount
of training data, ideally, etext has a small dimension to prevent over-fitting. With
that objective in mind, we project word embeddings in a space of reduced
dimensions ; we consider both random projections and Principal Component Analysis
(PCA) techniques conducted on the training database.
        </p>
        <p>
          Visual embedding. Segmented images and pre-computed visual features are
provided in the dataset. A label is provided for each region of the segmented image.
The given visual feature of a region is a 27-dimensional vector containing
lowlevel information such as region area, width and height of the region, mean
and standard deviation of height and width in the x and y axis respectively,
the boundary/area ratio, convexity, average, standard deviation and skewness
in RGB and CIE-Lab color spaces. Segmenting the images was done manually
and visual features of all the regions were computed in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] and are included in
the dataset. We construct eimage as follows:
eimage = rtrajector
rlandmark
rspatial
where r? is the visual embedding of the region ? ; we find the matching
region of a landmark or a trajector by taking the region annotated with the
most similar word (i.e. we compute cosine similarity scores on word
embeddings). In rspatial, we encode in a one-hot vector the connectivity relations
between the landmark and trajector regions: adjacent/disjoint, beside/x-aligned,
above/below/y-aligned. This information is also provided as input data. As with
textual embedding, to avoid over-fitting of the SVM classifier, we project the r?
vectors in a space of smaller dimension.
        </p>
        <p>Note that the relation embeddings are hand-crafted and they remain fixed
during training. The main reason for that is to reduce over-fitting of the model
given the small size of the training dataset.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Classification</title>
        <p>
          Once the embeddings of spatial relations are built (as explained in section 2.1),
we use linear Support Vector Machines for classification [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], according to three
strategies as shown in Figure 2:
– Mono-label: predicts a single label corresponding to specific values.
Multilabels are considered as distinct classes. This gives a total of 28 classes (we
remove multi-label classes that do not occur in the training set). The general
type is simply deduced from the specific values.
– Multi-label: predicts multiple labels corresponding to the specific values.
        </p>
        <p>
          We use the One-vs-Rest (OvR) strategy to do so, which gives a total of 19
classes. The general type is simply deduced from the specific values.
– Hierarchical Multi-label: first predicts multiple labels corresponding to
the general types, then uses appropriate classifiers (each one trained on a
particular general type) to predict multiple labels corresponding to the
specific values for each of the predicted general type. That gives us a total of
4 classifiers (one to determine the general types and one for each possible
general type).
Sub-Task 1. Identification of spatial entities. A sparse perceptron classifier is
trained for each role: Trajector, Spatial Indicator, and Landmark. The features
are designed using a set of lexical, syntactical, and contextual features (lexical
surface of the phrases, headwords phrases, POS-tags, dependency relations,
subcategorization, etc.). Results are presented in Table 1.
Sub-Task 2. Identification of spatial relations. In order to classify spatial
relations, two binary classifiers are trained on pairs of phrases: one takes as
input Trajector-Spatial Indicator pairs, the other on considers Spatial
IndicatorLandmark pairs. With the perceptron assigned to Spatial Indicators, the
indicator candidates are found, and all possible role-indicator pairs are possible
candidates for the binary classifiers trained earlier. At the end, pairs with
common indicators are the final triplets. Results are presented in Table 2.
This sub-task is the main focus of our contribution since we were interested in
using multi-modal embedding for classifying spatial relations. For this purpose,
we run two different scenarios:
– Our submitted best model (no image) which uses the mono-label
classification strategy. We have erelation = etext as eimage is ignored. Word embeddings
are projected in a space of dimension 25 with PCA (outperforming random
projection). Glove embeddings are used as they outperformed multi-modal
embeddings of [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
– Our submitted model with image is the same than the model without image
but the relation embedding includes rspatial in addition: erelation = etext
rspatial
Overall results. Table 3 presents the obtained results for sub-task 3 for our
scenarios with respect to two baselines:
– Organizer’s baseline: The features for the type classifiers are the
concatenation of the features from sub-task 1 for each argument of the triplet.
– Best model which is the same model as the one submitted without image
but the word embeddings are not projected in a lower-dimensional space and
stay unchanged.
Please, note that all hyper-parameters are chosen with a 5-fold cross-validation
computation on the training set. Our model, with its different settings reaches
better scores (precision, recall and F1) than the baseline by a large margin. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
reports comparable and slightly better results with a 10-fold cross-validation.
For more detailed results, we refer the reader to Appendix A.
        </p>
        <p>Classification strategies. To refine these results, we also compare the different
classification strategies, as shown in Table 4. Mono-label classification appears
to work better than other strategies. Interestingly, hierarchical strategy gives a
higher precision but worse recall (and overall worse F1 score). A joint model,
coupling mono-label and hierarchical models, might lead to even better
performance by taking advantage of both models (best precision for Hierarchical, best
recall for Mono-label).
Influence of the components of the embeddings. At a finer level, we also measure
the influence of each component of the global relation embedding vector erelation
with an ablation study. Instead of using full relation embeddings with all of their
components, we remove one or several parts of them and we report in Table 5
obtained results of classification models trained on those partial relation
embeddings. Each line of the table contains the results of a 5-fold cross-validation
training on erelation without the ablated part, namely image, text, spatial
indicator, visual region embeddings. This experimentally highlights the importance
of which spatial indicator is being used and the textual embeddings for the
trajector and then landmark. Visual parts of the relation embedding are useless or
even harmful for the overall performance.</p>
        <p>While it is unclear why using the visual embeddings slightly degrades the
overall performance, we note that many images do not contain areas for the
entities found by sub-task 1 in the sentence (for example "bench" is found as
trajector in the sentence "a bench in a park" but the associated image might
not contain any region labeled "bench"). Moreover, sometimes our algorithm
misses the good region: for example sub-task 1 gives the entity "head" in the
text but there is no "head" region in the image but rather "face-of-person".
Despite some handcrafted rules that we have added to account for this problem,
lots of regions are not considered. Also, even though high-level features from
the images are provided, we assume that there are not enough images in the
training set to learn something complementary from the text. Eventually, since
the classification labels of sub-task 3 are only gold labels prone to annotation
subjectivity, a human annotator would not get 100% f1-score. It would then be
interesting to know about human performance on that sub-task for a comparison
with our results.</p>
        <p>Word embedding dimension influence. Since textual embeddings are proved to
be major components of the spatial relation representation, we evaluate the
impact of the choice of the dimension of the space in which word embeddings
are projected with PCA. We make that parameter vary while others are kept
fixed, in the experiment reported in Table 3. We can see that increasing the word
embedding dimension improves the effectiveness of our approach and our best
performing model does not project word embeddings and keep them unchanged.
Intuitively, having too high-dimension embeddings leads to more parameters and
a higher risk in over-fitting the model. For a good trade-off between performance
and limited size for relation embeddings, 50 is also a suitable choice.
In this work, we focused on sub-task 3 of the mSpRL lab of CLEF 2017:
predicting general types and specific values for relations. Our system relies on a baseline
to extract spatial roles and relations from raw textual data. We build fixed
embeddings for spatial triplets, and a linear SVM classifies relations. Unfortunately,
we were not able to use provided visual inputs in a profitable way as our best
model ignores images. These results highlight that considering multi-modal data
for enhancing natural language processing is a difficult task and requires more
efforts in terms of model design.</p>
        <p>As future work, we have two objectives. First, we want to use the image
data for sub-tasks 1 and 2 in an end-to-end fashion, as visual information might
be useful to disambiguate between several candidate relations. Our other goal
aims at addressing the problem of the limited quantity of training data: we
wish to explore transfer learning techniques to train spatial word embeddings on
auxiliary tasks.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgements</title>
      <p>This work is partially supported by the CHIST-ERA EU project MUSTER
(http://www.chistera.eu/projects/muster) and the Labex SMART. We
additionally thank the task organizers for their help in using the baseline for sub-tasks
1 and 2.</p>
      <p>A</p>
    </sec>
    <sec id="sec-4">
      <title>Detailed results of the best performing model</title>
      <p>9. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word
representation. In: EMNLP. vol. 14, pp. 1532–1543 (2014)
10. Randell, D.A., Cui, Z., Cohn, A.G.: A spatial logic based on regions and connection.</p>
      <p>KR 92, 165–176 (1992)
11. Roberts, K., Harabagiu, S.M.: Utd-sprl: A joint approach to spatial role
labeling. In: Proceedings of the First Joint Conference on Lexical and Computational
Semantics-Volume 1: Proceedings of the main conference and the shared task, and
Volume 2: Proceedings of the Sixth International Workshop on Semantic
Evaluation. pp. 419–424. Association for Computational Linguistics (2012)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Collell</given-names>
            <surname>Talleda</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Moens</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.F.</surname>
          </string-name>
          :
          <article-title>Imagined visual representations as multimodal embeddings</article-title>
          .
          <source>In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)</source>
          .
          <source>AAAI</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Cortes</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          :
          <article-title>Support-vector networks</article-title>
          .
          <source>Machine learning 20(3)</source>
          ,
          <fpage>273</fpage>
          -
          <lpage>297</lpage>
          (
          <year>1995</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Escalante</surname>
            ,
            <given-names>H.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hernández</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>López-López</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Montes</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morales</surname>
            ,
            <given-names>E.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sucar</surname>
            ,
            <given-names>L.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Villaseñor</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grubinger</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>The segmented and annotated iapr tc-12 benchmark</article-title>
          .
          <source>Computer Vision and Image Understanding</source>
          <volume>114</volume>
          (
          <issue>4</issue>
          ),
          <fpage>419</fpage>
          -
          <lpage>428</lpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gareth</surname>
            <given-names>J. F.</given-names>
          </string-name>
          <string-name>
            <surname>Jones</surname>
          </string-name>
          , Séamus Lawless,
          <string-name>
            <surname>J.G.L.K.L.G.T.M.L.C.N</surname>
          </string-name>
          .F. (ed.):
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. 8th International Conference of the CLEF Association, CLEF</source>
          <year>2017</year>
          , Dublin, Ireland,
          <source>September 11- 14</source>
          ,
          <year>2017</year>
          , Proceedings, vol.
          <volume>10456</volume>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kordjamshidi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moens</surname>
            ,
            <given-names>M.F.</given-names>
          </string-name>
          :
          <article-title>Semeval-2012 task 3: Spatial role labeling</article-title>
          .
          <source>In: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation</source>
          . pp.
          <fpage>365</fpage>
          -
          <lpage>373</lpage>
          . Association for Computational Linguistics (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kordjamshidi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moens</surname>
            ,
            <given-names>M.F.:</given-names>
          </string-name>
          <article-title>Global machine learning for spatial ontology population</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web 30</source>
          ,
          <fpage>3</fpage>
          -
          <lpage>21</lpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kordjamshidi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
          </string-name>
          , H.:
          <article-title>Saul: Towards declarative learning based programming</article-title>
          .
          <source>In: IJCAI: proceedings of the conference/sponsored by the International Joint Conferences on Artificial Intelligence</source>
          . vol.
          <year>2015</year>
          , p.
          <year>1844</year>
          .
          <article-title>NIH Public Access (</article-title>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Linda</given-names>
            <surname>Cappellato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Nicola</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.G.</given-names>
            ,
            <surname>Mandl</surname>
          </string-name>
          , T. (eds.): CLEF 2017 Labs Working Notes
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>