<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Genetic Algorithm for Combining Visual and Textual Embeddings Evaluated on Attribute Recognition</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ruiqi Li Guillem Collell Marie-Francine Moens</string-name>
          <email>gcollell@kuleuven.be</email>
          <email>ruiqi.li1993@outlook.com</email>
          <email>sien.moens@cs.kuleuven.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department Computer Science Department Computer Science Department KU Leuven KU Leuven KU Leuven 3001 Heverlee</institution>
          ,
          <addr-line>Belgium 3001 Heverlee, Belgium 3001 Heverlee</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.): Proceedings of the 3rd Swiss Text Analytics Conference (Swiss- Text 2018)</institution>
          ,
          <addr-line>Winterthur</addr-line>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We propose a genetic-based algorithm for combining visual and textual embeddings in a compact representation that captures finegrain semantic knowledge-or attributes-of concepts. The genetic algorithm is able to select the most relevant representation components from the individual visual and textual embeddings when learning the representations, combining thus complementary visual and linguistic knowledge. We evaluate the proposed model in an attribute recognition task and compare the results with a model that concatenates the two embeddings and models that only use monomodal embeddings.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Distributed representations of words
        <xref ref-type="bibr" rid="ref17 ref19 ref22 ref6">(Collobert et al.,
2011; Mikolov et al., 2013; Pennington et al., 2014;
LeCun et al., 2015)</xref>
        in a vector space that capture
the textual contexts in which words occur have
become ubiquitous and been used effectively for many
downstream natural language processing tasks such as
sentiment analysis and sentence classification
        <xref ref-type="bibr" rid="ref13 ref2">(Kim,
2014; Bansal et al., 2014)</xref>
        . In computer vision,
convolutional neural network (CNN) based image
representations have become mainstream in object and scene
recognition tasks
        <xref ref-type="bibr" rid="ref11 ref14 ref15">(Krizhevsky et al., 2012a; Karpathy
et al., 2014)</xref>
        . Vision and language capture
complementary information that humans automatically
integrate in order to build mental representations of
concepts (Co
        <xref ref-type="bibr" rid="ref1">llell and Moens, 2016</xref>
        ). Certain concepts
or properties of objects cannot be explicitly visually
represented while, at the same time, not all the
properties are easily expressible with language. Here, we
assume that many properties of objects are learned by
humans both by visual perception and through the use
of words in a verbal context. For example, a cat has
fur, which is visually observed, but from language this
property can also be learned when speaking of the fur
of this animal or of hairs that shake when moving.
When building meaning representations of an object’s
attribute, combining visual representations or
embeddings with textual representations seems beneficial.
      </p>
      <p>
        In this paper we investigate how to integrate
visual and textual embeddings that have been trained
on large image and text databases respectively in
order to capture knowledge about the attributes of the
objects. We rely on the assumption that fine-grain
semantic knowledge of attributes (e.g., shape, function,
sound, etc.) is encoded in each moda
        <xref ref-type="bibr" rid="ref1">lity (Collell and
Moens, 2016</xref>
        ). The results shed light on the potential
benefit of combining vision and language data when
creating better meaning representations of content. A
first baseline model just concatenates the visual and
textual vectors, while a second model keeps a
compact vector representation, but selects relevant vector
components to make up the representation based on
a genetic algorithm, which allows capturing a
mixture of the most relevant visual and linguistic features
that encode object attributes. We additionally
compare our model with vision-only and text-only
baselines. Our contribution in this paper is as follows:
To the best of our knowledge, we are the first to
disentangle and recombine embeddings based on a
genetic algorithm. We show that the genetic algorithm
most successfully combines complementary
information of the visual and textual embeddings when
evaluated in an attribute recognition task. Moreover, with
this genetic algorithm we learn compact and targeted
embeddings, where we assume that compact
meaning representations are preferred over longer vectors
in many realistic applications that make use of large
sets of representations. Ultimately, this work provides
insight on building better representations of concepts,
which is essential towards improving automatic
language understanding.
      </p>
      <p>The rest of the paper is organized as follows. In the
next Section we review and discuss related work. In
Section 3 we describe the proposed genetic algorithm
that combines visual and textual embeddings in
encoding and classifying attributes, as well as a baseline
method that concatenates the visual and textual
embeddings and baseline vision-only or text-only
models. Next, we present and discuss our experimental
results. Finally, in conclusions and future work, we
summarize our findings and suggest future lines of
research.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>Representations of concepts are often task specific,
where they mostly have been used in word
similarity tasks. In this context integration of the visual
and linguistic representations was realized by Collell
et al. (2017); Lazaridou et al. (2015); Kiela and
Bottou (2014); Silberer and Lapata (2014). Kiela and
Bottou (2014) proposed the concatenation of visual
and text representations, while Lazaridou et al. (2015)
extend the skip-gram model to the multimodal
domain, but none of these works regard attribute
recognition. Silberer and Lapata (2014) obtain multimodal
representations by implementing a stacked
autoencoder with the visual and word vectors as input in an
attribute recognition task. These vectors were
separately trained with a classifier. In this work, we
start from general pre-trained embeddings.
Rubinstein et al. (2015) research attribute recognition by
relying only on linguistic embeddings. Bruni et al.
(2012) showed that the color attribute is better
captured by visual representations than by linguistic
representations. Farnadi et al. (2018) train a deep neural
network for multimodal fusion of user’s attributes as
found in social media. They use a power-set
combination of representation components in an attempt
to better model shared and non-shared representations
among the data sources which are composed of
images, texts of and relationships between social media
users. The closest work to ours is that of Collell and</p>
      <p>Moens (2016) who compare the performance of visual
and linguistic embeddings each pre-trained on
respectively a large image and text dataset for a large
number of visual attributes, as well as for other non-visual
attributes such as taxonomic, function or
encyclopedic. In contrast to their work, we propose a model that
integrates visual and linguistic embeddings,
leveraging their findings in which they show that visual and
linguistic embeddings encode complementary
knowledge.</p>
      <p>
        Genetic algorithms have been used for feature
selection in text classification and clustering tasks (e.g.,
Abua
        <xref ref-type="bibr" rid="ref1">ligah et al. (2016</xref>
        ); Gomez et al. (2017); Onan
et al. (2017)), where the goal is to reduce the
number of features. In this paper we continue this line of
thinking for learning better multimodal embeddings.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>Given visual and textual embeddings of the same
concept word but with different dimensionality, our goal
is to combine the two embeddings so that the new
embedding can capture both visual and textual
semantic knowledge but with a more compact form than the
concatenation representation. This section describes
why and how we achieve this goal under the genetic
algorithm (GA) framework.
3.1</p>
      <sec id="sec-3-1">
        <title>Why Genetic Algorithms</title>
        <p>When combining two embeddings, the instinctive idea
naturally comes to mind is to check the meaning of
each dimension in order to “pick” the dimensions that
are really useful in a certain task. However, it has
been a long term and highly debated issue in NLP
that what exactly each dimension in the learned
embeddings means to the whole representation. This is
a work requires devoted observation and up till now
there has been no final judgment on this topic. Back
to the original goal, our final task is not to investigate
the exact meaning of each dimension but to choose the
dimensions that really help. This motivates us to use
genetic algorithm which can provide numerous
solutions and select them based on the natural selection
principle. Specifically, the genetic operators such as
crossover in genetic algorithm can be just used to vary
the programming of embeddings from one generation
to the next.
3.2
Belonging to the larger class of evolutionary
algorithms, genetic algorithms (GA) are meta-heuristics
inspired by the process of natural selection. In a given
environment, a population of individuals competes for
survival and, more importantly, reproduction. The
ability of each individual to achieve certain goals
determines his or her chance of producing the next
generation. In a GA setting, an individual is a solution
with regard to the problem and the quality of the
solution determines its fitness . The fittest individuals tend
to survive and have children. By searching the
solution space through the use of simulated evolution,
i.e., following the survival of the fittest strategy, a GA
achieves continuous improvement over the successive
generations.</p>
        <p>GA haven been shown to generate high-quality
solutions to linear and non-linear problems through
biologically-inspired operators such as mutation,
crossover, and selection. A more complete
discussion can be found in the book of Davis (1991).
Algorithm.1 summarizes the procedure of a basic genetic
algorithm.</p>
        <p>Algorithm 1 Framework of a Genetic Algorithm.
1: initialize population;
2: evaluate population;
3: while (!StopCondition) do
4: select the fittest individuals;
5: breed new individuals;
6: evaluate the fitness of new individuals;
7: replace the least fitted population;</p>
      </sec>
      <sec id="sec-3-2">
        <title>8: end while</title>
        <p>There are six fundamental issues to be determined
to use a genetic algorithm: chromosome
representation, initialization, the selection function, the genetic
operators of reproduction, evaluation function, and
termination criteria. The rest of this section describes
the detail of these issues in creating a compact
representation to capture fine-grain semantic visual and
textual knowledge.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Chromosome Representation, Initialization, and Selection</title>
        <p>
          The chromosome representation determines the
problem structure and the genetic operators in a GA. The
floating point representation of the chromosomes has
been shown to be natural to evolution strategies and
evolutionary programming
          <xref ref-type="bibr" rid="ref23">(Periaux et al., 2015)</xref>
          . One
may point out that the pre-trained visual and textual
embeddings can naturally be used as the original
chromosomes since they consists of floating numbers. But
recall that our goal is to form a compact embedding,
and by “compact” we mean that the dimension of the
final embedding should be smaller than the
concatenation of the visual and textual embeddings. Due to
the previous reason, we first concatenate visual and
textual embeddings, then shuffle the dimensions in
the concatenation, and divide the concatenation into
two embeddings with the same dimension. Those two
embeddings are used as the original chromosomes.
Specifically, each real number in an embedding
vector, representing a feature of the target concept, can
be seen as a gene. In this way, the chromosome
(embedding) is made up of a sequence of shuffled real
numbers (floating points) which either comes from the
original visual or textual embedding. Thus the two
embeddings can be seen as a mixture of visual and
textual knowledge with different degrees. For
clarity, we henceforth use the term embedding instead of
chromosome.
        </p>
        <p>In a standard GA, the initial population is often
generated randomly and the selection function is
usually based on the fitness of an individual. However
in our case, as explained previously, the initial
population is formed by the original embeddings.
Consequently, we make a change in the target of the
selection function. Instead of trying to select the most fitted
individuals to reproduce, the selection function first
makes sure that every pair of visual and textual
embeddings having the same target concept reproduce a
group of candidates of the next generation, by
repeating the reproduction method several times. The
reproduction method involves randomly initialized
parameters and will produce different children each time.
Once a certain group of children candidates are
generalized, they will compete against each other to
survive but only the fittest one can win the opportunity of
becoming the next generation. In this way, the fitness
of the children generation is assured to be better than
their parent generation and the fitness is guaranteed to
improve over generations.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>The Genetic Operators of Reproduction</title>
        <p>Genetic operators determine the basic search
mechanism and create new solutions based on existing ones
in the population. Normally there are two types of
operators: crossover and mutation. Crossover takes two
individuals and produces two new individuals while
mutation alters one and produces one new. Since the
embeddings used in our problem represent mappings
from spaces with one dimension per concept word to
continuous vector spaces with lower dimension, the
value in each dimension of the embeddings
characterizes the target concept in the vector spaces and should
not be recklessly changed. Due to this reason, we only
use the crossover operator to reproduce the next
generation.</p>
        <p>Recall that we now have two embeddings and each
one can be seen as a mixture that combined visual and
textual knowledge with different degrees. Our goal
is to find all dimensions that help to achieve a
certain goal. To test whether a certain dimension is
relevant, the crossover operator is defined as follows:
Let X = (x1, · · · , xn) and Y = (y1, · · · , yn) be
two n-dimensional embeddings. The crossover
operator generates two random integers k, t from a uniform
distribution from 1 to n, and creates two new
embedding X0 = (x01, · · · , x0n), Y 0 = (y10, · · · , yn0)
according to:
x0i =
yi0 =
xi if i 6= k
yi otherwise
yi if i 6= t
xi otherwise
(1)
(2)</p>
        <p>As mentioned in the selection function, in one time
of reproduction the same crossover operator is applied
to all embeddings, producing one candidate of next
generation. By repeating it a certain number of times,
a group of different candidates is produced. We call
such repetition a “reproduction trial”.
3.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>Evaluation and Termination</title>
        <p>Diverse evaluation functions can be used, depending
on the specific tasks. For instance, for classification
tasks the evaluation function can be any classification
metric such as precision or Jaccard similarity score, as
long as it can map the population into a partially
ordered set. In regression, correlation is typically used
as evaluation function. In our experiment, the F1
measure is used as the evaluation function. Generally, we
use two types of F1 measure as the evaluation of
fitness to avoid bias, one with respect to positive labels
and the other negative labels. Section 4 shows the
detail of how we use the F1 measure as evaluation
function.</p>
        <p>GA moves through generations, selecting and
reproducing, until a specific termination criterion is met.
From the point of view of reproducing, the stopping
criterion can be set as a maximum number of
generations reproduced. For example, the algorithm will
stop once it reproduce 1000 generations. The first
termination criterion is the most frequently used. The
second termination strategy is a population
convergence criteria that evaluates the sum of deviations
among individuals. Third, the algorithm can also be
terminated when a lack of improvement over a
certain number of generations happens or, alternatively,
when the value for the evaluation measure meets a
target acceptability threshold. For instance, one can set
as threshold if there is no improvement over a series
of 10 times of reproduction, or if the fitness of the
current generation is larger than the target threshold,
then the algorithm terminates. Usually, several
strategies can be used in conjunction with each other. In
the experiments described below, a conjunction of the
maximum number of generations reproduced in the
first termination criterion and the maximum number
of generations that allows a lack of improvement in
the third termination criterion is used. Please noted
here that a maximum number of generations
reproduced in the first termination criterion and a certain
number of generations that allows a lack of
improvement are two different concepts. For example, if the
former is set to 1000 while the latter 10, the algorithm
will terminate when either 1) the algorithms reproduce
1000 generations; or 2) during the algorithm, there is
no improvement in fitness 10 consecutive times of
reproduction trials.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments and Results</title>
      <sec id="sec-4-1">
        <title>Experimental Setup 4.1.1</title>
      </sec>
      <sec id="sec-4-2">
        <title>Pre-trained Visual Embeddings</title>
        <p>
          Fo
          <xref ref-type="bibr" rid="ref1">llowing Collell and Moens (2016</xref>
          ), we use
ImageNet
          <xref ref-type="bibr" rid="ref25">(Russakovsky et al., 2015)</xref>
          as our source of
visual data. ImageNet is the largest labeled image
dataset, and covers 21,841 WordNet synsets or
meanings
          <xref ref-type="bibr" rid="ref9">(Fellbaum, 1998)</xref>
          and over 14M images. We only
preserve synsets with more than 50 images, and we
set an upper bound of 500 images per synset for
computation time. After this, 11,928 synsets are kept. We
extract a 4096-dimensional vector of features for each
image as the output of the last layer of a pre-trained
AlexNet CNN in Krizhevsky et al. (2012b). For each
concept, we combine the representations from its
individual images into a single vector by averaging the
CNN feature vectors of individual images
componentwise, which is equivalent to the cluster center of the
individual representations.
4.1.2
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Pre-trained Word Embeddings</title>
        <p>
          Fo
          <xref ref-type="bibr" rid="ref1">llowing Collell and Moens (2016</xref>
          ), we employ
300dimensional GloVe vectors
          <xref ref-type="bibr" rid="ref22">(Pennington et al., 2014)</xref>
          trained on the largest available corpus (840B tokens
and a 2.2M words vocabulary from Common Crawl
corpus) from the GloVe website1.
4.1.3
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Dataset</title>
        <p>The data set collected by McRae et al. (2005)
consists of data gathered from 30 human participants that
were asked to list properties—attributes—of concrete
nouns. The data contains 541 concepts, 2,526
different attributes, and 10 attribute types.</p>
        <p>
          # Attr. Avg. # concepts
To evaluate the composed embeddings, we assess how
well the attributes from McRae et al. (2005) can be
recognized by using the embeddings as input. For
each attribute a, we build a data set with the concepts
to which this attribute applies as the positive class
instances and the rest of concepts form the negative
class. For example, a “beetle” is a negative instance
and “airplane” a positive instance for the attribute a
1http://nlp.stanford.edu/projects/glove
Attribute type
encyclopedic
function
sound
tactile
taste
taxonomic
color
form and surface
motion
= is large. And an “ant” is a negative instance and
a “bear” is a positive instance for the attribute a =
has 4 legs. We consider that an attribute applies to a
noun concept if a minimum of 5 people have listed it2.
We treat attribute recognition as a binary classification
problem: For each attribute a we learn a predictor:
fa : X → Y
where X ⊂ Rd is the input space of
(ddimensional) concept representations and Y = {0, 1}
the binary output space. We report results with a linear
SVM classifier, implemented with the scikit machine
learning toolkit
          <xref ref-type="bibr" rid="ref21">from Pedregosa et al. (2011</xref>
          ).
        </p>
        <p>To guarantee sufficient positive instances, only
attributes with at least 25 positive instances in the above
dataset are kept. This leads to a total of 42 attributes,
covering 9 attribute types, and their corresponding
instance sets. The concept selection in ImageNet
described in Sect. 4.1.1 results in a visual coverage of
400 concepts (out of 541 from McRae et al. (2005)
data), and, for a fair vision-language comparison, only
the word embeddings (from GloVe) of these nouns are
employed. Hence, our training data {(→−xi , y)}i4=001
consists of 400 instances. Table 1 shows the detail of each
attribute type.
4.3</p>
      </sec>
      <sec id="sec-4-5">
        <title>Parameter Setting</title>
        <p>Notice that each reproduction operation will give birth
to two forms of embedding. To avoid potential bias,
we evaluate one embedding by F1 measure on the
positive labels and the other on the negative labels. The
average of these two F1 measures can be an option to
evaluate fitness. However, in practice, the negative
labels are more numerous than the positive ones. An
increase of the F1 measure on the negative labels while
a decrease of the positive ones can still result in an
increase on the average F1 measure. Thus, we use the
F1 measure on the positive labels as the first measure
of fitness and the F1 measure on the negative labels as
the second. Only the one with largest increase in the
first F1 measure and largest increase or at least
nondecrease in the second F1 measure will be chosen as
the next generation.</p>
        <p>The maximum number of generations reproduced
is set to 106. The maximum number of
reproduction trials in case a lack of improvement among the
2This threshold was set by McRae et al.
(2005)
children candidates is 10. The repeat times of the
crossover operation in one reproduction trial is 103.
And in the final evaluation of each embedding on the
attribute recognition task, we perform 5 runs of 5-fold
cross validation.</p>
        <p>We evaluate four different embeddings as input
of the attribute recognition task: 1) Embeddings of
the concept obtained with the GA described above
(GeMix); 2) Embeddings obtained by
concatenating the visual and textual embedding vectors (CON);
3) Monomodal visual embeddings (CNN); and 4)
Monomodal text embeddings (GloVe). Table 2 shows
the number of dimensions of each embedding
respectively.</p>
        <p>CNN
Glove
CON
GeMix</p>
      </sec>
      <sec id="sec-4-6">
        <title>Result and Discussion</title>
      </sec>
      <sec id="sec-4-7">
        <title>Performance per Attribute Type</title>
        <p>We first evaluate how the proposed method performs
on each attribute type. There are 9 attribute types and
we evaluate the four embeddings for each type by with
the average F1 measure.</p>
        <p>From Table 4 one can see that the GeMix
embeddings outperform the other three embedding methods
in 7 attribute types, i.e., encyclopedic, function,
tactile, taste, taxonomic, color and form and surface.
Especially in encyclopedic, GeMix increases the average
F1 measure by more than 0.02 and in function and
taxonomic, it increases by nearly 0.02. We perform
Wilcoxon Signed-Rank test of each two methods on
different feature sets and find that the difference is
significant atp ≤ 0.05.</p>
        <p>Another interesting finding is that the performance
of the concatenated embedding (CON) is not always
better than the performance of the monomodal
embeddings, CNN or GloVe. For instance, in tactile,
color and motion, the F1 measure of CNN or GloVe
is higher than that of concatenated embeddings. This
indicates that there are certain attributes in which the
performance of combined visual and textual
knowledge is not necessarily better than unimodal visual or
textual knowledge. This will further be discussed in
Section 4.4.2.
4.4.2</p>
      </sec>
      <sec id="sec-4-8">
        <title>Performance per Attribute and Overall</title>
        <p>Table 5 provides a more detailed answer to our
question, showing that GeMix outperforms the other three
embeddings in 20 attributes while CNN performs best
in 6 attributes, GloVe in 7 attributes and the
concatenated embedding (CON) in 9 attributes.
Specifically, GeMix outperforms the second best method
with more than 0.04 in attributes 04 (lays eggs), 09
(is soft), 12 (a vegetable), and 14 (a mammal) and
0.10 in 30 (has a beak) and 37 (made of wood).</p>
        <p>CNN
Glove
CON
GeMix</p>
        <p>
          According to Co
          <xref ref-type="bibr" rid="ref1">llell and Moens (2016</xref>
          ),
visual embeddings perform better than textual ones
when recognizing three main attributes: motion,
form and surface, and color, while textual
embeddings (GloVe) outperform the visual CNN
embeddings in recognizing encyclopedic and function
attributes. A closer look at Table 5 further reveals that
for attribute types where vision or language
embeddings show better performance over the other one, it
is high likely that adding respectively language or
vision information lower the performance, e.g., attribute
05 (hunted by people), 07 (used for transportation)
in function and 20 (is fast), 21 (eats) in motion.
Because GeMix tend to set aside the “noisy” dimensions
of the embeddings, it performs better than the
concatenated embedding.
        </p>
        <p>Let us take a look at the overall average F1
measure increase. We evaluate the F1 measure with
respect to two different aspects. First, the overall
average F1 measure per attribute, i.e., F 1attr = L1 F 1L
| |
where |L| is the number of different attributes (42 in
our case) and F 1L is the F1 measure of a specific
attribute. Second, the overall average F1 measure per
ber of samples (400 in |oS1u| rFc1aSsew)haenrde |FS1| Lis
itshethneumF1sample, i.e., F 1samp =
measure of each sample. Table 3 shows that in both
cases, GeMix achieves the highest F1 measure.</p>
        <p>Encyc
0.429
0.422
0.457
0.471
In this paper, we propose a genetic-based algorithm
which learns a compact representation that combines
visual and textual embeddings. Two embeddings,
coming from random evenly divide of the shuffled
concatenation of vision and textual embeddings, are
used as the initial chromosomes in the genetic
algorithm. A variant of one-point crossover method is
used to move the most relevant components in the
representation to one embedding, and the non-relevant
ones to the other. To avoid bias, we use two measures
as the evaluation of fitness: one is respect to positive
labels and the other negative labels. The learned
embeddings can be seen as a combination of both visual
and textual knowledge. In an attribute recognition task
the genetic-based representation outperformed a
baseline composed of the concatenation of the visual and
textual embeddings, as well as the monomodal visual
or textual embedding.</p>
        <p>
          Another interesting finding in this paper is that for
a small group of attributes in which either vision or
language generally dominate, adding the other
modality may lower the final performance. For example,
the attribute eats in the motion type for which
vision tends to perform better than
          <xref ref-type="bibr" rid="ref1">language (Collell
and Moens, 2016</xref>
          ), the performance of the mixture
of both visual and textual representation is lower than
the monomodal visual representation. Ultimately, our
findings provide insights that can help building better
multimodal representation by taking into account to
what degree should the visual and textual knowledge
be mixed with respect to different tasks.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>L. M. Abualigah</surname>
            ,
            <given-names>A. T.</given-names>
          </string-name>
          <string-name>
            <surname>Khader</surname>
            , and
            <given-names>M. A.</given-names>
          </string-name>
          <string-name>
            <surname>Al-Betar</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Unsupervised feature selection technique based on genetic algorithm for improving the text clustering</article-title>
          .
          <source>In 2016 7th International Conference on Computer Science and Information Technology (CSIT)</source>
          . volume
          <volume>00</volume>
          , pages
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Mohit</given-names>
            <surname>Bansal</surname>
          </string-name>
          , Kevin Gimpel, and
          <string-name>
            <given-names>Karen</given-names>
            <surname>Livescu</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Tailoring continuous word representations for dependency parsing</article-title>
          .
          <source>In ACL (2)</source>
          . pages
          <fpage>809</fpage>
          -
          <lpage>815</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Elia</given-names>
            <surname>Bruni</surname>
          </string-name>
          , Gemma Boleda, Marco Baroni, and
          <string-name>
            <given-names>NamKhanh</given-names>
            <surname>Tran</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Distributional semantics in technicolor</article-title>
          .
          <source>In ACL. ACL</source>
          , pages
          <fpage>136</fpage>
          -
          <lpage>145</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Guillem</given-names>
            <surname>Collell</surname>
          </string-name>
          and
          <string-name>
            <surname>Marie-Francine Moens</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Is an image worth more than a thousand words? On the finegrain semantic differences between visual and linguistic representations</article-title>
          .
          <source>In COLING. ACL</source>
          , pages
          <fpage>2807</fpage>
          -
          <lpage>2817</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Guillem</given-names>
            <surname>Collell</surname>
          </string-name>
          , Ted Zhang, and
          <string-name>
            <surname>Marie-Francine Moens</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Imagined visual representations as multimodal embeddings</article-title>
          .
          <source>In AAAI</source>
          . pages
          <fpage>4378</fpage>
          -
          <lpage>4384</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Ronan</given-names>
            <surname>Collobert</surname>
          </string-name>
          , Jason Weston, Le´on Bottou, Michael Karlen, Koray Kavukcuoglu, and
          <string-name>
            <given-names>Pavel</given-names>
            <surname>Kuksa</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Natural language processing (almost) from scratch</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          (Aug):
          <fpage>2493</fpage>
          -
          <lpage>2537</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Lawrence</given-names>
            <surname>Davis</surname>
          </string-name>
          .
          <year>1991</year>
          .
          <article-title>Handbook of genetic algorithms</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Golnoosh</given-names>
            <surname>Farnadi</surname>
          </string-name>
          , Jie Tang, Martine De Cock, and
          <string-name>
            <given-names>MarieFrancine</given-names>
            <surname>Moens</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>User profiling through deep multimodal fusion</article-title>
          .
          <source>In WSDM</source>
          . pages
          <fpage>171</fpage>
          -
          <lpage>179</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Christiane</given-names>
            <surname>Fellbaum</surname>
          </string-name>
          .
          <year>1998</year>
          . WordNet. Wiley Online Library.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Juan</given-names>
            <surname>Carlos</surname>
          </string-name>
          <string-name>
            <given-names>Gomez</given-names>
            , Stijn Hoskens, and
            <surname>Marie-Francine Moens</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Evolutionary learning of meta-rules for text classification</article-title>
          .
          <source>GECCO.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Andrej</given-names>
            <surname>Karpathy</surname>
          </string-name>
          , George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and
          <string-name>
            <surname>Li</surname>
          </string-name>
          Fei-Fei.
          <year>2014</year>
          .
          <article-title>Large-scale video classification with convolutional neural networks</article-title>
          .
          <source>In CVPR. IEEE</source>
          , pages
          <fpage>1725</fpage>
          -
          <lpage>1732</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Douwe</given-names>
            <surname>Kiela</surname>
          </string-name>
          and Le´on Bottou.
          <year>2014</year>
          .
          <article-title>Learning image embeddings using convolutional neural networks for improved multi-modal semantics</article-title>
          .
          <source>In EMNLP</source>
          . pages
          <fpage>36</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Yoon</given-names>
            <surname>Kim</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Convolutional neural networks for sentence classification</article-title>
          .
          <source>arXiv preprint arXiv:1408</source>
          .
          <fpage>5882</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Alex</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <surname>Geoffrey E. Hinton.</surname>
          </string-name>
          2012a.
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>In NIPS. USA</source>
          , pages
          <fpage>1097</fpage>
          -
          <lpage>1105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Alex</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <given-names>Geoffrey E</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2012b</year>
          .
          <article-title>Imagenet classification with deep convolutional neural networks</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          . pages
          <fpage>1097</fpage>
          -
          <lpage>1105</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Angeliki</given-names>
            <surname>Lazaridou</surname>
          </string-name>
          , Nghia The Pham, and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Baroni</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Combining language and vision with a multimodal skip-gram model</article-title>
          .
          <source>arXiv preprint arXiv:1501</source>
          .
          <fpage>02598</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Yann</surname>
            <given-names>LeCun</given-names>
          </string-name>
          , Yoshua Bengio, and
          <string-name>
            <given-names>Geoffrey</given-names>
            <surname>Hinton</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Deep learning</article-title>
          .
          <source>Nature</source>
          <volume>521</volume>
          (
          <issue>7553</issue>
          ):
          <fpage>436</fpage>
          -
          <lpage>444</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Ken</surname>
            <given-names>McRae</given-names>
          </string-name>
          ,
          <article-title>George S Cree, Mark S Seidenberg,</article-title>
          and
          <string-name>
            <surname>Chris McNorgan</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Semantic feature production norms for a large set of living and nonliving things</article-title>
          .
          <source>Behavior research methods 37</source>
          <volume>(4)</volume>
          :
          <fpage>547</fpage>
          -
          <lpage>559</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , Kai Chen, Greg Corrado, and
          <string-name>
            <given-names>Jeffrey</given-names>
            <surname>Dean</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Efficient estimation of word representations in vector space</article-title>
          .
          <source>CoRR abs/1301</source>
          .3781.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Aytug</given-names>
            <surname>Onan</surname>
          </string-name>
          , Serdar Korukoglu, and
          <string-name>
            <given-names>Hasan</given-names>
            <surname>Bulut</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>A hybrid ensemble pruning approach based on consensus clustering and multi-objective evolutionary algorithm for sentiment classification</article-title>
          .
          <source>Inf. Process. Manage</source>
          .
          <volume>53</volume>
          (
          <issue>4</issue>
          ):
          <fpage>814</fpage>
          -
          <lpage>833</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>F.</given-names>
            <surname>Pedregosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Varoquaux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gramfort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Michel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thirion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Grisel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Blondel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prettenhofer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Dubourg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vanderplas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Passos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cournapeau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brucher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Perrot</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Duchesnay</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Scikit-learn: Machine learning in Python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>12</volume>
          :
          <fpage>2825</fpage>
          -
          <lpage>2830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Jeffrey</surname>
            <given-names>Pennington</given-names>
          </string-name>
          , Richard Socher, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Glove: Global vectors for word representation</article-title>
          .
          <source>In EMNLP</source>
          . volume
          <volume>14</volume>
          , pages
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Jacques</given-names>
            <surname>Periaux</surname>
          </string-name>
          , Felipe Gonzalez, and Dong Seop Chris Lee.
          <year>2015</year>
          .
          <article-title>Evolutionary optimization and game strategies for advanced multi-disciplinary design</article-title>
          . Springer Netherlands.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Dana</given-names>
            <surname>Rubinstein</surname>
          </string-name>
          , Effi Levi,
          <string-name>
            <given-names>Roy</given-names>
            <surname>Schwartz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ari</given-names>
            <surname>Rappoport</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>How well do distributional models capture different types of semantic knowledge? In ACL</article-title>
          . volume
          <volume>2</volume>
          , pages
          <fpage>726</fpage>
          -
          <lpage>730</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Olga</given-names>
            <surname>Russakovsky</surname>
          </string-name>
          , Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla,
          <string-name>
            <given-names>Michael</given-names>
            <surname>Bernstein</surname>
          </string-name>
          , et al.
          <year>2015</year>
          .
          <article-title>Imagenet large scale visual recognition challenge</article-title>
          .
          <source>IJCV</source>
          <volume>115</volume>
          (
          <issue>3</issue>
          ):
          <fpage>211</fpage>
          -
          <lpage>252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>Carina</given-names>
            <surname>Silberer</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mirella</given-names>
            <surname>Lapata</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Learning grounded meaning representations with autoencoders</article-title>
          .
          <source>In ACL</source>
          . pages
          <fpage>721</fpage>
          -
          <lpage>732</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>