A Genetic Algorithm for Combining Visual and Textual Embeddings Evaluated
                          on Attribute Recognition
            Ruiqi Li                Guillem Collell          Marie-Francine Moens
   Computer Science Department Computer Science Department Computer Science Department
           KU Leuven                  KU Leuven                   KU Leuven
      3001 Heverlee, Belgium     3001 Heverlee, Belgium      3001 Heverlee, Belgium
ruiqi.li1993@outlook.com gcollell@kuleuven.be sien.moens@cs.kuleuven.be


                                                                       or properties of objects cannot be explicitly visually
                                                                       represented while, at the same time, not all the prop-
                          Abstract                                     erties are easily expressible with language. Here, we
                                                                       assume that many properties of objects are learned by
     We propose a genetic-based algorithm for                          humans both by visual perception and through the use
     combining visual and textual embeddings in                        of words in a verbal context. For example, a cat has
     a compact representation that captures fine-                      fur, which is visually observed, but from language this
     grain semantic knowledge—or attributes—of                         property can also be learned when speaking of the fur
     concepts. The genetic algorithm is able to                        of this animal or of hairs that shake when moving.
     select the most relevant representation com-                      When building meaning representations of an object’s
     ponents from the individual visual and tex-                       attribute, combining visual representations or embed-
     tual embeddings when learning the represen-                       dings with textual representations seems beneficial.
     tations, combining thus complementary vi-
                                                                           In this paper we investigate how to integrate vi-
     sual and linguistic knowledge. We evaluate
                                                                       sual and textual embeddings that have been trained
     the proposed model in an attribute recognition
                                                                       on large image and text databases respectively in or-
     task and compare the results with a model that
                                                                       der to capture knowledge about the attributes of the
     concatenates the two embeddings and models
                                                                       objects. We rely on the assumption that fine-grain se-
     that only use monomodal embeddings.
                                                                       mantic knowledge of attributes (e.g., shape, function,
 1   Introduction                                                      sound, etc.) is encoded in each modality (Collell and
                                                                       Moens, 2016). The results shed light on the potential
 Distributed representations of words (Collobert et al.,               benefit of combining vision and language data when
 2011; Mikolov et al., 2013; Pennington et al., 2014;                  creating better meaning representations of content. A
 LeCun et al., 2015) in a vector space that capture                    first baseline model just concatenates the visual and
 the textual contexts in which words occur have be-                    textual vectors, while a second model keeps a com-
 come ubiquitous and been used effectively for many                    pact vector representation, but selects relevant vector
 downstream natural language processing tasks such as                  components to make up the representation based on
 sentiment analysis and sentence classification (Kim,                  a genetic algorithm, which allows capturing a mix-
 2014; Bansal et al., 2014). In computer vision, convo-                ture of the most relevant visual and linguistic features
 lutional neural network (CNN) based image represen-                   that encode object attributes. We additionally com-
 tations have become mainstream in object and scene                    pare our model with vision-only and text-only base-
 recognition tasks (Krizhevsky et al., 2012a; Karpathy                 lines. Our contribution in this paper is as follows:
 et al., 2014). Vision and language capture comple-                    To the best of our knowledge, we are the first to dis-
 mentary information that humans automatically inte-                   entangle and recombine embeddings based on a ge-
 grate in order to build mental representations of con-                netic algorithm. We show that the genetic algorithm
 cepts (Collell and Moens, 2016). Certain concepts                     most successfully combines complementary informa-
                                                                       tion of the visual and textual embeddings when eval-
 In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.):
 Proceedings of the 3rd Swiss Text Analytics Conference (Swiss-        uated in an attribute recognition task. Moreover, with
 Text 2018), Winterthur, Switzerland, June 2018                        this genetic algorithm we learn compact and targeted


                                                                  1
                                                                  58
embeddings, where we assume that compact mean-                  Moens (2016) who compare the performance of visual
ing representations are preferred over longer vectors           and linguistic embeddings each pre-trained on respec-
in many realistic applications that make use of large           tively a large image and text dataset for a large num-
sets of representations. Ultimately, this work provides         ber of visual attributes, as well as for other non-visual
insight on building better representations of concepts,         attributes such as taxonomic, function or encyclope-
which is essential towards improving automatic lan-             dic. In contrast to their work, we propose a model that
guage understanding.                                            integrates visual and linguistic embeddings, leverag-
   The rest of the paper is organized as follows. In the        ing their findings in which they show that visual and
next Section we review and discuss related work. In             linguistic embeddings encode complementary knowl-
Section 3 we describe the proposed genetic algorithm            edge.
that combines visual and textual embeddings in en-                 Genetic algorithms have been used for feature se-
coding and classifying attributes, as well as a baseline        lection in text classification and clustering tasks (e.g.,
method that concatenates the visual and textual em-             Abualigah et al. (2016); Gomez et al. (2017); Onan
beddings and baseline vision-only or text-only mod-             et al. (2017)), where the goal is to reduce the num-
els. Next, we present and discuss our experimental              ber of features. In this paper we continue this line of
results. Finally, in conclusions and future work, we            thinking for learning better multimodal embeddings.
summarize our findings and suggest future lines of re-
search.
                                                                3     Methodology
2   Related Work                                                Given visual and textual embeddings of the same con-
Representations of concepts are often task specific,            cept word but with different dimensionality, our goal
where they mostly have been used in word similar-               is to combine the two embeddings so that the new em-
ity tasks. In this context integration of the visual            bedding can capture both visual and textual seman-
and linguistic representations was realized by Collell          tic knowledge but with a more compact form than the
et al. (2017); Lazaridou et al. (2015); Kiela and Bot-          concatenation representation. This section describes
tou (2014); Silberer and Lapata (2014). Kiela and               why and how we achieve this goal under the genetic
Bottou (2014) proposed the concatenation of visual              algorithm (GA) framework.
and text representations, while Lazaridou et al. (2015)
extend the skip-gram model to the multimodal do-                3.1    Why Genetic Algorithms
main, but none of these works regard attribute recog-
nition. Silberer and Lapata (2014) obtain multimodal            When combining two embeddings, the instinctive idea
representations by implementing a stacked autoen-               naturally comes to mind is to check the meaning of
coder with the visual and word vectors as input in an           each dimension in order to “pick” the dimensions that
attribute recognition task. These vectors were sep-             are really useful in a certain task. However, it has
arately trained with a classifier. In this work, we             been a long term and highly debated issue in NLP
start from general pre-trained embeddings. Rubin-               that what exactly each dimension in the learned em-
stein et al. (2015) research attribute recognition by           beddings means to the whole representation. This is
relying only on linguistic embeddings. Bruni et al.             a work requires devoted observation and up till now
(2012) showed that the color attribute is better cap-           there has been no final judgment on this topic. Back
tured by visual representations than by linguistic rep-         to the original goal, our final task is not to investigate
resentations. Farnadi et al. (2018) train a deep neural         the exact meaning of each dimension but to choose the
network for multimodal fusion of user’s attributes as           dimensions that really help. This motivates us to use
found in social media. They use a power-set com-                genetic algorithm which can provide numerous solu-
bination of representation components in an attempt             tions and select them based on the natural selection
to better model shared and non-shared representations           principle. Specifically, the genetic operators such as
among the data sources which are composed of im-                crossover in genetic algorithm can be just used to vary
ages, texts of and relationships between social media           the programming of embeddings from one generation
users. The closest work to ours is that of Collell and          to the next.


                                                           2
                                                           59
3.2 Genetic Algorithms Basic                                      evolutionary programming (Periaux et al., 2015). One
                                                                  may point out that the pre-trained visual and textual
Belonging to the larger class of evolutionary algo-
                                                                  embeddings can naturally be used as the original chro-
rithms, genetic algorithms (GA) are meta-heuristics
                                                                  mosomes since they consists of floating numbers. But
inspired by the process of natural selection. In a given
                                                                  recall that our goal is to form a compact embedding,
environment, a population of individuals competes for
                                                                  and by “compact” we mean that the dimension of the
survival and, more importantly, reproduction. The
                                                                  final embedding should be smaller than the concate-
ability of each individual to achieve certain goals de-
                                                                  nation of the visual and textual embeddings. Due to
termines his or her chance of producing the next gen-
                                                                  the previous reason, we first concatenate visual and
eration. In a GA setting, an individual is a solution
                                                                  textual embeddings, then shuffle the dimensions in
with regard to the problem and the quality of the solu-
                                                                  the concatenation, and divide the concatenation into
tion determines its fitness. The fittest individuals tend
                                                                  two embeddings with the same dimension. Those two
to survive and have children. By searching the so-
                                                                  embeddings are used as the original chromosomes.
lution space through the use of simulated evolution,
                                                                  Specifically, each real number in an embedding vec-
i.e., following the survival of the fittest strategy, a GA
                                                                  tor, representing a feature of the target concept, can
achieves continuous improvement over the successive
                                                                  be seen as a gene. In this way, the chromosome (em-
generations.
                                                                  bedding) is made up of a sequence of shuffled real
   GA haven been shown to generate high-quality
                                                                  numbers (floating points) which either comes from the
solutions to linear and non-linear problems through
                                                                  original visual or textual embedding. Thus the two
biologically-inspired operators such as mutation,
                                                                  embeddings can be seen as a mixture of visual and
crossover, and selection. A more complete discus-
                                                                  textual knowledge with different degrees. For clar-
sion can be found in the book of Davis (1991). Algo-
                                                                  ity, we henceforth use the term embedding instead of
rithm.1 summarizes the procedure of a basic genetic
                                                                  chromosome.
algorithm.
                                                                     In a standard GA, the initial population is often
Algorithm 1 Framework of a Genetic Algorithm.                     generated randomly and the selection function is usu-
 1: initialize population;                                        ally based on the fitness of an individual. However
 2: evaluate population;                                          in our case, as explained previously, the initial popu-
 3: while (!StopCondition) do                                     lation is formed by the original embeddings. Conse-
 4:      select the fittest individuals;                          quently, we make a change in the target of the selec-
 5:      breed new individuals;                                   tion function. Instead of trying to select the most fitted
 6:      evaluate the fitness of new individuals;                 individuals to reproduce, the selection function first
 7:      replace the least fitted population;                     makes sure that every pair of visual and textual em-
 8: end while                                                     beddings having the same target concept reproduce a
   There are six fundamental issues to be determined              group of candidates of the next generation, by repeat-
to use a genetic algorithm: chromosome representa-                ing the reproduction method several times. The repro-
tion, initialization, the selection function, the genetic         duction method involves randomly initialized param-
operators of reproduction, evaluation function, and               eters and will produce different children each time.
termination criteria. The rest of this section describes          Once a certain group of children candidates are gen-
the detail of these issues in creating a compact rep-             eralized, they will compete against each other to sur-
resentation to capture fine-grain semantic visual and             vive but only the fittest one can win the opportunity of
textual knowledge.                                                becoming the next generation. In this way, the fitness
                                                                  of the children generation is assured to be better than
3.3 Chromosome Representation, Initialization,                    their parent generation and the fitness is guaranteed to
    and Selection                                                 improve over generations.

The chromosome representation determines the prob-
                                                                  3.4   The Genetic Operators of Reproduction
lem structure and the genetic operators in a GA. The
floating point representation of the chromosomes has              Genetic operators determine the basic search mecha-
been shown to be natural to evolution strategies and              nism and create new solutions based on existing ones


                                                             3
                                                             60
in the population. Normally there are two types of op-                     and the other negative labels. Section 4 shows the de-
erators: crossover and mutation. Crossover takes two                       tail of how we use the F1 measure as evaluation func-
individuals and produces two new individuals while                         tion.
mutation alters one and produces one new. Since the                            GA moves through generations, selecting and re-
embeddings used in our problem represent mappings                          producing, until a specific termination criterion is met.
from spaces with one dimension per concept word to                         From the point of view of reproducing, the stopping
continuous vector spaces with lower dimension, the                         criterion can be set as a maximum number of gener-
value in each dimension of the embeddings character-                       ations reproduced. For example, the algorithm will
izes the target concept in the vector spaces and should                    stop once it reproduce 1000 generations. The first ter-
not be recklessly changed. Due to this reason, we only                     mination criterion is the most frequently used. The
use the crossover operator to reproduce the next gen-                      second termination strategy is a population conver-
eration.                                                                   gence criteria that evaluates the sum of deviations
   Recall that we now have two embeddings and each                         among individuals. Third, the algorithm can also be
one can be seen as a mixture that combined visual and                      terminated when a lack of improvement over a cer-
textual knowledge with different degrees. Our goal                         tain number of generations happens or, alternatively,
is to find all dimensions that help to achieve a cer-                      when the value for the evaluation measure meets a tar-
tain goal. To test whether a certain dimension is rel-                     get acceptability threshold. For instance, one can set
evant, the crossover operator is defined as follows:                       as threshold if there is no improvement over a series
Let X = (x1 , · · · , xn ) and Y = (y1 , · · · , yn ) be                   of 10 times of reproduction, or if the fitness of the
two n-dimensional embeddings. The crossover opera-                         current generation is larger than the target threshold,
tor generates two random integers k, t from a uniform                      then the algorithm terminates. Usually, several strate-
distribution from 1 to n, and creates two new embed-                       gies can be used in conjunction with each other. In
ding X 0 = (x01 , · · · , x0n ), Y 0 = (y10 , · · · , yn0 ) accord-        the experiments described below, a conjunction of the
ing to:                                                                    maximum number of generations reproduced in the
                          
                              xi if i 6= k                                 first termination criterion and the maximum number
                 x0i =                                          (1)        of generations that allows a lack of improvement in
                             yi otherwise
                                                                           the third termination criterion is used. Please noted
                                                                          here that a maximum number of generations repro-
                                yi if i 6= t
                   yi0 =                                       (2)         duced in the first termination criterion and a certain
                               xi otherwise
                                                                           number of generations that allows a lack of improve-
   As mentioned in the selection function, in one time                     ment are two different concepts. For example, if the
of reproduction the same crossover operator is applied                     former is set to 1000 while the latter 10, the algorithm
to all embeddings, producing one candidate of next                         will terminate when either 1) the algorithms reproduce
generation. By repeating it a certain number of times,                     1000 generations; or 2) during the algorithm, there is
a group of different candidates is produced. We call                       no improvement in fitness 10 consecutive times of re-
such repetition a “reproduction trial”.                                    production trials.

3.5 Evaluation and Termination                                             4     Experiments and Results
Diverse evaluation functions can be used, depending                        4.1     Experimental Setup
on the specific tasks. For instance, for classification
                                                                           4.1.1    Pre-trained Visual Embeddings
tasks the evaluation function can be any classification
metric such as precision or Jaccard similarity score, as                   Following Collell and Moens (2016), we use Ima-
long as it can map the population into a partially or-                     geNet (Russakovsky et al., 2015) as our source of
dered set. In regression, correlation is typically used                    visual data. ImageNet is the largest labeled image
as evaluation function. In our experiment, the F1 mea-                     dataset, and covers 21,841 WordNet synsets or mean-
sure is used as the evaluation function. Generally, we                     ings (Fellbaum, 1998) and over 14M images. We only
use two types of F1 measure as the evaluation of fit-                      preserve synsets with more than 50 images, and we
ness to avoid bias, one with respect to positive labels                    set an upper bound of 500 images per synset for com-


                                                                      4
                                                                      61
putation time. After this, 11,928 synsets are kept. We         = is large. And an “ant” is a negative instance and
extract a 4096-dimensional vector of features for each         a “bear” is a positive instance for the attribute a =
image as the output of the last layer of a pre-trained         has 4 legs. We consider that an attribute applies to a
AlexNet CNN in Krizhevsky et al. (2012b). For each             noun concept if a minimum of 5 people have listed it2 .
concept, we combine the representations from its in-           We treat attribute recognition as a binary classification
dividual images into a single vector by averaging the          problem: For each attribute a we learn a predictor:
CNN feature vectors of individual images component-
wise, which is equivalent to the cluster center of the                               fa : X → Y
individual representations.                                       where X ⊂ Rd is the input space of (d-
                                                               dimensional) concept representations and Y = {0, 1}
4.1.2 Pre-trained Word Embeddings
                                                               the binary output space. We report results with a linear
Following Collell and Moens (2016), we employ 300-             SVM classifier, implemented with the scikit machine
dimensional GloVe vectors (Pennington et al., 2014)            learning toolkit from Pedregosa et al. (2011).
trained on the largest available corpus (840B tokens              To guarantee sufficient positive instances, only at-
and a 2.2M words vocabulary from Common Crawl                  tributes with at least 25 positive instances in the above
corpus) from the GloVe website1 .                              dataset are kept. This leads to a total of 42 attributes,
                                                               covering 9 attribute types, and their corresponding in-
4.1.3 Dataset                                                  stance sets. The concept selection in ImageNet de-
The data set collected by McRae et al. (2005) con-             scribed in Sect. 4.1.1 results in a visual coverage of
sists of data gathered from 30 human participants that         400 concepts (out of 541 from McRae et al. (2005)
were asked to list properties—attributes—of concrete           data), and, for a fair vision-language comparison, only
nouns. The data contains 541 concepts, 2,526 differ-           the word embeddings (from GloVe) of these nouns are
ent attributes, and 10 attribute types.                        employed. Hence, our training data {(→   −
                                                                                                                i=1 con-
                                                                                                        xi , y)}400
                                                               sists of 400 instances. Table 1 shows the detail of each
   Attribute type     # Attr.   Avg. # concepts   SD
                                                               attribute type.
   encyclopedic         4            32.7          1.5
   function             3             46          27.9
   sound                1             34            -          4.3   Parameter Setting
   tactile              1             26            -
   taste                1             33            -
                                                               Notice that each reproduction operation will give birth
   taxonomic            7             42          24.8         to two forms of embedding. To avoid potential bias,
   color                7            42.4         12.0         we evaluate one embedding by F1 measure on the pos-
   form and surface     14           63.7         29.9
   motion               4            37.5          5.7         itive labels and the other on the negative labels. The
                                                               average of these two F1 measures can be an option to
Table 1: Attribute types, number of attributes in each         evaluate fitness. However, in practice, the negative la-
type (# Attr.), and average number of concepts in              bels are more numerous than the positive ones. An in-
each type (Avg. # concepts) with their respective              crease of the F1 measure on the negative labels while
standard deviations (SD).                                      a decrease of the positive ones can still result in an in-
                                                               crease on the average F1 measure. Thus, we use the
4.2 Attribute Recognition                                      F1 measure on the positive labels as the first measure
                                                               of fitness and the F1 measure on the negative labels as
To evaluate the composed embeddings, we assess how             the second. Only the one with largest increase in the
well the attributes from McRae et al. (2005) can be            first F1 measure and largest increase or at least non-
recognized by using the embeddings as input. For               decrease in the second F1 measure will be chosen as
each attribute a, we build a data set with the concepts        the next generation.
to which this attribute applies as the positive class              The maximum number of generations reproduced
instances and the rest of concepts form the negative           is set to 106 . The maximum number of reproduc-
class. For example, a “beetle” is a negative instance          tion trials in case a lack of improvement among the
and “airplane” a positive instance for the attribute a           2
                                                                   This threshold was set by McRae et al.
  1
      http://nlp.stanford.edu/projects/glove                   (2005)


                                                          5
                                                          62
children candidates is 10. The repeat times of the                 textual knowledge. This will further be discussed in
crossover operation in one reproduction trial is 103 .             Section 4.4.2.
And in the final evaluation of each embedding on the
attribute recognition task, we perform 5 runs of 5-fold            4.4.2   Performance per Attribute and Overall
cross validation.                                                  Table 5 provides a more detailed answer to our ques-
   We evaluate four different embeddings as input                  tion, showing that GeMix outperforms the other three
of the attribute recognition task: 1) Embeddings of                embeddings in 20 attributes while CNN performs best
the concept obtained with the GA described above                   in 6 attributes, GloVe in 7 attributes and the con-
(GeMix); 2) Embeddings obtained by concatenat-                     catenated embedding (CON) in 9 attributes. Specif-
ing the visual and textual embedding vectors (CON);                ically, GeMix outperforms the second best method
3) Monomodal visual embeddings (CNN); and 4)                       with more than 0.04 in attributes 04 (lays eggs), 09
Monomodal text embeddings (GloVe). Table 2 shows                   (is soft), 12 (a vegetable), and 14 (a mammal) and
the number of dimensions of each embedding respec-                 0.10 in 30 (has a beak) and 37 (made of wood).
tively.
                                                                                       F 1attr            F 1samp
                             # dimensions                              CNN             0.535               0.469
         CNN                     4096                                  Glove           0.552               0.474
         Glove                    300                                  CON             0.572               0.495
         CON                     4396                                  GeMix           0.586               0.507
         GeMix                   2198
                                                                   Table 3: Overall F1 measure per attribute and per sam-
  Table 2: Dimensionality of each embedding type.                  ple.
                                                                      According to Collell and Moens (2016), vi-
4.4 Result and Discussion                                          sual embeddings perform better than textual ones
                                                                   when recognizing three main attributes: motion,
4.4.1   Performance per Attribute Type
                                                                   form and surface, and color, while textual embed-
We first evaluate how the proposed method performs                 dings (GloVe) outperform the visual CNN embed-
on each attribute type. There are 9 attribute types and            dings in recognizing encyclopedic and function at-
we evaluate the four embeddings for each type by with              tributes. A closer look at Table 5 further reveals that
the average F1 measure.                                            for attribute types where vision or language embed-
    From Table 4 one can see that the GeMix embed-                 dings show better performance over the other one, it
dings outperform the other three embedding methods                 is high likely that adding respectively language or vi-
in 7 attribute types, i.e., encyclopedic, function, tac-           sion information lower the performance, e.g., attribute
tile, taste, taxonomic, color and form and surface. Es-            05 (hunted by people), 07 (used for transportation)
pecially in encyclopedic, GeMix increases the average              in function and 20 (is fast), 21 (eats) in motion. Be-
F1 measure by more than 0.02 and in function and                   cause GeMix tend to set aside the “noisy” dimensions
taxonomic, it increases by nearly 0.02. We perform                 of the embeddings, it performs better than the con-
Wilcoxon Signed-Rank test of each two methods on                   catenated embedding.
different feature sets and find that the difference is sig-           Let us take a look at the overall average F1 mea-
nificant at p ≤ 0.05.                                              sure increase. We evaluate the F1 measure with re-
    Another interesting finding is that the performance            spect to two different aspects. First, the overall aver-
                                                                                                                    1
of the concatenated embedding (CON) is not always                  age F1 measure per attribute, i.e., F 1attr = |L|  F 1L
better than the performance of the monomodal em-                   where |L| is the number of different attributes (42 in
beddings, CNN or GloVe. For instance, in tactile,                  our case) and F 1L is the F1 measure of a specific at-
color and motion, the F1 measure of CNN or GloVe                   tribute. Second, the overall average F1 measure per
                                                                                             1
is higher than that of concatenated embeddings. This               sample, i.e., F 1samp = |S| F 1S where |S| is the num-
indicates that there are certain attributes in which the           ber of samples (400 in our case) and F 1L is the F1
performance of combined visual and textual knowl-                  measure of each sample. Table 3 shows that in both
edge is not necessarily better than unimodal visual or             cases, GeMix achieves the highest F1 measure.


                                                              6
                                                              63
                   Encyc      Funct     Sound     Tactile   Taste     Taxon     Color     Form&Surf    Motion
         CNN       0.429      0.738     0.513     0.470     0.421     0.486     0.676       0.567      0.628
         GloVe     0.422      0.743     0.747     0.517     0.341     0.495     0.630       0.563      0.595
         CON       0.457      0.760     0.762     0.477     0.433     0.512     0.663       0.582      0.623
         GeMix     0.471      0.786     0.758     0.520     0.438     0.528     0.671       0.588      0.620

Table 4: Performance per Attribute Type: Averages of F1 measures per attribute type (i.e., average individual
attributes) for CNN, GloVe, CON and GeMix.

               01       02        03        04        05       06         07        08        09        10      11
    CNN       0.484   0.580     0.355     0.591     0.410    0.682      0.794     0.663     0.261     0.418   0.591
    GloVe     0.463   0.611     0.521     0.591     0.486    0.602      0.971     0.701     0.551     0.311   0.668
    CON       0.462   0.673     0.576     0.626     0.456    0.723      0.944     0.765     0.547     0.437   0.683
    GeMix     0.502   0.661     0.570     0.672     0.466    0.734      0.944     0.717     0.605     0.439   0.690
               12       13        14        15        16       17         18        19        20        21      22
    CNN       0.460   0.284     0.632     0.405     0.431    0.422      0.439     0.581     0.915     0.527   0.812
    GloVe     0.292   0.233     0.628     0.491     0.484    0.530      0.320     0.617     0.822     0.510   0.622
    CON       0.522   0.321     0.641     0.443     0.471    0.522      0.466     0.603     0.846     0.510   0.773
    GeMix     0.565   0.228     0.702     0.475     0.437    0.476      0.475     0.651     0.863     0.524   0.822
               23       24        25        26        27       28         29        30        31        32      33
    CNN       0.513   0.643     0.884     0.699     0.647    0.544      0.727     0.325     0.489     0.649   0.738
    GloVe     0.347   0.595     0.743     0.668     0.379    0.448      0.437     0.313     0.495     0.767   0.672
    CON       0.476   0.668     0.852     0.728     0.640    0.558      0.433     0.298     0.503     0.722   0.651
    GeMix     0.546   0.660     0.829     0.704     0.639    0.548      0.414     0.476     0.512     0.744   0.734
               34       35        36        37        38       39         40        41        42
    CNN       0.580   0.421     0.372     0.532     0.906    0.506      0.748     0.421     0.418
    GloVe     0.444   0.440     0.368     0.345     0.970    0.522      0.543     0.415     0.291
    CON       0.622   0.548     0.377     0.547     0.888    0.570      0.784     0.483     0.539
    GeMix     0.622   0.562     0.387     0.670     0.900    0.573      0.791     0.468     0.500

Table 5: Performance of the attribute classification task per attribute in terms of F1 measure for each embedding
method. Attribute 01 - 04 belong to encyclopedic, 05 - 07 function, 08 sound, 09 tactile, 10 taste, 11 - 17
taxonomic, 18 - 21 motion, 22 - 28 color and 29 - 42 form and surface.
5   Conclusion and Future Work                                   the genetic-based representation outperformed a base-
                                                                 line composed of the concatenation of the visual and
In this paper, we propose a genetic-based algorithm              textual embeddings, as well as the monomodal visual
which learns a compact representation that combines              or textual embedding.
visual and textual embeddings. Two embeddings,
coming from random evenly divide of the shuffled                    Another interesting finding in this paper is that for
concatenation of vision and textual embeddings, are              a small group of attributes in which either vision or
used as the initial chromosomes in the genetic algo-             language generally dominate, adding the other modal-
rithm. A variant of one-point crossover method is                ity may lower the final performance. For example,
used to move the most relevant components in the rep-            the attribute eats in the motion type for which vi-
resentation to one embedding, and the non-relevant               sion tends to perform better than language (Collell
ones to the other. To avoid bias, we use two measures            and Moens, 2016), the performance of the mixture
as the evaluation of fitness: one is respect to positive         of both visual and textual representation is lower than
labels and the other negative labels. The learned em-            the monomodal visual representation. Ultimately, our
beddings can be seen as a combination of both visual             findings provide insights that can help building better
and textual knowledge. In an attribute recognition task          multimodal representation by taking into account to


                                                            7
                                                            64
what degree should the visual and textual knowledge               Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.
be mixed with respect to different tasks.                           2012a. Imagenet classification with deep convolutional
                                                                    neural networks. In NIPS. USA, pages 1097–1105.
                                                                  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
References                                                          2012b. Imagenet classification with deep convolutional
                                                                    neural networks. In Advances in neural information
L. M. Abualigah, A. T. Khader, and M. A. Al-Betar. 2016.            processing systems. pages 1097–1105.
   Unsupervised feature selection technique based on ge-
   netic algorithm for improving the text clustering. In          Angeliki Lazaridou, Nghia The Pham, and Marco Ba-
   2016 7th International Conference on Computer Sci-               roni. 2015. Combining language and vision with
   ence and Information Technology (CSIT). volume 00,               a multimodal skip-gram model.      arXiv preprint
   pages 1–6.                                                       arXiv:1501.02598 .
Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2014.              Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015.
  Tailoring continuous word representations for depen-              Deep learning. Nature 521(7553):436–444.
  dency parsing. In ACL (2). pages 809–815.
                                                                  Ken McRae, George S Cree, Mark S Seidenberg, and Chris
Elia Bruni, Gemma Boleda, Marco Baroni, and Nam-                    McNorgan. 2005. Semantic feature production norms
   Khanh Tran. 2012. Distributional semantics in techni-            for a large set of living and nonliving things. Behavior
   color. In ACL. ACL, pages 136–145.                               research methods 37(4):547–559.

Guillem Collell and Marie-Francine Moens. 2016. Is an             Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
  image worth more than a thousand words? On the fine-              Dean. 2013. Efficient estimation of word representa-
  grain semantic differences between visual and linguistic          tions in vector space. CoRR abs/1301.3781.
  representations. In COLING. ACL, pages 2807–2817.
                                                                  Aytug Onan, Serdar Korukoglu, and Hasan Bulut. 2017.
Guillem Collell, Ted Zhang, and Marie-Francine Moens.               A hybrid ensemble pruning approach based on con-
  2017. Imagined visual representations as multimodal               sensus clustering and multi-objective evolutionary algo-
  embeddings. In AAAI. pages 4378–4384.                             rithm for sentiment classification. Inf. Process. Manage.
                                                                    53(4):814–833.
Ronan Collobert, Jason Weston, Léon Bottou, Michael
                                                                  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
  Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011.
                                                                     B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
  Natural language processing (almost) from scratch.
                                                                     R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
  Journal of Machine Learning Research 12(Aug):2493–
                                                                     D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
  2537.
                                                                     esnay. 2011. Scikit-learn: Machine learning in Python.
Lawrence Davis. 1991. Handbook of genetic algorithms .               Journal of Machine Learning Research 12:2825–2830.
                                                                  Jeffrey Pennington, Richard Socher, and Christopher D
Golnoosh Farnadi, Jie Tang, Martine De Cock, and Marie-
                                                                     Manning. 2014. Glove: Global vectors for word rep-
  Francine Moens. 2018. User profiling through deep
                                                                     resentation. In EMNLP. volume 14, pages 1532–1543.
  multimodal fusion. In WSDM. pages 171–179.
                                                                  Jacques Periaux, Felipe Gonzalez, and Dong Seop Chris
Christiane Fellbaum. 1998. WordNet. Wiley Online Li-                 Lee. 2015. Evolutionary optimization and game strate-
  brary.                                                             gies for advanced multi-disciplinary design. Springer
                                                                     Netherlands.
Juan Carlos Gomez, Stijn Hoskens, and Marie-Francine
   Moens. 2017. Evolutionary learning of meta-rules for           Dana Rubinstein, Effi Levi, Roy Schwartz, and Ari Rap-
   text classification. GECCO.                                      poport. 2015. How well do distributional models cap-
                                                                    ture different types of semantic knowledge? In ACL.
Andrej Karpathy, George Toderici, Sanketh Shetty,                   volume 2, pages 726–730.
  Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014.
  Large-scale video classification with convolutional neu-        Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,
  ral networks. In CVPR. IEEE, pages 1725–1732.                     Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej
                                                                    Karpathy, Aditya Khosla, Michael Bernstein, et al.
Douwe Kiela and Léon Bottou. 2014. Learning image em-              2015. Imagenet large scale visual recognition chal-
  beddings using convolutional neural networks for im-              lenge. IJCV 115(3):211–252.
  proved multi-modal semantics. In EMNLP. pages 36–
  45.                                                             Carina Silberer and Mirella Lapata. 2014. Learning
                                                                    grounded meaning representations with autoencoders.
Yoon Kim. 2014. Convolutional neural networks for sen-              In ACL. pages 721–732.
  tence classification. arXiv preprint arXiv:1408.5882 .


                                                             8
                                                             65