=Paper=
{{Paper
|id=Vol-2226/paper7
|storemode=property
|title=A Genetic Algorithm for Combining Visual and Textual Embeddings Evaluated on Attribute Recognition
|pdfUrl=https://ceur-ws.org/Vol-2226/paper7.pdf
|volume=Vol-2226
|authors=Ruiqi Li,Guillem Collell,Marie-Francine Moens
|dblpUrl=https://dblp.org/rec/conf/swisstext/LiCM18
}}
==A Genetic Algorithm for Combining Visual and Textual Embeddings Evaluated on Attribute Recognition==
A Genetic Algorithm for Combining Visual and Textual Embeddings Evaluated
on Attribute Recognition
Ruiqi Li Guillem Collell Marie-Francine Moens
Computer Science Department Computer Science Department Computer Science Department
KU Leuven KU Leuven KU Leuven
3001 Heverlee, Belgium 3001 Heverlee, Belgium 3001 Heverlee, Belgium
ruiqi.li1993@outlook.com gcollell@kuleuven.be sien.moens@cs.kuleuven.be
or properties of objects cannot be explicitly visually
represented while, at the same time, not all the prop-
Abstract erties are easily expressible with language. Here, we
assume that many properties of objects are learned by
We propose a genetic-based algorithm for humans both by visual perception and through the use
combining visual and textual embeddings in of words in a verbal context. For example, a cat has
a compact representation that captures fine- fur, which is visually observed, but from language this
grain semantic knowledge—or attributes—of property can also be learned when speaking of the fur
concepts. The genetic algorithm is able to of this animal or of hairs that shake when moving.
select the most relevant representation com- When building meaning representations of an object’s
ponents from the individual visual and tex- attribute, combining visual representations or embed-
tual embeddings when learning the represen- dings with textual representations seems beneficial.
tations, combining thus complementary vi-
In this paper we investigate how to integrate vi-
sual and linguistic knowledge. We evaluate
sual and textual embeddings that have been trained
the proposed model in an attribute recognition
on large image and text databases respectively in or-
task and compare the results with a model that
der to capture knowledge about the attributes of the
concatenates the two embeddings and models
objects. We rely on the assumption that fine-grain se-
that only use monomodal embeddings.
mantic knowledge of attributes (e.g., shape, function,
1 Introduction sound, etc.) is encoded in each modality (Collell and
Moens, 2016). The results shed light on the potential
Distributed representations of words (Collobert et al., benefit of combining vision and language data when
2011; Mikolov et al., 2013; Pennington et al., 2014; creating better meaning representations of content. A
LeCun et al., 2015) in a vector space that capture first baseline model just concatenates the visual and
the textual contexts in which words occur have be- textual vectors, while a second model keeps a com-
come ubiquitous and been used effectively for many pact vector representation, but selects relevant vector
downstream natural language processing tasks such as components to make up the representation based on
sentiment analysis and sentence classification (Kim, a genetic algorithm, which allows capturing a mix-
2014; Bansal et al., 2014). In computer vision, convo- ture of the most relevant visual and linguistic features
lutional neural network (CNN) based image represen- that encode object attributes. We additionally com-
tations have become mainstream in object and scene pare our model with vision-only and text-only base-
recognition tasks (Krizhevsky et al., 2012a; Karpathy lines. Our contribution in this paper is as follows:
et al., 2014). Vision and language capture comple- To the best of our knowledge, we are the first to dis-
mentary information that humans automatically inte- entangle and recombine embeddings based on a ge-
grate in order to build mental representations of con- netic algorithm. We show that the genetic algorithm
cepts (Collell and Moens, 2016). Certain concepts most successfully combines complementary informa-
tion of the visual and textual embeddings when eval-
In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.):
Proceedings of the 3rd Swiss Text Analytics Conference (Swiss- uated in an attribute recognition task. Moreover, with
Text 2018), Winterthur, Switzerland, June 2018 this genetic algorithm we learn compact and targeted
1
58
embeddings, where we assume that compact mean- Moens (2016) who compare the performance of visual
ing representations are preferred over longer vectors and linguistic embeddings each pre-trained on respec-
in many realistic applications that make use of large tively a large image and text dataset for a large num-
sets of representations. Ultimately, this work provides ber of visual attributes, as well as for other non-visual
insight on building better representations of concepts, attributes such as taxonomic, function or encyclope-
which is essential towards improving automatic lan- dic. In contrast to their work, we propose a model that
guage understanding. integrates visual and linguistic embeddings, leverag-
The rest of the paper is organized as follows. In the ing their findings in which they show that visual and
next Section we review and discuss related work. In linguistic embeddings encode complementary knowl-
Section 3 we describe the proposed genetic algorithm edge.
that combines visual and textual embeddings in en- Genetic algorithms have been used for feature se-
coding and classifying attributes, as well as a baseline lection in text classification and clustering tasks (e.g.,
method that concatenates the visual and textual em- Abualigah et al. (2016); Gomez et al. (2017); Onan
beddings and baseline vision-only or text-only mod- et al. (2017)), where the goal is to reduce the num-
els. Next, we present and discuss our experimental ber of features. In this paper we continue this line of
results. Finally, in conclusions and future work, we thinking for learning better multimodal embeddings.
summarize our findings and suggest future lines of re-
search.
3 Methodology
2 Related Work Given visual and textual embeddings of the same con-
Representations of concepts are often task specific, cept word but with different dimensionality, our goal
where they mostly have been used in word similar- is to combine the two embeddings so that the new em-
ity tasks. In this context integration of the visual bedding can capture both visual and textual seman-
and linguistic representations was realized by Collell tic knowledge but with a more compact form than the
et al. (2017); Lazaridou et al. (2015); Kiela and Bot- concatenation representation. This section describes
tou (2014); Silberer and Lapata (2014). Kiela and why and how we achieve this goal under the genetic
Bottou (2014) proposed the concatenation of visual algorithm (GA) framework.
and text representations, while Lazaridou et al. (2015)
extend the skip-gram model to the multimodal do- 3.1 Why Genetic Algorithms
main, but none of these works regard attribute recog-
nition. Silberer and Lapata (2014) obtain multimodal When combining two embeddings, the instinctive idea
representations by implementing a stacked autoen- naturally comes to mind is to check the meaning of
coder with the visual and word vectors as input in an each dimension in order to “pick” the dimensions that
attribute recognition task. These vectors were sep- are really useful in a certain task. However, it has
arately trained with a classifier. In this work, we been a long term and highly debated issue in NLP
start from general pre-trained embeddings. Rubin- that what exactly each dimension in the learned em-
stein et al. (2015) research attribute recognition by beddings means to the whole representation. This is
relying only on linguistic embeddings. Bruni et al. a work requires devoted observation and up till now
(2012) showed that the color attribute is better cap- there has been no final judgment on this topic. Back
tured by visual representations than by linguistic rep- to the original goal, our final task is not to investigate
resentations. Farnadi et al. (2018) train a deep neural the exact meaning of each dimension but to choose the
network for multimodal fusion of user’s attributes as dimensions that really help. This motivates us to use
found in social media. They use a power-set com- genetic algorithm which can provide numerous solu-
bination of representation components in an attempt tions and select them based on the natural selection
to better model shared and non-shared representations principle. Specifically, the genetic operators such as
among the data sources which are composed of im- crossover in genetic algorithm can be just used to vary
ages, texts of and relationships between social media the programming of embeddings from one generation
users. The closest work to ours is that of Collell and to the next.
2
59
3.2 Genetic Algorithms Basic evolutionary programming (Periaux et al., 2015). One
may point out that the pre-trained visual and textual
Belonging to the larger class of evolutionary algo-
embeddings can naturally be used as the original chro-
rithms, genetic algorithms (GA) are meta-heuristics
mosomes since they consists of floating numbers. But
inspired by the process of natural selection. In a given
recall that our goal is to form a compact embedding,
environment, a population of individuals competes for
and by “compact” we mean that the dimension of the
survival and, more importantly, reproduction. The
final embedding should be smaller than the concate-
ability of each individual to achieve certain goals de-
nation of the visual and textual embeddings. Due to
termines his or her chance of producing the next gen-
the previous reason, we first concatenate visual and
eration. In a GA setting, an individual is a solution
textual embeddings, then shuffle the dimensions in
with regard to the problem and the quality of the solu-
the concatenation, and divide the concatenation into
tion determines its fitness. The fittest individuals tend
two embeddings with the same dimension. Those two
to survive and have children. By searching the so-
embeddings are used as the original chromosomes.
lution space through the use of simulated evolution,
Specifically, each real number in an embedding vec-
i.e., following the survival of the fittest strategy, a GA
tor, representing a feature of the target concept, can
achieves continuous improvement over the successive
be seen as a gene. In this way, the chromosome (em-
generations.
bedding) is made up of a sequence of shuffled real
GA haven been shown to generate high-quality
numbers (floating points) which either comes from the
solutions to linear and non-linear problems through
original visual or textual embedding. Thus the two
biologically-inspired operators such as mutation,
embeddings can be seen as a mixture of visual and
crossover, and selection. A more complete discus-
textual knowledge with different degrees. For clar-
sion can be found in the book of Davis (1991). Algo-
ity, we henceforth use the term embedding instead of
rithm.1 summarizes the procedure of a basic genetic
chromosome.
algorithm.
In a standard GA, the initial population is often
Algorithm 1 Framework of a Genetic Algorithm. generated randomly and the selection function is usu-
1: initialize population; ally based on the fitness of an individual. However
2: evaluate population; in our case, as explained previously, the initial popu-
3: while (!StopCondition) do lation is formed by the original embeddings. Conse-
4: select the fittest individuals; quently, we make a change in the target of the selec-
5: breed new individuals; tion function. Instead of trying to select the most fitted
6: evaluate the fitness of new individuals; individuals to reproduce, the selection function first
7: replace the least fitted population; makes sure that every pair of visual and textual em-
8: end while beddings having the same target concept reproduce a
There are six fundamental issues to be determined group of candidates of the next generation, by repeat-
to use a genetic algorithm: chromosome representa- ing the reproduction method several times. The repro-
tion, initialization, the selection function, the genetic duction method involves randomly initialized param-
operators of reproduction, evaluation function, and eters and will produce different children each time.
termination criteria. The rest of this section describes Once a certain group of children candidates are gen-
the detail of these issues in creating a compact rep- eralized, they will compete against each other to sur-
resentation to capture fine-grain semantic visual and vive but only the fittest one can win the opportunity of
textual knowledge. becoming the next generation. In this way, the fitness
of the children generation is assured to be better than
3.3 Chromosome Representation, Initialization, their parent generation and the fitness is guaranteed to
and Selection improve over generations.
The chromosome representation determines the prob-
3.4 The Genetic Operators of Reproduction
lem structure and the genetic operators in a GA. The
floating point representation of the chromosomes has Genetic operators determine the basic search mecha-
been shown to be natural to evolution strategies and nism and create new solutions based on existing ones
3
60
in the population. Normally there are two types of op- and the other negative labels. Section 4 shows the de-
erators: crossover and mutation. Crossover takes two tail of how we use the F1 measure as evaluation func-
individuals and produces two new individuals while tion.
mutation alters one and produces one new. Since the GA moves through generations, selecting and re-
embeddings used in our problem represent mappings producing, until a specific termination criterion is met.
from spaces with one dimension per concept word to From the point of view of reproducing, the stopping
continuous vector spaces with lower dimension, the criterion can be set as a maximum number of gener-
value in each dimension of the embeddings character- ations reproduced. For example, the algorithm will
izes the target concept in the vector spaces and should stop once it reproduce 1000 generations. The first ter-
not be recklessly changed. Due to this reason, we only mination criterion is the most frequently used. The
use the crossover operator to reproduce the next gen- second termination strategy is a population conver-
eration. gence criteria that evaluates the sum of deviations
Recall that we now have two embeddings and each among individuals. Third, the algorithm can also be
one can be seen as a mixture that combined visual and terminated when a lack of improvement over a cer-
textual knowledge with different degrees. Our goal tain number of generations happens or, alternatively,
is to find all dimensions that help to achieve a cer- when the value for the evaluation measure meets a tar-
tain goal. To test whether a certain dimension is rel- get acceptability threshold. For instance, one can set
evant, the crossover operator is defined as follows: as threshold if there is no improvement over a series
Let X = (x1 , · · · , xn ) and Y = (y1 , · · · , yn ) be of 10 times of reproduction, or if the fitness of the
two n-dimensional embeddings. The crossover opera- current generation is larger than the target threshold,
tor generates two random integers k, t from a uniform then the algorithm terminates. Usually, several strate-
distribution from 1 to n, and creates two new embed- gies can be used in conjunction with each other. In
ding X 0 = (x01 , · · · , x0n ), Y 0 = (y10 , · · · , yn0 ) accord- the experiments described below, a conjunction of the
ing to: maximum number of generations reproduced in the
xi if i 6= k first termination criterion and the maximum number
x0i = (1) of generations that allows a lack of improvement in
yi otherwise
the third termination criterion is used. Please noted
here that a maximum number of generations repro-
yi if i 6= t
yi0 = (2) duced in the first termination criterion and a certain
xi otherwise
number of generations that allows a lack of improve-
As mentioned in the selection function, in one time ment are two different concepts. For example, if the
of reproduction the same crossover operator is applied former is set to 1000 while the latter 10, the algorithm
to all embeddings, producing one candidate of next will terminate when either 1) the algorithms reproduce
generation. By repeating it a certain number of times, 1000 generations; or 2) during the algorithm, there is
a group of different candidates is produced. We call no improvement in fitness 10 consecutive times of re-
such repetition a “reproduction trial”. production trials.
3.5 Evaluation and Termination 4 Experiments and Results
Diverse evaluation functions can be used, depending 4.1 Experimental Setup
on the specific tasks. For instance, for classification
4.1.1 Pre-trained Visual Embeddings
tasks the evaluation function can be any classification
metric such as precision or Jaccard similarity score, as Following Collell and Moens (2016), we use Ima-
long as it can map the population into a partially or- geNet (Russakovsky et al., 2015) as our source of
dered set. In regression, correlation is typically used visual data. ImageNet is the largest labeled image
as evaluation function. In our experiment, the F1 mea- dataset, and covers 21,841 WordNet synsets or mean-
sure is used as the evaluation function. Generally, we ings (Fellbaum, 1998) and over 14M images. We only
use two types of F1 measure as the evaluation of fit- preserve synsets with more than 50 images, and we
ness to avoid bias, one with respect to positive labels set an upper bound of 500 images per synset for com-
4
61
putation time. After this, 11,928 synsets are kept. We = is large. And an “ant” is a negative instance and
extract a 4096-dimensional vector of features for each a “bear” is a positive instance for the attribute a =
image as the output of the last layer of a pre-trained has 4 legs. We consider that an attribute applies to a
AlexNet CNN in Krizhevsky et al. (2012b). For each noun concept if a minimum of 5 people have listed it2 .
concept, we combine the representations from its in- We treat attribute recognition as a binary classification
dividual images into a single vector by averaging the problem: For each attribute a we learn a predictor:
CNN feature vectors of individual images component-
wise, which is equivalent to the cluster center of the fa : X → Y
individual representations. where X ⊂ Rd is the input space of (d-
dimensional) concept representations and Y = {0, 1}
4.1.2 Pre-trained Word Embeddings
the binary output space. We report results with a linear
Following Collell and Moens (2016), we employ 300- SVM classifier, implemented with the scikit machine
dimensional GloVe vectors (Pennington et al., 2014) learning toolkit from Pedregosa et al. (2011).
trained on the largest available corpus (840B tokens To guarantee sufficient positive instances, only at-
and a 2.2M words vocabulary from Common Crawl tributes with at least 25 positive instances in the above
corpus) from the GloVe website1 . dataset are kept. This leads to a total of 42 attributes,
covering 9 attribute types, and their corresponding in-
4.1.3 Dataset stance sets. The concept selection in ImageNet de-
The data set collected by McRae et al. (2005) con- scribed in Sect. 4.1.1 results in a visual coverage of
sists of data gathered from 30 human participants that 400 concepts (out of 541 from McRae et al. (2005)
were asked to list properties—attributes—of concrete data), and, for a fair vision-language comparison, only
nouns. The data contains 541 concepts, 2,526 differ- the word embeddings (from GloVe) of these nouns are
ent attributes, and 10 attribute types. employed. Hence, our training data {(→ −
i=1 con-
xi , y)}400
sists of 400 instances. Table 1 shows the detail of each
Attribute type # Attr. Avg. # concepts SD
attribute type.
encyclopedic 4 32.7 1.5
function 3 46 27.9
sound 1 34 - 4.3 Parameter Setting
tactile 1 26 -
taste 1 33 -
Notice that each reproduction operation will give birth
taxonomic 7 42 24.8 to two forms of embedding. To avoid potential bias,
color 7 42.4 12.0 we evaluate one embedding by F1 measure on the pos-
form and surface 14 63.7 29.9
motion 4 37.5 5.7 itive labels and the other on the negative labels. The
average of these two F1 measures can be an option to
Table 1: Attribute types, number of attributes in each evaluate fitness. However, in practice, the negative la-
type (# Attr.), and average number of concepts in bels are more numerous than the positive ones. An in-
each type (Avg. # concepts) with their respective crease of the F1 measure on the negative labels while
standard deviations (SD). a decrease of the positive ones can still result in an in-
crease on the average F1 measure. Thus, we use the
4.2 Attribute Recognition F1 measure on the positive labels as the first measure
of fitness and the F1 measure on the negative labels as
To evaluate the composed embeddings, we assess how the second. Only the one with largest increase in the
well the attributes from McRae et al. (2005) can be first F1 measure and largest increase or at least non-
recognized by using the embeddings as input. For decrease in the second F1 measure will be chosen as
each attribute a, we build a data set with the concepts the next generation.
to which this attribute applies as the positive class The maximum number of generations reproduced
instances and the rest of concepts form the negative is set to 106 . The maximum number of reproduc-
class. For example, a “beetle” is a negative instance tion trials in case a lack of improvement among the
and “airplane” a positive instance for the attribute a 2
This threshold was set by McRae et al.
1
http://nlp.stanford.edu/projects/glove (2005)
5
62
children candidates is 10. The repeat times of the textual knowledge. This will further be discussed in
crossover operation in one reproduction trial is 103 . Section 4.4.2.
And in the final evaluation of each embedding on the
attribute recognition task, we perform 5 runs of 5-fold 4.4.2 Performance per Attribute and Overall
cross validation. Table 5 provides a more detailed answer to our ques-
We evaluate four different embeddings as input tion, showing that GeMix outperforms the other three
of the attribute recognition task: 1) Embeddings of embeddings in 20 attributes while CNN performs best
the concept obtained with the GA described above in 6 attributes, GloVe in 7 attributes and the con-
(GeMix); 2) Embeddings obtained by concatenat- catenated embedding (CON) in 9 attributes. Specif-
ing the visual and textual embedding vectors (CON); ically, GeMix outperforms the second best method
3) Monomodal visual embeddings (CNN); and 4) with more than 0.04 in attributes 04 (lays eggs), 09
Monomodal text embeddings (GloVe). Table 2 shows (is soft), 12 (a vegetable), and 14 (a mammal) and
the number of dimensions of each embedding respec- 0.10 in 30 (has a beak) and 37 (made of wood).
tively.
F 1attr F 1samp
# dimensions CNN 0.535 0.469
CNN 4096 Glove 0.552 0.474
Glove 300 CON 0.572 0.495
CON 4396 GeMix 0.586 0.507
GeMix 2198
Table 3: Overall F1 measure per attribute and per sam-
Table 2: Dimensionality of each embedding type. ple.
According to Collell and Moens (2016), vi-
4.4 Result and Discussion sual embeddings perform better than textual ones
when recognizing three main attributes: motion,
4.4.1 Performance per Attribute Type
form and surface, and color, while textual embed-
We first evaluate how the proposed method performs dings (GloVe) outperform the visual CNN embed-
on each attribute type. There are 9 attribute types and dings in recognizing encyclopedic and function at-
we evaluate the four embeddings for each type by with tributes. A closer look at Table 5 further reveals that
the average F1 measure. for attribute types where vision or language embed-
From Table 4 one can see that the GeMix embed- dings show better performance over the other one, it
dings outperform the other three embedding methods is high likely that adding respectively language or vi-
in 7 attribute types, i.e., encyclopedic, function, tac- sion information lower the performance, e.g., attribute
tile, taste, taxonomic, color and form and surface. Es- 05 (hunted by people), 07 (used for transportation)
pecially in encyclopedic, GeMix increases the average in function and 20 (is fast), 21 (eats) in motion. Be-
F1 measure by more than 0.02 and in function and cause GeMix tend to set aside the “noisy” dimensions
taxonomic, it increases by nearly 0.02. We perform of the embeddings, it performs better than the con-
Wilcoxon Signed-Rank test of each two methods on catenated embedding.
different feature sets and find that the difference is sig- Let us take a look at the overall average F1 mea-
nificant at p ≤ 0.05. sure increase. We evaluate the F1 measure with re-
Another interesting finding is that the performance spect to two different aspects. First, the overall aver-
1
of the concatenated embedding (CON) is not always age F1 measure per attribute, i.e., F 1attr = |L| F 1L
better than the performance of the monomodal em- where |L| is the number of different attributes (42 in
beddings, CNN or GloVe. For instance, in tactile, our case) and F 1L is the F1 measure of a specific at-
color and motion, the F1 measure of CNN or GloVe tribute. Second, the overall average F1 measure per
1
is higher than that of concatenated embeddings. This sample, i.e., F 1samp = |S| F 1S where |S| is the num-
indicates that there are certain attributes in which the ber of samples (400 in our case) and F 1L is the F1
performance of combined visual and textual knowl- measure of each sample. Table 3 shows that in both
edge is not necessarily better than unimodal visual or cases, GeMix achieves the highest F1 measure.
6
63
Encyc Funct Sound Tactile Taste Taxon Color Form&Surf Motion
CNN 0.429 0.738 0.513 0.470 0.421 0.486 0.676 0.567 0.628
GloVe 0.422 0.743 0.747 0.517 0.341 0.495 0.630 0.563 0.595
CON 0.457 0.760 0.762 0.477 0.433 0.512 0.663 0.582 0.623
GeMix 0.471 0.786 0.758 0.520 0.438 0.528 0.671 0.588 0.620
Table 4: Performance per Attribute Type: Averages of F1 measures per attribute type (i.e., average individual
attributes) for CNN, GloVe, CON and GeMix.
01 02 03 04 05 06 07 08 09 10 11
CNN 0.484 0.580 0.355 0.591 0.410 0.682 0.794 0.663 0.261 0.418 0.591
GloVe 0.463 0.611 0.521 0.591 0.486 0.602 0.971 0.701 0.551 0.311 0.668
CON 0.462 0.673 0.576 0.626 0.456 0.723 0.944 0.765 0.547 0.437 0.683
GeMix 0.502 0.661 0.570 0.672 0.466 0.734 0.944 0.717 0.605 0.439 0.690
12 13 14 15 16 17 18 19 20 21 22
CNN 0.460 0.284 0.632 0.405 0.431 0.422 0.439 0.581 0.915 0.527 0.812
GloVe 0.292 0.233 0.628 0.491 0.484 0.530 0.320 0.617 0.822 0.510 0.622
CON 0.522 0.321 0.641 0.443 0.471 0.522 0.466 0.603 0.846 0.510 0.773
GeMix 0.565 0.228 0.702 0.475 0.437 0.476 0.475 0.651 0.863 0.524 0.822
23 24 25 26 27 28 29 30 31 32 33
CNN 0.513 0.643 0.884 0.699 0.647 0.544 0.727 0.325 0.489 0.649 0.738
GloVe 0.347 0.595 0.743 0.668 0.379 0.448 0.437 0.313 0.495 0.767 0.672
CON 0.476 0.668 0.852 0.728 0.640 0.558 0.433 0.298 0.503 0.722 0.651
GeMix 0.546 0.660 0.829 0.704 0.639 0.548 0.414 0.476 0.512 0.744 0.734
34 35 36 37 38 39 40 41 42
CNN 0.580 0.421 0.372 0.532 0.906 0.506 0.748 0.421 0.418
GloVe 0.444 0.440 0.368 0.345 0.970 0.522 0.543 0.415 0.291
CON 0.622 0.548 0.377 0.547 0.888 0.570 0.784 0.483 0.539
GeMix 0.622 0.562 0.387 0.670 0.900 0.573 0.791 0.468 0.500
Table 5: Performance of the attribute classification task per attribute in terms of F1 measure for each embedding
method. Attribute 01 - 04 belong to encyclopedic, 05 - 07 function, 08 sound, 09 tactile, 10 taste, 11 - 17
taxonomic, 18 - 21 motion, 22 - 28 color and 29 - 42 form and surface.
5 Conclusion and Future Work the genetic-based representation outperformed a base-
line composed of the concatenation of the visual and
In this paper, we propose a genetic-based algorithm textual embeddings, as well as the monomodal visual
which learns a compact representation that combines or textual embedding.
visual and textual embeddings. Two embeddings,
coming from random evenly divide of the shuffled Another interesting finding in this paper is that for
concatenation of vision and textual embeddings, are a small group of attributes in which either vision or
used as the initial chromosomes in the genetic algo- language generally dominate, adding the other modal-
rithm. A variant of one-point crossover method is ity may lower the final performance. For example,
used to move the most relevant components in the rep- the attribute eats in the motion type for which vi-
resentation to one embedding, and the non-relevant sion tends to perform better than language (Collell
ones to the other. To avoid bias, we use two measures and Moens, 2016), the performance of the mixture
as the evaluation of fitness: one is respect to positive of both visual and textual representation is lower than
labels and the other negative labels. The learned em- the monomodal visual representation. Ultimately, our
beddings can be seen as a combination of both visual findings provide insights that can help building better
and textual knowledge. In an attribute recognition task multimodal representation by taking into account to
7
64
what degree should the visual and textual knowledge Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.
be mixed with respect to different tasks. 2012a. Imagenet classification with deep convolutional
neural networks. In NIPS. USA, pages 1097–1105.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
References 2012b. Imagenet classification with deep convolutional
neural networks. In Advances in neural information
L. M. Abualigah, A. T. Khader, and M. A. Al-Betar. 2016. processing systems. pages 1097–1105.
Unsupervised feature selection technique based on ge-
netic algorithm for improving the text clustering. In Angeliki Lazaridou, Nghia The Pham, and Marco Ba-
2016 7th International Conference on Computer Sci- roni. 2015. Combining language and vision with
ence and Information Technology (CSIT). volume 00, a multimodal skip-gram model. arXiv preprint
pages 1–6. arXiv:1501.02598 .
Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2014. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015.
Tailoring continuous word representations for depen- Deep learning. Nature 521(7553):436–444.
dency parsing. In ACL (2). pages 809–815.
Ken McRae, George S Cree, Mark S Seidenberg, and Chris
Elia Bruni, Gemma Boleda, Marco Baroni, and Nam- McNorgan. 2005. Semantic feature production norms
Khanh Tran. 2012. Distributional semantics in techni- for a large set of living and nonliving things. Behavior
color. In ACL. ACL, pages 136–145. research methods 37(4):547–559.
Guillem Collell and Marie-Francine Moens. 2016. Is an Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
image worth more than a thousand words? On the fine- Dean. 2013. Efficient estimation of word representa-
grain semantic differences between visual and linguistic tions in vector space. CoRR abs/1301.3781.
representations. In COLING. ACL, pages 2807–2817.
Aytug Onan, Serdar Korukoglu, and Hasan Bulut. 2017.
Guillem Collell, Ted Zhang, and Marie-Francine Moens. A hybrid ensemble pruning approach based on con-
2017. Imagined visual representations as multimodal sensus clustering and multi-objective evolutionary algo-
embeddings. In AAAI. pages 4378–4384. rithm for sentiment classification. Inf. Process. Manage.
53(4):814–833.
Ronan Collobert, Jason Weston, Léon Bottou, Michael
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,
Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011.
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
Natural language processing (almost) from scratch.
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,
Journal of Machine Learning Research 12(Aug):2493–
D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-
2537.
esnay. 2011. Scikit-learn: Machine learning in Python.
Lawrence Davis. 1991. Handbook of genetic algorithms . Journal of Machine Learning Research 12:2825–2830.
Jeffrey Pennington, Richard Socher, and Christopher D
Golnoosh Farnadi, Jie Tang, Martine De Cock, and Marie-
Manning. 2014. Glove: Global vectors for word rep-
Francine Moens. 2018. User profiling through deep
resentation. In EMNLP. volume 14, pages 1532–1543.
multimodal fusion. In WSDM. pages 171–179.
Jacques Periaux, Felipe Gonzalez, and Dong Seop Chris
Christiane Fellbaum. 1998. WordNet. Wiley Online Li- Lee. 2015. Evolutionary optimization and game strate-
brary. gies for advanced multi-disciplinary design. Springer
Netherlands.
Juan Carlos Gomez, Stijn Hoskens, and Marie-Francine
Moens. 2017. Evolutionary learning of meta-rules for Dana Rubinstein, Effi Levi, Roy Schwartz, and Ari Rap-
text classification. GECCO. poport. 2015. How well do distributional models cap-
ture different types of semantic knowledge? In ACL.
Andrej Karpathy, George Toderici, Sanketh Shetty, volume 2, pages 726–730.
Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014.
Large-scale video classification with convolutional neu- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause,
ral networks. In CVPR. IEEE, pages 1725–1732. Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej
Karpathy, Aditya Khosla, Michael Bernstein, et al.
Douwe Kiela and Léon Bottou. 2014. Learning image em- 2015. Imagenet large scale visual recognition chal-
beddings using convolutional neural networks for im- lenge. IJCV 115(3):211–252.
proved multi-modal semantics. In EMNLP. pages 36–
45. Carina Silberer and Mirella Lapata. 2014. Learning
grounded meaning representations with autoencoders.
Yoon Kim. 2014. Convolutional neural networks for sen- In ACL. pages 721–732.
tence classification. arXiv preprint arXiv:1408.5882 .
8
65