A Genetic Algorithm for Combining Visual and Textual Embeddings Evaluated on Attribute Recognition Ruiqi Li Guillem Collell Marie-Francine Moens Computer Science Department Computer Science Department Computer Science Department KU Leuven KU Leuven KU Leuven 3001 Heverlee, Belgium 3001 Heverlee, Belgium 3001 Heverlee, Belgium ruiqi.li1993@outlook.com gcollell@kuleuven.be sien.moens@cs.kuleuven.be or properties of objects cannot be explicitly visually represented while, at the same time, not all the prop- Abstract erties are easily expressible with language. Here, we assume that many properties of objects are learned by We propose a genetic-based algorithm for humans both by visual perception and through the use combining visual and textual embeddings in of words in a verbal context. For example, a cat has a compact representation that captures fine- fur, which is visually observed, but from language this grain semantic knowledge—or attributes—of property can also be learned when speaking of the fur concepts. The genetic algorithm is able to of this animal or of hairs that shake when moving. select the most relevant representation com- When building meaning representations of an object’s ponents from the individual visual and tex- attribute, combining visual representations or embed- tual embeddings when learning the represen- dings with textual representations seems beneficial. tations, combining thus complementary vi- In this paper we investigate how to integrate vi- sual and linguistic knowledge. We evaluate sual and textual embeddings that have been trained the proposed model in an attribute recognition on large image and text databases respectively in or- task and compare the results with a model that der to capture knowledge about the attributes of the concatenates the two embeddings and models objects. We rely on the assumption that fine-grain se- that only use monomodal embeddings. mantic knowledge of attributes (e.g., shape, function, 1 Introduction sound, etc.) is encoded in each modality (Collell and Moens, 2016). The results shed light on the potential Distributed representations of words (Collobert et al., benefit of combining vision and language data when 2011; Mikolov et al., 2013; Pennington et al., 2014; creating better meaning representations of content. A LeCun et al., 2015) in a vector space that capture first baseline model just concatenates the visual and the textual contexts in which words occur have be- textual vectors, while a second model keeps a com- come ubiquitous and been used effectively for many pact vector representation, but selects relevant vector downstream natural language processing tasks such as components to make up the representation based on sentiment analysis and sentence classification (Kim, a genetic algorithm, which allows capturing a mix- 2014; Bansal et al., 2014). In computer vision, convo- ture of the most relevant visual and linguistic features lutional neural network (CNN) based image represen- that encode object attributes. We additionally com- tations have become mainstream in object and scene pare our model with vision-only and text-only base- recognition tasks (Krizhevsky et al., 2012a; Karpathy lines. Our contribution in this paper is as follows: et al., 2014). Vision and language capture comple- To the best of our knowledge, we are the first to dis- mentary information that humans automatically inte- entangle and recombine embeddings based on a ge- grate in order to build mental representations of con- netic algorithm. We show that the genetic algorithm cepts (Collell and Moens, 2016). Certain concepts most successfully combines complementary informa- tion of the visual and textual embeddings when eval- In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.): Proceedings of the 3rd Swiss Text Analytics Conference (Swiss- uated in an attribute recognition task. Moreover, with Text 2018), Winterthur, Switzerland, June 2018 this genetic algorithm we learn compact and targeted 1 58 embeddings, where we assume that compact mean- Moens (2016) who compare the performance of visual ing representations are preferred over longer vectors and linguistic embeddings each pre-trained on respec- in many realistic applications that make use of large tively a large image and text dataset for a large num- sets of representations. Ultimately, this work provides ber of visual attributes, as well as for other non-visual insight on building better representations of concepts, attributes such as taxonomic, function or encyclope- which is essential towards improving automatic lan- dic. In contrast to their work, we propose a model that guage understanding. integrates visual and linguistic embeddings, leverag- The rest of the paper is organized as follows. In the ing their findings in which they show that visual and next Section we review and discuss related work. In linguistic embeddings encode complementary knowl- Section 3 we describe the proposed genetic algorithm edge. that combines visual and textual embeddings in en- Genetic algorithms have been used for feature se- coding and classifying attributes, as well as a baseline lection in text classification and clustering tasks (e.g., method that concatenates the visual and textual em- Abualigah et al. (2016); Gomez et al. (2017); Onan beddings and baseline vision-only or text-only mod- et al. (2017)), where the goal is to reduce the num- els. Next, we present and discuss our experimental ber of features. In this paper we continue this line of results. Finally, in conclusions and future work, we thinking for learning better multimodal embeddings. summarize our findings and suggest future lines of re- search. 3 Methodology 2 Related Work Given visual and textual embeddings of the same con- Representations of concepts are often task specific, cept word but with different dimensionality, our goal where they mostly have been used in word similar- is to combine the two embeddings so that the new em- ity tasks. In this context integration of the visual bedding can capture both visual and textual seman- and linguistic representations was realized by Collell tic knowledge but with a more compact form than the et al. (2017); Lazaridou et al. (2015); Kiela and Bot- concatenation representation. This section describes tou (2014); Silberer and Lapata (2014). Kiela and why and how we achieve this goal under the genetic Bottou (2014) proposed the concatenation of visual algorithm (GA) framework. and text representations, while Lazaridou et al. (2015) extend the skip-gram model to the multimodal do- 3.1 Why Genetic Algorithms main, but none of these works regard attribute recog- nition. Silberer and Lapata (2014) obtain multimodal When combining two embeddings, the instinctive idea representations by implementing a stacked autoen- naturally comes to mind is to check the meaning of coder with the visual and word vectors as input in an each dimension in order to “pick” the dimensions that attribute recognition task. These vectors were sep- are really useful in a certain task. However, it has arately trained with a classifier. In this work, we been a long term and highly debated issue in NLP start from general pre-trained embeddings. Rubin- that what exactly each dimension in the learned em- stein et al. (2015) research attribute recognition by beddings means to the whole representation. This is relying only on linguistic embeddings. Bruni et al. a work requires devoted observation and up till now (2012) showed that the color attribute is better cap- there has been no final judgment on this topic. Back tured by visual representations than by linguistic rep- to the original goal, our final task is not to investigate resentations. Farnadi et al. (2018) train a deep neural the exact meaning of each dimension but to choose the network for multimodal fusion of user’s attributes as dimensions that really help. This motivates us to use found in social media. They use a power-set com- genetic algorithm which can provide numerous solu- bination of representation components in an attempt tions and select them based on the natural selection to better model shared and non-shared representations principle. Specifically, the genetic operators such as among the data sources which are composed of im- crossover in genetic algorithm can be just used to vary ages, texts of and relationships between social media the programming of embeddings from one generation users. The closest work to ours is that of Collell and to the next. 2 59 3.2 Genetic Algorithms Basic evolutionary programming (Periaux et al., 2015). One may point out that the pre-trained visual and textual Belonging to the larger class of evolutionary algo- embeddings can naturally be used as the original chro- rithms, genetic algorithms (GA) are meta-heuristics mosomes since they consists of floating numbers. But inspired by the process of natural selection. In a given recall that our goal is to form a compact embedding, environment, a population of individuals competes for and by “compact” we mean that the dimension of the survival and, more importantly, reproduction. The final embedding should be smaller than the concate- ability of each individual to achieve certain goals de- nation of the visual and textual embeddings. Due to termines his or her chance of producing the next gen- the previous reason, we first concatenate visual and eration. In a GA setting, an individual is a solution textual embeddings, then shuffle the dimensions in with regard to the problem and the quality of the solu- the concatenation, and divide the concatenation into tion determines its fitness. The fittest individuals tend two embeddings with the same dimension. Those two to survive and have children. By searching the so- embeddings are used as the original chromosomes. lution space through the use of simulated evolution, Specifically, each real number in an embedding vec- i.e., following the survival of the fittest strategy, a GA tor, representing a feature of the target concept, can achieves continuous improvement over the successive be seen as a gene. In this way, the chromosome (em- generations. bedding) is made up of a sequence of shuffled real GA haven been shown to generate high-quality numbers (floating points) which either comes from the solutions to linear and non-linear problems through original visual or textual embedding. Thus the two biologically-inspired operators such as mutation, embeddings can be seen as a mixture of visual and crossover, and selection. A more complete discus- textual knowledge with different degrees. For clar- sion can be found in the book of Davis (1991). Algo- ity, we henceforth use the term embedding instead of rithm.1 summarizes the procedure of a basic genetic chromosome. algorithm. In a standard GA, the initial population is often Algorithm 1 Framework of a Genetic Algorithm. generated randomly and the selection function is usu- 1: initialize population; ally based on the fitness of an individual. However 2: evaluate population; in our case, as explained previously, the initial popu- 3: while (!StopCondition) do lation is formed by the original embeddings. Conse- 4: select the fittest individuals; quently, we make a change in the target of the selec- 5: breed new individuals; tion function. Instead of trying to select the most fitted 6: evaluate the fitness of new individuals; individuals to reproduce, the selection function first 7: replace the least fitted population; makes sure that every pair of visual and textual em- 8: end while beddings having the same target concept reproduce a There are six fundamental issues to be determined group of candidates of the next generation, by repeat- to use a genetic algorithm: chromosome representa- ing the reproduction method several times. The repro- tion, initialization, the selection function, the genetic duction method involves randomly initialized param- operators of reproduction, evaluation function, and eters and will produce different children each time. termination criteria. The rest of this section describes Once a certain group of children candidates are gen- the detail of these issues in creating a compact rep- eralized, they will compete against each other to sur- resentation to capture fine-grain semantic visual and vive but only the fittest one can win the opportunity of textual knowledge. becoming the next generation. In this way, the fitness of the children generation is assured to be better than 3.3 Chromosome Representation, Initialization, their parent generation and the fitness is guaranteed to and Selection improve over generations. The chromosome representation determines the prob- 3.4 The Genetic Operators of Reproduction lem structure and the genetic operators in a GA. The floating point representation of the chromosomes has Genetic operators determine the basic search mecha- been shown to be natural to evolution strategies and nism and create new solutions based on existing ones 3 60 in the population. Normally there are two types of op- and the other negative labels. Section 4 shows the de- erators: crossover and mutation. Crossover takes two tail of how we use the F1 measure as evaluation func- individuals and produces two new individuals while tion. mutation alters one and produces one new. Since the GA moves through generations, selecting and re- embeddings used in our problem represent mappings producing, until a specific termination criterion is met. from spaces with one dimension per concept word to From the point of view of reproducing, the stopping continuous vector spaces with lower dimension, the criterion can be set as a maximum number of gener- value in each dimension of the embeddings character- ations reproduced. For example, the algorithm will izes the target concept in the vector spaces and should stop once it reproduce 1000 generations. The first ter- not be recklessly changed. Due to this reason, we only mination criterion is the most frequently used. The use the crossover operator to reproduce the next gen- second termination strategy is a population conver- eration. gence criteria that evaluates the sum of deviations Recall that we now have two embeddings and each among individuals. Third, the algorithm can also be one can be seen as a mixture that combined visual and terminated when a lack of improvement over a cer- textual knowledge with different degrees. Our goal tain number of generations happens or, alternatively, is to find all dimensions that help to achieve a cer- when the value for the evaluation measure meets a tar- tain goal. To test whether a certain dimension is rel- get acceptability threshold. For instance, one can set evant, the crossover operator is defined as follows: as threshold if there is no improvement over a series Let X = (x1 , · · · , xn ) and Y = (y1 , · · · , yn ) be of 10 times of reproduction, or if the fitness of the two n-dimensional embeddings. The crossover opera- current generation is larger than the target threshold, tor generates two random integers k, t from a uniform then the algorithm terminates. Usually, several strate- distribution from 1 to n, and creates two new embed- gies can be used in conjunction with each other. In ding X 0 = (x01 , · · · , x0n ), Y 0 = (y10 , · · · , yn0 ) accord- the experiments described below, a conjunction of the ing to: maximum number of generations reproduced in the  xi if i 6= k first termination criterion and the maximum number x0i = (1) of generations that allows a lack of improvement in yi otherwise the third termination criterion is used. Please noted  here that a maximum number of generations repro- yi if i 6= t yi0 = (2) duced in the first termination criterion and a certain xi otherwise number of generations that allows a lack of improve- As mentioned in the selection function, in one time ment are two different concepts. For example, if the of reproduction the same crossover operator is applied former is set to 1000 while the latter 10, the algorithm to all embeddings, producing one candidate of next will terminate when either 1) the algorithms reproduce generation. By repeating it a certain number of times, 1000 generations; or 2) during the algorithm, there is a group of different candidates is produced. We call no improvement in fitness 10 consecutive times of re- such repetition a “reproduction trial”. production trials. 3.5 Evaluation and Termination 4 Experiments and Results Diverse evaluation functions can be used, depending 4.1 Experimental Setup on the specific tasks. For instance, for classification 4.1.1 Pre-trained Visual Embeddings tasks the evaluation function can be any classification metric such as precision or Jaccard similarity score, as Following Collell and Moens (2016), we use Ima- long as it can map the population into a partially or- geNet (Russakovsky et al., 2015) as our source of dered set. In regression, correlation is typically used visual data. ImageNet is the largest labeled image as evaluation function. In our experiment, the F1 mea- dataset, and covers 21,841 WordNet synsets or mean- sure is used as the evaluation function. Generally, we ings (Fellbaum, 1998) and over 14M images. We only use two types of F1 measure as the evaluation of fit- preserve synsets with more than 50 images, and we ness to avoid bias, one with respect to positive labels set an upper bound of 500 images per synset for com- 4 61 putation time. After this, 11,928 synsets are kept. We = is large. And an “ant” is a negative instance and extract a 4096-dimensional vector of features for each a “bear” is a positive instance for the attribute a = image as the output of the last layer of a pre-trained has 4 legs. We consider that an attribute applies to a AlexNet CNN in Krizhevsky et al. (2012b). For each noun concept if a minimum of 5 people have listed it2 . concept, we combine the representations from its in- We treat attribute recognition as a binary classification dividual images into a single vector by averaging the problem: For each attribute a we learn a predictor: CNN feature vectors of individual images component- wise, which is equivalent to the cluster center of the fa : X → Y individual representations. where X ⊂ Rd is the input space of (d- dimensional) concept representations and Y = {0, 1} 4.1.2 Pre-trained Word Embeddings the binary output space. We report results with a linear Following Collell and Moens (2016), we employ 300- SVM classifier, implemented with the scikit machine dimensional GloVe vectors (Pennington et al., 2014) learning toolkit from Pedregosa et al. (2011). trained on the largest available corpus (840B tokens To guarantee sufficient positive instances, only at- and a 2.2M words vocabulary from Common Crawl tributes with at least 25 positive instances in the above corpus) from the GloVe website1 . dataset are kept. This leads to a total of 42 attributes, covering 9 attribute types, and their corresponding in- 4.1.3 Dataset stance sets. The concept selection in ImageNet de- The data set collected by McRae et al. (2005) con- scribed in Sect. 4.1.1 results in a visual coverage of sists of data gathered from 30 human participants that 400 concepts (out of 541 from McRae et al. (2005) were asked to list properties—attributes—of concrete data), and, for a fair vision-language comparison, only nouns. The data contains 541 concepts, 2,526 differ- the word embeddings (from GloVe) of these nouns are ent attributes, and 10 attribute types. employed. Hence, our training data {(→ − i=1 con- xi , y)}400 sists of 400 instances. Table 1 shows the detail of each Attribute type # Attr. Avg. # concepts SD attribute type. encyclopedic 4 32.7 1.5 function 3 46 27.9 sound 1 34 - 4.3 Parameter Setting tactile 1 26 - taste 1 33 - Notice that each reproduction operation will give birth taxonomic 7 42 24.8 to two forms of embedding. To avoid potential bias, color 7 42.4 12.0 we evaluate one embedding by F1 measure on the pos- form and surface 14 63.7 29.9 motion 4 37.5 5.7 itive labels and the other on the negative labels. The average of these two F1 measures can be an option to Table 1: Attribute types, number of attributes in each evaluate fitness. However, in practice, the negative la- type (# Attr.), and average number of concepts in bels are more numerous than the positive ones. An in- each type (Avg. # concepts) with their respective crease of the F1 measure on the negative labels while standard deviations (SD). a decrease of the positive ones can still result in an in- crease on the average F1 measure. Thus, we use the 4.2 Attribute Recognition F1 measure on the positive labels as the first measure of fitness and the F1 measure on the negative labels as To evaluate the composed embeddings, we assess how the second. Only the one with largest increase in the well the attributes from McRae et al. (2005) can be first F1 measure and largest increase or at least non- recognized by using the embeddings as input. For decrease in the second F1 measure will be chosen as each attribute a, we build a data set with the concepts the next generation. to which this attribute applies as the positive class The maximum number of generations reproduced instances and the rest of concepts form the negative is set to 106 . The maximum number of reproduc- class. For example, a “beetle” is a negative instance tion trials in case a lack of improvement among the and “airplane” a positive instance for the attribute a 2 This threshold was set by McRae et al. 1 http://nlp.stanford.edu/projects/glove (2005) 5 62 children candidates is 10. The repeat times of the textual knowledge. This will further be discussed in crossover operation in one reproduction trial is 103 . Section 4.4.2. And in the final evaluation of each embedding on the attribute recognition task, we perform 5 runs of 5-fold 4.4.2 Performance per Attribute and Overall cross validation. Table 5 provides a more detailed answer to our ques- We evaluate four different embeddings as input tion, showing that GeMix outperforms the other three of the attribute recognition task: 1) Embeddings of embeddings in 20 attributes while CNN performs best the concept obtained with the GA described above in 6 attributes, GloVe in 7 attributes and the con- (GeMix); 2) Embeddings obtained by concatenat- catenated embedding (CON) in 9 attributes. Specif- ing the visual and textual embedding vectors (CON); ically, GeMix outperforms the second best method 3) Monomodal visual embeddings (CNN); and 4) with more than 0.04 in attributes 04 (lays eggs), 09 Monomodal text embeddings (GloVe). Table 2 shows (is soft), 12 (a vegetable), and 14 (a mammal) and the number of dimensions of each embedding respec- 0.10 in 30 (has a beak) and 37 (made of wood). tively. F 1attr F 1samp # dimensions CNN 0.535 0.469 CNN 4096 Glove 0.552 0.474 Glove 300 CON 0.572 0.495 CON 4396 GeMix 0.586 0.507 GeMix 2198 Table 3: Overall F1 measure per attribute and per sam- Table 2: Dimensionality of each embedding type. ple. According to Collell and Moens (2016), vi- 4.4 Result and Discussion sual embeddings perform better than textual ones when recognizing three main attributes: motion, 4.4.1 Performance per Attribute Type form and surface, and color, while textual embed- We first evaluate how the proposed method performs dings (GloVe) outperform the visual CNN embed- on each attribute type. There are 9 attribute types and dings in recognizing encyclopedic and function at- we evaluate the four embeddings for each type by with tributes. A closer look at Table 5 further reveals that the average F1 measure. for attribute types where vision or language embed- From Table 4 one can see that the GeMix embed- dings show better performance over the other one, it dings outperform the other three embedding methods is high likely that adding respectively language or vi- in 7 attribute types, i.e., encyclopedic, function, tac- sion information lower the performance, e.g., attribute tile, taste, taxonomic, color and form and surface. Es- 05 (hunted by people), 07 (used for transportation) pecially in encyclopedic, GeMix increases the average in function and 20 (is fast), 21 (eats) in motion. Be- F1 measure by more than 0.02 and in function and cause GeMix tend to set aside the “noisy” dimensions taxonomic, it increases by nearly 0.02. We perform of the embeddings, it performs better than the con- Wilcoxon Signed-Rank test of each two methods on catenated embedding. different feature sets and find that the difference is sig- Let us take a look at the overall average F1 mea- nificant at p ≤ 0.05. sure increase. We evaluate the F1 measure with re- Another interesting finding is that the performance spect to two different aspects. First, the overall aver- 1 of the concatenated embedding (CON) is not always age F1 measure per attribute, i.e., F 1attr = |L| F 1L better than the performance of the monomodal em- where |L| is the number of different attributes (42 in beddings, CNN or GloVe. For instance, in tactile, our case) and F 1L is the F1 measure of a specific at- color and motion, the F1 measure of CNN or GloVe tribute. Second, the overall average F1 measure per 1 is higher than that of concatenated embeddings. This sample, i.e., F 1samp = |S| F 1S where |S| is the num- indicates that there are certain attributes in which the ber of samples (400 in our case) and F 1L is the F1 performance of combined visual and textual knowl- measure of each sample. Table 3 shows that in both edge is not necessarily better than unimodal visual or cases, GeMix achieves the highest F1 measure. 6 63 Encyc Funct Sound Tactile Taste Taxon Color Form&Surf Motion CNN 0.429 0.738 0.513 0.470 0.421 0.486 0.676 0.567 0.628 GloVe 0.422 0.743 0.747 0.517 0.341 0.495 0.630 0.563 0.595 CON 0.457 0.760 0.762 0.477 0.433 0.512 0.663 0.582 0.623 GeMix 0.471 0.786 0.758 0.520 0.438 0.528 0.671 0.588 0.620 Table 4: Performance per Attribute Type: Averages of F1 measures per attribute type (i.e., average individual attributes) for CNN, GloVe, CON and GeMix. 01 02 03 04 05 06 07 08 09 10 11 CNN 0.484 0.580 0.355 0.591 0.410 0.682 0.794 0.663 0.261 0.418 0.591 GloVe 0.463 0.611 0.521 0.591 0.486 0.602 0.971 0.701 0.551 0.311 0.668 CON 0.462 0.673 0.576 0.626 0.456 0.723 0.944 0.765 0.547 0.437 0.683 GeMix 0.502 0.661 0.570 0.672 0.466 0.734 0.944 0.717 0.605 0.439 0.690 12 13 14 15 16 17 18 19 20 21 22 CNN 0.460 0.284 0.632 0.405 0.431 0.422 0.439 0.581 0.915 0.527 0.812 GloVe 0.292 0.233 0.628 0.491 0.484 0.530 0.320 0.617 0.822 0.510 0.622 CON 0.522 0.321 0.641 0.443 0.471 0.522 0.466 0.603 0.846 0.510 0.773 GeMix 0.565 0.228 0.702 0.475 0.437 0.476 0.475 0.651 0.863 0.524 0.822 23 24 25 26 27 28 29 30 31 32 33 CNN 0.513 0.643 0.884 0.699 0.647 0.544 0.727 0.325 0.489 0.649 0.738 GloVe 0.347 0.595 0.743 0.668 0.379 0.448 0.437 0.313 0.495 0.767 0.672 CON 0.476 0.668 0.852 0.728 0.640 0.558 0.433 0.298 0.503 0.722 0.651 GeMix 0.546 0.660 0.829 0.704 0.639 0.548 0.414 0.476 0.512 0.744 0.734 34 35 36 37 38 39 40 41 42 CNN 0.580 0.421 0.372 0.532 0.906 0.506 0.748 0.421 0.418 GloVe 0.444 0.440 0.368 0.345 0.970 0.522 0.543 0.415 0.291 CON 0.622 0.548 0.377 0.547 0.888 0.570 0.784 0.483 0.539 GeMix 0.622 0.562 0.387 0.670 0.900 0.573 0.791 0.468 0.500 Table 5: Performance of the attribute classification task per attribute in terms of F1 measure for each embedding method. Attribute 01 - 04 belong to encyclopedic, 05 - 07 function, 08 sound, 09 tactile, 10 taste, 11 - 17 taxonomic, 18 - 21 motion, 22 - 28 color and 29 - 42 form and surface. 5 Conclusion and Future Work the genetic-based representation outperformed a base- line composed of the concatenation of the visual and In this paper, we propose a genetic-based algorithm textual embeddings, as well as the monomodal visual which learns a compact representation that combines or textual embedding. visual and textual embeddings. Two embeddings, coming from random evenly divide of the shuffled Another interesting finding in this paper is that for concatenation of vision and textual embeddings, are a small group of attributes in which either vision or used as the initial chromosomes in the genetic algo- language generally dominate, adding the other modal- rithm. A variant of one-point crossover method is ity may lower the final performance. For example, used to move the most relevant components in the rep- the attribute eats in the motion type for which vi- resentation to one embedding, and the non-relevant sion tends to perform better than language (Collell ones to the other. To avoid bias, we use two measures and Moens, 2016), the performance of the mixture as the evaluation of fitness: one is respect to positive of both visual and textual representation is lower than labels and the other negative labels. The learned em- the monomodal visual representation. Ultimately, our beddings can be seen as a combination of both visual findings provide insights that can help building better and textual knowledge. In an attribute recognition task multimodal representation by taking into account to 7 64 what degree should the visual and textual knowledge Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. be mixed with respect to different tasks. 2012a. Imagenet classification with deep convolutional neural networks. In NIPS. USA, pages 1097–1105. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. References 2012b. Imagenet classification with deep convolutional neural networks. In Advances in neural information L. M. Abualigah, A. T. Khader, and M. A. Al-Betar. 2016. processing systems. pages 1097–1105. Unsupervised feature selection technique based on ge- netic algorithm for improving the text clustering. In Angeliki Lazaridou, Nghia The Pham, and Marco Ba- 2016 7th International Conference on Computer Sci- roni. 2015. Combining language and vision with ence and Information Technology (CSIT). volume 00, a multimodal skip-gram model. arXiv preprint pages 1–6. arXiv:1501.02598 . Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2014. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Tailoring continuous word representations for depen- Deep learning. Nature 521(7553):436–444. dency parsing. In ACL (2). pages 809–815. Ken McRae, George S Cree, Mark S Seidenberg, and Chris Elia Bruni, Gemma Boleda, Marco Baroni, and Nam- McNorgan. 2005. Semantic feature production norms Khanh Tran. 2012. Distributional semantics in techni- for a large set of living and nonliving things. Behavior color. In ACL. ACL, pages 136–145. research methods 37(4):547–559. Guillem Collell and Marie-Francine Moens. 2016. Is an Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey image worth more than a thousand words? On the fine- Dean. 2013. Efficient estimation of word representa- grain semantic differences between visual and linguistic tions in vector space. CoRR abs/1301.3781. representations. In COLING. ACL, pages 2807–2817. Aytug Onan, Serdar Korukoglu, and Hasan Bulut. 2017. Guillem Collell, Ted Zhang, and Marie-Francine Moens. A hybrid ensemble pruning approach based on con- 2017. Imagined visual representations as multimodal sensus clustering and multi-objective evolutionary algo- embeddings. In AAAI. pages 4378–4384. rithm for sentiment classification. Inf. Process. Manage. 53(4):814–833. Ronan Collobert, Jason Weston, Léon Bottou, Michael F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, Natural language processing (almost) from scratch. R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, Journal of Machine Learning Research 12(Aug):2493– D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- 2537. esnay. 2011. Scikit-learn: Machine learning in Python. Lawrence Davis. 1991. Handbook of genetic algorithms . Journal of Machine Learning Research 12:2825–2830. Jeffrey Pennington, Richard Socher, and Christopher D Golnoosh Farnadi, Jie Tang, Martine De Cock, and Marie- Manning. 2014. Glove: Global vectors for word rep- Francine Moens. 2018. User profiling through deep resentation. In EMNLP. volume 14, pages 1532–1543. multimodal fusion. In WSDM. pages 171–179. Jacques Periaux, Felipe Gonzalez, and Dong Seop Chris Christiane Fellbaum. 1998. WordNet. Wiley Online Li- Lee. 2015. Evolutionary optimization and game strate- brary. gies for advanced multi-disciplinary design. Springer Netherlands. Juan Carlos Gomez, Stijn Hoskens, and Marie-Francine Moens. 2017. Evolutionary learning of meta-rules for Dana Rubinstein, Effi Levi, Roy Schwartz, and Ari Rap- text classification. GECCO. poport. 2015. How well do distributional models cap- ture different types of semantic knowledge? In ACL. Andrej Karpathy, George Toderici, Sanketh Shetty, volume 2, pages 726–730. Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neu- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, ral networks. In CVPR. IEEE, pages 1725–1732. Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Douwe Kiela and Léon Bottou. 2014. Learning image em- 2015. Imagenet large scale visual recognition chal- beddings using convolutional neural networks for im- lenge. IJCV 115(3):211–252. proved multi-modal semantics. In EMNLP. pages 36– 45. Carina Silberer and Mirella Lapata. 2014. Learning grounded meaning representations with autoencoders. Yoon Kim. 2014. Convolutional neural networks for sen- In ACL. pages 721–732. tence classification. arXiv preprint arXiv:1408.5882 . 8 65