=Paper=
{{Paper
|id=Vol-2729/paper1
|storemode=property
|title=Embedding Formal Contexts Using Unordered Composition
|pdfUrl=https://ceur-ws.org/Vol-2729/paper1.pdf
|volume=Vol-2729
|authors=Esteban Marquer,Ajinkya Kulkarni,Miguel Couceiro
|dblpUrl=https://dblp.org/rec/conf/ecai/MarquerKC20
}}
==Embedding Formal Contexts Using Unordered Composition==
Embedding Formal Contexts Using Unordered Composition? Esteban Marquer, Ajinkya Kulkarni, and Miguel Couceiro Université de Lorraine, CNRS, Inria N.G.E., LORIA, F-54000 Nancy, France {esteban.marquer,ajinkya.kulkarni,miguel.couceiro}@inria.fr Abstract. Despite their simplicity, formal contexts possess a complex latent structure that can be exploited by formal context analysis (FCA). In this paper, we address the problem of representing formal contexts us- ing neural embeddings in order to facilitate knowledge discovery tasks. We propose Bag of Attributes (BoA), a dataset agnostic approach to capture the latent structure of formal contexts into embeddings. Our approach exploits the relation between objects and attributes to gener- ate representations in the same embedding space. Our preliminary ex- periments on attribute clustering on the SPECT heart dataset, and on co-authorship prediction on the ICFCA dataset, show the feasibility of BoA with promising results. Keywords: Formal Concept Analysis · Vector Space Embedding · Neu- ral Networks · Complex Data · Link Prediction · Clustering. 1 Introduction In recent years there has been an increasing interest in approaches to combine formal knowledge and artificial neural networks (NN). As mentioned in [1, p. 6], these approaches have a “generally hierarchical organization”, with the “lowest- level network [taking] raw data as input and [producing] a model of the dataset”. These networks represent data as real-valued vectors called embeddings. Formal concept analysis (FCA) is another powerful tool for understanding complex data. Replicating its mechanisms using NNs could help processing com- plex and large datasets [5,17] by tackling FCA’s scalability issues. Following this idea, we want to reproduce the general extraction process of FCA with NN ar- chitectures. This asks for a general embedding framework for contexts capable of handling data of arbitrary dimensions while encoding much of the contextual information. To our knowledge, there are only a few approaches in this direc- tion [5,10,17,20]. Dürrschnabel et al. [5] propose FCA2VEC to embed formal contexts by encoding FCA’s closure operators. It has three main components: attribute2vec and object2vec, (both based on word2vec [19]), and closure2vec that relies on a distance between closures of sets of attributes. In fact, closure2vec ? We would like to thank the Inria Project Lab (IPL) HyAIAI (“Hybrid Approaches for Interpretable AI”) for funding the research internship of E. Marquer. Copyright c 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). is based on [20] that showed how to encode closure operators with simple feed- forward NNs. An advantage of this framework is that it produces embeddings of low dimensions (2 and 3) to facilitate interpretability. Yet FCA2VEC has sev- eral limitations: the embedding models need to be trained on each formal context (without guaranties of generalization) and the embeddings for objects and at- tributes are not defined in the same embedding space, which can be problematic when processing objects and attributes together. To overcome these limitations, we propose Bag of Attributes (BoA) for pro- viding a shared embedding space for objects and attributes by composing object embeddings from attribute embeddings. It is based on unordered composition and long short-term memory neural network (LSTM). Also, it predicts the num- ber of concepts and a new measure of attribute similarity, called co-intent sim- ilarity, based on metric learning. This novel approach differs from the existing ones as it is agnostic to the data. Indeed, the model can accommodate any num- ber of objects and attributes and it generalizes on real-world formal contexts despite being trained on randomly generated ones. We also explore the advan- tages and limits of our approach and provide experimental results on attribute clustering on the SPECT heart dataset1 , and on co-authorship prediction on the ICFCA dataset2 . The comparison of BoA with FCA2VEC shows competi- tive performances on both tasks. The paper is organized as follows. In Section 2, we briefly recall some basic background on FCA and NN. The architecture of BoA is defined in Section 3, whereas the corresponding training process and datasets are presented in Sec- tion 4. We discuss the impact of datasets characteristics on the performance in Section 5. Section 6 describes our experiments on attribute clustering and co- authorship prediction using real-world datasets3 . Finally, we discuss potential developments for future work in Section 7. 2 Preliminaries And Basic Background In this section we briefly recall some basic background on FCA (Subsection 2.1) and deep learning (Subsection 2.2 and 2.3), and define the new co-intent sim- ilarity for attributes. For further details on FCA see, e.g., [8,12], and on NN architectures see, e.g., [11]. 2.1 Formal Concept Analysis A formal context is a triple hA, O, Ii, where A is a finite set of attributes, O is a finite set of objects, and I ⊆ A × O is an incidence relation between A and O. A formal context can be represented by binary table C with objects as rows Co and 1 https://archive.ics.uci.edu/ml/datasets/SPECT+Heart 2 https://github.com/tomhanika/conexp-clj/tree/dev/testing-data/icfca_ community 3 Compared to graph neural network approaches, FCA2VEC shows an improvement of at least 5% on all metrics for link prediction. attributes as columns Ca , for o ∈ O and a ∈ A. The entry of C corresponding to o and a is defined by Co,a = 1 if (o, a) ∈ I, and 0 otherwise. It is well known [8] that every formal context hA, O, Ii induces a Galois connection between objects and attributes: for X ⊆ O and Y ⊆ A, defined by: X 0 = {y ∈ A | (x, y) ∈ I for all x ∈ X} and Y 0 = {x ∈ O | (x, y) ∈ I for all y ∈ Y }. A formal concept is then a pair (X, Y ) such that X 0 = Y and Y 0 = X, called respectively the intent and the extent. It should be noticed that both X and Y are closed sets, i.e., X = X 00 and Y = Y 00 . The set of all formal concepts can be ordered by inclusion of the extents or, dually, by the reversed inclusion of the intents. We denote by I ⊆ 2A the set of intents and E ⊆ 2O the set of extents. 2.2 Auto-Encoders, Embeddings And Metric Learning Auto-encoders are a class of deep learning models composed of (i) an encoder, that takes some x as an input and produces a latent representation z, and (ii) a decoder, that takes z as an input and reconstructs x̂ a prediction of x. The model learns to compress x into z by training the the model to match x and x̂. The training objective matching x and x̂ is the reconstruction loss. Auto-encoders are one of the methods to generate representation of data as vectors. In that case z is called the embedding of x, and the real-valued space in which z is defined is called the embedding space. Unlike traditional auto-encoders, variational auto-encoders (VAEs) [14] en- code a distribution for each value of z instead of the value itself. In practice, for each component of z, the encoder produces two values: a mean µ and a standard deviation σ. When training the model, z is sampled from the normal distribution defined by µ and σ. Finally, the distribution defined by µ and σ is normalized by adding the Kullback–Leibler (KL) divergence loss term. To make this process dif- ferentiable and be able to train the model, a method called reparametrization [14] is used. VAEs are known to provide better generalization capabilities and are easier to use to decode arbitrary embeddings, compared to classic auto-encoders. This property is useful for generation since we can train a model generating em- beddings and then decode them with a pre-trained VAE. In fact, VAEs are used in a wide variety of applications to improve the quality of embedding spaces, e.g., for image [14], for speech [16] and for graph generation [15]. Metric learning [18] is a training process used ensure that embedding spaces have the properties of metric spaces. To achieve this, a loss is used to reduce the distance between the embeddings of equal elements and increase the distance between embeddings of different elements. Multiple losses can achieve this, such as the pairwise loss and the triplet loss. Triplet loss considers the embeddings of three elements: an input x1 , some x2 judged equal to x1 and some y different from x1 . In some approaches [16] a predictor (typically a multi-layer perceptron (MLP)) is used to predict a distance between the embeddings instead of applying a standard distance directly on the embeddings. It is possible to learn distances on different properties of the embedded elements, by splitting the embedding into segments and learn a different distance on each one [16]. Metric learning is usually used to approximate actual distances. However, this process can be applied to learn other kinds of measures not fitting the definition of a distance, which is what we do in this paper. We need both “equal” and “different” attributes to use metric learning losses on attribute embeddings. Nonetheless, even if we consider equivalent attributes (i.e. with the same extent) as “equal”, they are usually rare within a given con- text. We define co-intent similarity to compare attributes and avoid this issue. Given two attributes a1 and a2 we define their pairwise co-intent similarity as: 1 if |{i ∈ I|a1 ∈ i}| + |{i ∈ I|a2 ∈ i}| = 0 co-intent(a1 , a2 ) = (1) 2 × |{i ∈ I|a1 ∈ i, a2 ∈ i}| otherwise. |{i ∈ I|a1 ∈ i}| + |{i ∈ I|a2 ∈ i}| In other words, it is the ratio of intents containing both attributes over the intents containing a1 or a2 4 . In cases where no intent contain the attributes (both attributes are empty or padding columns), the similarity is set to 1. Co-intent similarity ranges from 0, for attributes never appearing in the same intents, to 1, for attributes always appearing together or for identical attributes. 2.3 Unordered Composition By unordered composition functions we mean operations that do not take into account the order of the input elements and that can accommodate any num- ber of input elements. Typical examples are the componentwise min, max, and average (also called respectively min-, max-, and average-pooling). Unordered composition-based models have proven their effectiveness in a variety of tasks, for instance, sentence embedding [13], sentiment classification [4] and feature classification [9]. On the one hand, this family of methods allows inputs of vary- ing sizes to be processed at a relatively low computational cost, by opposition to recurrent models like LSTM [11]. On the other hand, we lose the information related to the order of the input elements. 3 Proposed approach In this section we first define the proposed approach and explain the objectives used to train the model. The training process is detailed in Section 4. 3.1 Bag of Attributes Bag of Attributes (BoA) takes a formal context as input, and produces embed- dings for its attributes. Then, object embeddings are computed using the embed- dings of the attributes and the formal context. BoA has four main components: a pre-embedding generator, an attribute encoder called self-other encoder to com- pute the attribute embedding, an object encoder and a decoder. It considers the 4 Observe that this is essentially the Jaccard index on the set of intents. (a) Self-other attribute encoder for an attribute. (b) BoA encoder architecture. Fig. 1: Schematic representation of the BoA architecture. Blue blocks correspond to tensors, the orange to neural components and green blocks to non-neural computations. Arrows joining blocks represent concatenation of tensors. attributes as an unordered set to produce the object and attribute embeddings. The name is inspired by Bag of Words (BoW) [19]. The structure of the encoder is schematized in Figure 1. The decoder itself is an MLP predicting if an object has an attribute or not (1 or 0, respectively). Its input is the concatenation of the object and the attribute embeddings. A sigmoid function applied on the output ensures it is in [0, 1]. BoA is trained as a VAE on formal contexts, so a µ and σ vector is produced for each attribute. The sampling of the attribute embeddings is done before the generation of object embeddings. The order of the attributes in the dataset does not matter for FCA, there- fore in BoA each attribute is processed in a similar manner to capture this property. Each attribute is compared to all the other attributes, for each object of the dataset. In practice, the column of an attribute (self ) is compared to an unordered composition (average-pooling) of all the other attributes (other ). Self and other are then processed by a bidirectional LSTM (BLSTM) [11], with the object dimension as the sequence dimension. The last hidden state of the BLSTM is processed with a feed-forward layer into an embedding that represents the attribute. The structure of the attribute encoder is presented on Figure 1a. Finally, the object embeddings are computed by applying max-pooling on the embeddings of the attributes present in the object’s intent. We apply a LSTM on each row of C before the self-other encoder, as it produces different embeddings for each attribute despite the same input, to allow the model to determine which attribute is involved by avoiding the use unordered composition directly on C. 3.2 Training Objective We train BoA using KL divergence on the attribute embeddings exclusively as the sampling happens before the computation of the object embeddings. We use the binary cross entropy loss for reconstruction because the model predicts be- tween two classes (1 and 0). On top of that, we use metric learning with the new co-intent similarity (see Equation 1) and the number of concept [8], with mean square error (MSE) as the loss function. Predicting the number of concepts from the context without actually computing the intents, helps when generating the set of concepts using neural models. Indeed, knowing how many elements to gen- erate beforehand facilitates the generation process. We use MLPs to predict the co-intent similarity and number of concepts. For co-intent similarity between two attributes a1 and a2 , the input is the concatenated embeddings of a1 and a2 and a sigmoid output function is added to ensure the predicted similarity is in [0, 1]. We apply a max-pooling over the attribute embeddings before predicting the number of concepts, which corresponds to a deep averaging network (DAN) [13]. 4 Training Setup In this section we present our training process (Subsection 4.1), dataset (Sub- section 4.2) and we describe the data augmentation process (Subsection 4.3). 4.1 Training Process We train BoA in two phases of 5000 epochs each. In the first phase, we apply the reconstruction loss and the KL divergence only. Then, we gradually introduce the prediction of the co-intent similarity and of the number of concepts. When using metric learning with multiple distances, a common approach is to split the embedding space and to learn one distance per sub-part of the embedding space [16]. We apply the same principle and use 50% of the embedding space to predict the co-intent similarity and 25% for the number of concepts. The exact embedding dimension of BoA is 128, with a pre-embedding of size of 64. The LSTM and the BLSTM have two layers each. The decoder MLP has four layers, and the MLPs used for distance prediction both have two layers. We use a rectified linear unit activation function between all the layers of the model. 4.2 Training Data The dataset used for training the BoA model is composed of 6000 randomly gen- erated formal contexts split into training and validation, and the corresponding intents computed using the Coron system5 . To generate a context of |O| ob- jects and |A| attributes we sample |O| × |A| values from a Poisson distribution and apply a threshold of 0.3. Values under the threshold correspond to 1 in the context, which leads to a density around 0.3. Note that the random generation 5 http://coron.loria.fr/site/index.php Table 1: Descriptive statistics on the dataset of randomly generated contexts. Dataset # Object # Attribute # Concept Context density Mean train 12.83 ± 6.11 12.98 ± 6.03 77.93 ± 78.39 0.329 ± 0.057 ± std. test 12.83 ± 6.13 12.97 ± 6.04 78.12 ± 77.27 0.332 ± 0.057 train 1 to 20 2 to 20 1 to 401 0 to 0.56 Range test 2 to 20 3 to 20 2 to 401 0 to 0.49 process may result in empty rows and columns, which will be dropped by Coron if they are at the extremities of the context. For this reason, the actual size of the generated context may be smaller than the requested one. We generate a training set of 5000 contexts and a test set of 1000 samples. For the training phase, a development set of 10% of the training set is randomly sampled from the training set. For each set, we generate different sizes of contexts, 20% of each: 5 × 5, 10 × 10, 10 × 20, 20 × 10 and 20 × 20 contexts (|O| × |A|). The statistics of the generated datasets are reported in Table 1. 4.3 Data Augmentation We rely on plain random generation for the formal contexts, and not on more involved generation processes as discussed in [7,6], so the random data is biased. We introduce a simple way to compensate some of those biases while improving the generalization capability of the model. We implement the following data augmentation pipeline: (i) duplicating of objects and attributes, (ii) inverting the value of entries and (iii) shuffling objects and attributes. With this process, we simulate identical (duplication) and nearly identical (duplication + drop) objects or attributes that appear in real-world datasets. Objects and attributes have a probability p of being duplicated. If dupli- cated, they have the same probability p of being duplicated again. From this definition, the number of copies of an object (or attribute) follow a geometric law with a probability of success p. Consequently, the exact number of objects and attributes actually seen during training do not match the ones reported in Table 1. Nonetheless, the duplication follows a geometric law so we can estimate the number amount of object and attribute seen as number/(1 − p). Invert- ing some randomly selected values in the formal context is our adaptation of dropout, a common technique in deep learning. The shuffling after duplication avoids model’s reliance on order of the objects and the attributes. We set the duplication probability to p = 0.1 and the drop probability to 0.01. In this set- ting, the estimated average object and attribute numbers are respectively 14.25 and 14.42, for both the training and development sets. When co-intent similarity is used, duplication and shuffling are reproduced on the intents. However, drops in a formal context alter the corresponding lattice, so they are not applied at all when using data from the lattice. This precaution avoids making the model insensitive to small variations in the input. 4.4 Issues With KL Divergence When adding the KL divergence to the prototype of BoA (initially a simple auto-encoder) the performance of the model was greatly impaired. The analysis of the predictions revealed the model was going for “low hanging fruits” and ignored the embeddings themselves, as described in [3]. To solve this issue we apply annealing [3] and multiply the KL divergence by a lambda that we set to 10−3 . This reduces the impact of the KL divergence on the training and allows the model to learn some features before the KL divergence comes into effect. However, it reduces the benefits we get from using a VAE. 5 Exploring The Limits Of BoA We now explore the limits of BoA w.r.t. input data. All experiments are per- formed on randomly generated data to control of the evaluation process. 5.1 Reconstruction Performance To assess the reconstruction performance of the BoA auto-encoder, we use the area under the ROC curve (AUC ROC). It allows to determine whether the BoA has good predictive capacity and, similarly to the F1 measure, gives a general account of performance. To determine if the results are significantly different, we use Student t-test on means. The results are presented in Figure 2. We first evaluate the impact of the density on the reconstruction by compar- ing the performance on random contexts with densities from 0.1 to 0.9. We use 100 samples per density with a fixed size of 20 objects and attributes. Student’s t-test show significant differences between the performance with the various den- sities: all the p-values are under 0.01 except between 0.4 and 0.8 (0.24), 0.5 and 0.6 (0.39), and 0.7 and 0.8 (0.19). However, the model performance stays overall stable across the densities, while slightly better with smaller densities. We sus- pect this tendency is due to the composition process of the object embeddings: the higher the density, the more attributes are present for an object, so more attribute embeddings are involved in the composition of the object embeddings, making it more complex to decode. We also examine the effect of the size on the AUC ROC. Square random con- texts (|O| = |A|) of sizes in {5, 10, 20, 50, 100, 200, 500} and a fixed density of 0.3 are used for this experiment, with 20 samples per size. The model performs very well for seen data sizes with a slight drop to 0.83 for 20 objects and attributes. As expected, the performance drops when manipulating larger contexts. For 50 ob- jects and attributes (2.5 times the maximum seen size), the AUC ROC is above 0.63, but from 100 objects and attributes onward, it drops under 0.6. Finally, with 500 objects and attributes (25 times the largest seen data size and 4 times the embedding size), the reconstruction AUC ROC falls to 0.50 in average. This is the limit of reconstruction performance with the current training process. Finally, we examine the impact of the number of concepts on the performance of the model. We use contexts of fixed size (20 objects and attributes) from the (a) Impact of the density, from 0.1 to 0.9, (b) Impact of the size, from 5 to 500 ob- 100 samples per density. jects and attributes, 20 samples per size. (c) Impact of the concept number, for 200 sample with 20 objects and attributes. The blue line is the general tendency when rounding the concept number to 50. Fig. 2: Reconstruction performance on random contexts. The error bars and the shaded area correspond to the standard deviation. test set, totaling 200 contexts. We consider the concept number as an indicator of the variety of attributes and objects in the context. Indeed, if the concept number is high for a given size of context, we can expect the context to be close to the clarified context (context with no equivalent objects or attributes). This implies a lower amount of duplicate objects and attributes. In addition, we can expect the model to have a harder time encoding and decoding irregular contexts than repetitive ones. Consequently, the drop of the AUC ROC for higher concept numbers is not surprising. However, we also observe a lower performance around 150 concepts. This second decrease requires further investigation. 5.2 Metric Learning Performance We evaluate the performance of BoA for the co-intent similarity and number of concepts’ prediction, by computing the attribute embeddings and applying the predictors trained together with the BoA model. We use the 200 contexts of 20 objects and attributes from the test set. The prediction results are reported Figure 3. The Pearson correlation coefficient is 0.9 between the actual concept number and the prediction, indicating a strong correlation. We can notice the tendency of the model to under-evaluate the concept number. Even though BoA (a) Concept number. (b) Co-intent similarity. Fig. 3: Predicted pseudo-metrics against the actual values, for the 200 samples with 20 objects and attributes from the test set. manages to differentiate ai and aj , when ai = aj , the predictions in the other cases are not clear: for similarities between 0 and 0.8, they seem randomly picked between 0 and 0.4. The analysis of the training process reveals a small difference of the MSE between the first and the last training epochs: from 0.17 to 0.05. 6 Experiments On Real-World Datasets6 To evaluate the performance of BoA on real-world datasets, we follow the em- pirical setting of [5]: we reproduce their link prediction and attribute clustering tasks, used to evaluate object2vec (o2v) and attribute2vec (a2v), respectively. We use the same ICFCA dataset as [5] for link prediction. For attribute clustering however, we use SPECT heart7 as it is smaller than wiki44k [5], with dimensions closer to the training data: 68 objects and 23 attributes. We train the CBoW and SG variants of FCA2VEC models using the same settings as in [5], with 20 random iterations of each model and an embedding size of 3. To obtain compa- rable results, we reduce the embeddings produced by BoA to 3 dimensions by applying two standard dimensionality reduction techniques: principal component analysis (PCA) and t-distributed stochastic neighbor embedding (TSNE). We use Student t-test on means to determine if the results are significant. We report the link prediction performance in Table 2. The three BoA variants show a significantly different performance from o2v SG, with all the p-values lower than 0.005. We found that the classifier based on BoA, the one with the best F1 score, systematically answers positive. Additionally, we fail to reproduce the performance of [5] (F1 score of 0.69 for o2v CBoW, 0.66 for o2v SG) Finally, the ICFCA context is very sparse: it has a density of 0.003 on the train and 0.005 on the test set. Due to this, the task may not be representative of the 6 Experiments were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr). 7 https://archive.ics.uci.edu/ml/datasets/SPECT+Heart Table 2: Performance on the link prediction task (mean ± std.). Model Precision Recall F1 o2v CBoW 0.63 ± 0.05 0.46 ± 0.05 0.53 ± 0.05 o2v SG 0.70 ± 0.04 0.49 ± 0.03 0.57 ± 0.02 BoA PCA 3d 0.65 0.42 0.51 BoA TSNE 3d 0.58 0.67 0.62 BoA 0.50 1.00 0.67 Table 3: Performance on the attribute clustering task with 2, 5 and 10 clusters. Model k=2 k=5 k = 10 a2v CBoW 0.66 ± 0.00 0.14 ± 0.02 0.063 ± 0.013 a2v SG 0.35 ± 0.13 0.11 ± 0.03 0.042 ± 0.010 BoA TSNE 3d 0.30 0.29 0.044 BoA PCA 3d 0.70 0.22 0.051 BoA 0.70 0.22 0.051 performance of the object embeddings on most datasets. These results hint that the task needs to be adapted to get proper insight on the object embedding performance. The attribute clustering performance is reported in Table 3. In this experiment, we find that the CBoW variant performs significantly better than the SG (all t-test p-values under 0.0005). This is the opposite of the result found by [5] for attribute clustering. However, this result may be due to using a different dataset. Interestingly, the BoA PCA variant performs equally to the full BoA. The performance of BoA (and BoA PCA) is significantly better than a2v CBoW for 2 and 5 clusters (p-values under 10−14 ). For 10 clusters however, a2v CBoW performs significantly better (p-value under 0.001). The model improves the performance of a2v CBoW by 4% for 2 clusters and 8% for 5 clusters. 7 Conclusion And Future Work We introduced the co-intent similarity for attributes and proposed BoA, a gen- eralized embedding framework for formal contexts that integrates several FCA aspects. Our framework is data agnostic and scales to real-world datasets such as the SPECT heart dataset. It is also robust w.r.t. variations in the density and the concept number of formal contexts. The experimental results are encouraging, as our general approach achieves performance similar to FCA2VEC, a dataset specific one. In addition to being data agnostic, BoA constitutes a promising alternative to FCA2VEC since it uses a single embedding space for all con- texts. Moreover, the asymptotic time complexity of embedding a formal context through BoA is θ(|O| × |A|2 ). In comparison, applying a linear embedding (the most basic embedding) or an LSTM embedding model (like our pre-embedding) to each entry has a complexity of θ(|O| × |A|). As future work we aim to tackle two active issues in the FCA community: the random generation of contexts and the scalability of concept lattices. Random context generation introduces several biases as it does not match the distribution of real-world formal contexts (see e.g., the “stegosaurus effect” discussed in [2]). It could be beneficial to use more accurate generation algorithms than our current algorithm, like those discussed in [6,7]. Nonetheless, it is encouraging to see that BoA achieves acceptable performance when trained on simply generated contexts of relatively small size (up to 20 objects and attributes). Since this preliminary work enables the use of decoders in generation processes, using NNs seems a feasible direction for concept lattices construction. These are some directions of current ongoing work. References 1. Besold, T.R., et al.: Neural-Symbolic Learning and Reasoning: A Survey and In- terpretation. CoRR 1711.03902 (2017) 2. Borchmann, D., Hanika, T.: Some experimental results on randomly generating formal contexts. In: CLA. vol. 1624, pp. 57–69 (2016) 3. Bowman, S.R., et al.: Generating sentences from a continuous space pp. 10–21 (2016) 4. Chen, X., et al.: Adversarial deep averaging networks for cross-lingual sentiment classification. CoRR 1606.01614 (2016) 5. Dürrschnabel, D., et al.: Fca2vec: Embedding techniques for formal concept anal- ysis. CoRR 1911.11496 (2019) 6. Felde, M., et al.: Formal context generation using dirichlet distributions. In: Graph- Based Representation and Reasoning. pp. 57–71. Cham (2019) 7. Ganter, B.: Random extents and random closure systems. In: CLA (2011) 8. Ganter, B., Wille, R.: Formal Concept Analysis: Mathematical Foundations. Springer (1999) 9. Gardner, A., et al.: Classifying unordered feature sets with convolutional deep averaging networks. CoRR 1709.03019 (2017) 10. Gaume, B.e.: Clustering bipartite graphs in terms of approximate formal concepts and sub-contexts. IJCIS vol. 6, 1125–1142 (2013) 11. Graves, A., et al.: Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks 18(5-6), 602–610 (2005) 12. Ignatov, D.I.: Introduction to formal concept analysis and its applications in in- formation retrieval and related fields. CoRR 1703.02819 (2017) 13. Iyyer, M., et al.: Deep unordered composition rivals syntactic methods for text classification. In: 53rd ACL and 7th IJCNLP. pp. 1681–1691 (2015) 14. Kingma, D., Welling, M.: auto-encoding variational bayes (2013) 15. Kipf, T.N., et al.: Variational Graph Auto-Encoders. CoRR 1611.07308 (2016) 16. Kulkarni, A., et al.: Deep Variational Metric Learning For Transfer Of Expressivity In Multispeaker Text To Speech (2020), to appear in SLSP 2020 17. Kuznetsov, S.O., et al.: On neural network architecture based on concept lattices. In: Foundations of Intelligent Systems. pp. 653–663. Cham (2017) 18. Lin, X., et al.: Deep variational metric learning. In: Computer Vision – ECCV. pp. 714–729. Cham (2018) 19. Mikolov, T., et al.: Efficient Estimation of Word Representations in Vector Space. pp. 1–12 (2013) 20. Rudolph, S.: Encoding closure operators into neural networks. In: NeSy. pp. 40–45. CEUR-WS. org (2007)