Introduction

Research Summary: Knowledge Transfer in Arti cial Learning

Pierre-Alexandre Murena

murena@telecom-paristech.fr 0 0 Telecom ParisTech - Universite Paris Saclay , 46 rue Barrault, 75013 Paris , France

204 208

Introduction

This document presents the main research problems addressed during my PhD studies. All these researches are led inside the two teams DBWeb in Telecom ParisTech and LInK (Learning and Integration of Knowledge) in AgroParisTech, both located in Paris, and supervised by Pr. Jean-Louis Dessalles and Pr. Antoine Cornuejols.

My researches focus on learning theory both in the perspective of symbolic machine learning and of learning in continuous domains. I aim at nding an information-theoretic principle guiding information transfer in learning.

The start point of these researches is the idea that most machine learning takes a strong stationary hypothesis for granted. The general framework of statistical learning (mainly supervised and semi-supervised learning) considers two data sets: a learning data set (from which the concepts have to be learned) and a test data set (on which quality of the learned concepts is evaluated). The key idea of current learning methods and theories is to assume that training data and test data are independent and identically distributed (i.i.d.). However this strong hypothesis does not hold in many cases: either the data generation process evolves over time (aging e ect, trending e ect...) or the data belong to a di erent domain. Because similar questions of transfer and domain adaptation had already been addressed in analogical reasoning, we proposed to use an approach based on Kolmogorov complexity instead of probabilities. Kolmogorov complexity is a measure of the information contained inside an object. The use of Kolmogorov complexity in machine learning is accepted by the community, but mainly in a stationary point of view (when the key concept does not vary); we proposed to extend its use to non-stationary environments, in the same way as done in analogical reasoning. A presentation of these issues is given in section 2.

The strong similarity between transfer learning and analogical reasoning led me to consider this issue in my researches. Analogical reasoning consists in situations of the form \`b' is to `a' as `d' is to `c' ". Because its value has already been demonstrated, I focus on Hofstadter's micro-world, made up of strings of alphabetical characters that can be described with simple concepts like `predecessor', `successor' or `repetition'. The use of Kolmogorov complexity for analogical reasoning had already been considered, but our approach is slightly di erent. We Copyright © 2017 for this paper by its authors. Copying permitted for private and academic purpose. In Proceedings of the ICCBR 2017 Workshops. Trondheim, Norway developed a small descriptive language for Hofstadter's problems and convert it into a binary code, the length of which corresponds to Kolmogorov complexity. Our work on analogy is presented in section 3. Finally, because of its very global perspective, our research topic leads naturally to collaborations on various topics related to learning. This side aspect of my research is presented in section 4. 2

A global approach of learning

Statistical machine learning in its current form is often considered to be based mainly on three inductive principles: Empirical Risk Minimization (ERM), Bayesianism and Minimum Description Length (MDL). The validity of ERM has been demonstrated under the strong i.i.d. assumption using several frameworks, all inspired by the Probably Approximately Correct learning framework. Besides strong links have been stated between Bayesianism and MDL.

When the i.i.d. hypothesis is not veri ed, these three principles are not valid anymore and have to be replaced by new principles. Exploring this direction, we considered the most straightforward transfer learning problem and the classi cation problem, the purpose of which is to associate data to labels. Given source data XS associated to their classes YS and a target data XT , we aim at nding the corresponding classes YT . The idea is to nd a classi cation function S such that S (XS ) = YS and to transfer this function into a classi cation function T available on the target data XT . Because this problem is the same as analogical reasoning, we used a simpli cation of the general MDL principle in the context of analogy [ 2 ].

The mathematical tool used to measure the description length in MDL is Kolmogorov complexity [ 4 ]. Kolmogorov complexity (also called algorithmic complexity) of an object x is an information theoretic measure de ned as the minimal length of a program de ned on a Universal Turing Machine (UTM) and the output of which is object x. This quantity can be shown to be machine independent, but non calculable.

The idea we developed is to consider an upper-bound of this complexity based on a restricted Turing machine. The choice of a restricted Turing machine corresponds to an inductive bias, inevitable in any inductive reasoning (see for example the no-free-lunch theorem). This choice also raises the problem of machine dependency which seems crucial in human learning.

Our rst contribution is a direct use of analogical MDL in the context of transfer learning [ 5 ]: I presented a two-part MDL equation based on a data representation called model. A model is any object which may be used to compress data. In transfer learning, our purpose is to infer a source model MS and a target model MT :

min MS;MT

C(MS ) + C(XS jMS ) + C(YS jMS ; XS ) + C(MT jMS ) + C(XT jMT ) (1) where C(:) designates Kolmogorov complexity, X the data, YS the source labels and M the model. This equation applies both for continuous data (X is a matrix, the rows of which correspond to a vector data point) and for symbolic data (X is a sequence of symbols, for example a sequence of letters). The di erent terms in the equation present a strong similarity with usual machine learning terms: C(M ) corresponds to a model penalization based on its complexity; C(XjM ) corresponds to a likelihood term, ie. a tness of the model toward data; and the term C(Y jM; X) corresponds to an empirical risk. In the paper, we also proposed experimental validations on two toy data sets with a simple prototypebased model. They present good results and high performance on these data sets.

A direct variant of this formula has been proposed for incremental learning [ 6 ]. In incremental learning, the system faces a sequence of questions X1; X2; : : : and has to nd the solution to each problem one by one. The model used to describe data can be updated if it is outdated and does not correspond to the current data anymore. We propose the following simpli ed MDL objective: min M

X C(MtjM t

t 1(f1g)) + C(XtjMt) + C( tjMt; Xt) + C(YtjMt; Xt; t) (2) where t is a model association function such that t(u) = 1 if model Mt can be described with model Mu, and t(u) = 0 otherwise. The consistency of our framework with existing heuristics state-of-the-art methods has been established, as well as the validity of a naive algorithm based on the same neural model as presented for transfer learning.

The successful results obtained with MDL so far encourage future research tracks. Several problems emerge from the developed framework. First, the transfer objective 1 is valid for one target only. In practice, several target problems may occur, hence a multi-target variant of transfer has to be given. In particular, i.i.d. hypothesis consists in assuming in nitely-many targets with speci c distributions. I am currently exploring an approach based on Pareto-optimality, implying a prior over the future and thus a new learning concept: concern for future question.

Another question of interest is the theoretical validity of such approaches. Unlike statistical learning which has a clear measure of quality (given by the risk), an approach based on MDL does not present any natural quality measure. Such a function has to be found before an equivalent of PAC learning can be developed. Additionally, incremental learning methods do not have access to the whole objective function 2 at all steps: only local optimizations are possible. A measure of the impact of this algorithmic simpli cation appears as a direct consequence. Finally, we aim at nding a geometric interpretation of these equations. An interesting track is o ered by the domain of information geometry and probability distribution manifolds. Under some speci c conditions, Kolmogorov complexity may be associated to a probability distributed, hence considered in the perspective of information geometry.

Approaches of analogy

Because of its crucial role in my researches on learning, I attach great importance to studying analogical reasoning. For now on, my researches focused on Hofstadter's micro-world [ 3 ] which presents highly general characteristics of analogical thought. I will work on this research track with Dr. Laurent Orseau.

Preliminary works have already given insightful results and promising perspectives. I chose to work on the development of a small prototype language generating Hofstadter's analogies (of the form ABC : ABD :: IJK : IJL, which has the be read \ABD is to ABC what IJL is to IJK"). Among other speci cations, the proposed language had to be generative, ie. describe a dynamic generation rather than a static description (as opposed to the description in [ 2 ] for instance). For example, the string ABC will be generated by the program alphabet,sequence,3, which can be read as \Consider the alphabet and take the sequence of rst three letters ". Once such a language is de ned, it is turned into a pre x-delimited binary code, the length of which measures an upper-bound of Kolmogorov complexity.

A more elaborated and general version of the language has been recently proposed. This memory-based language o ers a exibility in the management of prior knowledge of the user and o ers a simple way to set priority to operations: its grammar enables any possible operator, as long as the operator can be put in long-term memory. The complexity of an element in memory is de ned as the complexity of its depth in memory. A more rigorous presentation of this new framework including the considerations on memory will be presented at ICCBR 2017 Computational Analogy Workshop [ 7 ].

We propose a validation of our approach with a human experiment. In an online survey, we submitted a few Hofstadter's problems and asked participants for their most intuitive solution. We shew that the majority answers correspond to local minima of Kolmogorov complexity. These results are not yet published.

The proposed approach o ers a tool to compare two results of an analogical problem when the generative instructions are given to the system. A rst logical direction is to provide automatic instruction generation, hence software able to produce an optimal generative instruction for any complete analogy. Once this will done, I have to nd a solution to an analogical problem (eg. nd the best solution to ABC:ABD::IJK:?): because the space of solutions is in nite, it cannot be explored naively and research biases have to be found. In order to address these issues, I am currently working on a Python interpreter for the developed language. 4

Collaborations and side problems

In the context of my research, I have the opportunity to collaborate on several projects related to non-stationarity and transfer.

A rst project is led jointly with Dr. Jeremie Sublime (ISEP) and Dr. Basarab Matei (Universite Paris 13) and concerns collaborative clustering. Classic clustering consists in associating similar data together in clusters. Collaborative and multi-view clustering is a framework in which several clustering algorithms are involved and try to in uence each other. The algorithms do not produce the same number of clusters, the same underlying model nor the same nal solution. This problem is closely related to transfer because it involves sharing information between several di erent domains. A rst contribution has been proposed using operational research tools to select relevant collaborators in an existing probabilistic framework [ 9 ]. A brand new approach, based on complexity, is being developed: we expressed the problem of collaborative clustering in terms of data compression and worked on minimal assumptions to obtain a tractable model. This new perspective o ers a theoretical background for a wide range of heuristic state-of-the-art approaches and inspires a new algorithm [ 8 ].

A new related collaboration has been engaged recently with Dr. Cristina Manfredotti, Pr. Juliette Dibie (AgroParisTech) and Dr. Fatiha Sais (LRI), exploring common approaches in transfer learning and structural mappings in knowledge bases (ontology alignment).

Finally, I am exploring the problem of Transfer Learning using boosting algorithms with Pr. Antoine Cornuejols and Sema Akkoyunlu. The use of boosting may o er new perspectives on transfer in general and be bene cial to my understanding of transfer mechanisms. A rst contribution has been submitted [ 1 ].

1. Cornuejols , A. , Akkoyunlu , S. , Murena , P.A. , Olivier , R.: Transfer learning by boosting projections from the target domain to the source domain, submitted to NIPS 2017

2. Cornuejols , A. , Ales-Bianchetti , J. : Analogy and induction : which (missing) link ? In: Workshop \Advances in Analogy Research : Integration of Theory and Data from Cognitive, Computational and Neural Sciences" . So a, Bulgaria ( 1998 )

3. Hofstadter , D. : The copycat project: An experiment in nondeterminism and creative analogies . AI Memo 755 , Arti cial Intelligence Laboratory, Massachusetts Institute of Technology ( 1984 )

4. Li , M. , Vitanyi , P.M.: An Introduction to Kolmogorov Complexity and

Its

Applications . Springer Publishing Company, Incorporated, 3 edn . ( 2008 )

5. Murena , P. , Cornuejols , A. : Minimum description length principle applied to structure adaptation for classi cation under concept drift . In: 2016 International Joint Conference on Neural Networks, IJCNN 2016 , Vancouver, BC, Canada, July 24-29 , 2016 . pp. 2842 { 2849 ( 2016 )

6. Murena , P.A. , Cornuejols , A. , Dessalles , J.L. : Incremental learning with the minimum description length principle , accepted in International Joint Conference on Neural Networks 2017

7. Murena , P.A. , Dessalles , J.L. , Cornuejols , A. : A complexity based approach for solving hofstadter's analogies , accepted at Computational Analogy Workshop, ICCBR 2017

8. Murena , P.A. , Sublime , J. , Matei , B. , Cornuejols , A.

: Multi-view clustering with the principle of minimum description length, submitted to ECML 2017

9. Sublime , J. , Matei , B. , Murena , P.A. : Analysis of the in uence of diversity in collaborative and multi-view clustering , accepted in International Joint Conference on Neural Networks 2017