Towards a Software Product Line for Machine Learning Workflows: Focus on Supporting Evolution Cécile Camillieri Luca Parisi Mireille Blay-Fornarino camillie@i3s.unice.fr parisi@i3s.unice.fr blay@i3s.unice.fr Frédéric Precioso Michel Riveill Joël Cancela Vaz precioso@i3s.unice.fr riveill@i3s.unice.fr joel.cancelavaz@gmail.com Université Côte d’Azur, CNRS, I3S Bat Templier, 930 Route des Colles Sophia Antipolis, France ABSTRACT in order to help users who want to build ML workflows, we The purpose of the ROCKFlows project is to lay the foun- have to propose a system that can present a large variety dations of a Software Product Line (SPL) that helps the of algorithms to users, while helping them in their choices construction of machine learning workflows. Based on her based on their data and objectives. At the same time, we data and objectives, the end user, who is not necessarily an should be able to extend the supported solutions at least expert, should be presented with workflows that address her by incorporating new algorithms. The challenge is to hide needs in the ”best possible way”. To make such a platform the complexity of the choices to the end user and to revise durable, data scientists should be able to integrate new algo- our knowledge with each addition: an algorithm can become rithms that can be compared to existing ones in the system, less efficient compared to a new one, while the introduction thus allowing to grow the space of available solutions. While of new pre-processing operations can extend the reach of comparing the algorithms is challenging in itself, Machine algorithms already present. Learning, as a constantly evolving, extremely complex and The contribution of this paper is thus to describe a tool- broad domain, requires the definition of specific and flexible supported approach, responding to this challenge: the ROCK- evolution mechanisms. In this paper, we focus on mecha- Flows project1 . nisms based on meta-modelling techniques to automatically The remainder of the paper is organized as follows. We enrich a SPL while ensuring its consistency. discuss in the next section challenges we face and some re- lated works. Section 3 describes the architecture that sup- ports the project and two usage scenarios focusing each on Keywords a different user of ROCKFlows. We detail the evolution Software Product Line, Machine Learning Workflow, Evolu- process and the correlated artefacts in Section 4. Section 5 tion concludes the paper and briefly discusses future work. 1. INTRODUCTION 2. TOWARDS A SPL FOR MACHINE The answer to the question ”What Machine Learning (ML) LEARNING WORKFLOWS algorithm should I use?” is always ”It depends.” It depends The purpose of the ROCKFlows project is to lay the foun- on the size, quality, and nature of the data. It also depends dations of a software platform that helps the construction of on what we want to do with the answer [21]. ML workflows. This task is highly complex because of the in- The industry of cloud-based machine learning (e.g., IBM’s creasing number and variability of available algorithms and Watson Analytics, Amazon Machine Learning, Google’s Pre- the difficulty in choosing the suitable and parametrized algo- diction API) provides tools to learn ”from your data” with- rithms and their combinations. The problem is not only on out having to worry about the cumbersome pre-processing choosing the proper algorithms, but the proper transforma- and ML algorithms. To address such a challenge they pro- tions to apply on the input data. It is a trade-off between pose fully automated solutions to some classical learning many requirements (e.g., accuracy, execution and training problems such as classification. Some other actors like Mi- time). crosoft, with the Azure’s Machine Learning platform, allow Since Software Product Line (SPL) engineering is con- users to build much more complex ML workflows, in a graph- cerned with both variability and systematically reusing de- ical editor that is targeted towards ML experts. velopment assets in an application domain [5], we have based The common point between these solutions is that they our project on SPL and model-driven techniques. The SPL chose to select only a few algorithms, in comparison to the engineering separates two processes: domain engineering for hundreds that are available. However data scientists know defining commonality and the variability of the product line that the best algorithm will not be the same for each dataset and application engineering for deriving product line appli- [22]. Moreover, new algorithms are regularly proposed by data scientists for dealing with more or less specific prob- 1 ROCKFlows stands for Request your Own Convenient lems and improving performances and accuracy [6]. Thus, Knowledge Flows. 65 cations [15]. Similarly, the ROCKFlows project requires on maintaining consistency of evolving FMs. In our case, we one hand to build a consistent SPL, allowing end users to get rely on these operations to update our models, but we had reliable workflows, and on the other hand to allow evolution to encapsulate them in business oriented services. of this SPL to integrate new algorithms and pre-processing Moreover, despite the huge variability of the system, we treatments. have decided to propose the end user only choices that can Based on these requirements, we have identified the fol- lead to a proper result. Hence, it should not be possible lowing challenges, addressing the needs for building and to build a configuration for which we would not be able to evolving a SPL in a domain as complex and changing as ML, generate a workflow. For instance, it will not be possible for ensuring a global consistency of the knowledge and scalabil- a user to select, relatively to a given dataset, a performance ity of the system. value for which we have no algorithm that can reach such requirements. It is necessary for us to ensure a consistent C1: Exploratory project in a complex environment. configuration process [20]. Making a selection among the high number of data mining Contrary to the works allowing several users to modify algorithms is a real challenge: more than hundreds of algo- a model in contradictory manners and aiming to reconcile rithms exist that can tackle a single ML problem such as those [3], here we are in a setting where only consistent classification. While work exists to try and rank their per- evolutions are possible. Thus we did not have to handle formance [6], it only gives an overview of which algorithms co-evolution problems. Like the approach used in SPLEM- are best in average, not for a given specific problem and not MA [17], ”Maintenance Services” define the semantics of according to different pre-processing pipelines. evolution operations on the SPL ensuring its consistency. Data scientists often approach new problems with a set However, the analogy between meta-elements manipulated of best practices, acquired through experience. However, in ROCKFlows and SPLEMMA is hard to establish, espe- there is few scientific evidence as to why an algorithm or cially because our solution and problem spaces are in a con- pre-processing technique leads to better results than another stant evolution. Thus the associated meta-models are not and in which case. Thus, one of the biggest challenge for this stable, implying intraspatial second degree evolutions [19] project is so to find a proper way to characterise algorithms i.e., several spaces and mappings are simultaneously mod- and to compare them, relatively to the very broad spectrum ified; e.g., Adding a non-functional property kind, due to of user needs and data representations. some improvement of the experiment meta-model, involves Collaboration between SPL developers and data scientists to extend the FM and corresponding end user representation induces a complex software ecosystem [13] where some math- (problem space) and generation tools to take into account ematical results may or not find a correspondence at end this new feature. user problem level. Heterogeneity of formalisms induces that new evolutions are regularly discussed between the different C3: User Centered SPL. stakeholders. We identify three stakeholders, each leveraging challenges. Given these domain requirements, meta-model driven en- - SPL users are the end users of the SPL: it can be a neo- gineering provides an efficient and powerful solution to ad- phyte who is looking for a solution to extract information dress the complexity of the ecosystem through support of from a dataset as well as an expert who wants to check or separation of concerns and collaborations. In order to oper- learn dependencies among dataset properties, algorithms, ationalize it, we chose to consider this environment as a set targeted platforms and user objectives. They want to use of components relying on different meta-models for which a system helping them to master the variability, i.e., to ex- the evolution mechanisms are exposed through services. press their requirements and get their envisioned ML work- flow. Both of these do not know about FMs and may need C2: SPL building in a constantly evolving environ- complementary information like examples of uses of the al- ment. gorithms, details about the algorithm author or implemen- While the number of ML algorithms and techniques con- tation, etc. Thus, the visualization of a FM as in standard stantly grows, the fundamental understanding of ML inter- tool is not adapted. Creating a user interface dedicated to nal mechanisms is not stable enough to allow us to set any the SPL is also problematic knowing the changing nature of knowledge in stone. Both the domain and our understanding the system. At the same time, in a agile process, we need to of it evolve quickly, forcing constant evolution of the SPL. test the SPL with users in order to align it with their needs, The line evolves in particular through the addition of which are hard to identify a priori. new algorithms and pre-processings. We run experiments Some flowcharts have been designed to give users a bit of to identify dataset patterns leading to similar behavior of a rough guide on how to approach ML problems [16, 14]. algorithms on different concrete datasets. The high num- ROCKFlows wants this approach to be operational. So, we ber of possible combinations (variability of compositions and do not aim for the construction of workflows by assembly, algorithms) as well as the frequent changes in ML require but the automatic production of these workflows, without a evolution mechanisms that are both incremental and loosely direct contribution of the user in this construction process. coupled with the elements presented to the end user. Works such as these are however potential targets for the As of today, ROCKFlows’ SPL contains roughly 300 fea- generation of the workflows where proposed optimization tures, and 5000 constraints, representing 70 different ML al- could then be used automatically. Moreover, faced with the gorithms, 5 pre-processing workflows, and is mostly focused multitude of such systems (Clowdflows [10], MLbase [11], on classification problems. Weka [9]), it is right to allow the user to select her execu- Evolution in SPLs has been a challenge for many years [15]. tion target(s) so that the production is limited only to the In particular, several works exist on evolution of Feature ML workflows implemented by these platforms. Models (FM) [8, 1]. They propose different mechanisms for - External Developer are domain experts who contribute 66 to the SPL. They do not have all the knowledge of the sys- should be added in the FM; tem and contribute by adding new algorithms. They have to - GUI: Display metadata should be updated to reflect this be able to contribute separately with minimal interference. new feature in the GUI; - Internal developers are leaders of the SPL. They have - Generation: If we can execute the algorithm, the gener- the knowledge of the global architecture and manage contri- ator should be updated to allow it either through code or a butions of external developers to integrate them. They have reference to the corresponding element in target platform(s); to be able to maintain the platform and to ensure the consis- - Experiments should be made with this new algorithm in tency of all products despite the evolution of the ecosystem. order to compare it to the other known algorithms. Cur- rently, the properties that are considered are accuracy of 3. ARCHITECTURE FOR ROCKFlows the results, execution time of the workflow, and memory usage. If experiments can be achieved, i.e., Experiments 3.1 ROCKFlows Big Picture module has access to algorithm execution and results, the SPL needs to be updated with performance information for Figure 1 represents the current proposed architecture for this algorithm but also for all the algorithms whose ranking ROCKFlows on the component level. On the left of the cen- has changed. tral vertical line are the components that will be necessary The central component SPLConsistencyManager’s role is for the end user to configure her worfklow. On the right, to ensure that all required changes are made across the whole components enabling users to add their own ML algorithm system. Through the present scenario, we describe how this in the system is described. Relationships between the dif- component handles the impacts mentioned above. ferent components are relying on meta-models. We now describe this architecture through the description (1) As an External Developer wants to add a new algorithm, of two scenarios: the GUI presents her with proper information that she needs to provide. Because this information is meant to 3.1.1 Scenario 1: Configuration of a ML Workflow change as the system encompasses more possibilities of A user willing to configure a new ML workflow will do so Machine Learning, it should be easy to change. Hence, through a web-based configuration interface2 . The process the SPLConsistencyManager provides the GUI with the requires at least the following steps, as visible on the left of information needed to add a new algorithm in the tool. figure 1: The GUI presents then a generated form to the user. (a) The Graphical User Interface (GUI) requests display meta- (2) Once the External Developer has provided all informa- data on the Feature Model. Metadata associate each tion on the algorithm, it is sent to the Consistency Man- unique feature of the model with descriptions, references ager and dispatched among the other components. or other artefacts aiming to help non-experts in their (3) If the External Developer provided code to execute the choices. Figure 2 shows a screenshot of our GUI. Here, algorithm, the Experiments component is requested to user is presented with the choice of its main objective, run tests on the algorithms, in order to find its perfor- in the form of questions. mance. This component is described more precisely in (b) Once the FM is loaded and displayed properly with the the next section. metadata, the user configures the underlying FM by (4) Once experiments are finished, the manager analyses the responding to questions. The Feature Model compo- results to find whether any inconsistency was found be- nent in the figure exposes a web-service that allows for tween the information provided by the user and the re- configuration on any FM, through the use of SPLAR’s sults. If not, the algorithm can be added both to the API [12]. Feature Model and base of currently supported algo- (c) Once a valid and complete configuration has been de- rithms. fined, it is sent to the Workfow component. The configu- (5) Finally, once the feature has been added to the FM, ration is then transformed into a Platform Independent its display metadata can be filled with the information workflow model, that can in a second step be used to provided by the expert. generate executable code for different target platforms. 3.2 Component for experiments on algorithms (d) The Generator may require access to the base of algo- As we want even non-expert users to be able to get the rithms handled by the system in order to be able to appropriate algorithm for their need, we chose to express produce the proper code. higher level goals such as best accuracy or quickest execution Though it is not described here, depending on user’s pref- time in the FM. This representation allows to filter out a erence, the generated workflow would either be provided to number of algorithms by offering users with trade-offs over the user or directly executed by the target platform. these goals. In a second step, or a more advanced mode, using other models such as requirements engineering Goal 3.1.2 Scenario 2: Submission of a new algorithm Models is considered. In combination with our FM, it could We will now focus on the introduction of new ML algo- allow during configuration to present the user with more rithms in the SPL. Such an action has impacts on several precise expected values for her non functional goals [2]. parts of the system: To be able to express knowledge such as ”best”, ”average” - SPL: At least a new feature representing the algorithm or ”worst” accuracy, we need an appropriate way to com- pare algorithms on a given problem. This ranking among 2 The interface is accessible at http://rockflows.i3s.unice.fr algorithms is computed by our Experiment module. The 67 Figure 1: High level architecture for ROCKFlows. all experiments do not have the need to be executed again, however the ranking of the algorithms must be updated to take in account the newest algorithm. 4. ARTEFACTS TO SUPPORT EVOLUTION This section discusses how we allow users to provide in- formation about new algorithms, how we use it to update Figure 2: Screenshot of ROCKFlows’ GUI. the system, and how we can ensure consistency despite the multiple impacts of these changes. 4.1 Meta-models Figure 4 shows an excerpt of the meta-elements that deal with evolution of the SPL. Each component described earlier corresponds to a meta-model, and the SPLConsistencyMan- ager maintains consistency among them. 4.1.1 SPL: Feature Model and Configuration Our feature model is represented in the SXFM (Simple XML Feature Model)3 . The rest of our SPL handling is also made through SPLOT’s FM reasoning library, SPLAR4 . 4.1.2 Addressing non-expert users Figure 3: Experiments In order to make the system as accessible as possible, ad- ditional information on features must be set, such as descrip- tions or examples, as well as closed questions that will be component runs each known algorithm on the available com- asked to the end user during configuration. This metadata patible datasets and stores its performance on the different on the practical features is handled in a dedicated meta- properties for each dataset. On top of that, it will transform model AlgorithmDescriptionMM and used to build the GUIs the datasets with a set of available pre-processing operations that are presented to the end users and external developers. and test again each algorithm with those new sets. The model is briefly described in subsection 4.2. Algorithms results on similar datasets are then compared to one another to get a result similar to the one visible on fig- 4.1.3 Handling the results of Experiments ure 3. Though it will not be discussed here, we have defined Information such as the accuracy ranking of algorithms an algorithm based on classical ML and statistical methods according to dataset patterns is managed by the Experiment to pull off this comparison and regroup datasets in so-called component. The expected format of this data is designed in dataset patterns. This knowledge can then be pushed in a dedicated meta-model ExperimentPropertyMM. It evolves the FM in the form of constraints linking a functional ob- as we gain knowledge and experience. jective, an algorithm, a dataset pattern and the ranking for each of the properties, also depending on the possible pre- 3 processings for this dataset. http://ec2-52-32-1-180.us-west-2.compute.amazonaws. com:8080/SPLOT/sxfm.html As presented in section 3.1.2, the Experiment component 4 An excerpt of the meta-model used for both FM and is driven by the SPL Consistency Manager, that is charged configuration definition in the library, can be found to start experiments, gather and validate results before in- in https://github.com/FMTools/sxfm-ecore/blob/master/ corporating them in the SPL. As new algorithms are added, plugins/sxfm/model/sxfm.png 68 Figure 4: SPL Core Meta-models excerpt 4.2 Metadata on Algorithms chy of the features. So, adding new class of algorithms or Data scientists adding their algorithms in the system need modifying the structure of the FM can easily be achieved. to provide at least: Once the feature for the algorithm has been created, addi- - The high level ML objective the algorithm can be used tional constraints need to be defined between the algorithm for: classification of data, prediction of numerical values (re- and other features, in particular those describing input data. gression), anomaly detection, etc.; Generating domain constraints. Only certain types of con- - Properties of the algorithm in regards to input data: straints must be defined among those different sub-trees. which data types the algorithm supports, can the algorithm For instance algorithms can define constraints towards input handle missing values in the data, etc.; data, such as ”SVM implies Numerical Data” but never the - A description of the algorithm, examples of its use, refer- other way around. However such a constraint only applies in ences to publications or web pages describing the algorithm. a workflow if no pre-processing is used. So, this constraint As described in section 4.1.2, those will be displayed in the has to be transformed to express a constraint depending on configuration interface; the pre-processings that can be applied. The complexity of - If possible, code that will allow us to run the algorithm these cases, their multitude and the frequency of evolution in our tool, so that we can both compare it to the others lead us to encapsulate the generation of those constraints in algorithms and provide executable workflows to end users. dedicated operators, working on given ensembles (e.g., pre- Through the definition of theses elements in the Algo- processing set that returns numerical Data). They also in- rithmDescriptionMM meta-model, a form is automatically troduce features that are hidden to the end user, allowing generated and presented to the domain expert. Thus, the us to tame this complexity. model can be extended to add new properties for the al- It is interesting to note that, depending of the semantics gorithms and will be automatically handled by the GUI. associated to the features, different constraints should be However the impact of such changes on the FM still has to generated. For instance, if an algorithm cannot deal with be handled by the SPL manager. We do not know if tools missing values, a constraint ”algo excludes missing values” such as the one described in [7] could help because those needs to be generated. In the other case, a constraint ”algo changes mostly impact code. implies missing values” should never be generated because such an algorithm can still be used even if no missing value is 4.3 Domain driven tooled approach to man- present in the input data. Like previously, this higher level age Feature Model evolution knowledge of the features is defined in our metadata. It al- lows us to ensure that all necessary constraints are properly Even though we have defined a single FM for ROCK- defined for all algorithms we add into the SPL. Flows, we put a focus on separating concerns in it. The tree is currently separated in 4 sub-trees handling: input data description, user objectives, the processing algorithms, 5. CONCLUSION AND FUTURE WORK and expected properties of the generated workflow. This In this paper, we have outlined some of the difficulties separation of concerns provides a first level of modularity related to building and evolving a SPL for ML workflows. for the model. To handle the line’s complexity and evolution, we have pro- Linking Domain artefact and FM structure. Metadata ex- posed an architecture organized around a set of meta-models ternal to the FM itself defines particular points in the FM and transformations encapsulated within services, A large where feature can be inserted. Only the sub-trees that need number of complementary perspectives are considered, both to be modified are considered. In our example, we add all on mechanisms to build the SPL and on the business ap- algorithms responding to the classification problem in the proach of ML. same sub-tree. This mechanism enables us to extend the The complexity we are facing requires an agile and prag- model cleanly, and abstract ourselves from the exact hierar- matic approach in which users are given the opportunity 69 to provide feedback during the early stages of the project. [11] T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. J. The necessity for evolution of the system leads us today to Franklin, and M. Jordan. MLbase: A Distributed consider moving towards approaches focused on the reuse of Machine-Learning System. In CIDR, 2013. existing components [18]. This would allow for a necessary [12] M. M. Mendonça, M. Branco, D. Cowan, and control over the system’s evolution. M. Mendonca. S.P.L.O.T.: software product lines The domain of ML is currently booming with propositions online tools. In OOPSLA, pages 761–762. ACM Press, for algorithms, distributivity of execution and the opening to 2009. a larger audience. We consider to extend our approach to in- [13] T. Mens, M. Claes, P. Grosjean, and A. Serebrenik. tegrate the necessary parameters for distributing workflows’ Studying Evolving Software Ecosystems based on execution as well as proposing deep learning workflows. A Ecological Models. In Evolving Software Systems, longer term question is to target Scientific Workflow Man- pages 297–326. Springer, 2014. agement systems, allowing so to explore data driven exe- [14] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, cution through transformations targeting the specification B. Thirion, O. Grisel, M. Blondel, and al. Scikit-learn: interchange language called WISP [4]. Machine learning in Python. Journal of Machine Finally, we wish to allow the end user to express her needs Learning Research, 12:2825–2830, 2011. http: in business terms, through an approach similar to IBM Wat- //scikit-learn.org/stable/tutorial/machine learning map/. son Analytics5 . It is a question of integrating the state of the [15] K. Pohl, G. Böckle, and F. J. van der Linden. art practices in our modelisation, without losing the power Software Product Line Engineering: Foundations, of our evolutionary approach, driven by experiments. Principles and Techniques. Springer-Verlag, 2005. [16] B. Rohrer. Machine learning algorithm cheat sheet for 6. REFERENCES Microsoft Azure Machine Learning Studio. https://azure.microsoft.com/en-us/documentation/ [1] M. Acher, P. Collet, P. Lahire, and R. France. articles/machine-learning-algorithm-cheat-sheet/. Separation of Concerns in Feature Modeling: Support [17] D. Romero, S. Urli, C. Quinton, M. Blay-Fornarino, and Applications. In AOSD’12. ACM, 2012. P. Collet, L. Duchien, and S. Mosser. SPLEMMA: a [2] O. Alam, J. Kienzle, and G. Mussbacher. generic framework for controlled-evolution of software Concern-Oriented Software Design, pages 604–621. product lines. In International Workshop on Springer Berlin Heidelberg, Berlin, Heidelberg, 2013. Model-driven Approaches in SPL (MAPLE), volume [3] A. Anwar, S. Ebersold, B. Coulette, M. Nassar, and 2013, pages 59–66, 2013. A. Kriouile. A Rule-Driven Approach for composing [18] M. Schöttle, O. Alam, J. Kienzle, and G. Mussbacher. Viewpoint-oriented Models. Journal of Object On the modularization provided by concern-oriented Technology, 9(2):89–114, 2010. reuse. In Proceedings of MODULARITY’16, pages [4] B. F. Bastos, R. M. M. Braga, and A. T. A. Gomes. 184–189, New York, NY, USA, 2016. ACM. WISP: A pattern-based approach to the interchange of [19] C. Seidl, F. Heidenreich, U. Aßmann, and U. Aßmann. scientific workflow specifications. Concurrency and Co-evolution of Models and Feature Mapping in Computation: Practice and Experience, 2016. Software Product Lines. In Proceedings of SPLC’12, [5] P. Clements and L. M. Northrop. Software Product pages 76–85, New York, NY, USA, 2012. ACM. Lines : Practices and Patterns. Addison-Wesley [20] S. Urli, M. Blay-fornarino, and P. Collet. Handling Professional, 2001. Complex Configurations in Software Product Lines : a [6] M. Fernández-Delgado, E. Cernadas, S. Barro, and Tooled Approach. In ACM, editor, Proceedings of D. Amorim. Do we Need Hundreds of Classifiers to SPLC’14, pages 112–121, Florence, Italy, 2014. Solve Real World Classification Problems? Journal of [21] I. H. Witten, E. Frank, and M. a. Hall. Data Mining: Machine Learning Research, 15:3133–3181, 2014. Practical Machine Learning Tools and Techniques [7] S. Getir, M. Rindt, and T. Kehrer. A generic (Google eBook). 2011. framework for analyzing model co-evolution. In [22] D. H. Wolpert. The lack of a priori distinctions Proceedings of the Workshop on Models and Evolution between learning algorithms. Neural Comput., co-located with MoDELS 2014, Valencia, Spain, Sept 8(7):1341–1390, Oct. 1996. 28, 2014., pages 12–21, 2014. [8] J. Guo, Y. Wang, P. Trinidad, and D. Benavides. Consistency maintenance for evolving feature models. Expert Systems with Applications, 39(5):4987–4998, 2012. [9] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: An update. SIGKDD Explor. Newsl., 11(1):10–18, Nov. 2009. [10] J. Kranjc, V. Podpečan, and N. Lavrač. Clowdflows: a cloud based scientific workflow platform. In Machine Learning and Knowledge Discovery in Databases, pages 816–819. Springer Berlin Heidelberg, 2012. 5 https://www.ibm.com/analytics/watson-analytics/ 70