Building Ecosystem-Aware Tools Using the Ecosystem Monitoring Framework Boris Spasojević University of Bern, Switzerland spasojev@inf.unibe.ch Abstract Integrating ecosystem data into developer tools can be very beneficial but is usually complicated. By automating the routine parts of this task we can reduce the amount of work needed to develop these tools. We have developed a framework that allows developers to quickly develop new tools that use ecosystem data. This framework automates the execution of user-defined analyses on ecosystem projects, allowing the developer to focus only on what ecosystem data is needed for her tool and how to present it. 1 Introduction Leveraging ecosystem data in the development process through ecosystem-aware developer tools can benefit software quality and speed of development in many ways. A large body of work has been done in integrating ecosystem data i.e., data from other related projects, into the development process. Examples of such work deal with code completion [2], method and argument recommendations [1], library scoring systems [5], code snippet recommenders [13] and others. Papers published in this field rarely take into account software evolution. They focus heavily on the usability of their respective tools and provide no support once the data it is based on is out of date. We argue that tool developers are in need of tool support to gather ecosystem data and keep it up to date. We have implemented one such tool for the Pharo Smalltalk ecosystem and named it “Ecosystem Monitoring Framework” or EMF for short. The main goal of this framework is to enable developers to write tools that leverage ecosystem data (ecosystem-aware tools) without requiring them to spend time on the infrastructure for running their analyses on the ecosystem projects and keeping that data fresh. As shown in Figure 1, EMF is implemented as a client-server application. The server side is in charge of gathering, storing and providing the ecosystem data used by the clients — the ecosystem-aware tools. The server side periodically loads all the source code of all the projects, executes the user-defined analysis on all loaded source code and stores the results in a database to be provided to clients on demand. We use o↵ the shelf modules for the database (MongoDB) and data provider (REST server) and implement the loading of the source code and running the analyses in bash and Pharo Smalltalk. Section 2 gives a short description of the Pharo ecosystem we used and Section 3 gives a description of how to implement a simple tool using EMF. Using this framework, we augmented 3 existing Pharo tools with ecosystem data: a type inference engine [9], a system browser [11, 10] and a type guessing system [12]. We also developed a system that stores object-producing Copyright c by the paper’s authors. Copying permitted for private and academic purposes. Proceedings of the Seminar Series on Advanced Techniques and Tools for Software Evolution SATToSE2016 (sattose.org), Bergen, Norway, 11-13 July 2016, published at http://ceur-ws.org 1 Server side Data gathering Database module Data providing module Client A Client B Client C Figure 1: The architecture of EMF. snippets, enabling another level of augmentation for tools that require objects on demand. All of these tools are described in Section 4. Finally, we discuss the limitations of the framework as well as interesting directions for future work and improvements in Section 5. 2 The Pharo Ecosystem We call a group of software systems united by common and mutual dependencies a Software Ecosystem. This is inline with previous definitions of software ecosystems, most notably the one by Lungu et al.: “A software ecosystem is a collection of software projects that are developed and evolve together in the same environment.” [4] In this section we present a short description of the Pharo ecosystem as defined in the configuration browser. The configuration browser is a tool for automated loading of Pharo projects similar to Maven1 for java. It reads its list of project from a human maintained meta repository2 containing scripts that automatically load projects and their dependencies. Many of these dependencies are also defined in the same meta repository, supporting the claim that these projects co-evolve, and that changes in one project a↵ect others throughout the ecosystem. In Figure 2 we can see the dependencies (represented as edges in the graph) between projects (represented as the nodes) from the configuration browser. This graph does not show the dependency to the classes from the base image which all the projects have. It is clear from the dense network of edges in the graph that these projects are very interdependent. To discuss the size of the ecosystem, we present Figure 3. In this figure we can see the number of projects in the ecosystem on the first of each month in the interval between 01.01.2014 and 01.08.2016. We can see that there is linear growth of the number of projects until mid 2015 after which the number stabilises around 160 projects. This illustrates that this ecosystem evolves not just in the content of individual projects, but also in its scope, supporting further the need for providing the ecosystem-aware tools with fresh ecosystem data. It also illustrates the advantage of using a meta repository as the source of the ecosystem projects rather than using a fixed set of projects. 3 Implementing an Ecosystem-aware Tool Using EMF To better understand the process of developing ecosystem-aware tools we discuss how to implement a simple tool we call “Class name clash prevention tool”. Pharo Smalltalk does not have a concept of namespace. This means that di↵erent projects could define a class with the same name, which could cause problems if such project needed to co-exist in the same image. We call 1 https://maven.apache.org/ 2 http://smalltalkhub.com/mc/Pharo/MetaRepoForPharo30/main/ 2 01 .0 1 0" 20" 40" 60" 80" 100" 120" 140" 160" 180" 01 .20 .0 14 3 " 01 .20 .0 14 5 " 01 .20 .0 14 7 " 01 .20 .0 14 9 " 01 .20 .1 14 1 " 01 .20 .0 14 1 " 01 .20 .0 15 3 " 01 .20 .0 15 5 3 " 01 .20 .0 15 7 " 01 .20 .0 15 9 " 01 .20 .1 15 1 " 01 .20 .0 15 1 " 01 .20 .0 16 3 " 01 .20 .0 16 5 " 01 .20 .0 16 7. " 20 16 " Figure 2: Dependencies between projects in the Pharo Configuration browser. Figure 3: The growth of the Pharo ecosystem between 01.01.2014. and 01.08.2016. 1 ClassClashDataGatherer>>emfAnalysis 2 | toBeStoredToDb | 3 toBeStoredToDb := OrderedCollection new. 4 EMFAnalysisCore projectClasses 5 do: [ :pc | 6 | classToProject | 7 classToProject := Dictionary new 8 at: ’className’ put: pc name; 9 at: ’project’ put: EMFAnalysisCore projectName; 10 yourself. 11 toBeStoredToDb add: classToProject . ]. 12 EMFAnalysisCore 13 save: toBeStoredToDbCollection 14 toDB: ’classClash’ 15 inCollection: ’classClash’ , EMFAnalysisCore executionID Listing 1: Implementation of the class name clash prevention tool back end. this a class name clash. Class name clash prevention tool3 is an ecosystem-aware tool that informs a developer if a newly created class shares a name with another class in the ecosystem, thus preventing later class name clashes. All ecosystem-aware tools that leverage EMF are developed in 3 steps: developing the back end, registering the back end with EMF and developing the front end. At no point is the developer concerned about the scope or structure of the ecosystem, as this is defined in advance as part of EMF. The following subsections describe each of the steps in general and the case of the Class Clash prevention tool. 3.1 Step 1: Developing the Back End The role of the back end part of a ecosystem-aware tool is to gather relevant data from the ecosystem and store it for later access by a front end. In the case of the class name clash prevention tool relevant data consists of class names from all ecosystem projects mapped to the name of the project that defines it. EMF o↵ers an API to obtain a list of projects classes and the project name, so creating such a mapping is trivial. Since the default database in EMF is MongoDB we store this data as JSON documents represented in Pharo Smalltalk as Dictionary objects. EMF provides an API for storing the data to MongoDB. All the functionality of the class name clash prevention tool back end is wrapped in the emfAnalysis method of the ClassClashDataGatherer class of which source code is shown in Listing 1. This method iterates over all classes of the project (lines 4–5) and creates a new entry for the database (lines 7–11) which encodes the relation between the class name and the project name. Finally, the entries are stored to the database at the end (line 12). 3.2 Step 2: Registering the Back End with EMF EMF uses an XML configuration file to describe the steps needed to perform an analysis on a project. The configuration file for the class name clash prevention tool is shown in Listing 2. This configuration contains only one target (lines 2–16) as it is used only for the class name clash prevention tool. The XML schema allows multiple targets to be specified, one for each analysis. This target has a “pre” step (lines 4–10), specifying the Gofer4 script that loads the source code of the back end of the tool described in Subsection 3.1 and an “analysis” step (lines 12–15) specifying how to run the loaded back end code by invoking the emfAnalysis method on an instance of ClassClashDataGatherer. Since no post processing is needed for this tool, the “post” step is absent from the configuration. Once this configuration file has been provided to the EMF, all the specified steps from the configuration file, and thus all the analyses, will be run once a week to gather fresh data from the ecosystem. 3 http://smalltalkhub.com/#!/ spasojev/ClassClash ~ http://smalltalkhub.com/#!/~spasojev/ClassClashFrontEnd 4 http://pharobooks.gforge.inria.fr/PharoByExampleTwo-Eng/latest/Gofer.pdf 4 1 2 3
 4         
 5           Gofer new
 6               url: ’http://smalltalkhub.com/mc/spasojev/ClassClash/main/’;
 7               package: ’ConfigurationOfClassClashBackEnd’; load.
 8           (Smalltalk at: #ConfigurationOfClassClashBackEnd)
 9               loadDevelopment.
10         
11      
12 13 14 ClassClashDataGatherer new emfAnalysis. 15 16 17
18
Listing 2: EMF configuration file for the class name clash prevention tool. 3.3 Step 3: Developing the Front End Once the EMF data gatherer runs, and the data is gathered by the class name clash prevention tool back end, it is available for serving by the data providing module of EMF. At this point, the front end of the class name clash prevention tool can function. We implement the front end as a plugin for Nautilus5 which is the default system browser for Pharo. The plugin reacts to a new class being created. Once that event is detected, the front end sends the name of the new class to the data providing module which returns from the database a list of all projects that contain a class with that name. This list is then presented to the developer. 4 Ecosystem-aware tools In this section we describe the ecosystem-aware tools we developed using EMF. We developed these tools in order to demonstrate the versatility of EMF and to verify that it does satisfy the requirements of a unified framework for ecosystem-aware tools. 4.1 Ecosystem-aware type inference Dynamically typed languages lack information about the types of variables in the source code. Developers care about this information as it supports program comprehension. Basic type inference techniques are helpful, but may yield many false positives or negatives. In ecosystem-aware type inference we track how many times messages are sent to instances of available types throughout the source code for all available projects from the ecosystem. This means that the back end of the ecosystem-aware type inference builds a weighted mapping from types to selectors, where the weight is the number of times a message with that selector was sent to an instance of that class. We use this information in the front end to sort the potential types of a variable the developer cares about based on their likelihood of being the actual type in the context. The likelihood is computed based on how many times the messages sent to this variable have been observed to be sent to each potential type throughout the ecosystem. Using EMF, we implemented a prototype and used it to evaluate the approach. We show that, for our implementation, measuring the frequency of association between a message and a type throughout the ecosystem source code is helpful in identifying correct types [9]. 4.2 Frequently used methods Software developers are often unsure of the exact name of the method they need to use to invoke the desired behavior in a given context. This results in a process of searching for the correct method name in documentation, which can be lengthy and distracting to the developer. 5 http://smalltalkhub.com/#!/ ~Pharo/Nautilus 5 We can decrease the method search time by enhancing the documentation of a class with the most frequently used methods. Usage frequency for methods is gathered by analyzing other projects from the same ecosystem [10]. Using EMF, we implemented a proof of concept of the approach. We use the same data gathered for the ecosystem-aware type inference, but a di↵erent front end. Since methods are commonly searched for using a system browser called Nautilus, we developed a plugin for it which adds a section with the most commonly used methods for the currently observed class [11]. 4.3 Ecosystem-aware type guessing A common practice when writing Smalltalk source code is to name method arguments in a way that hints at their expected type (i.e., aString, anInteger, aDictionary). This practice makes code more readable and some tools (such as the auto complete feature in the Pharo Smalltalk code editor) improve the developer experience by “guessing” the type of the method argument based on these hints. Using EMF we gather argument names throughout the ecosystem and generate a weekly report containing information about commonly used ones. This report can be used by the Type-guessing tool developer to include heuristics for better type guessing. Used these reports we developed heuristics that improved Pharo type-guesser by almost 40% [12]. 4.4 Object repository Unlike the previously described ecosystem-aware tools, which are augmentations of existing tools with ecosystem data, the Object Repository is just a back end for multiple potential front ends. The main idea behind the Object Repository is to enable front end tools to have “Objects on demand”. This means that the Object Repository mines, from the ecosystem, code snippets that, when executed, produce an instance of some class. These snippets are then stored and mapped to the class they can instantiate. A front end needs only to provide a class name, and the Object Repository will provide all available snippets that instantiate that class. These snippets have many potential use cases in software documentation, testing, program comprehension etc. Our proof of concept implementation of the Object Repository relies on brute force execution of code segments obtained through converting AST nodes of methods to source code. We show that applying the proposed approach to the Pharo ecosystem, as defined in EMF, results in an Object Repository that can instantiate almost 80% of the available classes in these projects [8]. 5 Discussion and open questions In order to provide fresh data to the front end tools, EMF has to periodically re-run all the analyses on new versions of source code. In our implementation we chose a one week interval, but for some tools and some quickly evolving ecosystems this might not be enough. The main challenge here is that if we wish to have completely fresh data we need to re-run the analyses on every commit to every project in the ecosystem. This is especially problematic if the projects are hosted by a third party (e.g., GitHub, SmalltalkHub) which are not willing to give us notifications and full access at every commit, requiring us to poll these repositories for changes. One way this issue could be solved is if the code hosting providers o↵ered a cloud solution for running analyses on the source code. Much like other “infrastructure as a service” solutions this should o↵er tool developers access to computing machines with preloaded source code of required projects, allowing the developer access to fresh data and providing a source of monetization for the hosting service. If we manage to obtain every commit in real time there is still the issue that for every commit we need to re-analyze the entire ecosystem. Alternatively, we could express the analyses in such a way as to operate on single commits (i.e., “di↵s” in the source code), single class or single project. This would make them much less elegant and understandable. Further we could define a model of the ecosystem which is easier to update in real time and have the analyses be expressed in terms of that model, relying on source code only when absolutely necessary. In time, the model would have to be updated as new user needs are identified. Another open question is how to define the scope of an ecosystem. In our implementation we relied on a manually maintained meta-repository to define our ecosystem. It would be interesting to try to find a way to express certain constraints that would define the ecosystem of interest e.g., all projects using a particular library or framework. This ecosystem definition would allow to automatically extend or shrink the ecosystem scope as the individual projects evolve e.g., adopt or abandon the library we are interested in. The main challenge for adoption is that EMF is, in its current form, limited to Pharo Smalltalk. In order to port it to more popular languages such as Java we would need a concise and expressive way to analyse the 6 source code of Java projects. Smalltalk has the advantage that, due to its high reflectivity [7] and being partly self implementing, analyses can be written in Smalltalk itself, operating on the source code of the project in question. This is mainly not true for other languages, and we would need to use additional tools (e.g., Moose [6] or Rascal [3]) to analyse the source code. In the future, it might also be worthwhile considering porting EMF to cross language ecosystems e.g., the JVM ecosystem — the ecosystem of all languages that compile to Java byte code. This raises a whole new set of questions and challenges regarding interlanguage analysis, mapping concepts from one language to another, etc. Many of these can be addressed by using a language agnostic meta-model (e.g., Famix) but using any model reduces the amount of available information so finding the right level of abstraction is imperative. Throughout this paper we discussed ecosystem-aware tools relying only on source code of the projects in the ecosystem. Software development today produces a wide range of di↵erent artifacts that supplement the source code and are used as additional data sources for ecosystem-aware tools. Including this additional data into our ecosystem definition and into EMF would open new possibilities for developing di↵erent kinds of ecosystem-aware tools, but also raise a series of challenges on how to obtain, analyze or keep that data fresh. Finally, we still need a comprehensive study on user needs when it comes to ecosystem-aware tools. The tools we developed were based on our own intuition and, even though our analyses show that ecosystem data improves these tools, future work would be best served by finding exact user needs, and ensuring that EMF can support tools that tackle those problems. With all this said, we can still conclude that EMF is an easier way of developing ecosystem-aware tools. It frees the developer from the routine parts (scoping the ecosystem, loading the projects, executing the analysis on each project, storing the resulting data and providing it to client tools) and allows her to focus on what makes her new tool unique and useful. Also, the tools we developed using it support its applicability to a wide range of problems. Acknowledgments We gratefully acknowledge the financial support of the Swiss National Science Foundation for the project “Agile Software Analysis” (SNSF project No. 200020-162352, Jan 1, 2016 - Dec. 30, 2018). References [1] M. Acharya, T. Xie, J. Pei, and J. Xu. Mining API patterns as partial orders from source code: From usage scenarios to specifications. In Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ESEC-FSE ’07, pages 25–34, New York, NY, USA, 2007. ACM. [2] M. Bruch, M. Monperrus, and M. Mezini. Learning from examples to improve code completion systems. In Proceedings of the the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering, ESEC/FSE ’09, pages 213–222, New York, NY, USA, 2009. ACM. [3] M. Hills, P. Klint, and J. J. Vinju. Scripting a refactoring with rascal and eclipse. In Proceedings of the Fifth Workshop on Refactoring Tools, WRT ’12, pages 40–49, New York, NY, USA, 2012. ACM. [4] M. Lungu, M. Lanza, T. Gı̂rba, and R. Robbes. The Small Project Observatory: Visualizing software ecosystems. Science of Computer Programming, Elsevier, 75(4):264–275, Apr. 2010. [5] Y. M. Mileva, V. Dallmeier, and A. Zeller. Mining api popularity. In Proceedings of the 5th International Academic and Industrial Conference on Testing - Practice and Research Techniques, TAIC PART’10, pages 173–180, Berlin, Heidelberg, 2010. Springer-Verlag. [6] O. Nierstrasz, S. Ducasse, and T. Gı̂rba. The story of Moose: an agile reengineering environment. In Proceedings of the European Software Engineering Conference (ESEC/FSE’05), pages 1–10, New York, NY, USA, Sept. 2005. ACM Press. Invited paper. [7] F. Rivard. Smalltalk: a reflective language. In Proceedings of REFLECTION ’96, pages 21–38, Apr. 1996. 7 [8] B. Spasojević, M. Ghafari, and O. Nierstrasz. The object repository, pulling objects out of the ecosystem. In Proceedings of the 11th Edition of the International Workshop on Smalltalk Technologies, IWST’16, pages 7:1–7:10, New York, NY, USA, 2016. ACM. [9] B. Spasojević, M. Lungu, and O. Nierstrasz. Mining the ecosystem to improve type inference for dynamically typed languages. In Proceedings of the 2014 ACM International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, Onward! ’14, pages 133–142, New York, NY, USA, 2014. ACM. [10] B. Spasojević, M. Lungu, and O. Nierstrasz. Overthrowing the tyranny of alphabetical ordering in docu- mentation systems. In 2014 IEEE International Conference on Software Maintenance and Evolution (ERA Track), pages 511–515, Sept. 2014. [11] B. Spasojević, M. Lungu, and O. Nierstrasz. Towards faster method search through static ecosystem analysis. In Proceedings of the 2014 European Conference on Software Architecture Workshops, ECSAW ’14, pages 11:1–11:6, New York, NY, USA, Aug. 2014. ACM. [12] B. Spasojević, M. Lungu, and O. Nierstrasz. A case study on type hints in method argument names in Pharo Smalltalk projects. In Proceedings of the 23rd IEEE International Conference on Software Analysis, Evolution, and Reengineering (SANER), volume 1, pages 283–292, Mar. 2016. [13] S. Thummalapenta and T. Xie. Parseweb: a programmer assistant for reusing open source code on the web. In Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering, ASE ’07, pages 204–213, New York, NY, USA, 2007. ACM. 8