=Paper=
{{Paper
|id=Vol-1492/Paper_18
|storemode=property
|title=Knowledge Pit - A Data Challenge Platform
|pdfUrl=https://ceur-ws.org/Vol-1492/Paper_18.pdf
|volume=Vol-1492
|dblpUrl=https://dblp.org/rec/conf/csp/JanuszSSR15
}}
==Knowledge Pit - A Data Challenge Platform==
Knowledge Pit - A Data Challenge Platform⋆ Andrzej Janusz1 , Dominik Slezak1,2 , Sebastian Stawicki1,2 , and Mariusz Rosiak 1 Institute of Mathematics, University of Warsaw, Banacha 2, 02-097, Warsaw, Poland 2 Infobright Inc., Poland Krzywickiego 34, lok. 219, 02-078 Warsaw, Poland {janusza,slezak,stawicki}@mimuw.edu.pl mariusz.rosiak@gmail.com http://www.dominikslezak.org Abstract. Knowledge Pit (https://knowledgepit.fedcsis.org) is a web platform created to facilitate organization of data mining competitions. Its main aim is to stimulate collaborative research for solving practical problems related to real-life applications of predictive analysis and decision support systems. What makes Knowledge Pit different from other data challenge platforms is the fact that it is a non-commercial project focusing on a collaboration with international con- ferences. It promotes the idea of open research and encourages young researchers to involve in projects related to data science. The platform can also be used as a e- learning tool to support data mining courses and for defining interesting student projects. In this paper we discuss the architecture of Knowledge Pit and high- light its main functionalities. We also overview some of the already finished data challenges that were organized using our web platform. Key words: data mining competitions, collaborative research, web platform, e- learning 1 Introduction In this short paper we briefly describe a web platform, called Knowledge Pit, created in order to support organization of data mining competitions. On the one hand, this platform is appealing to members of the machine learning community for whom com- petitive challenges can be a source of new interesting research topics. Solving real-life complex problems can also be an attractive addition to academic courses for students who are interested in practical data mining. On the other hand, setting up a publicly available competition can be seen as a form of outsourcing the task to the community. This can be highly beneficial to the organizers who define the challenge, since it is an inexpensive way to solve the problem which they are investigating. Moreover, an open data mining competition can become a bridge between domain experts and data analysts. In a longer perspective, it may leverage a cooperation between industry and academic researchers. ⋆ The authors are partially supported by Polish National Science Centre grant DEC- 2012/05/B/ST6/03215 and by Polish National Centre for Research and Development grants PBS2/B9/20/2013 and O ROB/0010/03/001. 192 external interaction n issio subm try en sub mis ent sion ry ore n sc issio sub entry mis bm su sio n ion iss bm e su scor sub mis s sco ion re Fig. 1. A system architecture of the Knowledge Pit web platform. 2 System Architecture The Knowledge Pit platform is designed in a modular way, on top of an open-source e- learning platform Moodle.org [1] and as such, it follows the best practices of a software development. The current modules of the platform include user accounts management system, competition management subsystems, time and calendar functionalities, com- munications features (i.e. forums and messaging subsystems), and a flexible interface for connecting automated evaluation services prepared to assess contestants’ submis- sions. Figure 1 shows an architecture schema of the Knowledge Pit platform. Its two main parts are the platform’s engine located at a dedicated server and the evaluation subsys- tems. Currently, Knowledge Pit is hosted on a server belonging to Polish Information Processing Society (http://pti.org.pl/) and is located in the fedcsis.org domain. The two main parts of the platform are the platform’s engine and the evaluation subsystems. The first one provides interfaces for defining and maintaining of data challenges, management of user’s profiles, submissions and private files, maintaining Leaderboards3 , and the internal messaging systems (competition forums, chats, as well as email and notification sending services). It is based on a very popular solution stack, i.e. Apache, MySQL and PHP [6, 9, 2]. Together they constitute a bridge between the platform and different groups of users (guests, participants of competitions, moderators and organizers of particular challenges, managers and administrators of the system). The second part of the platform is responsible for assessment of solutions submit- ted by participants of particular competitions. Due to a flexible communication mech- anism, this service may be distributed among several independent workstations, which guarantees the scalability of the evaluation process. Since evaluating submissions for 3 A competition’s Leaderboard is an on-line ranking of participants competing in that particular data challenge. 193 some competitions may require a lot of resources (e.g. memory, CPU time, disc I/O or database connections), this is a very important aspect of system’s architecture. For example, the assessment of a single submission to AAIA’14 Data Mining Competition required constructing several Naive Bayes classification models for a data table con- sisting of 50, 000 objects and testing their performance on a different table with 50, 000 objects described by 11, 852 conditional attributes [3]. In that case, distribution of the required computations allowed for nearly real-time evaluation, even during the most busy moments of the competition. Another advantage of separating the evaluation subsystems from the platform’s en- gine is that it may be implemented in any suitable programming language, as a script or a stand alone compiled application that can use any external libraries. In this way, the responsibility for preparation of a suitable evaluation procedure can be delegated to organizers of individual competitions. In such a case, the only requirement for the im- plementation of the evaluator is that it should maintain a correct protocol of information exchange with the platform’s engine. This flow of responsibilities frees Knowledge Pit from the things which it cannot cope with in a generic way. It also gives competition organizers a very flexible method of expressing their data mining task in a form of a fully customizable evaluation procedures. For instance, the evaluation procedure can be implemented in R language [8], in a form of a script that runs independently on several machines. 3 Examples of Data Challenges Hosted by Knowledge Pit Knowledge Pit inaugurated in the beginning of 2014 and since then continues to orga- nize successful data mining competitions in cooperation with international conferences. By June 2015 it had hosted 4 major competitions and a few local student projects. It currently has over 700 active users who participated in at least one data challenge and this number grows with every new competition. Below we list the recent competitions and shortly describe their scope. Typically, after completion of a contest, its overview and detailed descriptions of top solutions are published in proceedings of the associ- ated conference. 3.1 AAIA’14 Data Mining Competition AAIA’14 Data Mining Competition: Key risk factors for Polish State Fire Service4 took place between February 3, 2014 and May 7, 2014. In this challenge the focus was on the feature selection problem and the data came from the public safety domain. We asked members of the machine learning community to identify characteristics extracted from the EWID reports [5], which are useful for predicting whether any people were harmed during a given incident. 4 Web page: https://knowledgepit.fedcsis.org/contest/view.php?id=83 194 3.2 AAIA’15 Data Mining Competition AAIA’15 Data Mining Competition: Tagging Firefighter Activities at a Fire Scene5 took place between March 9, 2015 and June 5. It was a continuation of the contest initiated during the previous edition of the data challenge associated with International Sym- posium on Advances in Artificial Intelligence and Applications (the AAIA conference series) [3]. The topic was related to real-time screening of firefighters’ vital functions and monitoring of ongoing physical activities at the incident scene [7]. 3.3 PAKDD’15 Data Mining Competition PAKDD’15 Data Mining Competition: Gender Prediction Based on E-commerce Data6 took place between March 23, 2015 and May 3 of the same year. The task in this com- petition was to reconstruct the information about user’s gender from product viewing logs from an on-line store. The data set was obtained from simulations of product view- ing activities for user with known gender and was provided by FTP Group - the leading information and communication technology enterprise in Vietnam. The results of this competition were presented at a major Asia-Pacific data mining conference PAKDD’15 and were acclaimed by the industry representatives from FTP Group. 3.4 IJCRS’15 Data Challenge IJCRS’15 Data Challenge: Mining Data from Coal Mines7 started April 13, 2015 and lasted until June 25, 2015. The task was to come up with a prediction model which could be effectively applied to foresee warning levels of methane concentrations at three methane meters placed in a longwall of the mine [4]. The data used in the compe- tition came from an active Polish coal mine. They consisted of multivariate time series corresponding to readings of sensors used for monitoring the safety conditions at the longwall. 4 Future Development of the Platform In this paper we briefly described our data challenge platform called Knowledge Pit. It is worth noticing that this non-commercial project is far from complete. We are contin- uously searching for new topics of data mining contests related to important practical issues. We are also working on developing new features and functionalities for our plat- form. One example of such a feature is a support for an evaluation system that not only assesses submissions of participants with regard to their predictive quality but also tries to grasp adaptiveness of the proposed solutions, i.e. how fast they can produce results with sufficient quality and how much training data do they need. 5 Web page: https://knowledgepit.fedcsis.org/contest/view.php?id=106 6 Web page: https://knowledgepit.fedcsis.org/contest/view.php?id=107 7 Web page: https://knowledgepit.fedcsis.org/contest/view.php?id=109 195 References 1. Cole, J.: Using Moodle. O’Reilly, first edn. (2005) 2. Isaacson, P.C.: Building a simple website using open source software (gnu/linux, apache, mysql, and python). J. Comput. Sci. Coll. 19(1), 286–288 (Oct 2003), http://dl.acm.org/ citation.cfm?id=948737.948777 3. Janusz, A., Krasuski, A., Stawicki, S., Rosiak, M., Śl˛ezak, D., Nguyen, H.S.: Key Risk Factors for Polish State Fire Service: a Data Mining Competition at Knowledge Pit. In: Ganzha, M., Maciaszek, L.A., Paprzycki, M. (eds.) Proceedings of FedCSIS’14. pp. 345–354 (2014) 4. Janusz, A., Śl˛ezak, D., Sikora, M., Wróbel, Ł., Stawicki, S., Grzegorowski, M., Wojtas, P.: Mining data from coal mines: IJCRS’15 Data Challenge. In: Proceedings of IJCRS’15 (2015), in print November 2015 5. Krasuski, A., Janusz, A.: Semantic tagging of heterogeneous data: Labeling fire & rescue incidents with threats. In: Ganzha, M., Maciaszek, L.A., Paprzycki, M. (eds.) Proceedings of FedCSIS’13. pp. 295–302 (2013) 6. Lee, Ware, B.: Open Source Development with LAMP: Using Linux, Apache, MySQL and PHP. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA (2002) 7. Meina, M., Janusz, A., Rykaczewski, K., Śl˛ezak, D., Celmer, B., Krasuski, A.: Tagging fire- fighter activities at the emergency scene: Summary of aaiaŠ15 data mining competition at Knowledge Pit. In: Ganzha, M., Maciaszek, L.A., Paprzycki, M. (eds.) Proceedings of FedC- SIS’15 (2015), in print September 2015 8. R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2008), http://www.R-project. org 9. Rosebrock, E., Filson, E.: Setting Up LAMP: Getting Linux, Apache, MySQL, and PHP Working Together. SYBEX Inc., Alameda, CA, USA (2004)