=Paper=
{{Paper
|id=None
|storemode=property
|title=None
|pdfUrl=https://ceur-ws.org/Vol-1633/tu2-intro.pdf
|volume=Vol-1633
}}
==None==
Massively Scalable EDM with Spark Tristan Nixon Institute for Intelligent Systems University of Memphis 365 Innovation Drive Memphis, TN, USA, 38152 t.nixon@memphis.edu 1. INTRODUCTION • An introduction to the Spark runtime model, including: The creation and availability of ever-larger datasets is motivating o Basic import and export operations the development of new distributed technologies to store and process data across clusters of servers. Apache Spark has emerged o Resilient distributed datasets (RDDs) as the new standard platform for developing highly scalable cluster o RDD transformations and actions computing applications. It offers a wide range of connectors to numerous databases and enterprise data management systems, an o How Spark optimizes the execution of ever growing library of machine-learning algorithms and the ability distributed computation to process streaming data in near-realtime. Developers can write • An overview to the different deployment options for their applications in Java, Scala, Python and R. Applications can be Spark, including: run locally (for easy development and testing), and deployed to dedicated clusters or on clusters leased from cloud-computing o Launching and using the interactive spark providers. command-line shell program 2. TUTORIAL o Running spark programs locally on a single This day-long tutorial will provide a hands-on introduction to machine developing massively scalable machine learning and data mining o Launching a Spark cluster on Amazon Web applications with Spark. Participants will be expected to follow Services along with all examples on their own laptops throughout the tutorial, and to collaborate in small groups. All code used in the o Submitting applications to remote clusters tutorial will either be taken from publicly available examples, or be • An introduction to Spark streaming available for download from the IEDMS github repository1, and made available under a very liberal open source license. All • An introduction to SparkSQL and working with examples will be designed to process a modestly sized sample of a DataFrames recent Cognitive Tutor dataset available from the PSLC DataShop2. o How to load and manipulate an EDM dataset In advance of the day, participants will be given instructions on (KDD cup data) how to install and configure Spark and Scala on their laptops, so that they might arrive at the tutorial ready to begin. Throughout the o Data representations needed to fit various tutorial, participants will be given exercises and problems to solve EDM algorithms in small groups. This will give them experience with the material as it is presented and hands-on practice with structuring a • An introduction to Spark’s Machine learning library distributed application in Spark. MLib, including: o Transformers and Estimators 2.1 Outline The following material will be covered in the course of the tutorial: o Chaining transformers into machine-learning pipelines • An overview and history of cluster computing and the development of map-reduce o Examples of common EDM algorithms in Spark: • An example of a very simple map-reduce algorithm (distributed word-count) in Spark § IRT algorithms using logistic regression (AFM, PFM, IFM) § BKT parameter fitting: (brute-force, HMMs) Any remaining time will be devoted to discussing potential applications that participants may have in mind for their own data or projects. 1 2 https://github.com/IEDMS/spark-tutorial https://pslcdatashop.web.cmu.edu/