=Paper= {{Paper |id=None |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-1633/tu2-intro.pdf |volume=Vol-1633 }} ==None== https://ceur-ws.org/Vol-1633/tu2-intro.pdf
                             Massively Scalable EDM with Spark
                                                           Tristan Nixon
                                                 Institute for Intelligent Systems
                                                        University of Memphis
                                                         365 Innovation Drive
                                                       Memphis, TN, USA, 38152
                                                       t.nixon@memphis.edu

1. INTRODUCTION                                                                •    An introduction to the Spark runtime model, including:
The creation and availability of ever-larger datasets is motivating                     o    Basic import and export operations
the development of new distributed technologies to store and
process data across clusters of servers. Apache Spark has emerged                       o    Resilient distributed datasets (RDDs)
as the new standard platform for developing highly scalable cluster                     o    RDD transformations and actions
computing applications. It offers a wide range of connectors to
numerous databases and enterprise data management systems, an                           o    How Spark optimizes the execution of
ever growing library of machine-learning algorithms and the ability                          distributed computation
to process streaming data in near-realtime. Developers can write
                                                                               •    An overview to the different deployment options for
their applications in Java, Scala, Python and R. Applications can be
                                                                                    Spark, including:
run locally (for easy development and testing), and deployed to
dedicated clusters or on clusters leased from cloud-computing                           o    Launching and using the interactive spark
providers.                                                                                   command-line shell program

2. TUTORIAL                                                                             o    Running spark programs locally on a single
This day-long tutorial will provide a hands-on introduction to                               machine
developing massively scalable machine learning and data mining                          o    Launching a Spark cluster on Amazon Web
applications with Spark. Participants will be expected to follow                             Services
along with all examples on their own laptops throughout the
tutorial, and to collaborate in small groups. All code used in the                      o    Submitting applications to remote clusters
tutorial will either be taken from publicly available examples, or be
                                                                               •    An introduction to Spark streaming
available for download from the IEDMS github repository1, and
made available under a very liberal open source license. All                   •    An introduction to SparkSQL and working with
examples will be designed to process a modestly sized sample of a                   DataFrames
recent Cognitive Tutor dataset available from the PSLC DataShop2.
                                                                                        o    How to load and manipulate an EDM dataset
In advance of the day, participants will be given instructions on                            (KDD cup data)
how to install and configure Spark and Scala on their laptops, so
that they might arrive at the tutorial ready to begin. Throughout the                   o    Data representations needed to fit various
tutorial, participants will be given exercises and problems to solve                         EDM algorithms
in small groups. This will give them experience with the material
as it is presented and hands-on practice with structuring a                    •    An introduction to Spark’s Machine learning library
distributed application in Spark.                                                   MLib, including:
                                                                                        o    Transformers and Estimators
2.1 Outline
The following material will be covered in the course of the tutorial:                   o    Chaining transformers into machine-learning
                                                                                             pipelines
       •    An overview and history of cluster computing and the
            development of map-reduce                                                   o    Examples of common EDM algorithms in
                                                                                             Spark:
       •    An example of a very simple map-reduce algorithm
            (distributed word-count) in Spark                                                     §    IRT algorithms using logistic
                                                                                                       regression (AFM, PFM, IFM)
                                                                                                  §    BKT parameter fitting: (brute-force,
                                                                                                       HMMs)
                                                                        Any remaining time will be devoted to discussing potential
                                                                        applications that participants may have in mind for their own data
                                                                        or projects.



1                                                                       2
    https://github.com/IEDMS/spark-tutorial                                 https://pslcdatashop.web.cmu.edu/