Scientific IT Services will offer a workshop on “big data” analysis for scientists with Apache Spark. Our second iteration of the workshop will take place during two sessions: Nov. 23 - 25 and Nov. 30 - Dec. 2 2015. If you are interested, please see the Contact section below.

Motivation

Scientific data sets produced by today's state-of-the-art experiments are rapidly exceeding our ability to process them by simple means on a laptop or even a capable workstation. While it is possible to use machines with extremely large amounts of on-board memory (~ 1-3 TB), they are very expensive and when the datasets increase further in size, such solutions will not scale. Therefore, many scientists still struggle to make sense of such vast quantities of data, as this often requires knowledge of complex parallel programming tools. In addition, analysis on data sets that large may take a very long time to run, inhibiting creative data exploration.

Outside of academia, extracting a signal out of unimaginable quantities of data quickly is, however, a daily practice. The most well-known technology that enables such data crunching is the Hadoop framework, which is an open-source framework implementing the MapReduce programming model. Such systems are used on a daily basis to make real-time analysis of endless streams of data on the Internet for various consumer, marketing, and business applications. This model of computation also holds great promise for working with state-of-the-art scientific data since it is designed to scale well to extremely large data sets on rather inexpensive hardware. Hadoop, however, is a poor match for scientists in practice because it is fairly cumbersome to use and not suited for interactive data exploration.

In recent years a new framework, named Spark, has emerged which solves some of the problems with Hadoop. Spark is also designed to handle analysis of any size, from one to many hundreds of computers. But, unlike Hadoop it is also made to be used interactively for data exploration via scripting languages like Python. This in particular makes it very attractive for scientific applications, where the goal of the analysis is often not clear from the outset. Spark finally offers scientists a straightforward way to efficiently scale their analysis and to do so from the familiar environment of a scripting language like python.

Spark @ ETH

Over the past year, Scientific IT Services (SIS) has been exploring the usability of Spark on the existing, central ETH computing infrastructure (including a prototype stand-alone Hadoop cluster and the Euler batch queue). To introduce the ETH scientific communicty to this framework, we have developed a short workshop.

The next round of the workshop will take place in two sessions at the end of November/beginning of December 2015. The days will be split into a 2-hour morning session usually consisting of some lecture slides and discussion followed by a longer afternoon hands-on session. The first two days may be adjusted depending on the knowledge and experience of the participants.

Workshop Outline

The general outline is as follows:

Day 1

  • introduction to distributed data analysis
  • review of key python concepts (data types, functional programming)
  • basic introduction to concepts of Map/Reduce
  • introduction to Spark
  • discussion of the Spark system and nomenclature
  • hands-on sessions with Python and Spark basics on the laptop

Day 2

  • introduction to Spark on the ETH computing infrastructure (dedicated Hadoop cluster with HDFS, and via the batch queuing system on Euler)
  • start of an extensive hands-on session during which participants will become familiar with data ingestion/pre-processing and develop a full data analysis pipeline, including the use of the Spark machine learning library

Day 3

  • continuing work on the Spark analysis pipeline
  • developing a prototype on samples of own data

If you have a dataset that you think you would want to analyze, bring a sample from it along! We can try to develop a prototype application on Day 3.

Structure

The entire workshop is envisioned to be mostly hands-on learning through the use of IPython notebooks and only minimal amount of "lecturing". The participants will have ample opportunities to ask questions and our intention is that they will leave the workshop with ideas on how to apply such technology to their own daily workflows. While the participants don't need to be computer science wizards, they will need to be fluent in basic programming constructs like for loops and functions. As we will use python throughout the course, becoming familiar with it beforehand is required - please make use of some of the freely available on-line resources, or get in touch if you need assistance.

Contact

To reserve a spot in the workshop, contact Rok Roškar and indicate whether you would like to participate in week 1 (Nov. 23-25) or week 2 (Nov. 30-Dec. 2) or whether you have no preference (the two sessions will cover identical material). Please note that each session will be limited to 15 participants.