Apache Spark

Apache Spark is a cluster computing framework for large-scale data processing

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark is becoming a popular tool for data alalytics.

Spark supports python (pyspark), R (sparkR), java, and scala. Examples for python, java, and scala can be found in Apache documetation (https://spark.apache.org/examples.html). Following is a sample Slurm script to submit a Spark job. the example code can be found in https://github.com/apache/spark/blob/master/examples/src/main/python/pi.py

#!/bin/bash
#SBATCH -N 2
#SBATCH -t 01:00:00
#SBATCH --ntasks-per-node 3
#SBATCH --cpus-per-task 5

module load spark
spark-start
echo $MASTER
spark-submit --total-executor-cores 30 --executor pi.py

The spark-start script will set up the spark cluster within the job allocation and the spark module will set up the environment variables necessary.