PBS
This section walks through how to submit and run a Spark job using PBS on the compute nodes of Hammer.
pbs-spark-submit launches an Apache Spark program within a PBS job, including starting the Spark master and worker processes in standalone mode, running a user supplied Spark job, and stopping the Spark master and worker processes. The Spark program and its associated services will be constrained by the resource limits of the job and will be killed off when the job ends. This effectively allows PBS to act as a Spark cluster manager.
The following steps assume that you have a Spark program that can run without errors.
To use Spark and pbs-spark-submit, you need to load the following two modules to setup SPARK_HOME and PBS_SPARK_HOME environment variables.
module load spark
module load pbs-spark-submit
The following example submission script serves as a template to build your customized, more complex Spark job submission. This job requests 2 whole compute nodes for 10 minutes, and submits to the default queue.
#PBS -N spark-pi
#PBS -l nodes=2:ppn=20
#PBS -l walltime=00:10:00
#PBS -q standby
#PBS -o spark-pi.out
#PBS -e spark-pi.err
cd $PBS_O_WORKDIR
module load spark
module load pbs-spark-submit
pbs-spark-submit $SPARK_HOME/examples/src/main/python/pi.py 1000
In the submission script above, this command submits the pi.py program to the nodes that are allocated to your job.
pbs-spark-submit $SPARK_HOME/examples/src/main/python/pi.py 1000
You can set various environment variables in your submission script to change the setting of Spark program. For example, the following line sets the SPARK_LOG_DIR to $HOME/log. The default value is current working directory.
export SPARK_LOG_DIR=$HOME/log
The same environment variables can be set via the pbs-spark-submit command line argument. For example, the following line sets the SPARK_LOG_DIR to $HOME/log2.
pbs-spark-submit --log-dir $HOME/log2
Environment Variable | Default | Shell Export | Command Line Args |
---|---|---|---|
SPAKR_CONF_DIR | $SPARK_HOME/conf | export SPARK_CONF_DIR=$HOME/conf | --conf-dir |
SPAKR_LOG_DIR | Current Working Directory | export SPARK_LOG_DIR=$HOME/log | --log-dir |
SPAKR_LOCAL_DIR | /tmp | export SPARK_LOCAL_DIR=$RCAC_SCRATCH/local | NA |
SCRATCHDIR | Current Working Directory | export SCRATCHDIR=$RCAC_SCRATCH/scratch | --work-dir |
SPARK_MASTER_PORT | 7077 | export SPARK_MASTER_PORT=7078 | NA |
SPARK_DAEMON_JAVA_OPTS | None | export SPARK_DAEMON_JAVA_OPTS="-Dkey=value" | -D key=value |
Note that SCRATCHDIR must be a shared scratch directory across all nodes of a job.
In addition, pbs-spark-submit supports command line arguments to change the properties of the Spark daemons and the Spark jobs. For example, the --no-stop argument tells Spark to not stop the master and worker daemons after the Spark application is finished, and the --no-init argument tells Spark to not initialize the Spark master and worker processes. This is intended for use in a sequence of invocations of Spark programs within the same job.
pbs-spark-submit --no-stop $SPARK_HOME/examples/src/main/python/pi.py 800
pbs-spark-submit --no-init $SPARK_HOME/examples/src/main/python/pi.py 1000
Use the following command to see the complete list of command line arguments.
pbs-spark-submit -h
To learn programming in Spark, refer to Spark Programming Guide
To learn submitting Spark applications, refer to Submitting Applications