Setting up a Spark cluster using Docker Compose


Prerequisites
Before we begin, make sure that you have Docker and Docker Compose installed on your system.
Step 1: Create a Docker Compose file
Create a new file called docker-compose.yml in your project directory and add the following code to it:
version: '2'
services:
spark-master:
image: docker.io/bitnami/spark:3.3
container_name:
environment: spark-master
- SPARK_MODE=master
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
ports:
- '8080:8080'
- '7077:7077'
spark-worker1:
image: docker.io/bitnami/spark:3.3
environment: spark-worker1
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
spark-worker2:
image: docker.io/bitnami/spark:3.3
environment: spark-worker2
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark:7077
- SPARK_WORKER_MEMORY=1G
- SPARK_WORKER_CORES=1
- SPARK_RPC_AUTHENTICATION_ENABLED=no
- SPARK_RPC_ENCRYPTION_ENABLED=no
- SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
- SPARK_SSL_ENABLED=no
- SPARK_USER=spark
or
curl -LO https://raw.githubusercontent.com/bitnami/containers/main/bitnami/spark/docker-compose.yml
This will create two services: a Spark master node (spark) and a Spark worker node (worker). The spark service exposes ports 8080 and 7077, which you can use to access the Spark Web UI and connect to the Spark master from your PySpark code.
Step 2: Start the Spark cluster
To start the Spark cluster, open a terminal window in your project directory and run the following command:
docker-compose up
This will download the necessary Docker images and start the Spark master and worker nodes. You should see the Spark Web UI at http://localhost:8080, which you can use to monitor the status of your Spark cluster.
Step 3: Connect to the Spark cluster from PySpark
Now that your Spark cluster is up and running, you can connect to it from your PySpark code. Here's some sample code to get you started:
from pyspark.sql import SparkSession
# create a new SparkSession
spark = SparkSession.builder \
.appName('MyApp') \
.master('spark://localhost:7077') \
.getOrCreate()
# read some data from a CSV file
df = spark.read \
.format('csv') \
.option('header', 'true') \
.load('/path/to/my/data.csv')
# do some processing on the data
result = df.groupBy('category') \
.agg({'price': 'max', 'quantity': 'sum'}) \
.orderBy('category')
# write the result to a Parquet file
result.write \
.format('parquet') \
.mode('overwrite') \
.save('/path/to/my/result.parquet')
# stop the SparkSession
spark.stop()
This code creates a new SparkSession and connects to the Spark master at spark://localhost:7077. It then reads some data from a CSV file, processes it using PySpark SQL, and writes the result to a Parquet file. Finally, it stops the SparkSession.
Conclusion
In this blog post, we walked through the steps of setting up a Spark cluster using Docker Compose with Bitnami images and connecting to it from PySpark. With this setup, you can easily experiment with PySpark and build big data applications without having to worry about the complexities of setting up a distributed cluster.

Wuttichai Kaewlomsap
Sr. Data Engineer