Apache Spark is a data processing framework that performs processing tasks over large data sets quickly. Apache Spark can also distribute data processing tasks across several computers. It does so either on its own or in tandem with other distributed computing tools. These two qualities make it particularly useful in the world of machine learning and big data. Spark also features an easy-to-use API, reducing the programming burden associated with data crunching. It undertakes most of the work associated with big data processing and distributed computing.
Step 1 – Install Java
Spark is a Java-based application, so you will need to install Java in your system. You can install the Java using the following command:
dnf install java-11-openjdk-devel -y
Once Java is installed, you can verify the Java version with the following command:
java --version
You should get the Java version in the following output:
openjdk 11.0.8 2020-07-14 LTSOpenJDK Runtime Environment 18.9 (build 11.0.8+10-LTS) OpenJDK 64-Bit Server VM 18.9 (build 11.0.8+10-LTS, mixed mode, sharing)
Step 2 – Install Spark
First, you will need to download the latest version of Spark from its official website. You can download it with the following command:
wget https://mirrors.estointernet.in/apache/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
Once the download is completed, extract the downloaded file with the following command:
tar -xvf spark-3.0.1-bin-hadoop2.7.tgz
Next, move the extracted directory to /opt with the following command:
mv spark-3.0.1-bin-hadoop2.7 /opt/spark
Next, create a separate user to run Spark with the following command:
useradd spark
Next, change the ownership of the /opt/spark directory to the spark user with the following command:
chown -R spark:spark /opt/spark
Step 3 – Create a Systemd Service File for Spark
Next, you will need to create a systemd service file for the Spark master and slave.
First, create a master service file with the following command:
nano /etc/systemd/system/spark-master.service
Add the following lines:
[Unit] Description=Apache Spark Master After=network.target [Service] Type=forking User=spark Group=spark ExecStart=/opt/spark/sbin/start-master.sh ExecStop=/opt/spark/sbin/stop-master.sh [Install] WantedBy=multi-user.target
Save and close the file when you are finished, then create a Spark slave service with the following command:
nano /etc/systemd/system/spark-slave.service.service
Add the following lines:
[Unit] Description=Apache Spark Slave After=network.target [Service] Type=forking User=spark Group=spark ExecStart=/opt/spark/sbin/start-slave.sh spark://your-server-ip:7077 ExecStop=/opt/spark/sbin/stop-slave.sh [Install] WantedBy=multi-user.target
Save and close the file, then reload the systemd daemon with the following command:
systemctl daemon-reload
Step 4 – Start the Master Service
Now, you can start the Spark master service and enable it to start at boot with the following command:
systemctl start spark-master systemctl enable spark-master
You can verify the status of the Master service with the following command:
systemctl status spark-master
You should get the following output:
-
spark-master.service - Apache Spark Master
Loaded: loaded (/etc/systemd/system/spark-master.service; disabled; vendor preset: disabled) Active: active (running) since Fri 2020-10-02 10:40:40 EDT; 18s ago Process: 2554 ExecStart=/opt/spark/sbin/start-master.sh (code=exited, status=0/SUCCESS) Main PID: 2572 (java) Tasks: 32 (limit: 12523) Memory: 174.5M CGroup: /system.slice/spark-master.service └─2572 /usr/lib/jvm/java-11-openjdk-11.0.8.10-0.el8_2.x86_64/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spar> Oct 02 10:40:37 centos8 systemd[1]: Starting Apache Spark Master... Oct 02 10:40:37 centos8 start-master.sh[2554]: starting org.apache.spark.deploy.master.Master, logging to /opt/spark/logs/spark-spark-org.apac> Oct 02 10:40:40 centos8 systemd[1]: Started Apache Spark Master.
You can also check the Spark log file to check the Master server.
tail -f /opt/spark/logs/spark-spark-org.apache.spark.deploy.master.Master-1-centos8.out
You should get the following output:
20/10/02 10:40:40 INFO Utils: Successfully started service 'MasterUI' on port 8080. 20/10/02 10:40:40 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://centos8:8080 20/10/02 10:40:40 INFO Master: I have been elected leader! New state: ALIVE
At this point, the Spark master server is started and listening on port 8080.
Step 5 – Access Spark Dashboard
Now, open your web browser and access the Spark dashboard using the URL http://your-server-ip:8080. You should see the Spark dashboard in the following page:
In the above page, there are no workers attached to the master.
Step 6 – Start Slave Service
Now, start the Slave service and enable it to start at boot with the following command:
systemctl start spark-slave systemctl enable spark-slave
Next, check the status of the Slave with the following command:
systemctl status spark-slave
You should get the following output:
-
spark-slave.service - Apache Spark Slave
Loaded: loaded (/etc/systemd/system/spark-slave.service; disabled; vendor preset: disabled) Active: active (running) since Fri 2020-10-02 10:45:12 EDT; 10s ago Process: 2671 ExecStart=/opt/spark/sbin/start-slave.sh spark://45.58.32.165:7077 (code=exited, status=0/SUCCESS) Main PID: 2680 (java) Tasks: 35 (limit: 12523) Memory: 197.9M CGroup: /system.slice/spark-slave.service └─2680 /usr/lib/jvm/java-11-openjdk-11.0.8.10-0.el8_2.x86_64/bin/java -cp /opt/spark/conf/:/opt/spark/jars/* -Xmx1g org.apache.spar> Oct 02 10:45:09 centos8 systemd[1]: Starting Apache Spark Slave... Oct 02 10:45:09 centos8 start-slave.sh[2671]: starting org.apache.spark.deploy.worker.Worker, logging to /opt/spark/logs/spark-spark-org.apach> Oct 02 10:45:12 centos8 systemd[1]: Started Apache Spark Slave.
You can also check the Spark slave log file for confirmation:
tail -f /opt/spark/logs/spark-spark-org.apache.spark.deploy.worker.Worker-1-centos8.out
You should get the following output:
20/10/02 10:45:12 INFO Worker: Spark home: /opt/spark
20/10/02 10:45:12 INFO ResourceUtils: ==============================================================
20/10/02 10:45:12 INFO ResourceUtils: Resources for spark.worker:
20/10/02 10:45:12 INFO ResourceUtils: ==============================================================
20/10/02 10:45:12 INFO Utils: Successfully started service ‘WorkerUI’ on port 8081.
20/10/02 10:45:12 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://centos8:8081
20/10/02 10:45:12 INFO Worker: Connecting to master 45.58.32.165:7077…
20/10/02 10:45:12 INFO TransportClientFactory: Successfully created connection to /45.58.32.165:7077 after 66 ms (0 ms spent in bootstraps)
20/10/02 10:45:13 INFO Worker: Successfully registered with master spark://centos8:7077
Now, go to the Spark dashboard and reload the page. You should see the worker in the following page.
You can also access the worker directly using the URL http://your-server-ip:8081.
Conclusion
In this guide, you learned how to set up a single-node Spark cluster on CentOS 8. You can now configure Spark multinode cluster easily and use it for big data and machine learning processing. Set up Apache Spark on your VPS hosting account with Atlantic.Net!