# Submitting Spark Applications
# Prerequisites
- kubectl
- KUDO Spark installed (please refer to the Installation)
In order to deploy a Spark Application to Kubernetes using the KUDO Spark Operator, it should be described as a Kubernetes object. To do that, create a specification in yaml
format with all the necessary configuration required for the application.
Let's take a simple SparkPi
application as an example. The yaml
specification could be found here: spark-pi.yaml. This example assumes that you installed KUDO spark to the spark
namespace.
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-pi
namespace: spark
spec:
type: Scala
mode: cluster
image: "mesosphere/spark:spark-3.0.0-hadoop-2.9-k8s"
imagePullPolicy: Always
mainClass: org.apache.spark.examples.SparkPi
mainApplicationFile: "local:///opt/spark/examples/jars/spark-examples_2.12-3.0.0.jar"
sparkConf:
"spark.ui.port": "4041"
sparkVersion: "3.0.0"
restartPolicy:
type: Never
driver:
cores: 1
memory: "512m"
labels:
version: 3.0.0
serviceAccount: spark-spark-service-account
executor:
cores: 1
instances: 2
memory: "512m"
labels:
version: 3.0.0
All the Spark application configuration is placed under spec
section. Here you can specify Spark related configuration properties, such as number of executors, number of cores for drivers/executors, amount of memory, etc. There is also a sparkConf
section, where you can place configuration parameters in the form of key-value pairs. In this example, we override the default spark.ui.port
with a custom value.
Crucial to the success of running this Spark job is the serviceAccount
. The format for the service account is {spark-operator name}-spark-service-account
. If you have installed the spark operator without specifying the instance name, its name will be spark-instance
, thus the service account name will need to be spark-instance-spark-service-account
.
# Creating the application
After the application spec is ready, the following command can be used to submit it to the operator:
$ kubectl create -f ./resources/spark-pi.yaml
You can list all Spark applications with the following command:
$ kubectl get sparkapplications -n spark
NAME AGE
spark-pi 1m
In the example above -n
flag specifies the namespace where the application is deployed.
To describe the spark-pi
application, use the following command:
$ kubectl describe sparkapplications spark-pi -n spark
To get the list of pods:
$ kubectl get pods -n spark
NAME READY STATUS RESTARTS AGE
spark-gq4bf8-spark-867b8c8d79-btt4q 1/1 Running 0 58m
spark-gq4bf8-spark-init-cdcrn 0/1 Completed 0 58m
spark-pi-1571911449587-exec-1 1/1 Running 0 4s
spark-pi-1571911449587-exec-2 1/1 Running 0 4s
spark-pi-driver 1/1 Running 0 11s
From the output above we can see the application has been created and is currently running as three pods: one driver and two executors.
To see pod's logs, use:
$ kubectl logs -f spark-pi-driver -n spark
Get the last n
lines of pod's log:
$ kubectl logs --tail=n spark-pi-driver -n spark
You can get the full description of the driver pod with the following command:
$ kubectl describe pod spark-pi-driver -n spark
Verify the driver pod has completed successfully:
$ kubectl logs --tail=20 spark-pi-driver -n spark | grep 'Pi is'
Pi is roughly 3.141644502283289
# Configuring Logging
Logging can be configured by placing a custom log4j.properties
file to SPARK_CONF_DIR
directory.
Spark Operator provides a mechanism for mounting Spark configuration files via K8s ConfigMap
objects.
- Create a
ConfigMap
using the followinglog4j.properties
as an example:
$ cat <<'EOF'>> log4j.properties
log4j.rootCategory=DEBUG, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
EOF
$ kubectl create configmap spark-conf-map --from-file log4j.properties
- Then, add the following lines to
SparkApplication
spec:
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-pi
namespace: spark
spec:
...
sparkConfigMap: spark-conf-map
...
executor:
javaOptions: "-Dlog4j.configuration=file:/etc/spark/conf/log4j.properties"
The contents of spark-conf-map
will be placed under /etc/spark/conf
directory for driver and executor pods,
and SPARK_CONF_DIR
environment variable will be set to this directory.
# Updating the application
Let's say you want to update the application and increase the number of executors from 2 to 4.
To do so, you need to modify the spec file and update the value of spec.worker.instances
.
Save the changes and apply the updated spec using the following command:
$ kubectl apply -f ./resources/spark-pi.yaml -n spark
Now let's verify the number of executors has changed:
$ kubectl get pods -n spark | grep spark-pi
spark-pi-1571999377454-exec-1 1/1 Running 0 118s
spark-pi-1571999377454-exec-2 1/1 Running 0 118s
spark-pi-1571999377454-exec-3 1/1 Running 0 117s
spark-pi-1571999377454-exec-4 1/1 Running 0 117s
spark-pi-driver 1/1 Running 0 2m4s
# Deleting the application
You can delete the application with the following command:
$ kubectl delete sparkapplication spark-pi -n spark
This will delete all the pods and services related to the application.
# Accessing Spark UI
When running Spark applications on K8s, it could be useful to have an access to Spark UI to monitor and inspect Spark job executions in a web browser.
The are a few ways to expose Spark UI for you application.
When an application is submitted to Kubernetes, the operator automatically creates a Service
with default type ClusterIP
, which can be used to access Spark UI externally.
$ kubectl get svc -n spark --field-selector metadata.name=spark-pi-ui-svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
spark-pi-ui-svc ClusterIP 10.0.5.197 <none> 4041/TCP 7s
# Using Port Forwarding
Port forwarding works in a way that connections made to a local port are forwarded to port of the pod which is running the Spark driver. With this connection in place, you can use your local workstation to access Spark UI which is running in the Driver pod.
Command example:
$ kubectl port-forward <resource-name> <local-port>:<container-port>
In case with spark-pi
application, the command will look like the following:
$ kubectl port-forward spark-pi-driver 4040:4041 -n spark
Forwarding from 127.0.0.1:4040 -> 4041
Forwarding from [::1]:4040 -> 4041
After that the Spark UI should be available via URL: localhost:4040:
# Using a Service
From the K8s documentation:
"When creating a service, you have the option of automatically creating a cloud network load balancer. This provides an externally-accessible IP address that sends traffic to the correct port on your cluster nodes provided your cluster runs in a supported environment and is configured with the correct cloud load balancer provider package."
So now let's create a service with special type that will expose our Spark UI globally via cloud network LB.
Service specification can look like the following:
apiVersion: v1
kind: Service
metadata:
labels:
app: spark-pi-loadbalancer
name: spark-pi-loadbalancer
spec:
type: LoadBalancer
selector:
spark-role: driver
ports:
- protocol: TCP
port: 80
targetPort: 4041
Specifying spark-role: driver
as a selector
is basically the mechanism for our service to target the required pod.
Create the service using kubectl create -f ./resources/spark-pi-service.yaml -n spark
. Wait a few minutes and check the services output to verify our new service is now having an EXTERNAL-IP
assigned:
$ kubectl get svc -n spark --field-selector metadata.name=spark-pi-loadbalancer
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
spark-pi-loadbalancer LoadBalancer 10.0.3.19 a55f8bba6615346149d96bf438d87438-1803869262.us-west-2.elb.amazonaws.com 80:31917/TCP 10m
Now you can use a55f8bba6615346149d96bf438d87438-1803869262.us-west-2.elb.amazonaws.com
address in your browser to access Spark UI:
Spark UI provides valuable insights into various aspects of an application, but its main drawback is that it's only available for the duration of the application. If you want to be able to explore an application's UI after it's finished, consider setting up the Spark History Server.
You also might want to consider setting up a complete monitoring solution to be able to visualize and analyze your Spark cluster performance using real-time metrics and dashboards. To do so, please refer to the Monitoring section of the documentation.
← Security Limitations →