# KUDO Spark Operator Configuration
# Default configuration
The default KUDO Spark configuration enables installation of the following resources:
- CRDs (Custom Resource Definitions) for
SparkApplication
andScheduledSparkApplication
- Service Account,
ClusterRole
, andClusterRoleBinding
for Operator - Service Account,
Role
, andRoleBinding
for Spark Applications (Drivers) - Spark Operator Controller which handles submissions and launches Spark Applications
- Mutating Admission Webhook for customizing Spark driver and executor pods based on the specification
MetricsService
for Spark Operator and Spark Applications to enable metrics scraping by Prometheus
Full list of configuration parameters and defaults is available in KUDO Spark params.yaml (opens new window).
# Docker
Docker images used by KUDO Spark and image pull policy can be specified by providing the following parameters:
kubectl kudo install spark --instance=spark-operator \
-p operatorImageName=mesosphere/kudo-spark-operator \
-p operatorVersion=3.0.0-1.1.0 \
-p imagePullPolicy=Always
# Monitoring
Metrics reporting is disabled by default and can be enabled by providing -p enableMetrics=true
parameter:
kubectl kudo install spark --instance=spark-operator -p enableMetrics=true
When enabled, KUDO Spark installs Metrics Services and Service Monitors to enable metrics scraping by Prometheus. More information about how KUDO Spark exposes metrics and metrics configuration is available in KUDO Spark Operator Monitoring documentation.
# Spark History Server
KUDO Spark comes bundled with Spark History Server which is disabled by default. It can be enabled via -p enableHistoryServer=true
:
kubectl kudo install spark --instance=spark-operator -p enableHistoryServer=true
More information about Spark History Server configuration is available in Spark History Server Configuration documentation.
# Service Accounts and RBAC
By default, KUDO Spark will create service accounts, roles, and role bindings with preconfigured permissions required for running Spark Operator and submitting Spark Applications.
# Service Accounts
The following parameters configure service account settings for Spark Operator:
kubectl kudo install spark --instance=spark-operator \
-p operatorServiceAccountName=<service account name> \
-p createOperatorServiceAccount=<true|false>
To use existing service account for the Operator, createOperatorServiceAccount
parameter should be set to false
.
Service account configuration for Spark Applications follows the same pattern:
kubectl kudo install spark --instance=spark-operator \
-p sparkServiceAccountName=<service account name> \
-p createSparkServiceAccount=<true|false>
To use existing service account for Spark, createSparkServiceAccount
parameter should be set to false
.
# RBAC
KUDO Spark ClusterRole
with necessary permissions created automatically, but this can be disabled by providing -p createRBAC=false
parameter:
kubectl kudo install spark --instance=spark-operator -p createRBAC=<true|false>
KUDO Spark requires a ClusterRole
to be configured in order for it to handle custom resources, listen to events, work
with secrets, configmaps, and services. The full list of permissions is available in spark-operator-rbac.yaml (opens new window).
If createRBAC
parameter is set to false
, a ClusterRole
and ClusterRoleBinding
should be created or exist before
the installation. ClusterRoleBinding
should be linked to the Service Account used by the Operator.
Spark Applications submitted to KUDO Spark also require permissions to launch Spark Executors and monitor their state. For this
purpose KUDO Spark creates a Role
for them and binds it to a service account provided via -p sparkServiceAccountName=<service account name>
.
If createRBAC
flag is set to false
, a Role
(with a RoleBinding
linked to Spark Service Account) should be
created or exist prior to submission of Spark Applications. Role
configuration and a list of required permissions are
available in spark-rbac.yaml (opens new window) template file.
# Integration with AWS S3
This section describes the steps required for configuring secure access to S3 buckets for Spark workloads and Spark History server. With this method all sensitive data is stored as Kubernetes Secret and mounted to pods via environment variables.
In order to store AWS credentials in a Secret
, the credentials should be converted to base64 first:
$ echo "<AWS_ACCESS_KEY_ID>" | base64
QVdTX0FDQ0VTU19LRVlfSUQK
$ echo "<AWS_SECRET_ACCESS_KEY>" | base64
aVJhTVlQZVM1NTJzdFNpaDhVWDNEVkRyMndaMXpZOGxtWWlOKy9TQwo=
Create a Secret
with the following contents:
apiVersion: v1
kind: Secret
metadata:
name: aws-credentials
namespace: spark-operator
type: Opaque
data:
AWS_ACCESS_KEY_ID: QVdTX0FDQ0VTU19LRVlfSUQK
AWS_SECRET_ACCESS_KEY: aVJhTVlQZVM1NTJzdFNpaDhVWDNEVkRyMndaMXpZOGxtWWlOKy9TQwo=
Note: a Secret must be in the same namespace as an Spark Operator.
To read more about Secrets, refer to the official K8s documentation (opens new window).
Here is an example configuration which uses a Secret
to pass AWS credentials as environment variables to SparkApplication
:
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: spark-app
namespace: spark-operator
spec:
...
sparkConf:
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "s3a://<s3-location>"
"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
...
driver:
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: AWS_SECRET_ACCESS_KEY
# in case when Temporary Security Credentials are used
- name: AWS_SESSION_TOKEN
valueFrom:
secretKeyRef:
name: aws-credentials
key: AWS_SESSION_TOKEN
optional: true