# Kerberos
# Overview
Kerberos is an authentication system that allows Spark to retrieve and write data securely to a Kerberos-enabled HDFS
cluster. Spark versions 2.4.5 and before, do not support retrieval, distribution, and renewal of delegation tokens
(authentication credentials) on Kubernetes and require the delegation token to be provided via Secret
. Starting from
version 3.0 Spark provides full support for Kerberos authentication and delegation token handling. Detailed
information about delegation token handling in Spark is available in the
official documentation (opens new window).
This section assumes you have previously set up a Kerberos-enabled HDFS cluster and have an access to it to execute CLI commands.
# Retrieving delegation tokens
To provide Spark Application and Spark History Server with a delegation token:
- Retrieve a delegation token from HDFS cluster. This can be done using HDFS CLI, e.g.:
hdfs fetchdt --renewer hdfs /var/keytabs/hadoop.token
This command wil fetch delegation token and save it to file /var/keytabs/hadoop.token
.
- Create a file-based secret using the delegation token retrieved at the previous step:
kubectl create secret generic hadoop-token --from-file hadoop.token
Note: Spark Operator assumes the delegation token file name to be hadoop.token
, so if token file has a different name,
it should be renamed to hadoop.token
.
# Configuring access to a Kerberos-enabled HDFS cluster for Spark Applications
To provide Spark Application with access to Kerberos-enabled HDFS cluster, use the secret containing delegation token
named hadoop.token
. To retrieve a delegation token and store it in a secret, check "Retrieving delegation tokens"
section of this document.
To provide Spark Application with Hadoop Delegation token, mount the secret with the token and set its type to
HadoopDelegationToken
in corresponding spec sections in SparkApplication
for both the Driver and Executors, e.g.:
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: <app-name>
namespace: <namespace>
spec:
...
hadoopConfigMap: hadoop-config
driver:
serviceAccount: spark-instance-spark-service-account
secrets:
- name: hadoop-token
path: /mnt/secrets
secretType: HadoopDelegationToken
executor:
secrets:
- name: hadoop-token
path: /mnt/secrets
secretType: HadoopDelegationToken
Once specified, the delegation token from hadoop-token
secret will be used to authenticate with Kerberos-enabled HDFS cluster.
Note To provide Hadoop configuration files such as core-site.xml
and hdfs-site.xml
use hadoopConfigMap
field in
SparkApplication
spec to specify the name of the ConfigMap containing them. The operator will mount the ConfigMap onto
path /etc/hadoop/conf
and ets the environment variable HADOOP_CONF_DIR
to point to it in both the Driver and Executors.
# Configuring Spark History Server to use a Kerberos-enabled HDFS cluster for storage
To provide Spark History Server with access to Kerberos-enabled HDFS cluster for storing Spark Applications historical data,
use the secret containing delegation token named hadoop.token
. To retrieve a delegation token and store it in a secret,
check "Retrieving delegation tokens" section of this document.
- To install Spark History Server with Kerberos enabled, install the Operator with the following parameters:
kubectl kudo install spark --namespace=<namespace> \
-p enableHistoryServer=true \
-p historyServerFsLogDirectory=hdfs://namenode.hdfs-kerberos.svc.cluster.local:9000/history \
-p delegationTokenSecret=hadoop-token
Here, delegationTokenSecret
parameter specifies the name of the secret containing delegation token, and
historyServerFsLogDirectory
contains an HDFS path for storage in a Kerberos-enabled cluster.
- To make Spark Application logs available in the History Server HDFS location, enable event log collection and specify event log dir pointing to the History Server HDFS location. Hadoop delegation tokens must be provided to Spark Applications to access to Kerberos-enabled HDFS cluster:
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: <app-name>
namespace: <namespace>
spec:
...
sparkConf:
"spark.eventLog.enabled": "true"
"spark.eventLog.dir": "hdfs://namenode.hdfs-kerberos.svc.cluster.local:9000/history"
driver:
serviceAccount: spark-instance-spark-service-account
secrets:
- name: hadoop-token
path: /mnt/secrets
secretType: HadoopDelegationToken
executor:
secrets:
- name: hadoop-token
path: /mnt/secrets
secretType: HadoopDelegationToken