Apache Kafka
Learn how to install and use Apache Kafka
Get started
Apache Kafka is an open-source distributed event streaming platform for high-performance data pipelines, streaming analytics, data integration and mission-critical applications.
Before exploring the chart characteristics, let’s start by deploying the default configuration:
helm install <release-name> oci://dp.apps.rancher.io/charts/apache-kafka \
--set global.imagePullSecrets={application-collection}
Please check our authentication guide if you need to configure Application Collection OCI credentials in your Kubernetes cluster.
Chart overview
The Apache Kafka Helm chart distributed in Application Collection is made from scratch, which allowed us to include every best practice and standardization we deemed necessary. When creating it, the objective was to keep the underlying Apache Kafka features intact while simplifying some of its mechanisms when deployed in a Kubernetes environment.
It is important to understand that the chart is only configurable to work in Kraft mode. Although supported, Zookeeper mode is deprecated and will be removed in Apache Kafka 4.0. That said, the industry still uses the old Zookeeper mode and we plan to support it, though there is no fixed date.
By default, the chart will deploy three nodes that serve as both controllers and brokers. These nodes will only have SASL authentication enabled, which will be explained in depth later in this guide. This is the minimum setup supported by the chart and is configured to work out of the box.
Chart configuration
To view the supported configuration options and documentation, run:
helm show values oci://dp.apps.rancher.io/charts/apache-kafka
Configure the cluster
As explained earlier, the chart’s standard configuration consists of a three-node deployment, each having the dual role of controller and broker. That said, there are a few variations to this setup that can be effortlessly configured.
Deploy broker-only nodes
A typical Kafka cluster will have three or five controller nodes, potentially having a larger number of broker nodes. As such, you may want to deploy broker-only nodes. As an example, the following command will deploy three broker-only nodes in addition to the controller/broker ones:
helm upgrade --install <release-name> oci://dp.apps.rancher.io/charts/apache-kafka \
--set broker.enabled=true
Deploy controller-only nodes
Given the clear difference between controllers and brokers, it can be more manageable to have the controller nodes operating only as such and not as controller/broker nodes:
helm upgrade --install <release-name> oci://dp.apps.rancher.io/charts/apache-kafka \
--set broker.enabled=true \
--set cluster.controllerBrokerRole=false
Enabling broker-only nodes is a must for this cluster setup to work properly.
Deploy a custom number of nodes
Following the above logic, a standard scenario involves adjusting the number of deployed nodes. The following design will result in a cluster of nine nodes, which includes five controllers-only nodes and four brokers:
helm upgrade --install <release-name> oci://dp.apps.rancher.io/charts/apache-kafka \
--set broker.enabled=true \
--set cluster.controllerBrokerRole=false \
--set cluster.nodeCount.controller=5 \
--set cluster.nodeCount.broker=4
The helm upgrade
will perform a cascade upgrade, so the new cluster will take some time to be online as old pods will need to be
redeployed with the new configuration.
Retrieve the cluster ID
Kafka clusters are identified by a unique ID called cluster ID. Unless explicitly defined with the cluster.clusterID
parameter, this ID
is randomly generated in each new chart installation.
When performing an upgrade, the chart will automatically look for its current cluster ID and configure it in the newly deployed pods. Even so, there may be instances where you will need this ID. The current Kafka cluster ID can be checked by executing:
kubectl get configmap <release-name>-cluster -o jsonpath="{.data.clusterID}"
Use a multidisk setup
The chart simplifies the configuration process for log data replication through the cluster.disksPerBroker
parameter. This parameter
deploys “N” volumes per broker and automatically configures log.dirs
in Kafka:
$ helm install <release-name> oci://dp.apps.rancher.io/charts/apache-kafka \
--set cluster.disksPerBroker=2
$ kubectl get pvc -l app.kubernetes.io/instance=<release-name> -o name
persistentvolumeclaim/logs-0-<release-name>-controller-0
persistentvolumeclaim/logs-0-<release-name>-controller-1
persistentvolumeclaim/logs-0-<release-name>-controller-2
persistentvolumeclaim/logs-1-<release-name>-controller-0
persistentvolumeclaim/logs-1-<release-name>-controller-1
persistentvolumeclaim/logs-1-<release-name>-controller-2
When persistence is disabled, the chart will deploy the amount of volumes specified in
disksPerBroker
instead.
Security
Apache Kafka supports the authentication of connections to nodes from clients, other nodes and tools using SSL, SASL or a combination of both.
SASL
Before delving into the different options, it is necessary to understand our Apache Kafka chart has SASL enabled and uses the SASL/PLAIN mechanism by default. Thus, an administrator and initial users are created using placeholder credentials. You should always modifying the default users in the initial setup. Let’s check how to do this.
When deploying the Apache Kafka chart as it is, you will deploy a cluster using SASL authentication via the PLAIN mechanism. That said, the list of supported SASL mechanisms is more extensive:
- SASL/GSSAPI
- SASL/PLAIN
- SASL/OAUTHBEARER
Any of the above configurations can be modified via the auth
parameter:
auth:
# -- Enable Apache Kafka password authentication
enabled: true
sasl:
# -- Comma-separated list of enabled SASL mechanisms. Valid values are `GSSAPI`, `OAUTHBEARER` and `PLAIN`
enabledMechanisms: "PLAIN"
gssapi:
...
plain:
interbrokerUsername: "admin"
interbrokerPassword: "admin_password"
users:
user_test: password_test
oauthbearer:
...
As seen above, you can easily disable SASL authentication, enable additional SASL mechanisms or modify the used credentials. Remember to configure the parameters of each enabled mechanism for the cluster to be properly configured.
The
SASL/SCRAM
mechanism is not yet supported in Kraft mode and is thusly skipped in our chart.
TLS
Apache Kafka allows clients to use SSL for traffic encryption and authentication. By default, SSL is disabled but can be turned on if
needed. Similar to SASL, TLS configuration can be found under the tls
parameter:
tls:
enabled: true
# -- Store format for file-based keys and trust stores. Valid values are `JKS` and `PEM`
format: "JKS"
# -- Configures kafka broker to request client authentication. Valid values are `none`, `required` and `requested`
clientAuth: "none"
# -- Whether to require Apache Kafka to perform host name verification
hostnameVerification: true
# -- Name of the secret containing the Apache Kafka certificates
## Note: The secret must contain a keystore file per node
## Each keystore must follow the "<release-name>-<nodeType>-<N>.keystore.jks" naming schema
## That is, for a cluster with 3 controller and 3 broker nodes using the "JKS" format we'll need:
## - <release-name>-controller-[1..3].keystore.jks
## - <release-name>-broker-[1..3].keystore.jks
existingSecret: "apache-kafka-tls-secret"
# -- Password to access the JKS keystore file in case it is encrypted
keystorePassword: "test_pass"
# -- Truststore filename in the secret
truststoreFilename: "truststore.jks"
# -- Password to access the JKS truststore file in case it is encrypted
truststorePassword: "test_pass"
# -- The password of the private key in the keystore file
keystoreKeyPassword: "test_pass"
Besides enabling TLS, the listener’s protocols must be updated to use an SSL protocol, be it SSL
or SASL_SSL
:
cluster:
listeners:
client:
protocol: SSL
controller:
protocol: SSL
interbroker:
protocol: SSL
The above configuration entails the most basic TLS setup needed. As it is, you need to provide a Kubernetes secret containing a keystore
for each node and the truststore file. After that, you will have to define both the secret and passwords in our values.yaml
and the chart
will configure the provided .jks
in every deployed node. For a cluster with three controllers and three brokers that would mean:
$ kubectl create secret generic apache-kafka-tls-secret \
--from-file=keystore/apache-kafka-controller-0.keystore.jks \
--from-file=keystore/apache-kafka-controller-1.keystore.jks \
--from-file=keystore/apache-kafka-controller-2.keystore.jks \
--from-file=keystore/apache-kafka-broker-0.keystore.jks \
--from-file=keystore/apache-kafka-broker-1.keystore.jks \
--from-file=keystore/apache-kafka-broker-2.keystore.jks \
--from-file=truststore/truststore.jks
secret/apache-kafka-tls-secret created
Our Apache Kafka chart currently supports TLS using JKS and PEM files, which should be generated as explained in the documentation’s Security SSL section. You can also simplify this process by running the kafka-generate-ssl.sh script.
Custom configuration
We have covered Kafka’s configuration parameters that have direct parallelism in our chart values, though as shown in the official documentation, there are multiple parameters that Kafka supports not directly covered in the chart. Fortunately, you can configure any needed parameter you may need using any of the two options:
- Append the configuration directly to the node’s configuration file
- Pass the configuration via environment variables
Configuration specified through environment variables takes preference over the one set via file.
Expand the configuration file
Each node takes its configuration from a server.properties
file generated when the pod is initialized. A simple way to add extra
configuration is to append any key=value
pair directly to the .properties
file using the controller/broker.configuration
parameter.
You can define this in a few ways, so let’s see a couple of examples:
# custom-values.yaml
# Using <nodeType> as a placeholder of controller/broker for readability
<nodeType>:
configuration: |-
log.retention.hours=72
log.retention.ms=300
# custom-values.yaml
<nodeType>:
configuration:
log.retention.hours: 72
log.retention.ms: 300
The configuration
parameter also supports using an array to set a key=value
pair in each entry. Whatever the method used, the chart
will create a configmap with the defined configuration and append it to the server.properties
once created.
The chart supports different configurations for controller and broker nodes, so both
configuration
parameters when must be set when using both types of nodes.
Use environment variables
Apache Kafka retrieves configuration data from environment variables natively. This offers a new venue to configure Kafka when using containers, our preferred way. This will require a specific naming schema explained in the documentation.
We can review existing ENVs and add new ones under <nodeType>.podTemplates.containers.<nodeType>.env
:
<nodeType>:
podTemplates:
containers:
<nodeType>:
env:
...
KAFKA_LOG_RETENTION_HOURS:
enabled: true
values: '72'
KAFKA_LOG_MS: '300'
Persistence
Enabled by default, data persistence in the Apache Kafka chart affects data configuration and logs and is configured with the persistence
parameter. You can define additional volumes to persist by appending them as follows:
<nodeType>:
statefulset:
volumeClaimTemplates:
data:
enabled: '{{ .Values.persistence.enabled }}'
logs:
enabled: '{{ .Values.persistence.enabled }}'
volumeCount: '{{ int .Values.cluster.disksPerBroker }}'
newVolume:
enabled: '{{ .Values.persistence.enabled }}'
podTemplates:
containers:
<nodeType>:
volumeMounts:
data:
enabled: true
mountPath: /mnt/kafka/data
logs:
enabled: true
volumeCount: '{{ .Values.cluster.disksPerBroker }}'
mountPath: /mnt/kafka/logs
newVolume:
enabled: true
mountPath: /mnt/kafka/new_path
Operations
Thanks to Apache Kafka architecture, running clusters can be easily modified using helm upgrade
. Knowing this, the methods explained at
Custom configuration can be used to modify your already deployed cluster’s configuration. For specific cases,
please check Apache Kafka’s documentation.
Upgrade to a new version
Chart upgradeability is paramount and can be affected by changes coming from two sources:
- Changes in Apache Kafka itself
- Changes in the chart templates
Although rare, the application of Apache Kafka can include changes from version to version (especially from one minor to the other) that
may require manual intervention. This is well documented and can be accessed at the
Upgrade documentation section. Similarly, the chart’s template can suffer modifications
that include breaking changes. This will always entail a bump to the chart’s major version
.
Regardless of the source, breaking changes affecting the Apache Kafka chart will be documented in its README file. Any required manual steps will also be included.
Uninstall the chart
Removing an installed Apache Kafka cluster is simple:
$ kubectl get pods -l app.kubernetes.io/instance=<release-name> -o name
pod/<release-name>-controller-0
pod/<release-name>-controller-1
pod/<release-name>-controller-2
$ helm uninstall <release-name>
Keep in mind PVCs won’t be removed unless you define the <nodeType>.statefulset.persistentVolumeClaimRetentionPolicy.whenDeleted=Delete
parameter.