Strimzi An Opensource Data Pipeline.

What is Strimzi?: Strimzi provides a way to run an Apache Kafka cluster on Kubernetes in various deployment configurations. You can also manage Kafka topics, users, Kafka MirrorMaker and Kafka Connect using Custom Resources. This means you can use your familiar Kubernetes processes and tooling to manage complete Kafka applications.

Use case Introduction: We had a requirement where we need to setup a kafka streaming data pipeline to pull data from an on-prem DB and load it on GCP platform over a BiqQuery table.

Data Transformation logics: We were expecting to load data for 20+ tables for last 6 months. They have data approximately 100 GB data. These tables have data with datatype as Char, Integer, Dates & CLOB.

Data should flowing in 24*7 and streaming data pipeline is expected for same. We don’t need any transformation logic that will be implemented over Kafka cluster, it will be simple incremental loading of data.

Setup Requirements:

  1. Create Kubernetes cluster over GCP.
  2. Create Namespace with a name as per your requirements.
  3. Apply Strimzi installation file over Kubernetes Cluster: ``` kubectl apply -f ‘https://strimzi.io/install/latest?namespace=kafka' -n kafka``` (Here namespace = ‘kafka’)
  4. Provision the Apache Kafka cluster from GitHub repository (https://github.com/strimzi/strimzi-kafka-operator/tree/master/examples/kafka) : ``` kubectl apply -f kafka-persistent-single.yaml — namespace=kafka```. We will get a response ‘Response: configured’
  5. We need to apply wait command: ```kubectl wait kafka/my-cluster — for=condition=Ready — timeout=300s — namespace=kafka```.
  6. Provision kafka connector using the operator with following yaml: ```kubectl apply -f kafka-connect.yaml — namespace=kafka```. Github repository (https://github.com/strimzi/strimzi-kafka-operator/blob/release-0.19.x/examples/connect/kafka-connect.yaml)
  7. Download Biquery connector & Configure it: download here.

Few Common used commands are:-

· Kubectl get kafka -n kafka

· Kubectl get pots

· kubectl config set-context — current — namespace=kafka (change namespace).

· kubectl exec -it my-kafka-connect-cluster-connect-86c6d6f744–25ksw /bin/bash

· kubectl exec — stdin — tty xxxx-xxx — /bin/bash

Hi, I am a Certified Google Cloud Data engineer. I use Medium platform to share my experience with other members of Medium network.