Data Engineering learning roadmap is as given below:-


Recent past I came across a discussion , ‘What are the main Building Blocks of a Data Pipeline?’ with our busy schedule of professional and personal life, we keep on doing pronounced work as Data Engineer which impacts million and billion life's some time directly or not immediately. Every professional…


Introduction

Apache Airflow is an open-source workflow management platform. Airflow is written in Python, and workflows are created via Python scripts. Airflow is designed under the principle of “configuration as code”. …


Recently, I have successfully pass my GCP Data Engineering Certification and i would like to share my experience over the forum which may help other members of the community.

Preparation Roadmap

Official Study Book for GCP-Data Engineering Certification is: Professional Data Engineering Study Guide, which can be ordered from here.

Week 1


What is Strimzi?: Strimzi provides a way to run an Apache Kafka cluster on Kubernetes in various deployment configurations. You can also manage Kafka topics, users, Kafka MirrorMaker and Kafka Connect using Custom Resources. This means you can use your familiar Kubernetes processes and tooling to manage complete Kafka applications.


Introduction

Some time we struggle while doing investigation over BigQuery cost analysis, during data auditing, any accidental query executions or mishandling of BigQuery best practices. We need a dashboard to investigate who has executed which query and what volume of data is been processed for each query executions.

Here is a solution which can help you to have a track of all queries which was executed in last ´N´days.

Query

SELECT
job_id,
start_time,
user_email,
total_bytes_processed,
query
FROM `region-us`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 DAY)
AND CURRENT_TIMESTAMP()
AND job_type = "QUERY"
AND end_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 10 DAY) AND CURRENT_TIMESTAMP()
ORDER BY total_bytes_processed DESC


If we have data like given below and we need to change it to JSON, then solution is given below.

Sample data:

{"id":"2917",
"reqid":"4643f7fd",
"guid":"7ac170210b4643f7fd",
"type":"Raw",
"cp":"750266",
"start":"1595579266754",
"processedTime":"15955630",
"message":{
"protoVer":"TLSv1.3",
"cliIP":"85.184.183.27",
"reqMethod":"GET",
"fwdHot":"kickout-service.com.comp.com",
"proto":"https",
"rrerere":"kickout-com.comp.com",
"UA":"Mozilla/5.0 (khkjjljk; Intel Mac OS X 10_15_4) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1 Safari/605.1.15",
"reqPath":"/dk/da/poders/7ytrr6607",
"respLen":"Content-bfrecff0d%0a",
"status":"200"
},
"reqHdr":{
"referer":"https://poder.comp.com/dk/da/kickout/shoppingbag/",
"reqTime":"1266.75"
},
"respHdr":{
"date":"Date: Fri, 24 Jul 2020 08:27:47 GMT",
"respCacheCtl":"Cache-Control: no-cache, no-store, max-age=0, must-revalidate",
"allowOrigin":"Access-Control-Allow-Origin: https://poder.comp.com"},
"netPerf":{
"midMileLatency":"35",
"errCdF29":"ERR_NONE",
"lastByte":"1",
"clientRTT":"45",
"asnum":"204274",
"retdwedd":"ERR_NONE",
"hgetet":"1110",
"netOriginLatency":"525",
"hrere":"568",
"edgeIP":"xxxx.xxxx.xxx"
},
"geo":{
"country":"Donk"
}
}

To get this converted to JSON we need to use following code over Python.

import json
dict_json_obj = json.loads(thatstring);

In today’s digital world we need to connect Splunk to pull data out of Splunk for ML modeling or prediction. There are multiple ways to do you, through both licensed & open source tools.

In this article i have demonstrated a way how we can connect Splunk and pull data…


If you are getting following error when u try to connect HTTP URL using Python code then solution will be not that tough it is given below.

I was try to connect Splunk using python scripts and i got this error. Code which i was running

import splunklib.client  as…

We get this error ‘google.api_core.exceptions.Forbidden: 403 GET https://bigquery.googleapis.com/bigquery/v2/projects/XXXXX/queries/XXXXXXXX?maxResults=0&location=XX: Request had insufficient authentication scopes.’ when we are trying to operate BigQuery table though Python Code.

Solution: To solve this error we need to perform following steps :

  1. Download GCP Service account file with required privileges.
  2. Set following parameters in your python code

Import os
#Replace with service account path
path_service_account = ‘XXXXXXXXX.json’
os.environ[“GOOGLE_APPLICATION_CREDENTIALS”] = path_service_account

Vibhor Gupta

Hi, I am a Certified Google Cloud Data engineer. I use Medium platform to share my experience with other members of Medium network.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store