What are the main building blocks of a Data Pipeline.
Recent past I came across a discussion , ‘What are the main Building Blocks of a Data Pipeline?’ with our busy schedule of professional and personal life, we keep on doing pronounced work as Data Engineer which impacts million and billion life's some time directly or not immediately. Every professional day we deal with some of these building blocks but some time we don’t realize it. This paper will cover main building blocks , and it will also talk about its significance.
My version, What is your version ?
As per my expertise and experience following are main building blocks of data pipeline, but it may vary from project/ solution, platform business requirements.
- Identify data scope: Process of identifying current & future data scope is very important, because accessing, processing & storing not useful data will add cost, degrade pipeline performance and it will be waste of energy and time.
- Identify data Schema Design/Infrastructure/storage/Scripting Language & application: Depending on source system and data type(such as Structured/Semi-structured/Unstructured), types of data processing ( massive, light, long term storage, regular updates an son), expected performance, volume & velocity of data and scalability, following design logics are considered for identifying storage schema. For Structured data (Relational DB): Modeling techniques (star, snowflakes schema), relational tables (Aggregate, Fact, dimensions tables) can be defined. We also need to Identify the partition & indexing over the tables. For Semi-structured/unstructured data: NoSQL DB, HDFC (both on Cloud & on-prem) can be used as storage areas for the data. We also need to identify partitions & bucketing over storage areas. Note: Performance improvement is a continuous process which will be evaluated over the PROD environment.
- Identify Data ingestion process: We need to identify the process or medium through which data gets transported from source system (data origin point) to the DE platform where it is consumed, cleansed, processed, and pushed for storage. Common data ingestion tools are API’s, messaging tools (kafka & pubsub, etc), and etc.
- Data Cleansing: Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data drom a dataset.
- Data Preparation: We get RAW data from the source system which needs data munging. For transforming “raw” data into another data format with the intent of making it more appropriate and valuable for a variety of downstream purposes and systems such as analysis, perdition & reporting. We generally perform data aggregation, transformation, parsing & wrangling depending on business requirements.
- Data Archival: Data archiving is the practice of identifying data that is no longer active and moving it out of production systems into long-term storage systems.
- Data Purging: Purging is the process of freeing up space in the database or of deleting obsolete data that is not required by the system. The purge process can be based on the age of the data or the type of data.
- Data Orchestrator: Data pipeline orchestration is a process which manages the dependencies between different pipeline tasks, orchestrates dependent tasks execution and schedules the job for end to end execution.
This post has listed my version, I would love to hear viewers version also. So please post your comments for a healthy discussion.