📕
Knowledge
  • Knowledge Repository
  • Java
    • Intro
    • Persistence
      • Java Persistence API (JPA)
        • Entity Manager
        • Transaction Management
        • Relationship Mapping
      • Java Transaction API (JTA)
    • Resources
  • Angular
    • Intro
    • Overview
    • CLI Commands
    • Directives
  • Tools
    • IDEs
      • IntelliJ
        • Shortcuts
  • Book Notes
    • Intro
    • Java A Beginner's Guide
      • Chapter 1. Java Fundamentals
      • Chapter 2. Data Types and Operators
      • Chapter 3. Program Control Statements
      • Chapter 4. Introducing Classes, Objects, and Methods
      • Chapter 5. More Data Types & Operators
      • Chapter 6. Closer Look At Methods and Classes
      • Chapter 7. Inheritance
      • Chapter 8. Packages & Interfaces
      • Chapter 9. Exception Handling
      • Chapter 10. Using I/O
    • Data Pipelines With Airflow
      • Chapter 1. Meet Airflow
      • Chapter 2. Anatomy Of a DAG
  • Course Notes
    • Intro
    • Spring: TDD With JUnit
Powered by GitBook
On this page
  • Introduction To Data Pipelines
  • Pipelines As Graphs
  • Execution Algorithm
  • Airflow
  • Benefits

Was this helpful?

  1. Book Notes
  2. Data Pipelines With Airflow

Chapter 1. Meet Airflow

This chapter explores the concepts of data pipelines.

To handle the ever-increasing volumes of data, companies can use a variety of tools, including Airflow, a batch-orientated framework for building data pipelines.

Airflow sits in the middle of data processors to coordinate work happening across different systems. Airflow does not process data itself, but orchestrates the different tools involved in the processing.

Introduction To Data Pipelines

A data pipeline typically consists of several tasks that execute in order, to move data from one representation/source to another, and along the way combined with other data for a desired outcome.

Pipelines As Graphs

To make dependencies between tasks more explicit, you can draw the pipeline as a graph, where individual tasks are represented as nodes and dependencies between tasks are represented by directed edges. This type of graph is known as a directed graph.

In regards to Airflow, this representation is known as a directed acyclic graph (DAG), as the tasks are directed in a one-way order and do not contain any loops or cycles which prevent circular dependencies.

Execution Algorithm

A DAG structure used a straightforward algorithm for execution:

  1. For each uncompleted task:

    • check upstream task has completed

    • if true, add current task to execution queue

  2. Execute tasks in queue, marking complete

  3. Go back to step one and repeat for all tasks in the DAG.

Airflow

Airflow is an open-source tool for developing and monitoring workflows that allows you to define pipelines as DAGS of tasks, flexibly with Python. The Python script describes the structure of a DAG, and then the set of tasks and the dependencies between them.

One of the major benefits of Python in this aspect is that there are now many Airflow extensions available that enable you to execute tasks across a variety of systems such as dbs, and big data tech like Google BigQuery.

Benefits

Some of the many benefits of using Aiflow are:

  • Extensive scheduling and execution options

  • Monitoring and handling failures

  • Alerting options

  • UI dashboard for monitoring DAGs

  • Incremental loading and backfilling of data

  • Programmatic with the use of Python

  • Extensions

PreviousData Pipelines With AirflowNextChapter 2. Anatomy Of a DAG

Last updated 3 years ago

Was this helpful?