Data Engineering 2022. 6. 6. 11:32

Introduction

Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. Airflow is an orchestrator allowing you to execute your tasks at the right time, in the right way, in the right order.

Airflow allows you to create a data pipeline that will interact with many different tools so that you can execute your tasks at the right time, in the right way, and in the right order.

With airflow, you will be able to manage your data pipelines and execute your tasks in a very reliable way and you will be able to monitor and try your tasks automatically.

Benefits

Python

They are coded in python. everything that can do in python, you can do it in your data pipelines. So the possibilities are quite limitless at the end.

Scalability

You can execute as many tasks as you want in parallel. Obviously, it depends on your architecture, on your resources. For example, if you have a Kubernetes cluster, you will be able to execute your tasks using that Kubernetes cluster.

GUI

You will be able to monitor your data problems from it. You will be able to retry your tasks. You will be able to interact a lot from the user interface of airflow with your data pipelines.

Extensibility

If there is a new tool and you want to interact with that new tool, you don’t have to wait for airflow to be upgraded. You can create your own plugin added to your app, for instance, and you will be able to interact with that new tool

Core Components

Web server: Flask server with Gunicorn serving the UI
Scheduler: Daemon is in charge of scheduling workflows
Metastore: Database where metadata is stored
Executor: Class defining how your tasks should be executed
Worker: Process/sub process executing your task

Concepts

DAG

Directed acyclic graph. A DAG is a data pipeline.

Operator

Wrapper around the task

#Operator

db = connect(host, credentials)
db.insert(sql_request)

Action Operator: Actually executing functions or commands such as batch operator, python operator
Transfer Operator: Allow you to transfer data between a source and a destination such as prestoToMysql operator
Sensor Operator: Differ than the other operators as they wait for something to happen before moving to the next task such as file sensor

Task

It is an operator in your data pipeline

Task Instance

As soon as you trigger your data pipeline and so your tasks, that task become task instance. So as soon as you execute an operator, that operator becomes a task instance.

Workflow

A DAG is triggered and has the operators with one dependency corresponding to your DAG, that is a concept of workflow. So the workflow is a DAG with operators, tasks, with dependencies.

What Airflow is not?

Not a data streaming solution such as Apache Flink neither a data processing framework such as spark

Architecture

One Node Architecture

The web server will fetch some metadata coming from the Metastore
Then the scheduler will be able to talk with the Metastore
The executor in order to send tasks that you want to execute
And the executor will update the status of the tasks in the Metastore
Those components talk together through the Metastore
The executor allows you to define how your tasks will be executed
Queue is used in order to define the order in which your tasks will be executed
The queue is part of the executor

Multi Nodes Architecture (Celery)

The queue is actually external to the executor, not like in single node architecture
You have multiple nodes, multiple machines where on each machine you will have airflow worker running
As usual, Web server fetches some metadata from the Metastore and the status of your tasks
Then you have the scheduler talking to the Metastore as well as the executor in order to scheduler the tasks
The executor will push the tasks that you want to trigger in the queue
One your tasks are in the queue, the airflow workers will fetch those tasks and will execute them on their own machine

How it works

New pipeline py will be added into Folder Dags
Both Web server and scheduler will parse the folder dags in order to be aware of your new data pipeline
As soon as your data pipeline is ready to be triggered, a DagRun Object is created in Metastore
Dagrun object has a state of running in Metastore
The first task is scheduled
When a task is ready to be triggered in your data pipeline, you have a task instance object
Executor runs task instance and updates the task instance’s object, the status of the task in the Metastore
Once your task has been done, update state in Metastore and scheduler check it and Web server upate UI

Reference

https://airflow.apache.org/docs/

https://www.youtube.com/watch?v=jnxlEWZwsNU

저작자표시 비영리 변경금지

'Data Engineering' 카테고리의 다른 글

AWS MSK (0)	2022.07.01
Collection Introduction (0)	2022.06.29
Data Format (0)	2022.06.17
Kinesis (0)	2019.09.20
Apache Kafka (0)	2019.09.05

ABOUT ME

Demyank's Tlog Demyank's Tlog

Introduction

Benefits

Python

Scalability

GUI

Extensibility

Core Components

Concepts

DAG

Operator

Task

Task Instance

Workflow

What Airflow is not?

Architecture

One Node Architecture

Multi Nodes Architecture (Celery)

How it works

Reference

'Data Engineering' 카테고리의 다른 글

티스토리툴바

ABOUT ME

Introduction

Benefits

Python

Scalability

GUI

Extensibility

Core Components

Concepts

DAG

Operator

Task

Task Instance

Workflow

What Airflow is not?

Architecture

One Node Architecture

Multi Nodes Architecture (Celery)

How it works

Reference

'Data Engineering' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바