Apache Airflow Short Introduction And Simple DAG – Cool Airflow Introduction In 5 Min!

Apache Airflow Short introduction and simple DAG - check cool DAG in 5 min! Apache Airflow krótkie wprowadzenie i prosty DAG - wystarczy 5 minut!
Share this post and Earn Free Points!

Apache Airflow is a software which you can easily use to schedule and monitor your workflows. [ Apache Airflow Short introduction and simple DAG ]. It’s written in Python. As each software Airflow also consist of concepts which describes main and atomic functionalities.

Introduction

Apache Airflow

Apache Airflow is a platform to programmatically author, schedule, and monitor workflows. It is an open-source project that allows users to define and execute workflows as directed acyclic graphs (DAGs) of tasks.

Airflow was developed as a solution for ETL (extract, transform, load) pipelines and has since grown to support a wide variety of use cases such as data pipelines, machine learning workflows, and many others.

Here are some key features of Apache Airflow:

  • DAG (Directed Acyclic Graph) – A DAG is a directed acyclic graph of tasks, represented as Python code. Each task in the DAG represents an operation that needs to be performed, and the dependencies between tasks represent the order in which they should be executed.
  • Operators: Operators are the building blocks of Airflow. They represent individual tasks in a DAG and define the execution logic for each task. Airflow comes with a wide variety of built-in operators for common tasks such as running a Python function, executing a Bash script, or reading and writing data to and from databases.
  • Scheduling: Airflow allows you to specify when and how often your DAGs should be executed. You can use cron-like syntax to specify when tasks should run, and Airflow will take care of scheduling and executing the tasks at the specified times.
  • Monitoring: Airflow provides a web-based UI for monitoring and debugging your workflows. The UI allows you to view the status of your DAGs and tasks, see the execution history and logs, and even trigger tasks manually.
  • Task – is a instance of specific run of operator in the DAG representation.

Apache Airflow Operators

Please find the list of some available Operators in version 2.2.5 of Airflow:

  • BashOperator: Executes a Bash command or script as a task in your workflow.
  • BranchPythonOperator: Chooses which task to run next based on the output of a Python function.
  • CheckOperator: Verifies the state of an external system or data source.
  • DockerOperator: Executes a command inside a Docker container as a task in your workflow.
  • DummyOperator: Does nothing and serves as a placeholder in your workflow.
  • DruidCheckOperator: Verifies the status of a Druid cluster.
  • EmailOperator: Sends an email as a task in your workflow.
  • GenericTransfer: Transfers data between two systems using a customizable SQL statement.
  • HiveToDruidTransfer: Transfers data from Hive to Druid.
  • HiveToMySqlTransfer: Transfers data from Hive to MySQL.
  • Hive2SambaOperator: Copies data from Hive to a Samba share.
  • HiveOperator: Executes a HiveQL statement as a task in your workflow.
  • HiveStatsCollectionOperator: Collects statistics about a Hive table.
  • IntervalCheckOperator: Verifies the state of an external system or data source at a regular interval.
  • JdbcOperator: Executes a SQL statement on a JDBC-compatible database as a task in your workflow.
  • LatestOnlyOperator: Ensures that only the latest run of a task is executed.
  • MsSqlOperator: Executes a SQL statement on a Microsoft SQL Server database as a task in your workflow.
  • MsSqlToHiveTransfer: Transfers data from Microsoft SQL Server to Hive.
  • MySqlOperator: Executes a SQL statement on a MySQL database as a task in your workflow.
  • MySqlToHiveTransfer: Transfers data from MySQL to Hive.
  • OracleOperator: Executes a SQL statement on an Oracle database as a task in your workflow.
  • PigOperator: Executes a Pig Latin script as a task in your workflow.
  • PostgresOperator: Executes a SQL statement on a PostgreSQL database as a task in your workflow.
  • PrestoCheckOperator: Verifies the status of a Presto cluster.
  • PrestoIntervalCheckOperator: Verifies the state of a Presto cluster at a regular interval.
  • PrestoToMySqlTransfer: Transfers data from Presto to MySQL.
  • PrestoValueCheckOperator: Verifies the output of a Presto query against a expected value.
  • PythonOperator: Executes a Python function as a task in your workflow.
  • PythonVirtualenvOperator: Executes a Python function in a specified virtual environment as a task in your workflow.
  • S3FileTransformOperator: Transforms a file in S3 using a provided script.
  • S3ToHiveTransfer: Transfers data from S3 to Hive.
  • S3ToRedshiftTransfer: Transfers data from S3 to Redshift.
  • ShortCircuitOperator: Skips downstream tasks based on the output of a Python function.
  • SimpleHttpOperator: Makes an HTTP request as a task in your workflow.
  • SlackAPIOperator: Sends a message to a Slack channel using the Slack API as a task in your workflow.
  • SlackAPIPostOperator: Posts a message to a Slack channel using the Slack API as a task in your workflow.
  • SqliteOperator: Executes a SQL statement on a SQLite database as a task in your workflow.
  • TriggerDagRunOperator: Triggers the execution of another DAG as a task in your workflow.
  • ValueCheckOperator: Verifies the output of a SQL query against an expected value.

Apache Airflow Architecture

Apache Airflow is a distributed platform for programmatically authoring, scheduling, and monitoring workflows. It consists of several components that work together to provide a complete workflow management solution.

Here is a high-level overview of the main components of the Apache Airflow architecture:

  • Scheduler: The scheduler is responsible for detecting new DAGs and tasks, and for determining which tasks should be executed based on their dependencies and execution schedules. It also handles retrying tasks that fail.
  • Webserver: The webserver provides a web-based UI for interacting with Airflow. It allows users to view and manage their DAGs, view the status of tasks and their execution history, and trigger tasks manually.
  • Executor: The executor is responsible for executing tasks on worker nodes. Airflow supports several types of executors, including the SequentialExecutor, LocalExecutor, and CeleryExecutor.
  • Worker: The worker is a daemon process that runs on each worker node and listens for tasks to be executed. When a task is scheduled by the scheduler, the worker picks up the task and executes it.
  • Metadata database: The metadata database stores information about DAGs, tasks, and their execution history. It is used by the scheduler and webserver to keep track of the state of the workflow and to provide information to users.
  • Message queue: The message queue is used by the CeleryExecutor to distribute tasks to workers. It allows tasks to be executed asynchronously, allowing the scheduler to scale horizontally.

Overall, the Apache Airflow architecture is designed to be flexible and scalable, allowing it to handle a wide range of workloads and use cases.

Deferrable Operators & Triggers

In Apache Airflow, a deferrable operator is an operator that can be configured to delay its execution until a specified time or until a certain condition is met. This allows you to build more flexible and dynamic workflows.

There are two main types of deferrable operators in Airflow:

  • TriggeredOperator: A TriggeredOperator is an operator that waits for a trigger before executing its task. The trigger can be a specified time, an external event, or the completion of another task.
  • DeferredMixin: The DeferredMixin is a mixin class that can be added to any operator to make it deferrable. It allows you to specify a delay in the execution of the operator, either as an absolute time or as a relative time (e.g., “3 hours from now”).

Create Simple DAG With Two Operators

As our exercise we will create DAG with two BashOperator tasks. Both of them will print the message to console.

First of all, we have to create the new python file in AIRFLOW_HOME/dags directory. The Airflow scheduler monitors this folder in interval of time and after few seconds you are able to see your DAG in Airflow UI.

In my case AIRFLOW_HOME=/home/pawel/airflow => It determines that my dags I need to upload into /home/pawel/airflow/dags folder. Below you can find DAG implementation. I saved it into simple-dag.py file in dags directory. (Apache Airflow Short introduction and simple DAG)

import airflow
from airflow.operators.bash_operator import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta

args = {
    'owner': 'Pawel',
    'start_date': datetime(2018, 12, 02, 16, 40, 00),
    'email': ['bigdataetlcom@gmail.com'],
    'email_on_failure': False,
    'email_on_retry': False
}
dag = DAG(dag_id='Simple_DAG', default_args=args, schedule_interval='@daily', concurrency=1, max_active_runs=1,
          catchup=False)
task_1 = BashOperator(
    task_id='task_1',
    bash_command='echo Hello, This is my first DAG in Airflow!',
    dag=dag
)
task_2 = BashOperator(
    task_id='task_2',
    bash_command='echo With two operators!',
    dag=dag
)
task_1 >> task_2

Airflow UI

In configuration file: (AIRFLOW_HOMEairflow.cfg) you can fine the “endpoint_url” parameter. As default it is set to: http://localhost:8080. If you are running your Airflow service on your local machine you should be able to open this url in your browser. In case when you have Airflow installed on remote host machine I recommend you change the “localhost” to public IP of your Airflow host. Additionally please change “base_url” parameter as well. (Apache Airflow Short introduction and simple DAG )

endpoint_url = http://localhost:8080
base_url = http://localhost:8080

Run Created DAG

You should be able to see our created DAG in Airflow UI as below in the picture. I have already launched it and it’s a reason why you can see “On” button with dark green background. When you trigger your DAG you should be able to see that in the “Recent Tasks” section are created new task which are going through each status of task (in case of succeeded scenario of course): (Apache Airflow Short introduction and simple DAG )

  1. Scheduled
  2. Queued
  3. Running
  4. Finished
Apache Airflow Short introduction and simple DAG - check cool DAG in 5 min!

“DAG Runs” section represents information about all running of DAGs on DAG level instead of “Recent task” which shows only the status of recent tasks of recent DAG as the name suggests :). (the DAG is a container for task/tasks).

Let’s open the succeeded run of DAG and see the logs of each task. In below image was presented the “Graph view” of our DAG. It is very useful when you would like to verify whether you task in DAG are in a correct order of execution and see the execution status of each task. In case of failure you will see that, for example “task_2” will be marked as yellow (when the task status will be set to “up_for_retry”) or red in case of failed.

Here you can easily go to the logs of each task, just click left button on selected task you will see the modal dialog with many options. One of them is button “Logs”. (Apache Airflow Short introduction and simple DAG )

Apache Airflow Short introduction and simple DAG - check cool DAG in 5 min!

Logs of #Task_1

Apache Airflow Short introduction and simple DAG - check cool DAG in 5 min!

Logs of #Task_2

Apache Airflow Short introduction and simple DAG - check cool DAG in 5 min!

Summary

After this post you should be able to create, run and debug the simple DAG in Airflow.

Apache Airflow is a powerful tool to orchestrate workflows in the projects and organizations. Despite it is still in Apache Incubator Airflow is used by many “Big Players” in IT world. (Apache Airflow Short introduction and simple DAG )

Could You Please Share This Post? 
I appreciate It And Thank YOU! :)
Have A Nice Day!

How useful was this post?

Click on a star to rate it!

Average rating 4.9 / 5. Vote count: 2192

No votes so far! Be the first to rate this post.

As you found this post useful...

Follow us on social media!

We are sorry that this post was not useful for you!

Let us improve this post!

Tell us how we can improve this post?