Apache Airflow" is a software which you can easily use to schedule and monitor your workflows. [ Apache Airflow Short introduction and simple DAG" ]. It’s written in Python". As each software Airflow" also consist of concepts which describes main and atomic functionalities.
Table of Contents
Introduction
Apache Airflow
Apache Airflow" is a platform to programmatically author, schedule, and monitor workflows. It is an open-source project that allows users to define and execute workflows as directed acyclic graphs (DAGs) of tasks.
Airflow was developed as a solution for ETL" (extract, transform, load) pipelines and has since grown to support a wide variety of use cases such as data pipelines, machine learning" workflows, and many others.
Here are some key features of Apache Airflow":
- DAG (Directed Acyclic Graph) – A DAG is a directed acyclic graph of tasks, represented as Python" code. Each task in the DAG" represents an operation that needs to be performed, and the dependencies between tasks represent the order in which they should be executed.
- Operators: Operators are the building blocks of Airflow". They represent individual tasks in a DAG and define the execution logic for each task. Airflow" comes with a wide variety of built-in operators for common tasks such as running a Python" function, executing a Bash" script, or reading and writing data to and from databases.
- Scheduling: Airflow" allows you to specify when and how often your DAGs should be executed. You can use cron-like syntax to specify when tasks should run, and Airflow" will take care of scheduling and executing the tasks at the specified times.
- Monitoring: Airflow" provides a web-based UI for monitoring and debugging your workflows. The UI allows you to view the status of your DAGs and tasks, see the execution history and logs, and even trigger tasks manually.
- Task – is a instance of specific run of operator in the DAG" representation.
Apache Airflow Operators
Please find the list of some available Operators in version 2.2.5 of Airflow":
BashOperator
: Executes a Bash" command or script as a task in your workflow.BranchPythonOperator
: Chooses which task to run next based on the output of a Python" function.CheckOperator
: Verifies the state of an external system or data source.DockerOperator
: Executes a command inside a Docker" container as a task in your workflow.DummyOperator
: Does nothing and serves as a placeholder in your workflow.DruidCheckOperator
: Verifies the status of a Druid cluster.EmailOperator
: Sends an email as a task in your workflow.GenericTransfer
: Transfers data between two systems using a customizable SQL" statement.HiveToDruidTransfer
: Transfers data from Hive" to Druid.HiveToMySqlTransfer
: Transfers data from Hive to MySQL".Hive2SambaOperator
: Copies data from Hive" to a Samba share.HiveOperator
: Executes a HiveQL statement as a task in your workflow.HiveStatsCollectionOperator
: Collects statistics about a Hive" table.IntervalCheckOperator
: Verifies the state of an external system or data source at a regular interval.JdbcOperator
: Executes a SQL" statement on a JDBC-compatible database as a task in your workflow.LatestOnlyOperator
: Ensures that only the latest run of a task is executed.MsSqlOperator
: Executes a SQL" statement on a Microsoft SQL Server database" as a task in your workflow.MsSqlToHiveTransfer
: Transfers data from Microsoft SQL Server" to Hive".MySqlOperator
: Executes a SQL" statement on a MySQL" database as a task in your workflow.MySqlToHiveTransfer
: Transfers data from MySQL" to Hive".OracleOperator
: Executes a SQL" statement on an Oracle" database as a task in your workflow.PigOperator
: Executes a Pig Latin script as a task in your workflow.PostgresOperator
: Executes a SQL statement on a PostgreSQL database" as a task in your workflow.PrestoCheckOperator
: Verifies the status of a Presto cluster.PrestoIntervalCheckOperator
: Verifies the state of a Presto cluster at a regular interval.PrestoToMySqlTransfer
: Transfers data from Presto to MySQL".PrestoValueCheckOperator
: Verifies the output of a Presto query against a expected value.PythonOperator
: Executes a Python" function as a task in your workflow.PythonVirtualenvOperator
: Executes a Python" function in a specified virtual environment as a task in your workflow.S3FileTransformOperator
: Transforms a file in S3 using a provided script.S3ToHiveTransfer
: Transfers data from S3 to Hive".S3ToRedshiftTransfer
: Transfers data from S3 to Redshift.ShortCircuitOperator
: Skips downstream tasks based on the output of a Python" function.SimpleHttpOperator
: Makes an HTTP request as a task in your workflow.SlackAPIOperator
: Sends a message to a Slack channel using the Slack API as a task in your workflow.SlackAPIPostOperator
: Posts a message to a Slack channel using the Slack API as a task in your workflow.SqliteOperator
: Executes a SQL" statement on a SQLite database as a task in your workflow.TriggerDagRunOperator
: Triggers the execution of another DAG as a task in your workflow.ValueCheckOperator
: Verifies the output of a SQL" query against an expected value.
Apache Airflow Architecture
Apache Airflow" is a distributed platform for programmatically authoring, scheduling, and monitoring workflows. It consists of several components that work together to provide a complete workflow management solution.
Here is a high-level overview of the main components of the Apache Airflow" architecture:
- Scheduler: The scheduler is responsible for detecting new DAGs and tasks, and for determining which tasks should be executed based on their dependencies and execution schedules. It also handles retrying tasks that fail.
- Webserver: The webserver provides a web-based UI for interacting with Airflow". It allows users to view and manage their DAGs, view the status of tasks and their execution history, and trigger tasks manually.
- Executor: The executor is responsible for executing tasks on worker nodes. Airflow" supports several types of executors, including the SequentialExecutor, LocalExecutor, and CeleryExecutor".
- Worker: The worker is a daemon process that runs on each worker node and listens for tasks to be executed. When a task is scheduled by the scheduler, the worker picks up the task and executes it.
- Metadata database: The metadata database" stores information about DAGs, tasks, and their execution history. It is used by the scheduler and webserver to keep track of the state of the workflow and to provide information to users.
- Message queue: The message queue is used by the CeleryExecutor" to distribute tasks to workers. It allows tasks to be executed asynchronously, allowing the scheduler to scale horizontally.
Overall, the Apache Airflow" architecture is designed to be flexible and scalable, allowing it to handle a wide range of workloads and use cases.
Deferrable Operators & Triggers
In Apache Airflow", a deferrable operator is an operator that can be configured to delay its execution until a specified time or until a certain condition is met. This allows you to build more flexible and dynamic workflows.
There are two main types of deferrable operators in Airflow":
- TriggeredOperator: A
TriggeredOperator
is an operator that waits for a trigger before executing its task. The trigger can be a specified time, an external event, or the completion of another task. - DeferredMixin: The
DeferredMixin
is a mixin class that can be added to any operator to make it deferrable. It allows you to specify a delay in the execution of the operator, either as an absolute time or as a relative time (e.g., “3 hours from now”).
Create Simple DAG With Two Operators
As our exercise we will create DAG with two BashOperator tasks. Both of them will print the message to console.
First of all, we have to create the new Python" file in AIRFLOW_HOME/dags directory. The Airflow" scheduler monitors this folder in interval of time and after few seconds you are able to see your DAG in Airflow UI".
In my case AIRFLOW_HOME=/home/pawel/airflow => It determines that my dags I need to upload into /home/pawel/airflow/dags folder. Below you can find DAG implementation. I saved it into simple-dag.py file in dags directory. (Apache Airflow Short introduction and simple DAG")
import airflow from airflow.operators.bash_operator import BashOperator from airflow.models import DAG from datetime import datetime, timedelta args = { 'owner': 'Pawel', 'start_date': datetime(2018, 12, 02, 16, 40, 00), 'email': ['bigdataetlcom@gmail.com'], 'email_on_failure': False, 'email_on_retry': False } dag = DAG(dag_id='Simple_DAG', default_args=args, schedule_interval='@daily', concurrency=1, max_active_runs=1, catchup=False) task_1 = BashOperator( task_id='task_1', bash_command='echo Hello, This is my first DAG in Airflow!', dag=dag ) task_2 = BashOperator( task_id='task_2', bash_command='echo With two operators!', dag=dag ) task_1 >> task_2
Airflow UI
In configuration file: (AIRFLOW_HOMEairflow.cfg) you can fine the “endpoint_url” parameter. As default" it is set to: http://localhost:8080. If you are running your Airflow" service on your local machine you should be able to open" this url in your browser. In case when you have Airflow" installed on remote host machine I recommend you change the “localhost” to public IP of your Airflow" host. Additionally please change “base_url” parameter as well. (Apache Airflow Short introduction and simple DAG" )
endpoint_url = http://localhost:8080 base_url = http://localhost:8080
Run Created DAG
You should be able to see our created DAG in Airflow UI" as below in the picture. I have already launched it and it’s a reason why you can see “On” button with dark green background. When you trigger your DAG you should be able to see that in the “Recent Tasks” section are created new task which are going through each status of task (in case of succeeded scenario of course): (Apache Airflow Short introduction and simple DAG" )
- Scheduled
- Queued
- Running
- Finished

“DAG Runs” section represents information about all running of DAGs on DAG level instead of “Recent task” which shows only the status of recent tasks of recent DAG as the name suggests :). (the DAG" is a container for task/tasks).
Let’s open" the succeeded run of DAG and see the logs of each task. In below image was presented the “Graph view” of our DAG". It is very useful when you would like to verify whether you task in DAG are in a correct order of execution and see the execution status of each task. In case of failure you will see that, for example “task_2” will be marked as yellow (when the task status will be set to “up_for_retry”) or red in case of failed.
Here you can easily go to the logs of each task, just click left button on selected task you will see the modal dialog with many options. One of them is button “Logs”. (Apache Airflow Short introduction and simple DAG" )

Logs of #Task_1

Logs of #Task_2

Summary
After this post you should be able to create, run and debug the simple DAG in Airflow".
Apache Airflow" is a powerful tool to orchestrate workflows in the projects and organizations. Despite it is still in Apache Incubator Airflow" is used by many “Big Players” in IT world. (Apache Airflow Short introduction and simple DAG" )
Could You Please Share This Post?
I appreciate It And Thank YOU! :)
Have A Nice Day!