Apache Airflow: Short introduction and simple DAG

Apache Airflow: Short introduction and simple DAG

Apache Airflow is a software which you can easily use to schedule and monitor your workflows. It’s written in Python.

As each software Airflow also consist of concepts which describes main and atomic functionalities. In Airflow you will encounter:

  • DAG (Directed Acyclic Graph) – collection of task which in combination create the workflow. In DAG you specify the relationships between takes (sequences or parallelism of tasks), order and dependencies.
  • Operator – represents the single task. From technical point of view you can treat operator like “wrapper” for operation.
  • Task – is a instance of specific run of operator in the DAG representation.

Please find the full list of available Operators in version 1.10.1 of Airflow:

  • BashOperator
  • BranchPythonOperator
  • CheckOperator
  • DockerOperator
  • DummyOperator
  • DruidCheckOperator
  • EmailOperator
  • GenericTransfer
  • HiveToDruidTransfer
  • HiveToMySqlTransfer
  • Hive2SambaOperator
  • HiveOperator
  • HiveStatsCollectionOperator
  • IntervalCheckOperator
  • JdbcOperator
  • LatestOnlyOperator
  • MsSqlOperator
  • MsSqlToHiveTransfer
  • MySqlOperator
  • MySqlToHiveTransfer
  • OracleOperator
  • PigOperator
  • PostgresOperator
  • PrestoCheckOperator
  • PrestoIntervalCheckOperator
  • PrestoToMySqlTransfer
  • PrestoValueCheckOperator
  • PythonOperator
  • PythonVirtualenvOperator
  • S3FileTransformOperator
  • S3ToHiveTransfer
  • S3ToRedshiftTransfer
  • ShortCircuitOperator
  • SimpleHttpOperator
  • SlackAPIOperator
  • SlackAPIPostOperator
  • SqliteOperator
  • SubDagOperator
  • TriggerDagRunOperator
  • ValueCheckOperator
  • RedshiftToS3Transfer

Create simple DAG with two opeartors

As our exercise we will create DAG with two BashOperator tasks. Both of them will print the message to console.

First of all, we have to create the new python file in AIRFLOW_HOME/dags directory. The Airflow scheduler monitors this folder in interval of time and after few seconds you are able to see your DAG in Airflow UI.

In my case AIRFLOW_HOME=/home/pawel/airflow => It determines that my dags I need to upload into /home/pawel/airflow/dags folder. Below you can find DAG implementation. I saved it into simple-dag.py file in dags directory.

import airflow
from airflow.operators.bash_operator import BashOperator
from airflow.models import DAG
from datetime import datetime, timedelta

args = {
    'owner': 'Pawel',
    'start_date': datetime(2018, 12, 02, 16, 40, 00),
    'email': ['bigdataetlcom@gmail.com'],
    'email_on_failure': False,
    'email_on_retry': False
}
dag = DAG(dag_id='Simple_DAG', default_args=args, schedule_interval='@daily', concurrency=1, max_active_runs=1,
          catchup=False)
task_1 = BashOperator(
    task_id='task_1',
    bash_command='echo Hello, This is my first DAG in Airflow!',
    dag=dag
)
task_2 = BashOperator(
    task_id='task_2',
    bash_command='echo With two operators!',
    dag=dag
)
task_1 >> task_2

Airflow UI

In configuration file: (AIRFLOW_HOMEairflow.cfg) you can fine the “endpoint_url” parameter. As default it is set to: http://localhost:8080. If you are running your Airflow service on your local machine you should be able to open this url in your browser. In case when you have Airflow installed on remote host machine I recommend you change the “localhost” to public IP of your Airflow host. Additionally please change “base_url” parameter as well.

endpoint_url = http://localhost:8080
base_url = http://localhost:8080

Run created DAG

You should be able to see our created DAG in Airflow UI as below in the picture. I have already launched it and it’s a reason why you can see “On” button with dark green background. When you trigger your DAG you should be able to see that in the “Recent Tasks” section are created new task which are going through each status of task (in case of succeeded scenario of course):

  1. Scheduled
  2. Queued
  3. Running
  4. Finished


“DAG Runs” section represents information about all running of DAGs on DAG level instead of “Recent task” which shows only the status of recent tasks of recent DAG as the name suggests :). (The DAG is a container for task/tasks).

Let’s open the succeeded run of DAG and see the logs of each task. In below image was presented the “Graph view” of our DAG. It is very useful when you would like to verify whether you task in DAG are in a correct order of execution and see the execution status of each task. In case of failure you will see that, for example “task_2” will be marked as yellow (when the task status will be set to “up_for_retry”) or red in case of failed. Here you can easily go to the logs of each task, just click left button on selected task you will see the modal dialog with many options. One of them is button “Logs”.


Logs of #Task_1


Logs of #Task_2


Summary

After this post you should be able to create, run and debug the simple DAG in Airflow.

Apache Airflow is a powerful tool to orchestrate workflows in the projects and organizations. Despite it is still in Apache Incubator Airflow is used by many “Big Players” in IT world.

If you enjoyed this post please leave the comment below or share this post on your Facebook, Twitter, LinkedIn or another social media webpage.

Thanks in advanced!

Leave a Reply

avatar
  Subscribe  
Notify of
Close Menu