Fastest way to deploy Airflow to AWS or GCP

Ankur Dahiya
Run[X]
Published in
4 min readJun 23, 2021

--

Airflow is one of the most popular tools for running workflows. It’s a really powerful software with a bunch of moving parts — which makes the production deployment complicated.

In this post, I’d like to demonstrate how we can simplify this process by using a new open source project called Opta. We will be creating a production-ready setup for Airflow — in under 30min!

About Opta: It is an infrastructure-as-code solution that aims to abstract away the complexity of cloud providers and Kubernetes — providing a clean cloud agnostic interface to deploy and run workloads.

Disclaimer: I’m one of the main people behind Opta but I’ll try to keep the bias to a minimum :)

Airflow Architecture

This is the Airflow architecture that we’ll be trying to create (relevant airflow docs). There’s three types of pods that run inside Kubernetes and then there is a SQL database and a Redis instance (we’ll be using the underlying cloud services for these).

We’ll be using the official Airflow chart to configure the pods — and the rest will be set up by Opta.

Why not Helm directly

A brief aside on why we’re using Opta here — we already have the Airflow helm chart — don’t we?

Note that the helm chart only manages resources running inside Kubernetes. And we don’t want to run Postgres/Redis inside K8s as that’s not efficient/robust — the cloud services for these services are way more performant!

Additionally, we’d need to set up:

  • Ingress to route traffic to the pods
  • Network configuration
  • Security groups
  • Kubernetes provisioning and configuration

And we need to make sure this is all done securely so we don’t introduce unnecessary vulnerabilities! That’s why Opta is a good fit here — it sets up all these resources for us with a very robust architecture.

Opta configuration for Airflow

To deploy airflow with Opta, we need to do 2 things:

  • Set up the environment (this will set up Ingress, Kubernetes, Network etc)
  • Set up the Airflow “service” (this will create the Postgres and Redis resources and configure the Airflow helm chart)

Opta is an infrastructure-as-code (..as-config?) solution. So the first step is to write a config file for the environment. Here’s the one we will be using:

Here we are selecting AWS as the cloud provider (we’ll look at GCP later in the article). Make sure to replace region and account— with the appropriate values.

Additionally, we are configuring some Opta modules — this is where the magic happens! More details about these can be found in the docs.

Once this file is ready, we can install Opta and run apply.

Installation:

/bin/bash -c "$(curl -fsSL https://docs.opta.dev/install.sh)"

Note that Opta is completely open source. So do check out the code first if you’re suspicious of running things from the internet (which you should be ;))

Apply:

opta apply --config <filename>

Next step

Next is setting up the Airflow “service”. Again, we need to write a configuration file for it and then run Apply.

Here’s the file we’ll be using:

We’re creating the Postgres DB (backed by RDS) and Redis (backed by ElasticCache). And then we’re instantiating the Airflow helm chart. Make sure to update the env file path (which we created in the previous step).

As you can see, Opta allows us to connect Postgres/Redis pretty seamlessly with the helm chart! The helm chart has quite a few configuration options and Opta also supports a lot more resources- so feel free to play around with this :)

Once the file is ready, we just need to run apply:

opta apply --config <filename>

Once Opta finishes, run opta output and note down load_balancer_raw_dns. Now you can access the airflow UI at http://<load_balancer_raw_dns> . The default username and password are admin and admin.

To put this behind a domain name and use SSL, please follow the dns delegation steps here.

We can run a particular DAG by turning it on and then triggering a run. Then we can browse the logs for that run and make sure it ran successfully!

Debugging

To debug things or explore the underlying containers, opta enables you to use kubectl.

Just run opta configure-kubectl and it will set up the appropriate config for kubectl. Now you can run kubectl get pods -n airflow to look at all the pods we are running and kubectl exec -it -n airflow pod/<pod-name> to ssh into a pod and explore the environment.

Running on GCP

Running this same example on GCP is pretty straightforward — as Opta is (mostly) cloud agnostic! We just need to update the env file to point to our GCP project and run apply.

The “service” yml file doesn’t need to be changed as it’s completely cloud agnostic! Note that Postgres will be provisioned via Cloud SQL and Redis via Memorystore.

Running a custom DAG

This example just enables the example DAGs that come with Airflow. To use a custom DAG, you’d need to compile a docker image with your DAGs, specify that in the opta yml and run opta apply.

What’s next

This was a quick overview of how we can use Opta to deploy Airflow to AWS or GCP. We were able to get a robust environment set up with minimal work!

All this code can be found on our github. Make sure to check out the Airflow docs and the Opta docs — for further configuration.

If you run into any problems or have suggestions for what else you’d like to use Opta for, please let us know in the comments or in our slack :))

--

--

CEO / Co-Founder @RunX. Previously led Infrastructure @Flexport