An Opinionated Guide to Consolidating your Data

The Tool

You don’t have too many choices for FOSS ELT/ETL.

The Destination

Let’s use a data lake. Its unstructured nature leaves more flexibility for purpose and we’ll assume that our data has not been processed or filtered yet.

The Deployment

Today, we’ll be deploying Airbyte locally on a workstation. Alternatively, you can deploy it on your own infrastructure, but this requires managing networking and security, which is unpalatable for a quick demonstration. If you want your syncs to continue running in perpetuity, you’ll want to deploy Airbyte externally to your machine. For a guide to deploying Airbyte on EC2 click here. For a guide to deploying Airbyte on Kubernetes, click here.

git clone git@github.com:airbytehq/airbyte.git
cd airbyte
docker-compose up

The Data Sources

Head over to localhost:8000 on your machine, complete the sign-up flow, and you’ll be greeted with an onboarding workflow. We’re going to skip this workflow to emulate a traditional usage of Airbyte. Click on the Sources tab in the left sidebar and click on +New Source. This is where we’ll be setting up all of our disparate data sources.

The Destination (again)

Head over to the Destinations tab in the left sidebar and follow the same process for setting up our connection to S3. Click on +New Destination and search for S3. Then fill out the configuration for your bucket. We’ll now use that access key that we generated earlier!

The Connection

Finally, head over to the Connections tab in the sidebar and click +New Connection. You will need to do this process for each data source that you have set up. Select any existing source and click your S3 Destination that you set up from the drop down. I failed to set up a connection with my GitHub source, so I navigated to the Airbyte Troubleshooting Discourse and filed an issue. Response times are really fast there, so I’ll likely be able to resolve this within a day or two.

The Analysis

Now that you’ve set up your data pipelines, if you want to run transformation jobs, Trino enables that use case well — Lyft, Pinterest, and Shopify have all done this to great success. There’s also a dbt-trino plugin managed by the folks over at Starburst. Alternatively, you could also accomplish this using S3 Object Lambda if you want to stay within the AWS landscape when possible.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store