The Path to Becoming a Data Engineer

[This article was first published on DataCamp Community - python, and kindly contributed to python-bloggers]. (You can report issue about the content on this page here)
Want to share your content on python-bloggers? click here.

The world of data science is evolving, and it’s changing rapidly. In the good old days, all your data was readily available in a single database and all you needed to know as a data scientist was some R or Python to build simple scripts. I, for one, remember setting up an R script, making it munch some data from a single table, and spit out some markdown reports, all glued together in a lonely CRON job. But we all must learn a precious lesson: data grows.

As companies grow, more and more data sources inevitably get added. Some of those data arrive in batches and others stream in through various channels: terabytes, petabytes of data accumulating so quickly, your head feels like it’ll explode. Some of you might’ve recognized this years ago when you moved into a new role as a data engineer, tasked with storing data safely and correctly. If you find yourself in this sticky situation, or if you’re just getting started as a data engineer, I have some good news for you: This article provides you with all the resources you need to learn data engineering. A lot of these resources are bundled in DataCamp’s career track for data engineers.

Data engineering is essential for data-driven companies, but what do data engineers actually do? The following analogy may help:

Think of data engineers like crop farmers, who ensure that their fields are well maintained and the soil and plants are healthy. They’re responsible for cultivating, harvesting, and preparing their crops for others. This includes removing damaged crops to ensure high quality and high yielding crops.

This is similar to the work data engineers do to ensure clean, raw data can be used by other people in their organization to make data-driven business decisions.

1. Become proficient at programming

Before we dive into the tools you’ll need, you have to understand that data engineers lay at the intersection of software engineering and data science. If you want to become a data engineer, you’ll need to first become a software engineer. So you should start brushing up on foundational programming skills.

The industry standard mostly revolves around two technologies: Python and Scala.

Learn Python

With Python programming, it’s essential that you not only know how to write scripts in Python, but that you also understand how to create software. Good software is well-structured, tested, and performant. That means you should use the right algorithm for the job. These courses lay out a path to become a Python programming rockstar:

  1. Introduction to Python: Start here if Python lists or NumPy don’t ring a bell.
  2. Intermediate Python for Data Science: Start here if you don’t know how to build a loop in Python.
  3. Python Data Science Toolbox (Part 1): Start here if you’ve never written a function in Python.
  4. Software Engineering for Data Scientists in Python: Start here if you’ve never written a class in Python.
  5. Writing Efficient Python Code: Start here if you never timed your Python code.

To deepen your knowledge of Python even further, take our new skill track on Coding Best Practices with Python. The knowledge you build in these courses will give you a strong foundation of writing efficient and testable code.

Learn the basics of Scala

A lot of tooling in the data engineering world revolves around Scala. Scala is built on strong functional programming foundations and a static typing system. It runs on the Java Virtual Machine (or JVM), which means it’s compatible with the many Java libraries available in the open-source community. If this sounds intimidating, we can ease you in with our course Introduction to Scala.

2. Learn automation and scripting

Why automation is crucial for data engineers

Data engineers must understand how to automate tasks. Many tasks you need to perform on your data may be tedious or may need to happen frequently. For example, you might want to clean up a table in your database on an hourly schedule. This comic by xckd says it best:

Tl;dr: If you know an automatable task takes a long time, or it needs to happen frequently, you should probably automate it.

Essential tools for automation

Shell scripting is a way to tell a UNIX server what to do and when to do it. From a shell script, you can start Python programs or run a job on a Spark cluster, for example.

CRON is a time-based job scheduler that has a particular notation to mark when specific jobs need to be executed. The best way to illustrate how it works is by giving you some examples:

Here’s an awesome website that can help you figure out the correct schedule: https://crontab.guru/. If you can’t wait to get started on shell scripting and CRON jobs, get started with these courses:

  1. Introduction to Shell for Data Science
  2. Data Processing in Shell (the last lesson is about CRON)

Later in this post, I’ll talk about Apache Airflow, which is is a tool that also relies on your scripting capabilities to schedule your data engineering workflows.

3. Understand your databases

Start by learning SQL basics

SQL is the lingua franca of everything related to data. It’s a well-established language and it won’t be going away anytime soon. Have a look at the following piece of SQL code:

What’s so beautiful about this SQL code is that it’s a declarative language. This means that the code describes what to do, not how to do it—the “query plan” takes care of that part. It also means that almost anyone can understand the piece of code I wrote here, even without prior knowledge of SQL: Return how many distinct IP addresses are used for all logins from each user.

SQL has several dialects. As a data engineer, you don’t necessarily need to know them all, but it may help to have some familiarity with PostgreSQL and MySQL. Intro to SQL for Data Science gives a gentle introduction on using PostgreSQL, and Introduction to Relational Databases in SQL goes into more detail.

Learn how to model data

As a data engineer, you also need to understand how data is modeled. Data models define how entities in a system interact and what they’re built out of. In other words, you should be able to read database diagrams, like this one:

You should recognize techniques like database normalization or a star schema. A data engineer also knows that some databases are optimized for transactions (OLTP), and others are better for analysis (OLAP). Don’t worry if these data modeling topics don’t ring a bell yet—our Database Design course covers all of them in detail.

Learn how to work with less structured data

Sometimes you’ll find yourself in a situation where data is not represented in a structured way, but is stored in a less structured document database like MongoDB. It definitely helps to know how to extract data from these. Our Introduction to MongoDB in Python course can help you with that.

4. Master data processing techniques

So far, I’ve only covered the fundamentals of knowing how to program and automate tasks, and how to leverage SQL. Now it’s time to start building on top of that. Since you now have a strong foundation, the sky’s the limit!

Learn how to process big data in batches

First, you need to know how to get your data from several sources and process it: This is called data processing. If your datasets are small, you might get away with processing your data in R with dplyr or in Python with pandas. Or you could let your SQL engine do the heavy lifting. But if you have gigabytes or even terabytes of data, you’d be better off taking advantage of parallel processing. The benefits of using parallel processing are two-fold: (1) You can use more processing power, and (2) you can also make better use of the memory on all of the processing units.

The most commonly used engine for parallel processing is Apache Spark, which according to their website is a unified analytics engine for large-scale data processing. Let me decode that for you: Spark provides an easy-to-use API by using common abstractions like DataFrames to do parallel processing tasks on clusters of machines.

Spark manages to significantly outperform older systems for parallel processing like Hadoop. It’s written in Scala, and it helps that it interfaces with several popular programming languages like Python and R. Lesser-known tools like Dask can be used to solve similar problems. Check out the following courses if you want to learn more:

  1. Introduction to PySpark
  2. Big Data Fundamentals with PySpark
  3. Introduction to Spark SQL in Python
  4. Parallel Computing with Dask

Data processing often happens in batches, like when there’s a scheduled daily cleaning of the prior day’s sales table. We call this batch processing because the processing operates on a collection of observations that occurred in the past.

Learn how to process big data in streams

In some cases, you might have a continuous stream of data that you want to process right away, known as stream processing. An example is filtering out mentions of specific stocks from a stream of Tweets. In this case, you might want to look into other data processing platforms like Apache Kafka or Apache Flink, which are more focused on processing streams of data. Apache Spark also has an extension called Spark Streaming to do stream processing. If you want to learn more about stream processing with Kafka or Flink, check out this gentle introduction.

Load the result in a target database

Finally, with the scheduled data processing job in place, you’ll need to dump the result in some kind of database. Often, the target database after data processing is an MPP database. We’ll see some examples of MPP databases in the upcoming section about cloud computing. They’re basically databases that use parallel processing to perform analytical queries.

5. Schedule your workflows

Workflow scheduling with Apache Airflow

Once you’ve built the jobs that process data in Spark or another engine, you’ll want to schedule them regularly. You can keep it simple and use CRON, as discussed earlier. At DataCamp, we choose to use Apache Airflow, a tool to schedule workflows in a data engineering pipeline. You should use whichever tool is best suited for your workflow. A simple CRON job might be enough for your use case. If the CRON jobs start adding up and some tasks depend on others, then Apache Airflow might be the tool for you. Apache Airflow has the added benefit of being scalable as it can run on a cluster using Celery or Kubernetes—but more on this later. Apache Airflow visualizes the workflows you author using Directed Acyclic Graphs, or DAGs:

The above DAG demonstrates the steps to assemble a car.

Tl;dr: You can use Airflow to orchestrate jobs that perform parallel processing using Apache Spark or any other tool from the big data ecosystem.

The ecosystem of tools

Speaking of tools, it’s easy to get lost in all the terminology and tools related to data engineering. I think the following diagram illustrates that point perfectly:

Source: https://mattturck.com/data2019/

This diagram is very complete, but it’s not very helpful in our case. Instead of overwhelming you with an excess of information, let’s wrap up the past two sections with an illustration that should help you make sense of all of the tools that I’ve presented.

A lot of the more popular tools, like Apache Spark or Apache Airflow, are explained in more detail in our Introduction to Data Engineering course.

An observant reader might see a pattern emerging in these open-source tools. Indeed, a lot of them are maintained by the Apache Software Foundation. In fact, many of Apache’s projects are related to big data, so you might want to keep an eye out for them. The next big thing might be on its way! All of their projects are open source, so if you know some Python, Scala, or Java, you might want to have a peek at their GitHub organization.

6. Study cloud computing

The case for using a cloud platform

Next, I want you to once again think about parallel processing. Remember that in the previous section, we talked about clusters of computers. In the old days, companies that needed to handle big data would have their own data center or would rent racks of servers in a data center. That worked, and a lot of companies still do it this way if they handle sensitive data, such as banks, hospitals, or public services. The drawback of this setup is that a lot of server time goes to waste. Here’s why: Let’s say you have to do some batch processing once a day. Your data center needs to handle the peak in processing power, but the same servers would sit idle the rest of the time.

This is clearly not efficient. And we haven’t even talked about geo-replication yet, where the same data needs to be replicated in different geographical locations to be disaster-proof.

The impracticality of every company managing their servers themselves was the problem that gave rise to cloud platforms, which centralize processing power. If one customer has idle time, another might be having a peak moment, and the cloud platform can distribute processing power accordingly. Data engineers today need to know how to work with these cloud platforms. The most popular cloud platforms for companies are Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP).

Common services provided by cloud platforms

Cloud platforms provide all kinds of services that are useful to data engineers. AWS alone offers up to 165 services. Let me highlight a few:

  • Cloud storage: Data storage forms the foundation for data engineering. Each cloud platform provides its version of cheap storage. AWS has S3, Microsoft has Azure Storage, and Google has Google Storage. They all pretty much do the same thing: store a ton of files.
  • Computation: Each cloud platform also has its own low-level computation service. This means they provide a remote machine to do computations on. AWS has EC2, Azure has Virtual Machines, and Google has its Compute Engine. They’re configured differently, but they pretty much do the same thing.
  • Cluster management: All cloud platforms have their version of a managed cluster environment. AWS has EMR, Azure hosts HDInsight, and Google has Cloud Dataproc.
  • MPP databases: Massively parallel processing databases is a fancy term for databases that run over several machines and use parallel processing to do expensive queries. Common examples include AWS Redshift, Azure SQL Data Warehouse, and Google BigQuery.
  • Fully managed data processing: Each cloud platform has data processing services that operate the infrastructure/DevOps setup for you. You don’t have to worry about how many machines to use in your cluster, for example. AWS has something called Data Pipelines, and there’s also Azure Data Factory and Google Dataflow. Google has open-sourced the programming model of Dataflow into another Apache project: Apache Beam.

That’s just a small subset of relevant services for data engineers. If any of this has piqued your interest, be sure to check out the first chapter of Introduction to Data Engineering, which has a lesson on cloud platforms. If you want to get more hands-on experience, check out Introduction to AWS Boto in Python.

7. Internalize infrastructure

You might be surprised to see this here, but being a data engineer means you also need to know a thing or two about infrastructure. I’m not going to go into too much detail on this topic, but let me tell you about two crucial tools: Docker and Kubernetes.

When to use Docker

First, let’s see if you recognize this situation:

You probably know where I’m going with this. Mastering Docker can help you make applications reproducible on any machine, no matter what the specifications of that machine are. It’s containerization software that helps you create a reproducible environment. It allows you to collaborate in teams and ensures that any application you make in development will work similarly in production. With Docker, data engineers waste considerably less time setting up local environments. Take Apache Kafka as an example, which can be pretty overwhelming to set up locally. You can use a platform called Confluent that packages Kafka along with other useful tools for stream processing, and the Confluent documentation provides an easy-to-follow guide on how to get started using Docker.

When to use Kubernetes

What logically follows single containers is a whole bunch of containers running on several machines. This is called container orchestration in infrastructure jargon, and Kubernetes is the tool to use. You might rightfully be reminded of parallel processing and tools like Apache Spark here. In fact, you can use a Kubernetes managed cluster with Spark. This is one of the more advanced topics in data engineering, but even newbies should be aware of it.

If you’ve made it this far, don’t get discouraged if you feel that you don’t have a full understanding of the data engineering landscape. It’s a huge field that’s constantly changing. What’s most important is to use the right tool for the job, and to not overcomplicate the big data solutions you build.

That said, it doesn’t hurt to keep up with recent developments. Here’s a handful of useful resources:

Start taking your first steps as a data engineer

That’s it! You’re at the end of the road. At this point, you’re practically a data engineer… But you must apply what you’ve learned. Building experience as a data engineer is the hardest part. Luckily, you don’t need to be an expert in all of these topics. You could specialize in one cloud platform, like Google Cloud Platform. You could even start your first pet project using one of Google’s public BigQuery datasets.

I hope you feel inspired by this blog post, and that the resources I provided are useful to you. At DataCamp, we’re committed to building our data engineering curriculum and adding courses on topics like streaming data and cloud platforms—so stay tuned!

To leave a comment for the author, please follow the link and comment on their blog: DataCamp Community - python.

Want to share your content on python-bloggers? click here.