5 Data Engineering Projects Ideas To Put On Your Resume
Data Engineering Projects for Your Resume
In this video, Ben Rogerson discusses the types of projects that data engineers can work on to showcase their skills and add to their resumes. He provides a list of five projects, some of which have existing GitHub repositories, and outlines the different components that should be included in a data engineering project.
Components of a Data Engineering Project
- Various types of data and data source systems should be used to pull data from different sources such as APIs, CSVs, JSON files, or scraped websites.
- Merging disparate data sources is important to show how you are capable of managing and manipulating different types of data sources.
- A data ingestion tool is necessary for automation and ingestion purposes. Airflow is an example of such a tool.
- Data storage system is where all the ingested data goes. Examples include BigQuery or RDS Postgres instance or Snowflake.
- Transformation tooling is used for transforming the stored data. SQL or DBT are examples.
- Data visualization tools like Tableau are used to display final results.
Five Data Engineering Projects
1. Yelp Fusion API Project
- Uses Yelp's Fusion API to extract restaurant information based on user input
- Stores extracted information in a database
- Displays results using Flask web framework
2. Spotify ETL Pipeline Project
- Extract music track information from Spotify API
- Transform the extracted information into a usable format
- Load the transformed data into a database
- Visualize the data using Tableau
3. NYC Taxi Data Pipeline Project
- Extract taxi trip data from NYC Open Data API
- Transform the extracted information into a usable format
- Load the transformed data into a database
- Visualize the data using Tableau
4. Twitter Streaming ETL Pipeline Project
- Stream tweets in real-time using Tweepy library and Twitter API
- Transform the extracted information into a usable format
- Load the transformed data into a database
- Visualize the data using Tableau
5. MovieLens Recommendation System Project
- Build a recommendation system based on user ratings of movies from MovieLens dataset.
- Store user ratings in a database.
- Use Spark to build collaborative filtering model for movie recommendations.
- Display results using Flask web framework.
Conclusion
Data engineering projects can be challenging to showcase on resumes, but by following these guidelines and working on one of these five projects, you can demonstrate your skills and experience to potential employers.
Introduction
In this section, the speaker introduces five different data science projects that can be used to showcase skills during interviews. The speaker emphasizes the importance of documenting the process and showing off the work.
Five Data Science Projects
- The speaker provides links to all five projects.
- The first project involves scraping stock and Twitter data for sentiment analysis and visualization on a website.
- The creator documented the process well, including a high-level diagram of the project's components.
- The diagram includes Kafka, Spark Streaming, Cassandra, HDFS, and other components used in the project.
Project Details
Scraping Stock and Twitter Data
- The goal of this project was to correlate sentiment analysis with stock prices using Twitter data.
- They used various components such as Kafka, Spark Streaming, Cassandra, HDFS to scrape Twitter data and stock data.
- They scraped JSON files from different sources like StockTwits API using GET requests.
Insights
- Correlation between number tweets and actual change in price would have been interesting.
Project Overview
This section provides an overview of a project that uses various components such as Druid, Delta Lake, Kubernetes, and Dagster to approach a complex problem.
Approach to Complex Problems
- The project used various components such as Druid, Delta Lake, Kubernetes, and Dagster.
- The creator documented the project well for others to understand.
- It is important to learn about technology stacks when working on projects.
- Focusing only on getting the right tool for the job may not be enough.
Real Estate Data Scraping
This section discusses how the project creator scraped real estate data from a website using HTML.
Scraping Real Estate Data
- The project creator scraped rental and purchase information for housing from a website using HTML.
- Scraping HTML is more difficult than pulling data from CSVs or JSON files.
- Pulling information requires digging into the code to find specific references and start points.
UI with Dashboards
This section discusses how the project creator displayed analytics through dashboards in their user interface.
Displaying Analytics through Dashboards
- The project creator displayed analytics through dashboards in their user interface.
- They got a decent amount of analytics despite not having too much data.
- There are interesting visualizations that can be seen by following the link provided below.
Analyzing Public GitHub Repositories
This section discusses how public GitHub repositories can be analyzed using Google BigQuery.
Analyzing Public GitHub Repositories with Google BigQuery
- Google BigQuery has already scraped all public GitHub repository data into its data layer.
- Felipe Hoffa did different forms of analytics on public GitHub repositories using BigQuery.
- He analyzed tabs versus spaces and what was more common in different programming libraries.
- He also analyzed code bases.
SQL and Project Examples
In this section, the speaker talks about the importance of becoming familiar with SQL and more complex SQL. They also discuss three project examples that showcase different skills.
Becoming Familiar with SQL
- It is important to become familiar with SQL and more complex SQL.
- Project 4 is open-ended, allowing for creativity in building something off of it.
- Project 5 is a real example where someone built something on GitHub and showcased their work.
Project Example: Common Crawl Data Analysis
- The creator used Common Crawl as their data source to detect inflation by pulling pricing information from various sources.
- They documented how they pulled the data and displayed their overall pipeline using tools like Spark and Pandas.
Project Example: Predicted API
- Predicted is a platform where users can predict outcomes of events like elections or financial events.
- The API provides XML data that can be scraped along with JSON or CSV data from other sources like Twitter to create interesting projects.
- Specific market data can also be pulled from Predicted's API.
Conclusion
In this section, the speaker concludes by encouraging viewers to try out these project ideas themselves and share them.
- Viewers are encouraged to try out these project ideas themselves and share them.
Newsletter and Content Overview
In this section, the speaker talks about his newsletter and the content he covers.
Newsletter Content
- The speaker puts out one newsletter a week covering interesting startups and technology.
- The newsletter includes different videos and references to cool articles, such as data analytics companies invested in by Greylock BC or achieving observability for Netflix.
- The speaker shares interesting articles that he thinks people in the tech space would like to know about.
- The speaker thanks viewers for watching and encourages them to sign up for his newsletter.