Working with AWS Glue DataBrew to prepare data without writing code - Part 1/4 | Amazon Web Services
Introduction to AWS Glue DataBrew
Overview of the Learning Series
- The series is presented by Harshad Adipati, a Principal Solutions Architect at AWS, focusing on analytics.
- This is part one of a four-part series covering the fundamentals of AWS Glue DataBrew.
- Participants will learn how to load data into DataBrew and conduct exploratory data analysis without coding.
Upcoming Parts in the Series
- Part two will focus on cleaning datasets, including handling null values and filtering transforms.
- Part three will cover enriching datasets using transformations like joins and aggregations.
- Part four will delve into feature engineering techniques such as outlier handling and one-hot encoding.
Challenges in Data Preparation
Issues with Traditional Methods
- Data preparation is often time-consuming, involving multiple steps for cleaning and normalizing data before analysis.
- Existing tools can be expensive and may require rework, complicating workflows for analysts and scientists.
- Manual processes introduce complexity in maintaining reusable workflows, leading to significant overhead.
Introduction to AWS Glue DataBrew
Features of DataBrew
- AWS Glue DataBrew allows users to clean and normalize data up to 80% faster through a visual interface.
- It supports over 250 built-in transformations commonly used in daily operations.
- Users can build reusable recipes and monitor data transformation lineage easily.
Integration Capabilities
- Seamless integration with various data sources via JDBC connectors enhances flexibility for users.
- The visual interface is accessible through the AWS console or can be embedded into Jupyter Notebook or SageMaker Studio.
Demo Walkthrough of AWS Glue DataBrew
Navigating the Interface
- The demo begins with an overview of the Glue DataBrew landing page highlighting options like datasets, projects, recipes, and jobs.
Creating Datasets
- Users can create new datasets from supported sources or upload files directly from their desktops.
- Supported JDBC sources include Microsoft SQL Server, MySQL, Oracle, PostgreSQL, Redshift, and Snowflake.
Example Dataset Creation Process
- For this demo, three data sources are introduced: product sales (Excel), customer (CSV), which are essential for self-service scenarios in analysis.
- Steps include selecting S3 buckets for saving files while specifying formats like Excel or CSV during dataset creation.
Data Profiling and Analysis with Data Brew
Overview of Data Profiling
- The process begins by selecting the entire dataset or a sample for profiling, which is saved in JSON format to an S3 bucket. Users can choose specific columns to profile, optimizing cost and time.
- It is recommended to run profiling jobs periodically as new data is added, although this step is skipped in the demo. A role with access to both Data Blue and the S3 dataset must be selected before running the job.
Insights from Profile Output
- Upon completion of the profiling job, users receive detailed information including total records, column counts, missing or duplicate data, correlation between columns, and value distributions.
- Column statistics provide insights such as data quality metrics, histograms showing data distribution, minimum/maximum values, standard deviation, outlier detection using z-scores, and additional insights on selected columns.
Understanding Data Lineage
- Clicking on the data lineage tab reveals how data has evolved over time and what transformations were applied across different projects. This includes tracking recipes created from these projects.
Creating a Project for Analysis
- To start analysis, users create a project by naming it and selecting a dataset (e.g., sales dataset). Visual analysis in Data Brew uses sample sizes ranging from 500 to 5000 records.
- Users can select random rows for their analysis; after creating a project, datasets are displayed in grid view along with selected sample sizes.
Transformations Available in Data Brew
- Over 250 transformations are available for use within Data Brew. These include common functions like filtering data sets and changing date formats.
- Additional capabilities allow handling missing values and outliers through various aggregation functions. More complex operations can utilize case statements or pivot functions without requiring coding skills.
Conclusion on Data Brew's Capabilities
- The session highlights that Data Brew simplifies data preparation through its visual interface while offering reusable recipes that enhance monitoring changes in datasets.