Snowflake VHOL - Build Data Lakes using Apache Iceberg with Snowflake and AWS Glue (14 Jan)

Name: Snowflake VHOL - Build Data Lakes using Apache Iceberg with Snowflake and AWS Glue (14 Jan)
Uploaded: 2025-11-18T01:35:31.000Z
Duration: 2 h 34 min 46 s
Description: Learn more about Snowflake: ➡️ Website: https://www.snowflake.com ➡️ Careers: http://careers.snowflake.com ➡️ X: https://x.com/Snowflake ➡️ Instagram: / _snowflake_inc ➡️ Facebook: / snowflakedb ➡️ LinkedIn: https://bit.ly/2QUexl4 ➡️ YouTube: https://bit.ly/3lzfeeB #Snowflake #AIDataCloud

Introduction and Overview Starting the Webinar

Welcome and Initial Remarks

The session begins with a brief wait for participants to join, indicating a welcoming atmosphere.

A light-hearted discussion about the expected number of attendees occurs, showcasing camaraderie among presenters.

Participants are encouraged to use the chat for questions, emphasizing engagement and interaction during the webinar.

Q&A Process

Clarification is provided that recordings will be shared post-session, ensuring attendees can revisit content.

The Q&A feature is highlighted as the preferred method for inquiries, promoting organized communication.

Presenters Introduction Meet the Speakers

Lester's Introduction

Lester introduces himself as a Solutions Architect from AWS, focusing on cloud architecture design and optimization.

He mentions his experience working with customers across the ASEAN region and highlights his three years at AWS.

Jono's Background

Jonathan (Jono), a Senior Partner Solution Engineer at Snowflake, shares his four-plus years of experience in building solutions with partners.

Jono outlines today's agenda: integrating Apache Iceberg with Snowflake and AWS Glue through a hands-on lab format.

Understanding Snowflake What is Snowflake?

Key Features of Snowflake

Snowflake is described as a cloud-based data platform that serves as a single source of truth for various data types (structured, semi-structured, unstructured).

It offers fully managed services without requiring user maintenance while ensuring security controls and resiliency.

Scalability and Customer Base

The platform boasts scalability to meet workload demands; currently serving over 12,000 customers since its inception in 2014.

Versatile Use Cases

Snowflake supports diverse workloads including data warehousing and lakes while enabling transactional database capabilities through unis store.

Data Engineering Capabilities

Users can execute code directly within Snowflake for data engineering tasks without needing separate environments or additional steps.

How to Leverage AI in Today's World

The Role of AI in Data Management

Discussion on leveraging AI technologies, including chatbots and Snowflake intelligence, emphasizing the ability to bring code to data rather than moving data around.

Mention of various leading AI models like Anthropic and Urban AI running within the Snowflake platform, allowing for application deployment with managed infrastructure.

Evolution of Snowflake

Explanation that Snowflake has evolved beyond a traditional data warehouse to support multiple workloads and different user personas.

Overview of Snowflake's architecture, highlighting its storage layer and elastic compute capabilities for efficient resource management.

Cloud Services and Governance

Introduction to the "brains" behind Snowflake—cloud services that manage governance, compliance, and optimization through metadata collection.

Description of Snow Grid as a unique capability for cross-region and cross-cloud replication, facilitating seamless data sharing across different locations.

Partnership with AWS

Recognition of winning Partner of the Year with AWS; 80% of Snowflake customers utilize AWS services across 25 regions.

Comparison indicating that using Snowflake on AWS is approximately 10% faster due to Graviton processors' efficiency.

Characteristics of Successful Organizations in the AI Era

Insight into how successful organizations adapt their data infrastructure for flexibility without incurring costly migrations or vendor lock-in.

Emphasis on performance scalability where costs drop when not in use but can respond instantly during demand spikes while maintaining security.

Importance of Apache Iceberg

Discussion on prioritizing Apache Iceberg as an open-source table format that simplifies processing large datasets stored in data lakes.

Noted industry trends showing customer preference for Iceberg due to lower implementation costs and better performance benchmarks compared to alternatives.

What is Apache Iceberg and Its Integration with AWS?

Introduction to Apache Iceberg

The discussion begins with a partnership between the speaker's organization and Snowflake to enable advanced data architectures.

Apache Iceberg was created by Netflix and later contributed to the Apache Foundation, which is why it carries the "Apache" prefix.

Integration with AWS Services

There is a tight integration of Iceberg with native AWS services like Glue, Kinesis, and QuickSight, facilitating seamless data operations.

Data ingestion can be managed through various AWS services such as Kinesis Firehose or EMR, showcasing the versatility of integrating Snowflake with these tools.

Understanding Iceberg's Functionality

Iceberg serves as an open table format that allows multiple engines (e.g., Trino, Glue, Snowflake) to access shared datasets while maintaining proper access controls.

It enables users to manage their storage effectively while ensuring compatibility across different querying engines.

Challenges in Data Management

Key challenges include vendor lock-in due to proprietary storage formats and potential cost increases from maintaining multiple copies of data across different engines.

Different teams may require various tools for distinct workloads, leading to complexity when data is stored in siloed formats.

Benefits of Using Iceberg

Utilizing Iceberg promotes open standards and reduces duplication costs by allowing a single copy of data stored in AWS accessible by multiple engines.

It enhances interoperability among compute engines, enabling teams to select appropriate tools for specific tasks while ensuring consistent data governance.

Use Cases for Data Architecture

A combined data lake approach consolidates storage without duplicating datasets; this allows continued use of Snowflake’s query capabilities on S3-stored iceberg tables.

The architecture supports bi-directional sharing between Snowflake and S3, addressing issues related to data silos and fostering collaboration among teams.

Reference Architecture Overview

An example reference architecture illustrates how various data sources can feed into Snowflake via methods like Snowpipe or Kinesis before being transformed within the platform.

Snowflake and Apache Polaris: Enhancing Data Governance

Leveraging Serverless Tasks and Open Source Solutions

Snowflake offers serverless task capabilities for executing user-defined functions (UDFs) or stored procedures, enhancing operational efficiency.

The introduction of Apache Polaris, an open-source catalog donated to the Apache Foundation, provides cross-engine governance capabilities for data management.

Security and Access Control

Snowflake emphasizes security by ensuring that users have appropriate access levels to parquet and iceberg data, facilitating accountability in data usage.

Once data is created within Snowflake, it can be shared with analytics tools or AI models, enabling broader application of insights derived from the data.

Catalog Management for Iceberg Tables

Different types of catalogs are available for managing iceberg tables; Snowflake can manage an iceberg catalog while integrating with external sources like AWS S3.

Data copied from AWS S3 into Snowflake is captured in a catalog that includes metadata, allowing users to understand the available datasets.

Benefits of Using Snowflake with Iceberg

Key benefits include tracking reads/writes to data, full interoperability with Iceberg formats, and leveraging Snowflake's features for optimized query processing based on captured metadata statistics.

Reference Architecture Overview

The reference architecture illustrates a flow where multiple data sources feed into Snowflake; internal tables can be created from this ingested data.

The Glue Data Catalog integrates with Snowflake by capturing metadata updates from S3 buckets, allowing seamless sharing across various analytics platforms like Spark or Athena.

External Catalog Integration

An external Glue Data Catalog allows collaboration and governance while keeping the original data in Amazon S3; this flexibility supports real-time analytics applications.

Metadata from the Glue catalog is shared within Snowflake, enabling the creation of iceberg tables outside of its environment while maintaining comprehensive oversight.

Data Integration and Management with Snowflake and AWS

Overview of Data Cataloging

The discussion begins with the integration of data into platforms like QuickSight or SageMaker, emphasizing the use of a managed catalog in Snowflake to capture metadata.

Collaboration with AWS Glue is highlighted, aiming for seamless access to data through an open-source approach, indicating potential availability for users today.

Users can read and write to external engines while maintaining transaction metadata consistency across environments like AWS and Snowflake.

Hands-On Lab Preparation

The session transitions into a hands-on lab where participants will create trial accounts on both Snowflake and AWS due to unexpected high attendance.

Participants will work with data stored in S3 buckets, creating Iceberg tables while updating the Glue catalog using Lake Formation for governance and security.

Data Set Context

The workshop focuses on financial use cases involving insurance data aimed at identifying customer churn or potential fraud based on recent claims activity.

Three datasets are introduced: quotes (existing as parquet files), customers, and policies (to be created in Snowflake), culminating in a combined Iceberg table for querying.

Lab Execution Steps

The lab aims to demonstrate how different datasets residing in various locations can be combined using Iceberg format while ensuring underlying data remains in S3.

Instructions are provided for participants to create their Snowflake trial account first, followed by an AWS trial account setup.

Account Setup Details

Participants are reminded to select the free plan during AWS account creation; credit card details may be required but won't incur charges unless specific conditions are met.

Clarification about the Snowflake trial account duration (30 days with $400 credit), along with options for multiple trial accounts using the same email address.

This structured summary captures key insights from the transcript while providing timestamps that facilitate easy navigation back to specific parts of the video.

Getting Started with Snowflake and AWS

Signing Up for Snowflake Trial Account

The process begins with signing up for a free trial of Snowflake, where users are prompted to enter personal details such as first name, last name, work email, phone number, and school name.

Users must select the enterprise edition of Snowflake and specifically choose the US West region to ensure compatibility with the lab setup.

It is important to agree to the terms and conditions before proceeding. Users can skip optional questions during signup if they prefer.

After entering their information, users will receive an activation link via email to complete their account setup by creating a password.

Participants are given time to create their accounts and activate them before moving on to setting up an AWS trial account.

Creating an AWS Free Tier Account

Users are encouraged to create a free tier AWS account which provides $100 in credits initially; additional credits can be earned through specific actions within AWS services.

During account creation, users will need to provide credit card details for verification purposes but will not incur charges unless they switch plans.

If participants have previously created accounts in other regions (e.g., Singapore), they can still create new trial accounts in the US West region as required by the lab setup.

Emphasis is placed on selecting the US West region due to potential latency issues when accessing resources located elsewhere (like S3 buckets).

For those without credit cards or existing accounts, alternative solutions may be discussed; however, having a free trial account is recommended.

Lab Instructions Overview

Once both accounts are set up, participants are directed towards lab instructions provided via a shared link in chat for further guidance on what will be built during the session.

The workshop will cover setting up various services like S3 and Glue that interact with Snowflake later on in the process.

Participants should feel free to review steps post-workshop and ask questions throughout for clarification or assistance from facilitators present.

Initial Setup Steps in AWS

The first step involves ensuring that participants are operating within the US West 2 region in their AWS account for consistency with predefined roles and policies used during the lab activities.

Participants will search for CloudFormation at the top of their dashboard as part of setting up necessary infrastructure using base templates provided.

Cloud Formation and Data Governance in AWS

Setting Up Cloud Formation

The speaker demonstrates how to create a YAML file for cloud formation templates, allowing users to set up resources on AWS by clicking through the interface.

Cloud formation automates the setup of necessary services on AWS, similar to Terraform, streamlining resource creation.

Users can refer to the lab guide for details on created resources; initial setups are expected to complete quickly due to their simplicity.

IAM Role Configuration

An IAM role is being created extensively for use with Glue and Lake Formation, which will access data stored in S3.

The speaker instructs users to edit the trust policy of the IAM role under Cloud Formation settings, adding Lake Formation as an authorized service.

Introduction to Lake Formation

Lake Formation is introduced as a data governance service that helps manage governance controls across datasets using S3.

The lab instructions need updating due to recent integrations with Lake Formation; participants are encouraged to pause and perform necessary actions.

Creating Glue Database and Table

Participants are guided through creating a Glue database named "iceber" before registering tables within it.

A table named "quotes" is being defined within the newly created database, focusing on metadata organization.

Defining Data Location and Schema

Users must specify where their data will be located in S3; currently, it's an empty bucket designated for iceberg format files.

The schema for the quotes table can be defined by pasting JSON schema directly from provided links or documents into the interface.

Loading Data into Iceberg Format

After setting up databases and tables, attention shifts towards loading data into S3 in iceberg format from another public account's parquet format storage.

The architecture involves transferring data from one account's S3 bucket into another while converting it into iceberg table format using Glue ETL processes.

Visual ETL Pipeline Setup

Initial Steps in Visual ETL

Begin by clicking on "Visual ETL" and then access the job details to change the name of the job.

Ensure that the IM role is set to the same as the Glue Snowflake lab role before saving changes.

Configuring Source Data

Specify the source data location, which is a public S3 bucket used for workshop purposes, and note that files are in Parquet format.

Discuss options for handling Change Data Capture (CDC), indicating that one method involves dumping data into Parquet files in S3.

Data Transformation and Target Configuration

Perform any necessary schema changes or data transformations; however, for simplicity, no additional transformations will be applied at this stage.

Select a target S3 bucket to write data in Iceberg format, ensuring proper compression options are chosen.

Finalizing Pipeline Setup

Add specific paths within your target bucket to organize where data will be stored.

Choose the Glue Data Catalog database for managing metadata related to Iceberg tables and save configurations.

Running and Monitoring the Pipeline

Executing the Job

Click on "Run" to execute the pipeline; expect it to take about one to two minutes for completion.

While waiting for execution, prepare AWS Lake Formation settings in a new tab.

Configuring AWS Lake Formation

Grant permissions under administration settings to allow external engines like Snowflake access to specified S3 locations with full table access.

Data Security and Integration

Registering Data Locations

Setting Up Permissions

Provide IAM credentials within Lake Formation and establish integration with Snowflake by creating necessary roles and permissions.

Ensure that permissions are granted correctly on databases so Snowflake can access underlying data effectively.

Job Completion Confirmation

Verifying Successful Execution

Confirm that Glue jobs have run successfully; check latest job status indicating successful loading of quotes data into S3 as an Iceberg table.

AWS and Snowflake Integration Setup

Overview of the Setup Process

The speaker confirms completion of AWS setup, including the creation of a quotes iceberg table in S3.

Jono emphasizes the importance of time management, aiming to get participants up and running within 10 minutes. He mentions available resources for assistance, including video reviews and trial accounts for Snowflake.

SQL Script Execution

Participants are instructed to log into their Snowflake account and prepare to run SQL scripts necessary for configuration.

A setup SQL file is provided via hyperlink; participants should download both setup SQL and workflow SQL files locally.

Creating a SQL Worksheet

Instructions are given on how to create a new SQL worksheet in Snowflake's UI by clicking the plus button.

The speaker demonstrates importing the downloaded SQL file into the worksheet using an import function.

Setting Context and Roles

Participants are guided to execute commands that set context by highlighting statements and using keyboard shortcuts or play buttons.

The process includes creating roles specific to this lab while adhering to a least privilege access security model.

Granting Permissions

After role creation, permissions must be granted using account admin privileges, allowing access to various components like integration, external volume, database, and warehouse.

Emphasis is placed on executing these permission-granting statements successfully before proceeding with data loading.

Data Loading Process

Once permissions are set, participants can load data into specified tables as part of the lab exercise.

Workflow Integration Steps

A new worksheet is created for loading workflow SQL files; participants need to repeat previous steps for setting context again.

Configuring Glue Catalog Integration

Instructions begin for integrating AWS Glue catalog with Snowflake; correct credentials must be entered carefully.

Participants learn how to find their AWS account ID from their console for use in catalog name entries within their worksheets.

Finalizing Configuration Details

Specific instructions include entering the copied AWS account ID in two locations: catalog name and authentication block.

Additional details such as changing catalog URI settings to US West 2 region are also highlighted as crucial steps in finalizing configurations.

Integration Steps for Snowflake and AWS

Executing Initial Statements

After highlighting the necessary code, execute the statement to confirm success. This is a prerequisite before proceeding with further integration steps.

Editing Trust Policy

It’s crucial to accurately edit the trust policy in AWS IAM. Ensure all parameters are entered correctly to avoid issues later on.

Configuring Trust Relationships

In the trust policy, add the principal AWS IAM role provided by Snowflake. This establishes a connection between Snowflake and AWS through a trust relationship.

Updating Trust Policy Values

Copy the IM user ARN role from the output of your previous command and update it in your trust policy. This step is essential for proper access configuration.

Additionally, enter the external ID obtained from Snowflake into your trust relationship under external ID settings before saving or updating the policy.

Verifying Connection Success

Once you’ve updated everything, run a verification statement that should return success as true, indicating that Snowflake can now access data within AWS.

Creating Database in Snowflake

Initiating Database Creation

After confirming successful connection, proceed to create a database inside Snowflake that links to an existing catalog in AWS Glue.

Collecting Metadata

The creation process involves collecting metadata from the catalog so that Snowflake can query data stored in an S3 bucket associated with this database.

Setting Context for Queries

Set context for using this new database and schema within Snowflake to enable querying of relevant data effectively.

Combining Data into Iceberg Table

Creating Iceberg Table

Begin creating an iceberg table named "quote analysis ice," which will consolidate various datasets into one location within S3 storage.

Specifying S3 Bucket Location

When creating this table, ensure you specify the correct S3 bucket location where all related data resides. The naming convention must be followed precisely for successful integration.

Inserting Combined Data

Insert combined data from internal tables (customer policies etc.) into the newly created iceberg table after ensuring all configurations are correct.

Conclusion of Lab Session

Final Remarks

Upon completion of these steps, participants are encouraged to reach out with any questions regarding their lab experience. A recording will be shared for reference purposes at week’s end.

Closing Remarks and Event Announcement

Gratitude and Session Conclusion

The speaker expresses gratitude towards Lester for his time, indicating a collaborative effort throughout the session.

The session is concluded with an invitation to future events, emphasizing the importance of community engagement.

Upcoming Event: Discover AI

An event titled "Discover AI" is announced, featuring hands-on labs ranging from level 100 to level 300.

Participants are encouraged to scan a QR code for registration; the sessions will be live and not recorded, highlighting the interactive nature of the labs.

Hands-On Labs Details

Attendees can choose from various labs that cater to different skill levels, promoting personalized learning experiences.

Experts like the speaker and Lester will guide participants through the labs, enhancing familiarity with Snowflake and AWS technologies.

Final Thoughts

The speaker wishes everyone a good rest of their day, reinforcing a positive closing sentiment for the session.