PyIceberg: Python Development Setup

PyIceberg: Python Development Setup

Introduction

The speaker introduces himself and the Python development project he would like to share with the audience. He explains that the project is a self-contained Spark setup with Iceberg support, which includes a REST catalog and an S3 backend.

Getting Started

  • Clone the repository called "spark iceberg" using Docker Compose.
  • The project comes with a Jupyter notebook.
  • Open the "Getting Started" notebook and run the first few cells to create a partitioned table in New York City.

Configuring Iceberg

  • Create a new file named "multiply_iceberg".
  • Set up default configurations for local catalogs and S3 endpoints.
  • Save the configuration file.

Installing Buy Iceberg

  • Install Buy Iceberg locally by running click install -E . in the Python directory of the repository.
  • List New York City tables using list New York City.
  • Describe New York City taxis using describe New York City.

Testing

The speaker demonstrates how to test Buy Iceberg by creating a simple bogus test.

Creating a Test

  • Create a load catalog pointing to local, which picks up the configuration created earlier.
  • Load data from the table "New York City taxis".
  • Create a new scan and say to Fire error directly.
  • Run this code to fetch data from Docker Postgres and return it to console.

Debugging Tests

  • Inject breakpoints into tests for debugging purposes.
  • Check schema information by asking about raw data or file schema.
Video description

This video will walk you through the steps required to set up the Python development environment for PyIceberg. We will set up a local instance of Spark, Rest catalog, and MinIO for querying an actual table. This makes it easy to do interactive development and test everything end to end. #iceberg #python #pyiceberg #tabular #minio #spark #datalake #datalakehouse