PyIceberg: Python Development Setup
Introduction
The speaker introduces himself and the Python development project he would like to share with the audience. He explains that the project is a self-contained Spark setup with Iceberg support, which includes a REST catalog and an S3 backend.
Getting Started
- Clone the repository called "spark iceberg" using Docker Compose.
- The project comes with a Jupyter notebook.
- Open the "Getting Started" notebook and run the first few cells to create a partitioned table in New York City.
Configuring Iceberg
- Create a new file named "multiply_iceberg".
- Set up default configurations for local catalogs and S3 endpoints.
- Save the configuration file.
Installing Buy Iceberg
- Install Buy Iceberg locally by running
click install -E .in the Python directory of the repository.
- List New York City tables using
list New York City.
- Describe New York City taxis using
describe New York City.
Testing
The speaker demonstrates how to test Buy Iceberg by creating a simple bogus test.
Creating a Test
- Create a load catalog pointing to local, which picks up the configuration created earlier.
- Load data from the table "New York City taxis".
- Create a new scan and say to Fire error directly.
- Run this code to fetch data from Docker Postgres and return it to console.
Debugging Tests
- Inject breakpoints into tests for debugging purposes.
- Check schema information by asking about raw data or file schema.