AWS re:Invent 2018: A Chronicle of Airbnb Architecture Evolution (ARC407)

Name: AWS re:Invent 2018: A Chronicle of Airbnb Architecture Evolution (ARC407)
Uploaded: 2018-11-28T18:47:06.000Z
Duration: 1 h 24 min 11 s

Introduction to Airbnb's Evolution

Overview of the Presentation

The presentation is led by members from the production infrastructure team at Airbnb, focusing on the evolution of their platform and architecture.

Airbnb started as a simple platform connecting hosts with guests, allowing room rentals. The initial product design was basic and focused on listings.

Evolution of Product Offerings

Over the years, Airbnb has expanded its offerings beyond homes to include experiences, restaurants, and a luxury brand called "beyond bio-memory."

The talk will cover infrastructure developments that supported these expansions over several years.

Challenges Faced in 2015

Major Incident Overview

In April 2015, a significant incident occurred where the website went down due to high CPU usage on databases.

It took 10 days to identify that a cron job had deleted a critical memcache key, causing widespread server issues.

Technical Architecture at the Time

At that time, Airbnb's architecture relied heavily on Amazon EC2 instances within a single region (US East 1), using Ruby on Rails for web services.

The architecture included multiple clients (desktop and mobile), internal load balancers built on Nginx, and various databases managed through ActiveRecord.

Complexity of Business Logic

Database Interactions

A single page could interact with numerous database tables; for example, one listing detail page accessed 19 core tables and additional tables across different databases.

Another checkout confirmation page required data from over 200 tables related to payment information.

Traffic Growth Challenges

From 2013 to 2015, traffic nearly doubled annually. By January 2015, traffic surpassed previous peak levels consistently.

Database Stability Issues

Error Rates During High Traffic

As traffic increased significantly in early 2015, database stability became an issue; error rates reached around 30% during peak times.

Formation of Special Teams

Due to frequent outages (hundreds of hours in total), a special team was formed in response to tackle database stability challenges throughout that year.

Performance Optimization Strategies

Understanding Regression Detection

The team implemented performance schema on the server side to identify regression sources, focusing on specific tables and query patterns.

Client-side logging was introduced to track where issues originated, including web handlers and lines of code, enabling quick detection of regressions.

Measuring Database Headroom

Efforts were made to measure database headroom by stress testing with traffic from multiple replicas, achieving a range of 100% to 600% traffic simulation.

Initiatives included separating tables from the main database and implementing caching for simple queries, resulting in over a 50% improvement in performance.

Addressing Connection Limitations

The team faced connection limitations as web servers reached maximum connections with the main databases due to increased traffic.

They utilized an open-source project called Mexico Winning Hand for connection pooling and back pressure management, which allowed linear scaling of the web tier.

Impact of Infrastructure Changes

After deploying new infrastructure solutions, incidents related to databases decreased significantly from 15 to just 2.

Downtime hours reduced dramatically from over 100 hours to less than three hours post-improvements.

Challenges in Deployment and Scaling

Release Protocol Issues

The engineering team operated under a single repository model with frequent releases leading to high batch sizes (40–50 changes).

A low bug density resulted in significant reverts during releases; this caused delays averaging one and a half hours due to coordination among engineers.

Locking Deployments

During debugging or reverting processes, deployments were locked for up to 15 hours at times, hindering productivity across teams.

Exploring Service-Oriented Architecture (SOA)

To address deployment challenges and improve ownership among engineers, the team began exploring SOA concepts in 2016.

Initial steps included building a message bus for decoupling services communication between different systems.

Asynchronous Search Services and Dynamic Configuration

Building Asynchronous Search Services

The team developed asynchronous search services to ensure users receive the latest pricing and availability information when searching on their pages.

A message bus was created to allow services to publish standardized messages, enabling other services to subscribe and listen for relevant updates.

Dynamic Configuration Distribution

Engineers were previously using various methods (e.g., code changes, Chef) to push configuration updates, leading to inefficiencies.

The introduction of a dynamic configuration distribution system reduced propagation delays from minutes to seconds, becoming essential for feature management and experimentation.

Improving Data Storage Solutions

Challenges with Existing Datastore

A project in 2016 revealed limitations in the existing database structure where pricing information was locked as integers, restricting charge capabilities.

Modifying large tables (hundreds of gigabytes) posed significant challenges due to lengthy non-schema changes that could take weeks.

Enhancements in Data Management

The need for better datastore support led to the development of systems allowing rapid product iterations and experimentation without long-term commitments.

Mutation events propagation was introduced, similar to the message bus concept, facilitating downstream service updates based on database changes.

Transitioning from Ruby to Java Services

Adapting Service Architecture

The company began transitioning from Ruby-based services towards Java, aiming for parity in libraries and monitoring tools across both languages.

ActiveRecord adapters were developed allowing engineers to write traditional Ruby queries while routing certain queries directly to services instead of databases.

Configuration Management Challenges

OneTouch Service Management Framework

In 2017, a new framework called OneTouch was initiated to simplify service creation by managing complex dependency graphs effectively.

This framework aimed at reducing the time taken (previously up to two weeks) for engineers—especially newcomers—to spin up new services.

Service Configuration and Deployment Strategies

Unified Configuration Management

The integration of various configurations, including secrets and deployable applications, within the same repository enhances project management by consolidating all necessary components.

This approach allows for simultaneous updates to both configuration and source code, promoting immutable infrastructure that facilitates safe deployments.

Accelerated Service Development

By streamlining the creation of new services, teams can significantly reduce development time—potentially saving up to two weeks in service launch preparation.

In a service-oriented architecture, developers must consider inter-service communication alongside business logic, necessitating additional planning and implementation efforts.

Standardized API Development

Utilizing an Interface Definition Language (IDL), such as gRPC, enables automatic generation of server-side code from a defined service API schema.

This automation includes standardized metrics collection and request/response validation, enhancing consistency across services while reducing manual coding efforts.

Enhanced Monitoring and Debugging

Distributed tracing is integrated into the service framework for Java and Ruby applications, improving observability across multiple services during debugging processes.

Middleware services are developed to support request lifecycles outside traditional monolithic architectures, addressing concerns like user authentication and session management.

Growth in Production Services

By early 2018, multiple major API services were successfully launched into production, demonstrating significant progress in handling live traffic.

API Service Framework Development

Overview of API Services

The discussion begins with the introduction of two key services: the bad services and the PQ Explorer service, which are built on a service framework aimed at high throughput for API requests.

Transition to API Services

Initially, traffic was routed through Monorail for practical reasons. The need arose to build an API service that could handle requests outside of Monorail, necessitating changes in security policies and user verification processes.

Policy Checks and Data Scrubbing

Implementing policy checks is crucial for ensuring data privacy; sensitive PII data must be scrubbed from responses. Unlike Monorail, where these checks were integrated, the new API service requires explicit handling of these tasks.

Development of the APS Service Framework

In early 2018, work began on an APS service framework designed to simplify policy checks and response scrubbing by allowing developers to annotate their requirements directly within the code.

Impact on Public-Facing Services

This framework significantly accelerated the launch of public-facing services such as homes PDP and checkout functionalities. Currently, there are approximately 20 to 30 public APIs in production traffic due to this development.

Traffic Management and Architecture Evolution

Introduction of Java-Based API Gateway

A Java-based API gateway was introduced in 2017 to manage traffic more effectively. It not only guards access but also facilitates user authentication and session management across various services.

Current Architecture Insights

The current architecture includes a load balancer directing requests through a hyper loop (Node.js cluster), which performs server-side rendering before reaching specific API services like home piece way service.

Performance Improvements Observed

Significant performance improvements have been noted since launching new services; for instance, homes PDP saw a drop in response time from 1200 milliseconds to just 226 milliseconds—a tenfold improvement.

Service Consistency and Business Unit Integration

Centralization Benefits

Centralizing pricing logic has enhanced consistency across listings by eliminating discrepancies that previously existed when pricing information was scattered across different pages.

Fast Product Iteration Across Units

The improved architecture supports rapid product iteration across various business units (homes, experiences, etc.), enabling faster development cycles for new features or products.

Enabling Comprehensive Platform Functionality

The evolving service platform aims not just at housing listings but at creating an end-to-end trips platform that integrates multiple aspects of travel planning beyond accommodations.

Transitioning from Monolithic to Service-Oriented Architecture

Challenges in Migration

The transition from a monolithic architecture to a service-oriented architecture (SOA) introduces various infrastructure challenges, including changes in traffic and data loading patterns due to the increasing number of services.

As the number of services grew, issues related to scalability and reliability emerged across the technical stack, particularly with Zookeeper, which is central to their service management.

During the SOA migration involving over 200 APIs, monitoring dependencies became increasingly complex, leading to persistent problems with Zookeeper as they navigated this transition.

Operational Difficulties

The increase in watch counts on Zookeeper highlighted operational pain points; stability issues often arose when multiple services interacted poorly.

Configuration changes required frequent restarts of proxy instances, complicating operations further. This was manageable with simpler topologies but became problematic as complexity increased.

Scalability Issues

A significant rise in proxy instances was noted due to constant service flapping within their monorail system, resulting in major scalability challenges for the organization.

To address these issues, they are transitioning from an older proxy technology (Itt Proxy) to a more modern solution (Boy), which offers enhanced features and better performance for their service stack.

Infrastructure Management

The growing number of EC2 instances has led to increased operational load; routine upgrades necessitate restarting all services across the stack, consuming valuable time that could be spent on developing new technologies.

They are exploring Kubernetes as a potential solution for managing their entire stack more efficiently amidst rising operational demands.

Evolution and Lessons Learned

Since starting service development in mid-2017, there has been a notable increase in message ingestion rates and partition numbers within their messaging systems.

With limited engineering resources (only two engineers), scaling effectively has been challenging; thus, they are considering Genesis 2 as a replacement strategy for better handling growth.

Reflecting on past experiences since 2015 reveals that proactive planning could have mitigated many infrastructure problems encountered during rapid growth phases.

By focusing on building robust foundational technologies early on (like SOA), they aim to facilitate faster service development while ensuring quality standards are met moving forward.

Modernizing both their service mesh and data storage solutions is crucial for accommodating future traffic growth and organizational expansion effectively.