AWS re:Invent 2018: A Chronicle of Airbnb Architecture Evolution (ARC407)
Introduction to Airbnb's Evolution
Overview of the Presentation
- The presentation is led by members from the production infrastructure team at Airbnb, focusing on the evolution of their platform and architecture.
- Airbnb started as a simple platform connecting hosts with guests, allowing room rentals. The initial product design was basic and focused on listings.
Evolution of Product Offerings
- Over the years, Airbnb has expanded its offerings beyond homes to include experiences, restaurants, and a luxury brand called "beyond bio-memory."
- The talk will cover infrastructure developments that supported these expansions over several years.
Challenges Faced in 2015
Major Incident Overview
- In April 2015, a significant incident occurred where the website went down due to high CPU usage on databases.
- It took 10 days to identify that a cron job had deleted a critical memcache key, causing widespread server issues.
Technical Architecture at the Time
- At that time, Airbnb's architecture relied heavily on Amazon EC2 instances within a single region (US East 1), using Ruby on Rails for web services.
- The architecture included multiple clients (desktop and mobile), internal load balancers built on Nginx, and various databases managed through ActiveRecord.
Complexity of Business Logic
Database Interactions
- A single page could interact with numerous database tables; for example, one listing detail page accessed 19 core tables and additional tables across different databases.
- Another checkout confirmation page required data from over 200 tables related to payment information.
Traffic Growth Challenges
- From 2013 to 2015, traffic nearly doubled annually. By January 2015, traffic surpassed previous peak levels consistently.
Database Stability Issues
Error Rates During High Traffic
- As traffic increased significantly in early 2015, database stability became an issue; error rates reached around 30% during peak times.
Formation of Special Teams
- Due to frequent outages (hundreds of hours in total), a special team was formed in response to tackle database stability challenges throughout that year.
Performance Optimization Strategies
Understanding Regression Detection
- The team implemented performance schema on the server side to identify regression sources, focusing on specific tables and query patterns.
- Client-side logging was introduced to track where issues originated, including web handlers and lines of code, enabling quick detection of regressions.
Measuring Database Headroom
- Efforts were made to measure database headroom by stress testing with traffic from multiple replicas, achieving a range of 100% to 600% traffic simulation.
- Initiatives included separating tables from the main database and implementing caching for simple queries, resulting in over a 50% improvement in performance.
Addressing Connection Limitations
- The team faced connection limitations as web servers reached maximum connections with the main databases due to increased traffic.
- They utilized an open-source project called Mexico Winning Hand for connection pooling and back pressure management, which allowed linear scaling of the web tier.
Impact of Infrastructure Changes
- After deploying new infrastructure solutions, incidents related to databases decreased significantly from 15 to just 2.
- Downtime hours reduced dramatically from over 100 hours to less than three hours post-improvements.
Challenges in Deployment and Scaling
Release Protocol Issues
- The engineering team operated under a single repository model with frequent releases leading to high batch sizes (40–50 changes).
- A low bug density resulted in significant reverts during releases; this caused delays averaging one and a half hours due to coordination among engineers.
Locking Deployments
- During debugging or reverting processes, deployments were locked for up to 15 hours at times, hindering productivity across teams.
Exploring Service-Oriented Architecture (SOA)
- To address deployment challenges and improve ownership among engineers, the team began exploring SOA concepts in 2016.
- Initial steps included building a message bus for decoupling services communication between different systems.
Asynchronous Search Services and Dynamic Configuration
Building Asynchronous Search Services
- The team developed asynchronous search services to ensure users receive the latest pricing and availability information when searching on their pages.
- A message bus was created to allow services to publish standardized messages, enabling other services to subscribe and listen for relevant updates.
Dynamic Configuration Distribution
- Engineers were previously using various methods (e.g., code changes, Chef) to push configuration updates, leading to inefficiencies.
- The introduction of a dynamic configuration distribution system reduced propagation delays from minutes to seconds, becoming essential for feature management and experimentation.
Improving Data Storage Solutions
Challenges with Existing Datastore
- A project in 2016 revealed limitations in the existing database structure where pricing information was locked as integers, restricting charge capabilities.
- Modifying large tables (hundreds of gigabytes) posed significant challenges due to lengthy non-schema changes that could take weeks.
Enhancements in Data Management
- The need for better datastore support led to the development of systems allowing rapid product iterations and experimentation without long-term commitments.
- Mutation events propagation was introduced, similar to the message bus concept, facilitating downstream service updates based on database changes.
Transitioning from Ruby to Java Services
Adapting Service Architecture
- The company began transitioning from Ruby-based services towards Java, aiming for parity in libraries and monitoring tools across both languages.
- ActiveRecord adapters were developed allowing engineers to write traditional Ruby queries while routing certain queries directly to services instead of databases.
Configuration Management Challenges
OneTouch Service Management Framework
- In 2017, a new framework called OneTouch was initiated to simplify service creation by managing complex dependency graphs effectively.
- This framework aimed at reducing the time taken (previously up to two weeks) for engineers—especially newcomers—to spin up new services.
Service Configuration and Deployment Strategies
Unified Configuration Management
- The integration of various configurations, including secrets and deployable applications, within the same repository enhances project management by consolidating all necessary components.
- This approach allows for simultaneous updates to both configuration and source code, promoting immutable infrastructure that facilitates safe deployments.
Accelerated Service Development
- By streamlining the creation of new services, teams can significantly reduce development time—potentially saving up to two weeks in service launch preparation.
- In a service-oriented architecture, developers must consider inter-service communication alongside business logic, necessitating additional planning and implementation efforts.
Standardized API Development
- Utilizing an Interface Definition Language (IDL), such as gRPC, enables automatic generation of server-side code from a defined service API schema.
- This automation includes standardized metrics collection and request/response validation, enhancing consistency across services while reducing manual coding efforts.
Enhanced Monitoring and Debugging
- Distributed tracing is integrated into the service framework for Java and Ruby applications, improving observability across multiple services during debugging processes.
- Middleware services are developed to support request lifecycles outside traditional monolithic architectures, addressing concerns like user authentication and session management.
Growth in Production Services
- By early 2018, multiple major API services were successfully launched into production, demonstrating significant progress in handling live traffic.
API Service Framework Development
Overview of API Services
- The discussion begins with the introduction of two key services: the bad services and the PQ Explorer service, which are built on a service framework aimed at high throughput for API requests.
Transition to API Services
- Initially, traffic was routed through Monorail for practical reasons. The need arose to build an API service that could handle requests outside of Monorail, necessitating changes in security policies and user verification processes.
Policy Checks and Data Scrubbing
- Implementing policy checks is crucial for ensuring data privacy; sensitive PII data must be scrubbed from responses. Unlike Monorail, where these checks were integrated, the new API service requires explicit handling of these tasks.
Development of the APS Service Framework
- In early 2018, work began on an APS service framework designed to simplify policy checks and response scrubbing by allowing developers to annotate their requirements directly within the code.
Impact on Public-Facing Services
- This framework significantly accelerated the launch of public-facing services such as homes PDP and checkout functionalities. Currently, there are approximately 20 to 30 public APIs in production traffic due to this development.
Traffic Management and Architecture Evolution
Introduction of Java-Based API Gateway
- A Java-based API gateway was introduced in 2017 to manage traffic more effectively. It not only guards access but also facilitates user authentication and session management across various services.
Current Architecture Insights
- The current architecture includes a load balancer directing requests through a hyper loop (Node.js cluster), which performs server-side rendering before reaching specific API services like home piece way service.
Performance Improvements Observed
- Significant performance improvements have been noted since launching new services; for instance, homes PDP saw a drop in response time from 1200 milliseconds to just 226 milliseconds—a tenfold improvement.
Service Consistency and Business Unit Integration
Centralization Benefits
- Centralizing pricing logic has enhanced consistency across listings by eliminating discrepancies that previously existed when pricing information was scattered across different pages.
Fast Product Iteration Across Units
- The improved architecture supports rapid product iteration across various business units (homes, experiences, etc.), enabling faster development cycles for new features or products.
Enabling Comprehensive Platform Functionality
- The evolving service platform aims not just at housing listings but at creating an end-to-end trips platform that integrates multiple aspects of travel planning beyond accommodations.
Transitioning from Monolithic to Service-Oriented Architecture
Challenges in Migration
- The transition from a monolithic architecture to a service-oriented architecture (SOA) introduces various infrastructure challenges, including changes in traffic and data loading patterns due to the increasing number of services.
- As the number of services grew, issues related to scalability and reliability emerged across the technical stack, particularly with Zookeeper, which is central to their service management.
- During the SOA migration involving over 200 APIs, monitoring dependencies became increasingly complex, leading to persistent problems with Zookeeper as they navigated this transition.
Operational Difficulties
- The increase in watch counts on Zookeeper highlighted operational pain points; stability issues often arose when multiple services interacted poorly.
- Configuration changes required frequent restarts of proxy instances, complicating operations further. This was manageable with simpler topologies but became problematic as complexity increased.
Scalability Issues
- A significant rise in proxy instances was noted due to constant service flapping within their monorail system, resulting in major scalability challenges for the organization.
- To address these issues, they are transitioning from an older proxy technology (Itt Proxy) to a more modern solution (Boy), which offers enhanced features and better performance for their service stack.
Infrastructure Management
- The growing number of EC2 instances has led to increased operational load; routine upgrades necessitate restarting all services across the stack, consuming valuable time that could be spent on developing new technologies.
- They are exploring Kubernetes as a potential solution for managing their entire stack more efficiently amidst rising operational demands.
Evolution and Lessons Learned
- Since starting service development in mid-2017, there has been a notable increase in message ingestion rates and partition numbers within their messaging systems.
- With limited engineering resources (only two engineers), scaling effectively has been challenging; thus, they are considering Genesis 2 as a replacement strategy for better handling growth.
- Reflecting on past experiences since 2015 reveals that proactive planning could have mitigated many infrastructure problems encountered during rapid growth phases.
- By focusing on building robust foundational technologies early on (like SOA), they aim to facilitate faster service development while ensuring quality standards are met moving forward.
- Modernizing both their service mesh and data storage solutions is crucial for accommodating future traffic growth and organizational expansion effectively.