Michelangelo - Machine Learning @Uber

Michelangelo - Machine Learning @Uber

Machine Learning at Uber

Overview of the Talk

  • The speaker introduces the topic of machine learning (ML) at Uber, outlining three main phases: interesting use cases, the initial platform built to support these use cases, and enhancing developer experience for ML adoption.

Exciting Aspects of Machine Learning at Uber

  • Uber is a dynamic environment for ML due to diverse projects across the company rather than a few dominant use cases.
  • The data collected by Uber is unique as it operates in the physical world, utilizing GPS and accelerometer data from rider and driver apps.
  • Being a relatively young company allows for innovative applications of ML without being constrained by legacy systems or incremental improvements.
  • ML is central to Uber's strategy, influencing decision-making and operational efficiency significantly.

Growth and Scale of Data

  • As of recent years, Uber has expanded to 75 million riders and 3 million drivers, completing over 4 billion trips annually across 600 cities.
  • An example provided shows GPS traces from London drivers over six hours, illustrating how quickly cars can cover large areas—data that supports various ML applications.

Diverse Use Cases for Machine Learning

  • There are over 100 active ML use cases at Uber spanning multiple domains including ridesharing, food delivery (Uber Eats), self-driving technology, customer support, pricing forecasting, anomaly detection in system metrics, and capacity planning for data centers.

Specific Applications in Uber Eats

  • In the Uber Eats app, hundreds of models score user preferences to generate personalized homepages based on likely restaurant choices.
  • Models predict delivery times by analyzing various stages from order preparation to delivery logistics.

Self-driving Cars and ETA Predictions

  • Self-driving cars utilize lidar and cameras with ML algorithms for object detection—identifying streets and pedestrians while planning routes effectively.
  • Accurate Estimated Time of Arrival (ETA) predictions are crucial; they enhance user experience while also impacting internal systems like pricing and routing.

Challenges with ETA Predictions

Understanding Machine Learning Applications at Uber

Predictive Models and Error Correction

  • The use of historical data to predict ETAs (Estimated Time of Arrival) often results in consistent errors, which can be modeled and corrected for improved accuracy.

Mapping Infrastructure Development

  • Uber is developing its own mapping infrastructure, starting with a base street map that is enhanced through evidence collection from vehicles equipped with cameras.

Data Collection and Machine Learning Integration

  • Images captured by cars are tagged with GPS coordinates; machine learning models are employed to identify addresses and street signs for database enhancement.

Marketplace Dynamics: Supply and Demand

  • In the context of ride-sharing, proximity in space and time between riders and drivers is crucial for operational efficiency, contrasting with e-commerce platforms like eBay where distance is less critical.

Deep Learning for Marketplace Predictions

  • Uber utilizes deep learning models to forecast marketplace metrics such as driver availability and rider demand, helping to optimize supply-demand balance.

Customer Support Optimization

  • With millions of rides daily, Uber employs deep learning to streamline customer support by predicting issues based on ticket text, reducing response options significantly.

Enhancing Communication Between Riders and Drivers

  • A new one-click chat feature allows quick communication between drivers and riders using predicted responses instead of typing, improving interaction efficiency.

Michelangelo Platform Overview

  • The Michelangelo platform supports various machine learning applications at Uber, aiming to empower data scientists to manage end-to-end processes from model development to deployment.

Empowering Data Scientists at Scale

Machine Learning at Uber: Enhancements and Insights

Iterative Process of Machine Learning

  • The iterative nature of machine learning (ML) allows for rapid movement through modeling cycles, leading to compounding effects as new models are experimented with.
  • Initial efforts in ML at Uber focused on enabling users to engage with the technology, which proved successful but presented challenges in usability.

Evolution from v1 to v2

  • Version 2 (v2) of Uber's ML platform aims to enhance developer productivity and streamline the modeling and deployment processes.
  • Acknowledgment that effective machine learning encompasses more than just model training; it requires a comprehensive workflow management system.

Data Management Challenges

  • Managing data is often the most complex aspect of the ML process, involving accurate datasets for both training and retraining models.
  • Models deployed in production require real-time access to relevant data, necessitating intricate pipelines for data delivery.

Model Training and Evaluation

  • The training phase involves iterative model evaluation, where tools are essential for comparing different models' performance.
  • Deployment should be seamless, allowing developers to easily push models into production via simple commands or API calls.

Monitoring and Continuous Improvement

  • Post-deployment monitoring is crucial as it assesses whether models perform accurately against new incoming data.
  • The same foundational workflow applies across various ML problems, including traditional methods like trees and linear models as well as deep learning techniques.

Centralized Feature Store Development

  • A centralized feature store has been established at Uber to facilitate sharing and curation of features across different teams, enhancing collaboration.
  • This feature store simplifies the modeling process by allowing developers to select pre-existing features rather than creating new queries each time.

Distributed Training Infrastructure

  • Large-scale distributed training is conducted using CPU clusters for simpler models and GPU clusters for deep learning applications.
  • Uber developed its own distributed training infrastructure called Horvat, which improves efficiency by eliminating parameter servers during model training.

Enhanced Usability in Model Management

Model Development and Deployment Insights

Importance of Metadata in Model Development

  • The collection of metadata about models is crucial for modelers to iterate effectively, helping them identify the best model for production use.
  • Standard error metrics are essential for evaluating regression and classification models, guiding data scientists in refining their models based on specific use cases.

Feature Analysis and Model Understanding

  • Analyzing feature importance and statistics (mean, min, deviation, distribution) aids in understanding the data used in models.
  • A tool is available to visualize the structure of learned trees, allowing users to comprehend how input features influence predictions.

Model Deployment Strategies

  • Uber supports both batch predictions (scheduled jobs generating multiple predictions at once) and real-time web service deployments for immediate prediction requests.
  • The architecture includes a routing infrastructure that directs feature vectors to the appropriate model based on incoming requests.

Performance Metrics and Prediction Speed

  • Currently, Uber processes close to one million predictions per second across various applications.
  • Scoring times for tree-based models are typically under five milliseconds; however, additional delays may occur when fetching online features from databases like Cassandra.

Monitoring Model Performance Post-deployment

  • After deploying a model into production, it’s vital to evaluate its performance against current data rather than just historical benchmarks.

Architecture of Data Systems

Overview of the Data Architecture

  • The architecture consists of offline systems at the bottom and online systems at the top, starting with a data lake where data is stored.
  • Batch processing begins in Hadoop using Hive tables, allowing developers to write Spark or SQL jobs for data aggregation and feature collection.
  • Features calculated from batch jobs are stored in Cassandra for online serving, enabling real-time predictions based on historical data.

Handling Freshness in Data

  • For features requiring freshness, such as current meal prep times, a streaming path is utilized to gather metrics from Kafka.
  • Flink jobs aggregate streaming data and write results into both Cassandra and Hive to maintain parity between online and offline datasets.

Model Training and Deployment

  • Batch training pulls data from Hive tables to run algorithms (tree models, linear models, deep learning), storing metadata about trained models in a dedicated database.
  • At deployment time, trained models can be pushed into either an online serving container for real-time predictions or batch jobs for periodic predictions.

Real-Time Prediction Process

  • When users request delivery time estimates via an app, relevant features (location, time of day) are sent to the model along with pre-computed features from Cassandra.
  • The model combines context-specific features with additional computed ones before scoring them for prediction.

Monitoring and Management

  • Effective monitoring is crucial; predictions are logged back to Hadoop for outcome analysis and alerting through metric systems.

Machine Learning Workflow Management and Deployment

Overview of the Platform

  • The platform facilitates workflow management and deployment, allowing users to write automation code in Python or Java to interact with the system.
  • Projects serve as containers for modeling problems, enabling connections to data sources like Hive tables for model training.

Model Visualization and Metrics

  • Users can access visualizations such as confusion matrices and various metrics that assess model accuracy.
  • Detailed reports on model features are available, showcasing distributions and split points within decision trees.

Model Deployment Process

  • The deployment process involves packaging the model, which takes a few minutes before it becomes operational.
  • A history log tracks all deployed models, including timestamps and user actions.

Enhancements in Machine Learning Development

Accelerating Machine Learning Projects

  • Current efforts focus on making machine learning faster and easier from idea generation through prototyping to production scaling.
  • Introduction of a new Python ML project aims to cater specifically to data scientists who prefer working in Python over web UIs or other languages.

AutoML Features

  • The platform includes auto-tuning capabilities that assist users in training effective models without needing extensive manual configuration.

Real-Time Monitoring Improvements

  • New features allow for real-time monitoring of models in production, enhancing responsiveness compared to previous hourly refresh rates.

Addressing Developer Experience Challenges

Identifying Friction Points

  • The development process is recognized as having friction points at every step; efforts are underway to streamline these workflows.

Empowering Data Scientists

  • By allowing data scientists more ownership over their workflows—from prototype through production—efficiency is expected to improve significantly.

Evolution of Michelangelo's Capabilities

Target Audience Shift

  • Initially designed for high-scale use cases involving large datasets, Michelangelo has evolved due to feedback indicating a need for greater usability among data scientists facing unique challenges.

Integration with Python Ecosystem

Infrastructure and Model Deployment Overview

Trade-offs in Model Infrastructure

  • Discussion on the trade-offs between PI ML and other systems, highlighting flexibility, resource efficiency, scale, and latency.
  • Introduction of a simple logistic regression model built using a Pandas DataFrame within a Jupyter notebook environment.

Model Training and Serving

  • Explanation of training a logistic regression model with scikit-learn; includes saving the model file and implementing an interface for predictions.
  • Description of uploading the trained model to Michelangelo back-end for management via UI or API, enabling real-time request-response scoring.

Deployment Options

  • Overview of deploying models either through UI or API to high-scale infrastructure for real-time scoring or batch processing using Spark jobs.
  • Emphasis on leveraging Python's flexibility while providing necessary infrastructure to support data scientists' tools at scale.

Model Packaging and Serving Architecture

Local Development Environment

  • Outline of the workflow: training models locally, saving them, building Docker containers with dependencies for deployment.
  • Explanation of how models can be pushed to online serving systems for both batch predictions (via Spark jobs) and online predictions (request-response).

High Scale Prediction Service

  • Description of deploying nested Docker containers that encapsulate Python resources; existing prediction services act as proxies routing requests to local containers.
  • Flexibility in using various libraries like scikit-learn or custom algorithms while noting potential trade-offs such as higher latency.

Distributed Deep Learning Systems

Advantages of Distributed Systems

  • Introduction to Horribad, a distributed deep learning system noted for its efficient scaling compared to other approaches.
  • Simplified API design allows easier transition from single-node training jobs to distributed environments.

Example from TensorFlow Documentation

  • Reference to an example setup from TensorFlow documentation demonstrating distributed training with minimal complexity involved in environment setup.

Challenges in Model Evaluation

Global Accuracy Metrics Limitations

Understanding Model Behavior Across Data Segments

Importance of Data Segmentation

  • Different segments of data exhibit unique characteristics, leading to varied model performance. A model may perform well overall but poorly on critical data slices.
  • Visualization tools are being developed to help users analyze how models behave on smaller data segments, identifying anomalies or problematic areas.

Hyperparameter Tuning Challenges

  • After determining model features, selecting the right combination of hyperparameters is crucial for achieving optimal accuracy. This process can be complex due to multiple parameters involved.
  • Common tuning methods include brute-force grid search and random search; however, these approaches can be inefficient and time-consuming.

Advanced Optimization Techniques

  • Black box optimization offers a more efficient way to explore hyperparameter combinations by learning the parameter space and directing searches towards optimal settings.
  • Collaboration with research teams (e.g., Uber's team) has led to the development of Bayesian black box optimization techniques that outperform traditional grid searches in fewer iterations.

Ensuring Model Performance in Production

Monitoring Model Behavior

  • Once deployed, it’s essential to ensure that models function correctly in production environments. This involves linking predictions back to actual outcomes.
  • Delays in outcome reporting (e.g., credit card fraud cases taking 90 days for results) complicate timely assessments of model performance.

Anomaly Detection Strategies

  • An alternative approach involves monitoring distributions of input features and predictions over time, allowing for quicker identification of anomalies despite lower precision.
  • Regular distributions are expected; deviations often indicate issues such as bad incoming data from broken services or pipelines.

Key Lessons Learned in Machine Learning Development

Focus on Developer Tools

  • Emphasizing productivity by providing developers with effective tools early on is crucial. The focus has shifted towards enhancing velocity while still considering scalability.

Infrastructure and Automation Needs

Real-Time Machine Learning Challenges

The Complexity of Real-Time ML

  • Real-time machine learning (ML) is described as challenging and difficult to implement effectively.
  • There is a significant focus on empowering modelers to take ownership of the end-to-end process in real-time ML systems.
  • Investments are being made to develop systems that can operate autonomously, reducing the burden on developers.
  • The goal is for developers to concentrate more on modeling work rather than managing system complexities.
Video description

QCon San Francisco International Software Conference is back this November 18-22, 2024. Software leaders at early adopter companies will come together to share actionable insights to help you adopt the right technologies and practices. Get exposed to new ideas and innovative approaches to software development and engineering, guaranteed to inspire and challenge you. Don’t miss this opportunity to take your knowledge and skills to the next level and stay ahead in the fast-paced world of technology. Register now: https://bit.ly/3Tc73rM ------------------------------------------------------------------------------------------------------------------------------------------ Video with transcript included: http://bit.ly/2Psd9B6 Jeremy Hermann talks about Michelangelo - the ML Platform that powers most of the ML solutions at Uber. The early goal was to enable teams to deploy and operate ML solutions at Uber scale. Now, their focus has shifted towards developer velocity and empowering the individual model owners to be fully self-sufficient from early prototyping through full production deployment & operationalization. This presentation was recorded at QCon San Francisco 2018: https://bit.ly/2uYyHLb The next QCon is Qcon New York 2019 – June 24-26, 2019: http://bit.ly/2Uyj39C For more awesome presentations on innovator and early adopter topics check InfoQ’s selection of talks from conferences worldwide https://bit.ly/2tm9loz Interested in Artificial Intelligence, Machine Learning and Data Engineering? Follow the topic on InfoQ: https://bit.ly/2rrEicK #MachineLearning #Uber #CaseStudy #InfoQ #QConSanFrancisco