Michelangelo - Machine Learning @Uber

Name: Michelangelo - Machine Learning @Uber
Uploaded: 2019-04-24T00:00:00.000Z
Duration: 1 h 32 min 12 s

Machine Learning at Uber

Overview of the Talk

The speaker introduces the topic of machine learning (ML) at Uber, outlining three main phases: interesting use cases, the initial platform built to support these use cases, and enhancing developer experience for ML adoption.

Exciting Aspects of Machine Learning at Uber

Uber is a dynamic environment for ML due to diverse projects across the company rather than a few dominant use cases.

The data collected by Uber is unique as it operates in the physical world, utilizing GPS and accelerometer data from rider and driver apps.

Being a relatively young company allows for innovative applications of ML without being constrained by legacy systems or incremental improvements.

ML is central to Uber's strategy, influencing decision-making and operational efficiency significantly.

Growth and Scale of Data

As of recent years, Uber has expanded to 75 million riders and 3 million drivers, completing over 4 billion trips annually across 600 cities.

An example provided shows GPS traces from London drivers over six hours, illustrating how quickly cars can cover large areas—data that supports various ML applications.

Diverse Use Cases for Machine Learning

There are over 100 active ML use cases at Uber spanning multiple domains including ridesharing, food delivery (Uber Eats), self-driving technology, customer support, pricing forecasting, anomaly detection in system metrics, and capacity planning for data centers.

Specific Applications in Uber Eats

In the Uber Eats app, hundreds of models score user preferences to generate personalized homepages based on likely restaurant choices.

Models predict delivery times by analyzing various stages from order preparation to delivery logistics.

Self-driving Cars and ETA Predictions

Self-driving cars utilize lidar and cameras with ML algorithms for object detection—identifying streets and pedestrians while planning routes effectively.

Accurate Estimated Time of Arrival (ETA) predictions are crucial; they enhance user experience while also impacting internal systems like pricing and routing.

Challenges with ETA Predictions

Understanding Machine Learning Applications at Uber

Predictive Models and Error Correction

The use of historical data to predict ETAs (Estimated Time of Arrival) often results in consistent errors, which can be modeled and corrected for improved accuracy.

Mapping Infrastructure Development

Uber is developing its own mapping infrastructure, starting with a base street map that is enhanced through evidence collection from vehicles equipped with cameras.

Data Collection and Machine Learning Integration

Images captured by cars are tagged with GPS coordinates; machine learning models are employed to identify addresses and street signs for database enhancement.

Marketplace Dynamics: Supply and Demand

In the context of ride-sharing, proximity in space and time between riders and drivers is crucial for operational efficiency, contrasting with e-commerce platforms like eBay where distance is less critical.

Deep Learning for Marketplace Predictions

Uber utilizes deep learning models to forecast marketplace metrics such as driver availability and rider demand, helping to optimize supply-demand balance.

Customer Support Optimization

With millions of rides daily, Uber employs deep learning to streamline customer support by predicting issues based on ticket text, reducing response options significantly.

Enhancing Communication Between Riders and Drivers

A new one-click chat feature allows quick communication between drivers and riders using predicted responses instead of typing, improving interaction efficiency.

Michelangelo Platform Overview

The Michelangelo platform supports various machine learning applications at Uber, aiming to empower data scientists to manage end-to-end processes from model development to deployment.

Empowering Data Scientists at Scale

Machine Learning at Uber: Enhancements and Insights

Iterative Process of Machine Learning

The iterative nature of machine learning (ML) allows for rapid movement through modeling cycles, leading to compounding effects as new models are experimented with.

Initial efforts in ML at Uber focused on enabling users to engage with the technology, which proved successful but presented challenges in usability.

Evolution from v1 to v2

Version 2 (v2) of Uber's ML platform aims to enhance developer productivity and streamline the modeling and deployment processes.

Acknowledgment that effective machine learning encompasses more than just model training; it requires a comprehensive workflow management system.

Data Management Challenges

Managing data is often the most complex aspect of the ML process, involving accurate datasets for both training and retraining models.

Models deployed in production require real-time access to relevant data, necessitating intricate pipelines for data delivery.

Model Training and Evaluation

The training phase involves iterative model evaluation, where tools are essential for comparing different models' performance.

Deployment should be seamless, allowing developers to easily push models into production via simple commands or API calls.

Monitoring and Continuous Improvement

Post-deployment monitoring is crucial as it assesses whether models perform accurately against new incoming data.

The same foundational workflow applies across various ML problems, including traditional methods like trees and linear models as well as deep learning techniques.

Centralized Feature Store Development

A centralized feature store has been established at Uber to facilitate sharing and curation of features across different teams, enhancing collaboration.

This feature store simplifies the modeling process by allowing developers to select pre-existing features rather than creating new queries each time.

Distributed Training Infrastructure

Large-scale distributed training is conducted using CPU clusters for simpler models and GPU clusters for deep learning applications.

Uber developed its own distributed training infrastructure called Horvat, which improves efficiency by eliminating parameter servers during model training.

Enhanced Usability in Model Management

Model Development and Deployment Insights

Importance of Metadata in Model Development

The collection of metadata about models is crucial for modelers to iterate effectively, helping them identify the best model for production use.

Standard error metrics are essential for evaluating regression and classification models, guiding data scientists in refining their models based on specific use cases.

Feature Analysis and Model Understanding

Analyzing feature importance and statistics (mean, min, deviation, distribution) aids in understanding the data used in models.

A tool is available to visualize the structure of learned trees, allowing users to comprehend how input features influence predictions.

Model Deployment Strategies

Uber supports both batch predictions (scheduled jobs generating multiple predictions at once) and real-time web service deployments for immediate prediction requests.

The architecture includes a routing infrastructure that directs feature vectors to the appropriate model based on incoming requests.

Performance Metrics and Prediction Speed

Currently, Uber processes close to one million predictions per second across various applications.

Scoring times for tree-based models are typically under five milliseconds; however, additional delays may occur when fetching online features from databases like Cassandra.

Monitoring Model Performance Post-deployment

After deploying a model into production, it’s vital to evaluate its performance against current data rather than just historical benchmarks.

Architecture of Data Systems

Overview of the Data Architecture

The architecture consists of offline systems at the bottom and online systems at the top, starting with a data lake where data is stored.

Batch processing begins in Hadoop using Hive tables, allowing developers to write Spark or SQL jobs for data aggregation and feature collection.

Features calculated from batch jobs are stored in Cassandra for online serving, enabling real-time predictions based on historical data.

Handling Freshness in Data

For features requiring freshness, such as current meal prep times, a streaming path is utilized to gather metrics from Kafka.

Flink jobs aggregate streaming data and write results into both Cassandra and Hive to maintain parity between online and offline datasets.

Model Training and Deployment

Batch training pulls data from Hive tables to run algorithms (tree models, linear models, deep learning), storing metadata about trained models in a dedicated database.

At deployment time, trained models can be pushed into either an online serving container for real-time predictions or batch jobs for periodic predictions.

Real-Time Prediction Process

When users request delivery time estimates via an app, relevant features (location, time of day) are sent to the model along with pre-computed features from Cassandra.

The model combines context-specific features with additional computed ones before scoring them for prediction.

Monitoring and Management

Effective monitoring is crucial; predictions are logged back to Hadoop for outcome analysis and alerting through metric systems.

Machine Learning Workflow Management and Deployment

Overview of the Platform

The platform facilitates workflow management and deployment, allowing users to write automation code in Python or Java to interact with the system.

Projects serve as containers for modeling problems, enabling connections to data sources like Hive tables for model training.

Model Visualization and Metrics

Users can access visualizations such as confusion matrices and various metrics that assess model accuracy.

Detailed reports on model features are available, showcasing distributions and split points within decision trees.

Model Deployment Process

The deployment process involves packaging the model, which takes a few minutes before it becomes operational.

A history log tracks all deployed models, including timestamps and user actions.

Enhancements in Machine Learning Development

Accelerating Machine Learning Projects

Current efforts focus on making machine learning faster and easier from idea generation through prototyping to production scaling.

Introduction of a new Python ML project aims to cater specifically to data scientists who prefer working in Python over web UIs or other languages.

AutoML Features

The platform includes auto-tuning capabilities that assist users in training effective models without needing extensive manual configuration.

Real-Time Monitoring Improvements

New features allow for real-time monitoring of models in production, enhancing responsiveness compared to previous hourly refresh rates.

Addressing Developer Experience Challenges

Identifying Friction Points

The development process is recognized as having friction points at every step; efforts are underway to streamline these workflows.

Empowering Data Scientists

By allowing data scientists more ownership over their workflows—from prototype through production—efficiency is expected to improve significantly.

Evolution of Michelangelo's Capabilities

Target Audience Shift

Initially designed for high-scale use cases involving large datasets, Michelangelo has evolved due to feedback indicating a need for greater usability among data scientists facing unique challenges.

Integration with Python Ecosystem

Infrastructure and Model Deployment Overview

Trade-offs in Model Infrastructure

Discussion on the trade-offs between PI ML and other systems, highlighting flexibility, resource efficiency, scale, and latency.

Introduction of a simple logistic regression model built using a Pandas DataFrame within a Jupyter notebook environment.

Model Training and Serving

Explanation of training a logistic regression model with scikit-learn; includes saving the model file and implementing an interface for predictions.

Description of uploading the trained model to Michelangelo back-end for management via UI or API, enabling real-time request-response scoring.

Deployment Options

Overview of deploying models either through UI or API to high-scale infrastructure for real-time scoring or batch processing using Spark jobs.

Emphasis on leveraging Python's flexibility while providing necessary infrastructure to support data scientists' tools at scale.

Model Packaging and Serving Architecture

Local Development Environment

Outline of the workflow: training models locally, saving them, building Docker containers with dependencies for deployment.

Explanation of how models can be pushed to online serving systems for both batch predictions (via Spark jobs) and online predictions (request-response).

High Scale Prediction Service

Description of deploying nested Docker containers that encapsulate Python resources; existing prediction services act as proxies routing requests to local containers.

Flexibility in using various libraries like scikit-learn or custom algorithms while noting potential trade-offs such as higher latency.

Distributed Deep Learning Systems

Advantages of Distributed Systems

Introduction to Horribad, a distributed deep learning system noted for its efficient scaling compared to other approaches.

Simplified API design allows easier transition from single-node training jobs to distributed environments.

Example from TensorFlow Documentation

Reference to an example setup from TensorFlow documentation demonstrating distributed training with minimal complexity involved in environment setup.

Challenges in Model Evaluation

Global Accuracy Metrics Limitations

Understanding Model Behavior Across Data Segments

Importance of Data Segmentation

Different segments of data exhibit unique characteristics, leading to varied model performance. A model may perform well overall but poorly on critical data slices.

Visualization tools are being developed to help users analyze how models behave on smaller data segments, identifying anomalies or problematic areas.

Hyperparameter Tuning Challenges

After determining model features, selecting the right combination of hyperparameters is crucial for achieving optimal accuracy. This process can be complex due to multiple parameters involved.

Common tuning methods include brute-force grid search and random search; however, these approaches can be inefficient and time-consuming.

Advanced Optimization Techniques

Black box optimization offers a more efficient way to explore hyperparameter combinations by learning the parameter space and directing searches towards optimal settings.

Collaboration with research teams (e.g., Uber's team) has led to the development of Bayesian black box optimization techniques that outperform traditional grid searches in fewer iterations.

Ensuring Model Performance in Production

Monitoring Model Behavior

Once deployed, it’s essential to ensure that models function correctly in production environments. This involves linking predictions back to actual outcomes.

Delays in outcome reporting (e.g., credit card fraud cases taking 90 days for results) complicate timely assessments of model performance.

Anomaly Detection Strategies

An alternative approach involves monitoring distributions of input features and predictions over time, allowing for quicker identification of anomalies despite lower precision.

Regular distributions are expected; deviations often indicate issues such as bad incoming data from broken services or pipelines.

Key Lessons Learned in Machine Learning Development

Focus on Developer Tools

Emphasizing productivity by providing developers with effective tools early on is crucial. The focus has shifted towards enhancing velocity while still considering scalability.

Infrastructure and Automation Needs

Real-Time Machine Learning Challenges

The Complexity of Real-Time ML

Real-time machine learning (ML) is described as challenging and difficult to implement effectively.

There is a significant focus on empowering modelers to take ownership of the end-to-end process in real-time ML systems.

Investments are being made to develop systems that can operate autonomously, reducing the burden on developers.

The goal is for developers to concentrate more on modeling work rather than managing system complexities.