Low-Code Machine Learning on Databricks with AutoML
Introduction
In this section, Nicholas and Stephanie introduce themselves and provide an overview of the topics they will cover in the video.
Low Code Machine Learning on Databricks
- Nicholas and Stephanie will be discussing low code machine learning on Databricks.
- They will focus mostly on AutoML but also show other low code tools available.
- The goal is to answer the question: does applying machine learning have to be a challenge?
Personal Story
- Nicholas shares his personal story about working with a domain expert named Chuck in the aerospace field.
- Chuck had extensive knowledge about jet engines but lacked coding skills.
- Nicholas had coding skills but lacked domain expertise.
- They worked together to build machine learning models for trend analysis of fleets of military jet engines.
Challenges of Applying Machine Learning
- Applying machine learning involves several challenges, including exploratory data analysis, software development skills, model training and optimization, deployment and hosting, tracking models for accuracy, and scaling production-level systems.
Databricks Solutions
- Databricks offers solutions that cover many aspects of applying machine learning.
- Specifically, they will focus on AutoML in this video.
Data Science and Low Code
In this section, the speaker talks about how data science and machine learning fit into the lake house. They also discuss low code and its benefits.
Data Science in the Lake House
- The lake house covers all sorts of different things, including data engineering, data streaming, data warehousing, etc.
- AutoML is a tool that empowers people to be more productive by automating tasks that could be automated.
- MLflow is a tool for reproducible research that tracks every single step of every single experiment.
- Bamboo Lib is a new tool integrated within the notebooks of Databricks that makes dev life easier.
Low Code Demo
In this section, the speaker demonstrates how to use Bamboo Lib in a low code environment.
Using Bamboo Lib
- Open up a notebook and import Bamboo Lube as bam.
- Load a database table (in this case covet hospitalizations).
- Use the menu to access different options such as history or export.
- Be careful when inputting values (e.g., row limit) to avoid errors.
Formatting Date as a Time Series
In this section, the speaker explains how to format dates as time series when working with time series data.
Formatting Dates
- When working with time series data, it is important to ensure that the date is formatted correctly.
- Click on the date column and explore it to ensure there are no missing values.
- If the value column is still considered an object, convert it to a float so that it can be viewed as a time series.
- Once converted, view the data over time to see any patterns or waves.
Understanding Data Metrics
In this section, the speaker discusses understanding data metrics when trying to predict future trends.
Filtering Data Metrics
- To predict future trends accurately, we need to filter down our data metrics and focus on a single metric.
- Click on the indicator column and notice that there are many different metrics available.
- Filter down by selecting "daily hospital occupancy" rows only.
- Further filter by selecting "United States" rows only.
Writing Data Set to Table
In this section, the speaker explains how to write a dataset into a table for use in automl experiments.
Writing Dataset
- Create a new table for storing data using SQL commands.
- Name the table and specify what type of data will be stored in it (in this case daily hospitalizations).
- Copy and paste code into a cell for easy reference later on.
Starting AutoML Experiment
In this section, the speaker explains how to start an AutoML experiment in Databricks.
Starting Experiment
- Navigate to Machine Learning tab within Databricks.
- Select "Experiments" and create a new AutoML experiment.
- Choose the cluster and problem type (forecasting in this case).
- Select the dataset to be used for the experiment.
Advanced Configuration Options
The speaker discusses the advanced configuration options available in the AutoML tool.
Evaluation Metrics
- The tool defaults to a specific evaluation metric, which is a solid option if you are unsure of what to use.
- However, there are other options available for more advanced users.
Limiting Training Frameworks
- You can limit your training frameworks by removing certain ones, such as Profit, to speed up your AutoML experiment.
Timeout Feature
- The timeout feature allows you to set a time limit for your models to train.
- This is useful if you have limited time and want to ensure that your models will be done by a certain point.
- If the model stops early due to reaching optimal performance, it will not continue running unnecessarily.
Launching an Experiment and Viewing Results
The speaker demonstrates how to launch an experiment and view its results.
Launching an Experiment
- To launch an experiment, simply click on the "Start" button.
- While training, the tool will tweak hyperparameters automatically.
Viewing Results
- Each run is considered an experiment with its own parameters and evaluation metrics.
- There are two notebooks generated at the end of each run - one for data exploration and one for the best model found during training.
- These notebooks are auto-generated code that can be used freely.
Benefits of AutoML Tool
The speaker explains some benefits of using the AutoML tool.
Glass Box Approach
- The AutoML tool uses a "glass box" approach, meaning that it does not hide any of the code or processes used during training.
- This allows users to see exactly what is happening and make changes as needed.
Free Code
- All generated code is free to use and can be cloned into a git repository for future use.
Domain Knowledge vs. Python Expertise
- The AutoML tool is useful for those with domain knowledge but limited Python expertise.
- Users can set up their data sets and experiment without needing to know how to write specific code.
- Results can then be passed on to data scientists for further analysis.
Best Model Found During Training
The speaker discusses the best model found during training.
Best Model Notebook
- The best model notebook is auto-generated code that shows all of the steps taken during training.
- It includes graphics that show the accuracy of the model over time.
Solid Starting Point
- The best model notebook provides a solid starting point for further analysis or deployment.
- Data scientists can take this notebook and build upon it as needed.
Registering a Model to the Model Registry
In this section, the speaker explains how to register a model to the model registry and its benefits.
Benefits of Model Registry
- The model registry stores all versions of your models.
- It keeps track of what's been reviewed and what hasn't been reviewed.
- You can view inputs, outputs, and metrics associated with each version.
Reviewing Models Before Moving to Production
- Before moving a model into production or staging, it is important to ensure that it works.
- An automated job should be run to check for descriptions, basic accuracy, and prediction capability.
- A human should also review the model before pushing it into production.
Doing Inference with Registered Models
- Batch inference is an option for running predictions on data and storing them in a table for use in dashboards.
- Real-time inference is done using an API that can hit all applications.
- Auto-generated code makes doing inference easier by saving time.
Centralizing Models in One Place
In this section, the speaker discusses how centralizing models in one place can make managing them easier.
Challenges Faced When Managing Multiple Models
- Different models were compiled from different sources such as C-sharp code or Python scripts.
- Some features were generated in stored procedures in MSSQL databases.
- This made managing models difficult as they were scattered across different places.
Benefits of Centralizing Models
- Centralizing models makes it easy to manage them from one place.
- Data scientists can easily test new models by looking at inputs, outputs, metrics associated with each version.
- Approving or denying new models becomes easier.
Doing Batch Inference with Auto-generated Code
In this section, the speaker explains how auto-generated code can make doing batch inference easier.
Benefits of Auto-generated Code
- Auto-generated code saves time by eliminating the need to copy and paste notebooks.
- It generates all the necessary pieces for running predictions on data and saving them in a table.
Lowering the Barrier of Entry
- The goal is to lower the barrier of entry for machine learning.
- This does not mean that you don't need to know Python or write code again.
- It just makes it easier to get started with machine learning and save time.
Conclusion
The speaker discusses how registering models to a model registry, centralizing models in one place, and using auto-generated code can make managing and doing inference with models easier. The goal is to lower the barrier of entry for machine learning while still requiring knowledge of Python and coding.
Introduction to Automotive Features in Databricks
In this section, the speaker introduces the different features currently supported in automotive and how they can help with forecasting, regression, classification, and feature importance.
Features Supported in Automotive
- Databricks supports various features for automotive such as forecasting, regression, classification, and feature importance.
- The shop values are a huge fan favorite for looking at future importances.
- Training on auto ml is available through Databricks Academy. However, there is no training on bamboo lib at this time.
- 8080 Labs website has many videos and resources available to learn about transformations that can be done with auto ml.
- A blog post was recently published about using auto ml with an example notebook that generates data for regression.
Limitations of Auto ML
This section discusses the limitations of Auto ML when it comes to handling large datasets.
Handling Large Data Sets
- Current versions of the ML runtimes allow larger data sets to be processed. However, if the data set is too large to run in a single experiment then it will be randomly sampled.
- Once you have the notebook itself you can throw all your data at it. However, running auto ml on a huge dataset could lead to wild computation if not careful.
Conclusion
The speaker concludes by mentioning Databricks community as a helpful resource for answering questions related to Databricks.
Helpful Resources
- Databricks community is a helpful resource for answering questions related to Databricks.