ML Zoomcamp 1.5 - Model Selection Process

ML Zoomcamp 1.5 - Model Selection Process

Introduction to Model Selection Process

In this lesson, we will discuss the modeling step in machine learning and focus on the process of selecting the best model, known as the model selection process.

Modeling Step and Goal of Model Selection

  • The modeling step is where machine learning actually happens.
  • Different models are tried to choose the best one for a specific problem.
  • Logistic regression, decision trees, neural networks, etc., are examples of different models that will be covered in later sessions.
  • The goal is to find a model that works well for the given problem.

Mimicking Real-world Usage

  • Models are used to make predictions on unseen data.
  • To evaluate a model's performance, it needs to be tested on data it hasn't seen during training.
  • A validation dataset is created by setting aside a portion (e.g., 20%) of the entire dataset.
  • The remaining data (80%) is used for training only.

Evaluating Model Performance

  • Feature matrix (X) and target variable (y) are extracted from the training dataset.
  • A model (g) is trained using X and y.
  • Another feature matrix (X_validation) and target variable (y_validation) are extracted from the validation dataset.
  • The trained model (g) is applied to X_validation to obtain predictions (y_hat).

Comparing Predictions with Actual Values

  • Predictions (y_hat) are compared with actual values (y_validation).
  • Accuracy is calculated by counting correct predictions divided by total predictions made.

Evaluating Multiple Models

In this section, we explore evaluating multiple models using accuracy as a metric.

Comparing Model Accuracies

  • Different models like logistic regression, decision tree, random forest, and neural network are evaluated.
  • Each model is trained on the training dataset and tested on the validation dataset.
  • Accuracy is calculated for each model to determine its performance.

Example Comparison

  • Logistic regression: 66% accuracy
  • Decision tree: 60% accuracy
  • Random forest: 67% accuracy
  • Neural network: 80% accuracy (example)

The above accuracies are just hypothetical examples.

Conclusion

In this lesson, we learned about the modeling step in machine learning and the importance of selecting the best model. We discussed how to mimic real-world usage by creating a validation dataset and evaluating model performance using accuracy. Finally, we explored comparing multiple models based on their accuracies.

The Problem with a Simple Model

In this section, the speaker discusses the problem with using a simple model to classify spam emails. They use a coin flip analogy to illustrate how different models can produce varying results based on chance.

Coin Flip Model Example

  • A coin flip model is used as an example to demonstrate the issue.
  • The model flips a coin and classifies emails as spam or not spam based on whether it lands on heads or tails.
  • Different currencies (e.g., euro, dollar, zloty) are used to represent different models.
  • Each currency produces different sequences of correct and incorrect classifications.

Randomness and Luck in Models

  • The example shows that one model may get lucky and correctly predict the validation data set by chance.
  • This randomness can also occur with real machine learning models, where one model may perform well on a specific subset of data due to luck.
  • This phenomenon is known as multiple comparison problems in statistics.

Need for Multiple Data Sets

  • To guard against lucky models, it is important to have separate training, validation, and test data sets.
  • The training data set is used to train the models, while the validation data set helps select the best model.
  • Finally, the test data set is used to validate the chosen model's performance without bias.

Final Model Selection

  • After selecting the best model using the validation data set, it is crucial to evaluate its performance on the test data set.
  • This step ensures that the chosen model performs well consistently and was not just lucky on the validation set.

Holding Out Two Data Sets for Model Evaluation

In this section, the speaker explains why two separate data sets (validation and test) are necessary for evaluating machine learning models accurately.

Splitting Data into Three Subsets

  • The original data set is divided into three subsets: training, validation, and test.
  • The percentages allocated to each subset can vary but are typically 60% for training, 20% for validation, and 20% for testing.

Model Selection with Validation Data

  • The model selection process involves training different models on the training data set and evaluating their performance on the validation data set.
  • The best-performing model is chosen based on accuracy or other evaluation metrics.

Evaluating Chosen Model with Test Data

  • To ensure that the chosen model's performance is not due to luck or overfitting, it needs to be evaluated on the test data set.
  • Applying the chosen model to the test data helps validate its generalizability and assess its real-world performance.

Avoiding Luck in Model Selection

In this section, the speaker emphasizes the importance of using a separate test data set to avoid relying solely on luck when selecting a machine learning model.

Different Models' Performance

  • Various models (e.g., logistic regression, decision tree, random forest, neural network) are evaluated based on their accuracy in predicting spam emails.
  • Each model produces different accuracy rates ranging from 60% to 80%.

Selecting Best Model

  • After comparing the models' performances, the best model (neural network) is selected based on its higher accuracy rate.

Validating Chosen Model with Test Data

  • To ensure that the neural network's high accuracy rate is not due to luck or chance alignment with the validation data set, it needs to be tested on unseen test data.
  • Applying the neural network to the test data helps confirm its effectiveness and reliability as a spam email classifier.

New Section

In this section, the speaker explains the process of model selection using a split data set approach.

Splitting the Data Set

  • The first step in the model selection process is to split the data set into three parts: training, validation, and test.
  • This allows for training the model on one part, validating it on another, and testing it on the remaining part.

Training and Validating the Model

  • After splitting the data set, the next step is to train a model using the training data.
  • The trained model is then applied to the validation data set to assess its accuracy or performance.
  • This process is repeated for different models and recorded in a table.

Selecting the Best Model

  • After evaluating multiple models, the best one is selected based on its performance on the validation data set.

Testing with Test Data

  • Using the best model from previous steps, it is applied to the test data set to ensure its effectiveness.

Utilizing Unused Data

  • Instead of discarding unused validation data completely, it can be combined with training data to create a larger training set.
  • A new model can be trained using this combined dataset for improved performance.
  • The final model's performance is then checked against the test dataset.

New Section

In this section, an alternative approach for utilizing unused validation data is discussed.

Combining Training and Validation Data

  • Rather than discarding unused validation data, it can be combined with training data to create a larger training dataset.

Training with Combined Dataset

  • A new model (model g) can be trained using this combined dataset which includes both original training and validation datasets.

Testing with Test Data Again

  • The newly trained model (model g) is then tested against the test dataset to evaluate its performance.

Retaining Validation Set

  • By utilizing the validation data for training, it is not completely wasted and still contributes to the final model selection process.

New Section

In this section, the speaker introduces the practical implementation of the discussed concepts using Python and relevant libraries.

Using Python for Model Selection

  • Python is used as the programming language for implementing the model selection process.

Required Libraries

  • Several libraries are utilized in Python, including numpy, pandas, and other relevant libraries.

The transcript provided does not specify any further content beyond this point.

Video description

Timecodes: 00:00 Introduction 01:02 Holdout + Train 03:28 Making Predictions 06:54 Scoring 07:58 Multiple Comparisons Problem 13:12 Train + Validation + Test 15:17 Further scoring 16:28 Model Selection (6 steps) 20:49 Summary Links: - Lesson page: https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/01-intro/05-model-selection.md - Slides: https://www.slideshare.net/AlexeyGrigorev/ml-zoomcamp-15-model-selection-process - Course GitHub repo: https://github.com/alexeygrigorev/mlbookcamp-code/tree/master/course-zoomcamp - Register here for the course: https://airtable.com/shr6Gz46UZCgJ9l6w - Public Google calendar: https://calendar.google.com/calendar/?cid=cGtjZ2tkbGc1OG9yb2lxa2Vwc2g4YXMzMmNAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ - The book - Machine Learning Bookcamp: http://bit.ly/mlbookcamp (Get 40% off with code "grigorevpc") Join DataTalks.Club: https://datatalks.club/slack.html