¿Cómo manejar los DATOS FALTANTES?: guía completa
Understanding Missing Data in Machine Learning
Introduction to Missing Data
- The video discusses the inevitable issue of missing data in machine learning and data science projects, emphasizing its significance in exploratory data analysis.
- A guide will be provided for viewers to download, summarizing the techniques discussed regarding handling missing data.
Mechanisms of Missing Data
- There are three mechanisms behind missing data:
- Completely Random: Missingness is unrelated to any observed or unobserved data.
- Not Random: The absence of data is systematically related to the values of other variables.
- Random but Related: Missingness depends on other variables but not on the missing values themselves.
Completely Random Mechanism
- In this case, missing values appear across different categories without dependence on their own characteristics.
Not Random Mechanism
- Here, missing values are systematically absent based on certain criteria (e.g., lower than a specific threshold), complicating recovery efforts.
Random but Related Mechanism
- This mechanism indicates that while the absence does not depend on itself, it may relate to other variables within the dataset.
Techniques for Handling Missing Data
- Two primary techniques exist for addressing missing data:
- Deletion: Removing records with missing values.
- Imputation: Estimating and filling in missing values using information from neighboring records or related columns.
Deletion Methods
- There are two deletion strategies:
- Listwise Deletion: Removes entire rows with any missing value, risking significant information loss.
- Pairwise Deletion: Only omits specific cells with missing values, preserving more information but potentially leading to inconsistent column sizes during model training.
Imputation Methodology
- Imputation aims to estimate missing values based on existing nearby data rather than arbitrarily assigning them. It should ideally maintain original distribution patterns within the dataset.
Imputation Techniques for Missing Data
Overview of Imputation Methods
- The original distribution of data should be preserved after imputation, which can be achieved through simple or multiple imputation techniques. These methods are recommended only if the missing data mechanism is random or completely random.
Simple Imputation Techniques
- Simple imputation uses an algorithm to make a single estimation for missing values. Common techniques include mean/median imputation, regression imputation, and hot deck imputation.
Mean/Median Imputation
- This method involves replacing missing values with the mean or median of known values in the variable. While easy to implement, it alters the data distribution by substituting many missing values with a single value.
Regression Imputation
- In regression imputation, each missing value is replaced with predictions from a regression model based on complete columns. This method better preserves the original data distribution compared to mean/median substitution.
Hot Deck Imputation
- Hot deck imputation replaces missing values using similar observed values within the same category. The k-nearest neighbors (KNN) algorithm is commonly used here, averaging k closest observations to fill in gaps.
Limitations of Simple Imputations
- While KNN provides more accurate imputations than mean/median methods, it can be computationally intensive when dealing with large datasets due to distance calculations for each missing value.
Multiple Imputation as an Advanced Technique
- Multiple imputation addresses limitations of single estimations by generating several estimates for each missing value and combining them into one final estimate, reducing bias in replacements.
Chained Equations Method (MICE)
- The most common multiple imputation technique is MICE (Multiple Imputation by Chained Equations), which iteratively refines estimates starting from initial mean imputations and applying linear regressions between pairs of columns until reaching predefined iterations.
Considerations for Effective Use of Techniques
- MICE is robust but requires linear relationships among variables; otherwise, more sophisticated statistical models or machine learning approaches may be necessary.
Summary and Recommendations
- It’s crucial first to identify whether the mechanism behind missing data is random before choosing an appropriate technique. If not random, collecting additional data may be advisable to avoid introducing bias into analyses.
Final Thoughts on Technique Selection
- The choice of technique depends significantly on the specific dataset characteristics in your project. A comprehensive guide with recommendations will help apply these techniques effectively in machine learning or data science projects.