¿Cómo manejar los VALORES EXTREMOS en nuestros datos?
Understanding Outliers in Data Science
Introduction to Outliers
- The management of outliers, or extreme values, is crucial in any data science and machine learning project.
- The video will cover the definition of outliers, their causes, detection techniques, and how to handle them effectively.
Resources for Learning
- Viewers are invited to visit codificando bits.com for online courses and services related to data science and machine learning.
- Contact information is provided for tailored training services for individuals or companies.
Definition of Outliers
- An outlier is defined as a sample that is exceptionally distant from the majority of data points in a dataset.
- This definition may seem subjective but will be clarified with examples throughout the video.
Examples of Datasets
- Three datasets are introduced:
- Heights of individuals from a specific location (25,000 records).
- Bank transaction amounts (1,300 records).
- Salaries within a company in the U.S. (997 records).
Visualizing Outliers
- A graph representing heights shows most data points clustered between 160 cm and 185 cm.
- Notable outliers include heights of 198 cm and two significantly lower heights at 135 cm and 132 cm.
Analyzing Transaction Data
- In bank transactions, most values fall within a certain range; however, some transactions are identified as outliers.
- Examples include unusually high transactions ($953 and $1923), as well as an illogical negative transaction value.
Causes of Outliers
- Two primary reasons for outlier occurrence:
- Human error during data collection can lead to incorrect entries (e.g., negative height).
- Sensor malfunctions can produce abnormal readings outside expected ranges.
Real-world Implications
- Some observed outlier values may not stem from errors but could represent genuine phenomena worth studying.
- For instance, suspiciously high transaction amounts might indicate fraudulent activity rather than recording mistakes.
How to Detect Outliers in Data
Visualization Techniques for Identifying Outliers
- The first approach to detect outliers is through data visualization, which involves generating graphs to observe data behaviors. However, this method may not always be effective.
- For example, when visualizing height data, most values fall within a specific range, but some points lie outside this range, indicating potential outliers.
- In contrast, salary data presents a more complex distribution where identifying extreme values becomes challenging. A point at $240,000 could also be considered an outlier.
- The effectiveness of visualization tools varies depending on the dataset's characteristics; sometimes they can easily identify outliers while other times they cannot.
Statistical Methods for Outlier Detection
Standard Deviation Approach
- When visualization fails, one alternative is using statistical methods like standard deviation to analyze data distribution and identify outliers.
- If the histogram of the dataset resembles a bell curve (Gaussian distribution), it indicates normality. Specific statistical tests can confirm this assumption.
- Assuming normal distribution is confirmed, two key parameters are used: mean (average value) and standard deviation (average distance from the mean).
Understanding Normal Distribution
- In a normal distribution:
- 68% of data falls within one standard deviation from the mean.
- 95% falls within two standard deviations.
- 99.7% lies within three standard deviations.
- Values that exceed three or four standard deviations from the mean are considered extreme and likely represent outliers in normally distributed datasets.
Practical Application with Height Data
Analyzing Height Distribution
- By plotting height distributions and calculating means along with three standard deviations around it using Python libraries like Pandas, we can visualize where most data points lie.
- The majority of height values cluster around the mean (approximately 72 inches), but some extreme values exist beyond this range (e.g., heights of 132 inches or more).
Identifying Extreme Values
- Superimposing calculated limits on the graph helps determine which values fall outside acceptable ranges. Points outside these red bars are classified as extreme values or potential outliers.
Alternative Method: Interquartile Range (IQR)
Understanding Percentiles and Quartiles
- In cases where data does not follow a normal distribution, another method involves using interquartile ranges (IQR).
- Percentiles divide datasets into 100 equal parts; specifically important percentiles include:
- 25th percentile: Also known as the first quartile.
Understanding Quartiles and Outliers in Data Analysis
Introduction to Quartiles
- The first quartile (Q1) represents the value below which 25% of the data falls, while the second quartile (Q2), or median, divides the dataset into two equal halves at 50%. The third quartile (Q3) indicates that 75% of the data is below this value.
- Quartiles divide a dataset into four equal parts rather than 100, with Q1 at 25%, Q2 at 50%, and Q3 at 75%. This segmentation helps in understanding data distribution.
Visualizing Quartiles with Salary Data
- A graphical representation using salary data illustrates how these quartiles are positioned. The red line marks Q1, indicating where the lowest 25% of salaries lie.
- The black line signifies Q2 (the median), showing that half of the salaries fall below this point, while half exceed it.
- Similarly, Q3 is represented by another red line above which only 25% of salaries exist.
Interquartile Range (IQR)
- The interquartile range is calculated as the difference between Q3 and Q1. It provides insight into the middle spread of data points within a dataset.
- To identify extreme values or outliers, one can establish lower and upper limits based on IQR: lower limit = Q1 - K * IQR; upper limit = Q3 + K * IQR, where K is typically set to 1.5.
Box Plot Visualization
- A box plot visually represents these quartiles and limits. It includes a shaded box for IQR with whiskers extending to show potential outlier boundaries.
- In this box plot example:
- Lower boundary (Q1): approximately $65,800
- Median (Q2): $87,000
- Upper boundary (Q3): approximately $115,000
Identifying Outliers
- Points outside the whiskers in a box plot are considered outliers. These represent extreme values that deviate significantly from other observations in the dataset.
Handling Outliers Effectively
- Expertise in domain knowledge is crucial for determining whether an observed value should be classified as an outlier or not before deciding on further actions regarding those values.
- There are four main strategies for managing outliers:
- Elimination: Removing them from analysis.
- Trimming: Adjusting their influence on results without complete removal.
- Testing both with and without outliers to assess impact on outcomes.
- No action taken if deemed appropriate based on context.
This structured approach aids in comprehensively understanding how quartiles function within datasets and how to effectively manage outliers during analysis.
Data Cleaning Techniques: Trimming and Capping Outliers
Understanding Data Trimming
- Trimming involves removing extreme data points from a dataset. This can lead to a smaller dataset, which may hinder exploratory data analysis and machine learning model implementation.
- A reduced dataset limits the ability to extract relevant information, potentially altering the distribution of the data, especially if it originally follows a normal distribution.
Implementing Data Trimming
- The process includes identifying outliers that fall below or above defined thresholds in datasets, such as height measurements.
- By comparing original and trimmed datasets, one can observe changes in size; for instance, reducing from 25,000 to 24,944 records after removing outliers based on standard deviation criteria.
Exploring Data Capping
- Capping (or clipping) involves adjusting extreme values to predefined limits instead of removing them entirely. This preserves the number of data points while changing their distribution.
- For salary data analysis, interquartile ranges are calculated to define lower and upper limits for capping salaries at $50k and $189k respectively.
Visualizing Impact of Capping
- After applying capping techniques on salary data, visualizations like box plots illustrate how all values now fit within established boundaries compared to the original dataset with outliers.
Evaluating Outlier Management Strategies
- When analyzing real-world data that contains outliers without errors, it's crucial to investigate underlying reasons rather than simply removing them.
- An alternative approach is training predictive models both with and without outliers to assess their impact on model performance.
Understanding Anomalies in Banking Transactions
Importance of Anomalous Data
- The relevance of information lies in anomalous data, as most banking transactions fall within a normal range. However, banks need systems to identify suspicious transactions that deviate from this norm.
Characteristics of Suspicious Transactions
- High transaction amounts often indicate anomalies, which are critical for detection. These outliers can provide significant insights into potential fraudulent activities.
Handling Outliers in Data Science Projects
- It is essential to retain extreme values (outliers) because they contain vital information necessary for the project's objectives. Ignoring these values could lead to loss of crucial insights.
Conclusions on Outlier Management
- Expertise in the domain and understanding the problem at hand is crucial for effective outlier management. This knowledge influences decisions regarding how to handle these data points.
- Not all outliers are undesirable; their value depends on the specific application and intended use within data science or machine learning projects.
Techniques for Managing Outliers
- If one decides to manage outliers through techniques like removal or capping, it’s important to weigh the advantages and disadvantages each method offers during processing.
Additional Resources and Engagement
- A link will be provided in the video description for downloading relevant notebooks discussed. Viewers are encouraged to leave comments with questions or suggestions and share content if they find it helpful.