Statistics for Data Science Full Course | Probability and Statistics for Engineers | Great Learning
Introduction to Statistics for Data Science
In this video, the speaker introduces statistics as one of the most important concepts in data science and machine learning. The speaker explains how statistics is used in various fields such as weather forecasting, credit card fraud analysis, and stock market prediction. The agenda for the video is also presented.
Difference between Statistics and Machine Learning
- Statistical way of thinking involves formulating a problem and then getting data to solve it.
- Machine learning way of thinking involves having data and asking what that data is telling you.
- There is a difference in the way these two communities approach things.
Types of Statistics
- Descriptive statistics
- Prospective statistics
- Predictive statistics
Types of Data
- Categorical data
- Numerical data
- Discrete numerical data
- Continuous numerical data
Correlation Concepts
- Positive correlation
- Negative correlation
- No correlation
Probability Concepts
- Introduction to probability
- Sample space
- Events
- Probability axioms
- Conditional probability with Bayes' theorem
- Independence
- Random variables
- Discrete random variables
- Continuous random variables
Normal Distribution
Conclusion
The speaker concludes by stating that while there may be differences in how statisticians and machine learning experts approach problems, both are important in the field of data science. Additionally, he emphasizes that asking the right questions is crucial in this field where collecting quality data can be expensive.
Analyzing Sales Data
In this section, the speaker discusses how to analyze sales data when a company's business model is disappearing.
What Data to Look For
- Determine what data you want to see and what questions you want to ask.
- Look at sales year by year and make comparisons.
- Analyze region-wise age groups to understand which sections of customers are buying your product.
- Identify your biggest set of customers.
Understanding Sales Trends
- Determine which segments are experiencing the most significant decline in sales.
- Understand how fast sales are going down.
Using Sales Data
- Use the information gathered from analyzing sales data to draw conclusions about the state of the business.
- Follow a three-step process: gather data, analyze it, and use it to make informed decisions.
Descriptive, Predictive, and Prescriptive Problems
In this section, the speaker explains the differences between descriptive, predictive, and prescriptive problems.
Descriptive Problem
- A problem that describes an issue without providing a solution.
- Helps locate and isolate the problem.
Predictive Problem
- Uses data to predict what might happen if certain changes are made.
- Example: Relating sales to prices to understand how reducing prices may increase sales.
Prescriptive Problem
- Provides a solution or action based on data analysis.
- The solution must meet various requirements such as optimizing welfare while avoiding harm.
- Can be complex in business settings where multiple factors need consideration such as profits, earnings, labor force, finances etc.
Constraints in Prescription
In this section, the speaker discusses constraints in prescription problems.
Autonomous Vehicle Example
- An autonomous vehicle must follow rules of the road when avoiding obstacles.
- Example: If it sees someone crossing the road it should stop but not too suddenly to avoid hurting passengers or damaging the car.
- Constraints can be added to programs to ensure compliance with requirements.
Conclusion on Prescription Problems
In this section, the speaker concludes his discussion on prescription problems.
Challenges with Prescription Problems
- Prescription is problematic due to its complexity and requirement for meeting various needs simultaneously.
- It requires domain knowledge and modeling of data before translating into an action that meets all requirements.
Examples
- Examples include diagnosing pre-diabetic patients and issuing prescriptions for controlling blood sugar levels while postponing diabetes onset as much as possible.
- In business, prescription problems can include increasing sales while maintaining profits and earnings.
Introduction to Descriptive Analytics
The speaker introduces the concept of descriptive analytics and explains how it involves simply describing data without building any predictions or models.
What is Descriptive Analytics?
- Descriptive analytics involves describing data without building any predictions or models.
- It is a skillful task that requires knowing how to look at data and identify what is interesting.
- A doctor recommending a blood test based on symptoms is an example of a descriptive analytics problem.
- The challenge in descriptive analytics is identifying what information is relevant from the vast amount of available data.
Understanding Random Variables
The speaker explains how bodily fluids, such as blood, are random variables because they constantly change based on various factors.
Bodily Fluids as Random Variables
- Bodily fluids, such as blood, are random variables because they constantly change based on various factors like nutrient intake and energy requirements.
- This variability makes it challenging to identify relevant information from bodily fluid tests.
How Doctors Reach Conclusions
In this section, the speaker discusses how doctors reach conclusions about a patient's health based on blood samples and averages.
Blood Collection
- Blood samples collected from different parts of the body may not look the same even if taken at the same time due to side asymmetry in the body.
- The heart is in the middle but beats to the left, causing blood to flow out from the left side and come back in on the right side. This can affect blood sample results.
- Doctors take a single sample of blood from patients and compare it to a threshold number to determine if there is an issue.
Averaging Over Time
- Doctors can average multiple blood samples taken over time to get a better understanding of a patient's health.
- Averaging neutralizes values and provides context for comparison with threshold numbers.
Conclusion Threshold
- Doctors set a threshold number for certain health indicators, such as blood sugar levels.
- If a patient's reading is below this threshold, no action is needed. If it is above, further testing or treatment may be necessary.
- The threshold number may not be as simple as initially thought and may require additional consideration based on other factors.
Introduction to Descriptive Analytics
The speaker introduces the concept of fuzzy logic and explains how it can be used to create uncertainty around data. They also discuss the importance of descriptive analytics in understanding data.
Fuzzy Logic and Uncertainty
- Fuzzy logic is a way of creating uncertainty around data.
- Uncertainty can be created by fiddling with the boundary or standard, or by adding a little plus/minus around the reading itself.
- Multiple readings can help determine if a number is varying very little or a lot.
Importance of Descriptive Analytics
- Descriptive analytics helps understand certain things about data that lead to conclusions more rigorously.
- Quantifying plus/minuses takes time and requires two instruments: one for knowing what to measure, and one for expressing it mathematically.
- Comparing one number to another is typically not helpful; instead, ranges or thresholds are used to account for variation.
Introduction to Probability
The speaker introduces probability as a language for expressing uncertainty in measurements.
Language of Probability
- Probability is used as a way to express statements like "I'm 95% confident that something is happening."
- Two sets of instruments are needed: one descriptive and one mathematical, so that mathematical statements can be put on top of descriptions.
- Medical tests often use intervals rather than single numbers, which typify variation.
Identifying Customer Characteristics for Treadmill Product
The market research team is tasked with identifying the profile of the typical customer for a free treadmill product offered by the company. They investigate whether there are differences across product lines with respect to customer characteristics.
Collecting Data on Treadmill Customers
- The team collects data on individuals who purchased a treadmill at a particular store during the past three months.
- They collect data on products, gender, age in years, education years, relationship status, annual household income, average number of times a customer plans to use a treadmill each week and average normalizer customer expects to run work each week on a self-rated fitness scale.
- Some of this data is opinion-based.
Understanding Product-Market Fit
In business, it's important to understand what people will buy and what you can make. This section discusses how to match between what you can make and what people will buy.
Product Market Fit
- Entrepreneurs often think about product-market fit when making something.
- It's important to match between what you can make and what people will buy.
- Isolating products or customers can help figure out what they tell us.
Differences Between Pandas and Numpy Libraries
This section discusses the differences between pandas and numpy libraries.
Pandas vs Numpy Libraries
- Pandas has more statistics built into it than numpy.
- Numpy was built more for mathematical problems than anything else.
- There are other stat side plots in Matlab plot life or c bond that have been seen already.
- Python is still figuring out how to arrange these libraries well enough.
Introduction to Data Variables
In this section, the speaker introduces the concept of data variables and explains how they are categorized.
Understanding Data Variables
- The speaker explains that data variables provide a sense of what variables are available and what kinds of variables they are.
- The speaker provides examples of numerical and categorical variables, such as income and gender.
- The software used to create a data frame determines whether a variable is numerical or categorical.
- The granularity at which data is stored in a data frame depends on the software used.
Challenges with Describing Data Variables
In this section, the speaker discusses some challenges associated with describing data variables.
Decimal Places in Data Frames
- The speaker asks why there are so many decimal places for certain numbers in the data frame.
- This occurs because the software stores the data to a certain number of digits based on its granularity requirements.
- When requesting a full description of the data, it may be displayed in an irritating way due to these decimal places.
Reporting Variable Descriptions
- The software reports out descriptions for all variables using the same syntax, regardless of their significance or type.
- For numerical variables like age, it calculates minimum, maximum, mean, standard deviation, and percentiles.
- For categorical variables like gender, it reports lots of NaN values since it cannot calculate meaningful statistics for them.
Representative Age Calculation
- To determine one representative age for a data set, the software calculates minimum and maximum values. However, this does not provide a complete picture of the data.
- A single representative age is difficult to determine without additional information.
Conclusion
In this transcript, the speaker introduces the concept of data variables and explains how they are categorized as numerical or categorical. They also discuss some challenges associated with describing data variables, including decimal places in data frames and reporting variable descriptions. Finally, they explain how calculating a single representative age for a data set can be difficult without additional information.
Designing for Weight
In this section, the speaker discusses the importance of knowing the weight that a product will bear and how to engineer for it.
Importance of Knowing Weight
- As a design engineer, it is important to know what weight will be on a product.
- When measuring a variable by one number, it is important to consider what makes sense.
- Over-engineering a product can lead to negative consequences.
Designing for Weight
- When designing products like mattresses or treadmills, it is important to consider how much weight they should bear.
- Over-engineering can lead to discomfort for users who weigh less than the designed weight limit.
- A five-point summary can be used to report out minimum, 25th percentile point, median (50th percentile), 75th percentile point, and maximum values.
- Distributions capture variability in data and are useful in understanding which value to use in different situations.
- The median represents the age of the average person while the mean represents an algebraic calculation based on all ages.
Understanding Mean and Median
In this section, the speaker explains the difference between mean and median. He also discusses how to calculate them and when to use each one.
Mean vs Median
- The speaker describes the median as the age of the average person and the mean as the average age of a person.
- The speaker admits that this can be confusing for some people.
- To understand it better, he suggests thinking of mean as adding up all ages and dividing by how many there are, while median is sorting ages from smallest to largest and picking off the middle one(s).
- If there are an even number of ages, take the average of the two middle ones.
- The choice between using mean or median depends on context.
Parameters
- When talking about parameters, mean and median are different parameters.
- The median is a parameter such that 50% of data falls on either side.
- The mean is what's called "the first moment" or center of gravity (CG) of data. It's where you would balance a plate if it were made out of metal.
Skewed Data
- If data is skewed to one side, then mean will move towards that side while median remains unchanged.
- When mean does not equal median, it indicates that left and right sides are not equal.
- A little more than median means some data has been pushed to right (right-skewed).
Sensitivity of Median to Outliers
In this section, the speaker explains why median is less sensitive to outliers than mean.
- The speaker says that one reason median often doesn't move is because it's not that sensitive to outliers.
- He gives an example of calculating mean and median income for a group of people. If someone with very high income joins the group, the mean will increase significantly while the median will remain relatively unchanged.
Typical Person's Movement
The speaker discusses how the typical person's movement is not likely to change.
Conclusion on Typical Person's Movement
- The typical person may move by at most half.
- The typical person is an actual individual in the room or maybe an average of two individuals in the room.
- That person is not going to change.
Histogram and Conclusion
The speaker talks about the histogram and draws a conclusion from it.
Middle of Data and Difference Between 18 and 26
- From a median perspective, the middle of the data is at 26.
- The difference between 26 and the smallest value (18) is eight years.
- This eight-year range contains 90 observations because there are 180 total.
Observations Between 26 and 50
- On the opposite side of this range (26), there are 24 years that contain the same number of observations (90).
Picture of Data Distribution
- A picture of this distribution would look like a right-skewed graph with more variation on its right side.
Skewness and Statistics Book
The speaker explains skewness, how it can be measured, and introduces a statistics book.
Skewness Definition and Variation on Sides
- Skewness is often measured using mean minus median. If it is positive, it usually corresponds to right skewness; if negative, left skewness.
- Skewed data sometimes causes difficulties in analysis because variation on one side means something different than variation from another side.
Statistics Book Introduction
- The speaker introduces a statistics book that covers topics such as tables at its back.
- It's an excellent book for understanding the statistics side of things, but it's not a Python book.
Statistics and Computers
The speaker talks about how statistics is often done without computers.
Use of Statistics in Non-Computer Environments
- Many methods are done in ways where either you don't have access to computers or if you do have access to computers, you don't have them at runtime.
- A lot of statistics is done under that kind of situation, even probabilities.
Importance of Understanding Why You're Doing Something
The speaker emphasizes the importance of understanding why you're doing something and recommends a book for thinking about this. He also discusses the limitations of books and suggests using Google to supplement your learning.
Book Recommendation for Thinking
- The speaker recommends a book for thinking about why you're doing something.
- The book does not provide Python syntax, but it is well-written and helpful for conceptual understanding.
Using Google to Supplement Learning
- The speaker suggests using Google to supplement your learning.
- While books can be helpful, they may not provide all the information you need.
- Googling terms like "Python" can help you find the code you need.
Managing Overwhelming Amounts of Material
The speaker warns that there will be a lot of material thrown at students during their program. He advises students to pick their battles and focus on what they want to learn in depth.
Dealing with Overwhelming Amounts of Material
- Students will encounter a lot of material during their program.
- Trying to learn everything in equal depth will take up too much time.
- Focus on what you want to learn in depth and pick your battles accordingly.
Understanding Statistics First
The speaker explains why statistics is taught first in the program, as it is easier from a computational perspective but harder from a conceptual perspective. He encourages students to hold onto this idea as they progress through the program.
Why Statistics is Taught First
- Statistics is taught first because it is easier from a computational perspective but harder from a conceptual perspective.
- Students should hold onto this idea as they progress through the program.
Summary of Five Numbers and Standard Deviation
The speaker provides a summary of the five numbers that help describe data, including minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum. He also explains what standard deviation is and how it is calculated.
Five Numbers for Describing Data
- The five numbers that help describe data are minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum.
- These numbers provide a graphical description of the data.
What is Standard Deviation?
- Standard deviation is a measure of how spread out a typical observation is from the mean.
- It is calculated by taking the distance from the average for every observation and averaging it.
- The formula for standard deviation is std = sqrt((x1 - x_bar)^2 + (x2 - x_bar)^2 + ... + (xn - x_bar)^2 / n-1).
- The reason why n-1 is used in the denominator instead of n is because we are dividing by one less than the number of observations to account for degrees of freedom.
Mean Absolute Deviation vs Standard Deviation
In this section, the speaker explains the difference between mean absolute deviation and standard deviation.
Why do we square deviations?
- Squaring deviations allows us to look at both positive and negative deviations.
- If we didn't square it, positive and negative deviations would cancel out.
Mean Absolute Deviation
- Mean absolute deviation (MAD) is a measure of variability that uses absolute values instead of squares.
- MAD is represented as 1/n-1 * |x1 - x̄| + ... + |xn - x̄|.
- MAD is used in modern machine learning algorithms.
Gauss vs Laplace
- Gauss and Laplace argued about whether to use standard deviation or mean absolute deviation.
- Gauss won because he found it easier to do calculations using calculus.
- Today, Laplace's way of thinking is being used more and more because it is less sensitive to outliers.
Criticism of Standard Deviation
- The standard deviation can be criticized for being too sensitive to outliers.
- Nassim Taleb criticizes the standard deviation as a measure of anything in his book "The Black Swan" and "Fooled by Randomness".
New Section
Gauss explains the logic behind variability and how it affects observations.
Variability and Observations
- The average observation is not always equal to the actual average.
- Variability measures how far an observation is from the average.
- Interquartile range and range are two measures of variability.
- The interquartile range is the difference between the upper quartile and lower quartile, while the range is the difference between maximum and minimum values.
- These measures can be used to draw conclusions about data, such as age or income.
Five Point Summary
- The five point summary includes minimum value, lower quartile, median, upper quartile, and maximum value.
- This summary provides a measure of location (median) and dispersion (interquartile range and range).
- Mental conclusions can be drawn from these five numbers to understand data better.
Conclusion
- Descriptive statistics provide useful insights into data for various applications.
Description and Information
In this section, the speaker explains the difference between description and information in the context of data. He also discusses how metadata is used to store additional information about data.
Data Description vs Information
- Information and description are two different things when it comes to data.
- The word "data" means different things to different people, but for IT professionals, it usually refers to bytes or information.
- Information tells you about the data object, such as its variable setting or integer field.
- A data object summary provides a non-statistical summary of what is in a data frame, including numeric categorical variables and non-null fields.
Metadata
- When storing real data, a data dictionary or metadata is often stored alongside it to provide additional information about the data.
- Archival data should never be a mixture of numerical and categorical objects according to many professional organizations.
- However, sometimes archiving both types of objects together can be convenient despite the added complexity.
Histogram Plotting
In this section, the speaker discusses histogram plotting using Matplotlib library.
Histogram Plotting with Matplotlib Library
- The speaker uses Matplotlib library for histogram plotting.
- Histogram syntax includes bin sizes and figure sizes that can be adjusted for customization.
- A histogram distribution shows a set of bins with counts within each bin based on a range of values.
- Histogram programs require some artistry in determining bin sizes and plotting shapes.
- Writing a histogram code is a good challenge for testing one's understanding of data, programming language, and visualization.
Storing Data in Excel
In this section, the speaker discusses how to store data in Excel and what metadata needs to be stored with it.
Storing Data in Excel
- The data should be stored along with metadata that describes the variables.
- One possibility is to have a text file that describes the data and two files called dot dat and dot disk which describe the variables.
- It is important to store what kind of computational object each variable is so that code can run on it.
- The metadata should also include information about whether the variable is discrete or continuous, as this affects which algorithms can be used on it.
Converting Continuous Data into Categories
- To ensure that every algorithm knows how each variable will be stored, some companies convert all variables into categories.
- This involves dividing continuous data into deciles and storing each variable as a number from 1 to 10.
- This allows algorithms to be written assuming that every variable will be stored in this way, making them more efficient.
Balancing Efficiency and Accuracy
- Professional analysts often have to balance doing the right thing well versus doing the wrong thing quickly when working with new data sets.
Auditing and Risk Decisions
The importance of clear decision-making in the face of audits is discussed, as well as the difference between making the best risk decision versus the most obvious one.
Auditing and Decision-Making
- Regulatory agencies such as the Reserve Bank of India may audit companies to review their data.
- Clear decision-making is important during audits, even if it means making an obvious risk decision rather than the best one.
- Making an obvious risk decision may not always be the same as making the best one.
Problem-Solving with Limited Resources
The challenges of problem-solving with limited resources are discussed, including how to ensure continuity when dealing with a team that may have high turnover rates.
Problem-Solving with Limited Resources
- When solving problems, assume you have an infinitely smart client and an infinitely fast computer.
- However, in reality, resources are limited and continuity must be ensured for future teams.
- Keeping things simple and obvious can help ensure continuity for future teams.
Histogram Command Summaries
Histogram command summaries are explained, including how they provide insight into variable distribution.
Histogram Command Summaries
- Histogram command summaries provide insight into variable distribution.
- Most variables tend to have a right skew when they do have a skew.
- Education may have a slight left skew where few people are highly educated and most people fall within a certain range.
Box Plots
Box plots are explained, including how they provide insight into the distribution of data.
Box Plots
- A box plot is a visual representation of data that shows the median, upper and lower quartiles, and whiskers.
- The length of the whisker is 1.5 times the interquartile range above the box.
Understanding Box Plots
In this section, the speaker explains what a whisker is and how it relates to box plots. They also discuss outliers and how they are defined in box plots.
Whiskers and Outliers
- The top of the whisker represents the maximum value, while the bottom of the whisker represents the minimum value.
- If there are no other points on the box plot besides the box and whisker, then your five-point summary is sitting there.
- An outlier is a point that lies more than 1.5 times the interquartile range above the box. If any points are still left outside after extending to 1.5 times this range, they will be shown as dots.
Changing Parameters in Box Plots
- You can change parameters such as the size of the whisker by going to box plot syntax and changing it from its default value of 1.5.
- The colors in a box plot can be changed depending on what you want to show (e.g., male vs female).
Components of a Box Plot
- The lower part of the box represents Q1, while the upper part represents Q3. The middle line is median or Q2.
- A mean can be added as a dot using a different function but it's not part of standard five-point summary calculation used in creating a box plot.
Summarizing Categorical Data
In this section, we learn about summarizing categorical data using cross-tabulation.
Cross-tabulation for Categorical Variables
- Cross-tabulation is used to summarize categorical data.
- A cross-tabulation table shows how many products are there in each product category.
New Section
In this section, the speaker talks about how to categorize preferences and analyze data using statistics.
Categorizing Preferences
- The speaker explains that the data presented is descriptive and simply tells you the data as it is.
- To do more analysis on the data, one needs to reach a conclusion based on it. For example, one can ask if men and women have the same preferences when it comes to fitness products they use.
- To answer such questions, one needs to find a statistic that measures the difference between men and women's preferences. This is called a chi-square test.
Visualizing Data
- The speaker shows how counts can be used instead of tables to visualize data. Count plots and bar plots are useful for this purpose.
- A pair plot of a data frame is another popular way of visualizing data. It requires little thinking as it simply figures out a way to plot variables pair by pair.
Pivot Tables
- The speaker mentions pivot tables in Excel as an example of tools that have been implemented in many software programs.
- The pivot table version of the same dataset is shown.
Univariate Analysis
- The speaker explains that univariate analysis means looking at variables one at a time. Histogram plots are an example of univariate analysis.
Overall, this section covers how to categorize preferences and analyze data using statistics, as well as different ways of visualizing data such as count plots, bar plots, pair plots, and pivot tables.
Understanding Pair Plots
In this section, the speaker explains how pair plots work and what happens when objects are present in the data frame.
Pair Plot Command
- The pair plot command remembers the nature of the graph.
- If there is gender in the data, it will be plotted.
- When using paste plot on data, gender is included.
Object Identification
- Product, gender, and marital status were identified as objects in the data frame.
- The pair plot command ignores objects and only plots numerics or integers.
Histogram
- The histogram is a visual tool to see shape and count of data.
- Changing bin width can change histogram shape.
X vs Y Plots
- Age vs education is an example of an x vs y plot.
- Mirror images occur when switching x and y axes on a plot.
Understanding Ordinal and Categorical Variables
In this section, the speaker explains the difference between nominal, ordinal, and categorical variables. They also discuss how some variables can be treated as both ordinal and categorical.
Defining Fitness Variable
- The variable "fitness" has very few numbers in it (1-5).
- This variable was created based on the participant's perception of their fitness level.
- Fitness is an example of an ordinal variable because there is a sense of order to it.
Treating Variables as Ordinal or Categorical
- Nominal variables are essentially names (e.g. gender, place), while ordinal variables have a sense of order (e.g. dissatisfied, very dissatisfied).
- The fitness variable can be treated as either an ordinal or categorical variable.
- Sometimes data that looks like a category will be recognized by software as a number because it was entered that way.
- To analyze data that looks like a category but is recognized as a number, change it to a character.
Special Cases: Zip Codes
- Zip codes are an example of a categorical variable that shows up as a number in databases.
- You cannot do arithmetic with zip codes because they are not numerical values.
- As your dataset grows, the number of zip codes also grows which makes it difficult to state how many categories there will be present.
Plotting Data
In this section, the speaker discusses plotting data and how to make sure certain variables are plotted correctly.
Graphical Representation
- If software recognizes data as a number, it will plot it accordingly.
- If you don't want data to be plotted as a number, change it to a character.
Extracting Mean and Standard Deviation
- You can extract mean and standard deviation for any given variable using Python.
- The standard deviation formula is shown on the board.
Interesting Plot
- There is an interesting plot in Seaborn, but the speaker does not go into detail about it.
Understanding Distributions
In this section, the speaker explains the concept of distributions and how they are used to analyze data.
What is a distribution?
- A distribution refers to the underlying pattern or shape of a set of data.
- The goal is to understand the distribution of the data rather than just looking at individual data points.
- The speaker uses an example of drawing lines on a graph to represent distributions.
Why is raw data not enough?
- Raw data only provides information about a specific group of people at a specific point in time.
- To make predictions about future customers, we need to assume that there is an underlying population with a certain distribution.
- The speaker explains that we need to use mathematical logic to make conclusions about future customers based on past customers.
How do we calculate distributions?
- We can't directly observe the distribution, but we can estimate it using a sample from that population.
- The speaker explains that he will discuss specific distributions in more detail in later sections.
- He also briefly mentions that one way to estimate the distribution is by taking averages of points.
Sample vs Population
In this section, the speaker discusses why samples are used instead of populations when analyzing data.
Why do we use samples instead of populations?
- It's often impractical or impossible to collect data from an entire population.
- Instead, we take a sample from that population and use it as an estimate for the entire population.
- The speaker gives an example of taking blood pressure readings over time and using those readings as estimates for future blood pressure readings.
What is statistical inference?
- Statistical inference refers to making conclusions about a population based on information from a sample.
- The goal is to minimize uncertainty and error when making these conclusions.
Predicting Future Outcomes
In this section, the speaker explains how distributions and statistical inference can be used to make predictions about future outcomes.
How can we predict future outcomes?
- By understanding the distribution of a population, we can make predictions about future outcomes.
- The speaker gives an example of predicting blood sugar levels based on past readings and lifestyle factors.
- Statistical inference allows us to make these predictions with some degree of certainty.
Why is it important to analyze data?
- Analyzing data allows us to make informed decisions and predictions about future outcomes.
- The speaker emphasizes that analyzing data is crucial for businesses that want to grow and attract new customers.
Understanding Distributions
In this section, the speaker explains the concept of distributions and how they can be used to make predictions.
The Importance of Distributions
- Eight readings can provide insight into what will happen next month.
- A distribution abstracts away from the data's random and systematic parts.
- Statistical inference is at the heart of statistics.
Creating a Distribution Plot
- A distribution plot estimates an underlying true distribution.
- Gaussian kernel density estimate is a sophisticated way of representing a distribution.
- Histogram bins allow for customization of boundaries.
- The plot has many functions available for customization.
Comparing Distributions
- Comparing old and new data distributions should be similar if thinking is correct.
Understanding Blood Sample Variability
In this section, the speaker explains how taking a small sample of blood can lead to variability in readings and why it is important to take samples in different situations.
Importance of Covering Range of Possibilities
- To understand sales distribution, cover range of possibilities.
- Not covering range of possibilities leads to incomplete understanding and inability to predict or prescribe.
Definition of Distribution Function
- Distribution function is the probability that x is less than or equal to x.
- Density function is the derivative of the distribution function.
- The density function is the area under the curve.
Analyzing Data and Reaching Conclusions
- Drawing conclusions outside your data is difficult but necessary.
- Analyzing data helps organizations make informed decisions about their portfolios, sales, etc.
Managing Finances
In this section, the speaker discusses how to manage finances based on a sense of distribution and experience.
Negotiating Salary
- Negotiate for a salary based on a sense of distribution.
- Figure out how much money you need and how much money you are expecting based on your expenditure.
- Spend money based on your income and expenditure.
Making Decisions Based on Experience
- Use past data and experience to make decisions.
- Make assumptions based on the data set available, even if it is incomplete.
- Use logic to make decisions, even if it goes against established rules.
Translating Logic into Algorithms
- Translate logic into an algorithm that can be understood by computers.
- Estimate the mean age of customers using population distribution.
- Understand the relationship between new customer data and existing data.
Estimating Population Mean
- Estimate the population mean around 28.7 plus or minus something.
- Acknowledge that there is uncertainty in estimating population mean.
The speaker talks about managing finances, making decisions based on experience, translating logic into algorithms, and estimating population mean using past data and experience. The notes provide insights into negotiating salaries, making assumptions based on incomplete data sets, translating logic into algorithms that can be understood by computers, understanding the relationship between new customer data and existing data, and estimating population mean with uncertainty.
Descriptive Analytics
The speaker discusses how the amount of data affects the certainty of repeatability and prediction. They also explain how descriptive analytics is used to measure a population and its distribution.
Data Averaging and Certainty
- The more data that is being averaged, the surer one can be about repeatability.
- The level of certainty depends on how sure one wants to be.
- Giving a range of values allows for prediction.
Descriptive Analytics
- Today's discussion focuses on descriptive analytics, not prediction.
- The shape of a curve indicates variation in data.
- More data is needed when there is greater variation in the middle.
Measuring Highly Variable Data
- When measuring highly variable data, it's important to take multiple measurements under different circumstances.
- Instructions can be given to cover all possible scenarios or measurements can be taken frequently over time.
Tracking Changes in Distribution
- When there is a change in distribution, more data should be collected to better understand it.
- In business situations, tracking market trends closely after introducing a new product helps measure its success.
Overall, this section covers the importance of collecting enough data to ensure accuracy when measuring highly variable populations. It also emphasizes the use of descriptive analytics as an initial step towards understanding distributions.
The Challenge of Utilizing Big Data
In this section, the speaker discusses how big data is changing the way we approach problem-solving and decision-making.
The Importance of Experience with Data
- As more data becomes available, experience with understanding and utilizing that data becomes increasingly important.
- With new data comes new problems to solve, rather than simply improving upon existing solutions.
- This presents a challenge for statisticians and big data professionals.
Making Efficient Use of Information
- Utilizing large amounts of data requires making efficient use of that information.
- When analyzing sentiment analysis or other changing opinions, it's important to consider the granularity at which opinions change in order to make accurate estimates.
- It can be difficult to determine whether a changing thing or a solid thing is being estimated.
Balancing Variability and Precision
- There are times when variability is acceptable, while other times require precise estimation.
- Precise estimation may be necessary for targeted advertising or product development.
Overall, the speaker emphasizes the importance of experience with utilizing big data and making efficient use of information while balancing variability and precision.
Assumptions in Statistics
In this section, the speaker discusses how statisticians make assumptions about their data and why they do so. They also explain that making assumptions can make calculations easier, but it does not guarantee accuracy.
Making Assumptions
- Statisticians need to make assumptions about their data.
- One reason for making assumptions is to simplify calculations.
- Many tests and procedures in statistics rely on certain assumptions.
Distribution Assumptions
- The assumption of a normal distribution is common in statistics.
- However, some industries have their own preferred distributions based on the shape of their data (e.g., Weibull distribution for engineering).
- Violations of these rules can lead to complexities and require more powerful tools.
Risks of Making Assumptions
- Making an assumption involves taking a risk, as it may not be correct for every case.
- Precedents are important when extending models beyond specific cases.
- Historically successful assumptions are often preferred over changing them for particular cases.
Regulatory Constraints
- Certain industries have regulatory constraints that limit the ways in which they can measure or report data (e.g., accountants measuring cash flow).
- Deviating from standard practices can cause problems outside of those constraints.
Introduction to Probably Approximately Correct (PAC) Learning
In this section, the speaker introduces the concept of Probably Approximately Correct (PAC) learning and explains how it combines statistical thinking with machine learning thinking.
PAC Learning
- PAC stands for "Probably Approximately Correct".
- PAC learning is a field that puts a probabilistic statement or an approximation on machine learning.
- The probabilistic part comes from statistical thinking, while the approximately part comes from machine learning thinking.
- At the end of the day, whatever method you use in PAC learning, there has to be a sense of how generalizable it is.
Hackathons and Model Evaluation
In this section, the speaker discusses hackathons and model evaluation in PAC learning.
Hackathons
- A common feeling for a hackathon is building your model on one data set and testing it on another data set.
- In hackathons, being very good on your data set doesn't necessarily mean you are successful. You have to be good on other data sets as well.
Model Evaluation
- Model evaluation involves showing improvement on someone else's data set.
- By being very good on your own data set doesn't necessarily mean you are successful. You have to be good on other data sets as well.
Mean vs Median
In this section, the speaker discusses the difference between mean and median.
Mean vs Median
- Mean being equal to the median from a distribution sense means that these are the two numbers.
- If the distribution is nice and symmetric, then this unknown thing in the middle can be estimated using a mean or it can be estimated using a median.
- The answer to which is better depends on whether there are many outliers or if the distribution has a more bell-shaped curve.
- Whether to use per capita income or income of typical Indian depends on whether you're interested in common sense or science side question.
- The amount of time spent on a website varies depending on different websites and purposes.
Mean, Median and Mode
In this section, the speaker explains the differences between mean, median and mode in statistics.
Mean vs Median
- The mean is calculated by adding up all the values and dividing by the number of values.
- The median is the middle value when all values are sorted in order.
- Browsing habits can be used as an example to explain how mean and median can carry different information.
Heavy Tail Distribution
- A heavy tail distribution is a distribution where there are many small numbers and a few big numbers.
- Network traffic is an example of a typical heavy tail distribution.
Mode
- The mode is the most common value in a set of data.
- It's harder to calculate than mean or median because it requires finding the highest frequency or most common value.
Numeric vs Categorical Data
In this section, the speaker discusses the difference between numeric and categorical data and how it affects calculating the mode.
Mode Calculation in Pre-Computer Era
- The mode was easy to calculate by hand in the pre-computer era.
- However, it is not easy to calculate on a computer because of the need to figure out bin width and estimation of mode.
- The logic is twisted, making it difficult for two different people to find the same answer.
Descriptive Analysis with Histograms
- Histograms can be used to compare distributions of different variables such as gender or device usage.
- Differences in shape and values can be observed between male and female distributions.
- Statistical differences between product usage can also be compared using histograms.
Bivariate Analysis
In this section, the speaker introduces bivariate analysis and explains how it differs from univariate analysis.
Location and Variation
- Univariate analysis focuses on location (mean or median) and variation (standard deviation, range, interquartile range).
- Bivariate analysis adds a sense of relation or correlation between two variables.
Measuring Correlation
- A number is needed to describe correlation between two variables.
- There are many ways of defining what that number should be.
Understanding Covariance
In this section, the speaker explains the concept of covariance and how it is used to measure the relationship between two variables.
Definition of Covariance
- Covariance is a measure of the degree to which two variables change together.
- It is calculated by taking the sum of the product of each variable's deviation from its mean.
- Covariance can be positive, negative or zero.
Interpreting Positive and Negative Covariance
- Positive covariance indicates that both variables move in the same direction.
- Negative covariance indicates that both variables move in opposite directions.
Applications of Covariance
- Covariances are heavily used in certain areas such as dimension reduction, principle components, finance and portfolio management.
- They are used to determine whether two variables are moving together or in opposite directions.
Overall, this section provides an introduction to covariance and its applications. The speaker explains how covariance can be used to measure the relationship between two variables and how it can be interpreted based on its sign.
Variance and Covariance
In this section, the speaker explains the concepts of variance and covariance.
Variance
- The variance is the same as the standard deviation of x squared.
- The thing before taking the square root is called the variance.
- The variance is all there in the book.
Covariance
- The covariance is a measure of the nature of the relationship between x and y.
- If the covariance is positive, they're moving in the same direction. If it's negative, they're moving in opposite directions.
- If it's zero, many things can happen. Either there's no relation or maybe it's not a normal distribution.
- For example, price and profit have a theoretical relationship where as price goes up, profit increases because you're getting more money per product. But with even higher prices, fewer people buy your product so your profit goes down.
Relationship Between Experience and Distance to Home
In this section, the speaker talks about an analysis done on attrition rates at a company and how distance from home relates to experience.
Analysis on Attrition Rates
- An analysis was done on attrition rates at a company to understand why people leave companies.
- A model was created to find out if there was a relationship between experience and distance from home.
Relationship Between Experience and Distance from Home
- Early on in their careers, people live close by. In middle age, they move away. Towards the end, they become closer again.
- The story made up was that in the beginning, people have low dependencies and are typically unmarried bachelors who are ready to work a lot harder. Staying close by is convenient. But as you reach middle age, things become complicated with spouses, kids, schools, and houses that you can afford. This solves a more complicated optimization problem and staying close to work may not be possible. People who survive longer in the company earn enough to solve this problem through other means and then move back closer to work again when there are fewer dependencies.
Normalizing Units
In this section, the speaker explains how statisticians normalize units to make them comparable.
Normalizing Units
- The value of a variable is dependent on its unit of measurement.
- Statisticians normalize units by dividing the variable by its standard deviation.
- The resulting value is called the correlation between two variables and ranges from -1 to 1.
- A correlation close to +1 indicates a strong positive relationship between two variables, while a correlation close to -1 indicates a strong negative relationship.
- Correlation only measures linear relationships and does not imply causation.
Understanding Correlation
In this section, the speaker explains what correlation means and how it can be interpreted.
Correlation and Relationship between Variables
- Correlation close to zero indicates low correlation between variables.
- Variables with low correlation are useful for clustering and segmentation.
- There is no negative correlation in the given dataset.
- Rich old people can be an interesting segment.
Interpreting Correlation
- Zero correlation means there is no relationship between variables.
- Correlations are useful summaries for large data but notoriously hard to interpret.
- The relationship between two variables may or may not be linear.
- Height and weight have a linear relationship, but removing outliers may change that.
- Body mass index (BMI), which is height by weight squared, has a parabolic relationship with the human body.
Understanding BMI
In this section, the speaker explains what BMI is and how it relates to height and weight.
What is BMI?
- BMI stands for body mass index.
- BMI is calculated as height divided by weight squared.
- A healthy person's BMI should be around 25.
Relationship Between Height, Weight, and BMI
- If you are taller, your weight should increase at half the rate of your height squared.
- Height by weight remains constant in objects such as these.
Understanding BMI and Dimension Reduction
In this section, the speaker explains how BMI is calculated based on height and weight, and how it relates to the shape of objects like cylinders and footballs. The concept of dimension reduction is also introduced.
Calculating BMI
- BMI is calculated by dividing weight by height squared.
- For a ball, the BMI should be calculated by weight divided by height cubed.
Shape Comparison
- If a football's height doubles, its volume increases by a factor of 8.
- The relationship between height and volume for a football is non-linear.
- If humans grew like footballs, they would be much fatter than they are now.
- Empirically, the relationship between height and weight for humans has been found to be best represented by using height cubed in calculating BMI.
Dimension Reduction
- Using two variables to create one that carries information is called dimension reduction.
- A heat map is an example of dimension reduction that uses colors to represent correlations between variables.
- Heat maps are useful when dealing with large sets of variables or data.
Usage and Fitness Variables
In this section, the speaker discusses how to use usage and fitness variables to predict the number of miles an instrument will run.
Linear Regression Model
- The speaker presents a targeted equation: miles = -5.75 + 20(usage) + 27(fitness).
- This equation is a linear regression model that goes from descriptive to predictive.
- There are three uses for linear regression: descriptive, predictive, and prescriptive.
Describing Relationships Between Variables
- The speaker wants to fit an equation for the number of miles an instrument will run based on self-rated fitness and frequency of use.
- Linear regression can be used with any number of variables to describe the relationship between them.
- Linear regression has three uses: descriptive, predictive, and prescriptive.
Predictive Use of Linear Regression
- The predictive use of linear regression involves using the model to predict a new value of y based on new values of x1 and x2.
- Prescriptive use involves changing x1 and x2 in order to get a different targeted y.
Using a Correlation Matrix for Regression
The speaker discusses using a correlation matrix to perform regression analysis.
Using a Correlation Matrix
- A 3x3 correlation matrix can be used for regression analysis.
- The speaker mentions the scikit-learn module and the linear model regression function.
- Regression coefficients and intercept are calculated using the linear model function.
- Interpretation of regression coefficients is discussed, including how changes in usage and fitness affect miles walked.
Describing Linear Regression
The speaker explains how linear regression is used to describe data.
Describing Linear Regression
- Linear regression can be used to describe relationships between variables.
- The relationship between three variables (miles, usage, and fitness) is summarized using linear regression as a descriptive tool.
- Positive signs in the equation indicate that as fitness or usage increase, miles walked also increases.
- Linear regression is empirical and based on data.
Describing Model Quality
The speaker discusses ways to describe the quality of a model and how accurate it is. They explain that these questions cannot be answered until the middle of the process.
Hypotheses Testing
- To determine if there is a relationship between output and variable, we ask if beta1 is equal to 0.
- We need to know what the error around that number is to answer this question.
- If plus minus includes 0, then we can't say that it's not 0. If it doesn't include 0, then it's a predictive model.
Inferential Phase
- All models will now see an inferential phase where they will be put into an interpretative test to determine how useful they are for new data.
- The mean must see a plus minus, and the regression coefficient must now see a test to determine if it's equal to zero or not.
Writing Equations for Models
The speaker explains how they write equations for models and what each part means.
Writing Equations
- The equation miles = beta naught + beta1 * usage + beta2 * fitness can be written using code.
- You can add another variable by putting in another comma-separated value in x.
Plotting Data
- Plotting becomes difficult with three or more variables.
- Arbitrary equations can be used to express the relationship between multiple variables.
Intercept
- The intercept is where the equation cuts the line.
- The equation is written so that it cuts the line, but this 0 doesn't make sense.
Using Beta 1 to Get a Better Line
The speaker explains how having the freedom to have a beta 1 allows for more flexibility in getting the best fit. This is not used for selecting variables to model.
Beta 1 and Flexibility
- Having a beta 1 allows for more flexibility in getting the best fit.
- Beta 1 is used to describe three variables in one shot, but it's not used for selecting variables to model.
- Beta 1 is not used when there's only one variable because there's nothing to model.
Descriptive Statistics vs Predictive Analytics
The speaker discusses descriptive statistics and predictive analytics, including their differences and uses.
Descriptive Statistics
- Descriptive statistics involves summarizing data with the purpose of visualizing it or using it for inference and prediction later on.
- Univariate data refers to one variable, and descriptive statistics includes location (mean, median), variation (standard deviation), quartiles, and five-point summary.
Predictive Analytics
- Predictive analytics involves predicting what will happen if something changes.
- Prediction is giving an x value and receiving a y value. It doesn't necessarily involve forecasting anything.
- Forecasting involves predicting future events based on past data.
Parameters Used in Describing Data
The speaker discusses parameters that are used in conveying information about data.
Five Point Summary
- A five point summary includes minimum, 25th percentile, median, 75th percentile, and maximum.
Quartiles
- Quartiles include upper quartile and lower quartile.
Other Parameters
- Other parameters used in describing data include standard deviation, range, and interquartile range.
Bivariate Data and Multivariate Summary
In this section, the speaker discusses bivariate data and multivariate summary. They explain covariance, correlation, and linear regression as tools for describing relationships between variables.
Covariance and Correlation
- Covariance measures the variability of two variables together.
- The variance is the univariate version of covariance.
- Correlation is a scaled version of covariance that ranges from -1 to 1.
- A correlation close to +1 indicates a strong positive relationship between variables, while a correlation close to -1 indicates a strong negative relationship.
- Correlation does not imply causation.
Linear Regression
- Linear regression is an equation that describes the relationship between variables.
- It can be used for prediction or prescription but is primarily used for describing relationships.
- Linear regression can be useful when dealing with multiple variables by providing a mechanism to describe how they are connected.
Visualization
- Visualization is another way to summarize data when dealing with multiple variables.
- Histograms, box plots, and scatter plots are examples of visualizations that can help us see patterns in data.
- However, visualization has limitations in terms of how many dimensions it can represent effectively.
Finding Beta Values
- To find beta values in linear regression, we form an equation and take the values closest to the data points using distance measurements.
Finding the Best Line
In this section, the speaker explains how to find a line that best describes the relationship between data points.
Finding the Equation of a Line
- To describe the relationship between data points, we need to find a line that goes through these points.
- The equation of this line is y = a + bx.
- We want to find values for a and b that will make this line as close as possible to all of our data points.
Measuring Distance from Data Points to Line
- For every choice of a and b, we can calculate the distance from each data point to the line.
- We do this by finding the difference between each y-value and its corresponding value on our line (a + bx).
- We then square these differences and sum them up for all data points.
- This gives us a measure of how far our line is from our data.
Choosing Values for A and B
- We want to choose values for a and b that minimize this distance measure.
- If every point is on the line, then this distance measure will be 0.
- By minimizing this distance measure, we can find the best-fitting line for our data.
Linear Regression
In this section, the speaker explains how linear regression works and why it is a convex optimization problem.
Linear Regression
- Choose a and b to minimize.
- The software gives you a and b that minimizes the function.
- This is called linear regression.
- Gauss was successful because of least square. Laplace was not successful because he did not have least square.
- If you have modulus values here, there is a possibility that you will not get a single answer.
- Because of the nice bowl-shaped curve that the square function gives you, you will find a unique solution to this.
Convex Optimization
- This was called a convex problem and this is a convex optimization because of squaring.
- The system differentiates with respect to a and b, sets equal to 0, differential respect to b sets equal to 0, and solves those two equations. It doesn't minimize when this becomes very high dimensional.
- To do that you need typically to do linear algebra.
Machine Learning Algorithms
In this section, the speaker explains how machine learning algorithms are built using optimization problems.
Optimization Problems in Machine Learning
- Most machine learning algorithms are built based on an input-output relationship where an algorithm tries to come closest to the output based on the input given.
- For example, object recognition or text recognition requires measuring distance between what computer thinks word is versus what the correct word is.
Understanding Least Squares Algorithm
In this section, the speaker explains how algorithms work by comparing predictions with actual data and minimizing the distance between them. The least squares algorithm is introduced as a popular fitting algorithm that minimizes the distance.
Introduction to Least Squares Algorithm
- Algorithms compare predictions with actual data and minimize the distance between them.
- The least squares algorithm is a popular fitting algorithm that minimizes the distance.
- Any type of algorithm can use the least squares method, including neural networks, support vector machines, random forests, and association rules.
Training Data for Algorithms
- Training data is given to an algorithm to teach it what to do in certain situations.
- The training set is not the same as real-world data because it has been given to the algorithm beforehand.
- Test data is used to evaluate how well an algorithm performs on new data.
Making Programs Generalizable
- To make a program generalizable, you need to train it on all available data except for a portion that is kept aside for testing purposes.
- This allows you to test how well your program will perform on new data that you have not seen before.
Test, Validate, Train
In this section, the speaker discusses the process of test validate trend and train validate test. The algorithm needs to know how good it is and it needs measures like mean standard deviations.
Validation and Measures
- Validation is done on somebody else's new data.
- The algorithm needs to know how good it is and it needs measures like mean standard deviations.
- Descriptive method is used as a criteria for building predictive models.
- Mean standard deviations are approximations.
Classifying Algorithms
In this section, the speaker talks about classifying algorithms as good or bad based on positive or negative sentiment in tweets.
Positive vs Negative Sentiment
- If you're classifying an algorithm based on positive or negative sentiment in tweets, you don't need measures like mean standard deviations.
- You only need to determine if you are correct or incorrect.
- Zero distance is given for correct classification while one distance is given for incorrect classification.
Estimating Numbers
In this section, the speaker discusses estimating numbers such as miles and things like that.
Measuring Closeness
- When estimating numbers such as miles, there is no mistake; you are either close or far.
- A measure of closeness is needed when estimating numbers.
Predictive Models
In this section, the speaker talks about using descriptive methods as a criteria for building predictive models.
Building Predictive Models
- Descriptive method can be used as a predictive model but very rarely is it good enough.
- Straight lines etcetera are approximations and better versions of these will be built when used for actual prediction.
Building Models
In this section, the speaker discusses building models for cricket.
Building Cricket Models
- Models are built to figure out whether a team is doing well or how well a chase is going.
- Physical laws are used to predict something that is not a physical law.
- The laws of physics do not apply to cricket in the way described.
Conclusion
In this section, the speaker concludes by saying that there are many ways to solve specific problems and better estimates can be obtained for doing so.
Better Estimates
- Specific problems require better estimates.
- There are many ways to solve specific problems.
Linear Regression
In this section, the speaker explains how to calculate the estimate of b and a in linear regression.
Calculation of Estimate of b and a
- The estimate of b is calculated using the formula: summation (xi - x̄)(yi - ȳ) / summation (xi - x̄)^2.
- The estimate of a is calculated using the formula: ȳ - b̂x̄.
Minimizing with respect to a
- To minimize something, we set its derivative equal to zero. This will give us the minimum value.
- We are minimizing with respect to 'a', not 'x'. 'x' and 'y' are fixed for our data.
Visualization and Uses of Linear Regression
- Linear regression can be used for prediction, prescription, or simply visualizing or summarizing the relationship between two variables.
- One use case is measuring price sensitivity by calculating elasticity of demand which is essentially a slope that relates sales and price.
Understanding Elasticity of Demand
In this section, the speaker explains how an equation can be used to describe a parameter and not predict anything. The elasticity of demand is discussed as an example.
Elasticity of Demand
- If a product has inelastic demand, changing its price won't affect its demand much. Salt is given as an example.
- Marketing people are concerned with whether their product's demand is elastic or inelastic because it affects pricing strategies.
- Equations can be built just to describe something like elasticity of demand.
Building an Equation for Miles and Usage
In this section, the speaker builds an equation for miles and usage using the data set provided.
Building the Equation
- The coefficient values for miles and usage are 36 and -22 respectively.
- The equation based on these coefficients is "miles = -22 + 36.29 * usage".
- To find the covariance between miles and usage, we can use the correlation or covariance function in Python.
- The covariance matrix gives us the covariance between all variables in our data set.
- To find b (the slope), we need to divide the covariance of miles and usage by variance of usage.
Linear Regression
In this section, the speaker explains how to calculate linear regression coefficients and discusses the importance of normalizing numbers by standard deviation.
Calculating Coefficients
- The variance is used in the equation for calculating linear regression coefficients.
- The covariance of x and y divided by the variance of x gives the coefficient.
- The value of b is 36.318.
Finding Intercept
- To find the intercept, use the mean of y.
- Plug in values for miles and usage to get a star value.
- Subtract 22 from star to get the coefficient.
Units and Difficulty with Predictive Models
- The unit of b is miles per usage.
- Because b is not dimensionless, it can be difficult to use in predictive models as changing units can change its magnitude.
- Normalizing numbers by standard deviation can help with hypothesis testing.
Base Theorem
In this section, the speaker introduces base theorem and explains how it works.
Introduction to Base Theorem
- Base theorem switches between given b and given a events being conditioned on each other.
Example: Identifying Spam Emails
- Gmail identifies spam emails partly by looking at email headers.
Probability of Spam Given Words
In this section, the speaker explains how to find the probability of spam given words and why it is easier to solve the opposite problem.
Finding the Opposite Conditional
- The opposite conditional is finding the probability of words given spam.
- This is easier because we can manually tag documents as spam or not spam and collect data on their word distributions.
Using Data to Solve for Probability
- By collecting data on tagged documents, we can find the distribution of words for both spam and non-spam emails.
- We can use this information to solve for probability of words given spam.
- We also need an estimate of the proportion of emails that are spam or not spam.
Solving for Probability of Spam Given Words
- Using Bayes' theorem, we can find the probability of spam given words by multiplying probability of words given spam with probability of spam divided by probability of words.
- To execute this formula, we need a lexicon or dictionary for probability of words and tagged data for both spam and non-spam emails.
Understanding Posterior Probability
In this section, the speaker explains how posterior probabilities work in Bayesian language.
Posterior Probability Definition
- Posterior probabilities are an update on prior probabilities based on new information.
- For example, if we know certain words in an email, we have a better idea whether it is spam or not.
Examples
- Other examples include knowing someone's income to determine if they will buy a product or knowing someone's profession to determine if they will be interested in a certain topic.
- Knowing more information allows us to update our prior probabilities and make more accurate predictions.
Bayes Theorem and Machine Learning
In this section, the speaker discusses how Bayes theorem has become central to machine learning, particularly in decision-making problems such as those faced by autonomous cars.
Bayes Theorem in Machine Learning
- Bayes theorem is central to machine learning, especially in decision-making problems.
- Autonomous cars face a decision problem when something crosses the road. They must decide whether to stop or not based on what they see.
- To solve this problem using Bayes theorem, the program needs to be told what situations require stopping and which do not. This can be done by looking at situations where the car stopped and situations where it did not.
- By analyzing these situations, the program can determine whether to stop or not based on what it sees.
Other Supervised Learning Algorithms
- Linear discriminant analysis is another supervised learning algorithm that finds posterior distributions of being in a class given data.
- Bayesian belief networks are explicitly designed for this type of supervised learning.
Simple Rules vs. Bayesian Methods for Autonomous Cars
In this section, the speaker compares simple rules with Bayesian methods for programming autonomous cars.
Simple Rules for Autonomous Cars
- A simple rule for an autonomous car would be to stop if it sees something on the road.
- However, programming a car with simple rules becomes difficult when trying to account for all possible scenarios. For example, what should the car do if it doesn't see anything?
Bayesian Methods for Autonomous Cars
- Bayesian methods are more effective for programming autonomous cars because they can account for all possible scenarios.
- When the car encounters a word it has not seen before, Bayesian methods will do nothing and update its decision only when certain words are present.