Statistics Lecture 3.4: Finding Z-Score, Percentiles and Quartiles, and Comparing Standard Deviation

Statistics Lecture 3.4: Finding Z-Score, Percentiles and Quartiles, and Comparing Standard Deviation

Comparing Standard Deviations and Coefficient of Variation

In this section, the speaker discusses how comparing standard deviations alone is not an effective way to determine which set has more variation. The coefficient of variation is introduced as one method to compare variation, but it is not commonly used in practice.

Coefficient of Variation

  • The coefficient of variation is a measure that converts the standard deviation into a percentage, allowing for comparison between sets.
  • However, it is not widely used in practical applications.

Finding Percentages within a Range

This section builds upon the previous topic by discussing how to find the percentage of data that falls within a given range using the concept of distance from the mean and standard deviation.

Z-Score and Measures of Relative Standing

  • The z-score is defined as the number of standard deviations a particular data value is away from the mean.
  • It allows for direct comparison between two data sets to determine which one has more variation.
  • Measures of relative standing involve comparing measures within or between data sets.

Introduction to Z-Score and Calculation

This section introduces the concept of z-scores and explains their calculation. The speaker emphasizes that z-scores are familiar because they were previously discussed in another session.

Definition and Calculation of Z-Score

  • A z-score represents the number of standard deviations a specific data value is away from the mean.
  • It can be calculated for both sample and population data using an identical formula.
  • The formula involves finding the distance between a data value and the mean, then dividing it by the standard deviation.

Representation of Data Values

This section discusses the representation of data values in the class and clarifies that the letter "X" is used to represent data values.

Representation of Data Values

  • In this class, data values are represented by the letter "X."

Z-Score Calculation for Sample and Population

This section explains that the calculation of z-scores is the same for both sample and population data. The speaker highlights that the major difference lies in calculating standard deviation.

Calculation of Z-Score for Sample and Population

  • Unlike standard deviation, which has different formulas for sample and population data, z-score calculation remains identical.
  • For a sample, use (N - 1) in the denominator when calculating standard deviation.
  • For a population, use N in the denominator when calculating standard deviation.

Distance Calculation and Percentage within Range

This section revisits finding percentages within a range using distance calculations from previous examples.

Distance Calculation and Percentage within Range

  • To find out what percentage of data falls within a given range, consider the mean and standard deviation.
  • Calculate how many standard deviations fit within the range instead of adding individual distances.
  • Last time's example involved finding a range from 10 to 58 with a mean of 34 and a standard deviation of 8.

Mean and Standard Deviation Consideration

This section emphasizes considering mean and standard deviation when determining percentages within a given range.

Mean and Standard Deviation Consideration

  • When finding percentages within a range, it is crucial to consider both mean and standard deviation.
  • The example discussed had a mean of 34 and a standard deviation of 8.

Timestamps may not be exact due to limitations in the provided transcript.

Calculating Standard Deviation and Z-Scores

In this section, the speaker explains how to calculate the distance between a data value and the mean using standard deviation. They also introduce the concept of z-scores and how they can be used to compare different sets of data.

Calculating Distance and Standard Deviation

  • The distance between a data value (x) and the mean (x-bar) is calculated by subtracting the mean from the data value.
  • To calculate how many standard deviations fit in this distance, divide the distance by the standard deviation (s).

Z-Scores for Samples and Populations

  • For samples, use x-bar to represent the mean and s to represent the standard deviation.
  • For populations, use mu (μ) to represent the mean and sigma (σ) to represent the standard deviation.

Using Z-Scores

  • To calculate a z-score, divide the distance between a data value (x) and the mean by the standard deviation.
  • The z-score indicates how many standard deviations away a data value is from the mean.
  • Z-scores can have decimal values, which provide more precise information than whole numbers.

Comparing Variation in Samples or Populations

This section discusses how z-scores can be used to compare variation in different samples or populations. It also explores comparing populations with samples.

Comparing Variation

  • Z-scores allow for easy comparison of variation in two different samples or populations.
  • By calculating z-scores for each set of data, you can determine which has more spread or variability.

Comparing Populations with Samples

  • It is possible to compare a population with a sample using z-scores.
  • If you know both sets of data's z-scores, you can assess if they have similar levels of variation.
  • A large enough sample size can represent the population well, but it may not be an exact match.

Example: Comparing Heights of Shaquille O'Neal and Lyndon B. Johnson

In this section, the speaker provides an example of comparing the heights of two individuals, Shaquille O'Neal and Lyndon B. Johnson, using z-scores.

Example Scenario

  • The heights of Shaquille O'Neal and Lyndon B. Johnson are being compared.
  • Shaquille O'Neal's height is 85 inches, while Lyndon B. Johnson's height is 75 inches.
  • Directly comparing their heights shows that Shaquille O'Neal is taller.

Comparing in Relation to Populations

  • To compare them in relation to their respective populations, calculate z-scores for each individual's height.
  • This allows for a standardized comparison considering the variation within each population.

These notes provide a summary of the main concepts discussed in the transcript.

Comparing Heights of Presidents and Shaquille O'Neal

In this section, the speaker discusses the heights of presidents and compares them to the height of Shaquille O'Neal.

Heights of Presidents and Shaquille O'Neal

  • The speaker mentions that Lyndon B. Johnson (LBJ) will be compared to other presidents.
  • Abraham Lincoln and LBJ were both six foot four inches tall.
  • The average height for presidents is seventy-one point five inches, which is slightly under six feet.
  • The standard deviation for presidents' heights is two point one inches.
  • The speaker explains the concept of absolute and relative height comparisons.
  • Shaquille O'Neal's height is mentioned as eighty-five inches (considered old data).
  • The mean height for Miami Heat players is given as eighty inches with a standard deviation of three point three inches.

Calculating Z-Scores for Heights

In this section, the speaker explains how to calculate z-scores for comparing heights.

Calculating Z-Scores

  • Z-scores are used to determine how far away a value is from the mean in terms of standard deviations.
  • Two populations are considered: all presidents and Miami Heat players.
  • A z-score can be positive or negative depending on whether the value is greater or smaller than the mean.
  • It's important to subtract correctly when calculating z-scores to avoid errors.

This summary covers only a small portion of the transcript.

[t=22m35s] Calculating the Z-Score

In this section, the speaker explains how to calculate the z-score and emphasizes the importance of using consistent data values and avoiding excessive rounding.

Calculating the Z-Score

  • The z-score formula is X minus mu divided by Sigma.
  • X represents the data value (in this case, LBJ's height).
  • Mu represents the mean for all presidents' heights.
  • Sigma represents the standard deviation for all presidents' heights.
  • To calculate the z-score:
  1. Subtract the data value from the mean.
  1. Divide by the standard deviation.
  • Avoid excessive rounding to ensure accuracy in subsequent calculations.

[t=23m17s] Interpreting Z-Scores

In this section, the speaker explains how to interpret z-scores and their significance in relation to mean and standard deviation.

Understanding Z-Scores

  • A z-score indicates how many standard deviations a data value is away from the mean.
  • A positive z-score means that a data value is above the mean, while a negative z-score means it is below.
  • In this example, a calculated z-score of 2.14 indicates that LBJ's height is approximately 2.14 standard deviations away from the mean height of all presidents.

[t=27m22s] The Meaning of Z-Scores

In this section, the speaker reinforces that a z-score represents a number of standard deviations a data value is away from the mean and discusses its implications.

Significance of Z-Scores

  • A z-score provides information about how much a data value varies from its population's mean.
  • Being within two standard deviations of the mean is considered usual or typical, while being outside that range is considered unusual or atypical.
  • In the case of LBJ's height, with a z-score of 2.14, it is considered unusual as it is more than two standard deviations away from the mean.

[t=29m46s] Using Appropriate Data for Z-Scores

In this section, the speaker emphasizes the importance of using appropriate data when calculating z-scores for different populations.

Using Different Means and Standard Deviations

  • When calculating z-scores for different populations or groups, it is crucial to use the appropriate mean and standard deviation.
  • In this example, since LBJ is not a president, his height should be compared to a different population's mean and standard deviation.
  • The formula remains the same: X minus mu divided by Sigma, but with values specific to the relevant population.

Timestamps are provided in English language as requested.

New Section

This section discusses the comparison of heights between Shaq and LBJ using z-scores.

Comparison of Heights

  • Shaq's height is relatively tall compared to his own population, but when compared to the Miami Heat players, he is only 1.82 standard deviations taller than them.
  • LBJ was relatively taller than Shaq because he was more than 1.82 standard deviations taller than his own population.
  • Positive z-scores indicate being taller than average, while negative z-scores indicate being shorter.
  • The empirical rule states that within one standard deviation, 68% of data falls; within two standard deviations, 95% falls; and within three standard deviations, 99.7% falls.
  • A z-score of one means being one standard deviation away from the mean in either direction.
  • Any z-score between -2 and 2 is considered usual, while any z-score outside this range is considered unusual.

New Section

This section explains how to determine if a data value is usual or unusual using z-scores.

Usual and Unusual Data Values

  • A z-score between -2 and 2 is considered usual, while a z-score outside this range is considered unusual.
  • For example, Shaq's height would be considered usual as it falls within the range of -2 to 2.
  • LBJ's height crosses over into the unusual range but not by much.
  • A z-score of 3 or higher would be very unusual.

New Section

This section concludes the discussion on usual and unusual data values based on z-scores.

Very Unusual Data Values

  • A z-score greater than 3 would be very unusual for a data value.
  • Being 3 standard deviations away from the mean is considered very unusual.
  • For example, if the z-score was 4, it would indicate an extremely unusual height.
  • The z-score helps determine how usual or unusual a data value is based on its distance from the mean.

Using Statistics to Determine Unusual Data

In this section, the speaker discusses using statistics to determine whether certain data points are unusual or not.

Determining Unusual Data

  • The speaker mentions that it would be interesting to determine if a specific scenario is usual or unusual based on statistical analysis.
  • They use the example of LeBron James (LBJ) playing for the Miami Heat and question whether it would be unusual considering his height.
  • The speaker explains that they will calculate the z-score to determine if LBJ's height on the Miami Heat team is usual or unusual.

Analyzing LBJ Playing for the Miami Heat

In this section, the speaker calculates the z-score to analyze if it would be usual or unusual for LBJ's height to play for the Miami Heat.

Calculating Z-Score for LBJ on Miami Heat

  • The speaker sets up the calculation by assigning an x-value (LBJ's height) and determining which mean should be used since he is now on the Miami Heat team.
  • They explain that they will use the same average and segregation as before since they don't have information about LBJ playing for the heat.
  • The calculation results in a negative z-score, indicating that LBJ's height on the Miami Heat team is less than average.

Determining Usualness of LBJ Playing for Miami Heat

In this section, the speaker discusses whether it would be usual or unusual for someone of LBJ's height to play for the Miami Heat.

Evaluating Usualness Based on Z-Score

  • The speaker presents a z-score of -1.21 and questions whether it would be usual or unusual for someone of LBJ's height to play for the Miami Heat.
  • They compare it to the range of values and conclude that it would be unusual for someone of LBJ's height to play for the Miami Heat.

Analyzing Another Player's Height

In this section, the speaker analyzes another player's height and discusses its usualness in playing for the Miami Heat.

Analyzing Mr. Leonard's Height

  • The speaker mentions Mr. Leonard, who is 6 feet tall, and questions whether it would be usual or unusual for him to play for the Miami Heat.
  • They calculate a z-score based on Mr. Leonard's height and determine that it would be usual/common for someone of his height to play for the Miami Heat.

Rareness of Data Based on Z-Score

In this section, the speaker explains how z-scores indicate the rareness of data points.

Understanding Rareness Based on Z-Score

  • The speaker discusses Muggsy Bogues as an example of a very short player who played in the NBA.
  • They explain that while rare cases can happen, data points with larger absolute value z-scores are more rare.
  • The further away from the mean (in terms of standard deviation), the more rare a piece of data becomes.

Relationship Between Z-Score and Data Rareness

In this section, the speaker emphasizes the relationship between z-scores and data rareness.

Relationship Between Z-Score and Data Rareness

  • The speaker reiterates that larger absolute value z-scores indicate rarer data points.
  • A negative z-score means being below average, while a positive z-score means being above average.
  • The larger the absolute value of a z-score, the rarer the data point.

Introduction to Quartiles and Percentiles

In this section, the speaker introduces quartiles and percentiles as ways to analyze data.

Understanding Quartiles

  • The speaker explains that quartiles divide data into quarters.
  • They mention that quartiles are similar to percentiles and will discuss their similarities in the following sections.

Understanding Quartiles

In this section, the speaker explains the concept of quartiles and their significance in data analysis. The first quartile represents the bottom 25% of the data, while the second quartile (also known as the median) represents the bottom 50%. The third quartile represents the bottom 75% of the data. The speaker also mentions that there is no fourth quartile since it encompasses all remaining values.

Quartiles and Their Percentages

  • The first quartile (Q1) represents the bottom 25% of sorted data.
  • The second quartile (Q2), which is also known as the median, represents the bottom 50% of sorted data.
  • The third quartile (Q3) represents the bottom 75% of sorted data.
  • There is no fourth quartile since it includes all remaining values.

Finding Quartiles

  • To find Q1, we need to identify the lowest 25% of values in a sorted dataset.
  • Q2 can be found by determining the median value in a dataset.
  • Q3 can be calculated by identifying the lowest 75% of values in a sorted dataset.

Calculator Functionality

  • Calculators can easily determine medians and other quartiles using one-variable statistics functions.
  • Some calculators may label Q2 as "median" but will still provide its value along with Q1 and Q3.

Example and Caveat

  • It is important to ensure that your dataset is sorted before calculating quartiles manually or using a calculator.
  • Different programs may calculate quartiles differently, so it's best to stick with one method consistently.

Finding Quartiles - Step by Step

This section provides step-by-step instructions for finding quartiles in a dataset.

Checking Data Order

  • Before finding quartiles, it is crucial to ensure that the dataset is sorted in ascending or descending order.

Finding Q2 (Median)

  • The first step is to find the median (Q2) of the dataset.
  • If the number of data points is odd, the median will be a single value.
  • If the number of data points is even, average the two middle values to find the median.

Finding Q1 and Q3

  • After finding Q2, divide the dataset into two halves at Q2.
  • For an odd number of data points, exclude Q2 when dividing.
  • For an even number of data points, include Q2 when dividing.
  • The lower half represents values below Q2 and can be used to find Q1.
  • The upper half represents values above Q2 and can be used to find Q3.

Example Calculation

  • Given a dataset with an even number of data points, locate the two middle values: 10 and 12.
  • Average these two values to find Q2 (median): 11.
  • Divide the dataset into lower and upper halves: [4, 6, 8] and [13, 15].
  • Find the median of each half: Q1 = 6 and Q3 = 15.

It's important to follow consistent methods for calculating quartiles manually or using calculators.

New Section

This section discusses how to find the median and quartiles in a data set.

Finding the Median and Quartiles

  • The median divides the data set into two groups: the top 50% and the bottom 50%. The bottom 50% is also known as Q2 or the median.
  • Quartiles are just 25% of the data. If you know how to find the median, you can easily find Q1 and Q3 by finding medians within each group.
  • When calculating quartiles, all data values are included except for when there is only one data value as the median itself.
  • To find Q1, take the average of the middle two numbers in the bottom group. To find Q3, take the average of the middle two numbers in the top group.
  • It's important to note that Q2 (the median) can be described as either Q2 or simply as "the median."

New Section

This section explains how to calculate quartiles when there is no exact middle number in a data set.

Calculating Quartiles with No Exact Middle Number

  • In cases where there is no exact middle number, such as having an even number of data points, we need to average two numbers to find Q1 and Q3.
  • Even if additional numbers are added to the dataset, when calculating quartiles, we exclude including those additional numbers while finding subsequent quartiles.
  • We pretend that these excluded numbers do not exist and separate our data into 50% portions for calculations.

New Section

This section demonstrates how adding an extra data point affects quartile calculations.

Impact of Adding an Extra Data Point

  • Adding an extra data point does not affect the calculation of the median (Q2).
  • However, when calculating Q1 and Q3, we exclude the median from consideration.
  • The quartile values may change slightly due to the addition of an extra data point, but it does not impact the overall process.

New Section

This section discusses how quartiles can be calculated using calculators and introduces the concept of percentiles.

Calculating Quartiles with Calculators

  • Quartile calculations can be done quickly using calculators without much manual effort.
  • Percentiles are similar concepts to quartiles. They represent a specific percentage of data points in a dataset.
  • Percentages indicate proportions or fractions of a whole.

Understanding Percentiles

In this section, the speaker explains the concept of percentiles and how they are used to compare data.

What is a Percentile?

  • A percentile separates data into hundredths parts out of 100.
  • It represents the position of a value in relation to other values in a dataset.
  • Percentiles are commonly used in tests like SAT to compare scores.

Number of Percentiles

  • When data is divided into 100 parts, there are 99 percentiles.
  • Scoring above the 99th percentile means being outside the dataset.

Calculating Percentile

  • To calculate percentile, count the number of values less than X (the given value).
  • Divide this count by the total number of data values.
  • Multiply the resulting decimal by 100 to get the percentile.

Example Calculation

Let's say you scored 87 out of 100 on a test. To find your percentile:

  1. Count how many people scored worse than you.
  1. Divide that count by the total number of test takers.
  1. Multiply the decimal by 100 to get your percentile.

By following these steps, you can determine your position relative to others who took the same test.

Comparing Scores with Percentiles

This section explores how percentiles compare an individual's score with others who took the same test.

Understanding Percentile Comparison

  • The percentile compares an individual's score with everyone else who took the test.
  • It does not represent a percentage score on the test itself.
  • A higher percentile indicates performing better compared to others.

Example Scenario

If lots of people perform exceptionally well on a test:

  • Scoring at or near average may still result in a high percentile rank.
  • A low score could still be considered good if most people scored poorly.

Comparing to Others

  • Percentiles allow for comparing an individual's performance to the entire group.
  • It provides a relative measure of how well someone did compared to others.

Calculating Percentile

This section explains the process of calculating percentile using a ratio and total number of data values.

Calculation Steps

To calculate percentile:

  1. Count the number of data values less than X (the given value).
  1. Divide this count by the total number of data values.
  1. Multiply the resulting decimal by 100 to obtain the percentile.

Numerator and Denominator

  • The numerator represents the count of values less than X.
  • The denominator is the total number of data values in the set.

Converting Ratio to Percentile

After obtaining the ratio, multiply it by 100 to convert it into a percentile value.

Example Calculation

In this section, an example calculation is provided to demonstrate how to calculate percentile based on a given score.

Example Scenario

Let's assume you scored 87 out of 100 on a test.

Calculating Percentile

To calculate your percentile:

  1. Determine how many people scored worse than you.
  1. Divide that count by the total number of test takers.
  1. Multiply the resulting decimal by 100 to get your percentile rank.

By following these steps, you can determine your position relative to others who took the same test.

Understanding Percentiles and Percentages

In this section, the instructor explains the difference between percentiles and percentages and how to calculate percentiles based on data values.

Percentiles vs. Percentages

  • Percentile compares a data value to others in a group, while percentage represents a score out of 100.
  • To calculate a percentile, compare your data value to others in the group.

Calculating Percentiles

  • Determine how many people scored worse than you and subtract one for yourself.
  • Divide the number of people who scored worse by the total number of people in the group.
  • Multiply the result by 100 to get the percentile.

Understanding Scores and Percentiles

  • The score represents your placement on the test or exam.
  • The percentile indicates your placement compared to others who took the test.

Interpreting Percentiles

  • Scoring in a higher percentile means you performed better than a larger percentage of people who took the test.
  • For example, scoring in the 95th percentile means you performed better than 95% of test-takers.

Relationship Between Score and Percentile

  • A score does not directly correspond to a percentile.
  • The score represents your placement on the test, while the percentile reflects your performance compared to others.

Going Back and Forth Between Score and Percentile

  • Given a percentile and information about class size, it is possible to determine how well you performed on the test.

Understanding Quartiles and Interquartile Range

In this section, the instructor introduces quartiles and explains what interquartile range (IQR) represents.

Quartiles

  • Only two quartiles are labeled: Q1 (first quartile) and Q3 (third quartile).
  • The median is also known as Q2.
  • The interquartile range (IQR) is the difference between Q3 and Q1.

Interquartile Range (IQR)

  • IQR represents the middle 50% of the data.
  • It is calculated by subtracting Q1 from Q3.

Conclusion

Understanding percentiles, percentages, quartiles, and interquartile range is essential for analyzing data and interpreting test scores. Percentiles help us compare our performance to others, while quartiles provide insights into the distribution of data.

Understanding Box Plots

In this section, the speaker introduces the concept of box plots and explains how they represent the five-number summary of a dataset.

Five Number Summary

  • The five number summary consists of the minimum value, maximum value, median (q2), and quartiles (q1 and q3).
  • The minimum and maximum values are self-explanatory.
  • Quartiles divide the data into four equal parts. Q1 represents the lower 25% of the data, while Q3 represents the upper 25%.

Creating a Box Plot

  • A box plot visually represents the five number summary on a number line.
  • The box in the middle represents the middle 50% of the data (IQR), with Q1 at one end and Q3 at the other.
  • The median is represented by a line within the box.
  • The minimum and maximum values are plotted as points outside of the box.

Example Calculation

  • Given a dataset already ordered from smallest to largest: 1, 2, 3, 4, 5, ..., 20, 21.
  • The minimum is 1 and the maximum is 21.
  • To find Q1 and Q3, count from both ends towards the center until reaching approximately half of the total data points. In this case, it would be around index positions 5 and 16.
  • Therefore, Q1 is around 5th value (Q1 = 5) and Q3 is around16th value (Q3 =16).

Constructing a Box Plot

  • Draw a number line scaled from one to twenty-one.
  • Place markers for each value on their respective positions on this line.
  • Markers for median (9), Q1 (5), and Q3 (13) are placed accordingly.
  • Connect the markers for minimum and maximum to form a line segment.
  • Draw a box around the markers for Q1, median, and Q3.

Outliers

  • An outlier is a data point that significantly deviates from the rest of the dataset.
  • Determining outliers can be subjective, as it depends on what is considered "far enough away" from normal.
  • In this example, 21 may be considered an outlier as it appears outside of the box plot.

Identifying Outliers

The speaker discusses how to identify outliers in a dataset and introduces a mathematical approach to determine their presence.

Defining Outliers

  • An outlier is a data point that is significantly different from other values in the dataset.
  • Subjectively determining outliers based on visual inspection can be challenging.

Mathematical Approach

  • There is a mathematical method to identify outliers more objectively.
  • However, explaining this method requires further explanation and may be tricky for some viewers.
  • It is recommended to rewatch or pay close attention to understand this approach fully.

Conclusion

The section provides an introduction to box plots and explains how they represent the five-number summary of a dataset. It also discusses identifying outliers using both subjective and mathematical approaches.

Understanding the IQR and Calculating Outliers

In this section, the speaker explains how to calculate the Interquartile Range (IQR) and identify outliers using mathematical calculations.

Calculating the IQR and Multiplying by 1.5

  • The IQR is the difference between the first quartile (Q1) and third quartile (Q3).
  • To find the IQR, subtract Q1 from Q3.
  • Multiply the IQR by 1.5 to determine a threshold for identifying potential outliers.

Determining Outliers

  • Subtract 1.5 times the IQR from Q1 to get a lower bound.
  • Add 1.5 times the IQR to Q3 to get an upper bound.
  • Any data points outside this range are considered potential outliers.

Example Calculation

  • Given an IQR of 8, multiply it by 1.5 to get a threshold of 12.
  • Subtract 12 from Q1 and add it to Q3.
  • If any data points fall below Q1 - 12 or above Q3 + 12, they are considered potential outliers.

Identifying Outliers in a Dataset

In this section, the speaker demonstrates how to identify outliers in a dataset using calculated thresholds.

Checking Data Range

  • Look at the range of numbers in your dataset.
  • Determine if any values fall outside of the lower and upper bounds calculated earlier.

No Outliers Found

  • If all data points are within the calculated range, there are no outliers present in the dataset.

Modifying Data for Outlier Detection

  • Changing a value in the dataset can potentially introduce new outliers.
  • By adjusting one value, such as changing 21 to 32, the presence of an outlier can be observed.

Conclusion

  • Outliers are data points that fall outside the calculated range.
  • Identifying outliers is important for understanding the distribution and characteristics of a dataset.
Video description

https://www.patreon.com/ProfessorLeonard Statistics Lecture 3.4: Finding the Z-Score, Percentiles and Quartiles, and Comparing Standard Deviation