Lecture 2 "Supervised Learning Setup Continued" -Cornell CS4780 SP17

Name: Lecture 2 "Supervised Learning Setup Continued" -Cornell CS4780 SP17
Uploaded: 2018-07-09T20:54:25.000Z
Duration: 1 h 36 min 45 s

Exam Overview and Class Announcements

Placement Exam Details

Students are reminded to submit their placement exam at the end of class.

An advertisement for a new project is presented, highlighting competition resources available this semester.

Data Science Project Insights

A large dataset (2.7 GB text file) caused issues for many teams last week; a smaller 400 MB file is now available for use with Spark.

For further inquiries, students can contact Cornell Data Science via email.

Machine Learning Concepts

Understanding Data Distribution

The dataset consists of n data points, each represented as a feature vector and label drawn from an unknown distribution P. This distribution is crucial for machine learning success.

Real-world examples illustrate how different distributions affect model training; e.g., face recognition systems trained on non-representative populations lead to poor performance in diverse settings like the U.S. or Beijing.

Importance of Training Data

Emphasizes that training data must reflect the target population to ensure effective model deployment; failure leads to embarrassing outcomes, as seen with Nokia's face recognition software.

Feature Representation in Machine Learning

Vector Representation of Data

Each data instance can be represented as a vector X, where images consist of pixel values (RGB), leading to high-dimensional representations (e.g., 18 million dimensions).

Distinction between dense and sparse vectors: dense vectors have most dimensions filled while sparse vectors contain many zeros, which is common in text documents due to limited vocabulary usage in any single document.

Machine Learning Algorithms and Hypothesis Space

Defining Label and Data Spaces

Identifying the label space (regression vs classification) is essential before applying machine learning algorithms; regression deals with real numbers while classification involves discrete labels.

Choosing Hypothesis Class

The hypothesis class (denoted as H) represents possible functions that can be learned from the data; selecting the right function is critical and often requires expertise from data scientists rather than automation alone.

Examples of Machine Learning Models

Common Algorithms Discussed

Various models such as decision trees, linear classifiers, artificial neural networks, and support vector machines are introduced as potential solutions depending on specific problems faced by practitioners in machine learning contexts.

Evaluating Model Performance

Loss Functions Explained

Loss functions measure how well a hypothesis performs on given datasets; common types include zero-one loss (accuracy), squared loss (used in regression), and absolute loss which accounts for prediction errors differently based on context or outliers present in the data set .

Zero-One Loss Function

The zero-one loss function counts misclassifications by outputting either 1 or 0 based on whether predictions match true labels .

Squared Loss Function

Squared loss penalizes larger errors more heavily than smaller ones , making it suitable for regression tasks where precision matters .

Absolute Loss Function

Absolute loss treats all deviations equally but may be preferred when dealing with outliers or skewed distributions .

This structured approach provides clarity into key concepts discussed during the lecture while allowing easy navigation through timestamps linked directly to relevant sections of content.

Understanding Loss Functions in Machine Learning

The Trade-off Between Absolute and Squared Loss

Discusses the implications of using squared loss versus absolute loss in predictions, highlighting that squared loss amplifies errors significantly compared to absolute loss.

Introduces the concept of learning algorithms where a function H is chosen from a hypothesis class, emphasizing the importance of selecting effective machine learning models.

Evaluating Function Performance

Explains how different functions can be evaluated for their performance, noting that clever methods are needed to navigate large sets of possible functions.

Proposes a hypothetical algorithm that outputs labels based on exact matches in the training dataset, prompting discussion about its effectiveness.

Memorization vs. Generalization

Identifies that the proposed algorithm would yield zero loss on training data due to memorization but fails to generalize beyond it.

Compares this memorization approach to students who memorize answers without understanding, leading to poor performance on unseen questions.

Distribution and Data Consistency Issues

Highlights the necessity for training and test data to come from the same distribution, citing an example involving military vehicle classification by the US Army which failed due to differing image conditions.

Emphasizes that classifiers trained on non-representative samples can lead to significant errors when deployed in real-world scenarios.

Generalization Loss Concept

Defines generalization as finding a function H such that it performs well not just on training data but also on any new data drawn from the same distribution.

Discusses desired outcomes for expected losses across distributions, stressing that low expected loss indicates good generalization capabilities.

Estimating Generalization with Test Sets

Addresses challenges in minimizing expected loss since true distribution P is often unknown; suggests splitting datasets into training and test sets as a solution.

Describes how keeping part of the dataset hidden allows for fair evaluation of model performance against unseen data, ensuring better estimates of real-world applicability.