Data, Models and ML Task
What is Data in Machine Learning?
Understanding Data
- The course introduces fundamental terms in machine learning, focusing on the definitions of data, models, and ML tasks.
- In machine learning, data typically refers to a collection of vectors rather than just bits or bytes.
- An example illustrates that each house can be represented by a vector (e.g., dimensions like number of rooms and price), forming a dataset.
- Metadata provides context for the data, explaining what each number in the vector represents (e.g., number of rooms, area).
- While metadata makes data interpretable for humans, computers only require consistency in how data is represented.
What is a Model?
Defining Models
- A model serves as a mathematical simplification of reality and has been utilized across various scientific fields for centuries.
- The Ideal Gas model exemplifies how models represent complex realities through simplified equations (e.g., PV = nRT).
- Other examples include Newton's gravitational law which approximates gravitational attraction but does not account for all variables.
- Models are essential in economics to predict outcomes based on changes in supply and demand; they help quantify relationships even if exact answers are unknown.
Understanding Models in Machine Learning
The Nature of Models
- Models are mathematical simplifications of reality, not exact representations. They allow for analysis and planning but cannot encapsulate the full complexity of real-world scenarios.
- George Box's famous quote highlights that "all models are wrong, but some are useful," emphasizing that while models may be approximations, they can still provide valuable insights into complex systems.
Types of Models in Machine Learning
- In machine learning, models refer to predictive and probabilistic types. Predictive models forecast outcomes based on input data, while probabilistic models focus on understanding uncertainty rather than making predictions.
Predictive Models
- Two primary types of predictive models are regression and classification. Regression predicts continuous values (e.g., prices), while classification predicts discrete categories (e.g., yes/no outcomes).
Regression Model Insights
- A regression model can predict house prices based on factors like area and distance from a metro station. For example, a model might suggest that price increases with area and decreases with distance from the city.
- While there will always be exceptions to these trends, such general rules help guide decisions about property searches within budget constraints.
Application of Regression Models
- Using a regression model allows one to estimate unknown values based on known variables. For instance, predicting the price of a new house using its size and location relative to public transport.
Classification Model Insights
- Classification models differ by predicting categorical outcomes instead of continuous values. An example could involve determining if a house is close or far from a metro station based on its price and area.
Example of Classification Model Usage
- A simple classification rule might state that if the number of rooms multiplied by two minus the price is less than one, then it’s considered close to the metro; otherwise, it’s far away.
Understanding Probabilistic Models
Understanding Probability Models in Machine Learning
Evaluating Events and Configurations
- Probability models assess the likelihood of specific events or configurations, such as determining the probability of a randomly chosen person being located at given latitude and longitude coordinates.
- The probability varies significantly based on location; for instance, being in the Sahara Desert yields a low probability compared to being in a populated area like Bombay.
Scoring Reality with Probability Models
- Different geographical coordinates have varying probabilities of containing randomly selected individuals, effectively "scoring" reality based on these configurations.
- A practical application includes evaluating whether a tweet was generated by a specific individual (e.g., Mr. Chopra), where similar tweets receive higher scores than random strings of characters.
Learning Algorithms: Converting Data into Models
- Learning algorithms play a crucial role in machine learning by transforming data into predictive models.
- These algorithms select from various model structures, typically choosing the best fit based on parameters that define relationships within the data.
Model Parameters and Predictions
- For example, when predicting house prices, one might establish an initial model structure involving parameters like area size and distance to metro stations before analyzing any data.
- The learning algorithm determines optimal values for these parameters (A, B, C), which characterize the model's predictions.
Human Guidance in Model Development
- In machine learning tasks, humans provide broad outlines rather than explicit instructions for model creation; they guide the process without detailing every aspect.
- The learning algorithm utilizes historical data to construct accurate models based on human-defined guidelines and existing datasets.