Why Apache spark | Lec-2
Introduction to Apache Spark
Overview of the Video
- Manish Kumar introduces the topic of Apache Spark, building on previous discussions about what Apache Spark is and its significance.
- The video aims to explore how Apache Spark addresses specific problems in data processing and its future potential.
Historical Context of Data Management
- Before Apache Spark, traditional databases were primarily used for structured data storage, such as SQL databases.
- Structured data was typically organized in a tabular format with fixed rows and columns, similar to an Excel sheet.
The Evolution of Data Formats
Changes Post-Internet Era
- With the advent of the internet, there was a significant increase in data generation across various formats including text files, images, and videos.
- Traditional databases struggled to handle these diverse file formats since they were designed only for structured data.
Emergence of Semi-Structured Data
- Semi-structured formats like JSON and YAML began to emerge, allowing for more flexible data representation without a fixed structure.
- The inability of traditional databases to manage high volumes and varieties of unstructured or semi-structured data became apparent.
Understanding Big Data
Defining Big Data Challenges
- Big Data is characterized by three main pillars: Volume, Velocity, and Variety (the "Three Vs").
Key Characteristics:
- Volume:
- Simply having large amounts of data (e.g., 100GB or 200GB) does not qualify it as Big Data; it's about how quickly that volume increases.
- Velocity:
- Refers to the speed at which new data is generated. For instance, if 10TB per second is introduced into a system, it indicates a serious Big Data challenge.
- Variety:
- Involves multiple forms of incoming data—structured (tabular), semi-structured (JSON/YAML), and unstructured—which complicates management.
Transitioning from Traditional Systems
From ETL to ELT Processes
- Traditional systems operated on an ETL model (Extract, Transform, Load), where data was extracted from sources before being transformed and loaded into storage.
New Approaches:
- Data Lakes:
Data Management Challenges and Solutions
Data Volume, Variety, and Velocity
- The data warehouse previously used for storing transformed data is now overwhelmed by the massive volume, variety, and velocity of incoming data.
- The challenge arises from the continuous influx of data every minute or second, leading to difficulties in storage and processing.
- Key issues identified include the need for increased storage capacity and efficient processing methods for raw data formats.
Storage and Processing Issues
- The primary concern is where to store the rapidly generated data while ensuring it can be processed effectively.
- Two main options are presented: expanding existing systems (monolithic approach) or adopting a distributed approach to handle the challenges.
Monolithic vs. Distributed Approaches
- A monolithic approach involves vertically scaling a single system to accommodate more storage and processing power but has limitations due to hardware constraints.
- Vertical scaling can lead to performance drops when CPU limits are reached, making it an inefficient long-term solution.
Limitations of Monolithic Approach
- Vertical scaling becomes costly as upgrading hardware (like hard disks or CPUs) increases expenses significantly.
- If a monolithic system fails, it results in total downtime since all operations depend on that single system.
Advantages of Distributed Approach
- A distributed approach allows for horizontal scaling by adding more machines into a cluster without significant cost increases.
- This method provides economic benefits through commodity hardware that is cheaper than high-end systems while allowing unlimited scalability without performance bottlenecks.
High Availability in Distributed Systems
- In case of machine failure within a distributed setup, workloads can be redistributed among remaining machines, enhancing reliability compared to monolithic systems which face complete failure upon any single point of failure.
Conclusion on Data Handling Strategies
- The rapid influx of diverse data formats necessitates a shift from traditional storage methods; thus, adopting a distributed approach becomes essential for effective management.