Why Apache spark | Lec-2

Name: Why Apache spark | Lec-2
Uploaded: 2023-03-21T11:30:10.000Z
Duration: 26 min 43 s

Introduction to Apache Spark

Overview of the Video

Manish Kumar introduces the topic of Apache Spark, building on previous discussions about what Apache Spark is and its significance.

The video aims to explore how Apache Spark addresses specific problems in data processing and its future potential.

Historical Context of Data Management

Before Apache Spark, traditional databases were primarily used for structured data storage, such as SQL databases.

Structured data was typically organized in a tabular format with fixed rows and columns, similar to an Excel sheet.

The Evolution of Data Formats

Changes Post-Internet Era

With the advent of the internet, there was a significant increase in data generation across various formats including text files, images, and videos.

Traditional databases struggled to handle these diverse file formats since they were designed only for structured data.

Emergence of Semi-Structured Data

Semi-structured formats like JSON and YAML began to emerge, allowing for more flexible data representation without a fixed structure.

The inability of traditional databases to manage high volumes and varieties of unstructured or semi-structured data became apparent.

Understanding Big Data

Defining Big Data Challenges

Big Data is characterized by three main pillars: Volume, Velocity, and Variety (the "Three Vs").

Key Characteristics:

Volume:

Simply having large amounts of data (e.g., 100GB or 200GB) does not qualify it as Big Data; it's about how quickly that volume increases.

Velocity:

Refers to the speed at which new data is generated. For instance, if 10TB per second is introduced into a system, it indicates a serious Big Data challenge.

Variety:

Involves multiple forms of incoming data—structured (tabular), semi-structured (JSON/YAML), and unstructured—which complicates management.

Transitioning from Traditional Systems

From ETL to ELT Processes

Traditional systems operated on an ETL model (Extract, Transform, Load), where data was extracted from sources before being transformed and loaded into storage.

New Approaches:

Data Lakes:

Data Management Challenges and Solutions

Data Volume, Variety, and Velocity

The data warehouse previously used for storing transformed data is now overwhelmed by the massive volume, variety, and velocity of incoming data.

The challenge arises from the continuous influx of data every minute or second, leading to difficulties in storage and processing.

Key issues identified include the need for increased storage capacity and efficient processing methods for raw data formats.

Storage and Processing Issues

The primary concern is where to store the rapidly generated data while ensuring it can be processed effectively.

Two main options are presented: expanding existing systems (monolithic approach) or adopting a distributed approach to handle the challenges.

Monolithic vs. Distributed Approaches

A monolithic approach involves vertically scaling a single system to accommodate more storage and processing power but has limitations due to hardware constraints.

Vertical scaling can lead to performance drops when CPU limits are reached, making it an inefficient long-term solution.

Limitations of Monolithic Approach

Vertical scaling becomes costly as upgrading hardware (like hard disks or CPUs) increases expenses significantly.

If a monolithic system fails, it results in total downtime since all operations depend on that single system.

Advantages of Distributed Approach

A distributed approach allows for horizontal scaling by adding more machines into a cluster without significant cost increases.

This method provides economic benefits through commodity hardware that is cheaper than high-end systems while allowing unlimited scalability without performance bottlenecks.

High Availability in Distributed Systems

In case of machine failure within a distributed setup, workloads can be redistributed among remaining machines, enhancing reliability compared to monolithic systems which face complete failure upon any single point of failure.

Conclusion on Data Handling Strategies

The rapid influx of diverse data formats necessitates a shift from traditional storage methods; thus, adopting a distributed approach becomes essential for effective management.