Why Apache spark | Lec-2

Why Apache spark | Lec-2

Introduction to Apache Spark

Overview of the Video

  • Manish Kumar introduces the topic of Apache Spark, building on previous discussions about what Apache Spark is and its significance.
  • The video aims to explore how Apache Spark addresses specific problems in data processing and its future potential.

Historical Context of Data Management

  • Before Apache Spark, traditional databases were primarily used for structured data storage, such as SQL databases.
  • Structured data was typically organized in a tabular format with fixed rows and columns, similar to an Excel sheet.

The Evolution of Data Formats

Changes Post-Internet Era

  • With the advent of the internet, there was a significant increase in data generation across various formats including text files, images, and videos.
  • Traditional databases struggled to handle these diverse file formats since they were designed only for structured data.

Emergence of Semi-Structured Data

  • Semi-structured formats like JSON and YAML began to emerge, allowing for more flexible data representation without a fixed structure.
  • The inability of traditional databases to manage high volumes and varieties of unstructured or semi-structured data became apparent.

Understanding Big Data

Defining Big Data Challenges

  • Big Data is characterized by three main pillars: Volume, Velocity, and Variety (the "Three Vs").

Key Characteristics:

  1. Volume:
  • Simply having large amounts of data (e.g., 100GB or 200GB) does not qualify it as Big Data; it's about how quickly that volume increases.
  1. Velocity:
  • Refers to the speed at which new data is generated. For instance, if 10TB per second is introduced into a system, it indicates a serious Big Data challenge.
  1. Variety:
  • Involves multiple forms of incoming data—structured (tabular), semi-structured (JSON/YAML), and unstructured—which complicates management.

Transitioning from Traditional Systems

From ETL to ELT Processes

  • Traditional systems operated on an ETL model (Extract, Transform, Load), where data was extracted from sources before being transformed and loaded into storage.

New Approaches:

  1. Data Lakes:

Data Management Challenges and Solutions

Data Volume, Variety, and Velocity

  • The data warehouse previously used for storing transformed data is now overwhelmed by the massive volume, variety, and velocity of incoming data.
  • The challenge arises from the continuous influx of data every minute or second, leading to difficulties in storage and processing.
  • Key issues identified include the need for increased storage capacity and efficient processing methods for raw data formats.

Storage and Processing Issues

  • The primary concern is where to store the rapidly generated data while ensuring it can be processed effectively.
  • Two main options are presented: expanding existing systems (monolithic approach) or adopting a distributed approach to handle the challenges.

Monolithic vs. Distributed Approaches

  • A monolithic approach involves vertically scaling a single system to accommodate more storage and processing power but has limitations due to hardware constraints.
  • Vertical scaling can lead to performance drops when CPU limits are reached, making it an inefficient long-term solution.

Limitations of Monolithic Approach

  • Vertical scaling becomes costly as upgrading hardware (like hard disks or CPUs) increases expenses significantly.
  • If a monolithic system fails, it results in total downtime since all operations depend on that single system.

Advantages of Distributed Approach

  • A distributed approach allows for horizontal scaling by adding more machines into a cluster without significant cost increases.
  • This method provides economic benefits through commodity hardware that is cheaper than high-end systems while allowing unlimited scalability without performance bottlenecks.

High Availability in Distributed Systems

  • In case of machine failure within a distributed setup, workloads can be redistributed among remaining machines, enhancing reliability compared to monolithic systems which face complete failure upon any single point of failure.

Conclusion on Data Handling Strategies

  • The rapid influx of diverse data formats necessitates a shift from traditional storage methods; thus, adopting a distributed approach becomes essential for effective management.
Video description

From this video, we are going to start talking about Apache spark in great depth. Directly connect with me on:- https://topmate.io/manish_kumar25 For more queries reach out to me on my below social media handle. My Second Channel -- https://www.youtube.com/channel/UCqX5o-tLG33L3RaBIdOFBWA Interview series Playlist:- https://www.youtube.com/playlist?list=PLTsNSGeIpGnFBjVePu_6ZQVvmgHChBh5L Follow me on LinkedIn:- https://www.linkedin.com/in/manish-kumar-373b86176/ Follow Me On Instagram:- https://www.instagram.com/competitive_gyan1/ Follow me on Facebook:- https://www.facebook.com/MANISH12340 My Gear:- Rode Mic:-- https://amzn.to/3RekC7a Boya M1 Mic-- https://amzn.to/3uW0nnn Wireless Mic:-- https://amzn.to/3TqLRhE Tripod1 -- https://amzn.to/4avjyF4 Tripod2:-- https://amzn.to/46Y3QPu camera1:-- https://amzn.to/3GIQlsE camera2:-- https://amzn.to/46X190P Pentab (Medium size):-- https://amzn.to/3RgMszQ (Recommended) Pentab (Small size):-- https://amzn.to/3RpmIS0 Mobile:-- https://amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai) Laptop -- https://amzn.to/3Ns5Okj Mouse+keyboard combo -- https://amzn.to/3Ro6GYl 21 inch Monitor-- https://amzn.to/3TvCE7E 27 inch Monitor-- https://amzn.to/47QzXlA iPad Pencil:-- https://amzn.to/4aiJxiG iPad 9th Generation:-- https://amzn.to/470I11X Boom Arm/Swing Arm:-- https://amzn.to/48eH2we My PC Components:- intel i7 Processor:-- https://amzn.to/47Svdfe G.Skill RAM:-- https://amzn.to/47VFffI Samsung SSD:-- https://amzn.to/3uVSE8W WD blue HDD:-- https://amzn.to/47Y91QY RTX 3060Ti Graphic card:- https://amzn.to/3tdLDjn Gigabyte Motherboard:-- https://amzn.to/3RFUTGl O11 Dynamic Cabinet:-- https://amzn.to/4avkgSK Liquid cooler:-- https://amzn.to/472S8mS Antec Prizm FAN:-- https://amzn.to/48ey4Pj