spark architecture | Lec-5

spark architecture | Lec-5

Understanding Spark Architecture

Introduction to Spark Architecture

  • The video discusses the intricacies of Spark architecture, emphasizing its depth and importance for understanding future concepts.
  • Viewers are encouraged to watch the entire video attentively, as interconnected concepts may be challenging if viewed in parts.
  • The speaker highlights that the discussion will revolve around potential interview questions related to Spark architecture.

Cluster Setup

  • A cluster is defined as a network of multiple computers connected together, which forms the basis of Spark's architecture.
  • The example given includes a setup with 10 machines, each equipped with 100 GB RAM and 20 CPU cores, totaling 1 terabyte of RAM across the cluster.

Master-Slave Architecture

  • The cluster operates on a master-slave architecture where one machine acts as the master (resource manager), while others serve as worker nodes (slaves).
  • The master node manages resources and assigns tasks to worker nodes, which execute applications based on requests from developers.

Resource Management

  • Each worker node is responsible for executing tasks assigned by the resource manager; this relationship is crucial for efficient task execution.
  • Developers submit applications to the resource manager, specifying requirements such as memory and CPU needs for their applications.

Application Execution Process

  • When an application request is made, it goes through the resource manager which allocates resources accordingly.
  • For instance, if an application requires a container with specific memory allocation (e.g., 20 GB), it will be created within one of the worker nodes' available resources.

Understanding the Application Master and Driver in Spark

Introduction to Containers and Application Master

  • The discussion begins with a transition from a cluster to a smaller container, which has 20 GB of storage and RAM. This container is referred to as the "Application Master" in Spark.
  • The entry point for running code is identified as the driver, where two types of code can be written: Python (using PySpark) or Java (using JVM methods).

Code Execution Flow

  • It is explained that Python is not a JVM language; thus, an internal method will be created for JVM compatibility when executing Python code.
  • The flow of execution involves converting Python code into Java before it runs on the JVM, highlighting how different languages interact within Spark.

Role of Application Driver

  • The term "Application Driver" refers to managing all processes within Spark. Understanding this concept is crucial for tackling advanced interview questions despite being part of a beginner's course.
  • While an Application Driver is always present, the PySpark driver isn't mandatory if coding in Java since it's specific to Python.

Resource Management and Container Allocation

  • After creating the main method within the container, requests are made for resources (executors), specifying memory requirements.
  • These resource requests are sent back to the Resource Manager, which allocates available executors based on specified needs.

Worker Nodes and Executor Creation

  • The Resource Manager checks available resources and assigns them accordingly while maintaining control over scheduling tasks across worker nodes.
  • Five containers are created based on resource allocation requests, demonstrating how multiple executors can be managed simultaneously.

Handling User Defined Functions (UDF)

  • A new diagram illustrates what happens when user-defined functions (UDFs) are introduced into applications.
  • If UDF code runs at runtime but lacks corresponding workers in Python, issues may arise due to incompatibility with existing structures.

Conclusion on Execution Dynamics

  • As complexities increase with UDF integration, modifications in design may be necessary to accommodate these changes effectively.

Understanding Python Workers in Isolated Containers

The Need for Python Workers

  • In a distributed system, each container operates independently, necessitating the inclusion of a Python worker within each container to handle specific tasks.
  • A Java Virtual Machine (JVM) is essential for execution, but a Python worker is also required to manage user-defined functions (UDFs), which are critical for performance.

Performance Considerations

  • Writing optimized code is crucial; poorly optimized UDFs can lead to significant performance impacts due to the overhead of requiring a Python environment at runtime.
  • Minimizing the use of libraries and focusing on essential functionalities can enhance overall performance by reducing resource consumption.

Cluster Architecture Overview

  • The cluster consists of multiple machines (e.g., desktops), where one acts as the master node directing other worker nodes.
  • Developers submit code specifying cluster details, which are processed by the master node that manages resources and allocates containers based on requirements.

Resource Management Process

  • The master node communicates with the resource manager to create necessary containers based on application demands, ensuring efficient resource allocation.
  • Once an application driver is established, it interacts with the executor and requests additional resources as needed for processing tasks.

Execution Flow and Data Processing

  • After receiving allocated containers, the application driver prepares files and libraries before initiating data transfer for processing.
Video description

In this video I have talked about spark Architecture in great details. please follow video entirely and ask doubt in comment section below. Directly connect with me on:- https://topmate.io/manish_kumar25 For more queries reach out to me on my below social media handle. Follow me on LinkedIn:- https://www.linkedin.com/in/manish-kumar-373b86176/ Follow Me On Instagram:- https://www.instagram.com/competitive_gyan1/ Follow me on Facebook:- https://www.facebook.com/MANISH12340 My Second Channel -- https://www.youtube.com/channel/UCqX5o-tLG33L3RaBIdOFBWA Interview series Playlist:- https://www.youtube.com/playlist?list=PLTsNSGeIpGnFBjVePu_6ZQVvmgHChBh5L My Gear:- Rode Mic:-- https://amzn.to/3RekC7a Boya M1 Mic-- https://amzn.to/3uW0nnn Wireless Mic:-- https://amzn.to/3TqLRhE Tripod1 -- https://amzn.to/4avjyF4 Tripod2:-- https://amzn.to/46Y3QPu camera1:-- https://amzn.to/3GIQlsE camera2:-- https://amzn.to/46X190P Pentab (Medium size):-- https://amzn.to/3RgMszQ (Recommended) Pentab (Small size):-- https://amzn.to/3RpmIS0 Mobile:-- https://amzn.to/47Y8oa4 ( Aapko ye bilkul nahi lena hai) Laptop -- https://amzn.to/3Ns5Okj Mouse+keyboard combo -- https://amzn.to/3Ro6GYl 21 inch Monitor-- https://amzn.to/3TvCE7E 27 inch Monitor-- https://amzn.to/47QzXlA iPad Pencil:-- https://amzn.to/4aiJxiG iPad 9th Generation:-- https://amzn.to/470I11X Boom Arm/Swing Arm:-- https://amzn.to/48eH2we My PC Components:- intel i7 Processor:-- https://amzn.to/47Svdfe G.Skill RAM:-- https://amzn.to/47VFffI Samsung SSD:-- https://amzn.to/3uVSE8W WD blue HDD:-- https://amzn.to/47Y91QY RTX 3060Ti Graphic card:- https://amzn.to/3tdLDjn Gigabyte Motherboard:-- https://amzn.to/3RFUTGl O11 Dynamic Cabinet:-- https://amzn.to/4avkgSK Liquid cooler:-- https://amzn.to/472S8mS Antec Prizm FAN:-- https://amzn.to/48ey4Pj