spark architecture | Lec-5
Understanding Spark Architecture
Introduction to Spark Architecture
- The video discusses the intricacies of Spark architecture, emphasizing its depth and importance for understanding future concepts.
- Viewers are encouraged to watch the entire video attentively, as interconnected concepts may be challenging if viewed in parts.
- The speaker highlights that the discussion will revolve around potential interview questions related to Spark architecture.
Cluster Setup
- A cluster is defined as a network of multiple computers connected together, which forms the basis of Spark's architecture.
- The example given includes a setup with 10 machines, each equipped with 100 GB RAM and 20 CPU cores, totaling 1 terabyte of RAM across the cluster.
Master-Slave Architecture
- The cluster operates on a master-slave architecture where one machine acts as the master (resource manager), while others serve as worker nodes (slaves).
- The master node manages resources and assigns tasks to worker nodes, which execute applications based on requests from developers.
Resource Management
- Each worker node is responsible for executing tasks assigned by the resource manager; this relationship is crucial for efficient task execution.
- Developers submit applications to the resource manager, specifying requirements such as memory and CPU needs for their applications.
Application Execution Process
- When an application request is made, it goes through the resource manager which allocates resources accordingly.
- For instance, if an application requires a container with specific memory allocation (e.g., 20 GB), it will be created within one of the worker nodes' available resources.
Understanding the Application Master and Driver in Spark
Introduction to Containers and Application Master
- The discussion begins with a transition from a cluster to a smaller container, which has 20 GB of storage and RAM. This container is referred to as the "Application Master" in Spark.
- The entry point for running code is identified as the driver, where two types of code can be written: Python (using PySpark) or Java (using JVM methods).
Code Execution Flow
- It is explained that Python is not a JVM language; thus, an internal method will be created for JVM compatibility when executing Python code.
- The flow of execution involves converting Python code into Java before it runs on the JVM, highlighting how different languages interact within Spark.
Role of Application Driver
- The term "Application Driver" refers to managing all processes within Spark. Understanding this concept is crucial for tackling advanced interview questions despite being part of a beginner's course.
- While an Application Driver is always present, the PySpark driver isn't mandatory if coding in Java since it's specific to Python.
Resource Management and Container Allocation
- After creating the main method within the container, requests are made for resources (executors), specifying memory requirements.
- These resource requests are sent back to the Resource Manager, which allocates available executors based on specified needs.
Worker Nodes and Executor Creation
- The Resource Manager checks available resources and assigns them accordingly while maintaining control over scheduling tasks across worker nodes.
- Five containers are created based on resource allocation requests, demonstrating how multiple executors can be managed simultaneously.
Handling User Defined Functions (UDF)
- A new diagram illustrates what happens when user-defined functions (UDFs) are introduced into applications.
- If UDF code runs at runtime but lacks corresponding workers in Python, issues may arise due to incompatibility with existing structures.
Conclusion on Execution Dynamics
- As complexities increase with UDF integration, modifications in design may be necessary to accommodate these changes effectively.
Understanding Python Workers in Isolated Containers
The Need for Python Workers
- In a distributed system, each container operates independently, necessitating the inclusion of a Python worker within each container to handle specific tasks.
- A Java Virtual Machine (JVM) is essential for execution, but a Python worker is also required to manage user-defined functions (UDFs), which are critical for performance.
Performance Considerations
- Writing optimized code is crucial; poorly optimized UDFs can lead to significant performance impacts due to the overhead of requiring a Python environment at runtime.
- Minimizing the use of libraries and focusing on essential functionalities can enhance overall performance by reducing resource consumption.
Cluster Architecture Overview
- The cluster consists of multiple machines (e.g., desktops), where one acts as the master node directing other worker nodes.
- Developers submit code specifying cluster details, which are processed by the master node that manages resources and allocates containers based on requirements.
Resource Management Process
- The master node communicates with the resource manager to create necessary containers based on application demands, ensuring efficient resource allocation.
- Once an application driver is established, it interacts with the executor and requests additional resources as needed for processing tasks.
Execution Flow and Data Processing
- After receiving allocated containers, the application driver prepares files and libraries before initiating data transfer for processing.