spark architecture | Lec-5

Name: spark architecture | Lec-5
Uploaded: 2023-04-05T11:30:20.000Z
Duration: 42 min 3 s

Understanding Spark Architecture

Introduction to Spark Architecture

The video discusses the intricacies of Spark architecture, emphasizing its depth and importance for understanding future concepts.

Viewers are encouraged to watch the entire video attentively, as interconnected concepts may be challenging if viewed in parts.

The speaker highlights that the discussion will revolve around potential interview questions related to Spark architecture.

Cluster Setup

A cluster is defined as a network of multiple computers connected together, which forms the basis of Spark's architecture.

The example given includes a setup with 10 machines, each equipped with 100 GB RAM and 20 CPU cores, totaling 1 terabyte of RAM across the cluster.

Master-Slave Architecture

The cluster operates on a master-slave architecture where one machine acts as the master (resource manager), while others serve as worker nodes (slaves).

The master node manages resources and assigns tasks to worker nodes, which execute applications based on requests from developers.

Resource Management

Each worker node is responsible for executing tasks assigned by the resource manager; this relationship is crucial for efficient task execution.

Developers submit applications to the resource manager, specifying requirements such as memory and CPU needs for their applications.

Application Execution Process

When an application request is made, it goes through the resource manager which allocates resources accordingly.

For instance, if an application requires a container with specific memory allocation (e.g., 20 GB), it will be created within one of the worker nodes' available resources.

Understanding the Application Master and Driver in Spark

Introduction to Containers and Application Master

The discussion begins with a transition from a cluster to a smaller container, which has 20 GB of storage and RAM. This container is referred to as the "Application Master" in Spark.

The entry point for running code is identified as the driver, where two types of code can be written: Python (using PySpark) or Java (using JVM methods).

Code Execution Flow

It is explained that Python is not a JVM language; thus, an internal method will be created for JVM compatibility when executing Python code.

The flow of execution involves converting Python code into Java before it runs on the JVM, highlighting how different languages interact within Spark.

Role of Application Driver

The term "Application Driver" refers to managing all processes within Spark. Understanding this concept is crucial for tackling advanced interview questions despite being part of a beginner's course.

While an Application Driver is always present, the PySpark driver isn't mandatory if coding in Java since it's specific to Python.

Resource Management and Container Allocation

After creating the main method within the container, requests are made for resources (executors), specifying memory requirements.

These resource requests are sent back to the Resource Manager, which allocates available executors based on specified needs.

Worker Nodes and Executor Creation

The Resource Manager checks available resources and assigns them accordingly while maintaining control over scheduling tasks across worker nodes.

Five containers are created based on resource allocation requests, demonstrating how multiple executors can be managed simultaneously.

Handling User Defined Functions (UDF)

A new diagram illustrates what happens when user-defined functions (UDFs) are introduced into applications.

If UDF code runs at runtime but lacks corresponding workers in Python, issues may arise due to incompatibility with existing structures.

Conclusion on Execution Dynamics

As complexities increase with UDF integration, modifications in design may be necessary to accommodate these changes effectively.

Understanding Python Workers in Isolated Containers

The Need for Python Workers

In a distributed system, each container operates independently, necessitating the inclusion of a Python worker within each container to handle specific tasks.

A Java Virtual Machine (JVM) is essential for execution, but a Python worker is also required to manage user-defined functions (UDFs), which are critical for performance.

Performance Considerations

Writing optimized code is crucial; poorly optimized UDFs can lead to significant performance impacts due to the overhead of requiring a Python environment at runtime.

Minimizing the use of libraries and focusing on essential functionalities can enhance overall performance by reducing resource consumption.

Cluster Architecture Overview

The cluster consists of multiple machines (e.g., desktops), where one acts as the master node directing other worker nodes.

Developers submit code specifying cluster details, which are processed by the master node that manages resources and allocates containers based on requirements.

Resource Management Process

The master node communicates with the resource manager to create necessary containers based on application demands, ensuring efficient resource allocation.

Once an application driver is established, it interacts with the executor and requests additional resources as needed for processing tasks.

Execution Flow and Data Processing

After receiving allocated containers, the application driver prepares files and libraries before initiating data transfer for processing.