Sistemas de Procesamiento Distribuido: Hadoop y YARN (Sesión Extendida)
Introduction to Advanced Data Engineering
Overview of Hadoop and YARN
- The session introduces Hadoop, a crucial component in the big data ecosystem, focusing on its resource management system, YARN.
- It discusses how Hadoop enables distributed parallel processing across various nodes to handle large volumes of data efficiently.
Cluster Architecture
- A cluster consists of interconnected computers that facilitate rapid communication between the master node and worker nodes, enhancing processing efficiency.
- The master node manages and coordinates tasks within the system, equipped with significant computational power (e.g., 12 cores and 128 GB RAM).
- Worker nodes execute tasks assigned by the master node, each having their own resources (e.g., 8 cores and 64 GB RAM), allowing for independent parallel processing.
Understanding YARN's Role
Resource Management
- YARN is described as a resource management software that dynamically allocates CPU, memory, and storage resources among applications running on Hadoop.
- It acts as an intermediary layer between the operating system of nodes and processing applications, optimizing resource usage.
Key Components of YARN
Resource Manager
- The Resource Manager operates on the master node with a comprehensive view of cluster resources; it plans and coordinates resource allocation among applications.
Node Manager
- Each worker node runs a Node Manager that monitors local resources (memory usage, active cores), reporting back to the Resource Manager about its status.
Containers
- Containers represent allocated fractions of cluster resources for specific tasks. They provide isolation for applications ensuring they do not interfere with one another.
Application Master
Understanding the Role of Application Master in YARN
Overview of Application Master Functions
- The application master controls task execution and plays a crucial role in fault tolerance by detecting errors and reprogramming failed tasks on new containers or nodes.
- Each client sending an application to the cluster has its own application master running in an independent container, coordinating various containers distributed across multiple nodes.
- The application master acts as the conductor for each application, ensuring proper resource allocation, task progression, and system recovery from failures.
Execution of MAP Reduce within YARN
- YARN is responsible for managing resources and coordinating the execution of MAP Reduce jobs within the cluster.
- Users can specify the number of mappers or containers needed and their memory requirements when launching applications; each job represents a task processed within the cluster.
- The resource manager serves as the main communication point between users and the cluster, verifying resource availability before executing jobs.
Resource Management and Job Execution
- If sufficient resources are available, jobs are launched; otherwise, they are queued until resources become available. This allows YARN to run multiple jobs in parallel efficiently.
- After receiving a job request from a client, the resource manager checks overall cluster status to identify available resources on worker nodes.
- Once a container is granted by node managers, an application master is launched within it to coordinate job execution.
Load Balancing and Fault Tolerance
- Application masters do not run on the master node but rather on worker nodes to prevent overload or failure that could impact all processes in case of issues at one point.
- The application master's next step involves requesting additional containers from the resource manager for executing various tasks that make up a MAP Reduce job.
Task Execution and Completion
- Node managers manage assigned containers on each node while launching corresponding tasks; distributing load across different nodes optimizes performance by leveraging data locality.
- Tasks communicate directly with their respective application masters regarding their status during execution while ensuring completion oversight by YARN.
- Upon task completion, containers are automatically released back into circulation for other jobs; once all tasks finish, so does the application master process.
Finalization of Job Execution
- When all tasks have been completed successfully, resources like CPU and memory return to availability for new jobs while concluding this cycle of MAP Reduce processing over YARN.
Data Processing and Storage Frameworks
Overview of Data Ingestion Tools
- Tools like Flume and Scoop are responsible for receiving information from various sources, including databases, sensors, or applications.
- These tools facilitate the transfer of data into a storage system, marking the first phase of data handling.
Distributed Data Storage Solutions
- In the second phase, data is stored in a distributed manner using HDFS (Hadoop Distributed File System) or HBASE.
- This approach ensures fault tolerance and scalability within the cluster environment.
Data Processing and Analysis Frameworks
- The third stage involves processing and analyzing data with frameworks such as Pig and Hive (HE).
- These frameworks connect with MapReduce to perform parallel processing on the data, transforming it and extracting valuable insights.
Accessing Processed Data