Understanding MapReduce With Example | Hadoop Tutorial for Beginners | Hadoop [Part 11]

Name: Understanding MapReduce With Example | Hadoop Tutorial for Beginners | Hadoop [Part 11]
Uploaded: 2018-11-22T12:30:02.000Z
Duration: 34 min

Understanding Map-Reducing Through Word Count Example

Introduction to Map-Reducing

The discussion begins with a theoretical overview of map-reducing, emphasizing the need for an example to clarify the concept.

The speaker compares learning map-reducing to writing a "Hello World" program in Java, introducing the WordCount program as a foundational example.

The WordCount Program

WordCount is described as the de facto introductory program for anyone learning map-reducing, similar to how "Hello World" serves in programming.

A hypothetical scenario is presented where a large text file is stored across three data nodes, illustrating how data can be distributed in a real-world application.

Data Representation and Use Case

The speaker uses characters from "Game of Thrones" (Arya, Sansa, John) as examples of words within the text file to make the explanation relatable.

The goal of the WordCount program is established: counting occurrences of each word in potentially millions of lines of text.

Mechanics of Map-Reducing

To achieve word counting, one must develop a mapper program that outputs key-value pairs representing each word and its frequency.

Key-value pairs are emphasized as essential output format for any mapper program according to map-reduce documentation.

Implementation Details

In Java, developers can utilize classes like StringTokenizer to extract individual words from lines of text efficiently.

A record reader component within map-reduce handles line-by-line data input into the mapper function seamlessly.

Execution Flow and Output Generation

Each mapper runs concurrently on different data blocks; they process lines independently using string tokenization to generate key-value pairs.

Understanding MapReduce: Key Concepts and Processes

Overview of Record Processing

Each line in the file is treated as a record, where individual words are extracted using string tokenization. Each word serves as a key with an associated value of one.

The output from the map function will consistently show repeated keys (e.g., Aria 1, Sansa 1, John 1) for each line processed.

Map Function and Data Handling

The map function processes data in chunks; however, it writes outputs to the hard disk after processing due to potential large data sizes, contributing to slower performance.

Disk read/write operations during MapReduce lead to its slowness. Ideally, only final outputs should be stored on disk.

Fault Tolerance Mechanism

In case of machine failure during processing, the system can wait for recovery or utilize replica blocks to ensure continuity in processing.

If a mapper crashes, YARN (Yet Another Resource Negotiator) identifies a replica block and initiates another mapper to continue processing without losing data integrity.

Data Persistence and Management

Technical documentation states that if output exceeds 100 MB, it cannot be held in RAM and must be pushed to hard disk for reliability.

Despite potential crashes during processing, MapReduce ensures that final outputs are typically retrievable through fault tolerance mechanisms.

Shuffle and Sort Phase

After all mappers complete their tasks, a shuffle and sort phase occurs automatically without additional coding requirements. This phase organizes keys in ascending order.

During this phase, values corresponding to each key are aggregated together from multiple mappers before being sent for further processing.

Cost Implications of Shuffle Phase

The shuffle phase can be resource-intensive as it requires transferring data across potentially hundreds of machines over the network.

While typically handled by one machine, shuffling may occur across multiple machines depending on data size.

Reducer Process Initiation

Once shuffling is complete, the reducer process begins. A custom reducer program must be written by developers to handle key-value pairs effectively.

Understanding MapReduce Logic

Importance of Correct Logic in MapReduce

The logic defined in the map function is crucial; if it is incorrect, both shuffle and reduce operations will yield wrong results.

A faulty map function affects the entire process, as the reducer relies on the output from shuffle, which in turn depends on a correctly executed map function.

The map function must assemble all values associated with a key into a data structure, especially when dealing with large datasets that may require significant RAM for storage.

Handling Large Datasets

When processing billions of values, it's essential to configure shuffling across multiple machines to manage data effectively.

In scenarios where only simple aggregation is needed (e.g., adding three values), one reducer suffices. However, complexity increases with more reducers.

Challenges with Multiple Reducers

Introducing additional reducers raises questions about how data will be distributed among them; each key's values must be processed by a single reducer to maintain integrity.