HDFS Architecture | Hadoop Tutorial for Beginners | Hadoop [Part 6]
What Happens When You Store a File in a Hadoop Cluster?
Understanding the Basics of Hadoop File Storage
- The speaker introduces the concept of storing a file (192 MB) in a Hadoop cluster, emphasizing that understanding the basics is crucial for real-world applications.
- To connect to a Hadoop cluster, users typically utilize a Hadoop client package or log into a gateway machine, which serves as an intermediary for security and access reasons.
- The gateway machine allows users to issue commands without directly accessing the Hadoop cluster, ensuring secure communication between user machines and the cluster.
File Upload Process in Hadoop
- Upon initiating an upload of the 192 MB file from the user's machine to the gateway, it first communicates with the NameNode to determine block size settings.
- The NameNode informs that files will be divided into blocks based on configured block sizes (assumed here as 64 MB), similar to how operating systems manage file storage.
- For this example, since 192 MB exceeds one block size, it will be split into three blocks (B1, B2, B3), which happens automatically without user intervention.
Data Distribution and Replication
- After dividing the file into blocks, these are sent back to the NameNode for distribution across different DataNodes based on available storage space.
- The NameNode allocates each block to separate DataNodes to optimize processing efficiency; ideally preventing all blocks from being stored on one machine.
- This distribution allows multiple machines to process data simultaneously. If all blocks were on one node, it would create bottlenecks during processing tasks.
Heartbeat Mechanism and Data Reliability
- All DataNodes send heartbeat signals to the NameNode regularly. This mechanism helps maintain awareness of active nodes within the cluster and ensures data availability.
- By default, each block is replicated three times across different machines for redundancy. This replication safeguards against data loss if any single machine fails.
Metadata Management by NameNode
Disaster Recovery in Hadoop: Key Insights
Understanding Name Nodes and Disaster Recovery
- The architecture involves an active name node and a standby name node, ensuring constant communication for disaster recovery.
- Van Disco is highlighted as a notable company specializing in disaster recovery solutions for Hadoop, emphasizing a non-traditional approach to backup systems.
Data Classification and Backup Strategies
- In managing large data sets (e.g., 100 terabytes), it's crucial to classify data based on its importance; only critical data (e.g., 10 terabytes) should be backed up periodically.
- The tool "Distributed Copy" (distcp) is used in Hadoop to synchronize primary and backup clusters efficiently without the need for extensive backup infrastructure.
Evolution of Hadoop Versions
- There are three major releases of Hadoop: version one is outdated, while version two remains widely used. Version three was released recently in December 2017.
- Companies can adjust the replication factor when storing data; however, debates exist regarding the efficiency of high replication rates due to space loss.
Handling Unstructured Data
- In Hadoop 3, there are limitations on file types supported, with no replication allowed for certain formats. This raises questions about handling unstructured data effectively.
- Analyzing unstructured data like images or videos poses challenges; conversion into binary format is necessary before analysis can occur.
Real-world Applications and Limitations
- A real-time example from a customer care context illustrates that companies often analyze metadata rather than actual audio recordings from calls due to practical constraints.