HDFS Architecture | Hadoop Tutorial for Beginners | Hadoop [Part 6]

HDFS Architecture | Hadoop Tutorial for Beginners | Hadoop [Part 6]

What Happens When You Store a File in a Hadoop Cluster?

Understanding the Basics of Hadoop File Storage

  • The speaker introduces the concept of storing a file (192 MB) in a Hadoop cluster, emphasizing that understanding the basics is crucial for real-world applications.
  • To connect to a Hadoop cluster, users typically utilize a Hadoop client package or log into a gateway machine, which serves as an intermediary for security and access reasons.
  • The gateway machine allows users to issue commands without directly accessing the Hadoop cluster, ensuring secure communication between user machines and the cluster.

File Upload Process in Hadoop

  • Upon initiating an upload of the 192 MB file from the user's machine to the gateway, it first communicates with the NameNode to determine block size settings.
  • The NameNode informs that files will be divided into blocks based on configured block sizes (assumed here as 64 MB), similar to how operating systems manage file storage.
  • For this example, since 192 MB exceeds one block size, it will be split into three blocks (B1, B2, B3), which happens automatically without user intervention.

Data Distribution and Replication

  • After dividing the file into blocks, these are sent back to the NameNode for distribution across different DataNodes based on available storage space.
  • The NameNode allocates each block to separate DataNodes to optimize processing efficiency; ideally preventing all blocks from being stored on one machine.
  • This distribution allows multiple machines to process data simultaneously. If all blocks were on one node, it would create bottlenecks during processing tasks.

Heartbeat Mechanism and Data Reliability

  • All DataNodes send heartbeat signals to the NameNode regularly. This mechanism helps maintain awareness of active nodes within the cluster and ensures data availability.
  • By default, each block is replicated three times across different machines for redundancy. This replication safeguards against data loss if any single machine fails.

Metadata Management by NameNode

Disaster Recovery in Hadoop: Key Insights

Understanding Name Nodes and Disaster Recovery

  • The architecture involves an active name node and a standby name node, ensuring constant communication for disaster recovery.
  • Van Disco is highlighted as a notable company specializing in disaster recovery solutions for Hadoop, emphasizing a non-traditional approach to backup systems.

Data Classification and Backup Strategies

  • In managing large data sets (e.g., 100 terabytes), it's crucial to classify data based on its importance; only critical data (e.g., 10 terabytes) should be backed up periodically.
  • The tool "Distributed Copy" (distcp) is used in Hadoop to synchronize primary and backup clusters efficiently without the need for extensive backup infrastructure.

Evolution of Hadoop Versions

  • There are three major releases of Hadoop: version one is outdated, while version two remains widely used. Version three was released recently in December 2017.
  • Companies can adjust the replication factor when storing data; however, debates exist regarding the efficiency of high replication rates due to space loss.

Handling Unstructured Data

  • In Hadoop 3, there are limitations on file types supported, with no replication allowed for certain formats. This raises questions about handling unstructured data effectively.
  • Analyzing unstructured data like images or videos poses challenges; conversion into binary format is necessary before analysis can occur.

Real-world Applications and Limitations

  • A real-time example from a customer care context illustrates that companies often analyze metadata rather than actual audio recordings from calls due to practical constraints.
Video description

🔥1000+ Free Courses With Free Certificates: https://www.mygreatlearning.com/academy?ambassador_code=GLYT_DES_Top_SEP22&utm_source=GLYT&utm_campaign=GLYT_DES_Top_SEP22 #BigData | What is Big Data Hadoop? How does it helps in processing and analyzing Big Data? In this course, you will learn the basic concepts in Big Data Analytics, what are the skills required for it, how Hadoop helps in solving the problems associated with the traditional system and more. About the Speaker: Raghu Raman A V Raghu is a Big Data and AWS expert with over a decade of training and consulting experience in AWS, Apache Hadoop Ecosystem including Apache Spark. He has worked with global customers like IBM, Capgemini, HCL, Wipro to name a few as well as Bay Area startups in the US. #BigDataHadoop #Hadoop #GreatLakes #GreatLearning About Great Learning: - Great Learning is an online and hybrid learning company that offers high-quality, impactful, and industry-relevant programs to working professionals like you. These programs help you master data-driven decision-making regardless of the sector or function you work in and accelerate your career in high growth areas like Data Science, Big Data Analytics, Machine Learning, Artificial Intelligence & more. - Watch the video to know ''Why is there so much hype around 'Artificial Intelligence'?'' https://www.youtube.com/watch?v=VcxpBYAAnGM - What is Machine Learning & its Applications? https://www.youtube.com/watch?v=NsoHx0AJs-U - Do you know what the three pillars of Data Science? Here explaining all about the pillars of Data Science: https://www.youtube.com/watch?v=xtI2Qa4v670 - Want to know more about the careers in Data Science & Engineering? Watch this video: https://www.youtube.com/watch?v=0Ue_plL55jU - For more interesting tutorials, don't forget to Subscribe our channel: https://www.youtube.com/user/beaconelearning?sub_confirmation=1 - Learn More at: https://www.greatlearning.in/ For more updates on courses and tips follow us on: - Google Plus: https://plus.google.com/u/0/108438615307549697541 - Facebook: https://www.facebook.com/GreatLearningOfficial/ - LinkedIn: https://www.linkedin.com/company/great-learning/ Great Learning has collaborated with the University of Texas at Austin for the PG Program in Artificial Intelligence and Machine Learning and with UT Austin McCombs School of Business for the PG Program in Analytics and Business Intelligence.