Oozie Hadoop Tutorial | HDFS Processing | Hadoop Tutorial for Beginners | Hadoop [Part 7]
Understanding Hadoop and Resource Management
Overview of Apache OZ Tool
- The tool discussed is Apache OZ, which is used alongside Hadoop for scheduling programs.
- Users can schedule jobs to analyze large datasets (e.g., 100 TB), but immediate submission may fail due to resource contention.
Resource Allocation in Hadoop
- Different roles (developer, tester, researcher) require varying access levels to the Hadoop cluster; resource allocation is managed through queues (Q).
- Parallelism is emphasized; multiple teams can submit jobs to their respective queues rather than directly to the cluster.
Queue Management and Scheduling
- Example given: a Hadoop cluster with 10 data nodes, each having specific RAM and processor cores. Proper management prevents monopolization of resources by any single job.
- Multiple queues can be created with defined resource allocations (e.g., developer Q gets 60% of resources).
Scheduler Types and Policies
- Various schedulers exist (capacity scheduler, fair scheduler); users must specify a queue when submitting jobs.
- Default scheduling policy is FIFO (First In First Out), but custom policies can also be implemented.
Data Storage and Processing in Hadoop
- HDFS allows storage of various file types without strict format requirements; processing logic must be written separately.
- Hive serves as a data warehouse on Hadoop that accepts SQL queries but operates differently from traditional RDBMS systems.
Comparison Between Hadoop and RDBMS
- Performance improvements are possible with Hive indexing, but it should not be compared directly with RDBMS due to fundamental differences in operation.