Hadoop Ecosystem | Hadoop Tutorial for Beginners | Hadoop [Part 9]

Name: Hadoop Ecosystem | Hadoop Tutorial for Beginners | Hadoop [Part 9]
Uploaded: 2018-11-20T12:30:01.000Z
Duration: 56 min 7 s

Understanding Data Processing in Hadoop

Overview of Virtual Machines and Performance

Discussion on the types of virtual machines, specifically bare-metal VMs like Hyper-V and VMware, which guarantee better performance compared to non-bare-metal setups.

Introduction to ETL (Extract, Transform, Load) processes traditionally used for data handling.

Transition from ETL to ELT in Hadoop

Explanation of why Hadoop favors ELT (Extract, Load, Transform) over traditional ETL due to challenges in transforming large datasets on-the-fly.

Mention of tools for extracting structured data into Hadoop systems, such as Apache Sqoop.

Using Apache Sqoop for Data Transfer

Description of Sqoop's role in transferring data from SQL databases like Oracle and MySQL into Hadoop; it is not classified as an ETL tool but rather a data transfer tool.

Clarification that Sqoop can only handle structured data and cannot process flat files or JSON directly.

Change Data Capture with GoldenGate

Introduction to the concept of Change Data Capture (CDC), particularly using Oracle's GoldenGate when direct access to core banking databases is restricted.

Explanation that CDC captures only changes (deltas) in the database without allowing direct interaction with the original database.

Utilizing Apache Flume for Unstructured Data

Overview of Apache Flume as a tool designed for collecting unstructured data from various sources and delivering it to destinations like HDFS.

Distinction between Flume’s point-to-point delivery system versus Kafka’s message queue capabilities; Flume does not store data but simply transfers it.

Comparing Flume and Kafka

Highlighting the differences between Flume's temporary storage model and Kafka's persistent message queuing system that allows multiple consumers access to stored messages.

Discussion on how Kafka can retain messages for up to seven days by default, enabling more complex analytics scenarios.

Real-Time Data Analysis with Spark

Introduction to Spark Streaming and Real-Time Data Processing

Overview of Spark Streaming

Spark Streaming is a utility for analyzing real-time data, which may be new to some learners.

Comparison of Flume and Kafka

Flume operates on a pull-based model, allowing data reading without modifying existing setups, while Kafka requires a producer installation for direct data access.

In an existing setup, using Flume to send data to Kafka is more feasible than installing Kafka directly.

Data Flow Architecture

The architecture involves sending one copy of the data to HDFS (Hadoop Distributed File System) and another copy into Spark Streaming for real-time processing.

Other tools capable of real-time processing include Apache Flink and Apache Storm.

Functionality of Flume and Kafka

Flume acts like WhatsApp by pushing data but does not store it; in contrast, Kafka stores the data until consumed.

In the ICICI Bank architecture example, one copy goes to Spark Streaming for processing while another remains in Kafka.

Kafka's Data Management Features

Data Retention Configuration

Kafka allows configuration of how long data should remain stored (default is seven days), with options based on size limits (e.g., 10 GB or 20 GB).

Topics in Kafka

Topics in Kafka help manage how long specific datasets are retained within its cluster architecture.

Real-Time Processing Applications

Use Cases for Real-Time Processing

An example includes sending immediate offers based on customer transactions processed through systems like Facebook clicks.

Importance of Original Data Storage

The original transaction data must be stored securely while enabling real-time analysis through tools like Splunk or ELK stack.

Integration with Other Tools

Using Scoop with HDFS and Hive

Scoop can push data primarily into HDFS but can also be configured to send it to Kafka or Hive for further analysis.

Transactional Data Analysis Challenges

Real-Time Data Processing and Fraud Detection

Change Data Capture (CDC) Systems

The discussion begins with the importance of capturing data using CDC systems, specifically mentioning that Flume is used to collect log files for real-time processing.

Real-time data processing requires a CDC system to push data out efficiently; examples include Spark Streaming and Flink, which are highlighted as popular frameworks.

Use Case: Credit Card Fraud Detection

A practical example involving Citibank illustrates how credit card transactions are monitored in real time to detect fraud.

The speaker emphasizes the volume of transactions processed daily by banks like HDFC, indicating that manual monitoring is impractical.

Transaction Monitoring Process

All credit card transaction data is collected and sent through systems like Kafka or directly into Spark Streaming or Flink for analysis.

Logic can be implemented to trigger alerts based on unusual transaction patterns, such as spending outside typical behavior.

Processing Capabilities of Frameworks

Apache Storm can process individual events (like single transactions), while Spark Streaming operates on micro-batches, collecting several seconds' worth of transactions before processing them.

Both frameworks aim for real-time processing but differ in their operational methodologies—Storm processes one event at a time while Spark uses micro-batching.

Machine Learning Integration

To determine if a transaction is fraudulent, historical transaction data must be analyzed. Machine learning models built from this data help predict potential fraud in real time.

The challenge lies in comparing new transactions against extensive historical datasets without significant delays; machine learning models facilitate this comparison efficiently.

Resource Requirements for Real-Time Processing

Implementing these systems requires substantial computational resources (CPU and RAM), highlighting the need for robust infrastructure to support machine learning operations.

Batch Processing with Hadoop

Introduction to Batch Processing Frameworks

Once data is stored in HDFS, batch processing can be performed using MapReduce as the default framework.

Alternative Tools for Batch Processing

Pig was mentioned as another tool available for batch processing but has fallen out of favor due to the rise of Spark's popularity.

SQL-like Querying with Hive

Hive and Spark: Understanding Batch Processing

Introduction to Hive and MapReduce

Hive is a tool that simplifies the process of writing queries for data processing, allowing developers to create tables and run complex queries without deep technical knowledge.

MapReduce is a batch processing system that can take significant time to execute queries; for example, a query in Flipkart took around 12 hours to run.

Spark as an Alternative

Spark is introduced as an in-memory execution engine that offers faster performance compared to traditional MapReduce systems, classified as batch processing but capable of near real-time execution.

While Spark provides speed advantages, it operates differently from Hive, which remains relevant as a data warehouse for table creation and data storage.

Integration of Spark with Hive

Spark can connect with Hive using its SQL library, enabling users to read tables from Hive and perform operations on them efficiently.

For large group-by queries requiring full table scans, both Hive and Spark are effective. However, for smaller queries targeting specific rows, loading entire datasets may be inefficient.

MPP Engines: Impala and Others

Multiple MPP (Massively Parallel Processing) engines like Impala, Hawk, Presto, and Phoenix are discussed as alternatives that optimize query performance by scanning only necessary data.

Impala executes SQL queries directly without relying on MapReduce; it connects with HDFS to retrieve only the required metadata for efficient querying.

Reliability vs. Speed

A key distinction between Impala and Hive lies in reliability; while Impala offers faster query execution times (e.g., two minutes), it lacks fault tolerance—if a machine crashes during an Impala query execution, the job fails.

In contrast, Hive's reliance on MapReduce ensures built-in redundancy; even if one machine fails during processing, the job will complete successfully.

The Role of HBase

HBase is identified as Hadoop's NoSQL database designed for real-time random reads/writes rather than block-level access typical in HDFS.

Although HBase allows individual record manipulation through its API, it requires learning its specific language instead of standard SQL.

Phoenix: Bridging SQL with HBase

Phoenix serves as an interface allowing users to write SQL queries on top of HBase. This combination enhances usability by providing familiar syntax over non-SQL databases.

Despite being fast due to inherent optimizations within HBase itself, Phoenix is still in development stages (alpha/beta), aiming towards becoming production-ready with support for transaction management.

Hadoop and Its Integration with Phoenix

Overview of Hadoop and Phoenix Integration

The integration of Phoenix with Hadoop is discussed, highlighting that originally HBase lacked SQL support. Phoenix serves as a layer on top of HBase to enable SQL capabilities.

There are ongoing developments aimed at enhancing transactional management within the system, promising true transactional capabilities in the near future.

Commercial Distributions vs. Apache Downloads

When acquiring products from Cloudera, users receive a comprehensive package that includes various tools, unlike downloading from Apache where only basic components like HDFS and MapReduce are available.

The commercial distributions provide an integrated learning experience by packaging multiple tools together for user convenience.

Tools and Technologies in Cloudera's Offerings

A variety of technologies such as Spark, Impala, Presto, Drill, and Kafka are included in Cloudera’s offerings alongside traditional components like HDFS and MapReduce.

Newer technologies like Kudu (a relational file system) are mentioned; however, there is limited knowledge about them among users.

Security Features in Hadoop Ecosystem

Concerns regarding security within the Hadoop ecosystem are addressed through the introduction of Sentry, which manages access control and authentication processes effectively.