What is Big Data? Introduction to Big Data | Hadoop Tutorial for Beginners | Hadoop [Part 1]
What is Big Data?
Introduction to Big Data
- The speaker opens with a question about the definition of Big Data, encouraging audience participation and practical understanding rather than relying solely on definitions found online.
- Emphasizes that knowledge of IT is essential for grasping the concept of Big Data, hinting at its technical nature.
Practical Use Case: Early Experience
- Shares a personal anecdote from 2007-2008 while working in Bangalore, where an application was developed for retail companies to capture sales data.
- Describes the use of an RDBMS (Relational Database Management System) to store sales data, highlighting its traditional row-column format.
Transition to Big Data: ICICI Bank Example
- Discusses a shift towards Big Data at ICICI Bank in 2011 due to limitations faced with traditional RDBMS systems.
- Identifies problems encountered by ICICI Bank, such as difficulties in managing increasing data volumes within their existing database systems.
Limitations of Traditional RDBMS
- Explains how traditional RDBMS can handle gigabytes to terabytes but struggles as data size increases significantly.
- Mentions that while storage capacity can be expanded (e.g., adding storage boxes), there are inherent limitations and complexities involved.
Challenges with Data Management
- Highlights issues related to partitioning large datasets across multiple machines and the complications arising from this approach.
Understanding Data Partitioning and Database Limitations
The Concept of Data Partitioning
- The speaker discusses the need for partitioning large tables in a database management system (DBMS) to improve query performance, as querying large datasets can be time-consuming.
- Logical partitioning is emphasized, where data cannot be physically divided but can be organized based on specific columns, such as country, to enhance processing efficiency.
Challenges with Traditional DBMS
- A significant drawback of traditional DBMS is that as data size increases, processing speed decreases. This leads to questions about denormalizing data for better performance.
- Denormalization is not typically supported in traditional systems; they normalize data across multiple tables requiring joins for queries.
Handling Unstructured Data
- Traditional DBMS primarily handle structured data (row-column format), raising concerns about their ability to process unstructured data like images or audio files effectively.
- The speaker highlights the limitations of traditional systems in managing unstructured data and suggests alternative solutions are necessary.
NoSQL Databases: A Solution for Modern Needs
- Companies like Flipkart utilize NoSQL databases to manage vast amounts of unstructured data efficiently. For instance, Flipkart manages around 1 billion product images using these technologies.
- NoSQL databases such as MongoDB and DynamoDB allow for faster storage and retrieval of unstructured data through key-value pairs rather than fixed rows and columns.
Scalability and Cost Issues
- Scalability is identified as a critical issue with traditional relational databases when handling millions of concurrent sessions, which modern applications require.
- Cost is another concern; traditional solutions like Oracle are expensive compared to more flexible NoSQL options that cater to contemporary business needs.
Big Data: Understanding Its Importance
- The concept of big data emerges from the necessity to analyze massive datasets that exceed the capabilities of conventional methods.
Understanding Transaction Management in Databases
The Role of DBMS in Transaction Management
- The discussion begins with the importance of data processing and the role of Database Management Systems (DBMS) in transaction management, emphasizing that DBMS is not obsolete despite the rise of NoSQL databases.
- DBMS systems are crucial for ensuring ACID properties (Atomicity, Consistency, Isolation, Durability), which guarantee reliable transactions. For instance, if a purchase is made on Flipkart, it must be clear whether the transaction was successful or not.
- A hypothetical scenario illustrates that if Flipkart were to use Cassandra (a NoSQL database), they could not confirm transaction success immediately, highlighting the limitations of some NoSQL systems in handling transactions.
Big Data and NoSQL Databases
- In the realm of big data, there exists a category known as NoSQL databases. These systems are designed for real-time queries and include technologies like MongoDB, Cassandra, and HBase.