Wide Column Store NoSQL vs SQL Data Modeling

Name: Wide Column Store NoSQL vs SQL Data Modeling
Uploaded: 2021-08-19T19:03:35.000Z
Duration: 1 h 44 min 37 s

Introduction to Data Modeling with Sila

Overview of the Speaker and Company

Zach Levitan introduces himself and his background, mentioning previous work at Oracle and other companies.

He presents Sila, a high-performance big data database compatible with Apache Cassandra and Amazon DynamoDB at the protocol level.

Features of Sila

Sila is designed for performance, developed in C++ with a focus on thread-based architecture.

Users can access Sila through various environments: open source, enterprise version (with added security features), and managed cloud services.

The enterprise version offers commercial support and stability compared to the open-source variant.

Sila Cloud provides a managed service that includes proactive monitoring and upgrades without requiring immediate credit card information for testing.

Understanding NoSQL Databases

Definition and Scope of NoSQL

The session will cover what NoSQL means, its history spanning over 15 years, and its broad categorization into hundreds of databases.

Two dimensions are used to categorize NoSQL databases: availability vs consistency (CAP theorem) and data modeling approaches.

Availability vs Consistency

Most modern databases like Sila, Cassandra, and DynamoDB fall under eventual consistency models; users can choose their desired consistency levels based on needs.

For deeper insights into availability versus consistency in databases, resources such as architecture pages on Sila docs or Jepson reports are recommended.

Data Modeling Approaches in NoSQL

Types of Data Models

NoSQL databases can be categorized from simple key-value stores to more complex structures:

Key-value Stores: Basic structure where each key maps to a value; examples include Aerospike and Redis.

Document Stores: Store documents as values associated with keys; typically use nested JSON or XML formats for data representation.

Wide Column Stores: Focused on partition keys and clustering keys; this model will be elaborated upon during the session's main discussion on data modeling techniques.

Additional Database Types

Other types of NoSQL databases include graph databases, time-series databases, etc., highlighting the dynamic nature of this field. A diagram illustrating these categories is available for further exploration later in the presentation materials.

Data Modeling in NoSQL Databases

Understanding Distributed Databases

Data modeling begins with recognizing that most NoSQL databases are distributed, meaning data is spread across multiple machines (nodes).

The distribution of data necessitates partitioning; you cannot store all data on a single node due to size constraints.

Unlike traditional databases like MySQL, which require manual partitioning, NoSQL databases like Scylla automatically handle this using hash functions.

In addition to partitioning, NoSQL databases replicate data across nodes for redundancy; typically, each piece of information is replicated on three nodes.

Even with replication, not every node holds all the data; only a subset based on the application factor.

Data Center Configuration and Efficiency

Scylla supports multiple data centers, allowing efficient regional configurations for applications (e.g., AWS or GCP regions).

Applications can query their local data center while updates synchronize across regions symbolically and efficiently.

A real-world test of this architecture occurred when an OVH cloud data center caught fire; Scylla continued functioning despite losing some nodes.

SQL vs. NoSQL Data Modeling Approaches

Traditional SQL modeling involves breaking down entities (like people or offices), creating tables per entity with defined relationships.

In contrast, NoSQL requires understanding specific application queries first to design the schema accordingly rather than following a relational model.

Iterative Schema Development

Developing a schema in NoSQL is iterative; initial assumptions may change over time as usage patterns evolve.

Continuous monitoring and adjustments are necessary as application needs shift after deployment.

Key Differences Between SQL and NoSQL

SQL offers more robust modeling capabilities compared to most NoSQL systems which prioritize high availability and performance over complex querying features.

The process of schema creation in NoSQL is often less flexible due to its denormalized nature and lack of joins or nested structures.

Terminology in Data Modeling

Important terms include "cluster" (the collection of nodes), "keyspace" (a namespace for tables), and other foundational concepts essential for understanding how Scylla operates.

Understanding Key Concepts in Database Partitioning

Overview of Database Structure

A database can be likened to a collection of tables, similar to SQL databases. Each table has properties such as application factors and user-defined types.

Tables are structured like matrices with columns and rows, where each cell contains information representing records.

Data partitioning is crucial; rows in a table must be distributed across multiple servers (nodes), with partitions being collections of rows on the same physical node.

Defining Partitions and Keys

When defining a table in systems like Cila or Cassandra, identifying the partition key is essential. This key determines how data is organized across nodes.

The partition key example given is the pet chip ID, which can consist of one or more columns. A hash function maps this key to a specific node for data storage.

Understanding Partition Distribution

Each node holds a range of tokens within a token ring structure, allowing efficient data distribution across the cluster.

An example illustrates that each partition consists of rows and cells containing actual database information.

Example Use Case: Pet Information Table

An imaginary startup collects caller and pet information, starting with a simple table featuring pet chip IDs and names.

The schema resembles SQL but differs enough to cause confusion; it emphasizes that while similar, it’s not identical to traditional SQL structures.

Importance of Choosing Good Partition Keys

The presentation highlights the significance of selecting an effective partition key—this decision impacts performance significantly.

A good partition key should have high cardinality (many unique values) and ensure even distribution across nodes to avoid overloading any single node.

Characteristics of Effective vs. Ineffective Partition Keys

High cardinality means expecting billions of partitions; thus, choosing keys that distribute evenly prevents resource wastage on overloaded nodes.

Bad partition keys lead to hot partitions where too much data resides on one node. Examples include using state or age as keys due to limited grouping potential.

Conclusion on Partition Key Selection

Good examples for effective keys include UUID or customer IDs; poor choices group data into fewer categories leading to inefficiencies.

Understanding Partition Keys and Hot Partitions in Databases

The Importance of Choosing a Good Partition Key

Selecting an appropriate partition key is crucial for database performance. A case study revealed that using a customer as a partition key led to uneven traffic distribution, with 90% of requests coming from one customer.

The initial choice of the customer as a partition key was ineffective due to the low number of customers and their uneven distribution, necessitating a change in strategy.

A bad partition key can lead to "hot partitions," where excessive traffic hits the same node, causing performance issues. This problem is common across non-SQL databases.

The concept of hot partitions is illustrated through examples from various sources, emphasizing its prevalence in data management challenges.

Hot partitions can severely impact system performance; if most traffic targets one node while others remain idle, it leads to resource wastage and potential overload on the busy node.

Consequences of Hot Partitions

In scenarios with billions of data points and high request volumes (e.g., tens of millions per second), having a single overloaded node can cripple overall system functionality.

Users face dual setbacks: financial investment in infrastructure becomes inefficient when nodes are underutilized, and overloaded nodes struggle to manage incoming requests effectively.

To mitigate hot partition risks, it's essential to select effective partition keys and continuously monitor system performance for signs of imbalance.

Transitioning to Effective Data Modeling

After discussing hot partitions, the focus shifts towards effective data modeling strategies within databases like Stella, which support more complex queries beyond simple key-value pairs.

An example application collects pet health metrics (like heart rates), requiring specific queries such as retrieving the last heartbeat or all heartbeats within a time range for individual pets.

Supporting these complex queries necessitates an appropriate schema design that accommodates both unique identifiers (like pet chip IDs) and timestamps for efficient data retrieval.

Evaluating Partition Key Effectiveness

The pet chip ID serves as an ideal partition key due to its UUID nature, promoting even distribution across nodes.

Clustering keys play a vital role by ordering rows within each partition based on timestamps, enhancing query efficiency by allowing quick access to relevant records.

Demonstration of Query Efficiency

A live demonstration illustrates inserting heart rate data into the database followed by executing efficient queries based on specified parameters like pet chip ID and time ranges.

The demo highlights how well-designed schemas allow rapid querying capabilities while ensuring that operations are performed efficiently across designated nodes without overloading any single point.

This structured approach provides clarity on critical concepts related to database management concerning partition keys and their implications on performance.

Understanding Clustering and Partition Keys in Databases

Overview of Monitoring Dashboards

The speaker discusses the efficiency of querying using time as a clustering key, noting that it can execute in less than a millisecond.

An overview of SQL dashboards is provided, emphasizing their utility for monitoring read and write operations, although current data may not be representative.

Defining Clustering Keys

The importance of defining a good clustering key is highlighted; it should align with the core application needs and how data is modeled.

Using timestamps as clustering keys is recommended for applications needing to retrieve recent values efficiently.

Managing Large Partitions

A discussion on partition sizes reveals that having millions of rows in one partition was once problematic but has been addressed through system improvements.

The speaker notes that while large partitions are now manageable, caution is advised to avoid "hot partitions" where all traffic targets the same partition.

Recommendations for Partitioning Strategies

General advice suggests avoiding excessively large partitions; however, if necessary, implementing component partition keys can help manage data distribution effectively.

By adding elements like dates to the partition key, each partition can hold limited data (e.g., one day's worth), which helps prevent overload.

Materialized Views and Query Efficiency

The concept of materialized views is introduced as a way to optimize queries by allowing alternative access patterns without compromising performance.

When querying by non-partition or non-clustering keys (like owner), developers must use "allow filtering," which may lead to inefficient scans across nodes.

Performance Considerations in Database Queries

Impact of Allow Filtering on Performance

Using allow filtering can lead to poor performance, especially with large datasets (billions of rows). Quick sub-millisecond responses are unlikely when this option is used.

Introduction to Material Views

A material view serves as an automatic denormalization tool, similar to how NoSQL databases handle data. It allows for the same information to be stored under different keys without manual synchronization.

The creation of a material view automates the synchronization between tables, reducing the workload on developers who would otherwise manage it manually.

Querying with Material Views

Once a material view is created, it enables efficient querying by non-partition key columns. This transformation allows for better access patterns compared to base tables.

Material views may not reside on the same nodes as their base tables due to differing partition keys; however, this is managed automatically by the database system.

Cost and Efficiency of Material Views

Maintaining a material view incurs storage costs since data is duplicated. Updates involve writing changes to both the base table and its corresponding view.

Reading from a material view can be efficient because if the coordinator node holds relevant data, it can fetch directly from there rather than querying multiple nodes.

Choosing Effective Material View Indexes

Selecting appropriate partition keys for material views follows similar principles as choosing good partition keys in general. For example, using 'owner' as a key works well due to its diversity compared to less varied options like 'age'.

Secondary Indexes vs. Material Views

Understanding Secondary Indexes

Secondary indexes in Cilla act as syntactic sugar over material views but have distinct implementations that differ from other databases like Apache Cassandra.

Optimization Strategies

When secondary indexes or material views share the same partition key as their base table, Cilla optimizes performance by ensuring they reside on the same node.

Querying Strategies Overview

There are three primary methods for querying tables not based on partition keys:

Allow filtering (full scan)

Material views

Secondary indexes

Guidelines for Querying Options

If queries are infrequent and latency isn't critical (e.g., ad hoc queries), allow filtering may suffice despite its inefficiency.

For scenarios requiring low-latency responses (sub-second), creating a material view is advisable over relying solely on secondary indexes or allow filtering.

Understanding Query Optimization and Data Management

Key Concepts in Querying with Sila

Managing Synchronization: Users must handle synchronization themselves when performing queries. While simple queries are effective, complex scenarios may require joins, which Sila does not support natively.

Using Third-Party Tools: For more complex transformations and queries, tools like Presto or Spark can be integrated with Sila. However, these may lead to full scans in the database, impacting efficiency.

Response Time Considerations: Sub-millisecond response times are crucial for certain applications. In contrast, analytics tasks that run overnight can afford longer processing times.

Best Practices for Data Modeling

Choosing Partition Keys: Selecting an appropriate partition key is paramount; it is the most critical aspect of data modeling. Clustering keys should only be used if necessary.

Materialized Views for Efficiency: When querying by a column that isn't the partition key, utilizing materialized views can enhance query performance significantly.

Monitoring and Iteration

Importance of Monitoring: Continuous monitoring is essential before benchmarking or testing new databases. Effective monitoring tools help avoid blind spots during performance assessments.

Iterative Schema Development: Database schemas will evolve over time as new information becomes available. Expect changes from initial designs to later iterations based on practical usage and insights gained.

Additional Resources and Q&A Insights

Further Learning Opportunities: Interested users can explore additional resources through Sila Cloud or Sila University for deeper insights into data modeling techniques.

Understanding Latency Metrics: Tail latency (e.g., p99 metrics) focuses on the worst-performing 1% of queries rather than average latencies, which could misrepresent application performance issues.

Testing Environments and API Compatibility

Testing on Single Nodes: Functional testing can be performed on a single node using Docker; however, performance testing requires multiple nodes to accurately reflect production environments.

DynamoDB API Compatibility: Sila supports both Apache Cassandra and DynamoDB APIs. The choice of partition keys remains critical across both systems for optimal performance.

Handling Large Collections

Telemetry Data Management: When capturing telemetry data from numerous sensors, consider how many data points per partition key you have to determine whether to consolidate or distribute them across multiple keys effectively.

Collection Size Limitations: Collections in Sila are designed for up to 101,000 elements; exceeding this limit suggests a need to restructure data into separate tables for better management and efficiency.

Partition Key Design Considerations

Understanding Partitioning in Data Storage

The design supports billions of partition keys, indicating a robust architecture capable of handling extensive data sets.

It is optimized for millions of elements flowing within a single partition, emphasizing efficiency in data retrieval and management.

Caution is advised against storing millions of elements in a collection within a single cell, which could lead to performance issues.

The discussion highlights the importance of structuring data effectively to leverage the system's capabilities fully.

Due to time constraints, further questions were left unanswered; however, contact information was provided for follow-up inquiries.