Conception d'un entrepôt de données (DataWarehouse) #datawarehouse #businessintelligence #ssis #etl

Name: Conception d'un entrepôt de données (DataWarehouse) #datawarehouse #businessintelligence #ssis #etl
Uploaded: 2023-02-13T17:13:28.000Z
Duration: 1 h 32 min 31 s
Description: Introduction Représentation multidimensionnelle Implémentation d’un Datawarehouse - Modélisation : étoile et flocon de neige - Conception d’un Datawarehouse

Data Warehousing and Business Intelligence Overview

This section discusses the concept of data warehousing, its importance in artificial intelligence, and how it supports decision-making processes.

Introduction to Data Warehousing

The discussion begins with an introduction to data warehouses and their role in artificial intelligence, emphasizing the storage of thematic data.

A detailed exploration of how to design a data warehouse will follow the initial introduction.

Structure and Implementation of Data Warehouses

The plan includes an introduction to data warehouses and Data Marts, detailing what a Data Mart is.

Multi-dimensional representations of data warehouses will be presented along with implementation strategies using star and snowflake schemas.

Transition from Production Systems to Decision-Making Systems

The objective is to transform production-oriented information systems into decision-support systems that aid managers in making informed decisions.

Examples include managing orders, stock, clients, suppliers, and human resources within production-oriented systems.

Decision-Making Insights

Emphasis on identifying key metrics such as best customers by region for targeted promotions or market share analysis for marketing strategies.

Utilization of employee statistics for recruitment purposes based on various demographics like age or specialization.

Definition and Characteristics of Data Warehouses

Bill Inmon's definition from 1996 describes a data warehouse as subject-oriented collections aimed at aiding decision-making processes.

Data integration involves consolidating heterogeneous sources into a coherent structure for analysis.

Data Integration Process

Discusses the necessity for cleaning and normalizing diverse data formats during the loading process.

Sources can include SQL databases, flat files, Excel sheets; highlighting the heterogeneity of incoming data formats.

Non-volatility and Historical Perspective

Explains that once loaded into a warehouse, data remains unchanged over time; it is used solely for analysis rather than manipulation.

Historical tracking allows visualization of changes over time while maintaining relevant archival practices based on necessity.

Differences Between Data Warehouses and Transactional Databases

Clarifies that while transactional databases handle real-time transactions (insertions/modifications), data warehouses are designed specifically for analytical purposes.

Understanding Transactional and Analytical Database Systems

This section provides an overview of the differences between transactional databases (OLTP) and analytical databases (OLAP), focusing on their structures, purposes, and data handling methods.

Characteristics of Transactional Databases

Transactional systems belong to the category of software known as Online Transaction Processing (OLTP), where data is processed in real-time by end-users who can insert, modify, or delete records.

These databases are designed for operational tasks and support complex relationships among various tables through intricate joins that must be normalized to avoid redundancy.

Normalization in OLTP aims to reduce storage size and ensure data integrity, which is crucial for maintaining accurate records during high-volume transactions.

In contrast to analytical systems, transactional databases prioritize write operations; they are optimized for quick processing of numerous transactions like inserts, updates, and deletions.

The design model used here is typically Entity-Relationship (ER), adhering to normalization standards such as the third normal form.

Features of Analytical Databases

Analytical databases or Data Warehouses utilize a different approach; they store data in a denormalized format aimed at facilitating read-only access for decision-makers.

Data stored in these warehouses is primarily sourced from transactional databases but transformed into formats suitable for analysis rather than production use.

OLAP systems focus on optimizing read operations with prepared datasets that enhance performance during complex analytical queries involving aggregations.

Unlike OLTP systems characterized by rapid transaction processing, OLAP systems handle extensive queries that require significant computational resources due to their complexity.

The primary goal of OLAP is not just speed but also the ability to maintain multi-user access while ensuring data integrity across various analyses.

Querying Differences Between OLTP and OLAP

Queries in transactional systems are straightforward—typically asking "what" or "who," such as retrieving product prices or sales figures linked to specific representatives.

Conversely, analytical queries are more complex; they often involve multiple aggregations necessary for extracting insights from large datasets intended for strategic decision-making.

The response time for these analytical queries becomes critical since they need efficient execution despite their complexity.

Data extracted from OLAP systems frequently undergoes further analysis using machine learning algorithms aimed at uncovering hidden patterns or supporting automated decision-making processes.

Historical data representation within these warehouses employs multidimensional models—such as star schema or snowflake schema—to facilitate easier analysis.

Example Use Case: Relational Database System

An example illustrates a relational database system where entities include product tables, sales tables, and store tables interconnected through associations.

Understanding OLAP and OLTP Systems

This section explores the differences between Online Analytical Processing (OLAP) and Online Transaction Processing (OLTP) systems, focusing on their structures, purposes, and data handling capabilities.

Key Differences Between OLAP and OLTP

Queries in OLAP are more complex than those in OLTP, often requiring multiple table joins and intricate predicates to group data by city or region.

Analytical queries in OLAP are generally more resource-intensive compared to transactional queries in OLTP, which focus on simple operations like insertion or deletion.

It is not advisable to perform analytical queries on transactional systems (OLTP); instead, dedicated analytical systems (OLAP) should be utilized for better efficiency.

Data from OLTP systems is operationally focused, while OLAP systems utilize consolidated data for planning and decision-making processes.

In contrast to the real-time updates typical of OLTP systems, historical data with varying granularities is essential for effective analysis in OLAP environments.

Data Storage and Structure

The speed of query execution differs significantly; OLTP queries are quick due to their simplicity, whereas OLAP queries can be lengthy depending on the volume of data processed.

While archival storage requirements for OLTP are minimal, they are substantial for OLAP due to the need for historical data aggregation.

Data normalization is a key feature of relational databases used in OLTP; however, multidimensional models with fewer tables characterize the structure of data in an OLAP system.

Purpose and Usage

The primary function of data within an OLTP system is operational support; losing this data can disrupt business functions. Conversely, analyzed data from an OLAP system can be retrieved as needed without critical loss implications.

Decision-makers benefit from using an OLAP system as it provides comprehensive insights into various business themes through multidimensional representations of company data.

Visualization Techniques

For example, sales performance can be visualized across multiple dimensions such as seller performance by date or region—allowing a 4D representation that enhances understanding of sales trends.

A product-oriented approach allows businesses to analyze sales quantities by region effectively. This targeted analysis aids decision-making regarding product management strategies.

Importance of Time Dimension

The time dimension plays a crucial role in database design; typically necessitating a dedicated time dimension table within the schema for effective temporal analysis across datasets.

Understanding the Entity-Association Principle in Decision Systems

Section Overview

This section delves into the principles of entity-association within decision systems, focusing on data warehousing concepts such as dimensions and measures. It outlines how these elements interact to facilitate data analysis.

Dimensions and Measures

The principle of entity association in decision systems utilizes dimensions and measures; measures are numerical values compared during analyses, such as sales amounts or quantities sold.

Dimensions serve as viewpoints from which measures can be observed, encompassing subjects like location, product, and date. Data is extracted using queries that perform aggregation operations.

A dimension is defined as a subject or theme for analyzing data; examples include customer dimension, product dimension, or time dimension containing external information.

Each dimension consists of members organized hierarchically; for instance, time can be grouped by week, month, or day based on different levels of granularity.

Dimension tables are identified by primary keys (recommended to use simple integers), which act as foreign keys in fact tables. A bridge dimension is essential in all models for designing a data warehouse.

Modifying Data in Dimension Tables

After loading data into a system, modifications may be necessary (e.g., changing a customer's address). There are three options: overwrite old data if previous states aren't needed; add a version attribute to track changes; or save both original and modified values with an additional attribute for the old value.

Fact Tables and Performance Indicators

Fact tables represent measure values linked to dimension tables. These values can either be measured directly during creation or calculated using existing dimensional data.

The fact table sits centrally among dimension tables and contains performance indicators. It includes quantitative attributes that form the basis of analysis.

Fact tables are identified by composite keys formed from concatenated primary keys of related dimension tables. Examples include sales quantity and revenue figures.

Types of Facts

Different types of facts exist: additive (e.g., total sales), semi-additive (e.g., stock levels), and non-additive (e.g., gross margin calculations).

Logical Schema Description

The logical description of a data warehouse involves relational database modeling through entity-association diagrams where entities (tables) relate via associations.

In multi-dimensional modeling for data warehouses, three schema types are defined: star schema, snowflake schema, and constellation schema—each illustrating relationships between fact and dimension tables differently.

Schema Models Explained

The star schema resembles an actual star with fact tables at the center connected directly to various dimension tables without interconnections among them.

Snowflake schemas feature normalized dimensions that branch out like snowflakes from central fact tables while allowing connections between some dimensions themselves.

Understanding Star and Snowflake Schema in Data Warehousing

This section provides an overview of the star and snowflake schema models used in data warehousing, focusing on their structures, relationships, and applications.

Star Schema Overview

In a star schema model, the fact table contains numerical measures essential for analysis, such as sales amounts, quantities sold, distance traveled, and ordered quantities.

Dimension tables are utilized by decision-makers to analyze the fact table; they allow extraction of sales amounts by vendor or client and other metrics like mileage per vendor-client pair.

The granularity of both dimension and fact tables must align; each row in the fact table should relate to all dimension tables. The fact table serves as the central hub for these dimensions.

There are no direct relationships between dimension tables; instead, connections exist solely between dimensions and the fact table.

Snowflake Schema Overview

The snowflake schema is a variant of the star schema that addresses performance issues when dealing with large dimensions containing millions of rows.

This model organizes data hierarchically across multiple related dimension tables to reduce redundancy and improve query performance.

In contrast to star schemas where dimensions directly connect to the fact table, snowflake schemas allow dimensions to link with other dimension tables without direct ties to the fact table.

Snowflake schemas typically feature varying granularities across different dimension tables while maintaining a cohesive subject focus.

Practical Application Example

To illustrate these concepts practically, an example will be provided comparing both star and snowflake models for designing a data warehouse focused on sales representatives' activities.

Decision-makers define requirements based on business needs; they specify subjects for analysis along with necessary measures. For instance, analyzing printer sales representatives' performance is highlighted.

Data Extraction Process

A goal is set to create a star schema that allows insights into sales activities—identifying coverage areas for each salesperson and evaluating their effectiveness through various metrics like average sales figures.

Data extraction involves identifying sources from transactional systems where relevant information resides before transferring it into a designated data warehouse for further analysis.

System Integration Considerations

Data may originate from centralized management systems or disparate departmental systems depending on organizational structure.

Accessing diverse datasets (e.g., human resources management system or sales management system), including travel expenses or fuel usage metrics is crucial for comprehensive analysis of representative activities.

Analysis Framework Development

Identifying analytical axes equivalent to dimensions alongside their attributes forms part of designing effective data warehouses.

It’s important to gather indicators during initial phases by asking targeted questions about desired analyses—such as total sales amounts or number of visits made by representatives.

Data Analysis and Warehouse Design Understanding Key Concepts

This section provides an overview of the critical aspects of data analysis and warehouse design, focusing on the necessary criteria for effective data management and analysis.

Criteria for Data Analysis

The analysis should consider various criteria such as date, vendor, product, geographical area, or client to ensure comprehensive insights.

It is essential to determine the level of granularity required for each dimension; for instance, dates should be analyzed by year, month, day, and even hour.

Identifying dimensions like date, vendor, product, geographical areas, and clients helps in structuring the data effectively for decision-makers.

Each dimension has specific granularities: vendors can be categorized by name or initials; products by category or type; geographical areas by country or city; clients by name and address.

Time is a crucial axis in data warehousing since historical data is vital for business management.

Data Warehouse Structure

The star schema model is introduced where fact tables are at the center connected to various dimension tables that contain performance indicators.

Fact tables are identified using composite keys derived from dimension table keys which help in maintaining relationships between different datasets.

The goal is to model a sales data warehouse based on product families while ensuring it can provide revenue figures per product over time.

Advanced Modeling Techniques

A snowflake schema variant may be used for performance reasons when dealing with large dimensions that could slow down data exploration processes.

Snowflake modeling involves creating hierarchical structures within dimension tables to reduce row counts per dimension while maintaining detailed information.

For example, if products belong to multiple categories leading to increased records in the product dimension table, a separate category table can be created linked back to the product table.

This approach allows better organization of dimensional hierarchies without cluttering fact tables directly with excessive detail.

Conclusion of Design Principles