Wednesday, February 10, 2016

Data Warehouse Concepts - 2

Data warehouse Schemas

Much like a database, a data warehouse also requires to maintain a schema. A database uses relational model, while a data warehouse uses Star, Snowflake, and Fact Constellation schema.

Star Schema
·         There is a fact table at the center which connects directly to the dimensions through the keys.
·         Each dimension in a star schema is represented with only one-dimension table.
·         This dimension table contains all the set of attributes.
·         May contain redundant data due to lack of normalization

Snowflake Schema
·         The fact table can be connected to multiple dimensions directly or indirectly.
·         Each dimension table can be normalized into multiple dimension tables (which results in the indirect dependency of the fact table).
·         Attributes are split across multiple dimension tables
·         Due to normalization in the Snowflake schema, the redundancy is reduced and therefore, it becomes easy to maintain and the save storage space.

Fact Constellation Schema
·         A fact constellation is a collection of multiple fact tables sharing dimension tables, viewed as a collection of stars. This is an improvement over Star schema.
·         It is also known as galaxy schema.


Data Modeling

A data model organizes data elements and standardizes how the data elements relate to one another.

Data modeling (or modelling) in software engineering is the process of creating a data model for an information system by applying formal data modeling techniques.

Widely used data modeling techniques are:

ER Model
·         ER model or entity-relationship model is a particular methodology of data modeling wherein the goal of modeling is to normalize the data by reducing redundancy.
·         This is different than dimensional modeling where the main goal is to improve the data retrieval mechanism.

Dimensional Model
Dimensional model consists of dimension and fact tables. Fact tables store different transactional measurements and the foreign keys from dimension tables that qualifies the data. The goal of Dimensional model is not to achieve high degree of normalization but to facilitate easy and faster data retrieval.

Dimensions and Facts

Dimensions

A dimension is something that qualifies a quantity (measure).
Dimensions are mutually independent. Technically speaking, a dimension is a data element that categorizes each item in a data set into non-overlapping regions.

·         Conformed Dimensions
·         Junk Dimensions
·         Role-playing Dimensions
·         Slowly Changing Dimensions
·         Degenerate Dimensions
·         Rapidly Changing Dimensions
·         Inferred Dimensions
·         Shrunken Dimensions
·         Static Dimensions

Conformed Dimensions
A conformed dimension is a dimension that has the same meaning to every fact with which it relates. Conformed dimensions allow facts and measures to be categorized and described in the same way across multiple facts and/or data marts, ensuring consistent reporting across the enterprise.
Eg: The date dimension table connected to the sales facts is identical to the date dimension connected to the inventory facts


Facts

A fact is something that is quantifiable (Or measurable). Facts are typically (but not always) numerical values that can be aggregated.

  • Fully Additive Facts: Additive measures can be used with any aggregation function like Sum(), Avg() etc. Example is Sales Quantity etc.
  • Semi Additive Facts: Semi-additive measures are those where only a subset of aggregation function can be applied. Let’s say account balance. A sum() function on balance does not give a useful result but max() or min() balance might be useful. Consider price rate or currency rate. Sum is meaningless on rate; however, average function might be useful.
  • Non-additive Facts: Non-additive measures are those which cannot be used inside any numeric aggregation function (e.g. SUM(), AVG() etc.). For example, profit margin is a fact which cannot be added across any of the dimensions. For example, if product P1 has a 10% profit and product P2 has a 10% profit then your net profit is still 10% and not 20%.  We cannot add profit margins across product dimensions. Similarly, if your profit margin is 10% on Day1 and 10% on Day2, then your net Profit Margin at the end of Day2 is still 10% and not 20%.
  • Derived Facts: Derived facts are the facts which are calculated from one or more base facts, often by applying additional criteria. Often these are not stored in the cube and are calculated on the fly at the time of accessing them. For example, profit margin.
  • Factless Facts: A factless fact table is one which only has references (Foreign Keys) to the dimensions and it does not contain any measures. These types of fact tables are often used to capture events (valid transactions without a net change in a measure value). For example, a balance enquiry at an automated teller machine (ATM). Though there is no change in the account balance, this transaction is still important for analysis purposes.
  • Textual Facts: Textual facts refer to the textual data present in the fact table, which is not measurable (non-additive), but is important for analysis purposes. For example, codes (i.e. product codes), flags (i.e. status flag), etc.


No comments:

Post a Comment