=Paper=
{{Paper
|id=Vol-39/paper-6
|storemode=property
|title=An Analysis of Many-to-Many Relationships Between Fact and Dimension Tables in Dimensional Modeling
|pdfUrl=https://ceur-ws.org/Vol-39/paper6.pdf
|volume=Vol-39
|authors=I. Song,B. Rowan,C. Medsker,E. Ewen
|dblpUrl=https://dblp.org/rec/conf/dmdw/SongRME01
}}
==An Analysis of Many-to-Many Relationships Between Fact and Dimension Tables in Dimensional Modeling==
An Analysis of Many-to-Many Relationships Between Fact and
Dimension Tables in Dimensional Modeling
Il -Yeol Song William Rowen
College of Information Science and Technology College of Information Science and Technology
Drexel University Drexel University
Philadelphia, PA 19104 Philadelphia, PA 19104
songiy@drexel.edu msis@drexel.edu
Carl Medsker Edward Ewen, M.D
Arynth, Inc. Christiana Care Health System
Cinnaminson, NJ 08077 Wilmington, DE 19899
cmedsker@arynth.com eewen@christianacare.org
1. Introduction
Abstract
The data warehouse (DW) is an integrated repository of
Star schema, which maintains one-to-many data, generated and used by an entire organization. The
relationships between dimensions and a fact table, data warehouse employs a suite of tools that transforms
is widely accepted as the most viable data raw data into meaningful business information. This
representation for dimensional analysis. Real- information depicts a view of a distinct business process to
world DW schema, however, frequently includes identify trends and patterns and serves as a foundation for
many-to-many relationships between a dimension decision-making.
and a fact table. Having those relationships in a
dimensional model causes several difficult issues, The dimensional model is a logical representation of a
such as losing the simplicity of the star schema business process whose significant features are user
structure, increasing complexity in forming understandability, query performance, and resilience to
queries, and degrading query performance by change. Dimensional modeling is widely accepted as the
adding more joins. Therefore, it is desirable to viable technique for delivering data to end users in a data
represent the many-to-many relationships with warehouse [KRRT98, AM97, AV98, AS97, DSHB98,
correct semantics while still keeping the structure MC98]. The main components of a dimensional model are
of the star schema. fact tables and dimension tables. A fact table contains
In this paper, we analyze many-to-many measurements of the business or records events. A
relationships between a dimension table and a fact dimension table contains attributes used to constrain,
table in dimensional modeling. We illustrate six group, or browse the fact data. There are two primary
different approaches and show the advantages and advantages of using a dimensional model in data
disadvantages of each. We propose two ad-hoc warehouse environments. First, a dimensional model
methods that maintain a star schema structure by provides a multidimensional analysis space in relational
denormalizing the dimensions to avoid many-to- database environments; we are analyzing factual data using
many relationships. This method allows quick dimensions. Second, a typical denormalized dimensional
query processing by using a concatenated attribute model has a simple schema structure, which simplifies end-
with minimal overhead. Other issues addressed user query processing and improves performance.
are data redundancy, weighting factors, storage
requirements, and performance concerns. The dimension tables contain a large number of
attributes, reflecting the details of the business processes.
Browsing is a user activity that explores the relationships
The copyright of this paper belongs to the paper’s authors. Permission to copy between attributes in a dimension table. The attributes will
without fee all or part of this material is granted provided that the copies are not serve as row headers and constraints for these views. It is
made or distributed for direct commercial advantage. common to have more than one hundred attributes in a real
Proceedings of the International Workshop on Design and world application. Dimension tables are considered wide
Management of Data Warehouses (DMDW'2001) for this reason. Denormalization of dimension tables is an
Interlaken, Switzerland, June 4, 2001 acceptable practice in data warehousing. A dimensional
(D. Theodoratos, J. Hammer, M. Jeusfeld, M. Staudt, eds.) model with highly normalized dimension structure is called
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-39/
I-Y. Song, W.Rowen, C. Medsker, E. Ewen 6-1
a snowflake schema [KRRT98]. Any attempts to dimension tables are typically joined to the fact table with
normalize a dimension table into a series of tables could a one-to-many relationship. When all the dimensions are
reduce the browsing capabilities of the user, resulting in related by one-to-many relationships with the fact table, the
more complex queries and increased retrieval time. Our schema is called a star schema. However, real-world DW
experiences with real-world data warehouse development schema frequently includes many-to-many relationships
shows that browsing and group-by queries are the two between a dimension and a fact table [KRRT98, AS97].
salient issues that drive the design of data warehouses. Having those relationships in a dimensional model causes
several difficult issues, such as losing the star schema
The fact table is where the numerical measurements of structure, increasing complexity in forming queries, and
the business processes are stored. These measurements or degrading query performance by adding more joins.
events are related to each dimension table by foreign keys. Therefore, it is desirable that we handle the many-to-many
The fact table contains thousands, or even millions of rows relationships while still keeping the structure of the star
of records. A typical query will compress or extract a large schema.
number of records into a handful of rows using
aggregation. Therefore, the most useful facts are numeric, In this paper, we analyze many-to-many relationships
continuously valued, and additive; Kimball calls this between a dimension table and a fact table in dimensional
premise the holy grail of dimensional database design modeling. Even though there are some previous studies on
[Kimb96]. how to represent a data warehouse conceptual schema
[GR98, SBHD98, TBC99] or how to derive/design a data
The grain of the fact table is a very important warehouse schema [AM97, KS97, TS98, LAW98, PJ99,
characteristic. The grain is the level of detail at which MK00, HLB00], the specific method of handling many-to-
measurements or events are stored. It determines the many relationships is rarely addressed. Two sources we
dimensionality of the data warehouse and dramatically found are books by Kimball et al. [KRRT98] and
impacts the size and conversely the performance. Giovinazzo [Giov00]. Not being satisfied by those
approaches for our real-world project, we have performed
The goal in designing the data warehouse model is to a thorough study on how to handle many-to-many
keep it simple to understand, simple to load with relationships. In this paper, we illustrate six different
operational data, and as fast as possible to query [Kimb96, approaches and show the advantages and disadvantages of
Kimb97, KRRT98, AM97, AV98]. We would like to have each. We propose two ad-hoc methods that maintain a star
neophyte and experienced business analysts creating schema structure by denormalizing the dimension to avoid
reports, so the logical model needs to be easy to many-to-many relationships. These methods allow us to
comprehend. Most business analysts frequently have a quickly process queries. Other issues that will be addressed
difficult time finding data in both highly normalized include data redundancy, weighting factors, storage
designs and abstract object designs. The flatter the requirements, and performance concerns.
dimensional model, the better for end-users. The more
complex the model, the more complex will be the The remainder of this paper is organized as follows:
extract/transform/load (ETL) routines to create and run. Section 2 presents a motivation example. Section 3
presents six approaches and discusses the advantages and
Finally, queries against the database will run faster if a disadvantages. Sections 4 presents a summary table and
minimal number of one-to-many relationships and joins are Section 5 concludes our paper.
present. To provide users with the views they need for
analysis, the one-to-many relationships between facts and 2. Motivational Example
dimensions should be flattened into a series of views or
In the healthcare billing process, there are usually
derived tables. For instance, the statistician may want to
multiple diagnoses for each patient visit. A design problem
create a regression model against diagnoses with the grain
arises in modeling a diagnosis dimension that has a many-
of the analysis being a single visit to the hospital.
to-many relationship with a fact table as shown in Figure 1.
Therefore, each row must completely define a visit with
We will explore specific data warehousing structures to
columns for specific diagnoses or columns that represent
analyze this predicament. We will use, as an illustrated
groups of diagnoses. To meet the fundamental goal of
example, the patient-billing situation throughout this paper
empowering end users to perform their own queries and
to compare and contrast the different solutions.
analyses, the design must balance elegance in conceptual
design with understandability, usability, and performance.
In Figure 1, the relationship between the diagnosis
dimension and the billable patient encounter fact table is
Design principles dictate that one should identify any
illustrated as a many-to-many. This considers the situation
dimensional attribute that has a single value for an
where a patient has more than one diagnosis for each
individual fact table record. The designer can build
billable encounter.
outward from the grain of the fact table and relate as many
dimensions as the business process demands. Therefore,
I-Y. Song, W.Rowen, C. Medsker, E. Ewen 6-2
Billable Patient factor will not directly add up (see Section 3.1.1 for the
Encounter Fact Table issues of weighting factors in a many-to-many
time_key (FK) relationship). You must guarantee by some other means
Diagnosis Dimension patient_key (FK) that the correct weight is applied for any subset of
diagnosis_key (PK) provider_key (FK) diagnoses. Users must be protected from retrieving a
diag_name location_key (FK)
payer_key (FK)
subset of data that aggregates incorrectly, which will occur
diag_description
procedure_key (FK) if no precautions are taken.
diagnosis_key (FK)
Note: only diagnosis dimension billedtopayer_amount 3. Methods for Handling Many-to-Many
is illustrated billedtopatient_amount Relationships
3.0. Assumptions
Figure 1: Healthcare Billable Encounter Schema Based on our experience of building a real-world data
[KRRT98] warehouse in a patient-billing domain, we have assumed
the following data for our analysis. The fact table contains
There are inherent problems with many-to-many patient billing information and each bill is assigned one
relationships between a fact table and a dimension. primary diagnosis and one to many secondary diagnoses.
Querying for records to find a particular combination of Although it is theoretically possible to have hundreds of
diagnoses requires multiple correlated subqueries. diagnoses, the maximum in practice is twenty or less.
Consider the query for retrieving 'billed to payer amount' Frequency distributions on an existing operational database
and 'patient key' for patients who have a combination of show that most bills have fewer than five secondary
diagnoses named 'heart' and 'cancer’ (for the remainder of diagnoses, with very few bills having more than 10. These
the discussion we will refer to the diagnosis dimension as are entered into the operational system in no particular
DD and the billable patient encounter fact table as BPE). order. There is no qualitative difference between secondary
1 and secondary 20. In addition, government and insurance
SELECT patient_key, billedtopayer_amount claim forms typically provide space for a maximum four or
FROM BPE eight secondary diagnoses, so the practical limit is fixed.
WHERE patient_key IN All have equal potential importance, depending on the
(SELECT patient_key context of use or the type of information that is compiled.
FROM DD, BPE Making a few assumptions can approximate the
WHERE diag_name = 'cancer' estimated size of the dimension and fact tables.
AND DD.diagnosis_key =
BPE.diagnosis_key • Fact table contains 1,000,000 records
INTERSECT • There are maximum 20 billable diagnoses for each
SELECT patient_key encounter.
FROM DD, BPE • There are maximum 500 billable diagnoses.
WHERE diag_name = 'heart' • There are on the average five separate diagnoses for
AND DD.diagnosis_key = each encounter
BPE.diagnosis_key); • All numerical field widths are an average four
bytes, names are eight bytes, and descriptions are
The subquery will select all patient numbers that have 15 bytes.
both heart and cancer diagnosis names. Queries for finding
patients with N different diagnoses will need N-level 3.1. Method A: The Bridge Table
subqueries. Therefore, report generation is very complex
and slow; you must search a large number of records with Figure 2 depicts Kimball’s use of the bridge table to
multiple correlated subqueries, increasing both the connect multiple diagnoses to a fact table [KRRT98]. The
processing time and the number of joins. bridge table is an intersection table between a diagnosis
When one requests additive measurements through the dimension table and the fact table. This table is similar to
relationship, the user may receive incorrect results. It is an intersection table that is created for a many-to-many
necessary to implement a weighting factor to give each relationship between two entities. However, what
separate diagnosis its appropriate contribution to the total distinguishes this bridge table in data warehouse modeling
bill [AV98]. from an intersection table in data modeling is the use of
An additional problem with this design is frequently a weighting factors and a diagnosis group key. A diagnosis
user may not want all of the diagnoses. When an end-user group key is assigned to clusters of diagnosis codes and the
retrieves fewer than all the diagnoses then the weighting combinations are inserted into the bridge table.
I-Y. Song, W.Rowen, C. Medsker, E. Ewen 6-3
Query: Given Tables 1, 2, and 3, find the ‘billed to payer
Diagnosis amount’ contributed by each diagnosis.
Dimension
diagnosis_key (PK) Situation 1 Impact Report:
diag_name
diag_description SELECT diag_name, SUM (billedtopayer_amount)
FROM DD,DGB,BPE
WHERE DD.diagnosis_key = DGB.diagnosis_key
AND DGB.diagnosis_group_key =
Diagnosis Group BPE.diagnosis_group_key
Bridge Table
GROUP BY diag_name;
diagnosis_key (PK)
diagnosis_group_key (PK)
weighting_factor Results:
diag_name billed_to_payer_amount
Cancer $ 3,000
Heart $ 3,000
Lung $ 2,000
Billable Patient
Encounter Fact Table
The results clearly indicate the inherent problem in a
time_key (FK)
patient-key (FK) many-to-many situation where the aggregation is counted
provider_key (FK) for the total amount for each occurrence of a diagnosis in
location_key (FK) the records (total amount billed is $8,000). Cancer
payer_key (FK) occurred in diagnosis group one and two, thus it was
procedure_key (FK) counted twice ($1,000 from diagnosis group one and $
diagnosis_group_key (FK) 2,000 from diagnosis group two returning an impact total
billedtopayer_amount of $ 3,000).
billedtopatient_amount
Situation 2 Weighting Factor Report:
Note: only diagnosis dimension is illustrated
SELECT diag_name,
Figure 2: Solving Multiple Diagnoses with a Bridge
SUM (billedtopayer_amount *weighting_factor)
Table [KRRT98]
FROM DD, DGB, BPE
WHERE DD.diagnosis_key = DGB.diagnosis_key
3.1.1. Weighting Factor AND DGB.diagnosis_group_key =
Observe the weighting factor attribute in Figure 2. The BPE.diagnosis_group_key
weighting factor is a percentage that identifies the GROUP BY diag_name;
contribution of the diagnosis to the specific encounter.
Within a diagnosis group, the sum of all the weighting Results:
factors must equal one. The weighting factor is multiplied diag_name billed_to_payer_amount
by fact values, through the joining of the two tables with Cancer $ 2,000
the diagnosis group key. In this manner, the involvement of Heart $ 800
each diagnosis in the diagnosis group is correctly Lung $ 200
calculated. Conversely, the user can request an impact
analysis, ignoring the weighting factors [KRRT98]. Such
impact reports will erroneously aggregate the amounts. It
Table 2: Diagnosis Group Bridge Table
will produce a summation based on the impact each
diagnosis has in relation to total amounts associated with
DGB
that diagnosis. Consider the following example, which
(Diagnosis Group Bridge)
shows only necessary attributes:
Diagnosis Diagnosis Weighting
key group_key factor
Table 1:Diagnosis Dimension
DK1 DG1 0.8
DD (Diagnosis Dimension) DK2 DG1 0.2
diagnosis_key diag_name
DK1 DG2 0.6
DK1 Cancer
DK2 DG2 0.3
DK2 Heart
DK3 DG2 0.1
DK3 Lung
I-Y. Song, W.Rowen, C. Medsker, E. Ewen 6-4
Table 3: Billable Patient Encounter Schema
The size of the bridge table would increase considerably
BPE if one encounter has many related diagnoses. We used an
(Billable Patient Encounter) average of five diagnoses per encounter for this example;
Patient Diagnosis billedtopayer this parameter produced a bridge table comparable to the
key group_key amount size of the fact table, as we will now demonstrate.
P1 DG1 $ 1,000
P2 DG2 $ 2,000 3.1.2. Database Sizing for the Bridge
Method.
The weighting factors produce a correct totaled report Base fact: 1,000,000 records
($3,000). During the summation, the weighting factor for Key fields = 7; Fact fields = 2; Total Fields = 9
each diagnosis key will be related to each bill through the Fact table size = 1,000,000 records * 9 fields * 4 bytes =
foreign key (diagnosis group key) found in the billable 36 MB
patient encounter table. Diagnosis dimension: 500 records
Key fields = 1; Name field =1; Description field =1
The weighting factor is necessary when using a bridge Record size = 4+8+15 =27 bytes
implementation to produce correct reports [KRRT98, Dimension table size = 500 records * 27 bytes =
AV98]. However, it is not always possible to rationalize 13,500 Bytes
the weighting factors for each diagnosis. In that case, it
would be possible to count the total diagnoses and produce Bridge table: 1,000,000 facts joined to 5 distinct diagnoses
an average cost through additional design measures. One in a diagnosis group = 5,000,000 records
method would be to add an additional attribute to the Key fields = 2; Weighting factor = 1; Total fields = 3
bridge, call it number_of_diagnosis; thus, you could divide Bridge table size=5,000,000 records * 3 fields * 4 bytes=
your impact total by this value to produce an average cost 60 MB
per diagnosis. This brute force method takes away from the Notice, if the average number of diagnosis is increased to
usefulness of your decision support based reports. Thus, it ten, our bridge table size will grow to 120 MB, nearly four
is recommended to use this method only when the correct times the size of our fact table
calculation of the weighting factors is not necessary. Total disk space = 96.1MB
A major benefit of this design is there is no fixed upper In summary, the bridge method can be considered a
limit, other than total possible diagnoses. Although in this logical solution for a many-to-many relationship with less
study, we have set an upper limit of twenty diagnoses, to redundancy. There are, however, various disadvantages to
meet the user requirements. The bridge method, as you can this method. Assigning weighting factors could prove to be
observe, implements a compound primary key for the difficult or cumbersome in a real-world environment;
bridge table comprised of diagnosis group key and additionally, adding a new diagnosis requires recalculating
diagnosis key. It is possible to find a group of related of the weighting factors. The logical structure would lose
diagnoses because the diagnosis group value is repeated for the simplicity and understandability of the star schema.
every member row in a set of diagnoses. More joins increase the overhead and query time. As
pointed out the size of the bridge table could increase
There may be other such many-to-many dimensions considerably based on the number of diagnosis assigned to
related to the same fact table, and the load times and query each diagnosis group.
times can be expected to be lengthy. For instance, there are
many procedure codes and Diagnosis Related Group
(DRG) codes assigned to a single visit or patient bill. A 3.2. Method B: Denormalizing the
DRG is a classification of a hospital stay in terms of what Dimension Table by Positional-Flag
was wrong with and what was done for a patient. There are Attributes
approximately 500 DRG codes, which are determined by a
program based on diagnoses and procedures coded in a Figure 3 illustrates denormalizing the diagnosis
standard International Classification of Disease (ICD-9) dimension using the positional-attribute approach. By
format and on patient attributes such as age, sex, and positional we mean that the location of each attribute is
duration of treatment. The DRG frequently determines the fixed. For example, the first attribute is cancer; the second
amount of reimbursements, regardless of the actually costs attribute is heart, etc. Thus, the same disease is always
incurred. A hospital visit is often coded by multiple indicated in the same column. In this method, each
systems, such as Systematized Nomenclature of Medicine diagnosis becomes a Boolean attribute being set to either
(SNOMED), Current Procedural Terminology (CPT4), and ‘TRUE’ or ‘FALSE’. For brevity and clarity, only five
others, all of which share a many-to-many relationship with attributes have been included in Figure 3.
the billable patient encounter fact table. Considering the
complexity of the healthcare billing system, the design and
performance using bridge tables will get quite complex.
I-Y. Song, W.Rowen, C. Medsker, E. Ewen 6-5
Billable Patient • there are no approaches to handle a weighting
Diagnosis Dimension Encounter Fact Table factor.
diagnosis_key (PK) INT time_key (FK)
However, a bitmap index scheme [OG95] can be
CancerFlag BOOL patient-key (FK) implemented on each positional attribute, which would
HeartFlag BOOL provider_key (FK) improve the query performance in this approach. It is clear
LungFlag BOOL location_key (FK) that this method would only be applicable when the
LiverFlag BOOL payer_key (FK) number of positional-attributes is limited and fixed.
KidneyFlag BOOL procedure_key (FK)
diagnosis_key (FK)
billedtopayer_amount 3.2.1. Database Sizing for the Positional-Flag
billedtopatient_amount
Attribute Method
Note: data types are displayed Number of base fact records: 1,000,000 records
in the diagnosis dimension to Key fields = 7; Fact fields = 2; Total Fields = 9
illustate positional-attribute Fact table size = 1,000,000 records * 9 fields *4 bytes =
concept in this example 36MB
Consider the total size of the dimension for 40 diagnoses:
Figure 3: Denormalizing the Dimension Table Diagnosis dimension: 240 records ≈ 1126.4 * 109 records
by Positional Flag attributes Number of key fields = 1; Number of attribute fields = 40;
Assuming 1 bit for each flag
Record size = 4+(1*40)/8= 9Bytes
This technique requires a very large diagnosis Dimension table size ≈ 1126.4 * 109 records * 9 bytes ≈
dimension table. N diagnoses require 2N records; for this 10.1TB
trivial example of five diagnoses, the table size is 32
records. Consider Table 4 that lists all the unique Total disk space = 10.1TB
diagnosis patterns. (for 40 diagnoses)
If we were to extend our model to include 10 diagnoses,
the table would be 1024 records in length; 20 diagnoses 3.3. Method C: Denormalizing the
would require 1,048,576 records; 40 diagnoses would Dimension Table by Non-Positional-
require about one trillion records. attributes & a Concatenated Field
In this approach, each attribute in the dimension will
store a different diagnosis value. By non-positional we
Additional disadvantages of this method include: mean that each attribute can have a different value in
different records. Other than the primary diagnosis, there is
• adding a new diagnosis value would require to rebuild no difference between secondary 1 and secondary 20.
the dimension table and the fact table. We need to use
Data Definition Language (DDL) to add a column and
reload the diagnosis dimension by adding (2N+1 - 2N)
rows and updating the diagnosis_key in the fact table;
Table 4: Diagnosis Dimension with Five Positional-attributes
diagnosis_key CancerFlag HeartFlag LungFlag LiverFlag KidneyFlag
001 FALSE FALSE FALSE FALSE FALSE
002 FALSE FALSE FALSE FALSE TRUE
003 FALSE FALSE FALSE TRUE FALSE
004 FALSE FALSE FALSE TRUE TRUE
005 FALSE FALSE TRUE FALSE FALSE
… … … … … …
031 TRUE TRUE TRUE TRUE FALSE
032 TRUE TRUE TRUE TRUE TRUE
I-Y. Song, W.Rowen, C. Medsker, E. Ewen 6-6
An example of this method is illustrated in Table 5. A
primary diagnosis and multiple secondary diagnoses can be
Table 5 Denormalized Non-Positional Diagnosis
assigned. Here, we introduce the notion of a concatenated
Dimensional Table
field to support query processing. A concatenated field is
used to store the primary and all the secondary values of
the diagnoses using the variable character data type. For Diagnosis Dimension
example, VARCHAR2 type in Oracle would be able to
store 4,000 characters. The LIKE clause of SQL could be diagnosis_key (PK)
employed to search for constrained information. The primary_diagnosis
concatenated value attributes will store diagnosis values in secondary_diagnosis1
a sorted order. One drawback is most bills have secondary_diagnosis2
approximately five diagnoses; therefore, there will be many secondary_diagnosis3
null values in secondary diagnoses. While queries across
diagnosis fields can be accomplished with multiple OR
clauses, the LIKE clause to the concatenated field will secondary_diagnosis20
simplify the search query. concatenated_ diagnoses (CD)
However, we note that most commercial database
systems do not employ B-tree type index for searching Note: secondary diagnosis 4 - 19 omitted for brevity
when LIKE clause begins with a wild character. Thus, an
efficient string indexing or string search mechanism will
enhance the query performance.
Observe the field concatenated_diagnoses (CD), which
In order to resolve the problem of LIKE clause, we can is a concatenation of the primary and all the secondary
enhance the non-positional model by incorporating the diagnoses related to a patient fact record. The primary
benefits of positional flag attributes. Additional Boolean diagnosis is included in the concatenated diagnosis to
attributes can be created for common or frequent enable the user to search for all assigned diagnoses for a
diagnoses. See Table 6 for an example. Bitmap indexes specific medical condition.
[OG95, CI98] can be created for these Boolean attributes
to facilitate searching based on these common diagnoses. Although we normally avoid fields with patterns or lists,
Unusual or intriguing diagnoses could also be included for an exception in this case is useful. There is no order or
specific business intelligence purposes. The hybrid method weighting to the secondary diagnoses, except the order in
allows both pattern matching with the LIKE command and which they come to mind of the evaluating physician or the
an index search through a limited number of Boolean order in which lab tests results become available, so when
fields. The main advantage here would be to constrain the this diagnosis dimension table is loaded, all the diagnoses
size of the dimension while allowing fast and efficient in a group are first sorted ascending and inserted across as
queries by maintaining the star schema. Consider Table 6 many diagnosis fields as required. This permits the use of a
to observe the usefulness of this approach. Most users are “wild card” query rather than multiple OR statements, to
interested in a disease category or combination of test whether a specific diagnosis was assigned to a patient
categories, not a single disease billing code. Multiple codes bill, regardless of its ordinal position. When searching for
can be assigned that all indicate the presence of a disease. a certain disease state, it usually does not matter if the
There may be as many as 20 codes that all indicate the diagnosis is primary or secondary; the physician just wants
patient has some form of diabetes. The analyst, for to ascertain if “any” diagnosis is for example “heart”.
reporting or regression purposes, simply needs a field for
diabetes that contains "TRUE" or "FALSE". In On-line
Analytical Processing (OLAP) designs users can combine Referring to Table 6:
(Boolean "AND") diseases by simply selecting "Yes"
across a series of OLAP categories. Pre-calculating and SELECT Patient_key, BPE. billedtopayer_amount
storing these clusters makes it simpler for users to query FROM DD,BPE
the database and for developers to create OLAP cubes. WHERE DD.diagnosis_key = BPE.diagnosis_key
AND DD.concatenated_diagnoses
LIKE ‘%heart%’;
I-Y. Song, W.Rowen, C. Medsker, E. Ewen 6-7
Table 6: Denormalized (non-positional) Diagnosis Dimension Table with Positional-attributes
Diagnosis Dimension
DK PD SD1 SD2 … SD20 CD AsthmaFlag DiabetesFlag (BOOL)
(BOOL)
1 heart lung asthma … null "asthma,heart,lung" TRUE FALSE
2 cancer diabetics null … null "cancer,diabetics" FALSE TRUE
3 liver heart null … null "heart,liver" FALSE FALSE
4 heart lung asthma … null "asthma,heart,lung" TRUE FALSE
…
Note: secondary diagnosis 3 - 19 omitted for brevity
The query is less difficult to write than a query with 3.3.1 Method C-1: One-to-One Relationship
multiple OR clauses. Adding flag columns for disease between Dimension and Fact Tables
groups can further enhance the design. For instance,
When each record in the diagnosis dimension can be
columns for diabetes and asthma can be used to tag all
related to one fact record, there exists a one-to-one
rows having specific diagnosis codes. Columns for certain
relationship between the tables (Figure 4). That is, we are
DRG codes can be included, since the dimension can be as
creating one dimension record for each new billing
wide as the designer desires to increase the usefulness to
encounter. The drawbacks in this design are three. First,
the end user. Note that there is a one-to-many relationship
most bills have fewer than 5 secondary diagnoses, so there
between a flag field and diagnosis codes. That is, the
will be many null values. Second, queries across
presence of any one or more of a set of diagnosis codes
secondary diagnosis fields will require multiple OR
may indicate an overall condition of diabetes. Flag fields
clauses, which are complex to write and slow to run.
are an elegant way for users to create simple queries that
However, this disadvantage can be solved using the
ask broad disease questions, but the determination and
concatenated attribute and LIKE clause as we explained in
loading of these fields during the ETL process is complex
the previous section. Third, it will take more storage.
and usually requires a mapping table created by medical
However, the most significant advantage of this approach
experts. They can be simple TRUE/FALSE flags that
is to maintain the simple star schema structure. Here,
allow rapid queries such as:
design is simpler in most ways and easier for analysts not
trained in data modeling to understand at the expense of
SELECT *
significant storage.
FROM DD
A weighting factor could be added to the diagnosis
WHERE Asthma = 'TRUE'
dimension, but will create a complexity in the actual usage
AND Diabetes = 'TRUE';
and is not recommended in this approach.
Note tuples one and four in Table 6. Both tuples have
the same primary and secondary diagnoses but have 3.3.1.1. Database Sizing for Denormalized
different diagnosis keys. The designer must make a Dimension Method C-1
decision, is this type of redundancy acceptable or should Number of base fact: 1,000,000 records
measures be taken to search for existing diagnosis patterns Key fields =7; Number of fact fields =2; Total Fields = 9
before issuing a new diagnosis key? It will be a trade-off Fact table size = 1,000,000 records * 9 fields * 4 bytes =
between more required memory space, or develop ad-hoc 36 MB
stored procedures to handle this situation. Diagnosis dimension: 1,000,000 records
This method of using non-positional attributes can be Key fields =1; Primary diagnosis size = 8 +15 = 23 bytes
implemented in two different ways: by one-to-one or one- Average secondary diagnoses size = 5 * (8+15)= 115 bytes
to-many relationship between the diagnosis dimension and Average Concatenated field size = 5*8 = 40 bytes
the fact table, depending on the allowance of redundant Number of Boolean Flag fields = 3 bits (1 bit each)
tuples. Record size = 4 + 23 + 115 + 40 + 1 = 183 bytes
This scheme described in this section allows the users to Dimension table size = 1,000,000 records * 183 bytes =
only be concerned with a single join between the fact table 183 MB
and dimension table. Total disk space = 219 MB
I-Y. Song, W.Rowen, C. Medsker, E. Ewen 6-8
Diagnosis
Dimension
In our project, we adopted this C-2 method. Many
diagnosis_key (PK)
primary_diagnosis
medical centers will purchase or download all necessary
Billable Patient
primary_diagnosis_desc Encounter fact table diagnosis codes and descriptions as flat files, then load this
secondary_diagnosis1 time_key (FK) data into a database table. We were able to create the
secondary_diagnosis1_desc patient-key (FK) initial diagnosis dimension using historic legacy data. For
secondary_diagnosis2 provider_key (FK)
secondary_diagnosis2_desc each billable encounter, a lookup is performed for the
location_key (FK)
secondary_diagnosis3 payer_key (FK) diagnosis description in a lookup table, the results are
secondary_diagnosis3_desc procedure_key (FK) sorted, and a new record is inserted. For future claims
...
secondary_diagnosis20
diagnosis_key (FK) records, a maintenance function will query the diagnosis
billedtopayer_amount
secondary_ diagnosis20_desc billedtopatient_amount
dimension to see if the pattern already exists. If the pattern
concatenated_diagnoses does not exist, the lookup table is accessed for a
SickleCellFlag
description and will update the dimension accordingly.
AsthmaFlag Note: secondary attributes 3 - 19
DiabetesFlag omitted for brevity.
3.3.2.1. Database Sizing for Denormalized
Figure 4 Denormalized Non-Positional Diagnosis Dimension Method C-2.
Dimension Table with Flag Attributes: One-to-One Number of base fact: 1,000,000 records
Relationship between Dimension and Fact Tables Key fields = 7; Fact fields = 2; Total Fields = 9
Fact table size =1,000,000 records * 9 fields * 4 bytes =
3.3.2. Method C-2: One-to-Many 36 MB
Relationship between Dimension and Diagnosis dimension records: 200,000 records
Fact Tables (Assumed on the average one pattern is associated with
five encounters.)
The diagnosis dimension can be related to the fact table Key fields =1;
in a one-to-many relationship (Figure 5). Thus, the same Primary diagnosis size = 8 +15 = 23 bytes
diagnosis pattern is associated with multiple encounters. Average secondary diagnoses size = 5 * (8+15)= 115 bytes
This method introduces many null values similar to Average Concatenated field size = 5*8 = 40 bytes
method C-1, but the redundancy is largely reduced. A Number of Boolean Flag fields = 3 bits (1 bit each)
sorting/concatenation procedure similar to the one used to Record size = 4 + 23 + 115 + 40 + 1 = 183 bytes
create the dimension table explained in section 3.3 can be Dimension table size = 200,000 records * 183 bytes =
employed to search for the same diagnosis pattern from 36.6 MB
existing records, however the insertion process will tend to
be complex and slow. Due to the one-to-many Total disk space = 72.6 MB
relationship, weighting factors cannot be exploited.
Diagnosis
3.4. Method D: Lowering the Grain of the
Dimension Fact Table
diagnosis_key (PK)
primary_diagnosis
Method D will lower the grain of the fact table to the
Billable Patient
primary_diagnosis_desc Encounter fact table dimension grain level (Figure 6). This method is briefly
secondary_diagnosis1 time_key (FK) discussed By Giovinazzo [Giov00]. For each event there
secondary_diagnosis1_desc patient-key (FK) will be multiple fact records relating to that specific event.
secondary_diagnosis2 provider_key (FK)
secondary_diagnosis2_desc An option is to add a diagnosis_group_key in the fact table
location_key (FK)
secondary_diagnosis3 payer_key (FK)
to group the multiple fact records. There is no need to
secondary_diagnosis3_desc procedure_key (FK) compute a weighting factor; each diagnosis can be directly
...
secondary_diagnosis20
diagnosis_key (FK) billed (see table 7). The star schema is retained in this
billedtopayer_amount
secondary_ diagnosis20_desc billedtopatient_amount
approach, giving the user a concise, clear logical view of
concatenated_diagnoses the business process, at the expense of increasing the size
SickleCellFlag of a fact table.
AsthmaFlag Note: secondary attributes 3 - 19
DiabetesFlag omitted for brevity. Since you can have many diagnoses for a singular event,
there is a need to calculate the aggregate of that instance.
The approach will be straightforward; sum the amounts of
each diagnosis for a specific date, grouping by
diagnosis_group_key. The size of the fact table will
Figure 5: Denormalized Non-Positional Diagnosis increase depending upon how many diagnoses will be
Dimension Table with Flag Attributes: One-to-Many stored in the fact table. There will also be redundant data
Relationship between Dimension and Fact Tables stored for other billing fields.
I-Y. Song, W.Rowen, C. Medsker, E. Ewen 6-9
Diagnosis Dimension Billable Patient
Encounter Fact Table Method E is similar to method D, but separates the
diagnosis_key (PK) time_key (FK) diagnosis data from the billing data by employing two fact
diag_name patient-key (FK) tables (Figure 7). Diagnosis data is recorded at individual
diagnosis_description provider_key (FK)
diagnosis grain in the Patient Medical Record fact table,
location_key (FK)
while a billing is recorded at each billing transaction grain.
payer_key (FK)
procedure_key (FK)
Using this structure, allows us to use two fact tables in
diagnosis_key (FK) different ways. When we perform diagnosis-related
diagnosis_group_key analysis, the Patient Medical Record fact table can be used.
billedtopatient_amount When we perform billing–related analysis, the Billing fact
billedtopayer_amount table can be used. When we analyze billing related to
diagnosis, both fact tables will be used. A billing event is
joined to diagnosis through the billing key. A Weighting
factor can be used if necessary.
Figure 6: Method D: Lowering the Grain of the In this approach, the patient medical record would
Fact Table contain redundant data about the patient due to the
granularity of the table, which is at the diagnosis level.
This method, however, causes less redundancy than
Method D because billing is recorded only once per event
in its own fact table. The size of the fact table will increase
Table 7: An Example of the Fact Table at depending upon how many diagnoses are stored in the fact
Individual Grain table. There will also be redundant data caused by the
foreign keys of an additional fact table.
BPE We note that if Method E is simplified by removing all
Billed to foreign keys except diagnosis_key and billing_key in the
Patient Diagnosis Diagnosis Other Patient Medical Record fact table, the structure is similar
payer
key group_key key FK's
amount to the method used with the Bridge table.
P1 DG1 cancer . $ 1,000
P1 DG1 heart . $ 2,000 Diagnosis
P2 DG2 cancer . $ 1,000 Dimension
P2 DG2 heart . $ 2,000 diagnosis_key (PK)
diag_name
P2 DG2 lung . $3,000 diag_description
3.4.1. Database Sizing for Method D:
Modifying the Fact Table Patient Medical
Record FactTable
Number of base fact records: 5,000,000 records (assume
time_key (FK)
five diagnoses on average). patient-key (FK)
provider_key (FK)
Key fields =7; Number of fact fields =3; Total Fields=10 location_key (FK)
Fact table size =5,000,000 records * 10 fields * 4 bytes = procedure_key (FK)
diagnosis_key (FK)
200 MB billing_key (FK)
Diagnosis dimension: 500 records WeightingFactor
Key fields = 1; Name field =1; Description field =1
Record size = 4+8+15 =27 bytes
Dimension table size = 500 records * 27 bytes =
13,500 Bytes
Billing
Fact Table
Total disk space = 200.1 MB
time_key (FK)
patient-key (FK)
provider_key (FK)
location_key (FK)
3.5. Method E: Lower the Grain of the Fact payer_key (FK)
procedure_key (FK)
Table: Separating Diagnosis from billing_key (FK)
Billing Data billedtopatient_amount
billedtopayer_amount
Figure 7: Method E: Lower the Grain of the Fact
Table and Separate Diagnosis/Billing Data
I-Y. Song, W.Rowen, C. Medsker, E. Ewen 6-10
Table 8: Summary of Six Methods
Name Weighting Factor Queries Redundancy Recommendation
Storage
Ideal for dimension with no
Necessary
Slight upper limit in Many side and
Method A WF assignment can be Additional join WF can be easily calculated or
Kimball’s Bridge cumbersome increases query time not needed.
Adding new diagnoses
96.1 MB A clean solution
require recalculation.
Dimension table
Requires many OR will list all
Method B Cannot handle a weighting commands, slows possible Because of enormous storage
Denormalizing factor processing response combination, can requirement, only applicable
Dimensional be extremely when the number of positional-
Table with large attributes is very small (less
Positional-Flag Bitmap index scheme than 10) and fixed
attributes Can use a primary can speed up query 10.1TB (for 40
diagnosis flag processing diagnoses)
Can maintain star schema
Method C
structure for ease of
Denormalizing
Can be very understanding and query
Dimensional
Depends on relationship Use of concatenated forming
Table with Non- large
between dimension and attribute and LIKE Use when the number of
positional-
fact table clause. dimension values is small and
attributes &
Concatenated query performance is important
field See C-1 and C-2
Creates many null values
below.
LIKE command Can be very
Can create a concatenated disables indexing. large Use when a dimension pattern
Method C-1
weighting factor, but Can use flags and appears in fact table only 1-2
One-to-One
cumbersome bitmap indexes for the 219 MB times.
flags
Use when a dimension pattern
Like command Less redundancy
Cannot use a weighting appears in fact table many
Method C-2 disables indexing. than Method C-1
factor because of one-to- times.
One-to-Many Can use flags and Requires complex logic to find
many nature
bitmap indexes for the 72.6 MB existing records before inserting
flags a new code.
Can cause
No need for a WF redundancy for The fact table can become very
Method D View is on single table
other FK's in fact large
Lower Grain of table
Fact Table to Line Need to calculate
Item level Use only for a limited number
Can use a primary aggregation. 201.1 MB of attributes and line items per
diagnosis flag.
Fast queries transaction
View is on single table Can cause
Method E Must deal with two redundancy for
Can use a WF
Lower Grain & fact tables. other FK's in fact Use when two fact tables can be
Separate Facts table used separately.
into two tables Fast queries
Can use a primary Need to calculate
196.1 MB
diagnosis flag. aggregation
I-Y. Song, W.Rowen, C. Medsker, E. Ewen 6-11
designer must consider the size of the fact table and
3.5.1 Database Sizing for the Positional- redundancy in relation to the query processing times.
attribute Method E There are several distinct types of users of the
warehouse; the executive browser, the data analyst, and the
Number of Patient Medical facts: 5,000,000 records
professional OLAP analysts. The executive browser will
(Assume five diagnoses on average).
see the data through interfaces designed by the warehouse
Key fields = 7; Fact fields = 1
developers and the development will be simpler using
Record size = 7*4 + 4 = 32 bytes
Method C-1 or C-2 because there are fewer joins in the
Fact table size = 5,000,000 records * 32 bytes =
queries and no flattening views to create. Data analysts and
160 MB
professional OLAP analysts such as statisticians will need
Diagnosis dimension: 500 records
to extract data from the warehouse tables into reports
Key fields = 1; Name field =1; Description field =1
employing statistical and data mining programs. Therefore,
Record size = 4+8+15 =27 bytes
the understandability of the model becomes very important.
Dimension table size = 500 records * 27 bytes =
Data from a flatter, horizontal dimension is easier to query
13,500 Bytes
into off-the-shelf application packages than data in a
Number of billing fact: 1,000,000 records
vertically oriented table. There is also no concern that
Key fields =7; Number of fact fields =2; Total Fields =9
analysts will forget to apply the weight factors to the facts.
Fact table size = 1,000,000 records * 9 fields * 4 bytes =
Maintenance is not appreciably different from a
36 MB
modeling and physical implementation perspective, which
are handled using a data-modeling tool. The extract,
Total disk space = 196.1 MB
transform, and load (ETL) routines may be more complex
and slow to run for the bridge method due to the multiple
4. Summary and Discussion tables that have to be tested and the need to use cursors and
procedural code to populate the bridge tables. Conversely,
In this section, we present the summary of the six
the ETL plans for method C may be easier to write and
methods discussed in the paper. The results are
should run faster. It is also important to note, that some
summarized in Table 8.
fact tables will have multiple many-to-many relationships
Kimball's bridge method [KRRT98] produces an elegant
with dimensions other than diagnosis. While theoretically,
design for many situations. Redundancy is kept at a
this poses no problem; in practice, the load times and query
minimum, with the added advantage of correct weighted
times may be excessive.
summaries. The cost is the size of the bridge table and
We finally note that the storage calculation in this paper
added joins can deteriorate the querying process. We note
was based on our assumption. When the domain and
that a weighting factor is needed when using a bridge. It is
assumption changes, the storage requirement needs to be
not always needed when fact and dimension tables are
recalculated.
flattened. We recommend this solution as a clean and
maintainable solution when the cardinality of many-to-
many is not limited and weighting factors can be easily
5. Conclusion
calculated or are not needed. While we were building a real-world patient-billing data
Method B, positional flag attributes, seems a likely warehouse, we met a many-to-many relationship problem
solution when the number of positional-attributes is very between a fact table and a dimension table. A survey of
limited (say less than 10 or 12) and fixed. As the number of literature showed us two methods (Method A and Method
attributes increases, the storage requirement becomes D). After a thorough study on the subject, we have
explosive. We do not recommend this approach for most identified four additional methods. In this paper, we have
situations. analyzed those six different approaches for handling many-
Methods C-1 and C-2 use non-positional attributes with to-many relationships. We have illustrated and shown the
a concatenated attribute and flags. . By maintaining the advantages and disadvantages of each solution.
star schema structure, these methods enhance
understandability of the model by non-professional Our preferred methods include two ad-hoc approaches
analysts and support easy query formation. However, the that maintain a star schema structure by denormalizing the
designer must take into consideration the null values and dimension to avoid many-to-many relationships. For our
redundancy. In addition, Method C-2 requires an efficient project, we implemented Method C-2. For a new
procedure to find an existing diagnosis key for each entry encounter, we query the Diagnosis dimension to see if the
of a fact table. We recommend Method C-1 when a pattern already exists. If the pattern does not exist, the
dimension pattern appears in the fact table only 1-2 times lookup table is accessed for a description and will update
and Method C-2 when a dimension pattern appears in the the dimension accordingly. We also note that Method C-2
fact table many times. We recommend these methods only uses the minimum storage. Our experience shows that the
when the number of the dimension value is small and fixed. Method C-2 proposed in this paper were easy to use and
Methods D and E offer still additional approaches that efficient given our situation. However, we recommend
will work well if the number of diagnoses is limited. The each data warehouse designer carefully evaluate each case
I-Y. Song, W.Rowen, C. Medsker, E. Ewen 6-12
to adopt the best method considering various options. Just [Kimb97] Kimball, R., A Dimensional Manifesto, DBMS,
as database designers trade off normalization for query August 1997, pp. 58-70.
response, a data warehouse designer must also resolve
many issues to solve this many-to-many dilemma. [KRRT98] Kimball, R., Reeves, L., Ross, M., &
Thornthwaite, W. (1998). The Data
Warehouse Lifecycle Toolkit.
Acknowledgements New York: John Wiley & Sons, Inc.
The authors thank the many students at Drexel
[KS97] Krippendorf, M. and Song, I.-Y, "Translation of
University who participated in the discussion on the many-
Star Schema into Entity-Relationship Diagrams," Proc.
to-many dilemma in data warehousing modeling. Their
of 8th International Conference and
insight has been invaluable in developing these solutions.
Workshop on Database and Expert Systems
Applications (DEXA97 ), September 1-5, 1997,
6. References Toulouse, France, pp. 390-395.
[AM97] Anahory, S. and Murray, D., Data [LAW98] Lehner, W., Albrecht, J., and Wedekind, H.,
Warehousing in the Real World, Addison Normal Forms for Multidimensional Databases,
Wesley, 1997 Proc. SSDBM, 1998, pp. 63-72.
[AV98] Adamson, C. and Venerable, M., Data [MC98] Maier, D. and Cannon,C., Building a Better
Warehouse Design Solutions, John Wiley, 1998. Data Warehouse, Prentice Hall, 1998.
[AS97] Axel, M. and Song, I. -Y., "Data Warehouse Design [MK00] Moody, D. and Kortink, M.A.R., From
for Pharmaceutical Drug Discovery Research," Enterprise Models to Dimensional Models:
Proc. of 8th International Conference and A Methodology for Data Warehouse and Data
Workshop on Database and Expert Systems Mart design, Proc. of Int’l Workshop on Design
Applications (DEXA97), September 1-5, and Management of Data Warehouses,
1997, Toulouse, France, pp. 644-650. Stockholm, Sweden., 2000.
[CI98] Chan, C.-Y. and Ioannidis, Y. , Bitmap Index [OG95] O’Neil, P. and Graefe, G., Multi-table Joins
Design and Evaluation, Proc. of 1998 SIGMOD Through Bitmapped Join Indices, SIGMOD Record,
Conference, pp. 355-366. Vol. 24, No.3, Sept. 1995, pp. 8-11.
[PJ99] Pedersen, T.B. and Jensen, C.S.,
[DSHB98] Dinter, B., Sapia, C., Hofling, G., and Multidimensional Data Modeling for Complex
Blaschka, M., The OLAP Market: State of the Data, Proc. of 15th ICDE,
Art and Research Issues, Proc. of Int’l Sidney, Australia, 1999, pp. 336-345.
Workshop on Data Warehousing and
OLAP, Washington, D.C., [SBHD98] Sapia, C., Blaschka, M., Hofling, G., and
1998, pp. 22-27. Dinter, B., “Extending the E/R Model for the
Multidimensional Paradigm,” Advances in
[Giov00] Giovinazzo, W.A., Object-Oriented Data Database Technologies (ER ’98 Workshop
Warehouse Design: Building a Star Schema, Proceedings), Springer-Verlag, pp. 105-116.
Prentice Hall, 2000.
[TBC99] Tryfona, N., Busborg, F., and Christiansen,
[GR98] Golfarelli, M. and Rizzi, S., A J.G.B., “starER: A Conceptual Model
Methodological Framework for Data for Data Warehouse Design,” Proc. of Int’l
Warehouse Design, Proc. of Int’l Workshop Workshop on Data Warehousing and OLAP
on Data Warehousing and OLAP, (DOLAP 99), Kansas City, MO., pp. 3-8.
Washington, D.C., 1998, pp. 3-9.
[TS98] Theodoratos, D., and Sellis, T, Data
[HLB00] Husemann, B., Lechtenborger, J., and Warehouse Schema and Instance Design, Proc.
Vossen, G., Conceptual Data Warehouse Design, of 17th International Conf. On Conceptual
Proc. Of Int’l Workshop on Design and Management Modeling (ER98), Singapore, Nov. 1998,
of Data Warehouses, Stockholm, Sweden., 2000. pp.363-376.
[Kimb96] Kimball, R. (1996). The Data Warehouse
Toolkit. New York: John Wiley & Sons, Inc.
I-Y. Song, W.Rowen, C. Medsker, E. Ewen 6-13