=Paper= {{Paper |id=Vol-39/paper-6 |storemode=property |title=An Analysis of Many-to-Many Relationships Between Fact and Dimension Tables in Dimensional Modeling |pdfUrl=https://ceur-ws.org/Vol-39/paper6.pdf |volume=Vol-39 |authors=I. Song,B. Rowan,C. Medsker,E. Ewen |dblpUrl=https://dblp.org/rec/conf/dmdw/SongRME01 }} ==An Analysis of Many-to-Many Relationships Between Fact and Dimension Tables in Dimensional Modeling== https://ceur-ws.org/Vol-39/paper6.pdf
              An Analysis of Many-to-Many Relationships Between Fact and
                      Dimension Tables in Dimensional Modeling

                    Il -Yeol Song                                                                      William Rowen
   College of Information Science and Technology                                       College of Information Science and Technology
                  Drexel University                                                                   Drexel University
               Philadelphia, PA 19104                                                              Philadelphia, PA 19104
                    songiy@drexel.edu                                                                   msis@drexel.edu
                                 Carl Medsker                                                           Edward Ewen, M.D
                                  Arynth, Inc.                                                     Christiana Care Health System
                             Cinnaminson, NJ 08077                                                    Wilmington, DE 19899
                             cmedsker@arynth.com                                                    eewen@christianacare.org
                                                                                        1. Introduction
                                       Abstract
                                                                                           The data warehouse (DW) is an integrated repository of
      Star schema, which maintains one-to-many                                          data, generated and used by an entire organization. The
  relationships between dimensions and a fact table,                                    data warehouse employs a suite of tools that transforms
  is widely accepted as the most viable data                                            raw data into meaningful business information. This
  representation for dimensional analysis. Real-                                        information depicts a view of a distinct business process to
  world DW schema, however, frequently includes                                         identify trends and patterns and serves as a foundation for
  many-to-many relationships between a dimension                                        decision-making.
  and a fact table. Having those relationships in a
  dimensional model causes several difficult issues,                                       The dimensional model is a logical representation of a
  such as losing the simplicity of the star schema                                      business process whose significant features are user
  structure, increasing complexity in forming                                           understandability, query performance, and resilience to
  queries, and degrading query performance by                                           change. Dimensional modeling is widely accepted as the
  adding more joins. Therefore, it is desirable to                                      viable technique for delivering data to end users in a data
  represent the many-to-many relationships with                                         warehouse [KRRT98, AM97, AV98, AS97, DSHB98,
  correct semantics while still keeping the structure                                   MC98]. The main components of a dimensional model are
  of the star schema.                                                                   fact tables and dimension tables. A fact table contains
     In this paper, we analyze many-to-many                                             measurements of the business or records events. A
  relationships between a dimension table and a fact                                    dimension table contains attributes used to constrain,
  table in dimensional modeling. We illustrate six                                      group, or browse the fact data. There are two primary
  different approaches and show the advantages and                                      advantages of using a dimensional model in data
  disadvantages of each. We propose two ad-hoc                                          warehouse environments. First, a dimensional model
  methods that maintain a star schema structure by                                      provides a multidimensional analysis space in relational
  denormalizing the dimensions to avoid many-to-                                        database environments; we are analyzing factual data using
  many relationships. This method allows quick                                          dimensions. Second, a typical denormalized dimensional
  query processing by using a concatenated attribute                                    model has a simple schema structure, which simplifies end-
  with minimal overhead. Other issues addressed                                         user query processing and improves performance.
  are data redundancy, weighting factors, storage
  requirements, and performance concerns.                                                  The dimension tables contain a large number of
                                                                                        attributes, reflecting the details of the business processes.
                                                                                        Browsing is a user activity that explores the relationships
The copyright of this paper belongs to the paper’s authors. Permission to copy          between attributes in a dimension table. The attributes will
without fee all or part of this material is granted provided that the copies are not    serve as row headers and constraints for these views. It is
made or distributed for direct commercial advantage.                                    common to have more than one hundred attributes in a real
Proceedings of the International Workshop on Design and                                 world application. Dimension tables are considered wide
Management of Data Warehouses (DMDW'2001)                                               for this reason. Denormalization of dimension tables is an
Interlaken, Switzerland, June 4, 2001                                                   acceptable practice in data warehousing. A dimensional
(D. Theodoratos, J. Hammer, M. Jeusfeld, M. Staudt, eds.)                               model with highly normalized dimension structure is called
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-39/




    I-Y. Song, W.Rowen, C. Medsker, E. Ewen                                                                                                      6-1
a snowflake schema [KRRT98].               Any attempts to      dimension tables are typically joined to the fact table with
normalize a dimension table into a series of tables could       a one-to-many relationship. When all the dimensions are
reduce the browsing capabilities of the user, resulting in      related by one-to-many relationships with the fact table, the
more complex queries and increased retrieval time. Our          schema is called a star schema. However, real-world DW
experiences with real-world data warehouse development          schema frequently includes many-to-many relationships
shows that browsing and group-by queries are the two            between a dimension and a fact table [KRRT98, AS97].
salient issues that drive the design of data warehouses.        Having those relationships in a dimensional model causes
                                                                several difficult issues, such as losing the star schema
   The fact table is where the numerical measurements of        structure, increasing complexity in forming queries, and
the business processes are stored. These measurements or        degrading query performance by adding more joins.
events are related to each dimension table by foreign keys.     Therefore, it is desirable that we handle the many-to-many
The fact table contains thousands, or even millions of rows     relationships while still keeping the structure of the star
of records. A typical query will compress or extract a large    schema.
number of records into a handful of rows using
aggregation. Therefore, the most useful facts are numeric,         In this paper, we analyze many-to-many relationships
continuously valued, and additive; Kimball calls this           between a dimension table and a fact table in dimensional
premise the holy grail of dimensional database design           modeling. Even though there are some previous studies on
[Kimb96].                                                       how to represent a data warehouse conceptual schema
                                                                [GR98, SBHD98, TBC99] or how to derive/design a data
  The grain of the fact table is a very important               warehouse schema [AM97, KS97, TS98, LAW98, PJ99,
characteristic. The grain is the level of detail at which       MK00, HLB00], the specific method of handling many-to-
measurements or events are stored. It determines the            many relationships is rarely addressed. Two sources we
dimensionality of the data warehouse and dramatically           found are books by Kimball et al. [KRRT98] and
impacts the size and conversely the performance.                Giovinazzo [Giov00]. Not being satisfied by those
                                                                approaches for our real-world project, we have performed
   The goal in designing the data warehouse model is to         a thorough study on how to handle many-to-many
keep it simple to understand, simple to load with               relationships. In this paper, we illustrate six different
operational data, and as fast as possible to query [Kimb96,     approaches and show the advantages and disadvantages of
Kimb97, KRRT98, AM97, AV98]. We would like to have              each. We propose two ad-hoc methods that maintain a star
neophyte and experienced business analysts creating             schema structure by denormalizing the dimension to avoid
reports, so the logical model needs to be easy to               many-to-many relationships. These methods allow us to
comprehend. Most business analysts frequently have a            quickly process queries. Other issues that will be addressed
difficult time finding data in both highly normalized           include data redundancy, weighting factors, storage
designs and abstract object designs. The flatter the            requirements, and performance concerns.
dimensional model, the better for end-users. The more
complex the model, the more complex will be the                    The remainder of this paper is organized as follows:
extract/transform/load (ETL) routines to create and run.        Section 2 presents a motivation example. Section 3
                                                                presents six approaches and discusses the advantages and
   Finally, queries against the database will run faster if a   disadvantages. Sections 4 presents a summary table and
minimal number of one-to-many relationships and joins are       Section 5 concludes our paper.
present. To provide users with the views they need for
analysis, the one-to-many relationships between facts and       2. Motivational Example
dimensions should be flattened into a series of views or
                                                                   In the healthcare billing process, there are usually
derived tables. For instance, the statistician may want to
                                                                multiple diagnoses for each patient visit. A design problem
create a regression model against diagnoses with the grain
                                                                arises in modeling a diagnosis dimension that has a many-
of the analysis being a single visit to the hospital.
                                                                to-many relationship with a fact table as shown in Figure 1.
Therefore, each row must completely define a visit with
                                                                We will explore specific data warehousing structures to
columns for specific diagnoses or columns that represent
                                                                analyze this predicament. We will use, as an illustrated
groups of diagnoses. To meet the fundamental goal of
                                                                example, the patient-billing situation throughout this paper
empowering end users to perform their own queries and
                                                                to compare and contrast the different solutions.
analyses, the design must balance elegance in conceptual
design with understandability, usability, and performance.
                                                                    In Figure 1, the relationship between the diagnosis
                                                                dimension and the billable patient encounter fact table is
   Design principles dictate that one should identify any
                                                                illustrated as a many-to-many. This considers the situation
dimensional attribute that has a single value for an
                                                                where a patient has more than one diagnosis for each
individual fact table record. The designer can build
                                                                billable encounter.
outward from the grain of the fact table and relate as many
dimensions as the business process demands. Therefore,



  I-Y. Song, W.Rowen, C. Medsker, E. Ewen                                                                                6-2
                                      Billable Patient         factor will not directly add up (see Section 3.1.1 for the
                                      Encounter Fact Table     issues of weighting factors in a many-to-many
                                      time_key (FK)            relationship). You must guarantee by some other means
     Diagnosis Dimension              patient_key (FK)         that the correct weight is applied for any subset of
     diagnosis_key (PK)               provider_key (FK)        diagnoses. Users must be protected from retrieving a
     diag_name                        location_key (FK)
                                      payer_key (FK)
                                                               subset of data that aggregates incorrectly, which will occur
     diag_description
                                      procedure_key (FK)       if no precautions are taken.
                                      diagnosis_key (FK)
     Note: only diagnosis dimension   billedtopayer_amount     3.        Methods for Handling Many-to-Many
              is illustrated          billedtopatient_amount             Relationships

                                                               3.0.      Assumptions
         Figure 1: Healthcare Billable Encounter Schema           Based on our experience of building a real-world data
                          [KRRT98]                             warehouse in a patient-billing domain, we have assumed
                                                               the following data for our analysis. The fact table contains
   There are inherent problems with many-to-many               patient billing information and each bill is assigned one
relationships between a fact table and a dimension.            primary diagnosis and one to many secondary diagnoses.
Querying for records to find a particular combination of       Although it is theoretically possible to have hundreds of
diagnoses requires multiple correlated subqueries.             diagnoses, the maximum in practice is twenty or less.
Consider the query for retrieving 'billed to payer amount'     Frequency distributions on an existing operational database
and 'patient key' for patients who have a combination of       show that most bills have fewer than five secondary
diagnoses named 'heart' and 'cancer’ (for the remainder of     diagnoses, with very few bills having more than 10. These
the discussion we will refer to the diagnosis dimension as     are entered into the operational system in no particular
DD and the billable patient encounter fact table as BPE).      order. There is no qualitative difference between secondary
                                                               1 and secondary 20. In addition, government and insurance
  SELECT    patient_key, billedtopayer_amount                  claim forms typically provide space for a maximum four or
  FROM      BPE                                                eight secondary diagnoses, so the practical limit is fixed.
  WHERE     patient_key IN                                     All have equal potential importance, depending on the
      (SELECT        patient_key                               context of use or the type of information that is compiled.
      FROM           DD, BPE                                      Making a few assumptions can approximate the
      WHERE          diag_name = 'cancer'                      estimated size of the dimension and fact tables.
            AND DD.diagnosis_key =
                     BPE.diagnosis_key                              •   Fact table contains 1,000,000 records
      INTERSECT                                                     •   There are maximum 20 billable diagnoses for each
             SELECT            patient_key                              encounter.
             FROM              DD, BPE                              •   There are maximum 500 billable diagnoses.
            WHERE              diag_name = 'heart'                  •   There are on the average five separate diagnoses for
                     AND DD.diagnosis_key =                             each encounter
                               BPE.diagnosis_key);                  •   All numerical field widths are an average four
                                                                        bytes, names are eight bytes, and descriptions are
   The subquery will select all patient numbers that have               15 bytes.
both heart and cancer diagnosis names. Queries for finding
patients with N different diagnoses will need N-level          3.1.      Method A: The Bridge Table
subqueries. Therefore, report generation is very complex
and slow; you must search a large number of records with          Figure 2 depicts Kimball’s use of the bridge table to
multiple correlated subqueries, increasing both the            connect multiple diagnoses to a fact table [KRRT98]. The
processing time and the number of joins.                       bridge table is an intersection table between a diagnosis
   When one requests additive measurements through the         dimension table and the fact table. This table is similar to
relationship, the user may receive incorrect results. It is    an intersection table that is created for a many-to-many
necessary to implement a weighting factor to give each         relationship between two entities.         However, what
separate diagnosis its appropriate contribution to the total   distinguishes this bridge table in data warehouse modeling
bill [AV98].                                                   from an intersection table in data modeling is the use of
   An additional problem with this design is frequently a      weighting factors and a diagnosis group key. A diagnosis
user may not want all of the diagnoses. When an end-user       group key is assigned to clusters of diagnosis codes and the
retrieves fewer than all the diagnoses then the weighting      combinations are inserted into the bridge table.




  I-Y. Song, W.Rowen, C. Medsker, E. Ewen                                                                               6-3
                                                              Query: Given Tables 1, 2, and 3, find the ‘billed to payer
        Diagnosis                                             amount’ contributed by each diagnosis.
        Dimension
        diagnosis_key (PK)                                    Situation 1 Impact Report:
        diag_name
        diag_description                                      SELECT   diag_name, SUM (billedtopayer_amount)
                                                              FROM     DD,DGB,BPE
                                                              WHERE    DD.diagnosis_key = DGB.diagnosis_key
                                                              AND      DGB.diagnosis_group_key =
            Diagnosis Group                                            BPE.diagnosis_group_key
            Bridge Table
                                                              GROUP BY diag_name;
            diagnosis_key (PK)
            diagnosis_group_key (PK)
            weighting_factor                                  Results:
                                                              diag_name        billed_to_payer_amount
                                                              Cancer           $ 3,000
                                                              Heart            $ 3,000
                                                              Lung             $ 2,000
                    Billable Patient
                    Encounter Fact Table
                                                                 The results clearly indicate the inherent problem in a
                    time_key (FK)
                    patient-key (FK)                          many-to-many situation where the aggregation is counted
                    provider_key (FK)                         for the total amount for each occurrence of a diagnosis in
                    location_key (FK)                         the records (total amount billed is $8,000). Cancer
                    payer_key (FK)                            occurred in diagnosis group one and two, thus it was
                    procedure_key (FK)                        counted twice ($1,000 from diagnosis group one and $
                    diagnosis_group_key (FK)                  2,000 from diagnosis group two returning an impact total
                    billedtopayer_amount                      of $ 3,000).
                    billedtopatient_amount
                                                              Situation 2 Weighting Factor Report:
        Note: only diagnosis dimension is illustrated
                                                              SELECT         diag_name,
     Figure 2: Solving Multiple Diagnoses with a Bridge
                                                                SUM (billedtopayer_amount *weighting_factor)
                     Table [KRRT98]
                                                              FROM           DD, DGB, BPE
                                                              WHERE          DD.diagnosis_key = DGB.diagnosis_key
3.1.1. Weighting Factor                                       AND            DGB.diagnosis_group_key =
   Observe the weighting factor attribute in Figure 2. The                   BPE.diagnosis_group_key
weighting factor is a percentage that identifies the          GROUP BY       diag_name;
contribution of the diagnosis to the specific encounter.
Within a diagnosis group, the sum of all the weighting        Results:
factors must equal one. The weighting factor is multiplied    diag_name        billed_to_payer_amount
by fact values, through the joining of the two tables with    Cancer           $ 2,000
the diagnosis group key. In this manner, the involvement of   Heart            $ 800
each diagnosis in the diagnosis group is correctly            Lung             $ 200
calculated. Conversely, the user can request an impact
analysis, ignoring the weighting factors [KRRT98]. Such
impact reports will erroneously aggregate the amounts. It
                                                                        Table 2: Diagnosis Group Bridge Table
will produce a summation based on the impact each
diagnosis has in relation to total amounts associated with
                                                                DGB
that diagnosis. Consider the following example, which
                                                                (Diagnosis Group Bridge)
shows only necessary attributes:
                                                                Diagnosis         Diagnosis            Weighting
                                                                key               group_key            factor
          Table 1:Diagnosis Dimension
                                                                DK1               DG1                  0.8
  DD (Diagnosis Dimension)                                      DK2               DG1                  0.2
  diagnosis_key                   diag_name
                                                                DK1               DG2                  0.6
  DK1                             Cancer
                                                                DK2               DG2                  0.3
  DK2                             Heart
                                                                DK3               DG2                  0.1
  DK3                             Lung



  I-Y. Song, W.Rowen, C. Medsker, E. Ewen                                                                           6-4
       Table 3: Billable Patient Encounter Schema
                                                                  The size of the bridge table would increase considerably
    BPE                                                        if one encounter has many related diagnoses. We used an
    (Billable Patient Encounter)                               average of five diagnoses per encounter for this example;
    Patient         Diagnosis           billedtopayer          this parameter produced a bridge table comparable to the
    key             group_key           amount                 size of the fact table, as we will now demonstrate.
    P1              DG1                 $ 1,000
    P2              DG2                 $ 2,000                3.1.2. Database Sizing for the Bridge
                                                                      Method.
   The weighting factors produce a correct totaled report      Base fact: 1,000,000 records
($3,000). During the summation, the weighting factor for       Key fields = 7; Fact fields = 2; Total Fields = 9
each diagnosis key will be related to each bill through the    Fact table size = 1,000,000 records * 9 fields * 4 bytes =
foreign key (diagnosis group key) found in the billable                                                            36 MB
patient encounter table.                                       Diagnosis dimension: 500 records
                                                               Key fields = 1; Name field =1; Description field =1
   The weighting factor is necessary when using a bridge       Record size = 4+8+15 =27 bytes
implementation to produce correct reports [KRRT98,             Dimension table size = 500 records * 27 bytes =
AV98]. However, it is not always possible to rationalize                                                       13,500 Bytes
the weighting factors for each diagnosis. In that case, it
would be possible to count the total diagnoses and produce     Bridge table: 1,000,000 facts joined to 5 distinct diagnoses
an average cost through additional design measures. One        in a diagnosis group = 5,000,000 records
method would be to add an additional attribute to the          Key fields = 2; Weighting factor = 1; Total fields = 3
bridge, call it number_of_diagnosis; thus, you could divide    Bridge table size=5,000,000 records * 3 fields * 4 bytes=
your impact total by this value to produce an average cost                                                           60 MB
per diagnosis. This brute force method takes away from the     Notice, if the average number of diagnosis is increased to
usefulness of your decision support based reports. Thus, it    ten, our bridge table size will grow to 120 MB, nearly four
is recommended to use this method only when the correct        times the size of our fact table
calculation of the weighting factors is not necessary.                                          Total disk space = 96.1MB

   A major benefit of this design is there is no fixed upper        In summary, the bridge method can be considered a
limit, other than total possible diagnoses. Although in this   logical solution for a many-to-many relationship with less
study, we have set an upper limit of twenty diagnoses, to      redundancy. There are, however, various disadvantages to
meet the user requirements. The bridge method, as you can      this method. Assigning weighting factors could prove to be
observe, implements a compound primary key for the             difficult or cumbersome in a real-world environment;
bridge table comprised of diagnosis group key and              additionally, adding a new diagnosis requires recalculating
diagnosis key. It is possible to find a group of related       of the weighting factors. The logical structure would lose
diagnoses because the diagnosis group value is repeated for    the simplicity and understandability of the star schema.
every member row in a set of diagnoses.                        More joins increase the overhead and query time. As
                                                               pointed out the size of the bridge table could increase
   There may be other such many-to-many dimensions             considerably based on the number of diagnosis assigned to
related to the same fact table, and the load times and query   each diagnosis group.
times can be expected to be lengthy. For instance, there are
many procedure codes and Diagnosis Related Group
(DRG) codes assigned to a single visit or patient bill. A      3.2.     Method B: Denormalizing the
DRG is a classification of a hospital stay in terms of what             Dimension Table by Positional-Flag
was wrong with and what was done for a patient. There are               Attributes
approximately 500 DRG codes, which are determined by a
program based on diagnoses and procedures coded in a              Figure 3 illustrates denormalizing the diagnosis
standard International Classification of Disease (ICD-9)       dimension using the positional-attribute approach. By
format and on patient attributes such as age, sex, and         positional we mean that the location of each attribute is
duration of treatment. The DRG frequently determines the       fixed. For example, the first attribute is cancer; the second
amount of reimbursements, regardless of the actually costs     attribute is heart, etc. Thus, the same disease is always
incurred. A hospital visit is often coded by multiple          indicated in the same column. In this method, each
systems, such as Systematized Nomenclature of Medicine         diagnosis becomes a Boolean attribute being set to either
(SNOMED), Current Procedural Terminology (CPT4), and           ‘TRUE’ or ‘FALSE’. For brevity and clarity, only five
others, all of which share a many-to-many relationship with    attributes have been included in Figure 3.
the billable patient encounter fact table. Considering the
complexity of the healthcare billing system, the design and
performance using bridge tables will get quite complex.


  I-Y. Song, W.Rowen, C. Medsker, E. Ewen                                                                               6-5
                                         Billable Patient                      •     there are no approaches to handle a weighting
       Diagnosis Dimension               Encounter Fact Table                        factor.
        diagnosis_key (PK) INT           time_key (FK)
                                                                            However, a bitmap index scheme [OG95] can be
        CancerFlag BOOL                  patient-key (FK)               implemented on each positional attribute, which would
        HeartFlag BOOL                   provider_key (FK)              improve the query performance in this approach. It is clear
        LungFlag BOOL                    location_key (FK)              that this method would only be applicable when the
        LiverFlag BOOL                   payer_key (FK)                 number of positional-attributes is limited and fixed.
        KidneyFlag BOOL                  procedure_key (FK)
                                         diagnosis_key (FK)
                                         billedtopayer_amount           3.2.1. Database Sizing for the Positional-Flag
                                         billedtopatient_amount
                                                                               Attribute Method
       Note: data types are displayed                                   Number of base fact records: 1,000,000 records
       in the diagnosis dimension to                                    Key fields = 7; Fact fields = 2; Total Fields = 9
       illustate positional-attribute                                   Fact table size = 1,000,000 records * 9 fields *4 bytes =
       concept in this example                                                                                                36MB
                                                                        Consider the total size of the dimension for 40 diagnoses:
        Figure 3: Denormalizing the Dimension Table                     Diagnosis dimension: 240 records ≈ 1126.4 * 109 records
               by Positional Flag attributes                            Number of key fields = 1; Number of attribute fields = 40;
                                                                        Assuming 1 bit for each flag
                                                                        Record size = 4+(1*40)/8= 9Bytes
    This technique requires a very large diagnosis                      Dimension table size ≈ 1126.4 * 109 records * 9 bytes ≈
dimension table. N diagnoses require 2N records; for this                                                                    10.1TB
trivial example of five diagnoses, the table size is 32
records. Consider Table 4 that lists all the unique                                                     Total disk space = 10.1TB
diagnosis patterns.                                                                                              (for 40 diagnoses)

   If we were to extend our model to include 10 diagnoses,
the table would be 1024 records in length; 20 diagnoses                 3.3.       Method C: Denormalizing the
would require 1,048,576 records; 40 diagnoses would                                Dimension Table by Non-Positional-
require about one trillion records.                                                attributes & a Concatenated Field
                                                                           In this approach, each attribute in the dimension will
                                                                        store a different diagnosis value. By non-positional we
    Additional disadvantages of this method include:                    mean that each attribute can have a different value in
                                                                        different records. Other than the primary diagnosis, there is
•    adding a new diagnosis value would require to rebuild              no difference between secondary 1 and secondary 20.
     the dimension table and the fact table. We need to use
     Data Definition Language (DDL) to add a column and
     reload the diagnosis dimension by adding (2N+1 - 2N)
     rows and updating the diagnosis_key in the fact table;


                              Table 4: Diagnosis Dimension with Five Positional-attributes

             diagnosis_key              CancerFlag          HeartFlag   LungFlag           LiverFlag        KidneyFlag
             001                        FALSE               FALSE       FALSE              FALSE            FALSE
             002                        FALSE               FALSE       FALSE              FALSE            TRUE
             003                        FALSE               FALSE       FALSE              TRUE             FALSE
             004                        FALSE               FALSE       FALSE              TRUE             TRUE
             005                        FALSE               FALSE       TRUE               FALSE            FALSE
             …                          …                   …           …                  …                …
             031                        TRUE                TRUE        TRUE               TRUE             FALSE
             032                        TRUE                TRUE        TRUE               TRUE             TRUE




    I-Y. Song, W.Rowen, C. Medsker, E. Ewen                                                                                      6-6
    An example of this method is illustrated in Table 5. A
primary diagnosis and multiple secondary diagnoses can be
                                                                       Table 5 Denormalized Non-Positional Diagnosis
assigned. Here, we introduce the notion of a concatenated
                                                                                   Dimensional Table
field to support query processing. A concatenated field is
used to store the primary and all the secondary values of
the diagnoses using the variable character data type. For                               Diagnosis Dimension
example, VARCHAR2 type in Oracle would be able to
store 4,000 characters. The LIKE clause of SQL could be                diagnosis_key (PK)
employed to search for constrained information. The                    primary_diagnosis
concatenated value attributes will store diagnosis values in           secondary_diagnosis1
a sorted order. One drawback is most bills have                        secondary_diagnosis2
approximately five diagnoses; therefore, there will be many            secondary_diagnosis3
null values in secondary diagnoses. While queries across
diagnosis fields can be accomplished with multiple OR
clauses, the LIKE clause to the concatenated field will                secondary_diagnosis20
simplify the search query.                                             concatenated_ diagnoses (CD)

   However, we note that most commercial database
systems do not employ B-tree type index for searching             Note: secondary diagnosis 4 - 19 omitted for brevity
when LIKE clause begins with a wild character. Thus, an
efficient string indexing or string search mechanism will
enhance the query performance.
                                                                   Observe the field concatenated_diagnoses (CD), which
    In order to resolve the problem of LIKE clause, we can      is a concatenation of the primary and all the secondary
enhance the non-positional model by incorporating the           diagnoses related to a patient fact record. The primary
benefits of positional flag attributes. Additional Boolean      diagnosis is included in the concatenated diagnosis to
attributes can be created for common or frequent                enable the user to search for all assigned diagnoses for a
diagnoses. See Table 6 for an example. Bitmap indexes           specific medical condition.
[OG95, CI98] can be created for these Boolean attributes
to facilitate searching based on these common diagnoses.           Although we normally avoid fields with patterns or lists,
Unusual or intriguing diagnoses could also be included for      an exception in this case is useful. There is no order or
specific business intelligence purposes. The hybrid method      weighting to the secondary diagnoses, except the order in
allows both pattern matching with the LIKE command and          which they come to mind of the evaluating physician or the
an index search through a limited number of Boolean             order in which lab tests results become available, so when
fields. The main advantage here would be to constrain the       this diagnosis dimension table is loaded, all the diagnoses
size of the dimension while allowing fast and efficient         in a group are first sorted ascending and inserted across as
queries by maintaining the star schema. Consider Table 6        many diagnosis fields as required. This permits the use of a
to observe the usefulness of this approach. Most users are      “wild card” query rather than multiple OR statements, to
interested in a disease category or combination of              test whether a specific diagnosis was assigned to a patient
categories, not a single disease billing code. Multiple codes   bill, regardless of its ordinal position. When searching for
can be assigned that all indicate the presence of a disease.    a certain disease state, it usually does not matter if the
There may be as many as 20 codes that all indicate the          diagnosis is primary or secondary; the physician just wants
patient has some form of diabetes. The analyst, for             to ascertain if “any” diagnosis is for example “heart”.
reporting or regression purposes, simply needs a field for
diabetes that contains "TRUE" or "FALSE". In On-line
Analytical Processing (OLAP) designs users can combine          Referring to Table 6:
(Boolean "AND") diseases by simply selecting "Yes"
across a series of OLAP categories. Pre-calculating and         SELECT            Patient_key, BPE. billedtopayer_amount
storing these clusters makes it simpler for users to query      FROM              DD,BPE
the database and for developers to create OLAP cubes.           WHERE             DD.diagnosis_key = BPE.diagnosis_key
                                                                AND               DD.concatenated_diagnoses
                                                                LIKE              ‘%heart%’;




  I-Y. Song, W.Rowen, C. Medsker, E. Ewen                                                                                6-7
         Table 6: Denormalized (non-positional) Diagnosis Dimension Table with Positional-attributes
Diagnosis Dimension

DK         PD          SD1          SD2          …     SD20    CD                     AsthmaFlag        DiabetesFlag (BOOL)
                                                                                      (BOOL)

1          heart       lung         asthma       …     null    "asthma,heart,lung"    TRUE              FALSE

2          cancer      diabetics    null         …     null    "cancer,diabetics"     FALSE             TRUE

3          liver       heart        null         …     null    "heart,liver"          FALSE             FALSE

4          heart       lung         asthma       …     null    "asthma,heart,lung"    TRUE              FALSE

…


                                      Note: secondary diagnosis 3 - 19 omitted for brevity


   The query is less difficult to write than a query with           3.3.1 Method C-1: One-to-One Relationship
multiple OR clauses. Adding flag columns for disease                between Dimension and Fact Tables
groups can further enhance the design. For instance,
                                                                       When each record in the diagnosis dimension can be
columns for diabetes and asthma can be used to tag all
                                                                    related to one fact record, there exists a one-to-one
rows having specific diagnosis codes. Columns for certain
                                                                    relationship between the tables (Figure 4). That is, we are
DRG codes can be included, since the dimension can be as
                                                                    creating one dimension record for each new billing
wide as the designer desires to increase the usefulness to
                                                                    encounter. The drawbacks in this design are three. First,
the end user. Note that there is a one-to-many relationship
                                                                    most bills have fewer than 5 secondary diagnoses, so there
between a flag field and diagnosis codes. That is, the
                                                                    will be many null values.         Second, queries across
presence of any one or more of a set of diagnosis codes
                                                                    secondary diagnosis fields will require multiple OR
may indicate an overall condition of diabetes. Flag fields
                                                                    clauses, which are complex to write and slow to run.
are an elegant way for users to create simple queries that
                                                                    However, this disadvantage can be solved using the
ask broad disease questions, but the determination and
                                                                    concatenated attribute and LIKE clause as we explained in
loading of these fields during the ETL process is complex
                                                                    the previous section. Third, it will take more storage.
and usually requires a mapping table created by medical
                                                                    However, the most significant advantage of this approach
experts. They can be simple TRUE/FALSE flags that
                                                                    is to maintain the simple star schema structure. Here,
allow rapid queries such as:
                                                                    design is simpler in most ways and easier for analysts not
                                                                    trained in data modeling to understand at the expense of
SELECT             *
                                                                    significant storage.
FROM               DD
                                                                       A weighting factor could be added to the diagnosis
WHERE              Asthma = 'TRUE'
                                                                    dimension, but will create a complexity in the actual usage
AND                Diabetes = 'TRUE';
                                                                    and is not recommended in this approach.
   Note tuples one and four in Table 6. Both tuples have
the same primary and secondary diagnoses but have                   3.3.1.1. Database Sizing for Denormalized
different diagnosis keys. The designer must make a                           Dimension Method C-1
decision, is this type of redundancy acceptable or should           Number of base fact: 1,000,000 records
measures be taken to search for existing diagnosis patterns         Key fields =7; Number of fact fields =2; Total Fields = 9
before issuing a new diagnosis key? It will be a trade-off          Fact table size = 1,000,000 records * 9 fields * 4 bytes =
between more required memory space, or develop ad-hoc                                                                     36 MB
stored procedures to handle this situation.                         Diagnosis dimension: 1,000,000 records
   This method of using non-positional attributes can be            Key fields =1; Primary diagnosis size = 8 +15 = 23 bytes
implemented in two different ways: by one-to-one or one-            Average secondary diagnoses size = 5 * (8+15)= 115 bytes
to-many relationship between the diagnosis dimension and            Average Concatenated field size = 5*8 = 40 bytes
the fact table, depending on the allowance of redundant             Number of Boolean Flag fields = 3 bits (1 bit each)
tuples.                                                             Record size = 4 + 23 + 115 + 40 + 1 = 183 bytes
   This scheme described in this section allows the users to        Dimension table size = 1,000,000 records * 183 bytes =
only be concerned with a single join between the fact table                                                             183 MB
and dimension table.                                                                               Total disk space = 219 MB



    I-Y. Song, W.Rowen, C. Medsker, E. Ewen                                                                                6-8
Diagnosis
Dimension
                                                                      In our project, we adopted this C-2 method. Many
diagnosis_key (PK)
primary_diagnosis
                                                                   medical centers will purchase or download all necessary
                                   Billable Patient
primary_diagnosis_desc             Encounter fact table            diagnosis codes and descriptions as flat files, then load this
secondary_diagnosis1               time_key (FK)                   data into a database table. We were able to create the
secondary_diagnosis1_desc          patient-key (FK)                initial diagnosis dimension using historic legacy data. For
secondary_diagnosis2               provider_key (FK)
secondary_diagnosis2_desc                                          each billable encounter, a lookup is performed for the
                                   location_key (FK)
secondary_diagnosis3               payer_key (FK)                  diagnosis description in a lookup table, the results are
secondary_diagnosis3_desc          procedure_key (FK)              sorted, and a new record is inserted. For future claims
...
secondary_diagnosis20
                                   diagnosis_key (FK)              records, a maintenance function will query the diagnosis
                                   billedtopayer_amount
secondary_ diagnosis20_desc        billedtopatient_amount
                                                                   dimension to see if the pattern already exists. If the pattern
concatenated_diagnoses                                             does not exist, the lookup table is accessed for a
SickleCellFlag
                                                                   description and will update the dimension accordingly.
AsthmaFlag                     Note: secondary attributes 3 - 19
DiabetesFlag                   omitted for brevity.

                                                                   3.3.2.1. Database Sizing for Denormalized
    Figure 4 Denormalized Non-Positional Diagnosis                          Dimension Method C-2.
    Dimension Table with Flag Attributes: One-to-One               Number of base fact: 1,000,000 records
    Relationship between Dimension and Fact Tables                 Key fields = 7; Fact fields = 2; Total Fields = 9
                                                                   Fact table size =1,000,000 records * 9 fields * 4 bytes =
3.3.2. Method C-2: One-to-Many                                                                                          36 MB
       Relationship between Dimension and                          Diagnosis dimension records: 200,000 records
       Fact Tables                                                 (Assumed on the average one pattern is associated with
                                                                   five encounters.)
   The diagnosis dimension can be related to the fact table        Key fields =1;
in a one-to-many relationship (Figure 5). Thus, the same           Primary diagnosis size = 8 +15 = 23 bytes
diagnosis pattern is associated with multiple encounters.          Average secondary diagnoses size = 5 * (8+15)= 115 bytes
   This method introduces many null values similar to              Average Concatenated field size = 5*8 = 40 bytes
method C-1, but the redundancy is largely reduced. A               Number of Boolean Flag fields = 3 bits (1 bit each)
sorting/concatenation procedure similar to the one used to         Record size = 4 + 23 + 115 + 40 + 1 = 183 bytes
create the dimension table explained in section 3.3 can be         Dimension table size = 200,000 records * 183 bytes =
employed to search for the same diagnosis pattern from                                                                36.6 MB
existing records, however the insertion process will tend to
be complex and slow.            Due to the one-to-many                                            Total disk space = 72.6 MB
relationship, weighting factors cannot be exploited.

Diagnosis
                                                                   3.4.     Method D: Lowering the Grain of the
Dimension                                                                   Fact Table
diagnosis_key (PK)
primary_diagnosis
                                                                      Method D will lower the grain of the fact table to the
                                   Billable Patient
primary_diagnosis_desc             Encounter fact table            dimension grain level (Figure 6). This method is briefly
secondary_diagnosis1               time_key (FK)                   discussed By Giovinazzo [Giov00]. For each event there
secondary_diagnosis1_desc          patient-key (FK)                will be multiple fact records relating to that specific event.
secondary_diagnosis2               provider_key (FK)
secondary_diagnosis2_desc                                          An option is to add a diagnosis_group_key in the fact table
                                   location_key (FK)
secondary_diagnosis3               payer_key (FK)
                                                                   to group the multiple fact records. There is no need to
secondary_diagnosis3_desc          procedure_key (FK)              compute a weighting factor; each diagnosis can be directly
...
secondary_diagnosis20
                                   diagnosis_key (FK)              billed (see table 7). The star schema is retained in this
                                   billedtopayer_amount
secondary_ diagnosis20_desc        billedtopatient_amount
                                                                   approach, giving the user a concise, clear logical view of
concatenated_diagnoses                                             the business process, at the expense of increasing the size
SickleCellFlag                                                     of a fact table.
AsthmaFlag                     Note: secondary attributes 3 - 19
DiabetesFlag                   omitted for brevity.                   Since you can have many diagnoses for a singular event,
                                                                   there is a need to calculate the aggregate of that instance.
                                                                   The approach will be straightforward; sum the amounts of
                                                                   each diagnosis for a specific date, grouping by
                                                                   diagnosis_group_key. The size of the fact table will
     Figure 5: Denormalized Non-Positional Diagnosis               increase depending upon how many diagnoses will be
   Dimension Table with Flag Attributes: One-to-Many               stored in the fact table. There will also be redundant data
    Relationship between Dimension and Fact Tables                 stored for other billing fields.




  I-Y. Song, W.Rowen, C. Medsker, E. Ewen                                                                                    6-9
  Diagnosis Dimension                 Billable Patient
                                      Encounter Fact Table         Method E is similar to method D, but separates the
  diagnosis_key (PK)                  time_key (FK)             diagnosis data from the billing data by employing two fact
  diag_name                           patient-key (FK)          tables (Figure 7). Diagnosis data is recorded at individual
  diagnosis_description               provider_key (FK)
                                                                diagnosis grain in the Patient Medical Record fact table,
                                      location_key (FK)
                                                                while a billing is recorded at each billing transaction grain.
                                      payer_key (FK)
                                      procedure_key (FK)
                                                                Using this structure, allows us to use two fact tables in
                                      diagnosis_key (FK)        different ways.      When we perform diagnosis-related
                                      diagnosis_group_key       analysis, the Patient Medical Record fact table can be used.
                                      billedtopatient_amount    When we perform billing–related analysis, the Billing fact
                                      billedtopayer_amount      table can be used. When we analyze billing related to
                                                                diagnosis, both fact tables will be used. A billing event is
                                                                joined to diagnosis through the billing key. A Weighting
                                                                factor can be used if necessary.
          Figure 6: Method D: Lowering the Grain of the            In this approach, the patient medical record would
                         Fact Table                             contain redundant data about the patient due to the
                                                                granularity of the table, which is at the diagnosis level.
                                                                This method, however, causes less redundancy than
                                                                Method D because billing is recorded only once per event
                                                                in its own fact table. The size of the fact table will increase
          Table 7: An Example of the Fact Table at              depending upon how many diagnoses are stored in the fact
                    Individual Grain                            table. There will also be redundant data caused by the
                                                                foreign keys of an additional fact table.
  BPE                                                              We note that if Method E is simplified by removing all
                                                    Billed to   foreign keys except diagnosis_key and billing_key in the
Patient    Diagnosis      Diagnosis     Other                   Patient Medical Record fact table, the structure is similar
                                                     payer
 key       group_key         key        FK's
                                                    amount      to the method used with the Bridge table.
  P1          DG1           cancer         .         $ 1,000
  P1          DG1           heart          .         $ 2,000            Diagnosis
  P2          DG2           cancer         .         $ 1,000            Dimension
  P2          DG2           heart          .         $ 2,000            diagnosis_key (PK)
                                                                        diag_name
  P2          DG2           lung           .         $3,000             diag_description


3.4.1. Database Sizing for Method D:
       Modifying the Fact Table                                                   Patient Medical
                                                                                  Record FactTable
Number of base fact records: 5,000,000 records (assume
                                                                                   time_key (FK)
five diagnoses on average).                                                        patient-key (FK)
                                                                                   provider_key (FK)
Key fields =7; Number of fact fields =3; Total Fields=10                           location_key (FK)
Fact table size =5,000,000 records * 10 fields * 4 bytes =                         procedure_key (FK)
                                                                                   diagnosis_key (FK)
                                                    200 MB                         billing_key (FK)
Diagnosis dimension: 500 records                                                   WeightingFactor
Key fields = 1; Name field =1; Description field =1
Record size = 4+8+15 =27 bytes
Dimension table size = 500 records * 27 bytes =
                                               13,500 Bytes
                                                                                           Billing
                                                                                           Fact Table
                              Total disk space = 200.1 MB
                                                                                           time_key (FK)
                                                                                           patient-key (FK)
                                                                                           provider_key (FK)
                                                                                           location_key (FK)
3.5.      Method E: Lower the Grain of the Fact                                            payer_key (FK)
                                                                                           procedure_key (FK)
          Table: Separating Diagnosis from                                                 billing_key (FK)
          Billing Data                                                                     billedtopatient_amount
                                                                                           billedtopayer_amount




                                                                          Figure 7: Method E: Lower the Grain of the Fact
                                                                             Table and Separate Diagnosis/Billing Data
  I-Y. Song, W.Rowen, C. Medsker, E. Ewen                                                                            6-10
                                         Table 8: Summary of Six Methods
      Name                Weighting Factor           Queries            Redundancy                     Recommendation
                                                                          Storage
                                                                                                Ideal for dimension with no
                     Necessary
                                                                           Slight               upper limit in Many side and
Method A             WF assignment can be        Additional join                                WF can be easily calculated or
Kimball’s Bridge     cumbersome                  increases query time                           not needed.
                     Adding new diagnoses
                                                                           96.1 MB              A clean solution
                     require recalculation.
                                                                           Dimension table
                                                 Requires many OR          will list all
Method B             Cannot handle a weighting   commands, slows           possible             Because of enormous storage
Denormalizing        factor                      processing response       combination, can     requirement, only applicable
Dimensional                                                                be extremely         when the number of positional-
Table with                                                                 large                attributes is very small (less
Positional-Flag                                  Bitmap index scheme                            than 10) and fixed
attributes           Can use a primary           can speed up query        10.1TB (for 40
                     diagnosis flag              processing                diagnoses)

                                                                                                Can maintain star schema
Method C
                                                                                                structure for ease of
Denormalizing
                                                                           Can be very          understanding and query
Dimensional
                     Depends on relationship     Use of concatenated                            forming
Table with Non-                                                            large
                     between dimension and       attribute and LIKE                             Use when the number of
positional-
                     fact table                  clause.                                        dimension values is small and
attributes &
Concatenated                                                                                    query performance is important
field                                                                      See C-1 and C-2
                                                                                                Creates many null values
                                                                           below.
                                                 LIKE command              Can be very
                     Can create a concatenated   disables indexing.        large                Use when a dimension pattern
Method C-1
                     weighting factor, but       Can use flags and                              appears in fact table only 1-2
One-to-One
                     cumbersome                  bitmap indexes for the    219 MB               times.
                                                 flags
                                                                                                Use when a dimension pattern
                                                 Like command              Less redundancy
                     Cannot use a weighting                                                     appears in fact table many
Method C-2                                       disables indexing.        than Method C-1
                     factor because of one-to-                                                  times.
One-to-Many                                      Can use flags and                              Requires complex logic to find
                     many nature
                                                 bitmap indexes for the    72.6 MB              existing records before inserting
                                                 flags                                          a new code.
                                                                           Can cause
                     No need for a WF                                      redundancy for       The fact table can become very
Method D                                         View is on single table
                                                                           other FK's in fact   large
Lower Grain of                                                             table
Fact Table to Line                               Need to calculate
Item level                                                                                      Use only for a limited number
                     Can use a primary           aggregation.              201.1 MB             of attributes and line items per
                     diagnosis flag.
                                                 Fast queries                                   transaction

                                                 View is on single table   Can cause
Method E                                         Must deal with two        redundancy for
                     Can use a WF
Lower Grain &                                    fact tables.              other FK's in fact   Use when two fact tables can be
Separate Facts                                                             table                used separately.
into two tables                                  Fast queries
                     Can use a primary           Need to calculate
                                                                           196.1 MB
                     diagnosis flag.             aggregation



  I-Y. Song, W.Rowen, C. Medsker, E. Ewen                                                                                  6-11
                                                               designer must consider the size of the fact table and
3.5.1 Database Sizing for the Positional-                      redundancy in relation to the query processing times.
      attribute Method E                                          There are several distinct types of users of the
                                                               warehouse; the executive browser, the data analyst, and the
Number of Patient Medical facts: 5,000,000 records
                                                               professional OLAP analysts. The executive browser will
(Assume five diagnoses on average).
                                                               see the data through interfaces designed by the warehouse
Key fields = 7; Fact fields = 1
                                                               developers and the development will be simpler using
Record size = 7*4 + 4 = 32 bytes
                                                               Method C-1 or C-2 because there are fewer joins in the
Fact table size = 5,000,000 records * 32 bytes =
                                                               queries and no flattening views to create. Data analysts and
                                                    160 MB
                                                               professional OLAP analysts such as statisticians will need
Diagnosis dimension: 500 records
                                                               to extract data from the warehouse tables into reports
Key fields = 1; Name field =1; Description field =1
                                                               employing statistical and data mining programs. Therefore,
Record size = 4+8+15 =27 bytes
                                                               the understandability of the model becomes very important.
Dimension table size = 500 records * 27 bytes =
                                                               Data from a flatter, horizontal dimension is easier to query
                                               13,500 Bytes
                                                               into off-the-shelf application packages than data in a
Number of billing fact: 1,000,000 records
                                                               vertically oriented table. There is also no concern that
Key fields =7; Number of fact fields =2; Total Fields =9
                                                               analysts will forget to apply the weight factors to the facts.
Fact table size = 1,000,000 records * 9 fields * 4 bytes =
                                                                  Maintenance is not appreciably different from a
                                                      36 MB
                                                               modeling and physical implementation perspective, which
                                                               are handled using a data-modeling tool. The extract,
                             Total disk space = 196.1 MB
                                                               transform, and load (ETL) routines may be more complex
                                                               and slow to run for the bridge method due to the multiple
4. Summary and Discussion                                      tables that have to be tested and the need to use cursors and
                                                               procedural code to populate the bridge tables. Conversely,
   In this section, we present the summary of the six
                                                               the ETL plans for method C may be easier to write and
methods discussed in the paper.            The results are
                                                               should run faster. It is also important to note, that some
summarized in Table 8.
                                                               fact tables will have multiple many-to-many relationships
   Kimball's bridge method [KRRT98] produces an elegant
                                                               with dimensions other than diagnosis. While theoretically,
design for many situations. Redundancy is kept at a
                                                               this poses no problem; in practice, the load times and query
minimum, with the added advantage of correct weighted
                                                               times may be excessive.
summaries. The cost is the size of the bridge table and
                                                                  We finally note that the storage calculation in this paper
added joins can deteriorate the querying process. We note
                                                               was based on our assumption. When the domain and
that a weighting factor is needed when using a bridge. It is
                                                               assumption changes, the storage requirement needs to be
not always needed when fact and dimension tables are
                                                               recalculated.
flattened. We recommend this solution as a clean and
maintainable solution when the cardinality of many-to-
many is not limited and weighting factors can be easily
                                                               5. Conclusion
calculated or are not needed.                                      While we were building a real-world patient-billing data
   Method B, positional flag attributes, seems a likely        warehouse, we met a many-to-many relationship problem
solution when the number of positional-attributes is very      between a fact table and a dimension table. A survey of
limited (say less than 10 or 12) and fixed. As the number of   literature showed us two methods (Method A and Method
attributes increases, the storage requirement becomes          D). After a thorough study on the subject, we have
explosive. We do not recommend this approach for most          identified four additional methods. In this paper, we have
situations.                                                    analyzed those six different approaches for handling many-
   Methods C-1 and C-2 use non-positional attributes with      to-many relationships. We have illustrated and shown the
a concatenated attribute and flags. . By maintaining the       advantages and disadvantages of each solution.
star schema structure, these methods enhance
understandability of the model by non-professional                Our preferred methods include two ad-hoc approaches
analysts and support easy query formation. However, the        that maintain a star schema structure by denormalizing the
designer must take into consideration the null values and      dimension to avoid many-to-many relationships. For our
redundancy. In addition, Method C-2 requires an efficient      project, we implemented Method C-2. For a new
procedure to find an existing diagnosis key for each entry     encounter, we query the Diagnosis dimension to see if the
of a fact table. We recommend Method C-1 when a                pattern already exists. If the pattern does not exist, the
dimension pattern appears in the fact table only 1-2 times     lookup table is accessed for a description and will update
and Method C-2 when a dimension pattern appears in the         the dimension accordingly. We also note that Method C-2
fact table many times. We recommend these methods only         uses the minimum storage. Our experience shows that the
when the number of the dimension value is small and fixed.     Method C-2 proposed in this paper were easy to use and
   Methods D and E offer still additional approaches that      efficient given our situation. However, we recommend
will work well if the number of diagnoses is limited. The      each data warehouse designer carefully evaluate each case


  I-Y. Song, W.Rowen, C. Medsker, E. Ewen                                                                               6-12
to adopt the best method considering various options. Just       [Kimb97]   Kimball, R., A Dimensional Manifesto, DBMS,
as database designers trade off normalization for query                     August 1997, pp. 58-70.
response, a data warehouse designer must also resolve
many issues to solve this many-to-many dilemma.                  [KRRT98]   Kimball, R., Reeves, L., Ross, M., &
                                                                            Thornthwaite, W. (1998). The Data
                                                                            Warehouse Lifecycle Toolkit.
Acknowledgements                                                            New York: John Wiley & Sons, Inc.
   The authors thank the many students at Drexel
                                                                 [KS97]     Krippendorf, M. and Song, I.-Y, "Translation of
University who participated in the discussion on the many-
                                                                            Star Schema into Entity-Relationship Diagrams," Proc.
to-many dilemma in data warehousing modeling. Their
                                                                            of 8th International Conference and
insight has been invaluable in developing these solutions.
                                                                            Workshop on Database and Expert Systems
                                                                            Applications (DEXA97 ), September 1-5, 1997,
6. References                                                               Toulouse, France, pp. 390-395.

[AM97]       Anahory, S. and Murray, D., Data                    [LAW98]    Lehner, W., Albrecht, J., and Wedekind, H.,
             Warehousing in the Real World, Addison                         Normal Forms for Multidimensional Databases,
             Wesley, 1997                                                   Proc. SSDBM, 1998, pp. 63-72.
[AV98]       Adamson, C. and Venerable, M., Data                 [MC98]     Maier, D. and Cannon,C., Building a Better
             Warehouse Design Solutions, John Wiley, 1998.                  Data Warehouse, Prentice Hall, 1998.
[AS97]       Axel, M. and Song, I. -Y., "Data Warehouse Design   [MK00]     Moody, D. and Kortink, M.A.R., From
             for Pharmaceutical Drug Discovery Research,"                   Enterprise Models to Dimensional Models:
             Proc. of 8th International Conference and                      A Methodology for Data Warehouse and Data
             Workshop on Database and Expert Systems                        Mart design, Proc. of Int’l Workshop on Design
             Applications (DEXA97), September 1-5,                          and Management of Data Warehouses,
             1997, Toulouse, France, pp. 644-650.                           Stockholm, Sweden., 2000.
[CI98]       Chan, C.-Y. and Ioannidis, Y. , Bitmap Index        [OG95]     O’Neil, P. and Graefe, G., Multi-table Joins
             Design and Evaluation, Proc. of 1998 SIGMOD                    Through Bitmapped Join Indices, SIGMOD Record,
             Conference, pp. 355-366.                                       Vol. 24, No.3, Sept. 1995, pp. 8-11.

                                                                 [PJ99]     Pedersen, T.B. and Jensen, C.S.,
[DSHB98] Dinter, B., Sapia, C., Hofling, G., and                            Multidimensional Data Modeling for Complex
         Blaschka, M., The OLAP Market: State of the                        Data, Proc. of 15th ICDE,
         Art and Research Issues, Proc. of Int’l                            Sidney, Australia, 1999, pp. 336-345.
         Workshop on Data Warehousing and
         OLAP, Washington, D.C.,                                 [SBHD98]   Sapia, C., Blaschka, M., Hofling, G., and
         1998, pp. 22-27.                                                   Dinter, B., “Extending the E/R Model for the
                                                                            Multidimensional Paradigm,” Advances in
[Giov00]     Giovinazzo, W.A., Object-Oriented Data                         Database Technologies (ER ’98 Workshop
             Warehouse Design: Building a Star Schema,                      Proceedings), Springer-Verlag, pp. 105-116.
             Prentice Hall, 2000.
                                                                 [TBC99]    Tryfona, N., Busborg, F., and Christiansen,
[GR98]       Golfarelli, M. and Rizzi, S., A                                J.G.B., “starER: A Conceptual Model
             Methodological Framework for Data                              for Data Warehouse Design,” Proc. of Int’l
             Warehouse Design, Proc. of Int’l Workshop                       Workshop on Data Warehousing and OLAP
             on Data Warehousing and OLAP,                                  (DOLAP 99), Kansas City, MO., pp. 3-8.
             Washington, D.C., 1998, pp. 3-9.
                                                                 [TS98]     Theodoratos, D., and Sellis, T, Data
[HLB00]      Husemann, B., Lechtenborger, J., and                           Warehouse Schema and Instance Design, Proc.
             Vossen, G., Conceptual Data Warehouse Design,                  of 17th International Conf. On Conceptual
            Proc. Of Int’l Workshop on Design and Management                Modeling (ER98), Singapore, Nov. 1998,
             of Data Warehouses, Stockholm, Sweden., 2000.                  pp.363-376.
[Kimb96]     Kimball, R. (1996). The Data Warehouse
             Toolkit. New York: John Wiley & Sons, Inc.



  I-Y. Song, W.Rowen, C. Medsker, E. Ewen                                                                           6-13