Granular Ontologies and Graphical In-Place Querying

                     Janis Barzdins, Edgars Rencis and Agris Sostaks

       Institute of Mathematics and Computer Science, University of Latvia, Riga, Latvia
                    {Janis.Barzdins, Edgars.Rencis, Agris.Sostaks}@lumii.lv


       Abstract. The data ontologies in a form of UML Class diagrams are discussed
       in this paper. We call the data ontology granular, if its corresponding instance
       diagrams (data) can be divided into separate parts called slices. A typical
       example of granular ontologies is process ontologies, where slices are run-time
       instances of these processes. Based on the notion of granularity a graphical in-
       place query language is presented in this paper. The proposed language is easy
       to use by domain experts that are not IT specialists. Besides that it has a very
       efficient (linear) execution time for answering queries.

       Keywords: Data ontologies, run-time instances, query language, graphical
       querying.


1 Introduction

While working with models, we have observed an interesting phenomenon – data can
be often divided into separate parts naturally. These parts have their own semantics,
which we would like to use while querying the model. If the division of the data
ontology is well formalized, it is possible to develop a query language for the
ontology that is both very efficiently executable and very convenient and easy-to-use
for the end user being the domain expert, not an IT specialist.
   In this paper we specify a set of ontologies, for which we can define a natural
division in parts. We call these ontologies granular and define the granularity
principles very formally in Section 2. After that, we describe the query language in
Section 3, which we have developed for granular ontologies. Here we lay out the
principles and primitives of the language, as well as define its time-efficiency.
   We, however, understand that not all the real world ontologies fall into the class of
granular ontologies. Therefore, in future we plan to extend the notion of ontology
granularity a bit in order to widen the class of granular ontologies by extending the
query language at the same time. The main objective that needs to be taken in mind in
the process is that the query language must preserve its time-efficiency.
2 Granular Ontologies

In this paper we inspect data ontologies in a form of UML Class Diagrams. More
precisely, we use only a subset of UML Class Diagrams containing classes, oriented
associations, typed attributes and generalization.
   From syntactic point of view our data ontology language is also a subset of the
OWL (see the comparison in [1]). From the semantic point of view there is, however,
a significant difference – while OWL uses the open-world semantics, we exploit the
closed-world semantics. Our proposed data ontology language is a convenient mean
for describing data of concrete domains, e.g., the structure of hospital registry. A very
simple example of a data ontology describing study programs is seen in Fig. 1 (we use
a traditional shorthand notation for associations, which are oriented in both
directions).
         Study program     program
                                                               self.program.course ->
        name: String       1
                                                               includesAll(self.course)
               1 program                   student *
                                                  Student
                                           id: Integer
          course *                         name: String
                                           age: Integer
               Course
       name: String               course         * student
       creditpoints: Integer      *

                               Fig. 1. The Study program ontology.

   We will depict the concrete data of the ontology as legal instances of the
corresponding class diagram. The specification of legality can be performed either
only through multiplicities (which must always be satisfied), or additionally through
OCL expressions (as in Fig. 1) or in any other way (even using the natural language).
   Let us assume we have a data ontology in a form of UML Class Diagram D and a
class A belonging to the ontology D. Let us also assume we have some instance G of
the diagram D. G consists of two kinds of elements – class instances called objects
and association instances called links. Since we only operate with oriented
associations, also the links are oriented. Therefore we can perceive G also as an
oriented graph. Let us now take an arbitrary instance x of the class A such that x∈G
(in shorthand notation: x∈A∩G). We can now introduce a concept of a slice respective
to the object x within instance G being the maximal subgraph of G (let us denote it
with S(x,G)) such that S(x,G) consists of the object x and all those objects that are
reachable from x via edges.
   When we inspect some data ontology D, it always comes together with the set of
its legal instances UD. We will now call a class A∈D a Master class, iff the two
following statements are satisfied:
1) ∀G ∈ UD ∀x ∈ A ∩ G∀y ∈ A ∩ G(x ≠ y ⇒ S(x, G) ∩ S(y, G) = Ø)
2)   ∀G ∈ UD U S(x, G) = G
               x∈A ∩ G
   The first statement states that all the slices respective to instances of the Master
class are distinct, that is, they do not have common objects. The second statement
states that these slices cover the whole instance G.
   There is only one Master class in the Study Program ontology seen in Fig. 1.
(given the specified OCL constraint) – the class “Study program”. Indeed, if we take
an instance of the class “Study program”, its respective slice covers all the courses of
that program together with its students. Since the OCL constraint prohibits for any
student to take course from a different program than he is attached to, it is clear that
respective slices of instances of the class “Study program” are distinct thus dividing
any legal instance of the class diagram into slices.
   Data ontologies (together with the legality constraints), for which there exists a
Master class, are called granular ontologies. We depict the Master class with a bold
frame in granular ontologies as can be seen in Fig. 1 (in case there is more than one
Master class in an ontology we just choose one). Further in this paper we inspect only
granular ontologies.
   The main objective of dividing the instance graph into slices is that thus we could
form natural queries over the instance easily. Since the data are naturally divided into
slices, we can formulate questions either within some concrete slice, or over a set of
slices. For example, we can take one concrete slice specified by the name of the study
program and count the sum of credit points of all courses of that program (e.g., the
question “How much credit points are to be collected in the Computer Science Master
study program?”). Another example – we can select a set of slices specified by the
age of the students and see, in which study programs these students are assigned to
(e.g., the query “Please, give me the list of all the programs, in which there are at least
one student older than 30 years!”). The means for formulating such kind of queries
and getting the results are described in the next Section.
   It must be mentioned that the class of granular ontologies is relatively rich. We can
see, how the division into slices becomes apparent in case of static class diagram seen
in Fig. 1. However, the situation with division into slices is especially characteristic, if
the data ontology describes some processes, and instances of the ontology are run-
time instances (transactions) of those processes. Good example of such process is the
shopping basket process widely used in the field of data mining. However, in this
paper we will use another example as a base – clinical processes in a hospital. For
describing such processes a special language MEDMOD is introduced in authors’
paper [2].
   Formally, the language MEDMOD is defined as a profile on UML Class diagrams
according to Fig. 2 (OCL constraints are omitted here). An example clinical process
describing the Emergency department management is seen in Fig. 3.


                  Fig. 2. The UML profile defining the MEDMOD language.
   On the basis of Fig. 3 we will now shortly explain the used notations. As is
described in the profile, Activities are divided into three categories. Start Activity,
e.g., “Patient enters the hospital” (called also the Master Activity) is depicted with
bolder frame in Fig. 3. Aggregate Activities (consisting of subactivities), e.g.,
“Clinical process in ward” are depicted with dashed frames (see Fig. 3). Simple
activities are all the other activities, e.g., the activity “Doctor sets diagnosis” in Fig. 3.
As is seen in Fig. 3, some activities are depicted with a multiple frame. That means
that several instances (more than one) of these activities can appear in one slice.
                                              Patient enters the
                                                   hospital
                                             date&time: DateTime
                                             surname: String
                                             age: Integer


                                            Doctor at emergency    {second opinion is necessary}
                                            department evaluates
                                           patient's medical needs        Patient consulted by
                                          emergency_doctor: String
                                                                              second doctor
                                                                         consulting_doctor: String


                    Patient admitted to                                     Patient treated at
                       hospital ward                                      emergency department
                                                 Patient scheduled
                   date&time: DateTime                                    procedure_code: pCode
                                                   for transfer to
                   ward_code: wCode                                       cost: Real
                                                    another ward
                                                 ward_code: wCode
                    Clinical process in
                            ward
                    ward_doctor: String


                         *                           *
                     Doctor sets diagnosis        Doctor assigns
                    diagnosis_code: dCode           procedure
                                              date&time: DateTime
                                              procedure_code: pCode


                                               Procedure is executed          Patient leaves the
                                              date&time: DateTime                  hospital
                                              procedure_code: pCode          date&time: DateTime
                                              cost: Real                     total_expenses: Real


Fig. 3. An example of a MEDMOD process – the Emergency department management process.

   Associations are divided into four categories:
1) Follows. This type of oriented relation can be established between two Activities
   A and B meaning that Activity B can only start after Activity A has ended. It is
   allowed for several Activities to follow the same Activity – the XOR semantics is
   implied in this case meaning that only one of those outgoing flows can be
   executed. We denote this situation by introducing a new diamond-shaped
   graphical element seen in Fig. 3.
2) Composition. A composition between two Activities can be established, if one
   Activity (called the Aggregate) semantically consists of one or more other
   Activities (called the Components).
3) Interruption. If there is an outgoing Interruption flow from the Aggregate Activity
   A to some Activity B, it means that the Activity A is suspended, when the flow is
   executed (i.e., when the Activity B needs to be started) meaning that it can no
   more create new Component instances (already created Component instances
   continues to execute normally).
4) Extension. Extension is an oriented relation between two Activities A and B
    meaning that Activity B can be called at some time during the execution of
    Activity A. The call is triggered, when some predefined condition occurs. The
    condition is described as an Extension point name and attached to the Extension.
   The reason behind developing a new language was that the traditional process
modeling languages have found a limited use in the hospital settings (see, e.g., [3],
[4]). One of the reasons behind this delay has been the lack of clear definition of the
sequence of activities that are carried out in clinical processes.
   Since a MEDMOD diagram is formally a Class diagram according to the profile
seen in Fig. 2, we can talk about instances of this class diagram, and we can
investigate the notion of granularity of MEDMOD diagrams. The Master class comes
out very naturally in this case, because the process diagram always has the starting
action, which can serve as the Master class. The conclusion is that the ontology given
by the MEDMOD language is granular.
   Since the instance graph is again divided into slices (assuming we have formulated
the instance legality criteria), we can query it either by specifying one concrete slice
or several similar slices (e.g., “What is total expenses for the patient Wolf?”), or over
a set of slices (e.g., “What is the average age of all patients treated by the doctor Stan
Lee?”).
   The query language described in the next section is explained based on the
MEDMOD example.


3 Query language

    If an ontology is granular – its underlying instance graph can be divided into
slices, then we can define simple and efficient means of querying the instance graph.
In this Section we describe an ontology based graphical in-place query language that
is easy to use even by non-IT specialists and the result of a query can be retrieved in
the linear time O(n) where n is the number of objects in the instance graph.
    Since instance graph has been divided into slices accordingly to some granular
ontology, questions can be asked accordingly to that ontology. Building a query has
two main activities – filtering and retrieving answers. Filtering, actually, is setting
simple constraints on objects. Constraints can be set on any attribute of any class in
the ontology. Once a constraint has been set, the instance graph is reduced to those
slices, which contains at least one object that meets the constraint. Let’s call it the
filtered instance graph. We allow to retrieve answers for two types of questions, the
first has an answer as a single number, e.g., “How much did the Dr. Jekyll’s patients
cost?” the second has an answer as a list of objects, e.g., “Which patients with
Pneumonia had no X-ray?”.
    Very important aspect of the query language is that its concrete syntax is based on
the data ontologies language used to specify the ontology. An example of a query
based on the MEDMOD language is given in the Fig. 4. The real-world examples, of
course, are not as tiny as the given example - just 3 patients. An average hospital in
Latvia (500 beds hospital) treats about 30 000 patients per year [5]. In order to better
understand the query language we give an insight in the process of building queries.
  Fig. 4. An example of query based on MEDMOD – the emergency department management.

   Let’s assume that we have obtained the instance graph conforming to the given
ontology (MEDMOD diagram). We leave behind the problem of getting data from
hospital’s information system. The person in interest (e.g., physician or manager)
starts with a query diagram that is based on the given MEDMOD diagram – the query
diagram has the same layout and elements as the MEDMOD diagram. It describes the
familiar for the physician process of the emergency department management in the
hospital. By default the query diagram contains boxes indicating the number of
objects of each class in the instance graph. (See Fig. 4, boxes labeled Count). These
are answers to simple questions like “How many patients have been treated at
emergency department?” or “How many procedures have been executed?” It should
be noted that every answer (result) is depicted as a box in the query diagram. Thus
ontology, constraints, results - everything can be seen graphically in-place - in the
same diagram. The same principle is used by spreadsheet applications – the user can
make changes in any cell of the spreadsheet and observe the immediate effects on
calculated values. In contrast most of query languages, e.g., SPARQL or SQL, have
separate representations for data model, query and data. Now one can start filtering
data by pointing to a class in the diagram and selecting an attribute. Simple
constraints on attribute’s values can be set – comparisons like equals, greater than,
less than, etc., can be made to the constants of appropriate type. Following the
simplicity of spreadsheet applications, no more than two constraints (comparison
operations) are allowed on each attribute. Both constraints may be mandatory (logical
AND), or at least one of the constraints must be met (logical OR).
   When a constraint on an attribute has been set, the instance graph is filtered and all
answers (result boxes) in the diagram are reevaluated and all boxes refreshed. Thus
the dynamic response to each step in construction of the complete query allows the
physician to see immediate reaction to every action. It shortens the learning curve
greatly and reduces the number of errors – they can be recognized much earlier. This
effect is called direct manipulation interaction mechanism [6].
   As it was mentioned earlier, all answers were depicted as boxes in the diagram. At
any moment these boxes can be removed and new boxes can be added. Possible single
number answers are: Number of objects of given type in the filtered instance graph,
Sum of values of given attribute in the filtered instance graph, Average of values of
given attribute in the filtered instance graph. The only allowed answer that is not a
single number is the list of objects (with attribute values) of given type in the filtered
instance graph. (See Fig. 4 for all types of answers).
   Let’s define the constraint, the query and the answer more formally. Assume that
we have a granular ontology D which consists of classes which in turn contains
attributes. Since the ontology D is granular, there exists some Master class A∈D.
Before one can query the instance graph G, it must be divided into slices respective to
objects of class A. Thus the queries must be executed over set of non-overlapping
slices S.
                            S = {s|s = S(x, G) & x ∈ A }
   Slices consist of objects with associated key-value lists, where keys are attribute
names and values are attribute values. The ontology determines possible attributes
and their range of values (type) for objects of given class.
   Let a be an attribute of some class B∈  ∈D and let T∈D be the type of the attribute a.
Then constraint on attribute a is the following Boolean expression:
1) One of the simple comparisons – a > const, a = const, a < const, where const∈T;
2) Conjunction (and) or disjunction (or) of two simple comparisons, e.g., “a<10 and
    a>5”.
   Such constraint can be checked on an object of class B in a time that consists of
time that is needed to locate the value of the given attribute in the object’s list of
attribute values and time that is needed to do actual comparison and logical
operations. Thus, the total time needed to check a constraint on a single object
depends only on the size of the given ontology and implementation (coding) of
objects. Therefore for each ontology and its implementation there exists such constant
C, that a single constraint can be checked on a single object in time less than C.
   As it was mentioned before, the physician is allowed to set just one constraint at
once. After the constraint is set it is evaluated immediately. Let’s define more
precisely, what does it mean to evaluate a constraint c on attribute a of class B on the
instance graph (set of slices S) and obtain the filtered instance graph – the subset of S.
   The main idea is to go through all slices and check all objects in particular slice. If
there is an object of the given type and the constraint c evaluates to true on that object,
then the slice is added to the filtered instance graph. It is easy to see, that in the worst
case all objects in instance graph have to be checked to evaluate the constraint, but no
more, because slices are non-overlapping. However, checking a single object does not
require more time than the constant C, thus the total time needed to evaluate a
single constraint on the instance graph G is O(n), where n is the number of
objects in G.
   The complete query Q is the ordered set of constraints. The execution of the query
starts with evaluation of the first constraint in the set and continues with gradual
evaluation of next constraints on the result of the previous. As it was mentioned
above, the typical number of patients treated in an average hospital in Latvia is 30000
per year. It would be the number of slices in the instance graph for the MEDMOD
example. Our experience and initial experiments with query language show that last
constraints in typical queries are evaluated on much smaller filtered instance graph
comparing to the initial instance graph. It may allow us to predict that the execution
of complex queries would be efficient for instance graphs even larger than
abovementioned 30000 slices.
   Now we can define more precisely, what answers (result boxes) can be retrieved.
Once the filtered instance graph FS has been obtained, here are possible answers:
1) Number of instances of given class B in the filtered instance graph FS
2) Sum of values of given attribute attr (in class B) in the filtered instance graph FS
3) Average of values of given attribute attr (in class B) in filtered instance graph FS
4) List of objects of given class B in the filtered instance graph FS
   Just like in case of constraints, also retrieving an answer does require a single
inspection of an object in the instance graph. Thus, the total time to retrieve an answer
on the instance graph G is O(n), where n is the number of objects in G. It should be
noted that the query language may be extended without loss of efficiency by other
means that also can be evaluated in the linear time, e.g., retrieving average number of
instances of given type per slice, filtering slices by number of instances of given type,
however we do not describe them all because of limitations of space.
To sum up, the main advantages of the proposed query language are:
• The view on data through “glasses” of familiar ontology (e.g., everybody in the
  hospital should know, how does it work!);
• The simple and easy-to-perceive means of setting filtering conditions require no
  more expertise than using spreadsheet applications (like MS Excel);
• The dynamic response to each step in construction of the complete query – the
  doctor sees immediate reaction to every action. It shortens the learning curve
  greatly and encourages even non-experienced users to try this out;
• The efficiency of query execution. It is required the linear time regarding to the
  size of the instance graph to filter and retrieve answers.


4 Related Work

Graphical query languages have been interesting to the researchers as long as textual
query languages exist. They have been developed as an attempt to fulfill the promises
of query languages to give an easy-to-use means for ad-hoc data analysis, because in
practice the powerful query languages (like SQL) have not became the mainstream
tools for non-IT users. Number of graphical (visual) query languages for relational
databases emerged in the late 80-s of the previous century [7, 8, 9]. However at that
time the implementation of graphical languages was an expensive and time-
consuming, not even thinking of usability issues that came along the involving non-IT
users. They tend to cover every feature of SQL and as the result of that we can name
just few examples of graphical query building tools, like, query designer in Microsoft
Access [10] that provides means to build SQL queries graphically.
   At the same time the spreadsheet applications have been widely used by non-IT
users. They allow dealing with data in tabular form (no relations). One of the reasons
of the spreadsheet’s success story is the usage simple concepts like cell, row, column,
etc., coming from real paper-based documents. Another reason is the dynamic
response on every action that takes place in the spreadsheet – user sees all changes in
the document immediately, just like in the query language we propose in this paper.
   Nowadays the graphical language workbenches [11, 12] allow building graphical
languages quickly. Thus the merely forgotten question about building visual query
languages is back on the timetable. Ontologies have become popular in recent years.
Therefore, the attention has been shifted from relational databases and ER models to
ontologies. Thus the query languages for ontologies have emerged and particularly
the graphical query languages for ontologies [13, 14]. And once again, these
languages focus on graphical representation of the query, try to cover all features of
SPARQL and separate the representations of ontology, query and data.


5 Conclusions and Future Work

One of the main results of this paper is the notion of granular data ontologies. This
notion is defined very formally in the paper. Based on the notion of granularity an in-
place graphical query language is then introduced. It is partly tested on real end-users
– doctors of a hospital. As the first experience has shown, the query language
possesses two essential features:
1) it is easily perceptible, and it is therefore easy to use by domain experts that are not
    IT specialists;
2) it has very efficient (linear regarding to the size of an instance graph) execution
    time for retrieving answers to queries.
   Many noticeable data ontologies turn out to be granular, which means an efficient
query language can be developed for them. At the same time there are also lots of
other ontologies, which are not granular, and that prohibits us to use our query
language for them. One of the main directions of our future research is to specify
another meaningful class of data ontologies, which are granular in a wider sense. We
will therefore extend the notion of ontology granularity allowing one to use the
efficient query language for this class of ontologies. The efficiency of the query
language will be preserved, i.e. the time evaluation of the query execution will remain
linear. Other future research directions include, but are not limited to the following:
1) To keep on improving the query language and to test it on a wider range of
    potential end-users;
2) To continue optimizing the language implementation in order to improve the time
   needed for retrieving answers to queries formed over data containing about 30000
   hospital transactions (our goal is to get the answer in less than a second here);
3) To further investigate practical use-cases of our approach in other areas outside
   the context of a hospital.


Acknowledgments

This work has been partially supported by the European Regional Development Fund
within the project Nr. 2010/0325/2DP/2.1.1.1.0/10/APIA/VIAA/109 and by the
Latvian National Research Program Nr. 2 „Development of Innovative
Multifunctional Materials, Signal Processing and Information Technologies for
Competitive Science Intensive Products” within the project Nr. 5 „New Information
Technologies Based on Ontologies and Model Transformations”.


References

1. Barzdins, J., Barzdins, G., Cerans, K., Liepins, R., Sprogis, A.: UML Style Graphical
    Notation and Editor for OWL 2. P. Forbrig and H. Günther (eds.), Perspectives in Business
    Informatics Research, LNBIP, Vol. 64, Springer, p. 102-113, 2010.
2. Barzdins, J., Barzdins, J., Rencis, E., & Sostaks, A.: Modeling and query language for
    hospitals. Health Information Science, LNCS, Vol. 7798, Springer, pp. 113-124, 2013.
3. Müller, R., & Rogge-Solti, A.: BPMN for Healthcare Processes. In Proceedings of the 3rd
    Central-European Workshop on Services and Their Composition, ZEUS 2011, Karlsruhe,
    Germany, February 21--22, pp. 65-72, Karlsruhe: CEUR-WS.org., 2011.
4. Agt, H., Kutsche, R.D., & Wegeler, T.: Guidance for domain specific modeling in small
    and medium enterprises. Proceedings of the compilation of the co-located workshops on
    SPLASH '11 Workshops, 63, 2011.
5. Central Statistical Bureau of Latvia, http://www.csb.gov.lv
6. Shneiderman, B.: Direct manipulation: A Step beyond Programming Languages, IEEE
    Computer, 16, pp. 57-69, 1983.
7. Angelaccio, M., Catarci, T. and Santucci, G.: QBD*: A graphical query language with
    recursion. In Proc. Third Human Computer Interaction Conf., 1989.
8. Cruz, I.F., Mendelzon, A.O. and Wood, P.T.: A graphical query language supporting
    recursion. In Proc. ACM SIGMOD Conf. Management of Data, 1987.
9. Czejdo, B., Embley, D., Reddy, V. and Rusinkiewicz, M.: A visual query language for an
    E-R data model. In Proc. Int. Workshop Visual Languages, Rome, Italy, 1989.
10. Microsoft Access – Office.com, http://office.microsoft.com/access
11. Sproģis, A., Liepiņš, R., Bārzdiņš, J., Čerāns, K., Kozlovičs, S., Lāce, L., Rencis, E.,
    Zariņš, A.: GRAF: a Graphical Tool Building Framework. Proceedings of the Tools and
    Consultancy Track. European Conference on Model-Driven Architecture Foundations and
    Applications, Paris, France, pp. 18-21, 2010.
12. Graphical Modeling Framework (GMF) Tooling, http://eclipse.org/gmf-tooling
13. Zviedris, M., Barzdins, G.: ViziQuer: A Tool to Explore and Query SPARQL Endpoints,
    The Semantic Web: Research and Applications, LNCS, Vol. 6644, pp. 441 – 445, 2011.
14. Fadhil, A., Haarslev, V.: OntoVQL: A graphical query language for OWL ontologies. In:
    International Workshop on Description Logics. (2007)