Semantic Indexing Based on Description Logics Albrecht Schmiedel Technische Universitat Berlin atms@cs.tu-berlin.de Abstract Basic entities: patient :< anything. A method for constructing and maintaining a examination :< anything and `semantic index' using a system based on de- not(patient). scription logics is described. A persistent index observation :< anything and into a large number of objects is built by classi- not(patient) and not(examination). fying the objects with respect to a set of index- ing concepts and storing the resulting relation Basic relations: between object-ids and most speci c indexing hasExam :< domain(patient) and concepts on a le. These les can be incre- range(examination). mentally updated. The index can be used for hasItem :< domain(examination) and eciently accessing the set of objects matching range(observation). a query concept. The query is classi ed, and, hasValue :< domain(observation) and based on subsumption and disjointness reason- range(number). ing with respect to indexing concepts, instances Other Primitives: are immediately categorized as hits, misses or hce :< examination. candidates with respect to the query. Based on the index only, delayless feedback concerning bloodPressure :< observation. the cardinality of the query (upper and lower bloodPressureSystolic :< bloodPressure. bounds) can be provided during query editing. bloodPressureDiastolic :< bloodPressure and not(bloodPressureSystolic). normal :< observation. 1 Introduction abnormal :< observation and not(normal). Indexing generally involves an association between some kind of key and the actual target. The key is used to Table 1: Primitive concepts and roles jump directly to a desired piece of information, thereby avoiding an exhaustive search through large sets of can- didates. In the context of databases, keys are usually 2 Index Construction based on the set of values of a particular attribute of In a description logic such as BACK a data base is viewed the objects to be indexed: if we know the value, we can as a set of distinct objects (also instances or individuals) move directly to the corresponding object(s). typically representing domain entities, each of which is In the following, a description logic (DL) based ap- associated with a description. proach to indexing is sketched which broadens the no- Descriptions are terms built with tion of a key: instead of using attribute values, indexing  term-forming operators such as and, all, some, elements can be arbitrary structured concepts as pro- etc., the logical constants provided by the language, vided by a terminological language such as BACK (cf.  primitive concepts and roles introduced by the user, [Hoppe et al.,1993]). Firstly, I will show how the con- and struction of such an index falls out quite naturally from  named de ned concepts and roles. the normal workings of a terminological reasoner, and secondly I will discuss how such an index can be used. Table 1 shows some top level primitive concepts This approach, and an experimental implementation, is roles for building a data base containing descriptions described in more detail in [Schmiedel,1993]. of patients, examinations, and observations made in examinations1. Patients are related to examinations via patient1 :: patient and hasExam:closed(hce1 and hce2 ). hasExam, and examinations to observations via hasItem. hce1 :: hce and hasItem:closed(bpsys1 and bpdia1 ). examSomeBpSysAbnorm := hce2 :: hce and hasItem:closed(bpsys2 and bpdia2 ). examination and some(hasItem, bpsys1 :: bloodPressureSystolic and hasValue:130. bloodPressureSystolic and bpdia1 :: bloodPressureDiastolic and hasValue:90. abnormal). bpsys2 :: bloodPressureSystolic and hasValue:150. patSomeBpAbnorm := bpdia2 :: bloodPressureDiastolic and hasValue:95. patient and some(hasExam, examSomeBpAbormal). Table 4: Object descriptions Table 2: De ned concepts an asterisk) are related by subsumption links; disjoint- Table 2 gives two examples for named descriptions (de- ness has been left out for the sake of simplicity. The ned concepts) using the primitives. examSomeBpSys- individuals at the bottom of the graph, a patient with Abnorm is an examination which has an item which is two examinations, each of which with two observations, an abnormal systolic blood pressure, and patSomeBpAb- are linked to the most speci c concepts they instantiate. norm is a patient which has an examination which has an For example, bpsys2 is classi ed under the conjunction of abnormal blood pressure. De ned concepts are syntactic bPSystolic, which was explicitly told, and abnormal, due sugar for abbreviating possibly complex descriptions. to an abnormality rule as in the example above. This leads to the classi cation of hce2 under examSomeBp- bloodPressureSystolic and all(hasValue, gt(140)) SysAbnorm (`an examination with an abnormal systolic => abnormal. blood pressure') which in turn triggers the classi cation bloodPressureSystolic and all(hasValue, 110..140) of patient1 as an instance of patSomeBpSysAbnorm (`a => normal. patient with an examination containing an abnormal sys- bloodPressureSystolic and all(hasValue, lt(110)) tolic blood pressure'). Note that hce2 (patient1) was ex- => abnormal. plicitly told to be only an examination (patient); the more speci c concepts were derived by the system as a conse- Table 3: Rules quence of the role ller relations, the de nitions and the rules. Descriptions are also used to de ne rules, which are ex- In the following, two properties of description logic pressed as implications between two descriptions. The based systems not present in mainstream database sys- left hand sides of the rules shown in Table 3 are descrip- tems play a crucial role: tions of certain sets of observations which are asserted to be in the set of normal or abnormal ones by the de-  the ability to handle any degree of partial informa- scription on the right hand side. tion in conjunction with an open world assumption, Table 4 shows how data is actually entered into the and system. The `::' operator is used for asserting that the  the ability to describe individuals with complex con- description on the right hand side is true for the object cepts and to use these descriptions for query answer- referenced on the left hand side. Here, there is an object ing. patient1, an instance of patient, with two examinations, hce1 and hce2, both instantiating the concept hce. The These two properties make it possible, for example, to keyword closed indicates that all llers of the hasExam remove all the information concerning observations (the role are known, i.e. there are only two examinations. shaded part in Fig. 1), but to keep all the information The examinations each have exactly two observations, that was derived from observations concerning other en- each of which has exactly one numeric value. tities. Thus, hce2 will still be known to be an instance of Based on this type of input, the system computes examSomeBpSysAbnorm, but the observations and their  for concepts the subsumption and disjointness re- values from which this was derived will become unknown. lation, i.e., for each pair of concepts whether one We can now de ne a set of individuals to be indexed subsumes the other or whether they are disjoint, (for example the set of patients), choose a set of index- ing concepts (e.g., the concepts specializing patient), and  for each individual the set of concepts it is (and is store the relation which associates each indexing concept not) an instance of. with the individuals it instantiates. This relation can ef- For our example containing three kinds of entities, pa- ciently be stored in two hashtables: one maps individ- tients, examinations, and observations, the result of this is ual names to the set of most speci c concepts describing illustrated in Fig. 1. Concepts (primitives marked with them, and the other maps concept names to the set of individuals they directly instantiate, i.e. those which are 1 For a more detailed description of the BACK language not instances of any subconcept. It is also useful to store see [Hoppe et al.,1993]. the associated cardinalities. anything * * * patient examination observation examAllBpSysNorm examSomeBpAbnorm patAllBpDiaNorm * bloodPressure patSomeBpAbnorm examAllBpDiaNorm * * Norm bPSystolic * * patAllBpNorm examAllBpNorm bPDiastolic Abnorm examSomeBpDiaAbnorm patSomeBpSysAbnorm examSomeBpSysAbnorm SN DN SAb DAb hasValue hasItem bPSys1 130 hasExam hce1 bPDia1 90 patient1 bPSys2 150 hce2 bPDia2 95 subsumed by instantiates filled by Figure 1: Example KB 3 Using the Index Queries using the index are processed in three distinct phases, each one providing progessively more informa- Based on this stored relation and the original concept tion at additional costs. The rst phase is designed to de nitions, a new knowledge base can be built which provide cheap and immediate feedback on the expected contains only the classi cation of the individuals with cardinality of the result of a query. For this only the respect to the indexing concepts, but lacks the full indi- cardinalities associated with indexing concepts need to vidual descriptions (see Table 5). It may thus be much be loaded. The query is classi ed, and cardinality con- smaller than the original KB. Due to the semantic prop- straints for it are computed based on the known cardinal- erties mentioned above, it will be ignorant w.r.t to some ities of indexing concepts, and their logical interrelations. information contained in the original KB, but it will Thus, the example query shown in Fig. 2 must have at never produce contradictory answers. This makes it use- least 40 instances, since there are two indexing subcon- ful as an index. cepts the cardinalities of which are added because they can be proved disjoint by the system. Similarly, there is an upper bound of 80 instances for the query, because the patient1 :: patSomeBpSysAbnorm and indexing superconcept with the least cardinality (100) patAllBpDiaNorm. has an indexing subconcept (20) disjoint from the query. :: hce and examAllBpNorm. Depending on this cardinality information, the user can hce1 hce2 :: hce and examSomeBpSysAbnorm and either re ne his query, specializing or generalizing it as examAllBpDiaNorm. desired, or proceed to the second phase. The quality of the cardinality feedback depends very Table 5: Abstracted object descriptions much on how close the query is related to already exist- n=100 n=300 Compared with standard value-based indexes, this re- someBpAbnormal male sults in the following characteristics: (1) A semantic index is inherently multidimensional since any combination of properties cast into a DL con- n=20 cept (i.e. an arbitrary query) can serve as an indexing element. n=40..80 (Phase I) (2) As a structured concept the indexing elements are someBpDiaAbnormal and someBpSysAbnormal allBpSysNormal and male not just attribute values, but can be based on complex descriptions of related individuals. (3) A semantic index as a whole is highly adaptable to patterns of usage. Indexing concepts can be added or n=30 n=10 someBpSysAbnormal someBpSysAbnormal removed at will, making it very dense and precise w.r.t and male and male to interesting sets of individuals, or very sparse in other, and age < 40 and age > 65 less interesting areas. (4) Since the index is actually a set of partial descrip- tions for the indexed instances, lots of information (such as cardinality estimates) can be drawn from the index alone without accessing (possibly remote) individual de- Query Concept Indexing Concept Disjoint Concepts scriptions at all. Figure 2: Approximating the cardinality of a query These properties may turn out useful for building local information servers which cache information at various levels of completeness, depending on usage patterns. ing indexing concepts. If we ask for a concept which is equivalent to an indexing one, we get the exact cardinal- ity. If we ask for a concept which is totally unrelated to References existing indexing concepts, i.e. there are no subsuming, [Hoppe et al., 1993] Hoppe, Th., Kindermann, C., no subsumed, and no disjoint ones, we will get a lower Quantz, J.J., Schmiedel, A., and Fischer, M., BACK bound of 0 and an upper bound equal to the number V5 Tutorial and Manual. KIT Report 100, Depart- of indexed instances. This means no information at all ment of Computer Science, Technische Universitat from the index. Typically, one should get something in Berlin, Berlin, Germany, March 1993. between, some partial information. [Schmiedel, 1993] Schmiedel, A., Persistent Mainte- The second phase additionally utilizes the actual ex- nance of Object Descriptions using BACK. KIT Re- tensions of indexing concepts also stored in the index. port 112, Department of Computer Science, Technis- This generally results in much better cardinality esti- che Universitat Berlin, Berlin, Germany, November mates at the cost of having to load the instances, com- 1993. puting intersections and unions, etc. In case the query is a combination of indexing concepts, its exact extension (and cardinality) can be computed. Otherwise there is a remaining set of candidates, the individuals for which the query is not known to be ei- ther true or false. In this case the index alone does not contain enough information to determine the extension of the query, and the third phase must be entered. For each candidate instance the original description must be accessed and explicitly tested against the query. After this has been done, the user can choose to declare the query as a new indexing concept, making the index more dense at that particular point in the semantic space. 4 Concluding Remarks This semantic indexing mechanism is crucially depen- dent on reasoning with descriptions as provided by ter- minological systems. The indexing elements are poten- tially complex descriptions logically related by subsump- tion and disjointness. Note that incomplete algorithms for computing subsumption are not disastrous for index- ing: they will simply result in a less informed, subopti- mal index.