=Paper=
{{Paper
|id=None
|storemode=property
|title=Reified Literals: A Best Practice Candidate Design Pattern for Increased Expressivity in the Intelligence Community
|pdfUrl=https://ceur-ws.org/Vol-713/STIDS_A1_Peterson.pdf
|volume=Vol-713
|dblpUrl=https://dblp.org/rec/conf/stids/Peterson10
}}
==Reified Literals: A Best Practice Candidate Design Pattern for Increased Expressivity in the Intelligence Community==
Reified Literals: A Best Practice Candidate Design
Pattern for Increased Expressivity in the Intelligence
Community
Eric Peterson,
Global Infotek, 1920 Association Drive
Reston, VA 20191, USA
epeterson@globalinfotek.com
Abstract. Reifying literals clearly increases expressivity. But reified literals
appear to waste memory, slow queries, and complicate graph-based models.
We show where this practice can be comparable to unreified literals in these
respects and we characterize the cost where it is not. We offer examples of how
reification allows literals to participate in a variety of relations enabling a
marked increase in expressivity. We begin with a case study in reified person
names, and then extend this analysis to reified dates and simple reified scalar
values. We show benefits for name matching and temporal analysis such as
would be of interest to the Intelligence Community (IC). We then show how
these same sorts of analyses can drive or inform any decision as to whether to
reify literals.
Keywords: reified literal, semantic, ontology, expressivity, best practice,
design pattern, Intelligence Community
1 Introduction
Reifying literals is not uncommon among popular ontologies and relational data
models. But data architecture teams in the IC can draw from varied backgrounds and
the use of reified literals may not be desired.
The practice may not be desired because it appears to waste memory, slow queries,
and complicate graph-based models despite the increase in expressivity that it offers.
We lay out simple, general metrics for judging such memory waste, slowness and
complication. We show where the practice can be comparable in those respects to
unreified literals and we characterize the modest cost where it is not. We show how
reification allows literals to participate in a variety of relations which foster
expressivity. Beginning with a case study comparing reified with unreified person
names, we then extend the analysis to dates and simple scalar values. We then show
how these same analyses can drive or inform any decision regarding whether to use or
not use literal reification. This paper works toward establishing the reified literal
design pattern as a best practice component.
STIDS 2010 Proceedings Page 58 of 135
We define reified literals as instances that represent literal values. A reification of
a name string literal value might be an instance of type Name with a datatype property
containing the value of the name string. This name instance might then be attached to
a person instance by the givenName object property statement (See Figure 1).
Figure 1: In the top unreified example, givenName is a datatype property statement. In the
bottom reified example, givenName is an object property statement referencing a Name
instance.
2 Current Practice, Related Work, and Contributions
Literal reification is not an uncommon practice. OpenCyc[1], Iode[2], and SUMO[3],
ontologies have pervasively reified literals. DOLCE[4] leaves literal definition to
extending ontologies.
W3C Semantic Web Best Practices and Deployment Working Group presented a
draft by Hobbs and Pan[5] of a time ontology that uses only reified time literals.
Project NeOn’s Ontology Design Pattern Repository[6] contains a pattern for reified
lexical items (terms). Many more main-stream examples exist.
The closest work related to the reified literals design pattern is Presutti and
Gangemi’s[7] content ontology design pattern requirements - to which reified literals
comply1.
Reified literals are not new. But we choose to characterize the virtues and cost of
using this design pattern. We address some common misconceptions about this
design pattern’s performance by detailing memory footprint cost, speed, and design
1 The reified literal design pattern is computational in that it is language independent and is
encoded in a higher order language (OWL). It is clearly small. It is autonomous (deployable
as a single file). It is hierarchical in that the class Literal must be subclassed for each
particular literal. It is inference-enabling in that its reified instances become the foci of
relationships that say something about the literal. Dates participate in Allen’s interval
calculus relations via transitive closure for example. The pattern is cognitively relevant in
that it is intuitive, compact, and captures relevant notions in a domain. It is linguistically
relevant in that we speak of names, dates, etc. as real things.
STIDS 2010 Proceedings Page 59 of 135
complexity. We give examples of the pattern’s increase in expressivity. We show
how to apply theses analyses to all literals.
3 Methods and Metrics
We describe simple, simple metrics to compare reified literal costs in memory and
query speed with respect to those of unreified literals. With the relative complexity of
reified literals and the benefits of their increased expressivity, however, we do not
attempt to go beyond a qualitative description.
For memory usage comparison the two approaches differ structurally only by the
type of one statement (after shared structure is amortized away). With the reified
approach the type of the statement in question is ObjectProperty and with the un-
reified approach it is of type DatatypeProperty. If datatype property statements are
less compact in a particular implementation than object property statements, then the
reified literal approach is correspondingly more compact for commonly referenced
literals. The opposite is true if datatype properties are more compact in memory.
Our query speed comparison shows one type of reified literal queries that are faster
than or equal to non-reified queries. This is due to the fact that there is a faster than
or equal relationship between (i) an equijoin and (ii) a join equating two unreified
literals. The other type of reified literal query is slowed by the speed associated with
addition of a single equijoin. Further, we state that (i) an equijoin may be much faster
than (iii) a join inexactly matching two unreified literals2.
i: {?person1 givenName ?reified_name .
?person2 givenName ?reified_name . }
ii: {?person1 givenName ?unreified_name .
?person2 givenName ?unreified_name . }
iii: {?person1 givenName ?unreified_name1 .
?person2 givenName ?unreified_name2 .
FILTER (likeTerm(?unreified_name1,
?unreified_name2,
partialMatchSpec) }
Since the equijoin uses fast instance or integer comparison, it is faster or
comparable to the join of two unreified literals3.
2 The likeTerm function is a non-standard extension to SPARQL for performing partial
matching of strings. It is a more powerful version of the SQL LIKE keyword. The third
argument is a partial match specification string that specifies the type of partial match and an
optional matching template. In our implementation and others, inexact match is onerously
slow compared to the speed of an equijoin. But the usage of a full text index, could make
inexact search significantly faster.
3 In RDF store implementations where strings are shared resources, the reified and unreified
approaches are comparable because both are based on the matching of integers rather than
strings.
STIDS 2010 Proceedings Page 60 of 135
The comparison of structural complexity we can base in part on statement counts
and amortization of memory footprint comparison. But we ultimately rely on our
reader’s judgment as to the relative complexity.
The comparison of these sometimes negative costs against the benefits of increased
expressivity is similarly subjective.
4 Case Study: Reified vs. Non-reified Names
First we treat the question of memory waste for reified names. Creating a new reified
name each time a data source mentions that name would clearly waste memory. We,
however, create a particular reified name instance just once for all its references to
share. The cost of representing a non-reified name is one statement (). The cost of a reified name would count the following object
property statement () plus the amortized cost of
the shared reified name components for the name John: and . If just one reference to a person with the
name "John" is in the data store, the extra reification cost is two statements – three
times the non-reified approach. If the cost is shared between two references, the
amortized extra cost is one statement; and among ten, the extra cost drops to one fifth
of a statement (see Figure 2).
Figure 2: Because reified names are shared, the memory cost of the name rdf:type statement
and name value statement is amortized out over all eleven name references. Note the several
examples of the expressive power of reified names.
STIDS 2010 Proceedings Page 61 of 135
If names are common, the cost for their shared statements is negligible. So
uncommon or rare reified names are individually costlier than unreified names, but
the overall reified name cost is a function of the average number of references to the
names.
Second, we treat the question of query speed. Exact name matching queries need
not be string based. With name reification, one can match name instances rather than
name strings. Implementations, then, can use integer comparison for speed equal to
or faster than matching un-reified names.
When attempting an exact match query on a particular name, on the other hand,
one must create the URI for that particular name, and one must create it in some
repeatable canonical fashion. One must, for example, always translate the name John
to precisely the same URI (e.g.: http://foo.gov/bar#Name_JOHN). This same name
URI creation algorithm must be used for all names in the knowledge base. With these
precautions, particular name matching also can also be a matter of simply comparing
instances/integers rather than using an extra join to compare strings.
Inexact name matching requires an extra join when using reified names because the
actual name string must be consulted. But this increase in query time is never large
and is moot when using a system whose time for inexact matching overwhelms the
cost of that extra join.
Next, we consider the structural complexity associated with reified names. The
path length from a person node to her actual name value is one statement longer with
the reified approach, but paying this price allows us to say things about names (see
next section) in an organized fashion. We claim that if name meta-information is
required, it is more intuitive to link it to the reified name instance and that the net
effect is a reduction in complexity.
We now consider the benefits of reified names. Perhaps the most obvious benefit
is being able to conveniently and intuitively say things about names and to reason
about that information. A reified name can be linked directly to various information
of interest to the IC such as its New York State Identification and Intelligence System
(NYSIS) encoding, its variants, its nicknames, its ethnic derivation, a notion of its
level of formality, its gender association, etc. (See Figure 2). Such information can
be encoded in a system using non-reified literals, but the querier would need to
understand the association between the name meta-information and the respective
names. Clearly it is more intuitive to directly link the information about a name to
some shared representation of the name. A newcomer need not know where the
NYSIS information is stored. She simply queries on the name and sees, by
inspection, that NYSIS information is associated with its corresponding names.
5 Generalizing Name Reification Results for Additional Literals
We discuss in detail the merits of reifying other literal types. We begin with dates,
heights and weights.
As with reified names, reified dates are shared among all the events that reference
them and, consequently, experience the same potential for memory cost amortization.
Exact date match query speed, as for all reified literals, is at least comparable to the
STIDS 2010 Proceedings Page 62 of 135
non-reified case. Simple time range queries bounding the reified date’s xsd:date
value with two xsd:date values require an additional join Structural complexity of
reified dates is clearly comparable to that of reified names. Expressivity-wise, dates
are, of course, actual time intervals rather than simple time points. As an immediate
result of date reification, one can concretely begin to better support temporal
reasoning for IC applications. One can start by attaching beginning and end date-
times to a date so as to allow dates to participate in precise time-interval-based
queries. Reified dates can be unknown and yet known to be before some other known
date. Such unknown date instances can participate readily in temporal interval
overlap relations such as temporallyContains in order to bound the date if possible
(see Figure 3).
With height and weight reification, amortized sharing cost efficiency is somewhat
moot as likely usage tends toward few height and weight values4. Because even a
minute difference in an exact match query is a miss, range queries are much more
useful with scalars. Query speed in this case, therefore, is reduced by the cost of an
equijoin, as actual weight and height values must be accessed. Reification increases
expressivity such as being able to encode that one person is taller than another and
both heights were unknown.
Figure 3: The following diagram shows six birth events each with a different sort of partial
date. Years, months, dates, and birth events all have temporal extent and can, therefore,
participate the various temporal interval algebra relations. Unknown date and months have
URI names with embedded SHA 256 hash values to prevent them from coalescing with similar
unknown dates and months. Were birthdays modeled as a datatype property rather than as
reified dates, this sort of query-time expressivity would difficult and less intuitive.
These analyses apply to all twenty five of our twenty five literals. Without going
into prohibited detail, all our literals are inherently sharable and, therefore, offer the
same memory cost amortization potential. Zipf-Mandlebrot power law results are
infrequently available for reifiable literals and they offer no guarantees that their
exponents will be such that we can know that memory cost will be negligible[8]. All
4 There are 289 English half inch values between zero and twelve feet inclusive.
STIDS 2010 Proceedings Page 63 of 135
but four of our date types and four of our scalars inherently lend themselves to using
exact match and partial match, as with reified names. The reified name analysis, then,
directly applies to these 17 literals. The two other scalar types behave just as height
and weight with the range queries that are slower by one equijoin. The other date
types are Month, Year and TimePoint. They are all simply date-like time intervals of
various sizes, so their analyses are comparable to Date5
6 Results, Discussion, and Future Work
We established that common reified names have comparable in-memory cost to non-
reified names. We similarly established that query speed for exact matching of reified
names is equal or better than non reified names. And the additional speed cost of
inexact matching is negligible in systems where inexact matching speed dominates
that of a join. We argue that the overall structure of reified names and their metadata
is simpler. We showed that reified names allow a sort of tightly linked expressivity
that un-reified names do not.
We similarly analyzed dates, heights, and weights and found them to be slower by
one join in interval queries. We found date expressivity to be significant and height
and weight expressivity similar yet less likely to be justified by our data sets.
We distilled the following rules from the above analysis to help determine when
the literal reification design pattern should be used:
1. Rare reified literals are individually costly, but the net cost is only a
concern if there are very many rare types.
2. Range queries such as with reified scalars and dates are slower by an
equijoin.
3. Inexact match queries over reified literals are slower by an equijoin. That
equijoin is inconsequential on systems where inexact match dominates the
query time.
4. Otherwise the speed and memory cost is comparable.
As we value expressivity, we found most of our literals to be reasonably strong
candidates for reification. Our desire for expressivity also makes us less concerned as
to how well amortized our shared structure is. We found all of our scalars to be
weaker/marginal candidates because we have no present or near future need for
scalar-related expressivity. Their inclusion would be based more on a desire to apply
all design patters consistently.
In all cases, the value of the expressivity gain must be subjectively weighed against
the possible cost in memory and speed. We have used literal reification for over
fifteen years in the IC and in two different data integration projects at scale.
We expect that as we continue to observe the results of our choices to reify and not
to reify literals, we will more finely characterize how to make such choices in the
future. We expect to have opportunity to garner shared structure amortization
statistics on our various reified literals.
5 TimePoint is encoded as an interval as per common convention. TimePoint duration varies
with the number of significant digits in the input.
STIDS 2010 Proceedings Page 64 of 135
7 Conclusions
Commonly referenced reified literals come at little or no significant cost in memory,
speed, or complexity. Queries over such literals are never slower than the cost of one
join with respect to unreified literals and are usually comparable. Where literal-related
expressivity is specifically needed or expected, reified literals should be considered.
References
1. Cycorp Inc.: OpenCyc. http://opencyc.org
2. Hightfleet (formerly OntologyWorks): IODE, http://www.highfleet.com
3. Pease, A., Niles, I., and Li, J.: The suggested upper merged ontology: A large ontology for
the semantic Web and its applications. In Proceedings of the AAAI-2002 Workshop on
Ontologies and the Semantic Web, Edmonton, Alta., Canada (2002)
4. Masolo, C., Borgo, S., Gangemi, A., Guarino, N., Oltramari, A., and Schneider, L.: DOLCE:
A Descriptive Ontology for Linguistic and Cognitive Engineering. WonderWeb Project,
Deliverable D17 v2.1 (2003)
5. Hobbs, J., Pan, F.: Time Ontology in OWL. Working draft, http://www.w3.org/TR/owl-
time (2006)
6. Charlet, J., Vandenbussche, P.: Concept Terms. Ontology Design Patterns.
http://ontologydesignpatterns.org/wiki/Submissions:ConceptTerms
7. Gangemi, A.: Ontology Design Patters for Semantic Web Content. ISWC 2005. LNCS,
vol. 1729, pp. 262-276 (2005)
8. Zipf, G., Selected Studies of the Principle of Relative Frequency in Language. Harvard
University Press, Cambridge, MA (1932)
STIDS 2010 Proceedings Page 65 of 135