Recommendation-Based Evolvement of Dynamic
           Schemata in Semistructured Information Systems

                         Wolfgang Gassler                                          Eva Zangerle
                Databases and Information Systems                      Databases and Information Systems
                   Institute of Computer Science                          Institute of Computer Science
                  University of Innsbruck, Austria                       University of Innsbruck, Austria
                    wolfgang.gassler@uibk.ac.at                             eva.zangerle@uibk.ac.at


ABSTRACT                                                          systems, where information can simply be added as text.
Community-based collaborative information systems provide         However, such structureless storage prevents every possibil-
the flexibility of storing information without having to ad-      ity to provide structured access or search facilities on the
here to any predefined, rigid schema. However, the stored         stored information, as for example relational databases do.
knowledge lacks common structure, which is crucial in terms       Consider the example of Wikipedia, which does not provide
of data access and search capability. We present a novel con-     structured search capabilities at all. The only way to search
cept for semistructured information systems, which features       through this unstructured information is to perform fulltext
a dynamic and self-learning schema system and provides            search, which is incapable of answering precise queries like
the users with recommendations when entering information.         “Which Austrian cities have between 10,000 and 20,000 in-
The recommendations ensure the creation and maintenance           habitants?”. For this reason information repositories need
of a common, homogeneous schema while at the same time            to support both structured and unstructured information to
not restricting the user’s flexibility to enter any kind of in-   fully exploit the advantages of both worlds, as also pointed
formation.                                                        out by Weikum et al. [13]. Our paper presents the Snoopy
                                                                  concept which combines the structured and unstructured ap-
1.   INTRODUCTION                                                 proach and takes advantages of both concepts.
When storing information, there are two common ways of            The rest of the paper is organized as follows. In Section
doing so: either one uses a (relational) database to store in-    2 the basic concept of our approach is described. Section
formation in a structured way or one stores information in        3 outlines the measures taken for the creation of a common
a mostly unstructured manner.                                     schema. Our proposed solution is described in Section 4 and
The first approach is very well suited for applications which     related work is outlined in Section 5. Section 6 summarizes
have to store strictly structured information. Considering        the paper and points out open research issues.
the example of bank accounts, all information is structured
and can easily be matched into a schema for the storage of        2.   THE SNOOPY CONCEPT
e.g. bank accounts and their owners. A big disadvantage is        The Snoopy concept offers the same flexibility as wiki sys-
that a change within the schema can be very time-consuming        tems but at the same time provides the possibility to struc-
and complex, as it has to be done manually and the already        ture information like (relational) databases. This is achieved
stored information has to be adapted to the new structure.        by using a self-adapting and self-learning schema system and
Therefore, the end-user is fixed to a given schema and can-       recommendations, which support the user during the inser-
not insert additional information not matching the given          tion process.
schema. For flexible information systems, such a restricting      In the Snoopy concept, information about a certain subject
approach is not very well suited, as it can result in a big       is stored as a collection, similar to a wiki page. A collection
loss of information because of the fact that the user cannot      consists of an arbitrary number of key-value pairs, which can
insert all information he might want to.                          be specified by the user without any restrictions. The fol-
On the other hand, the second, unstructured way of storage        lowing example shows how information about the city “Inns-
allows the user to store information in any arbitrary struc-      bruck” can be stored in a simple and understandable way:
ture or format. This approach has the advantage that the
user does not have to match the data into some predefined         Collection Name:      Innsbruck
schema. A popular example of such flexible systems are wiki
                                                                  Country:Austria
                                                                  State:Tyrol
                                                                  numberOfInhabitants:117,916
                                                                  PanoramaImage:pano_innsbruck_2010.jpg

                                                                  The key-value pairs are similar to relational database columns
                                                                  (keys) and its rows (values). Despite the fact that infor-
                                                                  mation is still plain text, it still features structure. Such
GvD Workshop ’10, 25.-28.05.2010, Bad Helmstedt, Germany.         semistructured data, where structure is present but does
                                                                  not comply to a unified schema, can be used for structured
access (e.g. queries like “All entries containing the prop-      3.1    Suggestion of Keys and Structure
erty Country:Austria”). One of the main challenges of such       The suggestion of keys and structure is the most important
semistructured data is the proliferation of keys and values.     feature within the Snoopy concept. Considering a user enter-
Furnas et. al. [6] showed that two people would choose           ing information about cities (consisting of key-value pairs)
the same term for a certain object (within a restricted do-      into the system, there might be additional useful keys that
main) with a probability of only 20%. Consider the key           the user might not think of at first hand. This is where
“numberOfInhabitants”. It can be assumed that a multiuser        the recommendation engine comes into play. If a user wants
system would produce many synonyms of this word, eg. “in-        to add the key ”inhabitants”, the system computes that 90
habitants”, “citizens” or “numberOfInhabitants”. This fact       percent of the users who added “inhabitants” as a key, also
would imply that the stored information is no longer search-     added “mayor”. The recommendation feature is based on
able in terms of unified access, as the keys are not aligned     a data mining process within the already existing collec-
to a common schema. Wikipedia also has to cope with              tions, which calculates keys that frequently occur together.
this problem of proliferation of structures and tries to solve   Having done these computations, the system suggests this
it by a very big and committed community, which creates          additional key to the user who can then accept this key and
templates (schemata) and manually unifies already present        enter a respective value. Assume, the user wants to store in-
knowledge. Boulain et al. [3] showed that only 35% of all        formation about the city “Innsbruck”. Based on the already
edits in Wikipedia are changes of content. All other edits       inserted keys “country”, “state” and “numberOfInhabitants”,
are related to structure and do not concern the content it-      the system recommends adding the keys “mayor” and “ZIP”.
self.                                                            Recommendations are simply further form fields displayed
The Snoopy concept takes a different approach and pushes         to the user that give him the opportunity to enter appro-
part of these structural adaptions to the user. The user         priate values for the suggested keys. The recommendations
knows more about the collection he enters than any process,      encourage the user to insert more information than originally
machine or community, which try to enhance the informa-          intended and thereby increase the amount of information in
tion after the author inserted it. Therefore, the key idea       the system.
of the Snoopy concept is to “snoop“ as much information as
possible from the user during the insertion process. Further-
more, the user is guided by an adapting, self-learning guid-
ance engine which provides recommendations to the user
                                                                 3.2    Key Completion
                                                                 When entering new key-value pairs, the user gets assistance
based on key-value pairs entered earlier. These recommen-
                                                                 by an auto-completion system. This feature suggests keys
dations support the user in aligning the information he in-
                                                                 that are already stored in the information system, are sim-
tends to enter to a commonly used schema.
                                                                 ilarly structured and therefore semantically related to the
It is important to note that all these recommendations are
                                                                 currently entered key. If the user types e.g. “number”, the
just suggestions and the user is not forced to use these rec-
                                                                 system automatically provides the user with the possibility
ommendations in any way. Therefore, the user is able to
                                                                 to choose “number of inhabitants” or “number of districts”,
enter information without any restrictions.
                                                                 which are keys that are already existent in the database.
                                                                 The user still has the opportunity to provide a new key, e.g.
3.   SCHEMA ALIGNMENT                                            “number of universities” by ignoring the recommendations
The process of integrating and transforming two or more          and continue typing. This recommendation contributes to
(database) schemata into a common schema has proven to           the creation of a common schema and data basis and is
be a very complex task [10]. The Snoopy concept avoids           demonstrated by the following example. If the user wants
any schema matching after the insertion of data as the guid-     to enter the key “inhabitants”, after typing “inhab”, the user
ance of the user significantly contributes to the alignment of   is informed that a similar key “number of inhabitants” is al-
the entered structured information to an already commonly        ready present. As accepting the recommendation is faster
used schema. This aligned “schema” is different from a fixed     than typing the word “inhabitants”, the user is encouraged
schema of a relational database to which information has         to accept the recommendation.
to be adapted to. The user is free to extend or modify the
recommended structure. Therefore, alignment can never be
done for the totality of data.
                                                                 4.    PROPOSED SOLUTION
                                                                 Our proposed solution is mostly concerned with the compu-
The self-adapting schema is implicit and dynamically cal-
                                                                 tation of recommendations and the underlying storage mech-
culated based on already stored information. It consists of
                                                                 anisms, which are explained in the following section.
keys which are used together by the majority of similar col-
lections. Hence, newly entered information can influence the
schema and can result in a change of the schema. For exam-       4.1    Recommendations
ple adding the key “numberOfStudents” to many collections        The computation of the recommendations can be achieved
about cities can lead to a recommendation of this key to         by using association rules [1], which can easily be adapted for
further users.                                                   the mining of relations between triples: formally, a frequent
As information systems contain information of various struc-     item set can be defined as a set of items I = {i1 , i2 , ...in },
tures and types, the Snoopy concept automatically computes       where a transaction T within the database consists of arbi-
a suitable number of schemata according to the stored infor-     trary many items of I. In the case of schema alignment, the
mation - without any predefined settings. Schema alignment       item set I consists of all properties pi occurring in the sys-
in the case of the Snoopy concept can be achieved by taking      tem and a transaction comprises all properties pij occurring
the following measures, which are all based on recommen-         together within the subject sj . The set of all transactions
dations of values and keys.                                      forms the transaction database T = {T1 , T2 , ...Tm }. Based
on this transaction database, the goal is to calculate asso-        ommendation-based version motivated the test users to in-
ciation rules which are implications X → Y , where X is a           sert 24% more key-value pairs.
property and Y is another property which co-occurs with X
on the same subject.                                                5.   RELATED WORK
This is the basis of the calculation of recommendation can-
                                                                    The Snoopy concept can benefit from many research areas
didates, which are then further examined in order to pro-
                                                                    which cover parts of the Snoopy concept. Lots of research is
vide the best possible recommendations to the user. Based
                                                                    done on how to introduce structure in Wikipedia or gener-
on this (probably large) set of association rules, the final
                                                                    ally in information systems. All of the following approaches
recommendations are computed taking the following impact
                                                                    try to enhance information after the insertion process, do
factors into acount: (i) support of a rule, (ii) novelty of prop-
                                                                    not consult the user and do not benefit from the user’s ex-
erties, (iii) recent popularity rise of properties and (iv) some
                                                                    tensive knowledge during the insertion process.
random choice to give new properties a chance to be used
                                                                    The DBPedia project [2] extracts structured information
and therefore rise in popularity.
                                                                    from Wikipedia infoboxes and stores it as RDF-triples. An-
Fundamentally, the computation process of the recommen-
                                                                    other approach, YAGO [11] is also based on Wikipedia data
dations has to fulfill the following requirements: (i) ability
                                                                    and tries to semantically enhance this data. The Kylin/KOG
to cope with huge amounts of data, (ii) compute recommen-
                                                                    System [14] automatically verifies semantically enhanced data
dation candidates based on mining association rules, (iii)
                                                                    by explicit community feedback. Semantic Wikipedia [12]
evaluation of the candidate sets according to the previously
                                                                    extends MediaWiki by adding typed links between Wikipedia
mentioned factors and (iv) real-time calculation of recom-
                                                                    entries as well as attributes and types. However, the user
mendations.
                                                                    is not guided in the process of specifying this additional se-
                                                                    mantics. Cimple/DBLife [5] presents an approach to build
4.2    RDF Storage System                                           a structured community portal from already existing com-
The underlying storage system is a crucial part of the appli-       munity sources. ExDB [4] extracts information from the
cation as it holds all Snoopy data. Basically, the main task        web, adds structure and is then able to query this data in a
is to store many key-value pairs belonging to a certain sub-        structured manner.
ject (collection). This pattern suggests storing this data as
triples using the W3C Resource Description Framework [8],           6.   CONCLUSION AND FUTURE WORK
which has proven to be a modelling language very well suited        We presented the Snoopy-concept, a novel approach for cre-
for the description of any (real-world) resource [9]. Basically,    ating and maintaining a common schema in semistructured
every resource can be stored as a triple consisting of a sub-       information systems by using recommendations. These rec-
ject (same as the collection name in the Snoopy concept), a         ommendations guide the user during the insertion process in
predicate (the respective key) and an object (the respective        order to align the information to a commonly used schema
value). The sentence “Innsbruck’s number of inhabitants             and vocabulary. The implicit schema is based on all stored
is 117,916” can therefore be stored as the following triple:        information in the system and adapts itself to the struc-
<Innsbruck>, <numberOfInhabitants>, <117,916>. The                  ture of this data. The recommendations are based on as-
main advantage of RDF is the possibility to easily model            sociation rules and were implemented in a prototype. First
knowledge in a machine-understandable way. SPARQL, the              results showed that recommendations ensure homogeneous
query language for RDF, can be used to query all informa-           schemata and vocabulary in a multi-user semistructured in-
tion.                                                               formation system and therefore improve the structured data
Basically, the underlying RDF storage system has to fulfill         access capabilities. Further work will include ranking of
the following requirements in order to ensure a scalable, ef-       rules enabling the system to compute top-n recommenda-
ficient and highly performant system: (i) SPARQL support            tions aiming at a higher precision of recommendations. Fur-
to query all stored data in a fast and efficient way, (ii) op-      thermore, capabilities to query the stored information effi-
timized RDF store: triples have to be stored in the most            ciently have to be introduced and the storage mechanisms
efficient and compact way as the representation of knowl-           have to be improved.
edge as triples leads to a very large number of triples, (iii)
an interface optimized for data mining tasks, as SPAQRL
is not best suited for data mining (iv) fast data mining:
                                                                    7.   REFERENCES
                                                                     [1] R. Agrawal, T. Imieliński, and A. Swami. Mining
mining structure and content within the triples is the most
                                                                         association rules between sets of items in large
time-critical part of the application as all recommendations
                                                                         databases. In SIGMOD ’93: Proceedings of the 1993
are based on it.
                                                                         ACM SIGMOD international conference on
                                                                         Management of data, pages 207–216, New York, NY,
4.3    Preliminary Results                                               USA, 1993. ACM.
SnoopyDB, a first prototype of the Snoopy concept has al-            [2] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann,
ready been developed and evaluated in [7]. The evaluation                R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a
showed that recommendations in SnoopyDB guide the user                   web of open data. Lecture Notes in Computer Science,
to common schemata and reduce the number of distinct                     4825:722, 2007.
properties. A system without any recommendations lead                [3] P. Boulain, N. Shadbolt, and N. Gibbins.
to 229 distinct properties on 50 test collections, whereas               Hyperstructure maintenance costs in large-scale wikis.
SnoopyDB including recommendation and guidance was able                  In Peter Dolog, Markus Kroetzsch, Sebastian
to store the same information by using only 154 distinct                 Schaffert, and Denny Vrandecic, editors, SWKM,
properties. Besides the 33% smaller property set, the rec-               volume 356 of CEUR Workshop Proceedings.
     CEUR-WS.org, 2008.
 [4] M. J. Cafarella, C. Re, D. Suciu, and O. Etzioni.
     Structured querying of web text data: A technical
     challenge. In CIDR, pages 225–234, 2007.
 [5] P. DeRose, W. Shen, F. Chen, A. Doan, and
     R. Ramakrishnan. Building structured web
     community portals: a top-down, compositional, and
     incremental approach. In VLDB ’07: Proceedings of
     the 33rd international conference on Very large data
     bases, pages 399–410. VLDB Endowment, 2007.
 [6] G. W. Furnas, T. K. Landauer, L. M. Gomez, and
     S. T. Dumais. The vocabulary problem in
     human-system communication. Communications of the
     ACM, 30(11):971, 1987.
 [7] W. Gassler, E. Zangerle, M. Tschuggnall, and
     G. Specht. Snoopydb: Narrowing the gap between
     structured and unstructured information using
     recommendations. In Proceedings of the 21th ACM
     Conference on Hypertext and Hypermedia. ACM, 2010.
 [8] G. Klyne and J. J. Carroll. Resource description
     framework (rdf): Concepts and abstract syntax. W3c
     recommendation, World Wide Web Consortium,
     February 2004. http://www.w3.org/TR/2004/REC-
     rdf-concepts-20040210/.
 [9] B. McBride. The resource description framework (rdf)
     and its vocabulary description language rdfs.
     Handbook on Ontologies, pages 51–66, 2004.
[10] P. Shvaiko and J. Euzenat. A survey of schema-based
     matching approaches. Journal on Data Semantics,
     4:146–171, 2005.
[11] F. Suchanek, G. Kasneci, and G. Weikum. Yago: A
     large ontology from wikipedia and wordnet. Web
     Semantics: Science, Services and Agents on the World
     Wide Web, 6(3):203 – 217, 2008. World Wide Web
     Conference 2007, Semantic Web Track.
[12] M. Voelkel, M. Kroetzsch, D. Vrandecic, H. Haller,
     and R. Studer. Semantic wikipedia. In WWW ’06:
     Proceedings of the 15th international conference on
     World Wide Web, pages 585–594, New York, NY,
     USA, 2006. ACM.
[13] G. Weikum, G. Kasneci, M. Ramanath, and
     F. Suchanek. Database and information-retrieval
     methods for knowledge discovery. Commun. ACM,
     52(4):56–64, 2009.
[14] D.S. Weld, F. Wu, E. Adar, S. Amershi, J. Fogarty,
     R. Hoffmann, K. Patel, and M. Skinner. Intelligence in
     wikipedia. In Twenty-Third Conference on Artificial
     Intelligence (AAAI’08), 2008.