Recommendation-Based Evolvement of Dynamic Schemata in Semistructured Information Systems Wolfgang Gassler Eva Zangerle Databases and Information Systems Databases and Information Systems Institute of Computer Science Institute of Computer Science University of Innsbruck, Austria University of Innsbruck, Austria wolfgang.gassler@uibk.ac.at eva.zangerle@uibk.ac.at ABSTRACT systems, where information can simply be added as text. Community-based collaborative information systems provide However, such structureless storage prevents every possibil- the flexibility of storing information without having to ad- ity to provide structured access or search facilities on the here to any predefined, rigid schema. However, the stored stored information, as for example relational databases do. knowledge lacks common structure, which is crucial in terms Consider the example of Wikipedia, which does not provide of data access and search capability. We present a novel con- structured search capabilities at all. The only way to search cept for semistructured information systems, which features through this unstructured information is to perform fulltext a dynamic and self-learning schema system and provides search, which is incapable of answering precise queries like the users with recommendations when entering information. “Which Austrian cities have between 10,000 and 20,000 in- The recommendations ensure the creation and maintenance habitants?”. For this reason information repositories need of a common, homogeneous schema while at the same time to support both structured and unstructured information to not restricting the user’s flexibility to enter any kind of in- fully exploit the advantages of both worlds, as also pointed formation. out by Weikum et al. [13]. Our paper presents the Snoopy concept which combines the structured and unstructured ap- 1. INTRODUCTION proach and takes advantages of both concepts. When storing information, there are two common ways of The rest of the paper is organized as follows. In Section doing so: either one uses a (relational) database to store in- 2 the basic concept of our approach is described. Section formation in a structured way or one stores information in 3 outlines the measures taken for the creation of a common a mostly unstructured manner. schema. Our proposed solution is described in Section 4 and The first approach is very well suited for applications which related work is outlined in Section 5. Section 6 summarizes have to store strictly structured information. Considering the paper and points out open research issues. the example of bank accounts, all information is structured and can easily be matched into a schema for the storage of 2. THE SNOOPY CONCEPT e.g. bank accounts and their owners. A big disadvantage is The Snoopy concept offers the same flexibility as wiki sys- that a change within the schema can be very time-consuming tems but at the same time provides the possibility to struc- and complex, as it has to be done manually and the already ture information like (relational) databases. This is achieved stored information has to be adapted to the new structure. by using a self-adapting and self-learning schema system and Therefore, the end-user is fixed to a given schema and can- recommendations, which support the user during the inser- not insert additional information not matching the given tion process. schema. For flexible information systems, such a restricting In the Snoopy concept, information about a certain subject approach is not very well suited, as it can result in a big is stored as a collection, similar to a wiki page. A collection loss of information because of the fact that the user cannot consists of an arbitrary number of key-value pairs, which can insert all information he might want to. be specified by the user without any restrictions. The fol- On the other hand, the second, unstructured way of storage lowing example shows how information about the city “Inns- allows the user to store information in any arbitrary struc- bruck” can be stored in a simple and understandable way: ture or format. This approach has the advantage that the user does not have to match the data into some predefined Collection Name: Innsbruck schema. A popular example of such flexible systems are wiki Country:Austria State:Tyrol numberOfInhabitants:117,916 PanoramaImage:pano_innsbruck_2010.jpg The key-value pairs are similar to relational database columns (keys) and its rows (values). Despite the fact that infor- mation is still plain text, it still features structure. Such GvD Workshop ’10, 25.-28.05.2010, Bad Helmstedt, Germany. semistructured data, where structure is present but does not comply to a unified schema, can be used for structured access (e.g. queries like “All entries containing the prop- 3.1 Suggestion of Keys and Structure erty Country:Austria”). One of the main challenges of such The suggestion of keys and structure is the most important semistructured data is the proliferation of keys and values. feature within the Snoopy concept. Considering a user enter- Furnas et. al. [6] showed that two people would choose ing information about cities (consisting of key-value pairs) the same term for a certain object (within a restricted do- into the system, there might be additional useful keys that main) with a probability of only 20%. Consider the key the user might not think of at first hand. This is where “numberOfInhabitants”. It can be assumed that a multiuser the recommendation engine comes into play. If a user wants system would produce many synonyms of this word, eg. “in- to add the key ”inhabitants”, the system computes that 90 habitants”, “citizens” or “numberOfInhabitants”. This fact percent of the users who added “inhabitants” as a key, also would imply that the stored information is no longer search- added “mayor”. The recommendation feature is based on able in terms of unified access, as the keys are not aligned a data mining process within the already existing collec- to a common schema. Wikipedia also has to cope with tions, which calculates keys that frequently occur together. this problem of proliferation of structures and tries to solve Having done these computations, the system suggests this it by a very big and committed community, which creates additional key to the user who can then accept this key and templates (schemata) and manually unifies already present enter a respective value. Assume, the user wants to store in- knowledge. Boulain et al. [3] showed that only 35% of all formation about the city “Innsbruck”. Based on the already edits in Wikipedia are changes of content. All other edits inserted keys “country”, “state” and “numberOfInhabitants”, are related to structure and do not concern the content it- the system recommends adding the keys “mayor” and “ZIP”. self. Recommendations are simply further form fields displayed The Snoopy concept takes a different approach and pushes to the user that give him the opportunity to enter appro- part of these structural adaptions to the user. The user priate values for the suggested keys. The recommendations knows more about the collection he enters than any process, encourage the user to insert more information than originally machine or community, which try to enhance the informa- intended and thereby increase the amount of information in tion after the author inserted it. Therefore, the key idea the system. of the Snoopy concept is to “snoop“ as much information as possible from the user during the insertion process. Further- more, the user is guided by an adapting, self-learning guid- ance engine which provides recommendations to the user 3.2 Key Completion When entering new key-value pairs, the user gets assistance based on key-value pairs entered earlier. These recommen- by an auto-completion system. This feature suggests keys dations support the user in aligning the information he in- that are already stored in the information system, are sim- tends to enter to a commonly used schema. ilarly structured and therefore semantically related to the It is important to note that all these recommendations are currently entered key. If the user types e.g. “number”, the just suggestions and the user is not forced to use these rec- system automatically provides the user with the possibility ommendations in any way. Therefore, the user is able to to choose “number of inhabitants” or “number of districts”, enter information without any restrictions. which are keys that are already existent in the database. The user still has the opportunity to provide a new key, e.g. 3. SCHEMA ALIGNMENT “number of universities” by ignoring the recommendations The process of integrating and transforming two or more and continue typing. This recommendation contributes to (database) schemata into a common schema has proven to the creation of a common schema and data basis and is be a very complex task [10]. The Snoopy concept avoids demonstrated by the following example. If the user wants any schema matching after the insertion of data as the guid- to enter the key “inhabitants”, after typing “inhab”, the user ance of the user significantly contributes to the alignment of is informed that a similar key “number of inhabitants” is al- the entered structured information to an already commonly ready present. As accepting the recommendation is faster used schema. This aligned “schema” is different from a fixed than typing the word “inhabitants”, the user is encouraged schema of a relational database to which information has to accept the recommendation. to be adapted to. The user is free to extend or modify the recommended structure. Therefore, alignment can never be done for the totality of data. 4. PROPOSED SOLUTION Our proposed solution is mostly concerned with the compu- The self-adapting schema is implicit and dynamically cal- tation of recommendations and the underlying storage mech- culated based on already stored information. It consists of anisms, which are explained in the following section. keys which are used together by the majority of similar col- lections. Hence, newly entered information can influence the schema and can result in a change of the schema. For exam- 4.1 Recommendations ple adding the key “numberOfStudents” to many collections The computation of the recommendations can be achieved about cities can lead to a recommendation of this key to by using association rules [1], which can easily be adapted for further users. the mining of relations between triples: formally, a frequent As information systems contain information of various struc- item set can be defined as a set of items I = {i1 , i2 , ...in }, tures and types, the Snoopy concept automatically computes where a transaction T within the database consists of arbi- a suitable number of schemata according to the stored infor- trary many items of I. In the case of schema alignment, the mation - without any predefined settings. Schema alignment item set I consists of all properties pi occurring in the sys- in the case of the Snoopy concept can be achieved by taking tem and a transaction comprises all properties pij occurring the following measures, which are all based on recommen- together within the subject sj . The set of all transactions dations of values and keys. forms the transaction database T = {T1 , T2 , ...Tm }. Based on this transaction database, the goal is to calculate asso- ommendation-based version motivated the test users to in- ciation rules which are implications X → Y , where X is a sert 24% more key-value pairs. property and Y is another property which co-occurs with X on the same subject. 5. RELATED WORK This is the basis of the calculation of recommendation can- The Snoopy concept can benefit from many research areas didates, which are then further examined in order to pro- which cover parts of the Snoopy concept. Lots of research is vide the best possible recommendations to the user. Based done on how to introduce structure in Wikipedia or gener- on this (probably large) set of association rules, the final ally in information systems. All of the following approaches recommendations are computed taking the following impact try to enhance information after the insertion process, do factors into acount: (i) support of a rule, (ii) novelty of prop- not consult the user and do not benefit from the user’s ex- erties, (iii) recent popularity rise of properties and (iv) some tensive knowledge during the insertion process. random choice to give new properties a chance to be used The DBPedia project [2] extracts structured information and therefore rise in popularity. from Wikipedia infoboxes and stores it as RDF-triples. An- Fundamentally, the computation process of the recommen- other approach, YAGO [11] is also based on Wikipedia data dations has to fulfill the following requirements: (i) ability and tries to semantically enhance this data. The Kylin/KOG to cope with huge amounts of data, (ii) compute recommen- System [14] automatically verifies semantically enhanced data dation candidates based on mining association rules, (iii) by explicit community feedback. Semantic Wikipedia [12] evaluation of the candidate sets according to the previously extends MediaWiki by adding typed links between Wikipedia mentioned factors and (iv) real-time calculation of recom- entries as well as attributes and types. However, the user mendations. is not guided in the process of specifying this additional se- mantics. Cimple/DBLife [5] presents an approach to build 4.2 RDF Storage System a structured community portal from already existing com- The underlying storage system is a crucial part of the appli- munity sources. ExDB [4] extracts information from the cation as it holds all Snoopy data. Basically, the main task web, adds structure and is then able to query this data in a is to store many key-value pairs belonging to a certain sub- structured manner. ject (collection). This pattern suggests storing this data as triples using the W3C Resource Description Framework [8], 6. CONCLUSION AND FUTURE WORK which has proven to be a modelling language very well suited We presented the Snoopy-concept, a novel approach for cre- for the description of any (real-world) resource [9]. Basically, ating and maintaining a common schema in semistructured every resource can be stored as a triple consisting of a sub- information systems by using recommendations. These rec- ject (same as the collection name in the Snoopy concept), a ommendations guide the user during the insertion process in predicate (the respective key) and an object (the respective order to align the information to a commonly used schema value). The sentence “Innsbruck’s number of inhabitants and vocabulary. The implicit schema is based on all stored is 117,916” can therefore be stored as the following triple: information in the system and adapts itself to the struc- , , <117,916>. The ture of this data. The recommendations are based on as- main advantage of RDF is the possibility to easily model sociation rules and were implemented in a prototype. First knowledge in a machine-understandable way. SPARQL, the results showed that recommendations ensure homogeneous query language for RDF, can be used to query all informa- schemata and vocabulary in a multi-user semistructured in- tion. formation system and therefore improve the structured data Basically, the underlying RDF storage system has to fulfill access capabilities. Further work will include ranking of the following requirements in order to ensure a scalable, ef- rules enabling the system to compute top-n recommenda- ficient and highly performant system: (i) SPARQL support tions aiming at a higher precision of recommendations. Fur- to query all stored data in a fast and efficient way, (ii) op- thermore, capabilities to query the stored information effi- timized RDF store: triples have to be stored in the most ciently have to be introduced and the storage mechanisms efficient and compact way as the representation of knowl- have to be improved. edge as triples leads to a very large number of triples, (iii) an interface optimized for data mining tasks, as SPAQRL is not best suited for data mining (iv) fast data mining: 7. REFERENCES [1] R. Agrawal, T. Imieliński, and A. Swami. Mining mining structure and content within the triples is the most association rules between sets of items in large time-critical part of the application as all recommendations databases. In SIGMOD ’93: Proceedings of the 1993 are based on it. ACM SIGMOD international conference on Management of data, pages 207–216, New York, NY, 4.3 Preliminary Results USA, 1993. ACM. SnoopyDB, a first prototype of the Snoopy concept has al- [2] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, ready been developed and evaluated in [7]. The evaluation R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a showed that recommendations in SnoopyDB guide the user web of open data. Lecture Notes in Computer Science, to common schemata and reduce the number of distinct 4825:722, 2007. properties. A system without any recommendations lead [3] P. Boulain, N. Shadbolt, and N. Gibbins. to 229 distinct properties on 50 test collections, whereas Hyperstructure maintenance costs in large-scale wikis. SnoopyDB including recommendation and guidance was able In Peter Dolog, Markus Kroetzsch, Sebastian to store the same information by using only 154 distinct Schaffert, and Denny Vrandecic, editors, SWKM, properties. Besides the 33% smaller property set, the rec- volume 356 of CEUR Workshop Proceedings. CEUR-WS.org, 2008. [4] M. J. Cafarella, C. Re, D. Suciu, and O. Etzioni. Structured querying of web text data: A technical challenge. In CIDR, pages 225–234, 2007. [5] P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: a top-down, compositional, and incremental approach. In VLDB ’07: Proceedings of the 33rd international conference on Very large data bases, pages 399–410. VLDB Endowment, 2007. [6] G. W. Furnas, T. K. Landauer, L. M. Gomez, and S. T. Dumais. The vocabulary problem in human-system communication. Communications of the ACM, 30(11):971, 1987. [7] W. Gassler, E. Zangerle, M. Tschuggnall, and G. Specht. Snoopydb: Narrowing the gap between structured and unstructured information using recommendations. In Proceedings of the 21th ACM Conference on Hypertext and Hypermedia. ACM, 2010. [8] G. Klyne and J. J. Carroll. Resource description framework (rdf): Concepts and abstract syntax. W3c recommendation, World Wide Web Consortium, February 2004. http://www.w3.org/TR/2004/REC- rdf-concepts-20040210/. [9] B. McBride. The resource description framework (rdf) and its vocabulary description language rdfs. Handbook on Ontologies, pages 51–66, 2004. [10] P. Shvaiko and J. Euzenat. A survey of schema-based matching approaches. Journal on Data Semantics, 4:146–171, 2005. [11] F. Suchanek, G. Kasneci, and G. Weikum. Yago: A large ontology from wikipedia and wordnet. Web Semantics: Science, Services and Agents on the World Wide Web, 6(3):203 – 217, 2008. World Wide Web Conference 2007, Semantic Web Track. [12] M. Voelkel, M. Kroetzsch, D. Vrandecic, H. Haller, and R. Studer. Semantic wikipedia. In WWW ’06: Proceedings of the 15th international conference on World Wide Web, pages 585–594, New York, NY, USA, 2006. ACM. [13] G. Weikum, G. Kasneci, M. Ramanath, and F. Suchanek. Database and information-retrieval methods for knowledge discovery. Commun. ACM, 52(4):56–64, 2009. [14] D.S. Weld, F. Wu, E. Adar, S. Amershi, J. Fogarty, R. Hoffmann, K. Patel, and M. Skinner. Intelligence in wikipedia. In Twenty-Third Conference on Artificial Intelligence (AAAI’08), 2008.