Ontology Based Queries – Investigating a Natural Language Interface Ielka van der Sluis Feikje Hielkema Chris Mellish Gavin Doherty Computer Science Computing Science Computing Science Computer Science Trinity College Dublin University of Aberdeen University of Aberdeen Trinity College Dublin vdsluis@cs.tcd.ie f.hielkema@abdn.ac.uk c.mellish@abdn.ac.uk gavin.doherty@cs.tcd.ie ABSTRACT (www.w3.org/TR/owl-features/). These ontologies consist In this paper we look at what may be learned from a of classes (e.g. City, State) and properties (hasCapital, comparative study examining non-technical users with a Name). The RDF statements describe instances of these background in social science browsing and querying classes (e.g. ‘The State of New York, whose capital is New metadata. Four query tasks were carried out with a natural York’). RDF is a subset of XML and potentially difficult to language interface and with an interface that uses a web understand for most non-technical users. This paper focuses paradigm with hyperlinks. While it can be difficult to on browsing RDF and the task of constructing complex attribute differences in performance to specific design queries. features, a qualitative analysis of the user behavior provides Support for these activities for casual, non-technical users is some insight into the task and problematic aspects of an important challenge for the entire Semantic Web existing interfaces. In general it was found that casual research community. As most members of the social subjects have difficulties recognizing typical ontology science community are unfamiliar with complex formalisms based concepts like objects, attributes and values. such as RDF, this makes them a representative group of non Author Keywords technical users of the Semantic Web. Non-technical users Querying and browsing, metadata, evaluation, natural- may benefit from what the Semantic Web offers, but may language interfaces, web-based interfaces. be deterred by its complexity and the need to learn to use graphical representations or controlled languages. While ACM Classification Keywords well-designed graphical tools can provide advantages, tools H5.m. Information interfaces and presentation (e.g., HCI): that use graphical representations (e.g. CREAM [6] or Miscellaneous. SHAKEN [13]) may be difficult to interpret for users INTRODUCTION unused to complex graphical presentations or ontologies. The advent of Semantic Web technologies [2] has generated For instance, Petre [9] argues that graphical readership is an a number of challenges relating to the use of technology by acquired skill, and describes experiments into reading domain experts and researchers in areas such as social comprehension of graphical and textual representations. science [3]. Among the questions to be addressed are the These showed that for some tasks people process graphical extent to which these researchers are comfortable with the representations significantly slower than text, with novices Web as a framework for research practice and in particular suffering from mis-readings and confusion. collaboration; whether ontologies are appropriate (and Kaufmann and Bernstein [7] demonstrated via an acceptable) to this community as a way of representing experiment that compared four different query interfaces concepts to facilitate their research activities; the utility (or for the Semantic Web, that naive users preferred the otherwise) of existing metadata frameworks in use by the interface that used full natural language sentences (as social sciences; and how best to integrate e-science tools opposed to keywords, partial sentences and a graphical and methods into existing working practices. interface). Hence, it is worth considering whether a natural A key aspect is concerned with support for creation of language representation of metadata could serve as a good metadata and access to resources annotated by semantic solution for novices to the Semantic Web (such as many metadata. This semantic metadata is captured with RDF social scientists). In order to investigate this possibility a (Resource Description Framework; www.w3.org/RDF/), tool named LIBER was developed, which uses natural statements of the type Property (subject, object) whose language to provide access to metadata. This paper presents semantics are defined by OWL ontologies a comparative study that was set up to assess and explore the querying and browsing interface of LIBER. INTERFACES FOR QUERY CONSTRUCTION ------------------------------------------------------- LIBER (Language Interface for Browsing and Editing Workshop on Visual Interfaces to the Social RDF) was developed for providing access to descriptions of and Semantic Web (VISSW2010), IUI2010, Feb 7, 2010, Hong Kong, China. social science resources (e.g. papers, statistical datasets, Copyright is held by the author/owner(s). interview transcripts) held in a data repository. The ------------------------------------------------------ interface (driven by a number of ontologies) enables users to find resources in the repository through querying and the natural language based interface to a graphical interface, browsing of metadata, and to deposit new resources with a while Longwell is a faceted browser; moreover, Longwell metadata description. Each component of the LIBER was developed by a company and has a user community, interface uses natural language generation to present while Kaufmann & Bernstein produced their own graphical information to the user through the WYSIWYM (What You interface, so we cannot be sure that its deficiencies reflect See Is What You Meant) approach [13]. WYSIWYM has those of such interfaces in general. been used by a number of other projects, such as MILE [10] EXPERIMENTAL STUDY and CLEF [5]. The positive results from these projects [4, Before describing the experiment, we note that there can be 11], suggest that WYSIWYM could be a suitable approach problems with interpreting comparison studies. Importantly, to use for constructing and accessing metadata. it can be difficult to attribute differences in performance to With WYSIWYM a system generates a feedback text for specific design features, such as the use of a natural the user that is based on a semantic representation. The language interface, as such choices necessitate many other representation includes generic phrases, or ‘anchors’, which differences in the design. For example, a badly executed correspond to objects in the description. Each object has a natural language based design might be outperformed by pop-up menu which lists the properties it can have; to add another interface, whereas a well-executed natural language information, the user selects a property and provides an design might perform better. appropriate value. In LIBER, properties of objects are used in queries, which may also include boolean operators Methodology (‘and’, ‘or’, ‘not’), and queries may also include optional Twenty students and researchers with backgrounds in elements. Results are presented as the query is constructed. various social science related disciplines participated, one As many other querying tools have been developed in the of which did not finish the experiment and was excluded Semantic Web community, we could compare LIBER’s (N=19). None had previous experience with LIBER or querying and browsing modules to existing systems. The Longwell, and only two had used an ontology before. question of which approach (natural language, graphics, Subjects were asked to supply some background faceted browsing) produces more usable interfaces is far information, then were handed a one-page description of from settled. We were therefore interested in comparing the one of the tools and were asked to follow the instructions to natural language interface of LIBER to one that uses a become acquainted with its operation. They then received different approach. Kaufmann & Bernstein [7] describe an four questions to answer, and were asked to find the answer evaluation study in which they compared four querying using the tool without relying on their own general interfaces: a graphical interface, a controlled language knowledge about the world. When finished, subjects were interface, a natural language interface that uses asked to fill out a SUS questionnaire [1], a standardized confirmation dialogues for disambiguation (Querix), and a usability test containing ten standardized questions (e.g. ‘I natural language interface that identifies relevant key felt very confident using the system’) which are rated on a phrases in the search term. The study showed that all 5-point Likert scale. This procedure was repeated for the natural language interfaces outperformed the graphical other tool. Afterwards, subjects were asked to complete a interface and that subjects preferred Querix and achieved questionnaire in which the tools were compared directly. the best results with it. We decided to use a similar set-up On average subjects needed about 45 min to finish the task. and materials for our evaluation, so we could adopt a Both the order of the tools and the order of the questions simple ontology and have a reference point for the were varied per subject. For both tools we recorded the evaluation results. answers the subjects provided and the time it took to answer We compare the LIBER interface with Longwell [8], a a question, and made video captures of the screen for web-based RDF-powered faceted browser developed by the qualitative analysis. To drive both tools, we used a simple SIMILE project at MIT. Longwell takes an RDF dataset as ontology that models the geography of the USA, which was input, and creates a website in which the data can be developed for Kaufmann & Bernstein’s study and is browsed and filtered using classes, properties and available online1. It is not faithful to the real world situation keywords. The user browses through the dataset by clicking (Alaska appears to have the smallest state area, for hyperlinks (which correspond to classes, properties and example), but this made it easier to prevent subjects from values) and keyword searching; each click and keyword relying on their own knowledge and thus bias the results. search adds (or removes) a filter. Longwell thus uses the We used two sets of questions, which were based on those web paradigm to present information rather than natural used by Kaufmann & Bernstein in their study. One of the language, and we were interested to see which would prove two sets is exemplified below: more effective and/or popular. 1. What is the area of Alaska? Following Kaufmann & Bernstein’s study, it might be 2. How many lakes are there in Florida? expected that users would be more accurate and complete 3. Which states contain a city called Springfield? tasks more quickly with the natural language tool LIBER than with the faceted browser Longwell. Realistically, we knew this inference might not apply as that study compared 1 http://www.ifi.uzh.ch/ddis/research/semweb/talking-to-the-semantic- web/owltest-data/ 4. Which rivers run through the state that contains the In Longwell, the user has first added a filter 'city' to select largest city in the US? all cities, then another filter on the name (Springfield), and finally opened the facet 'cityOf' on the right-hand side to 'Figures 1, 2 an 3 show screenshots of LIBER and Figures view the four states.' 4,5 and 6 show screenshots of Longwell, where the user is searching for the answer to the question 'Which states contain a city called Springfield?'. Both interfaces support multiple strategies for finding this answer; the screenshots portray merely one of them. In LIBER this user has created a search term that provides the answer without further browsing, by searching for all states which have the property 'hasCity' with as value a city by name of 'Springfield'; the answer appears when the user presses 'search'. Figure 4. Longwell: The user clicks 'city'. Figure 5. Longwell: The user clicks 'Springfield'in the 'Name' filter. Figure 1. LIBER: The user chooses the property 'Has city'. Figure 6. Longwell: The user opens the facet 'cityOf' to view the results Results: Comparative Analysis Two-tailed paired t-tests show that the Longwell interface outperformed the LIBER interface in terms of completion Figure 2. LIBER: The user specifies the name of the city. time (LIBER, mean 191.6sec, stdv 57.1sec; Longwell mean 96.5sec stdv 30.0s, p=0.000) and SUS score (LIBER, mean 37.63, stdv 18.11; Longwell mean 61.16, stdv 19.65 p=0.000). Subjects failed to complete tasks more often in LIBER (missing answers: LIBER, mean .47 stdv .62; Longwell mean .11, stdv .32, p = 0.015), but tended to provide more incorrect answers in Longwell (wrong answers: LIBER, mean .58 stdv 1.02; Longwell mean .84, stdev .90, p = 0.384). When asked to compare LIBER and Longwell directly, all but three users preferred Longwell; opinions on reliability were more divided but still in favour of Longwell (11 subjects). Results: Screen Capture Analysis We recorded screen captures and annotated the strategies that subjects employed in carrying out the querying task. Some videos did not record properly (N=16). Analysis of Figure 3. LIBER: Search results for question 3. the data helped us to identify common errors, delaying Delays factors and misunderstandings as reported below. With both interfaces, subjects appeared sometimes unsure whether all matches were found (Longwell, 5 subjects). In Strategies LIBER this happened, when the system stated the number A clear difference was found between the preferred strategy of matches to the query without actually listing them (6 employed in subjects’ initial use of the LIBER interface and subjects), or when only one match was found (4 subjects). the way in which subjects used LIBER over time. In In contrast, it also happened that browsing was stopped answering the first question, the most frequently used after only a partial answer was found (LIBER, 5 subjects; strategy (7 subjects) was phrasing a query that when Longwell, 4 subjects). In Longwell, subjects often clicked submitted retrieves the correct answer immediately, without on links that did not lead them to anything useful, like the need for further browsing. Five subjects used a different description of the ontology itself rather than the instances strategy, they formed a small query and used the LIBER (10 subjects). In LIBER uncertainties appeared in the browsing interface to find the final answer. From the selection of menu items (8 subjects) and there were some second question onwards the “query then browse” strategy, interface issues that caused delays in task performance, for dominated (used by 10, 8 and 7 subjects respectively). instance many subjects had trouble closing pop-up windows With the Longwell interface the most popular strategy for (11 subjects) or browsing windows (9 subjects). Many of finding answers to the questions was to use the provided them also experienced focus issues with pop-up windows; it descriptions rather than the filters. This preference was was not understood that pop-up windows needed to be independent of the type of the question as well as closed before a task could be continued (11 subjects). independent of the experience with the interface that was built up during the task. DISCUSSION From the experimental data, it is clear that subjects Errors preferred Longwell over LIBER and they performed better In general, subjects appeared to gain little understanding with Longwell than with LIBER in almost all respects. It from the interfaces of how the data in the geographical should be noted, however, that subjects felt that both ontology was modelled (e.g., classes, properties and interfaces were needlessly complicated. While the subject’s values). For instance, in both interfaces subjects entered preference for Longwell might help in choosing between keywords such as ‘largest city’ (LIBER 4 subjects; the two applications at the current time, we are more Longwell 9 subjects). This shows the extent to which interested in what the experiment tells us about the task of subjects are used to other types of search engines (e.g. a performing complex queries, and in how to improve web search on ‘largest city’ will list the pages that include interfaces to support this activity. these search terms), and had difficulty adapting to search When contrasting the difficulties encountered in the LIBER strategies suitable for RDF, which simply list population interface with the comparatively fluid performance in sizes, without comparing them. To search RDF you Longwell, we see that with Longwell subjects generally therefore need a different search strategy, a query that finds used the same strategy in answering all four questions. In those population sizes and then compares them for you. contrast, with LIBER subjects learned while working on the Compared to Longwell, in LIBER subjects made more task that a browsing facility is available and that spending mistakes that can be ascribed to minor issues in the less time on a perfect query yielded better results. This interface, such as those caused by not moving values to indicates that novice users’ initial expectations of the boxes for inclusion in the query before confirming the querying interface are incorrect. With LIBER many errors query (18 subjects), and those caused by usage of the and delays can be attributed to minor usability issues in the ‘optional’ checkbox (7 subjects). Most of these situations interface, although some issues do appear to be related to were catered for in that LIBER provided a warning or the interface style. The analysis of the screen captures clarification, which brought subjects back on track. Still, in helped to identify areas where the LIBER interface might LIBER some errors seem to be specific to the natural be improved such as clarification of the ‘optional checkbox’ language interface, like assigning a property or value to the and handling of pop-ups and browsing windows. Compared wrong object (e.g. looking for lakes called ‘Florida’, rather to LIBER, in Longwell fewer things can go wrong, users than for ‘lakes in a state called Florida’) (4 subjects). click on links and end up somewhere else (useful or not). With Longwell fewer things could go wrong but, most Because of their familiarity with the web paradigm, users likely due to the fact that subjects did not receive any may explore the interface more confidently, as they can feedback on what went wrong, the same errors were made backtrack when they find themselves on an irrelevant page. repeatedly. Compared to LIBER, errors were of a different kind, such as selecting the wrong value for both filters (5 CONCLUSIONS subjects) and descriptions (2 subjects), browsing through This paper described a study that was performed to help in only one of multiple results (3 subjects), typos (5 subjects), the design and refinement of LIBER’s interfaces for and misinterpretations of descriptions (5 subjects). querying and browsing metadata. The study compares subjects’ performance using LIBER with the existing Longwell interface, which provides a benchmark for performance. The study allows us to look at differences in interaction strategy, and to identify issues which may be ACKNOWLEDGMENTS associated with the interface style, including the use of This research is funded by SFI as part of the CNGL project natural language. The study has focused on initial use of and the ESRC as part of the PolicyGrid project. tools for querying and browsing metadata by researchers REFERENCES with backgrounds in social science, yielding insight into the 1. J. Brooke, SUS: a "quick and dirty" usability scale, in: difficulties experienced by casual, non-technical users when P. Jordan, B. Thomas, B. Weerdmeester, A. McClelland operating an interface to an unknown database that (eds.), Usability Evaluation in Industry, Taylor and nevertheless stored a general domain. A longer training Francis, London, 1996. time or a more longitudinal study could well yield different results, and could help to improve the system for use by 2. D. De Roure, N. Jennings, N. Shadbolt, The Semantic more experienced users. Also, the use of a database that is Grid: Past, Present and Future. In Proc. IEEE’05, 93(3), less simple, as well as more relevant for the subjects, might 2005. make a difference in that subjects would have intuitions and 3. P. Edwards, A. Chorley, F. Hielkema, E. Pignotti, A. expectations about the ontology used for representing the Preece, C. Mellish, J. Farrington, Using the Grid to data, which would be more representative of real world use. Support Evidence-Based Policy Assessment in Social In general, it was found that subjects that do not have any Science. In Proc. UK e-Science All Hands Meeting, knowledge of RDF data or SQL querying, seem to have Nottingham, 2007. difficulties recognizing and distinguishing concepts like 4. C. Hallett, D. Scott, and R. Power. Composing classes, properties and values and the way in which they are Questions through Conceptual Authoring. defined in the ontology used in this study. Subjects seemed Computational Linguistics, 33(1) (2007) 105–133. to rely on their methods for searching the internet, without realizing that different rules apply to metadata and the 5. C. Hallett. Generic Querying of Relational Databases particular database that was used for the study. Neither using Natural Language Generation Techniques. In LIBER nor Longwell provide the user with sufficient Proc. INLG’06, pages 88–95, Nottingham, UK, 2006. information about what type of input the system expects. Or 6. S. Handschuh, S. Staab, A. Maedche, CREAM: creating in other terms, both LIBER and Longwell have not yet relational metadata with a component-based, ontology- succeeded in providing an interface that supports users in driven annotation framework. In Proc. K-CAP’01, ACM efficiently constructing metadata-based queries. Press, Victoria, British Columbia, Canada, 2001. We believe that the usability of LIBER and Longwell (and 7. E. Kaufmann, A. Bernstein, How Useful Are Natural natural language interfaces and faceted browsers in general) Language Interfaces to the Semantic Web for Casual depends on a number of factors that will vary between and End-Users? In Proc. ISWC’07, vol. 4825 of LNCS, even within domains, such as: Springer Verlag, Busan, Korea, 2007. - The experience of users with ontologies and other 8. Longwell. http://simile.mit.edu/wiki/Longwell metadata; - The data described by the ontologies (for instance, a 9. M. Petre, Why Looking isn’t always Seeing: Readership recipe is more usually described in natural language Skills and Graphical Programming, Communications of than geographical data); the ACM 38 (6) (1995) 33-44. - The type of interfaces that users normally utilise 10. P. Piwek, R. Evans, L. Cahil, and N. Tipper, Natural (those used to working with databases through e.g. Language Generation in the MILE System. In Proc. of Access would prefer Longwell); IMPACTS in NLG workshop, 33–42, Schloss Dagstuhl, - The size of the ontologies, and the number of Germany, 2000. individuals within them (large amounts of 11. P. Piwek, Requirements Definition, Validation, individuals might cause the generation of very long Verification and Evaluation of the CLIME Interface and and therefore confusing descriptions in LIBER); Language Processing Technology. Technical Report - The mix of tasks and goals which might have an ITRI-02-03, ITRI, University of Brighton, 2002. effect on strategy (e.g. users may have a whole range of interaction types with a browsing system 12. R. Power, D. Scott, and R. Evans. 1998. What You See depending on their goals and mode of working.); Is What You Meant: Direct Knowledge Editing with - The heterogeneity of the data (Longwell's filters Natural Language Feedback. In Proceedings of the work better if each individual has the same set of Thirteenth European Conference on Artificial properties, while LIBER generates separate menus Intelligence, Brighton, UK. for each individual, and can thus deal better with 13. J. Thoméré, K. Barker, V. Chaudhri, P. Clark, M. heterogeneity). Eriksen, S. Mishra, B. Porter, A. Rodriguez, A Web- based Ontology Browsing and Editing System. In Proc. Further studies should evaluate each of these factors AAAI-02, Edmonton, Alberta, Canada, 2000. separately in order to provide a better understanding of interfaces to support ontology-based queries.