Using Basic Level Concepts in a Linked Data Graph to Detect User's Domain Familiarity Marwan Al-Tawil, Vania Dimitrova, Dhavalkumar Thakker School of Computing, University of Leeds, United Kingdom. Abstract.We investigate how to provide personalized nudges to aid a user’s exploration of linked data in a way leading to expanding her domain knowledge. This requires a model of the user’s familiarity with domain con- cepts. The paper examines an approach to detect user domain familiarity by ex- ploiting anchoring concepts which provide a backbone for probing interactions over the linked data graph. Basic level concepts studied in Cognitive Science are adopted. A user study examines how such concepts can be utilized to deal with the cold start user modelling problem, which informs a probing algorithm. Keywords:Linked data, knowledge utility, user modeling, basic level concepts. 1 Introduction The recent growth of the Web of Linked Data1(LD), which provides access to big data graphs representing domain entities and their relationships, has opened a new avenue of research on developing computational models to facilitate data exploration by lay- man users [9]. This has brought together research from Semantic Web and HCI to shape novel tools for interactive exploration of semantic data2. One of the key chal- lenges is ensuring that the interaction with linked data brings benefits for the users. Hence, personalization and adaptation can play a crucial role. Research in personal- ized exploration of linked data is still in an embryonic stage. Current work includes improving search efficiency by considering user interests [4, 7] or diversifying the user exploration paths with recommendations based on the browsing history [8]. Our research brings a new dimension of personalization and adaptation to enhance the benefits of linked data exploration, namely knowledge utility. We investigate how to aid a user’s exploration of linked data in a way leading to expanding her domain knowledge. This can have a broad implication for facilitating sense making while exploring linked data. Learning is an inevitable part of exploratory search, as users are discovering new connections and associations. Our earlier research has shown that although linked data exploration can promote domain knowledge expansion (‘seren- dipitous learning’ effect), not all paths can be beneficial. We derived empirically strategies to nudge the user to beneficial paths [10]. The user familiarity with the enti- ties in the linked data graph (LDG) was identified as a crucial input for the nudging 1 http://linkeddata.org/ 2 See the series of IESD workshops, e.g. IESD2014 held @ ISWC: https://iesd14.wordpress.com/ strategies we aim to develop – profitable exploration sequences include a start (an- chor) in a familiar entity followed by bringing a new (unexpected/interesting) entity. Identifying the user familiarity with the domain entities (domain concepts or in- stances) from the LDG is not a trivial task because LDGs usually include thousands of entities at different levels of abstraction. This brings forth the well-known cold start problem of user modelling, which is aggravated by the sheer number of LDG entities. One way to address cold start is via a probing dialogue. While LDG can pro- vide a knowledge pool to implement a probing dialogue for user modelling (c.f. [6]), it is not clear what domain entities to select from the vast amount of possibilities for probing. Consequently, the interactions can be too long and may refer to entities that do not bring high value for modelling a user’s domain familiarity. This paper examines an approach to detect user domain familiarity by using an- choring concepts in the LDG around which a probing dialogue can be developed. We adopt the Cognitive Science notion of basic level concepts (BLCs) – domain concepts that are highly informative and can be easily retrieved from memory. An example of a basic level concept in the Music domain is Guitar [3]; it has Musical Instrument as a superordinate concept (more abstract) and Classical Guitar as a subordinate concept (more specific). BLCs are likely to provide knowledge bridges to learn new concepts in big information spaces and to serve as indicators for user modelling. Cognitive science research has shown that the use of BLCs may indicate domain familiarity, e.g. experts tend to recognise subordinate concepts [5]. To get insights into how BLCs can be utilized to identify user's domain familiarity, we conduct a user study that adopts earlier Cognitive Science methods which derive BLCs in a specific domain [3, 5] to identify the BLCs in a LDG. Based on the study, we derive heuristics how BLCs can be related to user domain familiarity. We then suggest a user modelling probing algorithm that utilizes BLCs. 2 Identifying Basic Level Concepts in a Linked Data Graph We conducted a user study to examine how BLCs in a LDG can be utilized to model a user’s domain familiarity. 2.1. Study Design Dataset.We have used a dataset from the music domain which underpins a linked data browser (MusicPinta) developed by us in an earlier research [2]. The MusicPinta LDG is fairly large and diverse, yet of manageable size for experimentation. It con- tains 2.4M entities and 38M triple statements, and includes facts about 876 musical instruments from various categories, including many country-specific instruments. Musical instruments, which have been used by Cognitive Science studies in BLC, provide a suitable domain for cognitive activities linked with BLCs [11]. Participants. The study involved 40 participants recruited on a voluntary basis, var- ied in Gender (28 male and 12 female), cultural background (1 Belgian, 10 British, 5 Bulgarian, 1 French, 1 German, 5 Greek, 1 Indian, 2 Italian, 6 Jordanian, 1 Libyan , 2 Malaysian, 1 Nigerian, 1 Polish, and 3 Saudi Arabian), and age (18 – 55, mean = 25). Method. We follow the experimental set up in earlier Cognitive Science studies which derived BLCs using free-naming tasks in a specific domains [3, 5], including the Music domain. Participants were asked to freely name objects that were shown in image stimuli, under limited response time (10s). 364 taxonomical musical instru- ments were extracted from the MusicPinta dataset by running SPARQL queries from the MusicPinta SPARQL endpoint to get all musical instrument concepts linked via the rdfs:subClass relationship. The musical instrument concepts were classified either into leaf (l) instruments (total=265) or category (c) instruments (total= 108). Leaf instruments are found at the bottom of a hierarchy and do not have children, whereas category instruments have at least one child. For each leaf instrument l, a representa- tive image (stimuli) was collected from the Musical Instrument Museums Online (MIMO)3 and Wikipedia4. For a category c, all leafs from that category were shown as a group. Following the Cognitive Science studies, additional objects, outside the domain, were included to minimize response bias - 64 additional images were ran- domly chosen from the most occurring concepts in artificial and natural categories from the Battig and Montague category norms [1], including vehicles, clothing, furni- ture, tools, fruits, vegetables, animals and birds. Ten online surveys were run adopting two strategies: (i) Strategy 1 – leaf instru- ments: eight surveys presented the leaf instruments – each survey presented 32 leaves and 8 additional images. (ii) Strategy 2 – category instruments: two surveys present- edthe category instruments- each survey showed 54 categories and 14 additional images. The image allocation in surveys was random. Every survey had 4 partici- pants; each participant conducted one survey following an online link, including: • Pre-task questionnaire-collecting information about user profile (e.g. age- group, nationality, and gender). • Free-naming task- Each image was shown for 10 seconds on the participant's screen and he/she was asked to type the name of the given object(s) in the im- age as quickly as possible. Figures 1-4 show example instrument images and participant’s answers from the study. For this task, we recorded accuracy (i.e. the participant answered correctly) and frequency (i.e. how many times a par- ticular instrument name was mentioned correctly) of their accurate answers. • Post-task questionnaire- collected information about the participant's famili- arity level for the six top level musical instrument categories (String Instru- ments, Wind Instruments, Percussion Instruments, Electronic Instruments, and Other Instruments). Participants were asked to rate their knowledge in these catego- ries on a scale of 1 to 7 (1=No Knowledge and 7=Expert). 2.2. Basic Level Concepts Identified To extract BLCs from the MusicPinta dataset we considered accuracy and frequency of the participants' answers [5], grouping the answers into: • Group1: Naming a leaf instrument with its category instead of its own name. In this group, we calculated the frequency of exact matches between the partici- 3 http://www.mimo-international.com/MIMO/ 4 Wikipedia images were used only in the cases when a MIMO image did not exist. pants' answers and the category of instruments seen. For example, as shown in Fig.1, a participant has named the leaf instrument Violotta with its parent category Violin. We counted how many times Violin was named when its leaves were seen. • Group 2:Exact naming of categories. In this group, we considered the cases when participants were able to exactly name the category of the instrument they saw, e.g.Fig. 3 shows a response where the category Violins was seen and named. • Group 3: Naming a category level instrument with its parent or children instru- ment name. This is illustrated in Fig. 2 and Fig. 4 - the participant saw a category level instrument (Fiddle and Plucked String Instruments)and named its parent (Violin and String Instruments, respectively). Fig1. Leaf Violotta seen, named as Violin. Fig2. Category Fiddle seen, named as Violin. Fig3. Category Violins seen, named as Violins. Fig4. Category Plucked String Instruments seen, named as String Instruments. In each of the groups, LDG entities with frequency above 2 (i.e. they were named by at least two users) were included. The entities identified in Group 1 are derived from Strategy 1, while Group 2 and Group 3 give complementary output for the LDG enti- ties derived by Strategy 2 (i.e. when the participants saw categories of instruments). Hence, the union of Group 2 and Group 3 gives the likely BLCs identified with Strat- egy 2, which is then intersected with the output from Strategy 1 to obtain the final BLC list. This included: Accordion, Bells, Bouzouki, Clarinet, Drums, Flute, Guitars, Har- monica, Harp, Saxophone, String instruments, Trumpet, Violins and Xylophone. 3 Using Basic Level Concepts for User Modeling We compared the user survey answers and user's familiarity for the six top level mu- sical instrument categories. The findings were used to derive probing heuristics. When the user is able to name an instance (leaf instrument as seen in Strategy 1) instead of its corresponding BLC, she has high familiarity in the corresponding top level category. There were 27 cases (out of 41) where the participants could name leaf instruments rather than using their BLCs (as the majority of users did). For example, a participant named the leaf Electric cello instead of naming its BLC Violin. In 67% of these cases users had high familiarity with the top level instrument category. When the user successfully names children of a basic level concept from images of the corresponding categories (as seen in Strategy 2), she has high familiarity in the corresponding top level category. There were 34 cases where the participants named children that belonged to the basic level. For example, one participant named the child Cello instead of naming it with its BLC Violin. In 62% of these cases participants had high familiarity with the top level instrument category. When the user cannot name a basic level concept from the corresponding BLC im- ages (as seen in Strategy 2), she has low familiarity in the corresponding top level category. There were 11 cases (out of 64) where participants were shown a BLC and were unable to name it. In these cases, the participants had indicated low or no knowledge with the top level instrument category. When the user can name a basic level concept from the corresponding images for the BLC category (as seen in Strategy 2), she is likely to have high familiarity with the corresponding top level category.There were 43 (out of 64) cases where participants were shown a BLC and named it correctly. In half (58%) of these cases the partici- pants had high familiarity with the top level instrument category of the BLC. Based on the above heuristics, we propose a probing algorithm based on BLC. Input Processing Domain - a linked data graph //initialization G = (V , E ) where V = {v , v ,..., v } 1 2 n for all v ∈ V do d (v) = none Set of Images I = {i , i ,...,i } and a function for all b ∈ B do //BLC naming 1 2 n show image(b) and ask to name it image : V → I assigning an image i to each if user_answer≠ b do //cannot name BLC vertex v familiarit y(b) = none for all t ∈ T (b) familiarity(t ) = low Set of Basic level concepts: B = {b1 , b2 ,..., bk } else do //names BLC familiarity(b) = medium for all t ∈ T (v) familiarit y (t ) = medium User diagnosis is a mapping that over- lays G’s vertices with a familiarity level – none, low, medium, high. for all c ∈ C (b) do //check subordinate show image(c) and ask to name it familiarit y : V → W where V = {v , v ,..., v } and 1 2 n if user_answer== c do W = {none, low, medium, high} familiarit y(c) = high familiarit y(b) = high For every vertex from G, we define the end if following functions which are implement- end for ed with simple inferences using hierar- if chical relationships: familiarit y(b) == high do//check leaves - P(v) – returns all parent concepts for all l ∈ L(b) do for v show image(l ) and ask to name it - C(v) – returns all children concepts if user_answer== l do (including the leafs) for v - L(v) – returns the leaves (instanc- familiarity(l ) = high for all t ∈ T (l ) familiarit y (t ) = high es) for v - T(v) – top level categories for v end if end for end if Output end if end for User model U = V × W where d : V → W 4 Current State and Future Work In this work, we examine the advantage of using basic level concepts to detect user familiarity in a linked data graph. The user study identifies the BLCs in a Music do- main in a free-naming task and illustrates how these concepts can be utilized to detect the user familiarity with a subset of entities from the LDG. Obviously, these findings can only be applied if it is possible to automatically detect BLCs from a LDG. Fol- lowing the Cognitive Science definition of BLCs–domain concepts that carry the most information, possess the highest category cue validity, and are, thus, the most differ- entiated from one another are highly informative [3]-we have implemented eight algo- rithms for extracting BLCs from the LDG. The algorithms search for basic categories at the most inclusive level at which attributes are common to most categories' mem- bers and basic categories which are most differentiated from other categories (catego- ries with highest cue validity, i.e. their members have attributes common to the cate- gory and not belonging to other categories). We have implemented appropriate SPARQL queries over the MusicPinta dataset adopting several semantic relationships and similarity measures. The set of BLCs identified in the study is used as a ‘ground truth’ to benchmark the algorithms. Current results show that the best performing algorithms achieve precision of 0.48, which is promising but insufficiently high. Our immediate future work is to tune the BLC algorithms and explore various fu- sion methods to improve the precision results. We will then be able to implement the probing algorithm and utilise it in developing the nudging strategies derived in [10]. References 1.Van Overschelde, J. P., Rawson, K. A., &Dunlosky, J. Category norms: An updated and expanded ver- sion of the Battig and Montague (1969) norms. Journal of Memory and Language, 2004, 50, 289-335. 2.Thakker, D., Dimitrova, V., Lau, L., Yang-Turner, F. &Despotakis, D. Assisting User Browsing over Linked Data: Requirements Elicitation with a User Study. In proceedings of ICWE 2013, pp. 376-383. 3.Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., &Boyes-Braem, P. Basic objects in natural cate- gories. Cognitive Psychology, 1976, 8, 382-439. 4.Sah, M. & Wade, V. Personalized Concept-based Search and Exploration on the Web of Data using Re- sults Categorization. In ESWC 2013. 5.Tanaka, J., & Taylor, M. Object Categories and Expertise: Is the Basic Level in the Eye of the Beholder? Cognitive Psychology, 1991, 23(3). 457-482. 6.DhavalThakker, Lydia Lau, Ronald Denaux, VaniaDimitrova, Paul Brna, Christina M. Steiner:Using DBpedia as a Knowledge Source for Culture-Related User Modelling Questionnaires. UMAP 2014. 7.Rossel,O. Implemention of a “search and browse” scenario for theLinkedData. In Intelligent Exploration of Linked Data (IESD), 2014. 8.Vocht1, et, al. A Visual Exploration Workflow as Enablerfor the Exploitation of Linked Open Data. In Intelligent Exploration of Linked Data (IESD), 2014. 9.MC Schraefel, What does it look like, really? Imagining how citizens might effectively, usefully and eas- ily find, explore, query and re-present open/linked data. In ISWC 2010. 10.Al-Tawil, M., Thakker, D. and Dimitrova, V. Nudging to Expand User’s Domain Knowledge while Ex- ploring Linked Data. In Intelligent Exploration of Linked Data (IESD), 2014, @ ISWC2014. 11.Palmer, F., Jones, K., Hennessy, L., Unze, G., & Pick, A. D. How is a trumpet known? the “basic object level” concept and the perception of musical instruments. American Journal of Psychology, 102, 1989.