-

Adaptive Retrieval and Composition of Socio-Semantic Content for Personalised Customer Care

Ben Steichen

Ben.Steichen@cs.tcd.ie 0

Vincent Wade

Vincent.Wade@cs.tcd.ie 0 0 Knowledge and Data Engineering Group, School of Computer Science and Statistics Trinity College , Dublin , Ireland

The parallel rise of the Semantic and Social Web provides unprecedented possibilities for the development of novel search system architectures. However, many traditional search systems have so far followed a simple one-size-fits-all paradigm by ignoring the different user information needs, preferences and intentions. In the last number of years, we have begun to see initial evidence that personalisation may be applied within web search engines, however little detail has been published other than adaptation based on user histories. Moreover, current implementations often fail to combine the mutual benefits of both structured and unstructured information resources. This paper presents techniques and architectures for leveraging socio-semantic content and adaptively retrieving and composing such content in order to provide personalised result presentations. The system is presented in a customer care scenario, which provides an application area for personalisation in terms of available heterogeneous resources as well as user preferences, context and characteristics. The presented architectures combine techniques from the fields of Information Retrieval, Semantic Search as well as Adaptive Hypermedia in order to enable efficient adaptive retrieval as well as personalised compositions.

Adaptive Information Retrieval Adaptive Result Composition Socio-Semantic Search Personalised Search

The vast growth of the World Wide Web has resulted in search engines playing an integral part in people’s daily pursuit for information. In particular, with the rise of the Social Web, or Web 2.0, a significant part of the growing number of resources constitute user-generated content such as forum posts, tags, media uploads, etc. Although web search engines have become very efficient at indexing, retrieving and ranking unstructured documents (including such Web 2.0 resources), traditionally they have often followed a one-size-fits-all paradigm: the same results are returned in the same form and order for each user with the same query. More recently the notion of Personalised Information Retrieval (PIR) has emerged in research projects in order to retrieve more relevant results for users’ personal information needs [ 1 ]. However, the conceived solutions have mainly focussed on improving ranked list scores by boosting documents depending on their similarity to a mined user profile. They do not take into account the different search expertise, preferences or knowledge levels of users, nor do they make use of search strategies in order to assist more complex informational queries. The rise of the Semantic Web has provided new possibilities for representing information using semantic data formats such as ontologies, allowing the development of Semantic Search (SS) systems. However, the current state of the art of such systems has largely followed the IR approach of ranking relevant documents and presenting them in ranked lists. They have so far failed to use semantic knowledge to provide an improved guidance for querying users. The field of Adaptive Hypermedia (AH) has traditionally focussed on providing such guidance using personalised result compositions and presentations through multi-dimensional adaptivity. However, their reliance on heavily marked up content has often hampered the inclusion of open-corpus documents such as user-generated content.

This paper proposes to combine techniques and architectures from PIR, SS and AH in order to provide Adaptive Information Retrieval and Composition. The proposed system consolidates both social and semantic data sources and provides a single query interface that supports personalised query responses. Customer Care is used as an example field where such a personalised system can be applied, since in addition to providing traditional technical documentation, many organisations now provide their own versions of community resources where users increasingly engage in forums in order to solve technical problems. By applying our search system across these different data sources, we are able to provide users with result compositions that are (i) personalised to their own needs with respect to the product, (ii) semantically structured according to organisational knowledge and (iii) combined from closed (semantic) as well as open (social) content.

2 Related Work

A variety of techniques and technologies have been developed in several research fields in order to i) search across increasingly large volumes of data and ii) tailor the content retrieval towards users’ personal interests and preferences. A broad characterisation of such techniques reveals three distinct research areas: Personalised Information Retrieval, Semantic Search and Adaptive Hypermedia.

The field of Information Retrieval [ 2 ] has typically focused on improving ranked result lists using one-size-fits-all algorithms. More recently, Personalised Information Retrieval systems make use of personal information (e.g. gathered from previous search interactions [ 3 ]) in order to either expand the original user query with personalised keywords [ 4 ] and logical operators (e.g. AND, OR, NOT) [ 5 ], or to bias traditional ranking algorithms towards more personally relevant information [ 6 ]. Alternative composition and presentation attempts such as result clustering [ 7 ] have most often been confined to keyword frequency calculations, largely lacking a more fine-grained representation of i) the knowledge space that is being queried and ii) the user’s personal knowledge state and preferences.

In order to overcome this lack of structured representations of both domain and user models, Semantic Search engines draw from the expressive power of ontologies, which can be used for modelling and reasoning across the knowledge space as well as user interests [8]. Although early Semantic Search systems often made use of manual one-to-one mappings between documents and ontological concepts, more “lightweight” systems [9][10] now (semi-) automatically annotate documents using multiple concepts drawn from ontologies. These annotations can then be used in order to rank open corpus documents not only by their statistical similarity to a user’s keywords, but also by ranking them according to the importance of their particular annotations [10]. The usage of semantic user models such as in [8] has advanced the field to more personalised rankings of documents, however the sole dimension of adaptation has again been that of user interests. Moreover, user guidance has so far been largely neglected, as documents have mostly been composed and presented in a flat ranked list format, failing to guide the user through the result space.

Adaptive Hypermedia (AH) [11] is a field that has inherently focussed on providing multi-dimensional adaptation by creating personalised information compositions and presentations. Since the earliest systems such as AHA! [12] and APeLS [13], their focus has been on providing information compositions, which contain documents that are not only adaptively selected for the particular users, but also sequenced according to current user knowledge states as well as to a variety of user preferences. Moreover, presentational cues such as link hiding [12] or link annotation [14] provide additional navigation guidance across the document space. This increased adaptivity is facilitated by a new type of model called the Adaptation Model [12] or Narrative Model [13]. This model describes the strategy by which concepts can be traversed to support particular objectives. For example, a “how to” query of an inexperienced user might have a narrative that would first choose content containing a general introduction of the topic and its concepts, followed by examples on how to carry out the queried task. However, AH systems have inherently been hampered by their reliance on fine-grained concept-to-content indexing of the document space, making it hard to incorporate “unknown” open corpus data.

An additional search paradigm that has emerged over the last years is the notion of social search or collaborative recommendation. In these systems, users are presented with documents or items that are either globally popular [15] or recommended by users with similar interests (e.g AMAZON1 recommendations). With the growth of online communities, these techniques might become increasingly powerful for future adaptive and personalised search. However, such collaborative techniques are out of the scope of the research presented in this paper.

In conclusion, the major gap in current search systems lies in the failure to augment Personalised Information Retrieval with Semantic Search and Adaptive Hypermedia techniques in order to create Personalised Result Compositions and Presentations. In order to overcome this gap, search systems need to integrate the notion of query adaptation based on a wider variety of user characteristics in order to enable more personalised retrieval. Moreover, the expressive power of ontologies that drives Semantic Search systems needs to be integrated in order to model both the knowledge domain as well as the system users. Finally, the Adaptive Hypermedia notion of a Narrative Model needs to be incorporated in order to i) retrieve documents that most closely correspond to the current domain and user model states and ii) adaptively compose and present the results to improve the guidance of users. 1 http://www.amazon.com

Methodology

In order to study and address the identified gaps in current adaptive search techniques, a vertical application area is needed, which provides i) the necessary heterogeneous content and ii) an authentic evaluation scenario. For the research presented in this paper, a case-based study of customer care has been chosen, which represents an application area where users are currently already searching across both structured (closed corpus) and unstructured (open corpus) content. Additionally, this case study provides the necessary context for addressing different user information needs, skills and preferences.

4 Personalised Customer Care

Customer care is a crucial area for companies wishing to establish long-term relationships with their user base. Despite offering a strong product or service, it is often the post-purchase assistance that influences a user’s decision to consider purchasing more products or services from this particular company [16]. However, it is surprising that the type of support in this massive area has been confined to the simple one-size-fits-all paradigm. Users are left having to either consult complete user manuals in order to find the relevant section for the problem in question, or perform a keyword query and search through traditional ranked result lists regardless of their personal background in terms of product knowledge, skills and preferences.

From a technical perspective, there are three types of help files that are available for supported products. First of all, a company internally produces technical documentation that is often sliced to a fine granularity in order to assure their reuse in the case of software updates. These smaller units are then compiled into manuals in order to be shipped as complete user guides. By composing these knowledge items into manuals, chapters and sections, companies provide a great array of implicit metadata information that can potentially be used for adaptive and personalised retrieval. In addition to these highly structured data sources, companies often produce a second type of documents, which contain knowledge resources that have been generated by support staff following a direct interaction with customers. These types of documents are generally less structured than technical support documents, containing limited metadata such as topic categorisation. Nevertheless, these articles contain valuable information for an end-user who might be facing a similar issue. Finally, a third type of documents is emerging increasingly with the rise of the social web, or Web 2.0. Users increasingly engage in community forums, asking questions to the general user community in the hope that either a similar problem might have been solved previously or that a user in the community has the technical knowledge to identify the problem area. In terms of technical markup, these documents contain the least structure for several reasons. First of all, users inherently use different terminologies depending on their linguistic and technical background. Secondly, when users categorise or tag forum posts, they might have differing intentions and perceptions of what might be relevant for future use. Finally, even if users agree on the type of tags, categorisations and language, the problems of synonymy and polysemy increase the mismatch between user-generated terms and the organisational terminology.

It becomes apparent that current customer care is not lacking in terms of support document quantity, but rather in terms of aggregating and structuring existing content in order to make it i) consistent, ii) reusable and iii) suitable for adaptation and personalisation. Hence it is necessary to develop new techniques and architectures for structuring and aggregating the different document types. Additionally, new search architectures are required that leverage such improved data models in order to make full usage of the complete document space.

5 Structuring Heterogeneous Content

The heterogeneous support content that is available for software products needs to be transformed to a semantically richer form in order to allow reasoning, adaptation and personalisation across it. As mentioned earlier, Semantic Web technologies such as ontologies represent an opportunity to base such structuring and markup on. The different types of content can be broadly categorised by their amount of existing metadata and structure. Consequently, different types of usage can be drawn from each: whereas highly structured content (such as technical documentation) can be used to derive an ontology of the knowledge domain, unstructured content (such as forum posts) can be marked up in order to provide querying users with a larger range of problem solutions. Key challenges in using marked up content and ontologies lie in identifying (i) how high quality markup needs to be, (ii) how extensive the vocabulary can be and (iii) how extensive the ontology needs to be. 5.1 Structuring organisational content Organisational structured content is often of a fine granularity in order to ensure its reusability for future product updates. Transforming both the individual knowledge items as well as their compositions (e.g. from product manuals) to a domain ontology allows the content to be more reusable and suitable for adaptation and personalisation.

First of all, for each individual knowledge item, there exist a number of content fields such as title, paragraph, procedure, etc., as well as metadata fields such as index terms (i.e. keywords) or media type (e.g. text, image). By modelling the different fields as ontological classes, each knowledge item and its constituent parts can be populated as instances of these classes. This is particularly useful in the case of content and metadata fields that can be used for reasoning and adaptation (e.g. a metadata field indicating a procedure). For example, if a particular user has only just installed the product, explanatory items should introduce the user to a particular feature first, before showing a detailed procedure on how to configure this feature. The difficulty of an item can also be inferred from a variety of structural features contained such as the number of procedural steps, the content length, the number of paragraphs, etc. Corporate product documentation is often extensively marked up to a deep structural level, allowing such a detailed content analysis.

Secondly, the composition of knowledge items is transformed to ontological form by creating classes for the hierarchical components of the document (see Figure 1). Moreover, components such as chapters, sections and subsections often contain additional data (e.g. overview titles), providing valuable information about the overall subject of its constituent knowledge items. The individual content items (e.g. chapters, sections, subsections) are used to populate the different ontological classes as instances, with instance relationships ensuring the ability to reason across connected items. For example, if a section explains a particular product feature, its subsections typically provide more detailed information. Again in the case of a less experienced user, it is important to not only show the detailed information about how to configure a particular feature, but also to introduce the feature with the explanation that is contained in a higher level section.

By transforming the complete technical documentation into classes and instances, a domain ontology can be created, which accurately describes the subject area from the point of view of the product provider. In particular, implicit knowledge from the existing item compositions in product manuals is effectively transformed into a form that allows making this knowledge explicit using ontological reasoning. Since the technical documentation is marked up consistently according to predefined schemas, most of the transformations can be applied automatically. However, in order to extract additional, more high-level concepts, a certain amount of manual effort is involved. For example, in the case of several product manual chapters referring to the same product features (one chapter explaining its installation, another one its configuration), the domain ontology should capture these cross-chapter relationships. Unless such references to higher level concepts (e.g. particular product features) are mentioned explicitly in the document markup, a domain expert needs to manually add these ontology classes and relationships. After a domain ontology has been generated, it is possible to link new “unknown” documents with the existing ontological instances. Two separate components are needed in order to generate i) the right granularity from the open corpus content and ii) conceptual indexing according to the ontological structure (see Figure 2).

First of all, a content slicer described in [17] is responsible for transforming the original documents into fine-grained “slices”. Such slices are viewed as stand-alone pieces of information, containing their own semantic properties and metadata. During the slicing of the original open-corpus data (i.e. forum content and knowledge resources), structural as well as semantic analysis techniques are applied in order to generate fine-grained knowledge items as well as an initial set of metadata fields.

In a second step, the Web 2.0 concept of “crowd sourcing” is used to generate additional and more accurate annotations by presenting the content slices and their initial associated metadata to voluntary annotators (similar to [18]). Ideally, this socio-semantic annotation client is embedded within the actual community forum, allowing the initial content generators to tag their own posts. The domain ontology is also available to annotators as a preferred vocabulary in order to correspond their conceptual understanding of the slice to the terminology of the underlying semantic knowledge representation. The ontology is presented in hierarchical form, allowing annotators to easily browse and select concepts for the displayed slice. Furthermore, the annotation user interface includes several drop-down lists, which offer an annotation vocabulary for additional properties, such as the difficulty or interactivity level of the content. Finally, the selected annotations are stored in a triple store.

As a result of this two-stage approach, the original user-generated forum content as well as the knowledge resources have been annotated and consequently integrated with the semantic knowledge representation of the domain ontology. Even if the annotations are not as complete or accurate as the fine-grained technical documentation, they nevertheless enable partial reasoning, adaptation and personalisation during the content retrieval and composition stage. 6

Knowing and adapting to the user

Knowing the different characteristics, context and preferences of users is crucial for the development of any adaptive and personalised system. In the particular case of Personalised Customer Care, there are a number of user characteristics that product providers can adapt on.

First of all, a customer is using one or more particular products or services out of a potentially large portfolio from the company in question. Instead of leaving users sorting through search results in order to find the information that is related to their particular product, a system can automatically adapt the information retrieval and result composition accordingly. Secondly, upon each interaction with a search system, the user has a particular product state. For example, a user might have just purchased the product, consequently finding him-/herself at the “product installation and activation state”. Other examples would be the state of “configuring” after installation or the execution of “pro-active actions” (e.g. the user simply wants to find out more about a certain feature) or “re-active actions” (e.g. an error message has occurred in the product and the user wants to solve the problem). Another characteristic of a customer is one’s personal knowledge state, which depends on previous interactions with the product and the search system. Users could range from being complete novices to being considerable experts regarding particular parts of the product. Additionally, users can have different content preferences, for example some users might prefer looking at the content that contains the procedures for solving a problem, whereas other users might prefer to consult explanations or overviews first. Also, language preferences of users can be used during the adaptation phase, given that most of the content produced by a company as well as the community forums are available in different languages. In particular, consider a user who types in a keyword query in his/her native language other than English. If this particular user also speaks English, the system can adaptively retrieve additional resources in the case of poor coverage in the user’s native language.

In addition to these user characteristics and preferences, there are additional axes of adaptation that arise at query time. A particular query can have a question type, which represents the type of intent of the user’s question. For example, a user can have a query that is a “what”-type question, which requires an explanation as an answer. On the other hand, a “how” question requires the result to be a tutorial or procedure that the user has to follow in order to solve a particular problem. Additionally, the preferred answer structure might vary from query to query. For example, some queries are preferably answered with a “highly structured” result composition (including overviews, explanations, tutorials, related items, etc.), whereas a “quick” answer would simply provide a tutorial or reference resources (e.g. registry entry values, etc.).

The different user characteristics and preferences are stored using a hybrid user model, consisting of simple key-value pairs (e.g. for language preferences), semantic structures that mirror the domain ontology (i.e. overlay user model), as well as keyword vectors that represent users’ historical interactions with the system (i.e. based on resources a user has looked at/clicked on). 7

Retrieval and Composition System architecture

In order to provide multi-dimensional adaptation, the domain and user models need to be consolidated with the Adaptive Hypermedia concept of a Narrative/Adaptation Model (as mentioned in section 2). This model contains the particular rules on i) what should be adapted on and ii) how the adaptation should occur. In this section, a Retrieval and Composition system architecture will be explained, which incorporates these three models in order to deliver Personalised Customer Care. The retrieval and composition process is broken down in several stages (see Figure 3) and incorporates influences from the areas of Adaptive Hypermedia, Semantic Search and Information Retrieval. In particular, this work extends an initial prototype presented in [18], which has already proven the benefits of personalised retrieval and composition of opencorpus content in an educational scenario.

In the first stage, a user is requested to input a standard keyword query, along with a drop-down selection of query types (i.e. what/how). Additionally, users indicate their current activity or intent regarding the product, i.e. getting started, reacting to a problem, etc. Ideally, this property would already be stored in a user model (e.g. from previous interactions with the product or search engine), thus not requiring a user to manually select this information. The keyword query is executed on an indexed version of the domain ontology, yielding a collection of instance results. From these results, several statistics can be generated. First of all, it is possible to determine which “conceptual area” of the domain ontology has yielded the most results, i.e., which are the high level concepts that have the most results. For example, by ordering the results by their corresponding chapter or subject, one can infer the particular part of the domain ontology that contains many of the keywords. Additionally, by analysing the search results, it is also possible to generate statistics about the type of content that is retrieved, such as the activity-level (i.e. amount of procedures and tutorials), the compositional properties (number of detailed subsections results), etc.

These initial statistics are used in a second stage to group results and to extend the subject space in order to personalise the results shown to the user. By consolidating the initial results with the domain and user model, a strategy is then applied to provide a “storyline” across the conceptual space. Particular ontological relationships of the initial results are followed depending on user model preferences. For example, in the case of a user who has just purchased the product, knowledge items (i.e. instances) that focus on installing and activating the product are added to the results. Another example would be to add related instances that fill a particular user’s current knowledge gap (e.g. overview resources about a product feature, related features, etc.). Also, the activity level and difficulty level of instances influence their inclusion in the result space based on the user model preferences. At the end of this second step, a complete personalised result space has been selected from the domain ontology, which is not only more personally relevant than the initial results, but also more diversified and complete, containing additional relevant instances that would not have been found using conventional keyword search. The different results are composed according to their ontological relationships (provided by the domain ontology), their subject coverage, as well as their relevance to the querying user.

In stages 4 and 5, additional resources are retrieved by generating and executing expanded information retrieval queries across the user-generated content base. For each instance result in the extended subject space, an adapted query is generated, which contains the various aspects of resources that should be retrieved (in terms of keywords and metadata attributes). By indexing the content as well as the usergenerated annotations, structured queries can be used to retrieve topically as well as personally relevant data. Additionally, logical operators and query term weights are used in order to also minimise an overlap between the different result sets.

In the final step, the different results are composed together with the instance results from the domain ontology in order to provide a complete result space. The combined sets are grouped, sequenced and linked according to the particular structure of the personalised subject space that was generated in step 2. This additional notion of sequence or narrative corresponds to a typical Adaptive Hypermedia presentation that guides users through the result space rather than presenting a flat list [13]. For example, for a novice user, advanced features are preceded by simpler (overviewtype) resources, and followed by additionally relevant/related results. Also, due to these highly structured and personalised characteristics of the result space, additional Adaptive Hypermedia techniques can be applied. For example, on the result overview page, visual cues and link annotations guide a user to the currently most appropriate items to look at. Lastly, the composition of both organisational content as well as user-generated content ensures structure while still maintaining great topic coverage. 8

Ongoing Work

The system implementation is currently being completed using a variety of technologies. The organisational content has been transformed into the Web Ontology Language (OWL)2 using customised scripts, whereas the annotation store consists of a standard installation of the ARC triple store3. To ensure both efficiency as well as reasoning capabilities, the domain ontology is stored in both eXist4 (which allows efficient indexing using the built-in Lucene5 functionality), as well as its ontological form (for reasoning during the extended subject search stage). The retrieval and composition system builds on work presented in an educational scenario [18] (see Figure 4) and uses an Adaptive Engine to consolidate the User, Domain and Narrative Models. Ontological reasoning is performed within the Adaptive Engine using the Jena Framework6. Similarly, the extended queries are generated by the rules encoded in the narrative, which can either be scripted (JavaScript), or rule-based (Drools7). The adapted queries are executed on an indexed version of the annotated content slices and the results are presented in a web-based interface using JSP and JavaScript.

The system evaluation will consist of authentic users performing activities over the domain content, with assessment measures focussing on retrieval accuracy and 2 http://www.w3.org/2004/OWL/ 3 http://arc.semsol.org/ 4 http://exist.sourceforge.net/ 5 http://www.exist-db.org/lucene.html 6 http://jena.sourceforge.net/ 7 http://www.jboss.org/drools/ appropriateness, as well as the general task assistance in terms of task completion time and user effort. A second evaluation will capture typical user queries, which will be used as test evaluations of system response accuracy by product experts.

9 Conclusions

This paper has presented a novel approach to providing personalised information retrieval and composition from a variety of heterogeneous data sources. The presented architectures for structuring and retrieving both structured and user-generated content combine the latest advances in Personalisation, Semantic Search, Information Retrieval as well as the Social Web. Firstly, existing content resources are leveraged and structured in order to make them reusable, as well as suitable for adaptation and personalisation. Secondly, large sets of user-generated content are annotated using a socio-semantic annotation tool. Finally, an adaptive retrieval and composition architecture is responsible for aggregating the different data sources into personalised result presentations, which guide users towards relevant and appropriate resources.

The system is presented in a Customer Care scenario, which provides both the necessary heterogeneous data sources, as well as the context for different user information needs and preferences. It makes full usage of existing organisational structured knowledge and applies this across the user-generated content. The resulting user experience is a vastly improved customer care service, which provides an automated personalised assistance without the need of technical support staff intervention. Existing socio-semantic resources are hence leveraged and combined not only to improve customer satisfaction, but also to save costs for the product provider.

Acknowledgements

This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for Next Generation Localisation (http://www.cngl.ie) at Trinity College Dublin. We would like to acknowledge the contributions of the Localisation Department at Symantec that have provided us with a variety of customer care content, especially Fred Hollowood, Johann Roturier and Jason Rickard.

1. Micarelli , A. , Gasparetti , F. , Sciarrone , F. , and Gauch , S. Personalized Search on the World Wide Web . In: The Adaptive Web, LNCS , vol. 4321 , pp. 195 - 230 . ( 2007 )

2. Baeza-Yates , R. A. and Ribeiro-Neto , B. : Modern Information Retrieval . Addison-Wesley Longman Publishing Co., Inc. ( 1999 )

3. Speretta , M. , and Gauch , S. : Personalized Search Based on User Search Histories . In: Web Intelligence, WI2005 , pp. 622 - 628 ( 2005 )

4. Teevan , J. , Dumais , S. T. , and Horvitz , E.: Personalizing search via automated analysis of interests and activities . In: Proceedings of the 28th Annual international ACM SIGIR Conference on Research and Development in information Retrieval , SIGIR ' 05 . ( 2005 )

5. Koutrika , G. , Ioannidis , Y. : A Unified User Profile Framework for Query Disambiguation and Personalization . In: Proceedings of the Workshop on New Technologies for Personalized Information Access, PIA2005 , pp. 44 - 53 , Edinburgh, Scotland, UK ( 2005 )

6. Micarelli , A. , and Sciarrone F.: Anatomy and empirical evaluation of an adaptive web-based information filtering system . In: User Modeling and User-Adapted

Interaction

, 14 , 2 - 3 , pp. 159 - 200 ( 2004 )

7. Xu , S. , Jin , T. , and Lau , F. C. : A new visual search interface for web browsing . In: Second ACM international Conference on Web Search and Data Mining , Barcelona, Spain ( 2009 )