Automated Assembly of Custom Narratives from Modular Content using Semantic Representations of Real-world Domains and Audiences Joshua Wulf, David Jorm, Matthew Casperson, and Lee Newson jwulf,djorm,mcaspers,lnewson@redhat.com Red Hat Engineering Content Services Abstract. We present an approach to automatically assembling cus- tomized technical documents covering a specified area of interest, tailored to the needs of a specific audience, and with a meaningful narrative struc- ture using semantically annotated modular units of information (topics), and ontologies that describe the structure of the real-world domain of interest. In this paper we explore the nature of narrative, and how an automated document assembler can produce coherent narrative using semantic representation. We introduce a Semantic Publishing system, named Skynet, that implements these ideas in the context of document- ing commercial software products. Keywords: semantic publication, automated assembly, customized nar- rative 1 Introduction A utomated Document Assembly is the aggregation by a software agent of smaller units of information into a larger structure to meet the information requirements of a specific audience. We define the larger structure as a document, which may be instantiated as a static linear document on paper (or as a pdf), or as a customized view of hyper- linked text. Narrowing the definition used by Andre et al. [AFQ], we attempt to formally define a document as “a collection of information targeted to an audi- ence interest, constrained to include relevant information and exclude irrelevant information, and grouped and sequenced to match the audience’s hierarchy of concern”. We distinguish technical documents – documentation that seeks to inform the reader about factual information – from other types of documents such as prose, or fiction, which fall outside our definition of a document (see Wright [Wri]). The optimum structure for a document is a function of audience range of interest, existing audience knowledge, and audience hierarchy of concern. To produce an optimum document structure, an automated document assembler must have knowledge of these three aspects of the audience. Additionally it II requires knowledge of the domain within the audience’s interest, a collection of modular units of documentation that describe this domain, and knowledge of how these units relate to the domain and each other. This allows an automated assembler to produce a customized document de- scribing the range of the domain that matches the audience’s range of interest, in a way that matches the audience’s hierarchy of concern and level of knowledge. This can be done in the form of bespoke pdf documents or by modifying the dynamic presentation of a hyperlinked web of information. We will look at each of these four areas in turn: Semantic representation of the domain; Modular Units of Content; Semantic Representation of an Audience; Automated Custom Assembly. Before examining these four areas, however, we explain the theory of cogni- tion that informs our implementation. 2 Theory of Cognition We use a model of communication based on the Theory of Cognition proposed by van Dijk and Kintsch [DK]. They propose a categorization of communication elements that includes textual representation - words used to describe something - and a situation model - an internal mental model. The situation model rep- resents some real-world domain, while the textual model represents a situation model in language. The process of technical communication in this model is one of deconstructing the internal mental model possessed by a subject matter expert, marshaling the elements of that mental model into a verbal or written representation, streaming that representation to a receiver, demarshalling those verbal or written represen- tations into mental elements in the mind of the receiver, and integrating those elements into the receiver’s internal mental model. All three of: the textual representation; the situational model; and a map- ping between the two, must be available to the automated processor. We will first examine the situational model, which is made available to the automated processor as a semantic representation of a real-world domain. 3 Semantic Representation of a Real-World Domain A situation model is an internal predictive mental model that is used to predict how objects in the real-world will act and react. The situation model encodes categories, membership, and relationship between elements of the world of ex- perience [Gar] [Joh]. The structure of a document is itself semantic - the spatial and temporal relationships between textual units convey information about the relationships between the real-world elements that the textual units represent. Important information appears before less important information; things that occur or are encountered first are presented first; dependencies are presented before the things III that depend on them. These kinds of decisions about the structure of a document are made by a human author using their own internal mental model. In our experience, in cases where a human author is missing or has an incomplete mental model of a domain, they must take recourse to subject matter experts for guidance on where to place information (for an illustrative example, see this discussion). We define a coherent narrative as a semantic structure composed of com- prehensible textual units. In our system textual units are authored by human authors, and their assembly into a meaningful (semantic) structure, or coherent narrative, is the role of the automated processor. We conceptualize a situation model as an n-dimension hypercube which en- codes a multiplicity of relationships along different dimensions of interest. This n-dimensional hypercube enables us to formally (and programatically) answer the question: “Why is it that certain sentences should be “close” to each other in an instructional document ... ? ”[Hor] This Semantic Representation of a Real- World Domain (Domain Model) acts as a situation model for a automated doc- ument assembler that performs the role of human author/subject matter expert in deciding what information to include/exclude, and how to structure it in response to a given audience. The Domain Model is implemented as categories and tags, which represent dimensions and points, respectively, in the n-dimensional hypercube of the situ- ation model that the Domain Model simulates. 3.1 Ordered and Unordered Dimensions Construction of a coherent narrative involves grouping and sequencing oper- ations. Related information is grouped into sections and chapters, sequenced within that container, and the containers are themselves are grouped and se- quenced. We distinguish between ordered and unordered dimensions in the Domain Model, to encode the bases for these grouping and sequencing decisions. Ordered dimensions are those dimensions whose points have an intrinsic se- quential relationship that should be considered when constructing a narrative. Examples of such dimensions include “Lifecycle” (which is tied to dependency and also to time), “Temperature”, and “Location” (for example: layers in a software stack). Unordered dimensions are those dimensions whose points belong to the di- mension, but do not have an intrinsic sequential relationship. Examples of such dimensions include: “End User Demographic”, and “Name”. Unordered dimensions cannot be used as the basis for sequencing operations, and when information must be sequenced on the basis of a common unordered dimension the correct convention to use is “alphabetical ordering”, a structure that semantically communicates: “There is no meaning to this ordering” 1 . 1 It is important to note for implementation purposes that alphabetical ordering is completely extrinsic to the semantic dimension, as it will change in the document output depending on the target language. IV Table 1. Example Ordered Dimension in a Domain Model Category Tag Tag Order Lifecycle Prerequisites 0 Download 1 Installation 2 Configuration 3 Deployment 4 Shut down 5 Redeployment 6 Upgrade 7 Removal 8 Table 2. Example Unordered Dimension in a Domain Model Category Tag Tag Order End User Concern Application Development null Server Administration null Migration null Troubleshooting null Tags belong to categories, and categories that encode ordered dimensions (Category.IsOrdered ) make use of the TagOrder property of tags to encode the ordered nature of the dimension. 4 Modular Units of Content In order to construct a document, an automated assembler requires a semantic representation of a situation model, and modular units of content to assemble. The modular units must be sufficiently atomic to be meaningfully mapped to points within the n-dimensional hypercube of the Domain Model. A modular unit may be mapped to multiple points within the Domain Model, but all of the content in the unit must map to the same point(s). Otherwise, inclusion of the content in the output document based on its mapping to the Domain Model will result in the inclusion of content “in the wrong place”. To achieve this level of atomicity we have adopted use of the Darwin Infor- mation Typing Architecture (DITA) [DITA] topic types. The DITA Topic Types are based in part on the Information Mapping work of Horn [Hor+1]. Documents produced using formal division into Information Mapping units have been shown to be more effective than those produced with an ad-hoc information architecture [Hor+2]. V The information in a DITA topic is constrained to a single subject, and a single information role. This level of atomicity makes them ideal candidates for mapping to a Domain Model through metadata tagging. We implement the modular units using the DITA categorizations of concept, task, and reference topics, but using the Docbook XML schema, to leverage our existing open-source Docbook publishing toolchain, Publican. It is the information typing aspect of DITA [IMI] that is of principal use and interest to us1 . Constraining the content of a textual unit (topic) to a single subject means that they can be unambiguously mapped to points inside the n-dimensional hypercube of the Domain Model. This allows them to be reliably assembled according to the macrostructure of the document. Constraining them to a sin- gle information role means that they can be reliably assembled within that macrostructure, as concepts, references, and tasks play deterministic roles in the textual representation of an area of a situation model (something we will exam- ine in due course). Some additional information may be required to assemble the units into a coherent narrative at this level. In addition to mapping topics to the Domain Model, topics can be mapped to each other. So a task can declare that a specific concept “is a dependency”, or a reference can declare that it “illustrates” a specific task. 4.1 Natural Language Processing and Ontology Tagging The content of the textual units (topics) is generated by human authors. The metadata tagging of the topics against the Domain Model and against each other is also performed by human authors. We have some rudimentary natural language processing tools that assist in this process. A conceptual vocabulary is generated based on the Title property of the topic. A scanner then examines the textual content of other topics, and suggests potential relationships based on the content, which can then be accepted or rejected by a human moderator. This is similar to the approach used by the BBC World Cup semantic publishing system [BBC]. We further examine the role that topic information role plays in assembling output when we examine Automated Custom Assembly. 5 Semantic Representation of an Audience Generally speaking, an audience is a group of people with a shared interest and level of pre-existing knowledge [MS]. Audience is usually an approximation of a range, within which individual readers may completely or partially fit. The economics of document production dictate that a small number of documents be produced to serve a large number of people, and hence the idea of “audience” as a range. With automated assembly and electronic delivery, however, the economics of production of narrative change, and the problems associated with defining an 1 although we also plan to support DITA XML encoded content in the future. VI audience as generalized ranges [EL] [Ong] can be mitigated by using very specific definitions. Different people are interested in different information, and they are inter- ested in the same information in different ways. Consider the following two cases: 1. An organization where the layers of a software stack are horizontally divided. In this case, one group is reponsible for the database component, and another group is responsible for the Operating System component. 2. An organization where the layers of a software stack are vertically divided. In this case, one group is responsible for installing both the Operating System and the Database, while another group is responsible for maintaining them. In these two distinct cases the same information is needed by people in either organization, but it is needed in a different combination in each case. Customized Narratives can be generated for each of these use cases. In the first case the narrative is generated by creating two documents. The informa- tion in the first document is constrained to topics tagged with the “Compo- nent:Database” tag. The information in the second document is constrained to topics tagged with the “Component:Operating System” tag. In the second case, two documents are created. Both documents are con- strained to (“Component: Database” OR “Component: Operating System”). The content of the documents is also grouped at the first level on these tags, resulting in two sections: “Database” and “Operating System”. If the “Com- ponent” category is implemented as an ordered category based on the software layer, then the Operating System section will precede the Database section, oth- erwise they will be alphabetically ordered. The first document is further constrained to information tagged “Lifecycle: Installation”, and the second document is constrained to information (NOT tagged “Lifecycle: Installation”). Audiences can be further defined by linking an audience with a list of concepts that they can be expected to know. These concepts can then be elided from documents that are produced for this audience. Tasks can also be tagged with a tag from an ordered category “difficulty”, and a threshold set for an audience, so that introductory and advanced guides can be produced. When a formal definition of an audience is available to an automated pro- cessor, it can use this definition in conjunction with a semantic representation of a situation model and semantically-annotated modular units of content to assemble a custom narrative relevant to the audience’s needs. Because the production cost of this narrative assembly is so low, documents produced for an audience of one become economically feasible. The costs of nar- rative production move away from human authors (who can hardly be expected to write a different book for each reader), and move to processor cycles and a cognitive cost on readers, which we examine later. VII 6 Automated Custom Assembly We use an algorithm to solve the general case of creating a coherent narrative structure from an arbitrary collection of textual units: To automatically generate a semantic structure, we examine all of the topics that have been returned for a query, and assemble them into intermediate units based on topic types and any declared relationships between topics. We then examine the resulting aggregate units to determine if there exist meta data dimensions in which we can locate all of the units in the query. 1. If there is no meta data category for which all topics at this order of structure have a meta data tag, it’s alphabetical ordering. 2. If all topics at this order of structure have a meta data tag from the same sequenced category, that category is a candidate for sequencing. 3. If all topics at this order of structure have a meta data tag from the same non-sequenced category, that category is a candidate for grouping. 4. If one grouping candidate exists and no sequencing candidates exist, group on that category and sequence alphabetically. 5. If one sequencing candidate exists and no grouping candidates exist, group and sequence on that category. 6. If more than one grouping or sequencing candidate exists, then follow the semantic rules for hierarchy of concern. In addition to the interaction of audience range of interest and hierarchy of concern with the semantic representation of the situation model, the information type of the textual units (topics) influences the output structure of the document. We use some basic patterns to structure the output, all other semantic consid- erations being equal, based on topic type. The basic pattern we use is “Concept, Task, Reference”. Relevant or dependent concepts precede a Task, which is fol- lowed by additional reference material, including any example that illustrates its use. Figure 1 illustrates the output structure of a group of related topics, based on their topic type. The dotted lines represent concepts that may be elided or collapsed (in an html output) depending on the audience’s level of knowledge, or the predicted relevance of the concept. 7 Current Status of Our System Currently our semantic publishing system, known internally as Skynet and under heavy development, is implemented using as a JBoss Seam application, with a MySQL database to store the topics, and a topics-to-tags, tags-to-categories database schema to implement extensible meta data. Our processing engine is implemented as a combination of procedural code and rules using JBoss Drools. The system is implemented as a platform, and has a REST API interface that allows it to be easily integrated and extended. One of the first extensions that we’ve developed for it is a content specification processor that allows an arbitrary VIII Fig. 1. An assembly pattern for textual content based on topic type Concept Concept depends on: Concept depends on: depends on: Task illustrates: Reference IX topic map to be passed to the system and returned as a Docbook book. This allows external semantic processors to generate their own output structures, and request the platform to build it from the content in the repository. This open and extensible design allows us to innovate and extend outside of the core code of the project. The system is under development as an open source project on Sourceforge, and the current implementation is running on the Red Hat internal network and being used to develop the product documentation for the upcoming release of JBoss Enterprise Application Platform 6. The documentation for JBoss Enterprise Application Platform 6 is written as modular content units (topics), and tagged against a semantic representation. The final output documentation is generated by an automated processor that locates the modular content units in an aggregate structure using the semantic representation to generate the document structure. In this sense it functions in much the same way as the BBC World Cup semantic publishing system [BBC] — the automated processor handles the publishing into the larger structure, while the human author is concerned with writing the content, and accurately describing it in terms of the semantic representation. The document structure in Figure 2, for the JBoss Enterprise Application Platform 6 documentation, is generated by an automated processor using our semantic representation and semantically-annotated textual units. Information is constrained to the area of the hypercube containing “JBEAP 6” tagged topics. It is then grouped at the first level on the “Technology Component” dimension, which is an unordered dimension, hence the alphabetical ordering. At the second level of structure it is grouped on “End-User Concern”, which is an ordered dimension based on lifecycle. This is shown here as a tree structure, but it could also be instantiated as the table of contents of pdf output. 8 Challenges and Opportunities At this point in time we are able to generate multiple output structures for various audience definitions. We have the data in place to allow the generation of Active Documents [DGK], however our production infrastructure currently serves over 2TB of documentation to users each month, so introducing dynamic content generation on our public-facing website represents a scalability and se- curity challenge. Our next plan is to make our dataset available to the public, possibly through a web service end-point, with our semantic representation available as RDF data. This will allow users to design their own documentation. In this case, users will be able to express their interest, either at high level or as a specific, complex query, and we will return a pdf file. A significant challenge is encapsulating the complexity of the system. Users are not used to defining the content and architecture of documents. Making X Fig. 2. Output structure created by an automated processor using a semantic representation of a real-world domain and semantically-annotated textual units XI the power of the system available to users without overwhelming them with its complexity is the greatest challenge and opportunity. We are investigating three avenues to work towards this: natural language question answering; progressive customization of view; ant colony optimization of the semantic representation. 8.1 Natural Language Question Answering Users are not accustomed to designing books, or articulating the exact nature of their domain of interest in a wide sense. They are, however, used to searching for information relating to a specific query that they have. With a semantic rep- resentation and semantically-annotated textual units, we are now in a position to investigate generating customised answers to queries from units of documen- tation. Rather than attempting to build a complete book, and requiring a user to specify an entire book, we would attempt to assemble the pieces to answer a specific query. Some disambiguation questions may be necessary to derive the exact area of interest, and the user’s preferred format (hierarchy of concern), and then we can produce a small document to answer their query. 8.2 Progressive Customization of View Rather than requiring users to define a view of the information, we can present users with a default view of the information (as we do now with JBoss Enterprise Application Platform 6 documentation). However, over time we can customize that view based on the user’s behavior. We can infer things about the user based on their interaction with the material. When a user clicks on a search result, we know something about the result that they have clicked on. If we detect a preference to one area of the hypercube, we can weight search results to favor that area. We can establish weak assumptions about the user based on this kind of behavior. If we introduce the ability for the user to provide us with qualitative feedback, such as a “Like” or “this is what I was looking for” button, then we can also create strong assumptions about the user’s preferences and further weight customizations. Progressive customization of view is less taxing on users, although it may be more taxing on hardware requirements. Offline static rendering in response to a specific user query pushes the burden more to the user’s side, and will be the first approach for the early adopters. 8.3 Ant Colony Optimization of the Domain Model Ant Colony Optimization [DD] is a meta-heuristic optimization method, that simulates the behavior of an ant colony to approximate an optimal solution. It relies on many agents, in this case users, to iteratively explore the solution space and approximate a global optimum. The Domain Model allows us to formally state “Why is it that certain sen- tences should be “close” to each other in an instructional document ... ? ”[Hor]. XII However, there may be dimensions of interest to users that are not captured in our Domain Model, or are incorrectly encoded. If the Domain Model captures the situation model accurately, then rotation of the hypercube should allow two points in the model to come into proximity with each other. This means that in- formation that is required sequentially by the user will be presented sequentially when the user’s axis of interest is used to orient the hypercube. If we find that users consistently search for and then “like” two topics within a defined temporal period, and that these two topics are not available in proximity in a rotation of the hypercube, then it is an indication that there is something missing from the Domain Model, and this can be examined. References AFQ. Andre, J., Furuta, R.K., Quint, V: “Structured Documents”, Cambridge Uni- versity Press, 1989, pp. 3. BBC. Rayfield, J., “BBC World Cup 2010 dynamic se- mantic publishing”, BBC Internet Blog, 2010, http://www.bbc.co.uk/blogs/bbcinternet/2010/07/bbc world cup 2010 dynamic sem.html. DD. Dorigo, M., Di Caro, G. The Ant Colony Meta-Heuristic, IRIDIA Universit Libre de Bruxelles, 1999 DGK. David, C., Ginev, D, Kohlhase, M., Matican, B. Mirea, S., A Framework for Se- mantic Publishing of Modular Content Objects, In: Proceedings of the 1st Workshop on Semantic Publishing 2011, 2011, http://ceur-ws.org/Vol-721/paper-03.pdf. DITA. OASIS Darwin Information Typing Architecture (DITA) Version 1.2 Specifica- tion, OASIS Standard, 2010, http://docs.oasis-open.org/dita/v1.2/spec/DITA1.2- spec.html. DK. van Dijk, T. A., Kintsch, W.: Strategies of Discourse Comprehension, New York Academic Press, 1983, pp 305. EL. Ede, L., Lunsford, A. “Audience Addressed/Audience Invoked: The Role of Au- dience in Composition Theory and Pedagogy”, In: College Composition and Com- munication, Vol. 35, No. 2, National Council of Teachers of English, 1984. Gar. Garnham, A.: “Mental Models As Representations of Discourse and Text”, Ellis Horwood Ltd, 1988. Hor. Horn, R.E., “Structured Writing as a Paradigm”, N. J., Educational Technology Publications, 1998. Hor+1. Horn, R.E., “Information Mapping”, In Training in Business and Industry, Vol. 11, No.3, March 1974 Hor+2. Horn, R.E., “How High Can It Fly? Examining the Evidence on Information Mapping’s Method of High Performance Communication”, The Lexington Institute, 1992. IMI. Information Mapping Institute Whitepaper, “Information Mapping c and DITA: Two Worlds, One Solution”, Information Mapping Institute, 2011, http://www.informationmapping.com/us/resources/whitepapers/261-information- mapping-and-dita Joh. Johnson-Laird, P.N.: “Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness”, Cambridge: Cambridge University Press 1983. MS. Mathes, J.C., Stevenson, Dwight W.“Designing Technical Reports: Writing for Audiences in Organizations”, The Bobbs-Merrill Company, Inc., 1976. XIII Ong. Ong, W.J., “The Writer’s Audience is Always a Fiction”, In: PMLA, Vol. 90, No. 1, Modern Language Association, 1975 Wri. Wright, P.: “Writing Technical Information”. In: Review of Research in Education Vol.14, (1987) pp.327-385 American Educational Research Association.