On Creating Metadata With Authoring Tools Sheremetyeva and Alexei Pervuchin and Vladislav Trotsenko and Alexei Tkachev 1 Abstract. The paper addresses issues of knowledge markup with to be patented, the tool further annotates user’s raw input with authoring tools in which users construct representations of their syntactico-semantic information to create an internal representation knowledge. It attempts to contribute to the problem by suggesting a of the claim content that serves as an input to the generator. We methodology to author new documents in a way that they contain thus get semantically annotated knowledge as a by-product of an markup directly. The methodology is illustrated with a case study, - AutoPat main user (who is a patent officer or inventor) authoring knowledge markup with an AutoPat tool. AutoPat is an application session. The tool can also be used directly for knowledge markup. for authoring technical documents, such as patent claims that The specificity of our approach is that semantic annotation of guides users towards expert ways of thinking. It consists of two knowledge contained in a claim text is created before such text stages, - a semantic authoring module for interactive elicitation of comes into being rather than after as in many other applications. In technical knowledge about invention and a syntactic authoring fact, our annotation does not necessarily require a text as a starting module that automatically generates a legal (syntactically complex) point for markup. In what follows we first discuss linguistic aspects claim text. The semantic authoring module is implemented as a of semantic annotation in specific IT applications, we then present user-friendly interface that can be used as a stand-alone markup the context of the AutoPat application and overview its procedures tool. The markup includes morpho-syntactic information and components relevant for knowledge markup, - an analyzer and augmented by semantic data, such as concept (semantic class) and user interface. predicate-argument (case-role) structure. A particular focus is set on an easy-to-use environment for markup automation. 2 MARKUP LANGUAGE 1 INTRODUCTION We use Autopat internal knowledge representation language to A knowledge-capturing problem is a major focuse in the research annotate the technical knowledge of the patent domain. The about content-oriented intelligent applications. A wide range of problem of semantic knowledge representation (and annotation) is activities can be found in the area of developing annotated corpora directly connected with the decisions on the depth of semantic resources, markup languages and tools targeted for experiments in descriptions. Deeper descriptions promise better results but require NLP, see, e.g., [1]-[4]. Development of such resources is usually a greater acquisition effort. Practical considerations make it done or at least supervised by highly qualified specialists, requires reasonable to follow the demand-side approach to knowledge a lot of time and effort and thus is very expensive. It would be a acquisition which places a premium on what must be done (vs. clear advantage to have a tool based on a methodology, which what can be done) to put together a useful working application [6]. could provide a much cheaper way of metadata acquisition. The We based our approach on the ideas of lexicalism that shifts all methodology should be universal in the sense that it could be linguistic descriptions to lexicon [7]. This move, in turn, has led to applied to any language and require no skilled labour of an increased interest in argument structure - the representation and professionals. characterization of argument-taking properties of predicates [8], [9] Constructing general-purpose mark-up guidelines and tools, such that proved to be useful to encode a wide rande of information. In as XML, SGML, etc. that can be shared by the community is a our system technical knowledge about invention described by a popular topic of interest nowadays. It is also recognized that claim text (see Figure 2) is represented as a set of filled predicate though many increasingly convivial, more widely distributed and templates in the form: hardware-independent applications softwares are currently text::={ template){template}* available for this purpose it is difficult to find a system that template::={predicate-class predicate ((case-role)(case-role}*) matches exactly the end-user requirements [5]. It seems highly case-role::= (rank status value) problematic (at least nowadays) to be able to develop metadata value::= {word tag}* suitable for all kinds of applications. If, however, the concept of genericity is considered as applied to a family of applications, i.e., where predicate-class is a label of an ontological concept, applications sharing tasks and domains, one can probably suggest predicate is a string corresponding to one of predicates from the particular approaches to solve the problem. In this paper we system lexicon, case-roles are ranked according to the frequency attempt just that. of their co-ocurreence together with each predicate in the training We suggest knowledge markup with authoring tools in which corpus, status is the semantic status of a case-role, such as agent, users construct representations of their knowledge. The approach is theme, place, instrument, etc., value is a string which fills a case- illustrated on the example of knowledge markup with an AutoPat role. Tag is a label, which conveys morphological information tool, an application for authoring technical documents, such as (such as POS, number and inflection type) and semantic patent claims. It consists of two stages, - a semantic authoring information, a concept, defining word membership in a certain module for interactive elicitation of technical knowledge about semantic class (such as object, process, substance, etc.). For invention and a syntactic authoring module that automatically example, the tag Nf means that a word is a noun in singular (N), generates a legal (syntactically complex) claim text. The means a process (f), and does not end in –ing. This tag will be knowledge elicited from the user is the knowledge about invention assigned, for example, to such words as activation or alignment. At 1 LanA Consulting, Madvigs Alle, 9, 2 tv, DK-1829 Copenhagen, Denmark present we use 23 tags that are combinations of 1 to 4 features out instance in [10]. However, there are some important differences. of a set of 19 semantic, morphological and syntactic features for 14 Unlike the typical content specification modules, our system relies parts of speech. For example, the feature structure of noun tags is on an authoring workstation environment equipped with scenario as follows: for joint human-computer content specification. The latter starts with the user supplying natural language phrases into the system Tag [ POS[Noun [object [plural, singular] in the process of computer interview and (after passing through the process [-ing, other[plural, singular]] analyzer) results in production of a “draft” claim. This draft is a list substance [plural, singular] of proposition-level structures (“templates”) specifying the other [plural, singular]]]]] proposition head with its semantic class membership, a set of its case-roles, and case-role values filled by tagged word strings. The The number of semantic classes (concepts) and case-roles is tags mark both POS and semantic class membership of the words domain based and is rather small but can be easily augmented. In (see Section MARKUP LANGUAGE). The draft is then submitted general, our annotation can be classified, following the definition to automatic generator, which produces a claim text in a legally of the Network Working Group (www.landfield.com) as “out-of- accepted format (see Figure 2). band” which convey the textual content by metadata or The draft of a nascent claim is in fact annotated knowledge hyperstructure of some sort. about an invention or (which is the same) a syntactically and semantically annotated claim text, the legal format of which is produced at a later stage. The AutoPat knowledge base is corpus- 3 THE AUTOPAT TOOL based and draws heavily on the sublanguage. It contains AutoPat 3.1 Overview inherient knowledge and authoring memory (cf. "translation memory"). The inherent knowledge includes a shallow lexicon of Claims are parts of patents that contain crucial information about lexical units simply listed with their class membership that is a the invention and are the subject of legal protection. They must be morpho-semantic classification of words and phrases (this lexicon formulated according to a set of precise rules and so as to make is used for content support in claim composition and for patent infringement difficult. Composing a patent claim that meets morphological analysis of the input), and a deep (information-rich) all legal requirements to its structure is a complex task, even for lexicon of predicates (heads of predicative phrases describing experts (see Figure 2 for a sample claim text). AutoPat is designed essential features of an invention). This lexicon is the main part of to reduce composition effort, time and costs. It can also be used for the AutoPat static knowledge and covers the lexical, semantic and training patent attorneys. syntactic knowledge and is the basis of knowledge representation . AutoPat is an NLP application1 that consists of an interactive It is used both to provide content support for technical knowledge semantic authoring module for technical knowledge elicitation with elicitation and for generation heuristics. The user can customize a sophisticated but easy-to-use interface at the user end, analysis these lexicons. Authoring memory is created by the user in the module and fully automatic text generation module. course of authoring sessions. . 3.2 Elicitation/Annotation Module Analyzer. In AutoPat knowledge annotation takes place in the course of knowledge elicitation domain-dependent mixed-initiative interview and is, in fact, semantico-syntactic analysis of the user’s input. Our analyzer differs from many other application analyzers in that the morphological analysis module switches on after rather then before syntactic analysis. The early application of syntactic analysis allows the morphological analyzer to avoid overgeneration and produce unambiguous results. The knowledge elicitation scenario consists of the system requesting the user, in English, to supply information about the invention by offering the user a choice for lexical selection of what amounts to heads (predicates) of phrases in the nascent text, a significant amount of knowledge about predicate subcategorization and argument properties is available to the system from the lexicon. The user is presented with a predicate template (see Figure 3) based on knowledge about the case-roles (semantic arguments) of the selected dictionary item and fills appropriate slots – “What”, Figure 1. The overall architecture of AutoPat “Where”, “How”, and so forth. Filling case-roles in a predicate template during knowledge elicitation procedure is, in fact, an The architecture of AutoPat with integrated development interactive semantico-syntactic analysis and knowledge annotation. environment is given in Figure 1. Superficially, the architecture of The system marks the boundaries of the fillers (syntax) and their our system conforms to the standard emerged in natural language case-role status (semantics) to be used later for morphological generation, in that it includes the stages of content specification, disambiguation. text planning and surface generation (realization), as expressed, for Morphological analysis is applied to the case-role fillers and consists in tagging proper which assigns all possible tags to words. To help resolve POS ambiguities all lexical units in our analyzer 1 AutoPat is a product developed from the prototype described in [11]. It is are put into classes specified by tags marking not only POS but a 32-bit Windows application developed to run in a number of operating also such morphological features as number and inflection type. environments: Windows 95/98/2000/NT. It covers claims about apparatuses. Some of the tags include semantic features (see Section 2). After the set of tags is assigned to a word a disambiguation procedure switches on. Discarding faulty readings of ambiguously tagged provides a resource where syntactic information is augmented by words are done in two passes. During the first pass the analyzer semantic knowledge. applies disambiguation rules of a more general character relying only on the knowledge in the morphological lexicon and a 5-word 3.3 Interface window context with the tag in question in the middle. If there are AutoPat features a sophisticated but extremely user-friendly still ambiguities pending after this first step of disambiguation the interface that can be adapted to different user profiles (beginners second step of disambiguation is applied. It uses knowledge about and experts) and has two modes: a) Wizard that guides a user case-role boundaries and their semantic status delivered by the through an ordered step-by step procedure of describing the interactive module of the analyzer. The output of the knowledge invention and b) Professional that allows for more speed and elicitation procedure is an out-of-band annotated claim (shallow flexibility when authoring a claim, - the user may freely navigate content representation) that when stored in a knowledge base among the stages of claim composition authoring them in any order. Figure 2. A screen shot of the Autopat user interface at the final stage of knowledge elicitation. It displays a fragment of annotated knowledge, (top right) that is also represented in two unannotated texts, - a) a set of simple sentences (middle) corresponding to quantums of technical knowledge elicited from the user and generated for user content control, and b) a single sentence (bottom right) meeting legal requirements to the claim format. The left part of the screen shows the element tree of the invention. As was mentioned above the knowledge elicitation scenario The interface has two main components, - the background consists of the system requesting the user, in English, to supply window were the results of elicitation procedure stay displayed information about the invention. Using common graphical interface through the whole session and a set of pop-up windows tools (mouse support, dialogue boxes, menus, templates and slide corresponding to elicitations steps (see Figures 2 and 3). The two bars) the interface draws the user through a step-by step procedure modes of the interface share the background window while the sets of describing essential features of the invention. If the content of pop-up windows are mainly different. All pop-up windows in appears incorrect, the user can undo the latest quantum or both modes can be moved freely around the screen to allow the acquisition and do it again correctly. He can also easily edit the user to see any part of the background window at any time. knowledge elicited at any of the earlier stages using the second, Background Window (Figures 2 and 3). The left pane of this Professional mode of interface. window is headed "Your invention comprises" and displays a graphical representation of the hierarchy of all main elements and features of the invention to thus have a final claim text containing sub-elements after the user supplies the knowledge about them into generic and difference parts with the "characterized in that" the system. The names of the elements at its nodes can be expression between them (as in Figure 1), which is a must transferred to any of pop-up windows by simply clicking on them. according to the European Patent Office, or skip this stage. In the The right pane is headed "Essential features of your invention". It latter case the final claim text will be generated without generic displays the title of the invention and every essential feature of the and difference parts in the format accepted by the US Patent invention in the form of a simple sentence (see Figure 2) that is Office. The underlying knowledge will be accordingly marked as generated every time the user supplies a quantum of technical referring to novel or prototype features of the invention. This is knowledge. Visualization of the results of the elicitation procedure relevant if annotated knowledge base is searched in a new patent in the form of simple sentences is only done to make it possible novelty examination. and convenient for the user to control the results of Main Claim Text. Presents the output of the Auto generator, - authoring/annotation session. The simple sentences correspond to the claim text in legally acceptable format, as shown in Figure1. If statements in the system’s internal knowledge representation necessary the user may edit the text right in this interface window. language that are created following the knowledge elicitation This window is accessible through the “GET TEXT” button. procedures (see Figure 2). At the stage of eliciting knowledge Dependent Claims and Dependent Claim Text. The former about relations of invention elements a new section headed “Your appears only if called by the user who wants to compose a terminology” appears in the bottom of the left pane. Form now on dependent claim, it elicits information upon which of other claims all phrases used in relation descriptions stay displayed and the current one depends. The latter displays generated text of the “clickable” there for further reuse. dependent claim as it should appear in a patent document. A brief description of interface windows and functionalities is All user-computer communication is done in a natural given below. language. It provides content, composition and terminology Title. Helps the user to select the most appropriate title for the consistency maintenance support through choices of standing and invention. This window contains a title template. The slots of this pull-down menus. These menus supply access to words and template contain menus of words and phrases for optional inclusion phrases required in a claim. Though the user is encouraged to use in the title. the AutoPat controlled language given in the menus s/he has Main Elements and Sub-Elements. Prompt the user to describe always a choice to type in active text areas of interface windows. If the element hierarchy of the prototype of the invention. These a word is in a menu it will be automatically completed right after windows display a template of menus similar to that in the Title the first characters are typed. The interface looks for spelling errors window. and, more important, provides for lexicons customization so as not Element with Novel Characteristics. Makes the user specify the to require any linguistic skills. In case a word cannot be found in element whose novel properties (that, according to Patent Law, can the knowledge of the system the user will be asked to add it only be its shape or material) it is necessary to include in the claim. through an easy-to-use pop-up entry box. The interface automates Shape/Material. Prompts the user to describe novel shapes of tedious tasks such as typing and propagating changes through materials of the elements specified in the previous window document and, what is more important, it has effective means to background window). control knowledge supplied to the system. The user can check the Relations. Within the procedure the user selects two or more content elicited so far in an output window where the immediate objects in the element tree then specify the relation between them. results of each quantum of acquisition are displayed in the form of The initial setup in this window involves two menus, one listing simple sentences (see Figure 2). names of relation types (semantic classes) and another listing words (predicates) that can describe these relations. One can start by first selecting a relation type and then, after a semantic class is 4 CONCLUSIONS selected the second menu displays predicates which belong to this We suggest knowledge markup with authoring tools in which users class for further selection. By checking a corresponding radio construct representations of their knowledge and illustrated our button it is possible to start directly with selecting a predicate approach on the example of knowledge markup with an AutoPat among all the predicates included in the AutoPat knowledge base tool, an application for authoring patent claims. Annotated and listed in the predicate menu. In case the selected predicate knowledge is created as a by-product of an AutoPat main user (a belongs to more than one semantic classes, these classes appear in patent officer or inventor) authoring session. The tool can also be the semantic class menu and the user is asked to select one of them used directly for knowledge markup. In general, our annotation can to specify the meaning of the predicate. Selecting a predicate be classified, as “out-of-band” which convey the textual content by constitutes lexical selection, whereupon the system determines the metadata mainly stored in the predicate lexicon of the system. roles played by the highlighted elements . Annotated knowledge is stored as a formal shallow content Relation Specification. Presents the user with a predicate representation. Annotation marks both patent domain technical (sentence) template based on knowledge about the case-roles knowledge and linguistic data about patent sublanguage on (semantic arguments) of the semantic class underlying the selected morpho-syntactic and semantic levels. The annotated knowledge dictionary item. The user fills appropriate slots – “What”, can be stored in the domain knowledge base without its “Where”, “How”, and so forth (Figure 3). To make this easier unannoteted text form, the latter can always be generated by apart from clickable nodes in the element tree and in phrases in Autopat. A knowledge base of such annotations is beneficial for “Your terminology” section every template slot has a pop-up menu other applications of AutoPat family such as domain-tuned of auxiliary phrases from the underlying predicate dictionary entry. machine translation, information retrieval, summarization, etc. It Co-reference. Highlights coreference candidates and ask to can also be used for constructing linguistic metadata. We have also mark any elements that are coreferential among them. The descried an interactive procedure which allows for simultenuous coreference candidates are searched by morphosyntactic analyzer elicitation tecnical knowledge from the user and its annotation by and are noun lexemes regardless of their grammatical form the system. Both the metodology of annotation and components of Main Claim Format. Presents a "checkable" menu of all AutoPat relevant to annotation (analyzer and user interface) can be generated sentences-features. The user can either check the novel portable to other domains, languages and applications. Figure 3.A screen shot of the user interface at the stage of describing relations between the elements of an invention. The window on the right displays filling case-roles in a predicate template as a result of choices made during the “Relation Type” step of knowledge elicitation scenario shown on the left. REFERENCES [1] T. McEnery, A. Wilson, F.Sunchez-Leon, A.Nieto-Serrano, [6] S. Nirenburg, V. Raskin, ‘Ten Choices for Lexical ‘Multiningual resources in European Languages, Semantics. New Mexico State University’, Computing Contributions of the CRATER Project’, Literary and Research Laboratory Technical Report (1996 MCCS 96- Linguistic Computing 12/4 (1997). 304.(1996). [2] T.Erjavec, N.Ide. ‘The MULTEXT-East Corpus’, Proceed- [7] B. Y. Ooi, Computer Corpus Lexicography. Edinburgh ings of the MT Summit VII, 1999, Singapore. University Press (1996). [3] C.M. Sperberg-McQueen, L. Burnard, ‘Guidelines for [8] Ch.J. Fillmore. ‘Subjects, speakers and roles’. Synthese. Electronic Text Encoding and Interchange, Chicago and 21/3/4. (1970). Oxford. (1994). [9] B. Levin. ‘English Verb Classes and Alternations’. [4] P.Anick, J.Pustejovsky, ‘An application of lexical University of Chicago Press, Chicago . (1993). semantics to knowledge acquisition of corpora’, COLING- [10] Reiter, E.B. ‘Has a consensus natural language generation 90.Proceedings of the 13th International Conference on architecture appeared and is it psycholinguistically Computational Linguistics. Helsinki. 2 (1990) plausible?’ In Proceedings of the 7th International [5] P.Degoulet, F.C.Jean, U.Engelmann, HP.Meinzer, R.Baud, Workshop on Natural Language Generation. (1994) B.Sadbald, O.Wigertz, R.Le Meur, CA.Jagermann, The [11] S.Sheremetyeva, S.Nirenburg.. ‘Interactive Knowledge component-based architecture of the HELIOS medical Elicitation in a Patent Expert's Workstation’, IEEE software .(1994) Computer. Vol.7.(1996).