PATExpert: Semantic Processing of Patent Documentation Leo Wanner, Sören Brügmann, Barrou Diallo, Mark Giereth, Yiannis Kompatsiaris, Emanuele Pianta, Gautam Rao, Pia Schoester, and Vasiliki Zervaki Abstract— PATExpert is a recently started “Specific Targeted and to investigate some central topics from the semantic Research Project” funded by the EC in FP 6, IST priority. representation angle: patent retrieval and classification, content PATExpert’s goal is to change the paradigm currently followed extraction, generation of multilingual user comprehensible for patent processing from textual to semantic. We are about to develop a semantic multimedia content representation based on patent information, visualization of and navigation in patent Semantic Web technologies for selected technology areas and to content spaces, and patent valuing and technology area as- investigate some central topics from the semantic representation sessment, taking into account the information needs of all user angle: patent retrieval and classification, content extraction, gen- types as defined in a user typology. PATExpert’s technological eration of multilingual user comprehensible patent information, goal is to develop a showcase that demonstrates the viability visualization of and navigation in patent content spaces, and patent valuing and technology area assessment. of PATExpert’s approach to content representation for real applications. Index Terms— semantic representation, (multimedia) ontology, OWL-DL, SUMO, PULO, patent processing techniques. II. S EMANTIC R EPRESENTATION OF PATENTS The semantic representation of patent documentation must I. I NTRODUCTION cover, on the one hand, propositional, multimedia and Patents belong to the few types of public information that metadata information and, on the other hand, lingustic have a big impact on the European economy, and whose proper knowledge—first of all the characteristic text structures en- monitoring, retrieval, representation, interpretation, and assess- countered in patent documentation and the lexical information. ment so clearly depend on the access to its content, and, thus, We developed an initial working schema of the knowledge on advances in semantics-based techniques. However, research representation (KR) in PATExpert, with OWL-DL as the KR and development in the area of patent processing still focuses language. on selected traditional tasks such as text retrieval, classifica- As a rule, the content in patent documentation makes tion, and shallow linguistic analysis. Recent initiatives that tar- reference to knowledge of three levels of abstraction: (a) get the automatic access to content of patents attempt to cover common sense knowledge, (b) patent-specific knowledge and ALL knowledge areas. This forces them to rely on term fre- terminology, and (c) domain-specific knowledge that refers to quency, term co-occurrence and grammatical term categories. technology area details. PATExpert focuses on the ontologies I.e., despite the use of a Semantic Web-based formalism, the of two technology areas: optical recording media and mechani- resulting representation is not a real content representation. As cal engineering tools. a consequence, tasks that ultimately require knowledge-based Traditionally, common sense knowledge representation is multimedia techniques (content-oriented search, assessment, dealt with by core ontologies and domain-specific knowledge abstracting, etc.) are still, to a major extent, carried out representation by domain-ontologies. As core ontology, we manually. The overall goal of the PATExpert project, which use SUMO [2]. Following a recent trend in semantic web started 01.02. 2006 and is funded by the EC (FP6, IST- representation technologies (see, e.g., the Midlevel Ontology, 028116, http://www.patexpert.org), is to change the paradigm MILO by Teknowledge Knowledge Systems Group [3]), we currently followed for patent processing from textual (viewing capture patent-specific knowledge and terminology by a mi- patents as text blocks enriched by “canned” picture material, dlevel ontology, called Patent Upper Level Ontology, PULO. sequences of morpho-syntactic tokens, or collections of partial PULO aims to bridge the gap between the high level concept syntactic structures) to semantic (viewing patents as multi- descriptions in SUMO and the detailed domain ontologies. media knowledge objects) processing. PATExpert is about to Multiple media (in particular images such as photographs, develop a multimedia content representation formalism based diagrams, flow charts, drawings, etc.) come into play in the on Semantic Web technologies for selected technology areas representation of patent documentation and must thus be modelled by a multimedia ontology. For linguistic knowledge L. Wanner is with ICREA and Foundation Barcelona Media, email: representation, we foresee a patent document structure ontol- leo.wanner@upf.edu; S. Brügmann is with Industrie Software J. Brüg- mann, email: soeren.bruegmann@isjb.de; B. Diallo is with the European ogy and lexical (word level) ontologies. As lexical ontologies, Patent Office; M. Giereth is with the University of Stuttgart, email: we use WordNets [1]. Patent-related meta information (such mark.giereth@vis.uni-stuttgart.de; Y. Kompatsiaris and V. Zervaki are with the as the patent holder, inventor and his current affiliation, etc.) Aristotle University; E. Pianta is with the Istituto Trentino di Cultura, email: pianta@itc.it; G. Rao is with IALE, email:rao@iale.es, and Pia Schoester is are captured in a separate metadata ontology. Figure 1 shows with the Fraunhofer Gesellschaft, email: pia.schoester@pst.fhg.de the initial working schema of the knowledge representation in PATExpert. It will be revised as needed when the work on the and (b) generation of multilingual gists of given passages. For tasks that make use of the semantic representation progresses. both, shallow techniques and deep techniques that draw on content representation of the text passages in question (and that Core Ontology: Holistic Representation - SUMO implement thus text generation proper) are being developed. D. Visualization and Navigation in Patent Knowledge Spaces Patent Upper Level Ontology - PULO Given the complexity of patent knowledge spaces, the availability of techniques for visualization of patent content material that is retrieved from the patent KB or selected by the user while browsing the patent KB is crucial. We Multimedia Ontologies Domain Ontologies develop techniques that make the complex content structures transparent (as, e.g., the IS-A, PART-OF, CAUSE, OPERATE, Linguistic Ontologies Patent Document Structure etc. relations between objects, semantic similarity / entailment Ontologies links, etc.) and help navigate through such structures. The navigation techniques combine browsing mechanisms with ad- vanced strategies that guide navigation taking the user’s focus, Metadata Ontologies the context and the discourse relations between knowledge objects into account. Fig. 1. Knowledge representation modules in PATExpert E. Patent Valuing and Technology Area Assessment Currently, high quality valuing of patents and patent appli- III. P ROCESSING PATENT D OCUMENTS cations and the assessment of technology areas with respect The state of the art in a number of central patent processing to their potential to give rise to patent applications is done areas suffers from the lack of an adequate representation of the mainly manually—which is very costly and time consuming. content and content structure of patent documentation. With We are developing techniques that use statistical and semantic the KR-schema presented above at hand, PATExpert addresses information from patent (applications/) as well as user based these areas. The most central of them are listed below. In the data for market aspects to prognosticate the value of a patent poster presentation, more details will be given on each topic. (application). A. Information Extraction from Patents IV. T HE S TATE OF A FFAIRS Extraction of content information (e.g., composition and After having developed the initial schema of the knowledge function of the invention) and meta information (e.g., the representation, we work on the topics sketched in Section III. productivity of an inventor) from patent documentation is one The first prototypical implementations of the techniques are of the burning issues in patent processing. In PATExpert, this planned to be operational by the end of May 2007, some of topic is approached from two angles: as a stand alone task, them (e.g., the gist generation) already by the end of January and as a way to populate the knowledge base. Strategies using 2007. partial syntactic and semantic analysis, information extraction R EFERENCES techniques, inference mechanisms are being explored. [1] C. Fellbaum (ed.), WordNet. An Electronic Lexical Database. Cambridge, MA: The MIT Press, 1998. [2] I. Niles and A. Pease, “Towards a Standard Upper Ontology,” in B. Patent Retrieval and Classification Proceedings of the 2nd International Conference on Formal Ontology The (multi) media content representation of patent docu- in Information Systems (FOIS-2001), C. Welty and B. Smith, Eds., Ogunquit, MA, 2001. mentation will allow us to develop patent retrieval strate- [3] I. Niles and A. Terry, “The MILO: A General-Purpose, Mid-Level gies that go considerably beyond the state of the art patent Ontology,” in Proceedings of the 2004 International Conference on retrieval techniques. In particular, it will facilitate semantic Information and Knowledge Engineering, Las Vegas, NE, 2004. retrieval, i.e., search for patents that describe inventions with specific content features, image-based retrieval and document similarity-based retrieval. It will also allow for a classification and clustering of patent documents along semantic criteria. C. Production of Multilingual Patent Information The language style in patent documentation is very complex and repetitive. It is thus hard to comprehend by human readers. Our goal is to provide the reader with a comprehensible variant of text passages chosen by him in the language of his choice. Two topics are addressed: (a) paraphrasing of patent passages