Information retrieval in Current Research Information Systems. Andrei S. Lopatenko Vienna University of Technology Gusshausstrasse 28 / E015 A-1040 Vienna, Austria    andrei@derpi.tuwien.ac.at ABSTRACT persons, information from research web pages also should In this paper we describe the functional requirements for be included into information retrieval operations. research information systems and problems which arise in Usually researchers'or policy-makers'demands for research the development of such a system. Here is shown which information is not limited to information from one single problems could be solved by using knowledge markup system. Research information in any science or technology technologies. In this article one DAML + OIL ontology for area is scattered among a number of heterogeneous Research Information System is offered. The already information systems. There is a strong need to gather developed ontologies for research analyzed and compared. information or to point researchers to systems where The architecture based on knowledge markup for collecting information can be found. It is very important to know if research data and providing access to it is described. It is the gathered research information is actual and complete. shown how RDF Query Facilities can be used for We are developing the AURIS-MM information system information retrieval about research data. (Austrian Research Information System - MultiMedia Keywords enhanced) to provide research information to interested Current Research Information System, Ontology, consumers in a more attractive way. The system is being Information Retrieval, DAML, RDF, Knowledge Markup, developed coming from the existing AURIS (Austrian scientific publishing Research Information System) and FoDok-Online INTRODUCTION (Research Documentation of Vienna University of Information about research results, projects, publications, Technology). organizations, researchers and so on published on the web Our experience and newest web technologies showed us play a more and more pervasive role in modern research. that centralized database systems are very efficient but not The increasing dependence of modern research on already the best solution to provide access to research data due to a achieved research results requires to have ability to retrieve widespread distribution of the research data over the web. research information in a more efficient way. The new version of AURIS-MM is based on Semantic Web Information overload by the exponential rise of amount of technologies information makes it difficult for researchers to find RDF – Resource Description Framework relevant information. To solve these problems a number of www.w3.org/rdf Current Research Information Systems (CRIS) is being developed. RDFS – Resource Description Framework Schema www.w3.org/rdf But in most cases such systems do not solve their task of providing complete and actual information with a minimum DAML + OIL (DARPA Agent Markup Language of information noise. This is one reason that researchers are + Ontology Inference Layer) www.daml.org not prone to publish results about their research via ONTOLOGY DEVELOPMENT FOR SCIENCE information systems. Publishing usually is limited to Some efforts already were done to provide to researchers, researcher’s or project’s web pages. industry, policy-makers efficient information access to To provide actual and complete information for interested research data from some sectors of science and access to research limited to organization (university research information systems), or limited to geographical boundaries LEAVE BLANK THE LAST 2.5 cm (1”) OF THE LEFT (national networks, ERGO[ERGO] – European Research COLUMN ON THE FIRST PAGE FOR THE Gateways Online) . COPYRIGHT NOTICE. The development and use of such systems has shown that it is very hard to collect complete and up-to-date data about research in a sector or in an organization like a university in markup pages for automatic knowledge extraction. The last a central system due to the huge effort of periodically version of DAML is named DAML + OIL. DAML copying or keying in the data by the providers. specifications, examples, tools, ontologies are published at Due to the fact that already huge amount of data is provided DAML home page. on internet web pages of projects, researchers, universities, Several ontologies for research information are developed it is hard to get researchers provide their data once more in DAML. Among them: DAML version of SHOE into a centralized system. University ontology Full text search engines like Google (http://www.cs.umd.edu/projects/plus/DAML/onts/univ1.0. (http://www.google.com) index among others also pages daml), SWRC (Semantic Web Research Community) with research information. But they can not limit search to ontology (http://www.semanticweb.org/ontologies/swrc- trusted data, understand context of the page and provide onto-2000-09-10.daml), homework assignment ontology search based on meaning of the data. (http://www.ksl.stanford.edu/projects/DAML/ksl-daml- desc.daml). One of the possible ways to collect data about research is the page annotation. Knowledge can be annotated on the A more complete list of ontologies for research data as well page in such a way that automatic tools can collect and as for metadata standards, thesauri and system architectures understand it [BL-2001, Hend-2001, Erd-2001] please find at the European Current Research Information Systems Platform home page Ontologies make possible that software agents can (http://www.eurocris.org) and at Andrei Lopatenko’s understand knowledge which is marked up [Staab-2001, Resourse Guide to Metadata for Science, Research and SWA] . The benefits of ontologies and Semantic Web use Technology for scientific publishing were described at [Lee-2001] (http://derpi.tuwien.ac.at/~andrei/Metadata_Science.htm) Some effort is already done to develop markups for ONTOLOGY scientific data. So, the main goal of our ontology development was to SHOE[Hefl-99, SHOE] is a small extension to HTML develop an ontology which will help users of research which allows to annotate some knowledge about web page information to retrieve relevant information. content. SHOE is a very simple language for declaring The Primary use cases of information retrieval for CRIS are ontology, defining classification, relationship, inference [Jeff-98, CERIF-2000, Lind-2000, Aks-2000] rules, categories, etc. SHOE was developed in the Department of Computer Science, University of Maryland. • Retrieving information about research results by SHOE specification, tools, SHOE ontology in plain text and researchers or students for results reuse. The DAML, examples are accessible at the SHOE home page estimation of research results. Several ontologies for university and research data were • Seeking collaborators which can take part in developed for SHOE. There are the University ontology research projects as partners, sell their expertise, and the Computer Science Department ontology results and intellectual rights (http://www.cs.umd.edu/projects/plus/SHOE/onts/index.ht • Finding facilities and equipment which can be ml). used for research OIL (Ontology Inference Layer) [OIL, Fens-2000] - "is a • Assess and access to Research and Development proposal for a web-based representation and inference layer capabilities by policymakers for ontologies, which combines the widely used modeling primitives from frame-based languages with the formal • Finding ongoing research and technology activities semantics and reasoning services provided by description and results of projects by users in commerce and logics. It is compatible with RDF Schema (RDFS), and industry includes a precise semantics for describing term meanings • Finding the sponsors for a new research project (and thus also for describing implied information)." OIL The ontology should contain terms already known to was sponsored by the European Community via the IST developers of Current Research Information system to make projects Ibrow and On-To-Knowledge. it more easy to integrate new infrastructure with the old In the OIL for research data there were developed SWRC ones. (Semantic Web Research Community Ontology) There are not a lot of metadata standard for science. The (http://ontobroker.semanticweb.org/ontologies/swrc-onto- review of them have been done at [Grot-98,Lop-01]. 2000-09-10.oil) and KA2 (Ontology of Knowledge Acquisition community) . Math-Net developed a metadata format based on Dublin Core and RDF Schema for mark up of knowledge about DAML (DARPA Agent Markup Language)[DAML] - content of researchers and institutes pages[MathNet]. Math- ontology markup language, was developed as an extension to RDF and RDFS. DAML allows to specify ontologies and Net metadata set allows describe Researchers/Research Advanced Yes Close to Close to groups/organizations, projects, results, events, publications. classificatio CERIF CERIF In our ontology development we decided to use CERIF- n which can classificatio classificatio 2000 metadata standard (Common European Research server to n of n of Information Format)[CERIF-2000] research publications publications and . Grey . Grey According to CERIF documents [CERIF] “CERIF 2000 is a educational literature is literature is set of guidelines meant for everyone dealing with research IS not included not included information systems. The CERIF 2000 guidelines are developed by a group of experts from the EU Member States and Associated Member states, under the co- Event ordination of the European Commission.” Yes. Conference Yes. Very Conference Vary basic s close to Now CERIF 2000 is used by several groups of developers CERIF classificatio and researcher in different EU member states, it is proved n and stable. Also different group of developers are well- acquainted with CERIF-2000 what will let make a process Equipment of ontology more easy Yes. No No No Despite excellence of CERIF as metadata format for research, there are certain lacks in CERIF in description Patent some types of research information resources. In Patent No No No development of our ontology we decided to enrich it with terms, slots from some other ontologies, to make it more suitable for research information retrieval. Product/Research result In the next table is provided comparison of enriched CERIF Product Only Yes Only ontology with a few already developed ontologies (they software software were described earlier) and product software Table 1. Comparison of selected ontologies for science libraries CERIF Math-Net SWRC University Expertise skill/Research topic 2000 ontology Semantic Ontology Expertise Yes Research No Web skill Subject Topic Research Value Community Multimedia elements Person Multimedia No No No Yes. Yes. Advanced Advanced elements Not hierarchy Hierarchy No classified in suitable for suitable for research and research and Sites/pages CERIF education education No Yes No No Project Not Yes Yes. No After the comparative analysis of the CERIF ontology, classified in Classified. selected ontologies and some research information systems, CERIF it was recognized that CERIF ontology could be a base Organization technology due to richness of base terms and relevance to RIS. But in some areas there are certain lacks in CERIF. Yes. Yes Close to Only Enriching CERIF ontology with terms from other Classified CERIF educational ontologies can be useful for research information systems classificatio n The primitive units of the CERIF ontology are Person, Project, Organization Unit, Publication, Event, Site Publication (Internet service/page), Equipment, Result, Multimedia element, Research topic (Expertise skill). Research results which can be reused might be described in Project publications (articles, thesis, technical reports, etc.). European project Research results might be described precisely (Research result or Product). They can be presented by advanced Fundamental research project presentation techniques - Multimedia element, which maybe Applied research project video, images, drawing, diagrams, MS PowerPoint Financed by official bodies project presentations. Person Research results are results of research projects, invented by persons(researchers, students), in organization units Researcher (universities, labs, institutes, departments). Information Student about expertise skills of persons, organizations can be also Product/Research result significant for estimation of research results. Fundamental Some research results are patented and valuable information about them can be stored in patents. Applied To make search of research results more easy information Software about any entity can be classified by research topics. Software library To find a partner. Partner might be an organization unit or Information system person, which has relevant for partner seeker research Compound results and experience. Information about results and experience of partner can be extracted from its Process publications, description of the projects. Technology Information about organization units, publications, results, Algorithm projects, persons can be stored on the sites. No research Documentation information system store all relevant information. So users need to know about other information system, which can Proposal help in search research results, partners. Event To help user find information, data about other research Conference data relevant sites and internet services should be provided Cultural event to user. Exhibition Research may need equipment or facilities. Information about those entities also should be retrievable and Political event searchable. Sport event Table 2. Research Information Ontology terms Trade fair Organization unit Workshop Enterprise Publication Higher Education Establishment Abstract University Book Faculty Conference paper Institute Conference proceedings International organization Dissertation Joint Research Center Guideline Non-research private non-profit Index Non-research public sector Journal article Private research center Lecture Private non-profit research center Multimedia Public research center Patent Laboratory Report Research Group Review Equipment 1. knowledge markup (by researcher) Multimedia element 2. harvesting marked-up knowledge by crawlers or Audio software agents AudioVisual 3. transforming harvested data into formats appropriate for metadata repository/search engines DataForMultimedia(data for scientific software modules, such as GIS) 4. loaded into repository ExecutableFile(which visualize information, 5. retrieved by search engines according to users process, etc) request Flash Image WEB PAGE ANNOTATION So the ontology can serve for understanding meaning of RealMedia data. But to make data understandable by software agents, ShockWave they should be provided in a format, which agent can parse Slide presentation A number of annotation tools are described in [Staab- Video 2001]. Site For page annotation we use two tools: OntoMat and AURIS-MM metadata generating facilities. Organization’s site OntoMat [OntoMat] is a user-friendly interactive webpage Project’s site annotation tool. It includes web browser and ontology Personal home page browser. Ontology browser supports DAML + OIL Publication on the web ontology exploration. Web browser supports web browsing, highlighting parts of the web pages and creating List of the publications annotations based on highlighted part of the pages. To Reference page annotate the web page researcher needs to open web page Information system in the browser, then open ontology from provided by project URL. Then the researcher can crate annotation Library (access to articles) highlighting regions of the page and describing them in Research Information System (access to ontology browser according to the ontology terms, relation research data- projects, persons, organizations) and attributes. OntoMat automatically creates RDF annotation and new web page with included RDF annotation. The annotated web pages can be published on The complete ontology and set of terms are presented at the web instead of annotated. http://derpi.tuwien.ac.at/~andrei/Metadata_Science.htm. AURIS-MM metadata generating facilities generated RDF For ontology development CERIF-2000 Guidelines and description of the data from AURIS-MM Relational Subject Index recommendations were used, as well database. Multimedia Ontology [Hunt-2001] and science and university ontologies mentioned early. To create annotated web page, researcher needs input data about his research (projects, publications, etc) into AURIS- As a guidelines for ontology development we used [Noy- MM, and the use metadata generating facility just by 2001, Noy-G] pressing buttons. Generated RDF file then can be published INFORMATIONAL RETRIEVAL ARCHITECTURE on the web directly, or can be embedded into the web page. The generated RDF file for the object has a persistent The research data for retrieval should be collected, location in the AURIS-MM, which can be used as an analyzed. To make possible analysis and understanding of identifier for that object. This is very important because meaning of data by software, they should be published in information about the one object can be asserted on format understandable by software agent or annotated. Then different pages. OntoMat supports only annotation and does annotations should be collected, analyzed, if it is considered not generate persistent URLs, because it is annotation tool. necessary, they should also be transformed into one Currently AURIS-MM does not support any ontology for model/format. During search operation queries and data semantic annotation as OntoMat does. But it supports should be processed by search engines and response should vocabularies and thesaurus for advanced annotations, also it be send to information consumers supports workflows and allows to re-use already inputted So the process of information retrieval consists of data. Fig. Annotation of the page Fig. Metadata collecting into RDF database Fig. The registration of multimedia element. QUERYING COLLECTED METADATA, GETTING KNOWLEDGE FROM ANNOTATIONS Once the annotated metadata were collected, how to use them? There are several tools which can be used to search annotated pages. SHOE Search Engine – Semantic Search (http://www.cs.umd.edu/projects/plus/SHOE/search/) search registered annotated pages. User of search engine can choose ontology, then choose type of resource he searches, create very simple filter conditions and search COLLECTING METADATA SHOE metadata database. To make knowledge annotated on the web pages accessible for retrieval, it should be collected, analyzed, stored and Our approach assumes that data would be described in RDF made accessible for query engine. or can be translated into RDF by transformation procedure. Also to provide search services for researcher query Harvesting (collecting) RDF metadata possible by using facilities should be able to search data by its meaning (type RDF Crawler of resource or property), values of attributes (properties) (http://ontobroker.semanticweb.org/rdfcrawl/index.html) – and relation between resources. java application, which can crawl web pages and collect RDF data. After crawling RDF Crawler produces one file There are several query engines for RDF[Karv-2000], which store all RDF data and declaration of all used RDF Squish, Ontobroker, Redland RDF Application Framework, Schemas. MetaLog, RDF Data Query Language. The data about research now provided in different markup In our project to query RDF database Sesame RDF Query formats. Austrian research information system, Math- Repository and Querying Facility is used. Net(http://www.math-net.org) and other societies use Sesame supports RQL (RDF Query Language) [Vass] different markups to annotate date. which is being developed by ICS-FORTH Institute. Sesame In our approach all data should be converted to RDF to be supports storing both RDF and RDF Schema information. accessible for search and analysis through one search Querying Facilities of Sesame supports Schema information engine. about subclasses and subproperties, searching by attributes values, resource relations. Table. Examples of SESAME queries to retrieve research information projects and participants of those projects http://derpi.tuwien.ac.at/~andrei/cerif.rdfs#Person All persons in database (and any subtype of a person, Sesame provides application interface through HTTP -researchers and student) protocol, so application can query and update network RDF databases. http://derpi.tuwien.ac.at/~andrei/cerif.rdfs#Researcher CONCLUSIONS Use of Semantic Web technologies might be very fruitful All persons who are researchers (or any subtype of for development of Research Information Systems. researchers) The annotation of knowledge make it more easy to researchers and research organization to assert information ^http://derpi.tuwien.ac.at/~andrei/cerif.rdfs#Researcher about their research for dissemination. No need to register it All persons, who are researchers and not any subtype in a number of information systems. Software agents can of researcher collect information and understand its meaning select X,Y Not only research data but also new domain knowledge can be also asserted and shared for use. from #Project {X}. #project_persons{Y}, {Z} #expertise_skill {E} Query engines for Semantic Web due to that inference abilities and schema exploration can make development of where X = Z and N = “Semantic Web” Research Information System more easy then conventional All projects in Semantic Web with description of persons technologies like Relational Database management systems participation in them because exploration of domain knowledge is very crucial If the organization or person, or Research Information for CRIS systems . System asserts new type of project – software project and in ACKNOWLEDGMENTS RDF Schema provides that it is a subtype of AURIS-MM, I thank Walter Niedermayer and all AURIS-MM project then it will also searched. staff, Vienna University of Technology for support and helpful comments on previous versions of this article. select X,Y REFERENCES from ^#Project {X}. #project_persons{Y}, {Z} #expertise_skill {E} where X = Z and N = “Semantic Web” Only projects in Semantic Web asserted as exactly CERIF ERGO European Research Gateways Online http://www.cordis.lu/ergo BL-2001 Berners-Lee T., Hendler J., Lassila O., The Semantic Web, Scientific American, May 2001 Hend-2001 Hendler J., Agent and the Semantic Web, IEEE Intelligent Systems Journal, March/April 2001 Erd-2001 M. Erdmann, A. Maedche, H-P. Schmurr, and S. Staab, From Manual to Semi-automatic Semantic Annotation, LLQN SLQJ(OHFWURQLF$UWLFOHVLQ&RPSXWHUDQG,QIRUPDWLRQ6FLHQFH9RO  SWA Semantic Web Activity http://www.w3.org/2001/sw Lee-2001 Berners-Lee T., Hendler J., Scientific publishing on the ‘semantic web’, 12 April, The Nature, http://www.nature.com/nature/debates/e-access/Articles/bernerslee.htm Hefl-99 Jeff Heflin, James Hendler, and Sean Luke, SHOE: A Knowledge Representation Language for Internet Applications, Technical Report CS-TR-4078 (UMIACS TR-99-71). 1999. http://www.cs.umd.edu/projects/plus/SHOE/pubs/#tr99 SHOE SHOE home page. http://www.cs.umd.edu/projects/plus/SHOE/ OIL Ontology Inference Layer web site http://www.ontoknowledge.org/oil Fens-2000 D. Fensel et al.: OIL in a nutshell In: Knowledge Acquisition, Modeling, and Management, Proceedings of the European Knowledge Acquisition Conference (EKAW-2000), R. Dieng et al. (eds.), Lecture Notes in Artificial Intelligence, LNAI, Springer-Verlag, October 2000. DAML DARPA Agent Markup Language. http://www.daml.org Jeff-98 K. G. Jeffery, “ERGO: European Research Gateways Online and CERIF: Computerized Exchange of Research Information Format”, ERCIM News N. 35, 1998, http://www.ercim.org/publication/Ercim_News/enw35/jeffery.html CERIF-2000 Common European Research Information Format 2000 Guidelines. ftp://ftp.cordis.lu/pub/cerif/docs/cerif2000.htm Lind-2000 Niclas Lindgren, $QLWD5DXWDP N Managing Strategic Aspects of Research, CRIS-2000, (ftp://ftp.cordis.lu/pub/cris2000/docs/rautamdki_fulltext.pdf) Aks-2000 Dag W Aksnes, Johanne-Berit Revheim, The Application of CRIS for Analyzing Research Output - Problems and Prospects, CRIS-2000 ( ftp://ftp.cordis.lu/pub/cris2000/docs/aksnes_fulltext.pdf) Grot-98. M. Grotschel, L. Lugger, "Scientific Information systems and Metadata", Classification in the Information Age. Proc. of the 22nd Annual GfKl Conference, Dresden, March 4-6, 1998. Lop-01. Lopatenko A. S., Kulagin M. V. "Current Research Information Systems and Digital Libraries. Needs for integration", to appears in proceedings of "Digital Libraries: Advanced Methods and Technologies, Digital Collections", Sep. 2001 MathNet. Math-Net Application Profile http://www.iwi-iuk.org/material/RDF/1.1/profile/MNPage/ CERIF CERIF Homepage. http://www.cordis.lu/cerif Hunt-2001 J. Hunter, "Adding Multimedia to the Semantic Web - Building an MPEG-7 Ontology", SWWS, Stanford, July 2001 Noy-2001 Noy N. F., Ontology Engineering, Semantic Web Working Symposium, 2001, Stanford Noy-G Noy N. F., McGuinees D. L., Ontology Development 101: A Guide to Creating Your First Ontology, http://protege.stanford.edu/publications/ontology_development/ontology101-noy-mcguinness.html Staab-2001 Staab, S., Maedche, A., and Handschuh, S.: Creating Metadata for the Semantic Web: An Annotation Framework and the Human Factor. Technical Report, 2001 http://www.aifb.uni-karlsruhe.de/WBS/sha/papers/semantic-annotation.pdf OntoMat Webpage annotation tool. http://ontobroker.semanticweb.org/annotation/ontomat/index.html Karv-2000 Karvounarakis G., Querying RDF Metadata and Schemas Technical Report, Institute of Computer Science, Foundation for Research and Technology-Hellas (FORTH), Crete, Greece, http://www.ics.forth.gr/proj/isst/RDF/rdfquerying.pdf Vass G. K. Vassilis, C. D. Plexousakis, S. Alexaki, “Querying Community Web Portals”, http://139.91.183.30:9090/RDF/publications/sigmod2000.html