Using ontologies for querying and analysing protein-protein interaction data Mario Cannataro, Pietro Hiram Guzzi Bioinformatics Laboratory, Department of Experimental Medicine and Clinic University Magna Graecia, Catanzaro, Italy {cannataro,hguzzi}@unicz.it 1 Introduction Many wet lab experiments lead to the accumulation of a large amount of data related to interaction among proteins [1] also referred as Protein to Protein Interaction (PPI) data. The whole set of protein interactions of a single organism is also referred to as Protein to protein Interaction Network (PIN) and it is built from a set of binary interactions. PINs have been easily modeled by using undirected graphs [2] where nodes are associated to proteins, and edges represent interactions among proteins. The high dimension of this graph makes infeasible the manual inspection even for simple organisms, so the study of PINs requires graph-based computational methods. PPI databases, such as the Database of Interacting Proteins (DIP) [3], are often publicly available on the Internet and offer to the user the possibility to retrieve data of interest through simple querying interfaces. User, in fact, can conduct a search by the insertion of: (i) one or more protein identifiers, or (ii) a protein sequence. Results may consist of, respectively, a list of proteins that interact directly with the seed protein or that are at distance k from the seed protein in the PIN. It is impossible to formulate even simple queries involving biological concepts or annotations, such as: retrieve all the interactions that are related to glucose synthesis‘. Nevertheless, these annotations there still exist and are spread in different data sources. The main hypothesis of this paper is that annotating PPI data with biological information may result in more rich querying interfaces and in more powerful PINs analysis algorithms that may use such biological information [4]. This work presents a first prototype of a system system able to adding to actual data the information coming from ontologies such as Gene Ontology [5] and from other sources. Moreover the proposed system allows the use of annotated data into an analysis pipeline. The annotation of a PPI network consists of three main phases: (i) retrieval of PPI data (Data Extraction Module), (ii) retrieval of existing annotations for that data (Metadata Extraction Module), (iii) generation of annotated interactions and storage into the annotated database. Initially, the proposed system queries the existing interaction database and retrieves data about inter- actions. Then the protein identifiers are used to find related annotations. For instance, the Gene Ontology Annotation Database [6] (GOA) can be queried by using the UniProt identifiers or Gene Ontology terms. Finally, data are merged together and encoded by using an XML-based syntax, and stored into the an- notated database. Figure 1 depicts the architecture of the system for extracting annotations from Gene Ontology and for annotating the PPI database. Fig. 1. The Architecture for annotating PPI databases Main advantage of such system is the possibility to retrieve interactions, non only proteins whose nodes have a given annotation. Let us consider protein MEC1 of yeast and its interacting partners. Let us consider, moreover, the kinase activity process. When a user searches in existing databases it will retrieve the interactions: (MEC1, TEL1), (MEC1, RNR1). Successively, he/she has to check the annotation manually to discover which proteins are annotated with kinase activity. By using the annotated PPI database user can directly specify the process retrieving desired informations. Such a system could be useful not only for the semantic search of data, but also for the semantic-based analysis of PPI data. The analysis of PPI networks is usually done by using graph-based algorithms, and associating graph properties to biological properties of the modeled PPI. The availability of annotated data could enable the development of novel algoritms able to gather such information. References 1. Uetz, a.: A comprehensive analysis of proteinprotein interactions in saccharomyces cerevisiae. Nature 403 (2000) 623627 2. West, D.B.: Introduction to Graph Theory. Prentice Hall, NY (August 2000) 3. Salwinski, S.e.a.: The Database of Interacting Proteins: 2004 update. Nucl. Acids Res. 32(suppl1) (2004) D449–451 4. Cannataro, M., Guzzi, P.H., Veltri, P.: Using ontologies for annotating and retriev- ing protein-protein interactions data. In: Computer-Based Medical Systems, 2009. CBMS 2009. 22nd IEEE International Symposium on. (Aug. 2009) 1–5 5. Harris, M.A., et al: The gene ontology (go) database and informatics resource. Nucleic Acids Res Nucleic Acids Res 32(Database issue) (January 2004) 258–61 6. Camon, E., et al: The gene ontology annotation (goa) database: sharing knowledge in uniprot with gene ontology. Nucleic Acids Res 32(Database issue) (January 2004)