Close Integration of ML and NLP Tools in BioAlvis for Semantic Search in Bacteriology. Robert Bossy1, Alain Kotoujansky1, Sophie Aubin1, Claire Nedellec1 1 INRA-MIG, Domaine de Vilvert F-78352 Jouy-en-Josas {robert.bossy, alain.kotoujansky, sophie.aubin, claire.nedellec }@jouy.inra.fr Abstract. This paper focuses on the use of corpus-based machine learning (ML) methods for fine-grained semantic annotation of text. The state of the art in semantic annotation in Life Science as in other technical and scientific domains, takes advantage of recent breakthroughs in the development of natural language processing (NLP) platforms. The resources required to run such platforms include named entity dictionaries, terminologies, grammars and ontologies. The demand for domain-specific, comprehensive and low cost resources led to the intensive use of ML methods. The precise specification of the ML task goal and target knowledge, and the adequate normalization of the training corpus representation can notably increase the quality of the acquired knowledge. We argue in this paper that integrated ML-NLP architectures facilitate such specifications. We illustrate our demonstration with four representative NLP tasks that are part of the BioAlvis semantic annotation platform. Their impact on the quality of the semantic annotation is qualified through the evaluation of an IR application in Bacteriology. Keywords: Semantic Annotation, Machine Learning, Ontology Learning, Natural Language Processing. 1 Introduction Despite the growing number of available structured databases dedicated to biomedicine, a large part of the domain knowledge is only available in documents in natural language. Besides, several services centralize publications in Health and Life Sciences. The main one, Entrez PubMed (NCBI), references over 16 millions of papers [1]. However, at the same time, the size of the bibliographic bases grows exponentially and the scope of the scientific questions crosses the traditional boundaries of biologist expertise fields, making classical Information Retrieval (IR) applications no longer sufficient to target the useful and relevant documents. Advanced techniques involving more semantics have to be applied to textual information processing in the biomedical domain. Life and Health sciences are recognized as critical knowledge-intensive domains for the Semantic Web [2]. Research efforts towards the Semantic Web aim “at replacing the current web of links with a web of meaning” [3] producing large-scale methods for automating deep semantic analysis and markup of Web pages in a machine-readable form suitable for information extraction (IE) or information retrieval (IR) applications in the biomedical domain. Semantic analysis methods involve more and more Natural Language Processing (NLP) and powerful representation languages that are reaching a maturity stage, where fine-grained semantic markup of large Web corpora of text in various domains and languages become possible. This is demonstrated for instance by the GATE [4], UIMA [5], and Alvis [6] NLP platforms that make an extensive use of linguistic knowledge. A large part of it is domain-specific and requires costly development and maintenance efforts that can be alleviated by Machine Learning (ML) methods. Corpus-based ML methods yield impressive knowledge acquisition results for a wide variety of NLP tasks such as named entity recognition (NER) [7, 8], POS tagging [9] and concept and relation tagging [10, 11]. However, the cost remains high for (i) the production of the appropriate features for representing the training examples, (ii) the manual annotation of the training examples and (iii) the evaluation of the quality of the ML results. The close integration of ML methods and end-user applications, e.g. IE or IR, into semantic annotation platforms gives a useful framework to overcome these limitations. Such efficient platform integration implies the proper characterization of the type and role of the knowledge that is used and produced by each platform component. This formalization step allows to avoid many cases of redundancy and inconsistency of semantic annotations. Translating this into ML concerns means that the learning target concept must be clearly specified according to the overall knowledge model and the design of the example representation should be derived accordingly. Following this principle, we have defined four representative and related learning steps and the NLP process that computes the necessary training corpora. Experimental results with the BioAlvis ML-NLP platform show that the appropriate normalization of the example representation according to the learning task improves ML performance and facilitates further knowledge integration. With the application of the BioAlvis platform to IR of biomedical documents, we measure the quality improvement of the semantic annotation performed with the learned knowledge. 2 From Words to Concepts 2.1 Semantic Annotation Automatic semantic annotation supplies a meaningful structure to free texts expressed in natural language with the purpose of allowing machine processing. In the Semantic Web framework, the semantic annotation consists in an interpretation of the text supported by an ontology, i.e. the assignment of concepts and relations of an ontology to fragments of text. The extent of the annotated text fragments is fairly variable depending on the target application. IE and IR target specific bits of information contained in short fragments of text, i.e. terms, words and multi-word units. Fig. 1 shows an example of the semantic annotation of a sentence from a scientific article on Molecular Genetics. The word GerE denotes a protein and the word sigK denotes a gene. The negative interaction concept is supported by the inhibit verb. GerE (resp. sigK) and the verb inhibit are instantiations of the arguments of the ontology relation agent (resp. target) between the protein and the concept negative interaction. Fig. 1. Example of semantic annotation in Biology. t arget ag ent Negative Protein interaction Gene The GerE protein inhibits transcription in vitro of the sigK gene As an illustration of the production and exploitation of a semantic annotation in the context of IR, we present the BioAlvis variant of the Alvis framework focused on bacteriology that performs as follows. The annotation pipeline enriches documents with fine-grained semantic annotations acquired through the successive application of NLP tools. The result is passed to the indexing component and exploited by the semantic search engine. The IR service normalizes the user queries in the same way as the documents: words are lemmatized, terms and named entities (NE) are replaced by their canonical forms and the concepts are replaced by their paths from the ontology root. This strategy differs from usual query expansion that consists in replacing each query term with the set of synonyms and sub-concepts. The Alvis method indeed drastically reduces the complexity queries and makes its interpretation legible for the user. The user can also directly benefit from the annotation of concepts by performing ontology-based facet refinement through a rich Web user interface. 2.2 NLP/ML cooperation towards semantic annotation Software platforms for text corpus annotation integrate a common range of linguistic processes into pipelines, typically: tokenization, word and sentence segmentation, named entity and term tagging, part-of-speech tagging, syntactic parsing and semantic concept and relation annotation. Each process relies on linguistic resources relevant to the target domain, which requires important acquisition efforts. Most platforms do not specifically include automatic knowledge acquisition facilities (e.g. Luxid®, MedScan®, AKS2®, InGenuity®) or in a limited way (e.g. Luxid I2E® for NER), although corpus-based Machine Learning provides an attractive alternative to the manual acquisition of such resources. Technically, a single annotation pipeline can process documents for application purposes as well as for preparing training corpora with the intent of acquiring new linguistic resources. However, in most implementations, this virtuous feedback does not translate into close ML and NLP software integration. The input of the ML is usually computed by a subpart of the NLP pipeline but the output is not directly usable by subsequent NLP components. This is the case in Gate / Amilcare [4]. We claim that semantic annotation can greatly benefit from a full integration of the ML components that feed the knowledge bases. Beyond format homogeneity, close integration compels the architecture designer to specify the respective roles of each NLP component involved in the semantic annotation process, to identify precisely all types of knowledge along with their interdependencies, and the target knowledge to be acquired by each ML component. Breaking down the semantic annotation task into well-identified NLP elementary steps has a positive effect on the production quality of the associated ML component. Relevant regularities are more easily identified by the ML system and human annotation of training examples is easier and of higher quality when it concerns a singled out knowledge type. For example, in formal knowledge representation frameworks, the tagging of semantic types and the tagging of properties are two distinct steps. In the phrase, “mouse synaptophysin gene”, the annotation of synaptophysin as an object of type gene is handled separately from the annotation of its property belongs to the species mouse. The knowledge acquisition goals for the recognition of gene names and their related species must be achieved by at least two distinct ML tasks applied to two different training corpora. The increased homogeneity of corpora that results from normalization reduces the number of examples to annotate manually. Unfortunately, many knowledge acquisition approaches to NER do not follow this principle [8]. Moreover, the clarification of dependencies among the different types of knowledge provides a basis for increasing knowledge modularity and reducing annotation inconsistencies. Operationally, the dependencies between knowledge types impose a constrained order of linguistic/semantic/acquisition processing steps that should be made explicit. In a structured modular view of the linguistic knowledge base, higher level knowledge should encapsulate lower level knowledge. Then, in order to learn a given target knowledge K, the training example representation should be based on the knowledge on which K depends and no any lower level knowledge. For instance, for the sake of modularity, relation recognition rules should not be learned from shallow clues such as punctuation marks. Previous NLP steps should have interpreted the punctuation marks into relevant information such as sentence ends (sentence segmentation) or abbreviations (named-entity normalization). Hereafter, we present the results obtained by applying these principles to the development and the integration of knowledge acquisition facilities into the Alvis platform. We focus on the acquisition of critical resources that are required by semantic annotation, with respect to the variety of learnable linguistic knowledge, i.e. named entities, terms, concepts and relations. We demonstrate their learnability (section 3) and the benefit of fine-grained semantic annotation in terms of quality and density of annotations for a given domain and application: IR in Biology (section 4). 3 The BioAlvis Experience 3.1 Architecture The Alvis annotation/acquisition pipeline (A3P in the following) has been developed within the Alvis project [12]. Alvis aimed at developing an open software platform that supports the quick development of distributed semantic search engines in specific domains. Alvis platform integrates a semantic crawler, the annotation/acquisition pipeline and a semantic search engine based on Zebra [13]. As a proof of flexibility, various instances of Alvis have been deployed in a short time for different languages (e.g. Chinese, Slovene, English and French) and different genres and domains (e.g. textbooks, news, patents, Wikipedia entries, MedLine abstracts, Agrobiotechnology patents). The Biology instance, BioAlvis developed by us, is presented here. Following the principles advanced in section 2.2, A3P is composed of a sequence of modules, based on Ogmios [14], that produce a layered annotation of the input document (central area of Fig. 2). The modules communicate by the means of a common layered XML annotation format. Fig. 2. BioAlvis architecture. The XML annotation format relies on a layered representation where each layer gathers the annotations from a single type of knowledge, in a similar way as described in [15]. The first annotation steps identify semantic units, i.e. named entities and terms, that denote the reference concepts of the domain in the document (Semantic Units box of Fig. 2.). Their recognition, their normalization and their disambiguation require prior word and sentence segmentation and word lemmatization. The next annotation steps assign ontology categories to the semantic units (Conceptualization box of Fig. 2). This includes fine-grained sense disambiguation based on selectional restriction of the semantic units (section 3.3). Prior syntactic parsing produces the required syntactic dependencies. Finally, ontology relations among the semantic units are identified by the application of Information Extraction rules. The rules use the semantic categories and the syntactic dependency context of the semantic units. A3P bootstraps by providing an annotated corpus for the acquisition of the knowledge for the next components in the pipeline sequence. As shown in Fig. 2, the linguistic analysis modules are fed by knowledge bases (drums). Their acquisition is achieved by an array of corpus-based acquisition tools involving ML methods. The next subsections describe four representative acquisition tasks of BioAlvis, their target knowledge, the example representation and the principles of the integration of the learned knowledge into the knowledge bases. Most of the learning results described in the following were obtained from a representative training corpus in bacteriology of 2,397 scientific paper references from MedLine referred to as the Bacillus corpus designed in 2001. It is the result of the following query to PubMed: “Bacillus subtillis AND (transcription OR promoter OR sigma factor)”. 3.2 Semantic Units 3.2.1 Named Entities The term named entity usually designates proper names associated to an ontological category or semantic type (e.g. place, person). The proper names are rigid designators that denote a referential entity in an unambiguous way [16]. BioAlvis NER component, RenBio focuses on protein/gene and species recognition. It achieves NE tagging by GenBank-based dictionary mapping and by the application of disambiguation and variation rules. The disambiguation rules specify what contexts are required for each type of named entity. In parallel, variant dictionaries and variation rules in the form of hand-crafted regular expressions deal with common typographic alterations. Rules for disambiguation and recognition of new entities are automatically acquired by supervised machine learning from a reference training corpus. For example, the simple rule, A word, followed by the word protein, 4 letters long, starting and ending with an upper-case letter, is a protein name. applied to the text, “The GerE protein inhibits transcription” assigns GerE to the protein category, even if GerE is not in the protein dictionary. The linguistic features of the training examples are computed by segmentation, lemmatization and typographic analysis (e.g. length, case, presence of symbols and digits, co-occurring words) of the training corpus performed by the annotation pipeline. The annotation of positive examples is done by first mapping the NE dictionary and then its manual correction by domain experts. Negative examples are automatically generated under the closed-world assumption. The RenBio rules for gene and protein name annotation in BioAlvis were learnt from the Bacillus corpus. The dictionary mapping on the training corpus tagged 7,185 occurrences. 10 biologists analyzed, corrected and completed the tagging. They found 12% false positives due to ambiguities and 12% false negatives due to new names. We applied the C4.5 algorithm of induction of decision trees (J48 WEKA library version [17]) to the revised training corpus. The cross-validation evaluation reported in Table 1 showed significantly better results in terms of recall and precision compared to the best results of the two gene/protein recognition challenges NLPBA [18] and BioCreative II [19]. Table 1. NER performances (recall-precision). Best NLPBA Best BioCreative II RenBio Bacillus 76% - 69% 86% - 88% 94% - 92% We claim that the example representation features and accurate specifications of the learning goal permitted higher quality of the training examples, thus improving the conditions for the learning algorithms. The good results were not due to any breakthrough in ML since we apply a regular well-known algorithm. On the one hand, the automatic linguistic pre-processing by BioAlvis of the examples has contributed to drastically reduce the dimension of the example description space and to remove potential sources of errors. Moreover in order to discriminate between NE and non-NE by their context, we picked the most relevant trigger words by feature selection. For instance, words like gene, operon or transcription are more likely present around gene names than any other word. On the other hand, the learning goal was specified according to the role of the NER in the semantic annotation process. This strongly determines the guidelines for the manual annotation of the training examples by experts. Our strict annotation guidelines address several phenomena that could hinder the quality of the annotation. The principles are as follows: NE annotation should be restricted to single entities for learnability and knowledge modeling reasons, it should exclude terms that denote general semantic categories and properties and the entity span should exclude the description of entity qualifiers (e.g. in “recA gene”, only “recA” is annotated). The detailed description of the guidelines can be found in [20]. Our experiments demonstrate that the combination of the appropriate normalization of the data with the consistent annotation of training examples by experts improve machine learning performance in terms of precision, recall and size of training sample (see [20] for more details). 3.2.2 Terms The BioAlvis term analysis component identifies the phrases that represent semantic units. These are single or multi-word nominal or verbal expressions that refer to specific domain concepts (e.g. plant pest). The term analysis module achieves term recognition and normalization, which consists in tagging the term with its canonical form. Semantic ambiguities are processed afterwards by the semantic typing module (section 3.5), while inflections are handled beforehand by the lemmatization module. Similarly to NE, off-the-shelf lists of terms are not sufficient to annotate documents because terms may be ambiguous and subject to variation. Moreover, in scientific and technical domains, terminologies are generally incomplete with respect to the specific application needs [21]. Thus less than 1% of the 410 000 terms of MeSH [22] and Gene Ontology [23] occur in the 16,000 sentences of the Bacillus corpus. In addition to being mostly related to eukaryotes, those terms are suitable for manual indexing as they do not appear as such in NL documents. BioAlvis includes two acquisition modules for enriching the terminology with new terms and variants from a training corpus. The term acquisition component is the YaTeA term extractor [24]. It takes as input a training corpus with segmented sentences and words, lemmas and POS tags. YaTeA identifies candidate terms in an unsupervised way. Its strategy is based on declarative linguistic rules for boundary detection and term analysis and on endogenous disambiguation. Extracted candidate terms are usually validated by domain experts. Variation spectrum is much larger for terms than for NE. In addition to minor graphical variations, it includes morpho-syntactic alterations that can deeply modify the form of the term (e.g. plant pest, pests on plants and pests that attack plants). Term normalization involves complex linguistic and domain knowledge encoded in variation rules. BioAlvis integrates FASTR [25], a tool that computes candidate term variants from training corpora and controlled terminologies as produced by YaTeA and experts. For instance, FASTR insertion rule extracts genetic competence from the Bacillus corpus, as a variation of competence. Domain experts then validate the proposed variation relations as synonymy, hyponymy or other relations. Note that sets of synonym variants of the same term are similar to WordNet synsets. A canonical form is chosen for normalization purposes to represent the synonym set. Applied to the Bacillus corpus, YaTeA extracted 6,699 candidate terms occurring at least twice, among which 3,560 were validated by a group of 3 experts in terminology and biology, yielding 52% precision. The recall evaluated on a gold standard subset of Bacillus was 67%. FASTR then extracted 2,335 variants. The validation by two experts was done in a few days and resulted in 676 synonym sets and 1,569 hyponyms. Additionally, from 715 MeSH terms found to occur in the Bacillus corpus, FASTR identified 1,899 new variants among which 397 hyponyms and 117 synonyms. Such methods, when integrated in a pipeline, appear to be very competitive compared to manual acquisition. The approach offers exhaustiveness regarding to a corpus that is a clear advantage for knowledge-based application. 3.3 Semantic Types Semantic typing relates concepts from the ontology to semantic units in the text after their identification by the term and NE recognition components. The ontology concepts are organized into hierarchies. Semantic typing annotates the semantic units with the whole concept path to the hierarchy root without any a priori assumption on the concept level relevance. In A3P, the ontology-lexicon relation is explicit: the concepts of the ontology are represented at the lexical level by the canonical forms of terms and named entities. In case a given semantic unit is polysemic, disambiguation rules select the right concept in the ontology with respect to its syntactic context in the document; BioLG [26, 27], the dedicated version of Link Grammar, is integrated in BioAlvis for computing the syntactic contexts. The acquisition of concept hierarchies and disambiguation rules is supported by the ML hierarchical conceptual clustering tool Asium [10]. Asium takes as input a training corpus annotated in the same manner as the input of the semantic typing module, i.e. semantic unit tagging and parsing. The formation of concept classes by Asium is based on distributional analysis assuming that semantic units occurring in similar syntactic contexts in specific domain corpora are semantically close. Asium suggests their corresponding concepts as candidate members of a same semantic class. The disambiguation rules are automatically learned together with the semantic classes. They are expressed as restrictions of selection stating the syntactic dependency constraints on the context of the semantic unit being typed. For instance, cat may both denote a mammalian or a gene as defined by the ontology. In the phrase hypokalaemaic myopathy of Burmese cat, cat must be tagged as mammalian rather than as gene. myopathy is a disease and the disambiguation constraints express the knowledge that mammalians have diseases, but genes have not. Semantic classes are successively merged according to Asium distance formula. Asium includes a user cooperative interface to validate, revise and name the learned classes and hierarchies on the fly as they are built. This iterative process avoids error amplification along the hierarchy formation. Like all distributional semantics-based methods, Asium produces large coverage classes that may include three types of errors: the input syntactic dependencies computed by the parser may be incorrect, (e.g. between 20 and 30 % of the dependencies computed by BioLG [27]); the syntactic context may reflect different meanings (e.g. the preposition in expresses either time or place relation as in transcription in mitotic cell cycle / transcription in cell), which implies splitting the class; the semantic relation may not represent only close meanings but antonyms or lexical variations that were missed by the term and variants analysis. A large part of the potential learning errors is avoided upstream by choosing an appropriate representation of the training examples. Terminology and NE normalization significantly improve the quality of the learned classes by increasing the homogeneity of the training data. It also decreases the number of parsing errors and reduces the computation time, since the parser avoids computing dependencies inside the terms as detailed in [26]. Indeed, normalization removes irrelevant variations by a factor of 3 to 4. Moreover, syntactic contexts as used in Asium reflect more accurately the semantic roles than typographic windows can do. Extensive evaluation of the quality of the semantic classes acquired by distributional semantics based methods has not been conducted yet. However, a general comparison of Asium with other systems can be found in [28]. Although the concepts are validated along their construction by Asium, they cannot be integrated as such in the ontology. Their structure does not necessarily represent the model needed for the application and may require validation and revision. The alignment between learned ontology and existing ones also remains a critical problem. The modeling strategy we adopted for the development of BacteriOntology is based on an ontology skeleton crafted by hand by biologist experts and computer scientists from MIG-INRA laboratory. The hierarchical model results from the integration of several existing resources: (1) the highest levels of the ontologies and thesauri GO and MeSH; (2) relevant domain-specific information resources (Riley and Subtilist function classifications and NCBI species taxonomy); (3) concepts denoted by the 300 most frequent terms (section 3.2.2) acquired from our corpus. Asium results were then used to extend this core ontology and populate its classes. The current version of the resulting BacteriOntology defines 5,888 concepts, structured into 6 generality levels (excluding the extremely deep species hierarchy). The quality of BacteriOntology acquired with Asium support was globally evaluated through IR (section 4). 3.4 Domain-specific Relations Domain specific relations are usually more difficult to identify in the text than concepts because they are less directly supported by contiguous text fragments. BioAlvis annotation of relations focuses on gene interaction and relies on relation extraction rules. For a given relation, the rules check the type of the semantic units in the ontology in order to spot candidate relation arguments, and the type of the syntactic dependencies between them. For instance, in the text: GerE inhibits the expression of sigK, the gene interaction between the protein agent GerE and the target gene sigK is identified in the simplest case, by the rule expressed in first-order logic: gene_interaction (X,Z) :- type(X,Protein), subject(X,Y), type(Y,Interaction_action), obj(Z,Y), type(Z, Gene). where Protein, Interaction_Action and Gene are ontology concepts, and obj and subject are syntactic dependencies. Many complex gene interaction cases are handled with the same method including those involving regulon membership and promoter binding (detailed method in [29]). Relation extraction rules are learned by the supervised Inductive Logic Programming method, LP-Propal. The training examples are expressed in the same way as the input corpus of the relation tagging component: typed semantic units and syntactic dependencies. The sentences are selected by the naïve Bayes classifier STFilter [30], so that manual annotation focuses on the relation arguments in the sentences that most probably express a genic interaction. The successive filtering, disambiguation and normalization of the lexicon and syntactic analysis improve the training set homogeneity. The LLL dataset on genic interaction [31] has been designed from the Bacillus corpus for evaluation. Experiments on the subset action without coreference (70 positive examples) yielded 89.4% F-measure. This result is significantly better than the 65.5% best LLL challenge score on the same dataset [32] and than the BioCreative II result (48%) on the protein-protein interaction task [33]. We tested our system with an altered representation of the same data, where syntactic dependencies were replaced by word neighborhood relations. Considering the poor results (34.7% recall and 22.8% precision), we proved that syntactic dependencies convey major semantic relation information. 4 IR Experimentation in Biology We have designed the BioAlvis version of Alvis for the evaluation of ML-based semantic annotation benefit and the delivery of knowledge-based application to biologists (e.g. IR). This section reports on the experimental evaluation of the BioAlvis semantic annotation for semantic search and its comparison to other indexing and search models. We characterize Alvis search as semantic in the sense that it automatically interprets the meaning of the query with respect to the terminology and the ontology: Alvis searches for more specific and variant terms and it assists query refinement by ontology and terminology navigation (see [34] for more details). We compare Alvis retrieval capabilities to three representative IR services that are intensively used by specialists in specific domains and particularly in Biology: (1) Google and (2) Google Scholar represent automatic full-text indexing with shallow linguistic processing and (3) PubMed Entrez is representative of hand- crafted indexing by thesaurus keywords and full-text indexing without linguistic processing. The comparison focuses on the effect of semantic annotation and query expansion on the answer set quality. We exclude the effects of result sorting (ranking) and of interface facilities (query refinement). Although they are obviously important features, they are outside the scope of the evaluation. 4.1 Test Data The reported experiments concern the adaptation of enterobacteria to changes in their environment. Enterobacteriaceae is a large family of bacteria of the intestine, including many human pathogens, like the well-known Salmonella that causes inflammation of the intestine (Gastroenteritis). Their virulence is due to their capacity to survive and grow in hostile environment conditions imposed by their hosts (acidity, high temperature or oxidative stress induced by iron starvation and superoxide radicals). Part of these conditions is due to the response of the host organisms to pathogen infection. The deep understanding of the bacteria response mechanisms at a molecular level to these stress factors is a key point toward the design of more efficient drugs. The goal of the search is to find descriptions of pathogen reactions and was translated into the following query: enterobacteria stress genome component. In order to test BioAlvis, the Bacteriology document collection was built by first querying PubMed with all bacterial genera names from the GenBank taxonomy. Then we used a Bayesian filter to exclude documents that were not bacteriology stricto sensu. The result is a medium-size corpus containing 322,982 references of 70 words on average. This corpus was processed by BioAlvis in 60 hours on a cluster of 20 processors. The resulting semantic annotation was indexed and supplied to the Alvis search engine. Table 2 summarizes the main figures of the acquired linguistic resources (as described in section 3) and tagging features. Table 2. Annotation of the Bacteriology corpus. Type of resource Size of the resource Tagging 2,046,262 occurrences 1,686,244 different forms 200,225 different names Gene/protein names 666,797 canonical forms 12% of the dictionary Avg. 6 gene or prot. names/doc. 1,309,801 occurrences 748,262 different forms 30,985 different names Species names 270,159 canonical forms 4% of the dictionary Avg. 4 species names / doc. 2,449,669 occurrences 5,804 different terms Terms 7,279 canonical forms 80% of the dictionary Avg. 7 terms / doc. 2,305,747 occurrences 5,888 concepts 740 concepts of level > 0 Conceptual hierarchy (831 > level 0) 89% of the concepts > 0 Avg. 11 concepts / doc. The annotation is dense due to the type of documents and the corpus-based strategy of the lexicon acquisition. For instance, the BioCreative II corpus contains on average 4.6 gene or protein names per document while there are 6 in our corpus. 4.2 Compared Systems Google and Google Scholar index very large collections. Google references around 24 billions of web pages. The size of Google Scholar is estimated at more than one billion references. Both systems perform simple stemming on documents and queries. Our hypothesis is that in specific domains, they will (1) retrieve more incorrect results compared to semantic search, because they do not disambiguate words; (2) miss relevant documents by not exploiting synonymy and related terms. In Entrez PubMed, each indexed reference is manually assigned a set of terms representative of the document topic from the MeSH thesaurus. The manual annotation avoids ambiguities in document indexing but is quite expensive to maintain since it requires highly-trained experts who read the full-text articles. Entrez PubMed searches query terms in the full-text without any linguistic analysis as well as in the MeSH term index by expanding the query with synonyms and more specific terms according to the MeSH thesaurus. In all cases, the resolution of query ambiguity is postponed to query refinement by the user. To illustrate the strategy of BioAlvis, we detail how the example query enterobacteria stress genome component is processed: the words are lemmatized; the recognized semantic units are normalized and assigned to the BacteriOntology concepts that belong to taxonomies: enterobacteria, stress and genome component. BioAlvis expands the search to documents where sub-concepts of the query term occur. For instance, the taxonomic group of enterobacteria contains Escherichia coli and Salmonella enterica among hundreds of other bacteria species. In the same way stress defines 17 different types of stress such as heat-shock and phosphate starvation. Again, genome component represents 62 different sub-concepts (e.g. operon and promoter). Each of these concepts references its variants and synonyms. For instance, heat-shock is synonym of temperature upshift, thermal upshock, and temperature upshock according to our terminology. Additionally, query lemmatization allows BioAlvis to search regardless of word inflections and derivations (stress / stressing / stresses). The interface displays the detail of the interpretation so that the result is understandable. 4.3 Experiment and Evaluation As we could benefit from Bacteriology expert analysis, we opted for a qualitative evaluation of our system. Beyond rough figures, a comparative study of the answer sets has characterized the missing and irrelevant documents retrieved by each service. More complete results can be found in [34]. Table 3 summarizes the features of the answer set for the four IR services, including Alvis. The very large answer set of Google (245,000) was expected because of the document collection size and the generality of the query. Google and Google Scholar search for the stemmed query words in the documents. As no semantics is used, all documents with sub-concepts of the query words were missed. We tested the query with a replacement of genome component by (gene OR promoter) that are two productive sub-concepts and found that 50 % more documents were retrieved. As Google and Google Scholar make use of stemming, they find 8 times more documents than with exact matches. For instance, documents with enterobacterial were found thanks to stemming. Table 3. Size of the query answer sets for tested search service. Google Entrez Entrez Google BioAlvis Scholar PubMed PubMed w/o MeSH 245,000 2,740 1031 0 1,870 Entrez Pubmed expands queries in a similar way as BioAlvis by following MeSH relations top-down. The term enterobacteria is expanded into tens of subconcepts as well as genome component. The query yields 1,031 relevant answers, but documents about specific stresses, such as phosphate starvation (97) were missed because stress is not defined in MeSH despite of its importance in Biology and it is then searched without any specialization. Five more documents could have been found if Entrez PubMed had lemmatized the documents. When MeSH term index is disabled, no document is retrieved as no paper full-text contains all the query words. BioAlvis retrieved fewer documents than Google Scholar for two reasons: (1) its document collection is smaller (2) Google Scholar indexes scientific papers full text whereas BioAlvis only indexes abstracts and titles. It does not question the semantic annotation approach but the document collection preparation. BioAlvis missed also relevant documents because of the lack of some relevant sub-concepts of stress in the ontology like acid shock. This can be addressed by completing the ontology from a larger training corpus. Regarding relevance of the documents, the accuracy of the answer sets varies a lot among the services. Google and Google Scholar results contain a vast amount of false positives. This is mainly due to the fact that the answer set contains many documents that mention only a subset of the query terms. The rank of these documents is very low but they are however counted in the answer set. The amount of false positives in Google Scholar is less important because the indexed corpus is smaller and more focused. Beyond the main problem of spurious co-occurrence of the query words mentioned above, the indexing of irrelevant subparts of the document caused many errors. For instance, citations of the document as occurring in Citeseer or SpringerLink sites are indexed with the document itself. BioAlvis retrieves false positives to a much lesser extent. Most of the irrelevant documents were papers about organisms other than Enterobacteriacea, many of them including a mention of a homology with the extensively studied enterobacterium Escherichia coli. This observation stresses the importance of filtering semantic annotations for IR purposes, so that semantic annotation focuses on the main topics of the paper. 5 Conclusion While formal languages for ontology representation have made great advances, there are few formal or operational proposals designed to tie ontologies to linguistic knowledge [35]. Ontologies can no longer be considered as organized vocabularies or hierarchies of terms that can be simply mapped to the text for semantic markup. Intermediate linguistic knowledge levels are necessary to connect the textual information to the conceptual knowledge. Sophisticated and operational NLP platforms such as Alvis are available for developing such integrated applications. Still, the cost of maintaining and configuring them exponentially increases with the complexity of the linguistic knowledge. As highlighted above, the linguistic knowledge is scattered into various heterogeneous resources in order to feed distinct successive linguistic analyses. In this paper, we pointed out the challenge of integrating ontology knowledge and linguistic knowledge into a consistent model. In order to alleviate the lack of specialized knowledge to feed NLP tools, knowledge acquisition and ML methods are applied to training corpora. This raises the problem of integrating the processes of knowledge resource acquisition and the exploitation of these resources. We proposed an operational approach based on the clear specification of the learning task and the normalization of the example representation. Following these principles, we developed large resources in Biology for each linguistic step and demonstrated their efficiency through the semantic annotation of a representative Web corpus and its use in an IR application [36]. Acknowledgment. This research has been partially supported by the RNTL ExtraPloDocs and the FP6-STREP Alvis projects. Special thanks are due to Philippe Bessières. Without his expert participation in Biology, BioAlvis development would not have been possible. The terminology part of this work has greatly benefited from Annick Lacombe’s expertise. References 1. Entrez PubMed. http://www.ncbi.nlm.nih.gov/sites/entrez 2. Buitelaar P., Declerck T., Sacaleanu B., Vintar S., Raileanu D., Crispi C. A Multi-Layered, XML-Based Approach to the Integration of Linguistic and Semantic Annotations. In Proceedings of EACL NLPXML’03 Workshop Budapest, Hungary, 2003. 3. Fensel, D., Hendler, J. A., Lieberman, H. and Wahlster, W. (Eds.) Spinning the Semantic Web: bringing the World Wide Web to its full potential. Cambridge, MA: MIT Press, 2003 4. Bontcheva K., Tablan V., Maynard D., Cunningham H. Evolving GATE to Meet New Challenges in Language Engineering. Natural Language Engineering. 10:349-373. 2004. 5. IBM Unstructured Information Management Architecture http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.index.html 6. Nédellec, C., Nazarenko, A. and Bossy R. Ontology and Information Extraction. In Ontology Handbook., S. Staab, R. Studer (eds.), Springer Verlag, to appear, 2008. 7. Collier N. and Takeuchi K. Comparison of character-level and part of speech features for name recognition in biomedical texts. J. of Biomedical Informatics 37, 423-435, 2004. 8. Yeh A., Morgan A., Colosimo M., Hirschman L. BioCreAtIvE Task 1A: gene mention finding evaluation. BMC Bioinformatics 2005, 6(Suppl 1) 9. Marquez L., Padro L., Rodriguez H. A machine learning approach to POS tagging in Machine Learning Journal, Vol 39, Iss 1, pp 59-91, 2000. 10. Faure D. and Nédellec C. A Corpus-based Conceptual Clustering Method for Verb Frames and Ontology Acquisition. In Adapting lexical and corpus resources to sublanguages and applications, workshop of the 1st LREC, p. 1-8, Velardi P. (Ed.), Grenada, Spain, 1998. 11. Buitelaar P., Cimiano P., Loos B. Proceedings of the Workshop on Ontology Learning and Population at the 16th ECAI, Valencia, Spain, 2004. 12. W3C Semantic Web Health Care and Life Sciences Interest Group. http://www.w3.org/2001/sw/hcls/ 13. Alvis project. http://cosco.hiit.fi/search/alvis.html 14. Hamon T., Nazarenko A., Poibeau T., Aubin S., Derivière J. A Robust Linguistic Platform for Efficient and Domain specific Web Content Analysis. Proceedings of RIAO, Pittsburgh, 2007. 15. Zebra sofware. http://www.indexdata.dk/zebra/ 16. Kripke S. A. Naming and Necessity. In Semantics of Natural Language. D. Davidson, G. Harman (eds.), Reidel, Dordrecht, pp. 253-355, 762-769, 1972. 17. Witten I. H., Frank E. Data Mining: Practical machine learning tools and techniques, 2nd Edition. Morgan Kaufmann, San Francisco, 2005. 18. Kim J.-D, Ohta T. Tsuruoka Y., Tateisi Y. and Collier N. Introduction to the Bio-Entity Recognition Task at JNLPBA, Collier et al. (eds), Proc. of NLPBA/Coling wshp, 2004. 19. Ando R. K. BioCreative II Gene Mention Tagging System at IBM Watson. Proceedings of the Second BioCreative Challenge Evaluation Workshop. 2007. 20. Nédellec, C., Bessières, P., Bossy, R., Kotoujansky, A. and Manine, A.-P., Annotation Guidelines for Machine Learning-Based Named Entity Recognition in Microbiology. In Proceedings of the ECML/PKDD workshop Data and Text Mining in Integrative Biology. M. Hilario and C. Nedellec (eds), 40-54, 2006. 21. Alexa T., Mccray, Allen C., Browne, Bodenreider O. The lexical properties of the Gene Ontology. In Proceedings of AMIA Symposium, San Antonio, 2002. 22. MeSH thesaurus. http://www.nlm.nih.gov/mesh/ 23. The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genetics 25: 25-29, 2000. 24. Aubin, S. and Hamon T.: Improving Term Extraction with Terminological Resources. In Proceedings of FinTAL'2006. pp. 380-387. 25. Jacquemin, C. A Symbolic and Surgical Acquisition of terms Through Variation. In Connectionist, Statistical and Symbolic Approaches to Learning for NLP, Wermter, S., Riloff, E. & Scheler, G. (eds), pp. 425-438, Springer-Verlag, 1996. 26. Pyysalo S., Salakoski T., Aubin S., and Nazarenko A. Lexical adaptation of Link Grammar to the biomedical sublanguage: a comparative evaluation of three approaches. BMC Bioinformatics, 7(Suppl 3), 2006. 27. Aubin S., Nazarenko A. and Nédellec C. Adapting a General Parser to a Sublanguage, Proceedings of RANLP'05, pp 89-93. Borovets, Bulgarie, 2005. 28. Bisson G., Nédellec C. et Canamero D. "Designing clustering methods for ontology building - The Mo'K workbench" in Proc. of the workshop on Ontology Learning, (ECAI- 2000), Staab S. et al (Eds)., p. 13-19, Berlin, 2000. 29. Manine A.-P., Alphonse E. and Bessières Ph. Genic Interaction Extraction by Reasoning on an Ontology. In SMBM’2008, pp. 93-100, 2008. 30. Nédellec C., Ould Abdel Vetah M., and Bessières P. Sentence Filtering for Information Extraction in Genomics, a Classification Problem. In PKDD'2001, p. 326-338, 2001. 31. LLL dataset. http://genome.jouy.inra.fr/texte/LLLchallenge/ 32. Nédellec C. Genic Interaction Extraction Challenge. In Proc. of the Learning Language in Logic (LLL05) workshop joint to ICML'05. Cussens J. and Nedellec C. (eds). Bonn, 2005. 33. Krallinger, M. The interaction-Pair and Interaction Method Sub-Task evaluation. In proceedings of the BioCreAtIvE II Workshop, 2007. 34. Buntine, W., Zhou, L., Toan, V. L., Hamon, T., Ardö, A., Nazarenko, A., Nedellec, C., Pedersen, G. and Podnar, Y. Report on Tests, D8.3 IST-FP6 Alvis project, 2007. 35. Buitelaar P., Declerck T., Frank A., Racioppa S., Kiesel M., Sintek M., Engel R., Romanelli M., Sonntag D., Loos B., Micelli V., Porzel R., Cimiano P. LingInfo: Design and Applications of a Model for the Integration of Linguistic Information in Ontologies. In Proc. of OntoLex06, a Workshop at LREC, Genoa, Italy, 2006. 36. http://genome.jouy.inra.fr/alvis/front