A Survey of Identifiers and Labels in OWL Ontologies Nor Azlinayati Abdul Manaf, Sean Bechhofer, and Robert Stevens School of Computer Science The University of Manchester, UK norazlinayati.abdulmanaf@postgrad.manchester.ac.uk, {sean.bechhofer, robert.stevens}@manchester.ac.uk Abstract. We present a survey of the usage and style of identifiers and labels of named entities in a corpus of OWL ontologies. We investigated the frequency of use of both labels and meaningful or meaningless iden- tifiers in those ontologies. We also surveyed common practices of lexical encoding styles for identifiers. We found that most ontologies do not use labels for named entities. When they do use labels, those labels are mostly meaningful and most ontologies also used meaningful identifiers. CamelCase style appears to be the most widely used style of lexical encoding for identifiers. We observed, however, that the majority of the ontologies use a mixture of two or more lexical encoding styles. The result of this survey is useful when considering strategies, for example, natu- ral language generation from ontologies or converting artefacts, such as OWL ontologies, into languages like the Simple Knowledge Representa- tion System (SKOS), where the notion of label is important. Given that labels are optional in OWL ontologies, what is the best way to handle the label selection when converting them into SKOS? Merging multiple entities may require selection from labels or identifiers assigned to these entities for skos:prefLabel and skos:altLabel. Keywords: survey, identifiers, labels. 1 Introduction In this paper we present a survey of how identifiers and labels are used within Web Ontology Language (OWL)1 ontologies. We are interested in transforming such ontologies in to other forms—such as natural language and in to other Se- mantic Web representations such as the Simple Knowledge Organisation System (SKOS)2 . In these transformations it is important to be able to deal with both identifiers and labels in OWL ontologies. In natural language generation, for ex- ample, a human understandable form of the entity needs to be available to place within a natural language setting [1]. In SKOS, a concept has an alternate and preferred label—from where do these labels arise (identifier or label) and how is a choice made between preferred and alternate labels? [2–4]. As OWL does not 1 http://www.w3.org/2004/OWL/ 2 http://www.w3.org/2004/02/skos/ CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings mandate use of labels, but an Internationalized Resource Identifier (IRI) alone can be used to ‘identify’ an entity for both machine and human, how does any transformation programme deal with labels and identifiers? [5]. What labelling and identifier situations should a designer of such a system expect to encounter? This survey was thus motivated by a need to understand the degree and style of use of labels and identifiers in OWL ontologies. Once this is known, strategies to deal with various situations can be made with an understanding of the cost and benefit of realising those strategies. Ontologies are used to capture knowledge about some domain of interest. An ontology describes the concepts in the domain and also the relationships that hold among these concepts. These concepts can be represented by classes or individuals, and the relationships are represented using properties. In OWL, the concepts and properties can be referred to as entities. A named entity refers to a named class, a named individual or a named property. Each named entity must have a unique identifier, called an IRI. An IRI refers to an object that can act as a reference to something that has identity. Identifiers are not only used by computers, but also by humans. Humans prefer using meaningful identifiers—the name encapsulates the nature of the entity that it names. An identifier is meaningful if there is a direct relationship between the natural language term used and the characteristics about the entity being identified. For example, if the entity is used to represent a concept “dog”, then using dog as an identifier for this concept helps to make the identifier meaningful (to an English speaker). It is also possible, however, to have a meaningless identifier, or also called a “semantic-free” identifier. For example, ABC 20020 is a semantic-free identifier. An identifier is meaningless or semantic free if there is no direct relationship between the natural language term used and the characteristics about the entity being identified. In OWL it is possible to separate the IRI for the entity and the label for that entity (usually provided through rdfs:label). When identifiers are meaningless in human terms, the entity needs a label that is a natural language term for that entity. This can have several desirable effects, including: ability for having different language renderings; being able to change the label without having to change the identity of the entity (which is useful when the ontology is being used to encode data); and so on. As identifiers in OWL can contain no spaces, meaningful identifiers that would normally contain spaces have to be encoded in a way that excludes spaces, but retains the meaningful nature of the identifier for human readers. An iden- tifier can be encoded in various lexical encoding styles. Using internal upper case letters within an identifier to denote word boundaries–camel case style (eg. PetOwner or petOwner); underscore style ( ) (eg. pet owner); and hyphen style (-), are among the styles that are used in meaningful IRIs. This means that a meaningful identifier needs some processing to have it in a conventional form for human reading—that is, containing spaces between words. We used a survey to determine the current use of labels and identifiers in ontologies including the naming convention of identifiers. The goal was to allow us to answer the following questions: 1. Given that labels are optional, what is the frequency of label use in an ontology? 2. Given that a label should be meaningful, and if labels are used in an ontology, what is the frequency of meaningful labels used? 3. Given that an entity could have multiple labels, what is the frequency of an entity having multiple labels? 4. What is the frequency of meaningful and meaningless identifiers used in an ontology? 5. What is the frequency of the following combinations between identifiers and labels used in an ontology? (a) an entity with a meaningful identifier and has meaningful label(s)? (b) an entity with a meaningful identifier and has meaningless label(s)? (c) an entity with a meaningful identifier and has no label? (d) an entity with a meaningless identifier and has meaningful label(s)? (e) an entity with a meaningless identifier and has meaningless label(s)? (f) an entity with a meaningless identifier and has no label? 6. What is the frequency of camel-case, underscore style and hyphen styles encoding used in an ontology to encode identifiers? 2 Materials and Methods An overview of the methods used to answer these questions is: 1. Corpus preparation; 2. Isolation of identifiers and labels; 3. Determination of whether the identifiers and labels are meaningful or mean- ingless; 4. Result recording; 5. Data analysis. 2.1 Corpus Preparation In this survey we used ontologies in the TONES repository3 . We also searched on Google using filetype:owl for more OWL ontologies to be added to our cor- pus. From the search result, we looked through each ontology for OWL constructs such as owl:Class, owl:Individual, owl:ObjectProperty or owl:DataProperty to be considered as a “valid” OWL ontology for this survey. All collected ontolo- gies from both sources were compared to eliminate duplication. We utilized the OWL API4 for loading and managing the OWL ontologies. All ontologies were locally stored for future reference. 3 http://owl.cs.manchester.ac.uk/repository/ 4 http://owlapi.sourceforge.net/ 2.2 Isolation of identifiers and labels For each ontology, we isolated the identifiers and labels for each named entity. The identifier was extracted from the IRI for each named entity. If the IRI contained a fragment identifier, then the identifier for this entity is the frag- ment identifier (the fragment after the (#) character). For example, for the named class http://owl.cs.manchester.ac.uk/2010/people#person, we ex- tracted person as the identifier. Otherwise, we took the last portion of the path component as an identifier (the fragment after the last (/) character5 . For ex- ample, for the named class http://owl.cs.manchester.ac.uk/2010/pizza/ pizzaTopping, we extracted pizzaTopping as the identifier. Entity labels were identified through the annotation property rdfs:label in the ontology. An entity is considered to have a label if there exist one or more rdfs:label associated with the entity. We also considered labels made through sub-properties of rdfs:label. 2.3 Determination of whether the identifiers and labels are meaningful or meaningless Our aim is to test a label or identifier against the Web to see if it is meaningful. Gaining many pages or ‘hits’ to a query based on a label or identifier would suggest that it is meaningful—based on the assumption that use of the string on Web pages suggests its use in natural language. This is a two-stage process. Normalise the lexical encoding style of identifiers Prior to this test, how- ever, an identifier must be put in to a form suitable for querying as identifiers are formed with no spaces. As described in Section 1, identifiers are normally encoded in various lexical encoding styles. In order to determine the meaning- fulness of an identifier, the human brain will apply some cognitive manipulation on the encoded identifier into a form (space-separated form) that could be more readily interpreted. We called this process ‘normalisation’. To check the mean- ingfulness of an identifier, we normalised the string used by transforming it in to a space separated form. In order to do the transformation, we first needed to identify the style of lexical encoding used to encode the identifiers. We have identified the following commonly used lexical encoding styles, and limit our cat- egorisation to these, placing any identifiers not using these styles in an “Other style” category. 1. CamelCaseStyle. 2. Underscore style. 3. Hyphen-style. 4. HybridCamelCase underscore style. 5. HybridCamelCase-hyphen-style. 6. Hybrid-hyphen underscore style. 7. Single word 5 http://www.ietf.org/rfc/rfc2396.txt 8. Other style—any identifiers that are encoded using other than the styles mentioned above are grouped under this category. All single word identifiers are grouped in the “single word” category. This cat- egory can be considered as a “wild card” as it is compatible with all other categories. Therefore, we used to following rules to decide the lexical encoding style used in the ontologies. 1. If all identifiers in an ontology are encoded using single word, then classify the ontology as having only single word identifiers. 2. If some identifiers are encoded using the single word style and only one other lexical encoding style is used for the rest of the identifiers in an ontology, the single word category can be made compatible with the one lexical encoding style, and the ontology is classify to have that one lexical encoding style. For example, the rest of the identifiers in an ontology were encoded with camelCase style, then, the single word category can be made compatible with camelCase style and the ontology can be classified as encoded using camelCase style. 3. If some identifiers are encoded using single words and more than one lexical encoding style is used for the rest of the identifiers in an ontology, then the single word category cannot be made compatible with any of these styles and the ontology is classified as having a mixture of lexical encoding style. Once the lexical encoding style has been identified, we then normalise the iden- tifier into a space-separated form to be used in the meaningfulness checking. Check for meaningfulness For our meaningfulness check we used a Web search query using the Bing API 6 . For each label and normalised identifier, we sent a Web search query to the World Wide Web (WWW) to search for the number of websites with the words in the labels and normalised identifiers. We are interested in the number of results returned from the search query. There are three options for sending strings to the Web search query. First, using the quotation (“ ”) around the search string. For example, string hello world is search as "hello world". Since this query searched for exactly the same oc- currence of string in the Web it returned limited number of search results due to the reason that not all words in an identifier occurred together in natural language presented in the Web. Second, search for the string without the quota- tion. For example, using string as hello world as the search string. This type of query searched for the words in an identifier that occur anywhere in a Web page, and not necessarily in the same order. The result return by this type of query is moderate and acceptable. Third, if a string consists of multiple words, search for the words as separate string with and/or without quotation [6]. For example, searching for hello world as separate query hello and world. Since this query searched for the string separately, the result returned could be two different number and further processing is needed to determine which one should 6 http://www.bing.com/developers be chosen. For this survey, we chose the second option to search the terms to- gether without quotation. Additionally, we used the order of words as how it appears in the identifiers and labels. For example, if the normalised identifier is “hello world”, then we use a string hello world with the same order for the Web search query. We set a threshold value of 100 hits which is used to deter- mine the meaningfulness of the searched term. A hit result that is below 100 is considered not meaningful. The choice of 100 as the threshold is a heuristic based on running a few ontologies from various domains and simply judging a reasonable threshold. We found that ontologies with medical terminologies, get fewer hits for meaningful identifiers. 2.4 Result recording We recorded the results of this survey at various stages. All extracted identifiers and labels were recorded in XML format for future reference and analysis. We also recorded the hit results for each of the identifiers and labels from the Web search query in CSV format for future reference and analysis. 2.5 Data analysis Based on the recorded results, we calculated and recorded for each ontology and for each named entity, the following: 1. frequency of labels used; 2. frequency of meaningful and meaningless labels; 3. frequency of identifiers with one label and more than one labels; 4. frequency of meaningful and meaningless identifiers; 5. frequency of the combination of identifiers and labels; 6. frequency of lexical encoding styles. For each entity type in each ontology, we also calculated the proportion of these frequency with respect to its total entity for each criterion listed above in the form of a percentage. Finally, we calculated the mode, mean and median of these percentages for each of the criteria. 3 Results We used 219 valid ontologies from the TONES repository7 , after discarding any URIs that no longer existed or were too big to be loaded in our machine. There were 354 hits returned from the search query8 after all the duplicate results were omitted. After looking at each URLs, only 264 URLs represented valid OWL ontologies, the rest were URLs linked to pages that no longer existed. We also compared the list of URLs with the ontologies from the TONES repository 7 As at 22 February 2010 8 As at 30 March 2010 to avoid duplication. Out of the remaining 241 ontologies, we have randomly se- lected 87 ontologies to be added to our corpus—making a total of 306 ontologies. Out of 306 ontologies, 5 ontologies contained none of the named entities— leaving 301 ontologies in the corpus. There are 296 ontologies containing named classes; 105 ontologies with named individual; 264 ontologies with named ob- ject properties and 138 ontologies with named data properties. The rest of the analyses were performed on ontologies that contained named entities. Table 1 shows the result summary with the number of ontologies for each criteria surveyed9 . The mean shown represents the mean of the proportion of the measured criteria. Type Classes Individuals Object Data Properties Properties Count Mean Count Mean Count Mean Count Mean Total ontologies 296 105 264 138 Ontologies with Labels 122 32.9% 32 20% 82 27.8% 23 14.4% Meaningful labels 122 89.4% 32 94.4% 82 95.6% 23 91.7% Meaningless labels 70 6.9% 12 5.6% 13 4.4% 6 8.3% Ontologies with single label 121 93.9% 31 88.6% 81 97.5% 22 94.5% Ontologies with multi labels 21 6.1% 8 11.5% 4 2.5% 2 5.5% Meaningful identifiers 286 85.2% 103 90.6% 263 97.8% 137 97.2% Meaningless identifiers 135 14.9% 49 9.4% 37 2.3% 23 3.5% Meaningful identifiers with meaningful labels 107 18.8% 31 16.3% 81 25.2% 21 11.9% with meaningless labels 46 1.4% 9 1.3% 11 1.2% 3 0.7% with no label 242 65% 90 70% 200 71.3% 125 83.9% Meaningless identifiers with meaningful labels 66 10.1% 9 3.2% 5 1.1% 3 1.5% with meaningless labels 53 2.6% 6 0.2% 7 0.1% 5 0.4% with no label 76 2.1% 42 7.1% 28 1.1% 16 1.7% Lexical Encoding Style of Identifiers CamelCase style 116 64.9% 35 47.4% 133 52.5% 97 62.4% Underscore style 0 1% 4 7.3% 63 24.4% 4 5.9% Hyphen-style 0 0.8% 0 0.6% 5 2.1% 1 0.9% CamelCase underscore style 40 27% 5 23.4% 35 4.9% 4 7% CamelCase-hyphen style 1 1.6% 0 0.9% 0 0.4% 1 1.8% Hyphen-underscore style 0 0.1% 1 1% 0 0.1% 0 0% Single word 4 4.5% 52 20.4% 0 14.9% 7 22.2% Others 0 0.1% 0 0.01% 0 0% 0 0% Mixture 135 52 58 24 Table 1. Number of ontologies for different criteria surveyed. (The mean was calculated over the total ontologies for each entity type) 9 The complete analysis of the result for this survey is made available at http://www. myexperiment.org/packs/110 4 Discussion Table 1 provides basic answers to the questions raised in Section 1 in numerical terms. Here we present here some observations based on an initial analysis of those results along with closer examination of some of the ontologies. First, we appreciate that the technique used to determine meaningfulness of both labels and identifiers – using a search with a fixed cutoff threshold – is rather basic. The threshold value was selected based on some preliminary experiments, but it is likely that the use of a single static threshold value is not appropriate for all domains – see the discussion below. However, for the purpose of this survey, the technique is enough to show some interesting results. We are currently extending and exploring possible mechansisms for the selection of variable threshold values based on the content of each ontology rather than having a single static cut-off value for all ontologies. Labels are not widely used in all named entity types. However, when labels are used in an ontology, those labels are usually meaningful. In terms of the number of labels per entity, we observed that, for all named entity types, almost all ontologies contained single labels. Where an ontology does contain more than one label per entity, closer investigation revealed that the multiple labels were used to represent labels in different languages. Single labels usually represent labels in one language only. Almost all of the ontologies used meaningful identifiers for named entities, with object property and data property entities showing the highest use of mean- ingful identifiers. Further analysis, shows that those identifiers for object and data properties that are classified as meaningless are actually meaningful, but the meaningfulness test gave a hit below the threshold (as discussed above). As for meaningless identifiers, even though the result shows that quite a number of ontologies used meaningless identifiers, their percentage of usage (in terms of the proportion of entities in the ontologies) is quite small. For all named entity types, most of the ontologies contained meaningful iden- tifiers with no label. This observation supports our findings that labels are not widely used in the ontologies and most ontologies do have meaningful identifiers. Interestingly, we observed that there are also a few ontologies that use mean- ingless identifiers with meaningless labels or no labels. However, their mean per- centage of use is rather small. We suspect again that our approach in identifying the meaningfulness of terms is a factor in this abnormality. Having a meaning- less identifier and no label makes little sense; it is reasonable to suspect that specialised language will appear meaningless in the face of the simple threshold approach used. As for lexical encoding styles, the result show that camel case style is the most used lexical encoding style for all named entity type. There are also a significant number of ontologies that use a mixture of lexical encoding styles. A small number of ontologies used unidentified lexical encoding style under the others category. Further analysis of this category showed the identifiers classified into this category used other punctuation symbols such as dot (.) to encode the identifiers. Some example of identifiers in this category are as follows: – E1.CRM Entity (combination of dot (.) and underscore) – E71.Man-Made Thing (combination of dot (.), hyphen and underscore) – erbB-2 Genes (combination of camel case style (erbB), hyphen and under- score) The small numbers obtained for the “others” category suggests that the cate- gories identified in Section 2.3 are indeed sufficient to characterize the bulk of ontologies in the corpus. 5 Related Work There are several surveys that analyse Semantic Web documents especially OWL ontologies to help understanding of the nature of OWL ontologies. Bechhofer and Volz [7] surveyed a sample of 227 OWL ontologies to answer the question of “how much OWL DL is there on the Web?” and found that is “not much”. A majority of them are OWL Full, which in many cases were caused by syntactic errors such as missing type triples. However, they presented a patching technique for these errors and increase this “a little bit”. In [8], Wang et al. extended the work in [7] to a much larger samples size. They were interested in evaluating those ontologies to determine trends in modeling practices, OWL construct usages and OWL species utilization. They surveyed a sample of 1 300 ontological documents, not only OWL ontologies, but also RDFS documents. The survey reported in our paper adds to these surveys and takes a finer grained look at identifiers and labels within ontologies. The information gained is important, as discussed in the introduction, for deciding upon strategies for handling the ‘names of entities’ within software where some human orientated presentation is required. 6 Conclusion We found that most ontologies do not use labels for named entities. When they do use labels, these labels are mostly meaningful. Only a few ontologies have more than one label per named entity. We also found that most of the ontologies do use meaningful identifiers and if they do use meaningless identifiers, these identifiers only occupied a small portion of the ontologies. Most ontologies that have meaningful identifiers do not have labels. Interestingly, there are also a few ontologies that used meaningless identifiers with meaningless labels or no label; though this may well be an artefact of our test for meaningfulness. Camel case style appears to be the most widely used lexical encoding style for identifiers. However, most ontologies are inconsistent in their identifier encoding style, as more than one style is used to encode the identifiers within an ontology. We hope to extend this survey on a larger corpus of ontologies. For example, collecting for more ontologies from various other sources like the Swoogle 10 and Watson 11 . It also might be interesting if we could extend the survey to not only 10 http://swoogle.umbc.edu/ 11 http://kmi-web05.open.ac.uk/WatsonWUI/ investigate the use of labels and identifiers, but also other OWL constructs such as property restrictions, to have a better understanding of the common practice of use of these constructs in the existing OWL ontologies. We can raise further questions about the effect of the domain for which an ontology was built on its style of identifier and label use. For example, the Open Biomedical Ontologies consortium [9] have a policy of semantic free identifiers and use of labels. In addition, the question of whether the ontology is one that is ‘in service’ with a community—that is, it is actually being used to do a job of work—rather than being one developed for research purposes makes a difference to identifier and lable use would be a useful one to answer. When transforming OWL ontologies into other forms – such as natural lan- guage or to other Semantic Web representations such as SKOS, an understanding of the use of labels and identifiers within the ontologies is beneficial. If nothing else, it allows developers to make judgements about situations for which strate- gies should be developed. Acknowledgements: This work was funded in part by the SWAT project EP/G032459/1. The authors would like to thank Majlis Amanah Rakyat (MARA), an agency under the Malaysian Government, for funding the student. Many thanks to the reviewers who gave insightful comments and suggestion to im- prove this paper. References 1. Smart, P.R.: Controlled natural languages and the semantic web. Technical report ITA/P12/SemWebCNL, School of Electronics and Computer Science, University of Southampton (2008) 2. Jupp, S., Stevens, R., Bechhofer, S., Kostkova, P., Yesilada, Y.: Document navi- gation: Ontology or knowledge organisation system? In: Network Tools and Appli- cations in Biology (NETTAB’2007) - A Semantic Web for Bioinformatics: Goals, Tools, Systems, Applications. (2007) 3. Jupp, S., Stevens, R., Bechhofer, S., Yesilada, Y., Kostkova, P.: Knowledge repre- sentation for web navigation. In: Semantic Web Applications and Tools for the Life Sciences (SWAT4LS 2008) Workshop. (2008) 4. Abdul Manaf, N.A., Bechhofer, S., Stevens, R.: Exploring the relationships between OWL and SKOS. ISWC 2009 Doctoral Consortium (2009) 5. Cimino, J.J.: Desiderata for controlled medical vocabularies in the twenty-first century. In: Methods of Information in Medicine. (1998) 394–403 6. Alani, H., Brewster, C.: Ontology ranking based on the analysis of concept struc- tures. In: Proceedings of the Third International Conference on Knowledge Capture( K-CAP 05), Banff, Canada, ACM (2005) 7. Bechhofer, S., Volz, R.: Patching syntax in owl ontologies. In: Proceedings of the 3rd International International Semantic Web Conference. (2004) 8. Wang, T.D., Parsia, B., Hendler, J.: A survey of the web ontology landscape. In: In Proc. of the International Semantic Web Conference, ISWC. (2006) 9. Smith, B., Ashburner, M., Rosse, C., Bard, J., Bug, W., Ceusters, W., Eilbeck, L.J.G.K.: The OBO Foundry: Coordinated evolution of ontologies to support biomedical data integration. Nature Biotechnology 25 (11) (2007) 1251–1255