ADDING VALUE TO BIODIVERSITY IMAGES THROUGH COMMUNITY ANNOTATION Greg Riccardi Andrew Deans, David Gaitros, Katja Seltman, Steven Winner College of Information Neelima Jammingumpula, School of Computational Science Florida State University Corinne Jorgensen, Peter College of Information Tallahassee, FL 32306-2100 USA 01-850-644-2869 Jorgensen, Austin Mast, and Department of Biological Science Karolina Maneva-Jakimoska, Florida State University Riccardi@ci.fsu.edu Debbie Paul, Fredrik Ronquist, ABSTRACT Discovering and recording ad-hoc data is the most problematic. It Morphbank, an on-line collection of museum-quality biological is particularly difficult to find ways that users can record images, is an NSF funded project designed to facilitate the on-line associations among objects. collaboration of biologists from around the world [3]. Our primary As long as data is well formatted and constrained to the database focus is to aid in the collection and management of images that schema then finding and retrieving it is simple. However, as are useful in phylogenetic research. Morphbank users are actively we’ve discovered, there is no practical limit to the amount of collaborating on the creation of information that represents the information a scientist may wish to store with a particular associations among images and related biodiversity data objects. specimen. Most of the knowledge is contained in the memory of This paper describes the Morphbank annotation tool and data these scientists or in hand written notebooks. Although it is models and gives examples of how users create structured recognized that manual annotation is expensive and time information in the system. Schematized annotation provides consuming it is nevertheless still essential in documenting biologists with a flexible framework to create semantically-rich collaborative knowledge in biological systems [2]. Translating annotations using their own data models. and storing this knowledge in a searchable form is the challenge. Keywords 2. BACKGROUND Annotation, association, biodiversity Morphbank is an open Web repository of images serving the biological research community. It is currently being used to 1. INTRODUCTION document specimens in natural history collections, to voucher The discovery, identification, and documentation of biological DNA sequence data, and to share research results in disciplines entities are time consuming and tedious tasks. The subtle such as taxonomy, morphometrics, comparative anatomy, and differences between similar species may be so minute as to phylogenetics. Morphbank can serve as a virtual reference require the collaboration of several experts to identify. Each collection of named organisms or a resource for comparative taxonomic group has many experts who can assist in the morphological study; new use cases are continuously added [7]. identification of specific organisms. However, with the increase in Each image in the database is associated with fully searchable set the number of new organisms that have been discovered and a of text information. Additionally images can be downloaded in decrease in number of senior specialists, identification and several different formats [3]. Understanding the background of curation of data have become more difficult. Often, it involved the Morphbank is important to understanding the complexity of the need for scientists to travel to the location of the specimens or for problem of collaborating with other scientists on the identification specimens to be sent to the scientists for first hand examination. and curation of biodiversity data. This is still standard practice among most biologists today. Morphbank contains information about organisms. Each image in 2.1 MORPHBANK OBJECTS the system is associated with one or more specimens. Each Each object in the Morphbank system is uniquely identified and specimen is a representation of information about an organism. includes a set of standard fields that assist us in cataloging the Specimens are in turn associated with localities, contributors, location and type of each object, the identification of the user who taxonomic concepts, and a variety of annotations. added the object, the date and time of creation, an optional description of the object, and the last time the object was The design and development of the Morphbank system identified modified. These attributes allow anyone accessing Morphbank several challenges in discovering and creating information about sufficient information to find and catalog data and associate images and their related objects. related objects. Each object is externally identified by a Life Science Identifier (LSID) [13].  Finding images and specimens associated with a specific species and genus, 2.2 MORPHBANK OBJECT  Finding and recording information about that image and its RELATIONSHIPS related objects, and Since each Morphbank object is uniquely identified, any object  The discovery and recording of ad-hoc associations among can be the target of a stored reference. A single column within a the various objects. Morphbank table holding a foreign key may refer to several an object of any type. Thus a collection object can be heterogeneous. For instance, an annotation object may define an association 1 among images, specimens, locations, users, or even other Supported by NSF contract DBI-0446224, 2005-2008 WWW 2007, May 8--12, 2007, Banff, Canada. annotations. This flexibility allows for the creation of complex collections of made digital annotations somewhat cumbersome. The increased objects that can be shared with other users of the Morphbank use of Javascript, higher speed communications, improved Web system. Although there are a series of predefined relationships in interface standards, and increased browser capability have made Morphbank, the use of unique identifiers allows users to define an Web-based digital annotations more of a reality. However, there is unrestricted set of complex relationships of objects within the still no convenient method for making annotations on the sides of confines of the system. Web pages as you would on paper documents [8]. Figure 1 shows the result of searching for images that are related The problem of biodiversity annotation is that biologists have to the taxon with id 30244, the species asclepias amplexicaulus. increased the number of specimens they can gather but have not The search looks through the known associations between objects increased their ability to catalog, identify, and study them. to find the proper set. Each image in the set is associated with a Collaborations still include the exchange of physical specimens specimen which is associated with the proper taxon. The structure and the manual annotations of the images using indexed cards and of these predefined associations allow the search to be both paper documents. At the functional level, many users have effective and efficient. The information about the images in developed their own specific but proprietary solution to this Figure 1 comes from the image, its related specimen and its problem. Through the use of Morphbank and a Web based related taxon. annotation tool, we can solve most if not all of these problems. 3.1 MORPHBANK OBJECT ANNOTATION A variety of annotation technologies allow users to add value to images by creating associations between those images, text and other digital objects. Morphbank takes this one step further by making the associations into first class objects that can themselves be annotated and associated with other objects. Morphbank also allows associations to take on specific semantic characteristics that constrain their meaning and thereby improve searching and understanding. Image annotation is available in a variety of image management Web sites. The simplest annotations are found in systems that support attaching tags to images and other media. Flickr.com and YouTube.com, e.g., allow users to add text attributes (tags) to images and use those tags to support searching. FotoTagger.com, among others, goes a step further and allows the tags to be attached to specific locations on images. Blogging is another form of image annotation in which text Figure 1. The result of searching for images for a particular passages are linked to images, Web pages and other digital taxon objects. A blog entry creates an associate between its own text and the linked objects. 3. BIOLOGICAL ANNOTATION Annotea.org supports the creation of RDF attributes for image REQUIREMENTS tags. These attributes can be used to provide search inference The users of the Morphbank database system have identified capabilities for users of image repositories. several requirements for image and object annotation to be used by authorized users of the system. These requirements are Another annotation strategy involves the development of consistent with the Specifications For Image Annotation On The laboratory notebooks such as those under development at the Semantic Web as described W3C in their draft document [5]. A United States Department of Energy, National Collaboratories major restriction placed on Morphbank development was that the under the guidance of Dr. Jim Myers [11]. These middle-ware annotation software must be accessible through the use of a Web products present researchers, applications, problem-solving browser without the need to download an extensive set of client environments (PSE), and software agents with a layered set of based applications. This requirement was established because application services that provide a finite set of capabilities for the research biologists frequently travel from one location to another creation and management of meta-data, the definition of semantic and many times only have access to a Web browser. Additionally, relationships between data objects, and the development of annotations must be made in real-time and directly to the actual electronic research records [10]. Users are able to record data source to avoid update anomalies associated with multiple associations between digital objects across and among projects. copies of the data. Updates and annotations made by one scientist must be readily available to other colleges for collaboration in a Morphbank seeks to combine these ideas by allowing timely manner. incorporating an extensible annotation type system and by systematically expanding the scope of associations by including There has been considerable effort put into the development of any objects referenced by globally unique IDs (GUID). general purpose Web-based annotation tool sets over the past several years. In their paper on Web annotations, Venu Morphbank was designed to allow users to take advantage of Web Vasudevan and Mark Palmer [15] described an approach 6 years service products to gain access to the data by conforming to ago on the development of a Web based annotation tool that could industry practices and standards but maintain the ontology of the be used to annotate documents over the Internet with just the use original data. Users will browse or search the Web site for of a Web browser. However, they discovered several limitations Morphbank objects using a variety of tools provided through the in the use of Web browsers and of HTML as layout languages that Web site. 3.2 BASIC ANNOTATION TEMPLATE with each other. User will select any two Morphbank objects An annotation is an assertion that a collection of objects are (image, specimen, view, location, publication, user, group, etc) related in a particular way. For annotation and search purposes, and then describe the relationship among the two. the Morphbank object annotation tool provides a minimum set of 4. EXAMPLES OF ANNOTATIONS tools common to all annotation requirements. The tool uses the Specimen image annotation captures people’s knowledge of terminology of the Darwin Core [1] biodiversity ontology species such as new observations, and disagreements with initiative. We strove to keep the tool-set as simple and as straight previous annotations. Image annotation enables semantic image forward as possible and to provide specializations that make it retrieval and maintains a record of user comments concerning the easy for particular types of annotations to be created. data. Furthermore, a collection of featured annotations provides a Flexibility is particularly important because all annotations must way to assign species to a specimen. Image annotation associates be made using only a Web browser. The template for the tool textual information to the specific region of an image to enable defines several functional areas required for basic biodiversity semantic querying. annotation and specimen determination. Two technologies are frequently used: Text-based approach and field-based approach. The former simply add keywords to the 3.3 TYPES OF ANNOTATIONS whole image using natural language. However, keyword-based Using the ability to store complex metadata with annotations retrieval returns irrelevant documents (i.e., low accuracy of gives allows us to define associative semantic relationships with retrieval). A field-based method describes and retrieves an item ad-hoc data and other Morphbank data. The data model that using one or more field-value pairs, thus improves the retrieval supports annotation is intended to be extended to incorporate precision. Figure 2 shows an image annotation of the field-based additional types as needed by users. The categories of annotations approach. This annotations asserts that a particular portion of an in the current system are as follows: image (of a wasp leg) is a femur.  General: There are instances where users desire to make some ad-hoc comments concerning a collection of images, specimens or other objects. The requirement for this type of annotation was made to allow maximum flexibility for including comments, measurements, and other related data to be stored and associated with the collection of objects. A very useful example of a general annotation is a simple collection of objects, much like a shopping cart, that can be stored, organized, and labeled for later use.  Image: As a phylogenetic database, images are vitally important to the users of the system. Therefore, many of the annotation types described in this section will apply specifically to images. The types of image annotations are listed as:  Spot location on an image associated with the annotation. The user will identify a specific spot on the image to associate with a label, title, and paragraph description.  Circle associated with an area on the image.a The user will place a circle encapsulating an area to associate with a label, Figure 2. An Image Annotation Example title, and paragraph description.  Rectangle associated with an area on image. The user will However, both text-based and field-based approaches store the place a rectangle encapsulating an area to associate with a information in a plain text format. It is known that querying the label, title, and paragraph description. plain text is inefficient. Furthermore, storing annotation  Taxon Determination: Used for discussion concerning the information using only plain text is not suitable to satisfy the species or other taxonomic determination of a specimen. Users higher level requirements for the system. Meaning and ontology will select a specimen and by using the associated images, make must be associated with the data. The heterogeneous data models a recommendation as to the specific genus and species from different biologists and the diversity of association types determination. Taxon determinations are extremely important to require frequent update and evolving data structures. the research activities of the primary users. Figure 3 shows a Morphbank image annotation in context. The  Phylogenetic Character and State: This type of annotation annotation contains attribution (upper left), a small instance of the will be used to organize physical features (called ―characters‖) annotated image (upper right), detailed comments, with technical of organisms into objects of interest to research users. terms highlighted (lower left), and brief descriptions of other Phylogenetic characters and possible values (states) of those annotations of the same image (lower right). characters are associated with specific images, with species, and with collections of species. In this type of annotation, the user The annotation of Fig. 3 asserts that the wasp whose leg is shown will associate an image or specimen in the database with has a particular feature, which is called ―femur swollen medially‖. phylogenetic characters and states. Such features are used by experts to categorize specimens into taxonomic units (genus, species, etc.) and, after analysis, to  Relationship: Morphbank comes standard with predefined develop evolutionary models. data relationships. Relationship annotations allow the user to define additional relationships associating Morphbank objects Morphbank is using annotation and association technology to collect information that is directly used in scientific research. Each of the Morphbank objects related to the annotation of Figure 3—the image, the annotations, the related specimen, etc.—are represented as first-class objects with globally-unique identity. Thus the objects can be stored in collections, included in other annotations, and referenced in external sites. Figure 5. Morphbank display of the image of a herbarium sheet Creating the determination annotation sheet began with interviews with domain experts and the evaluation of typical manual records. Figure 6 shows a detail of the herbarium sheet of Figure 5 that contains the information cards that are attached to the sheet. Two cards are attached. The lower card is the primary information about the specimen including who collected it, when and where. Figure 3. Image Annotation In Context The lower card also shows the species determination that was Mass annotations are possible as well. Figure 4 shows an interface recorded when the specimen was collected. that allows a user to annotate each of a group of objects. In this case, the user is preparing to comment on the species identification, also called the determination of several botanical specimens. This annotation interface has been developed to enable a specific activity to be performed by experts on plant morphology. Figure 4. Group Annotations Figure 6. Information card from herbarium sheet The upper card shows a determination annotation that was added 5. PRELIMINARY RESULTS to the specimen in 1983. J. Farmer of the University of North The Morphbank research team has been working closely with a Carolina agreed that the determination was correct. group of botanists at the Department of Biological Sciences at Florida State University to use the annotation tool for the curation In pencil, between the two cards is second annotation. D. D. Ward of specimens from the Robert K. Godfrey Herbarium at Florida in 1983 also agreed on the correctness of the determination. State University. Figure 5 shows some of the Morphbank information for a typical herbarium sheet. The Morphbank annotation tool is intended to allow the online collection and dissemination of information like that shown in Fig. 6. The tool will allow researchers to evaluate the integration. In Workshop on Knowledge Markup and determination of the specimen, that is, the association between Semantic Annotation, KCAP03, 2003. each specimen and its taxon. The activity is an evaluation of the [3] D. Gaitros, G. Riccardi, F. Ronquist, N. Jammigumpula, and quality of the information stored in the herbarium. W. Blanco. Morphbank, the development of a general A major benefit of the Web tools is its support for distributed purpose bioiinformatics database. Conference on Internet collaboration. Before the sheets were Computing (ICOMP’05), pages 31–37, Jun 2005. The annotation interface shown in Fig. 4 can be used to agree with [4] L. Haas, D. Kossmann, E. Wimmers, and J. Yang. An the recorded determination of the set of specimens, or to disagree optimizer for heterogeneious systems with non-standared and select a different taxon. In this way the annotation represents data search capabilities. in special issue on query processing a qualitative evaluation of the recorded information. Fig 4 shows for non-standard data. IEEE Data Engineering Bulletin 19(4), that 19 annotations already record agreement (A) with the pages 37–43, Dec 1996. determination. [5] C Halasheck-Weiner, J Hunter, N Simou, J Smith, and V The results so far are very promising. Fifteen taxonomists were Tzouvaras. Image annotation on the semantic Web, Jan 2006. asked to use Morphbank images of specimens from the Robert K. [6] P. Korica, H. Maurer, and N. Scerbakov. Extending Godfrey Herbarium at Florida State University to make digital annotations to make the truly valuable. World Conference on determination annotations for 50 specimens each. The scientists E-Learning in Corporate, Government, Healthcare, and found the online tools to be an excellent replacement for the Higher Education (ELEAN) 2005, 2005. manual task. They were particularly pleased to be able to see the results online and to be able to see the effects of this online [7] J Liljeblad and F Ronquist. A phyogenetic analysis of higher- collaboration. level gall wasp relationships (hymenoptera: Cynipidie). Systemantic Entomology, 23:229–252, 1998. An additional study of the feasibility of making determinations from images in lieu of physical specimens was conducted by [8] P. Marshall. Annotations: From paper books to the digital library. in Proceedings of the ACM Digital Libraries 97 bringing some of these experts to Florida. The study is ongoing. Conference, Philidelphia, Pa, Jul 1997. We hope to be able to establish that digital representations of these specimens are more than adequate replacements for the real [9] C Meng. Biological information standards. Bulletin of the objects. American Society for Information Science and Technology, 2004. 6. CONCLUSION We have described an existing need in the biological community [10] J Myers. http://collaboratory.emsl.pnl.gov/, 2004. to store and retrieve complex information on specimen and related [11] J Myers, A Chappell,MElder, A Geist, and Schwidder J. images. In creating a Web site that stores the elements common to Reintegrating the research record. IEEE Computing and all entities in the Tree of Life, we have made biodiversity research Science and Engineering, May 2003. more effective. [12] MySQL. http://dev.mysql.com/techresources/ articles/mysql- Our work in developing a tool that allows users to annotate 5.1-xml.html. images via the Web using only the essential elements has proven [13] D. Smith S. Martin and B. Szekely. Lsid(life science successful. The non-intrusive method permits biologists to mark identifer) project, 2005. http://lsid.sourceforge.net. images without altering the original image, and share this annotations with others in an easy and open format. Our hope is [14] P Spyns, R Meersman, and M Jarrar. Data modeling versus that the work performed under this NSF grant by the Morphbank ontology engineering. SIGMOD Record, 31(4):12–17, project will provide the Tree-of-Life initiative with a stable digital December 2002. image database and annotation tool set that can be used by [15] V. Vasudevan and M. Palmer. On Web annotations: biologists around the world. Promises and pitfalls of current Web infrastructure. 32nd Hawaii International Conference on Systems Sciences, Jan 7. REFERENCES 1999. possible (see Figure 1). It may extend across both [1] L. Alexander, A. Runyan, and V. Anderson. Taxonomic columns to a maximum width of 17.78 cm (7‖). data working group, Darwin Core 2. TDWG.org [2] A Dingli, F Ciravegna, and Y Wilks. Autmotic semantic annotation using unsupervised information extraction and