Social Web Meets Sensor Web: From User-Generated Content to Linked Crowdsourced Observation Data∗ † ‡ Dong–Po Deng Guan–Shuo Mai Tyng–Ruey Chuang Institute of Biodiversity Institute of Information Science Research Center Information Science Academia Sinica Academia Sinica Academia Sinica Taipei, Taiwan Taipei, Taiwan Taipei, Taiwan Rob Lemmens Kwang–Tsao Shao Faculty of Biodiversity Geo–Information Research Center Science and Earth Academia Sinica Observation (ITC) Taipei, Taiwan University of Twente Enschede, Netherlands ABSTRACT others, to formalize the extracted datasets, hence, make The reach of dominating social media like Facebook and them readily linkable. A nice consequence of this approach Twitter in the current population is enormous, and these is that a multi-faceted browser can be quickly built to ex- media have long been leveraged for diverse applications. In plore biodiversity information in large collections of UGC. particular, for some citizen science projects, existing social media increasingly become platforms on which participants Categories and Subject Descriptors interact and contribute. These user contributions, often H.3.5 [Online Information System]: [Web-based services]; termed User-Generated Content (UGC), can be a mix bag H.5.3 [Group and Organization Interfaces]: [Web-based of posts, comments, images, and other media. We report Interaction]; I.2.4 [Knowledge Representation Formalisms in this paper a work-in-progress in formalizing user con- and Methods]: Semantic Networks tributions from a large Facebook group (more than 4,000 users) established for biodiversity observation. A major part of our work is to extract structured datasets with well- General Terms defined semantics from unstructured UGC collections. We Management, Design, Human Factors. use common vocabularies from Darwin Core (DwC), Friend- of-a-friend (FOAF), Semantically-Interlinked Online Com- munities (SIOC), Semantic Sensor Network (SSN), among Keywords Citizen Science, Crowdsourcing, Facebook, GeoSPARQL, ∗This research is supported in part by the Ministry of Sci- Linked Data, Sensor Network, User-Generated Content ence and Technology (grant no. 102-2627-M-001-009) and (UGC). by the Endemic Species Research Institute, Council of Agri- culture, Taiwan. We are grateful to Te–En Lin and his group at the Endemic Species Research Institute for their help 1. INTRODUCTION with the collected data. Citizen science is a crowdsourcing mechanism that refers †Dong–Po Deng is also a PhD candidate at the Faculty of to a distributed, collaborative problem-solving model in Geo–Information Science and Earth Observation (ITC), Uni- which a crowd of undefined size is engaged to solve a versity of Twente. ‡Tyng–Ruey Chuang is also affiliated with the Research complex or scientific problem through an open call [3, 20]. Incorporation with trained volunteers participating in scien- Center for Information Technology Innovation and the Re- tific studies as field assistants has a long history [26]. How- search Center for Humanities and Social Sciences (Center for Geographic Information Science), both at Academia ever, the landscape of citizen science has been transformed Sinica. by modern Web services and communications enabling peo- ple around the world to spread information. Social media This paper is released under the Creative Commons Attribution 4.0 License. is one of significant tools in changing the ways information You are free to share and adapt this paper for any purpose, even commer- is produced and used in citizen science projects. A social cially, as long as you give appropriate credit, provide a link to the license, media site can offer participants of citizen science projects and indicate if changes were made. These freedoms cannot be revoked as not only a virtual environment for social interactions but long as you follow the license terms. For a copy of the license, please visit also a platform for sharing, discussing, and modifying data . together. On one hand, social media potentially provide Linked Data on the Web (LDOW2014), April 8, 2014. Seoul, Korea. situational awareness and opportunities for assistance on an individual level [12]. The communication channels make . possible for participants to share and manage their own sightings on a globally accessible database [29]. That is, the citizens are locally acting as human sensors, and social media are acting as platforms connecting these human sen- sors. On the other hand, social media enable scientists to reach out a large number of people, over a large geographic region and over an extended time period, to introduce them to citizen science projects. Therefore, the use of social media has greatly increased citizen participation and im- proved data collection process in citizen science projects. Such crowdsourced approach often can reduce cost and effort in data management and exchange [26]. However, to utilize social media for citizen science projects, there is a need to bridge a knowledge gap between human and the machine. In using social media for collecting partic- Figure 1: The growth of data in the Facebook group ipants’ observations, it is often hard in controlling the qual- Reptile Road Mortality. ity of the content. Social media applications and services facilitate social interactions, but not scientific activities and data exchanges. Valuable scientific content is mixed up with The paper is organized as follows. After introducing the huge amounts of noisy, low-quality, unstructured text and citizen science project in Section 2, we describe how named- media. Often a crowdsourcing effort only creates human- entities can be extracted from crowdsourced data, and the readable content but not machine-readable data. Moreover, evaluation of the information extraction in Section 3. We often the lack of sufficient metadata for crowdsourced data explain the design of the synthesis ontology of citizens as makes it difficult to derive meaningful interpretations from sensors, and how crowdsourced data can be transferred the data. Correspondingly, data integration and sharing to RDF data model in Section 4. In Section 5 we make in different knowledge domains is hampered. To achieve spatiotemporal queries and present a faceted browser for semantic computing on crowdsourced data, it requires not the linked crowdsourced sensor data. Then we provide only text mining for extracting valuable information from related work in Section 6. Finally we conclude in Section 7 user-generated content but also semantic enrichment for with an outlook to future work. interpreting the meaning of the extracted information. An ontology, as a “shared conceptualization”, plays an im- portant role for the basis of connections between datasets 2. REPTILE ROAD MORTALITY: A CITI- [14]. It is because an ontology presents a formal modeling ZEN SCIENCE PROJECT for knowledge representation geared towards resolving se- mantic ambiguity, and consequently it contributes to the This section introduces the data collection in the citizen achievement of semantic interoperability between informa- science project, Reptile Road Mortality (in Chinese, 路殺 tion communities [17]. Linked Data refers to the publica- 社). This citizen science project is hosted by the Endemic tion of structured data on the Web in such a way that it Species Research Institute, Council of Agriculture, Taiwan. is machine-readable, its meaning is explicitly defined, it is The citizen science project aims to collect reports of dead linked to other external datasets, and can in turn be linked animals that have been struck and/or killed by motor vehi- to from external datasets [2]. Technically, the Linked Data cles through the use of a Facebook group. The reason of paradigm combines knowledge representation technolo- using Facebook as a crowdsourced data collection platform gies, e.g. RDF and OWL, with traditional Web technologies, is its high user base in the Taiwanese population. Accord- e.g. HTTP and REST, for publishing and interlinking data ing to a statistic of Socialbakers1 , over half of Taiwanese and information [28]. The technologies enable a process population has a Facebook account. Facebook thus can be evolving transition from current document-oriented Web a good social place for recruiting participants. The number into a Web of interlinked data and, ultimately, into the Se- of participants in the Reptile Road Mortality is 4,187 at the mantic Web [1]. end of year 2013, but only 618 persons ever posted at least This paper reports our experiences on processing crowd- one observation. The ratio of participants and contributive sourced data from social media into interlinked data for the participants reveals the reality of mass collaboration, which Web. The process can be elaborated by the following: is often said that 80% of the work is done by 20% of the people. Up to Jan. 4 2014, the group has assembled 7,842 • how the crowdsourced observation data can be trans- posts as shown in Figure 1. formed and represented by an ontology of citizens as Any user possessing a Facebook account can join this sensors, citizen project and post his/her observations of roadkill animals. Figure 2 illustrates a roadkill observation posted • how the crowdsourced observation data can be inter- in the Facebook group Reptile Road Mortality. Chuang Yu- linked with other Linked Data resources such as bio- Ta saw a killed animal on the road, so he took a photo and diversity (TaiCOL) and geospatial information (Geon- posted his observation with location and time description ames), on the group. When Joyce read the post, she identified the species in Chuang Yu-Ta’s photo and left the species name • how the crowdsourced observation data can be acces- as comment. Thus, the roadkill observation was composed sible to machines by using the Linked Data paradigm of photo, description of location and time, and identification and be readable for humans by means of a faceted 1 browser. http://www.socialbakers.com Observation Provider: Thread Chuang Yu Ta Several different algorithms have been proposed to deal Observation location: with the challenge. Generally speaking, the algorithms Geoname:  (Sindian) can be classified into character-based and word-based ap- Road kilometre: proaches [33]. The character-based approaches ignore the 916.3K Post (Province Road No.9, concept of words, and use characters to extract word-level section 16.3 kilometer) Lat:24.95149 information in the construction of information extraction Lon:121.57520 Lat:151m system. The word-based approaches apply lexicon to seg- Observation date: ment Chinese words. They often reply on a rich lexicon, 2013/12/4 sophisticated word segmentation, and/or syntactic analy- Photo: A proof of occurrence sis in extracting word-level information from documents Comment section Species Identifier: [4]. However, existing Chinese lexicons are constructed for Joyce Chen general applications. The lacks of domain-specific corpora Species name:  often hamper the information extraction in specific domains (Melogale moschata) Identification Date: such as geography and biodiversity. For example, the group Dec. 4, 2013 Chinese Knowledge Information Processing (CKIP)3 is con- The post is published on . information. The lexicon now contains over 140,000 word entries, and is used in a corpus with over a million parsed Figure 2: A post on the Facebook group Reptile Road sentences. This is a great research resource. Unfortunately Mortality, as well as biodiversity observation informa- using the CKIP lexicon for extracting location and species tion embedded in the post. names is not efficient. To efficiently extract species and location names from Facebook threads, it is necessary to constitute specific lexi- of species. cons. We compiled a geo-name lexicon from the Taiwan Geo- The participants of this citizen project would be asked graphic Names database4 and a species-name lexicon from to provide the location and time descriptions for the their the Taiwan Catalogue of Life databases (TaiCOL)5 . Note observations. Because of privacy and security issues, Face- that, however, species names and place names found in book strips metadata (EXIF) from the photos. Without EXIF Facebook posts and comments are not always in these two data, a photo from Facebook is just an image; the photo can- specific lexicons. The name-entity recognition approach not in itself indicates the date and location on which it was we use was elaborated in a paper we previously published taken. The text messages accompanying the photos will [10]. be the main sources for extracting biodiversity information about the species in the photos. Facebook posts can be retrieved using the Facebook 3.2 Evaluation of Name-Entity Recognition Graph API2 which enables developers to read from and Precision and recall are the basic measures used in Natu- write data to Facebook. This API offers a simple, consistent ral Language Processing to evaluate information extraction view of the Facebook social graph, uniformly represent- methods [9, 18, 30]. Generally speaking, it needs a training ing objects in the graph (e.g., people, photos, events, and dataset to assess the quality of information extraction. Our pages) and the connections in between them (e.g., friend training dataset is generated by domain experts. While the relationships, shared content, and photo tags). training dataset is considered as a positive set, the names extracted by Name-Entity Recognition (NER) is a negative set. According to whether an identification is correct, four 3. INFORMATION EXTRACTION sets can be distinguished: true positive, false positive, true negative, and false negative. From the statistical point of 3.1 Name-Entity Recognition view, false negative are Type I errors, and false positives The data offered by the Facebook Graph API is structured are Type II errors. Precision is the ratio of the number around the Facebook social graph which is useful for pro- of correct names identified by both NER and domain ex- cessing social relationships. However, this citizen science perts (True Positive) to the total number of incorrect and project focuses on collecting occurrences of roadkill ani- correct names identified by NER (True Positive + False mals. The valuable information is in the photos for proving Positive) (Eq. 1). Recall is the ratio of the number of correct occurrences of roadkill animals, and in the texts for describ- names identified by both NER and domain experts (True ing the time and location of occurrences of roadkill animals. Positive) to the total number of correct names identified To extract the information of occurrences of roadkill ani- by domain experts (True Positive + False Negative) (Eq. 2). mals, we apply name-entity recognition to identify location, The F-score is an overall metric that is calculated from both time, and species in Facebook posts and comments. Be- precision and recall, treating these two metrics as equally cause the participants in the Facebook group use traditional important (Eq. 3). Chinese as the communication language, our task of name- entity recognition actually aims at Chinese text processing. |nameactual ∩ namepredict | Chinese texts are character-based, not word-based. More- Recall = (1) |nameactual | over, there is often no space between characters in written Chinese sentences. This unique language feature leads to 3 a challenge of word segmentation. http://ckip.iis.sinica.edu.tw/CKIP/engversion/index.htm 4 http://placesearch.moi.gov.tw 2 5 https://developers.facebook.com/docs/graph-api http://col.taibif.tw Table 1: Confusion matrix of information extraction sioc:Thread foaf:Person assessment. Expert determine Expert not determine sioc:has_container NER predict 282 7 sioc:reply_of NER not predict 10 101 sioc:Post sioc:has_container foaf:holdsAccount |nameactual ∩ namepredict | Precision = (2) |namepredict | sioc:has_creator 2 × Recall × Precision F-score = (3) sioc:UserAccount sioc:has_owner foaf:Image Recall + Precision where nameactual is the set of place names or species names that has been identified from Facebook messages by domain experts, and namepredict is the set of place names Figure 3: The vocabularies of SIOC and FOAF used in or species names that has been identified from Facebook our ontology. messages by the NER. 400 posts are randomly selected from the entire 7,842 posts for the evaluation. The confusion matrix of the infor- location. Also, the species of the animal is the feature of mation extraction assessment is shown in Table 1. Thus, the interest. Figure 4 displays the use of the vocabularies of precision is 282/(282 + 7) = 0.9758, the recall is 282/(282 + the SSN ontology in our ontology. 10) = 0.9656, and the F-score is 2.8973. However, the citizen is a person and cannot exactly be regarded as a sensor. The persons can be expressed as 4. AN ONTOLOGY FOR CITIZENS AS SEN- foaf:Person, and the sensors can be defined to ssn:Sensor. All individuals of foaf:Person cannot be the same as all SORS individuals of ssn:Sensor. Only some of these individu- als can be expressed as not only foaf:Person but also 4.1 A synthesis of social networks and sensor ssn:Sensor. To clarify the concept, we create the class networks Citizen_As_Sensor which is a subclass of the intersec- Before we begin to transform the crowdsourced content tion of the two classes. That is, an individual of the class to RDF, we first develop an ontology for not only expressing Citizen_As_Sensor can be an instance of both classes. But the notions of “Citizens as Sensors” but also formalizing the the instances of foaf:Person or ssn:Sensor are not neces- extracted name-entities, e.g. species and geospatial names. sary to be the individuals of the class Citizen_As_Sensor. To make linked data interoperable, the ontology reuses suit- Moreover, the same situation occurs for ssn:SesnorOutput, able vocabularies from the existing ontologies as many as as some instances are in sioc:Post or in sioc:Image. There- possible. Since the crowdsourced dataset is retrieved from fore, we define the class Post_As_SesnorOutput to be in Facebook, a social media site, its content can be mapped the intersection of sioc:Post and ssn:SensorOutput, and to RDF using existing social semantic web ontologies. The the class Image_As_SesnorOutput to be a subclass of both Semantically Interlinked Online Communities (SIOC)6 is sioc:Image and ssn:SensorOutput. used for representing the content of the Facebook group Reptile Road Mortality, e.g. threads, posts, and images. 4.2 Formalizations of the extracted name-entities The Friend of a Friend (FOAF)7 can be used to describe content creators. Figure 3 shows the vocabularies of SIOC 4.2.1 Geospatial information and FOAF used in our ontology. In the process of information extraction, name entity In this study, “Citizens as Sensors” means that a Citizen recognition is used to identify the geospatial and species voluntarily reporting his/her observations via social media names. The extraction of geospatial information includes for a citizen science project. The citizen acts as a Sensor not only location names (such as names of populated places which enables automatic measurement and/or recording of and point of interests) and road names with kilometers physical properties. To express the notion, the vocabularies but also coordinates (longitude and latitude). If coordi- of W3C Semantic Sensor Network (SSN) ontology are used nates were not written in the texts of observation posts, to express the content from social networks. Conceptually, the location names would be used to retrieve the longitude the action that a participant reports her/his roadkill obser- and latitude. To semantically encode geospatial data, we vation matches the pattern of Stimulus-Sensor-Observation. use the vocabularies of Open Geospatial Consortium (OGC) The pattern describes a process that a sensor transforms a GeoSPARQL. The GeoSPARQL is one of OGC standards stimulus from the physical world into an observation and which provides three main components for semantically en- thereby it allows us to reason about the observed proper- coding geographic data: (1) The definitions of vocabularies ties of particular features of interest [15]. A roadkill animal for representing features, geometries, and their relation- actually is the stimulus which triggers a citizen to a post ships; (2) A set of domain-specific, spatial functions for use her/his observations on the Facebook at specific time and in SPARQL queries; (3) A set of query transformation rules 6 http://sioc-project.org [21]. 7 http://www.foaf-project.org The ontology of the GeoSPARQL standard includes three ssn:Sensor DUL:Entity geo:Feature ssn:observerBy ssn:detects owl:subClassOf ssn:Observation ssn:Stimulus owl:subClassOf ssn:observationResult PlaceOfObservation ssn:SensorOutput gn:name geo:hasGeometry ssn:featureOfInterest Geoname gn:featureClass sf:Point ssn:FeatureOfInterest DUL:hasLocation geo:asWKT ssn:observationResultTime Feature Type geo:WKTLiteral PlaceOfObservation time:DateTimeInterval Figure 5: The vocabularies of GeoSPARQL used in our ontology. Figure 4: The vocabularies of W3C SSN used in our ontology. of biodiversity data has increased the scale from regional to global, and has broaden the scope beyond that of establish- ing species ranges [16]. To reach global biodiversity data main classes: geo:SpatialObject , geo:Features, and coordination, standardized metadata vocabularies i.e. Dar- geo:Geometry . The geo:Feature and geo:Geometry are win Core is used to develop data infrastructures for sharing the subclass of geo:SpatialObject. The geo:Feature class biodiversity data. Darwin Core is a standard for sharing represents features, which are abstractions of real world data about biodiversity — the occurrence of life on earth phenomena. The concept of feature is derived from ISO and its associations with the environment [32]. However, 19109 General Feature Model. The geo:Geometry, express- Darwin Core is comprised of technology-independent vo- ing spatial geometries of the features, has sixteen sub- cabularies. The classes in Darwin Core are categories and classes defining a hierarchy of geometry types such as have no formal domain declarations for vocabularies [31]. point, polygon, curve, arc, and multi-curve. These geometry To improve the knowledge representation of Darwin Core, classes are derived from ISO 19107 Spatial Schema. RDF Darwin-SW8 designs the properties between classes and literals are used to store geometry values. There are two formalizes the classes including five existing core classes ways to store geometry values via RDF literals: Well Known of Darwin Core (i.e. Taxon, Event, Identification, Location, Text (WKT) and Geography Markup Language (GML). The Occurrence) and two new ones (i.e. Token and Individual geo:asWKT and geo:asGML properties map between the ge- Organism). Figure 6 shows the classes and properties of ometry entities and the geometry literals. Geometry val- Darwin Core are used in our ontology. ues for these two properties use the geo:WKTLiteral and Traditionally, a specimen collecting all or part of an or- geo:GMLLiteral data types respectively. Figure 5 shows ganism serves as an evidence for the occurrence of the the classes and properties of GeoSPARQL used in our ontol- organism, and is a basis for identifying the organism to a ogy. taxon concept. However, the documentation process nowa- Although DUL:hasLocation is usually a predicate in be- days has many possible methods such as images, sound, tween ssn:Observation and DUL:Entity in W3C SSN, it ac- or DNA sequences. The class dsw:Token is used to repre- tually can be a property between any entities. To clarify the sent evidences from the classes dwctype:Occurrence and place of observation, we create a class PlaceOfObservation dwctype:Identification. To connect Darwin Core to W3C which is a subclass of both of DUL:Entity and geo:Fea- SSN, we create classes Token_As_FeatureOfInterest and ture. The class PlaceOfObservation not only keeps the Occurrence_As_Stimulus. Token_As_FeatureOfInterest DUL:hasLocation property but also inherits the formal is a subclass of the intersection of ssn:FeatureOfInterest geospatial concepts from geo:Feature. As for the time of and dwstype:Token. The class Occurrence_As_Stimulus an ssn:Observation event, ssn:observationResultTime is in the intersection of ssn:Stimulus and dwctype:Occ- can be a predicate in between the class ssn:Observation urrence. and the class time:DateTimeInterval. 4.2.2 Biodiversity information 4.3 Transformations from the extracted name- entities to the RDF model Discovery and inventory of specimen data is a fundamen- tal work in biodiversity informatics. With the development 8 of Internet technologies, the aggregation and dissemination https://code.google.com/p/darwin-sw/ dwctype:Identification dwc:identifiedBy foaf:Person dsw:toTaxonConcept dsw:identifiedBasedOn dwctype:Taxon dsw:Token dsw:identifies dsw:hasName dsw:hasDerivative TaxonName dsw:IndividualOrganism Figure 9: The taxon name of extract species name is dsw:hasEvidence linked to a URI in TaiBIF. dsw:hasOccurrence dwctype:Occurrence Figure 6: The vocabularies of Darwin-SW used in our ontology. Figure 8: The taxon concept of extract species name is linked to a URI in TaiBIF. Figure 10: The extract place name points to a URI in Taiwan Geographic Name. Assembling the above-mentioned vocabularies, we can create the ontology of “Citizen as Sensor”, as shown in Figure 7. Such designed ontology plays as the schema for study uses BBN Parliament, which is an open source triple transforming crowdsourced content to linked sensor data. store developed by Raytheon BBN Technologies. The BBN Take Figure 2 as example, we can correspondingly trans- Parliament is compliant with OGC GeoSPARQL standard, form the user-generated content to RDF data, as shown and supports spatial and non-spatial SPARQL queries. Us- in the Appendix. The extracted name entities of species ing BBN parliament, we build a GeoSPARQL endpoint9 . for and place names are pointed to by URLs. The word “鼬獾” the linked crowdsourced sensor dataset. The following lists (M elogale moschata subaurantiaca) is identified as a taxon a GeoSPARQL query, and Figure 11 is the result of the , as query. shown in Figure 8, and mapped to the scientific name PREFIX geo: , PREFIX geof: as shown in Figure 9. The extracted place name “新店” (Sin- PREFIX owl: PREFIX rdf: dian) also is linked to a URI in Taiwan Geographic Name PREFIX rdfs: whose URIs are all mapped to Geonames.org, as shown in PREFIX sf: Figure 10. PREFIX time: PREFIX units: PREFIX xsd: PREFIX eoe: 5. SPATIOTEMPORAL QUERIES PREFIX DUL: PREFIX ssn: Since the geospatial information is formalized by the vo- cabularies of OGC GeoSPARQL, information in our RDF 9 dataset can be retrieved via spatiotemporal queries. This http://lod.tw/parliament/ DUL:Entity ssn:Stimulus geo:Feature Sensor Network ssn:detects ssn:Sensor owl:subClassOf owl:subClassOf ssn:Observation Geospatial DUL:hasLocation ssn:isProducedBy ssn:observationResult PlaceOfObservation ssn:SensorOutput gn:name ssn:featureOfInterest geo:hasGeometry Geoname gn:featureClass ssn:observationResult ssn:FeatureOfInterest sf:Point owl:subClassOf ssn:observationResult geo:asWKT Feature Type ssn:featureOfInterest owl:subClassOf geo:WKTLiteral owl:subClassOf owl:subClassOf ssn:observationResultTime owl:subClassOf Occurrence_As_Stimulus ssn:observes Post_As_SensorOutput Image_As_SensorOutput dsw:hasEvidence Time time:DateTimeInterval ssn:detects ssn:isProducedBy ssn:isProducedBy Token_As_FeatureOfInterest time:xsdDateTime ssn:observes Person_As_Sensor xsd:dateTime dsw:isBasedOn owl:subClassOf owl:subClassOf owl:subClassOf dwc:identifiedBy dwc:dateIdentified owl:subClassOf owl:subClassOf Social Network dwctype:Identification sioc:Thread foaf:Person Biodiversity sioc:has_container dsw:toTaxonConcept dsw:isBasedOn sioc:reply_of sioc:Post sioc:has_container dwctype:Taxon dsw:identifies dsw:Token foaf:holdsAccount dsw:hasName dsw:hasEvidence dsw:hasDerivative sioc:has_creator dsw:hasOccurrence dwctype:Occurrence TaxonName dsw:IndividualOrganism sioc:UserAccount sioc:has_owner foaf:Image Figure 7: The ontology of “Citizen as Sensor”. Figure 11: The result of a spatiotemporal query. SELECT Distinct ?Obs ?POO_geo ?POO_wkt WHERE{ ?Obs a ssn:Observation; DUL:hasLocation ?POO ; ssn:observationResultTime ?Int . ?POO geo:hasGeometry ?POO_geo . ?POO geo geo:asWKT ?POO_wkt . _ ?Int time:xsdDateTime ?Time_xsd . Figure 12: A faceted viewer. FILTER (geof:sfWithin(?POO_wkt,"POLYGON(( 121.756555 24.488236, 121.207238 24.488236, 121.207238 25.141394, 121.756555 25.141394, 121.756555 24.488236))"^^sf:wktLiteral)) Filter (?Time_xsd > "2013-12-19T16:00:00Z"^^xsd:dateTime ) concepts such as kingdom, phylum, class, order, family, and } genus. The social relation graph shows the connections To efficiently browse the RDF triples, we develop a faceted in between the participants in the citizen science project. viewer10 including a taxon tree, a social relation graph, and It can be used to view who observes what species, and an observation map, as shown on Figure 12. The taxon tree where the species occurs. To display locations of species can visualize the identified species names via their taxon occurrences, the coordinates are used to pin the species on the map. Also a timeline is used to show the times of the 10 http://taibif.tw/vgd/ldow2014/viewer.php species occurrences. 6. RELATED WORK data in disaster management, and it shall help humanitar- Traditionally, in order to ensure the quality of data col- ian agencies make informed decisions. The exploitation lections, training and educating volunteers by experts or of external semantic resources to disambiguate contents experienced participants is a common method in citizen is often said to be an effective method. To enrich the se- science [11]. The volunteers, thus, are capable to fill des- mantics of folksonomies, Choudhury et al. not only built ignated forms, to use well-defined terms, and/or to follow up relations among tags via statistical analysis but also default steps on the web for reporting their observations. integrated the structured tags with the linked data cloud The user-contributed data, thus, can be fitted to a default through the DBpedia [5]. Mendes et al. proposed a Linked data model. However, this method is difficult to apply when Open Social Signals architecture for collection, semantic citizen science projects depend on Web applications and annotation, and analysis of real-time social signals from services. It is argued there exists an inherent trade-off microblogging data [19]. The design of Linked Data man- between data quality and data quantity [23]. The growth agement often aim to “reach a high level of automation of data quantity will be slow if the data contribution is re- with respect to the processing of an open and decentralized stricted to experts or trained volunteers. On the contrary, data space bringing together data sources published by dif- data volume often increases rapidly if data contribution ferent parties, of varying quality and using heterogeneous is entirely open to volunteers. But data quality is hard to conceptual schemas and vocabularies” [27]. Crowley et al. guarantee. Such volunteered contributions can easily be proposed a generic framework for aggregating and linking imperfect (e.g. erroneous, incomplete, or fraudulent) and heterogeneous data from various sources and transforming unstructured (e.g. in the form of texts and/or images) [6, them to Linked Data [8]. The framework allows reuse and 10]. Crowdsourcing is the first step of data collection in integration of the produced data with other data resources citizen science. After preprocessing and cleaning up the (including social media and sensors) enabling spatial busi- noise in crowdsourced data, it can provide more valuable in- ness intelligence for various domain-specific applications. formation to scientists than what raw data can do. The role of semantic web technologies is increasingly important for tackling crowdsourced data. To enable semantic computing 7. CONCLUSION AND FUTURE WORK to process crowdsourced data, Sheth proposed semantics- Social media creates new opportunities for citizen sci- empowered social computing architecture for dealing with ence. The information created from social media is consid- crowdsourced data [25]. The architecture emphasized the ered a new resource for scientific works. Meanwhile, the use of domain-specific or spatial-temporal-thematic ontolo- use of social media in citizen science projects also brings gies for extracting meaning in the data. new issues to research data. This study explored the is- The idea of citizen sensing is not new. Goodchild coined sues involved in the use of social media in citizen science the term “Volunteered Geographic Information” (VGI) to projects, as well as reported our experiences in transfer- describe a contemporary trend where Web technologies ring unstructured collaborative information to structured empower a network of human sensors voluntarily reporting data for scientific purposes. We shared our experiences in and interpreting in-situ information [13]. Sheth also de- tackling the data collection from social process to scientific scribed Internet users or Web-enabled social community as process. The successful implementation of this approach citizens. The ability to interact with Web 2.0 services can can further facilitate the development of social-media based augment these citizens into citizen sensors [24]. He further citizen science projects. We believe it also has broader explained the advantages of “human-in-the-loop sensing”, applications in user-generated content management, and emphasizing the background knowledge and past experi- promises to be a practical solution to an important design ences from human in citizen sensing. Janowicz and Comp- problem in citizen science projects on the Web. ton developed the Stimulus-Sensor-Observation ontology This study deals with crowdsourced content from a citi- pattern which forms the Semantic Sensor Network (SSN) zen science project via a “Citizen as Sensor” ontology. The ontology as developed by the W3C SSN Incubator Group processed data is formalized by inheriting the concepts [15]. The design pattern provides a knowledge represen- from the ontology. Thus, the extracted name entities can tation for integration of social web and sensor web. Some be mapped to the existing resources and linked to domain- studies not only transformed the crowdsourced data to a specific concepts. With clarified domain-specific semantics, standard format such as RDF but also leverage the power the triplified data can be applied in faceted exploration for of the SSN ontology to describe the sensors on mobile de- new knowledge. This study uses several tools for storing vices for passenger information system and in emergency and visualizing the RDF triples. To make the browser more reporting applications on microblogging platforms [6, 7]. usable, a task to integrate the tools into a knowledge-based Linked Data has established itself as the de facto means browser remains to be done in the future. Moreover, the for the publication of structured data over the Web. More triplified dataset should be considered for linkage to larger and more ICT ventures offer innovative data management linked datasets such as DBPedia and other resources. services on the top of Linked Open Data (LOD) [27]. Ort- mann et al. described an approach based on LOD to allevi- ating the integration problems of crowdsourced data, and 8. REFERENCES to improving the exploitation of crowdsourced data in dis- [1] S. Auer, J. Lehmann, and A.-C. N. Ngomo. Introduction aster management [22]. To solve the problem of structural to linked data and its lifecycle on the web. In and semantic interoperability, they also suggested engage Proceedings of the 7th International Conference on people in processing unstructured observations into struc- Reasoning Web: Semantic Technologies for the Web tured RDF-triples according to Linked Open Data principles. of Data, RW’11, pages 1–75, Berlin, Heidelberg, 2011. The process would increase the impact of crowdsourced Springer-Verlag. [2] C. Bizer, T. Heath, and T. Berners-Lee. Linked data - [16] S. Kelling, J. Gerbracht, D. Fink, C. Lagoze, W.-K. the story so far. Int. J. Semantic Web Inf. Syst., Wong, J. Yu, T. Damoulas, and C. P. Gomes. A 5(3):1–22, 2009. human/computer learning network to improve [3] G. Chatzimilioudis, A. Konstantinidis, C. Laoudias, biodiversity conservation and research. AI Magazine, and D. Zeinalipour-Yazti. Crowdsourcing with 34(1):10–20, 2013. smartphones. Internet Computing, IEEE, 16(5):36–44, [17] R. Lemmens and D. Deng. Web 2.0 and semantic web: Sept 2012. Clarifying the meaning of spatial features. Semantic [4] L.-F. Chien. Pat-tree-based keyword extraction for Web meets Geopatial Applications, AGILE, 2008. chinese information retrieval. In ACM SIGIR Forum, [18] C. D. Manning and H. Schütze. Foundations of volume 31, pages 50–58. ACM, 1997. statistical natural language processing. MIT press, [5] S. Choudhury, J. G. Breslin, and A. Passant. 1999. Enrichment and ranking of the YouTube tag space [19] P. N. Mendes, A. Passant, P. Kapanipathi, and A. P. and integration with the linked data cloud. In Sheth. Linked open social signals. In Proceedings of International Semantic Web Conference, volume 5823 the 2010 IEEE/WIC/ACM International Conference on of LNCS, pages 747–762. Springer, 2009. Web Intelligence and Intelligent Agent [6] D. Corsar, P. Edwards, N. Velaga, J. Nelson, and J. Z. Technology-Volume 01, pages 224–231. IEEE Pan. Short paper: Addressing the challenges of Computer Society, 2010. semantic citizen-sensing. In Proceedings of the 4th [20] G. Newman, D. Zimmerman, A. Crall, M. Laituri, International Workshop on Semantic Sensor J. Graham, and L. Stapel. User-friendly web mapping: Networks(SSN’11), pages 101–106, 2011. lessons from a citizen science website. Int. J. Geogr. [7] D. Crowley, A. Passant, and J. G. Breslin. Short paper: Inf. Sci., 24(12):1851–1869, Dec. 2010. Annotating microblog posts with sensor data for [21] OGC. GeoSPARQL - A Geographic Query Language for emergency reporting applications. In Proceedings of RDF Data. Technical report, the 4th International Workshop on Semantic Sensor http://www.opengeospatial.org/standards/geosparql, Networks (SSN’11), pages 95–100, 2011. 2011. [8] D. N. Crowley, M. Dabrowski, and J. G. Breslin. [22] J. Ortmann, M. Linbu, W. Dong, and T. Kauppinen. Decision support using linked, social, and sensor data. Crowdsourcing linked open data for disaster In Proceedings of the Nineteenth Americas management. In W. W. Cohen and S. Gosling, editors, Conference on Information Systems, 2013. Terra Cognita, pages 11–22, 2011. [9] K. Crowston, E. E. Allen, and R. Heckman. Using [23] J. Parsons, R. Lukyanenko, and Y. Wiersma. Easier natural language processing technology for citizen science is better. Nature, 471(7336):37, Mar. qualitative data analysis. International Journal of 2011. Social Research Methodology, 15(6):523–543, 2012. [24] A. Sheth. Citizen sensing, social signals, and [10] D.-P. Deng, G.-S. Mai, C.-H. Hsu, T.-R. Chuang, T.-E. enriching human experience. Internet Computing, Lin, H.-H. Lin, K.-T. Shao, R. Lemmens, and M.-J. IEEE, 13(4):87–92, July 2009. Kraak. Using social media for collaborative species [25] A. Sheth. Computing for human experience: identification and occurrence: Issues, methods, and Semantics-empowered sensors, services, and social tools. In Proceedings of the 1st ACM SIGSPATIAL computing on the ubiquitous web. Internet International Workshop on Crowdsourced and Computing, IEEE, 14(1):88–91, 2010. Volunteered Geographic Information, GEOCROWD [26] J. Silvertown. A new dawn for citizen science. Trends ’12, pages 22–29, New York, NY, USA, 2012. ACM. in Ecology & Evolution, 24(9):467 – 471, 2009. [11] A. Flanagin and M. Metzger. The credibility of [27] E. Simperl. Crowdsourcing semantic data volunteered geographic information. GeoJournal, management: Challenges and opportunities. In 72:137–148, 2008. Proceedings of the 2Nd International Conference on [12] H. Gao, G. Barbier, and R. Goolsby. Harnessing the Web Intelligence, Mining and Semantics, WIMS ’12, crowdsourcing power of social media for disaster pages 1:1–1:3, New York, NY, USA, 2012. ACM. relief. Intelligent Systems, IEEE, 26(3):10–14, May [28] C. Stadler, J. Lehmann, K. Höffner, and S. Auer. 2011. Linkedgeodata: A core for a web of spatial open data. [13] M. Goodchild. Citizens as sensors: the world of Semantic Web Journal, 3(4):333–354, 2012. volunteered geography. GeoJournal, 69:211–221, [29] B. L. Sullivan, C. L. Wood, M. J. Iliff, R. E. Bonney, 2007. D. Fink, and S. Kelling. ebird: A citizen-based bird [14] T. R. Gruber. A translation approach to portable observation network in the biological sciences. ontology specifications. KNOWLEDGE ACQUISITION, Biological Conservation, 142(10):2282 – 2292, 2009. 5:199–220, 1993. [30] K. Verspoor, K. B. Cohen, A. Lanfranchi, C. Warner, [15] K. Janowicz and M. Compton. The H. L. Johnson, C. Roeder, J. D. Choi, C. Funk, stimulus-sensor-observation ontology design pattern Y. Malenkiy, M. Eckert, et al. A corpus of full-text and its integration into the semantic sensor network journal articles is a robust evaluation tool for ontology. In Proceedings of The 3rd International revealing differences in performance of biomedical workshop on Semantic Sensor Networks 2010 natural language processing tools. BMC (SSN10) in conjunction with the 9th International bioinformatics, 13(1):207, 2012. Semantic Web Conference (ISWC 2010), ISWC’10, [31] C. Webb and S. Baskauf. Darwin-sw: Darwin core data 2010. for the semantic web. TDWG Annual Meeting; 2011-10-18, 2011. owl:sameAs http://lod.tw/placenames/159624 . [32] J. Wieczorek, D. Bloom, R. Guralnick, S. Blum, eoe:point_559070840853748 rdf:type geo:Point , M. Döring, R. Giovanni, T. Robertson, and D. Vieglais. owl:NamedIndividual ; w3c_geo:long "121.575200" ; Darwin core: An evolving community-developed w3c_geo:lat "24.951490" ; biodiversity data standard. PLoS ONE, 7(1):e29715, geo:asWKT "Point(121.575200 2012. 24.951490)"^^sf:wktLiteral . [33] K.-F. Wong, W. Li, R. Xu, and Z.-s. Zhang. Introduction eoe:thread_559070840853748 rdf:type sioc:Thread , owl:NamedIndividual ; to Chinese Natural Language Processing. Morgan & sioc:has_container fb:groups/roadkilled . Claypool Publishers, 2010. eoe:occr_559070840853748 rdf:type eoe:Occurrence_As_Stimulus , owl:NamedIndividual ; APPENDIX dsw:hasEvidence eoe:token_559070840853748 . eoe:person_100002525111203 rdf:type eoe:Person_As_Sensor , A. FROM UGC TO ENRICHED RDF DATA owl:NamedIndividual ; rdfs:label "Chuang Yu Ta" ; @prefix rdf: . ssn:detects eoe:occr_559070840853748 ; @prefix geo: . ssn:observes eoe:token_559070840853748 ; @prefix foaf: . foaf:account fb:100002525111203 . @prefix DUL: . @prefix dwc: . taxon:380522 rdf:type dwctype:Taxon , @prefix dsw: . owl:NamedIndividual ; @prefix taibif: . dsw:hasName taibif:380522 ; @prefix ssn: . skos:preLabel "Melogale moschata subaurantiaca" ; @prefix sf: . skos:altLabel " 鼬獾 ’" . @prefix w3c_geo: . @prefix schema: . @prefix sioc: . @prefix rdfs: . @prefix dwctype: . @prefix time: . @prefix dct: . @prefix owl: . @prefix xsd: . @prefix rdf: . @prefix eoe: . @prefix fb: . @prefix tgn: . @prefix taxon: . @prefix skos: . @prefix gn: . eoe:img_559070840853748 rdf:type eoe:Image_As_SensorOutput , owl:NamedIndividual ; sioc:has_container eoe:thread_559070840853748 ; sioc:has_owner fb:100002525111203 ; ssn:isProducedBy eoe:person_100002525111203 . fb:238918712815615 _694835510557264 rdf:type eoe:Post_As_SensorOutput , owl:NamedIndividual ; sioc:has_container eoe:thread_559070840853748 ; _ sioc:has creator fb:100002525111203 ; ssn:isProducedBy eoe:person_100002525111203 . eoe:iden_559070840853748_01 rdf:type dwctype:Identification , owl:NamedIndividual ; dwc:dateIdentified eoe:iden_time_559070840853748 ; dsw:identifies eoe:idv_238918712815615_694835510557264 ; dsw:isBasedOn eoe:token_559070840853748 ; dsw:toTaxonConcept taxon:380522 . eoe:token_559070840853748 rdf:type eoe:Token_As_FeatureOfInterest , owl:NamedIndividual . eoe:idv_238918712815615_694835510557264 rdf:type dsw:IndividualOrganism , owl:NamedIndividual . eoe:obs_559070840853748 rdf:type ssn:Observation , owl:NamedIndividual ; ssn:observationResultTime eoe:obs_time_559070840853748 ; DUL:hasLocation eoe:placeOfOb_559070840853748 ; ssn:observationResult eoe:img_559070840853748 , fb:238918712815615 _694835510557264 ; ssn:featureOfInterest eoe:token_559070840853748 ; ssn:observedBy eoe:person_100002525111203 . eoe:obs_time_559070840853748 rdf:type time:DateTimeInterval , owl:NamedIndividual ; time:xsdDateTime "2013-12-04T07:42:15"^^xsd:dateTime . eoe:iden_time_559070840853748 rdf:type time:DateTimeInterval , owl:NamedIndividual ; time:xsdDateTime "2013-12-11T07:42:15"^^xsd:dateTime . eoe:placeOfOb_559070840853748 rdf:type eoe:PlaceOfObservation , owl:NamedIndividual ; geo:hasGeometry eoe:point_559070840853748 ; gn:name " 新店 " ;