Social Web Meets Sensor Web: From User-Generated
Content to Linked Crowdsourced Observation Data∗
† ‡
Dong–Po Deng Guan–Shuo Mai Tyng–Ruey Chuang
Institute of Biodiversity Institute of
Information Science Research Center Information Science
Academia Sinica Academia Sinica Academia Sinica
Taipei, Taiwan Taipei, Taiwan Taipei, Taiwan
Rob Lemmens Kwang–Tsao Shao
Faculty of Biodiversity
Geo–Information Research Center
Science and Earth Academia Sinica
Observation (ITC) Taipei, Taiwan
University of Twente
Enschede, Netherlands
ABSTRACT others, to formalize the extracted datasets, hence, make
The reach of dominating social media like Facebook and them readily linkable. A nice consequence of this approach
Twitter in the current population is enormous, and these is that a multi-faceted browser can be quickly built to ex-
media have long been leveraged for diverse applications. In plore biodiversity information in large collections of UGC.
particular, for some citizen science projects, existing social
media increasingly become platforms on which participants Categories and Subject Descriptors
interact and contribute. These user contributions, often
H.3.5 [Online Information System]: [Web-based services];
termed User-Generated Content (UGC), can be a mix bag
H.5.3 [Group and Organization Interfaces]: [Web-based
of posts, comments, images, and other media. We report
Interaction]; I.2.4 [Knowledge Representation Formalisms
in this paper a work-in-progress in formalizing user con-
and Methods]: Semantic Networks
tributions from a large Facebook group (more than 4,000
users) established for biodiversity observation. A major
part of our work is to extract structured datasets with well- General Terms
defined semantics from unstructured UGC collections. We Management, Design, Human Factors.
use common vocabularies from Darwin Core (DwC), Friend-
of-a-friend (FOAF), Semantically-Interlinked Online Com-
munities (SIOC), Semantic Sensor Network (SSN), among
Keywords
Citizen Science, Crowdsourcing, Facebook, GeoSPARQL,
∗This research is supported in part by the Ministry of Sci- Linked Data, Sensor Network, User-Generated Content
ence and Technology (grant no. 102-2627-M-001-009) and (UGC).
by the Endemic Species Research Institute, Council of Agri-
culture, Taiwan. We are grateful to Te–En Lin and his group
at the Endemic Species Research Institute for their help 1. INTRODUCTION
with the collected data. Citizen science is a crowdsourcing mechanism that refers
†Dong–Po Deng is also a PhD candidate at the Faculty of
to a distributed, collaborative problem-solving model in
Geo–Information Science and Earth Observation (ITC), Uni- which a crowd of undefined size is engaged to solve a
versity of Twente.
‡Tyng–Ruey Chuang is also affiliated with the Research complex or scientific problem through an open call [3, 20].
Incorporation with trained volunteers participating in scien-
Center for Information Technology Innovation and the Re-
tific studies as field assistants has a long history [26]. How-
search Center for Humanities and Social Sciences (Center
for Geographic Information Science), both at Academia ever, the landscape of citizen science has been transformed
Sinica. by modern Web services and communications enabling peo-
ple around the world to spread information. Social media
This paper is released under the Creative Commons Attribution 4.0 License. is one of significant tools in changing the ways information
You are free to share and adapt this paper for any purpose, even commer- is produced and used in citizen science projects. A social
cially, as long as you give appropriate credit, provide a link to the license, media site can offer participants of citizen science projects
and indicate if changes were made. These freedoms cannot be revoked as not only a virtual environment for social interactions but
long as you follow the license terms. For a copy of the license, please visit also a platform for sharing, discussing, and modifying data
.
together. On one hand, social media potentially provide
Linked Data on the Web (LDOW2014), April 8, 2014. Seoul, Korea. situational awareness and opportunities for assistance on
an individual level [12]. The communication channels make
. possible for participants to share and manage their own
sightings on a globally accessible database [29]. That is,
the citizens are locally acting as human sensors, and social
media are acting as platforms connecting these human sen-
sors. On the other hand, social media enable scientists to
reach out a large number of people, over a large geographic
region and over an extended time period, to introduce them
to citizen science projects. Therefore, the use of social
media has greatly increased citizen participation and im-
proved data collection process in citizen science projects.
Such crowdsourced approach often can reduce cost and
effort in data management and exchange [26].
However, to utilize social media for citizen science projects,
there is a need to bridge a knowledge gap between human
and the machine. In using social media for collecting partic-
Figure 1: The growth of data in the Facebook group
ipants’ observations, it is often hard in controlling the qual-
Reptile Road Mortality.
ity of the content. Social media applications and services
facilitate social interactions, but not scientific activities and
data exchanges. Valuable scientific content is mixed up with
The paper is organized as follows. After introducing the
huge amounts of noisy, low-quality, unstructured text and
citizen science project in Section 2, we describe how named-
media. Often a crowdsourcing effort only creates human-
entities can be extracted from crowdsourced data, and the
readable content but not machine-readable data. Moreover,
evaluation of the information extraction in Section 3. We
often the lack of sufficient metadata for crowdsourced data
explain the design of the synthesis ontology of citizens as
makes it difficult to derive meaningful interpretations from
sensors, and how crowdsourced data can be transferred
the data. Correspondingly, data integration and sharing
to RDF data model in Section 4. In Section 5 we make
in different knowledge domains is hampered. To achieve
spatiotemporal queries and present a faceted browser for
semantic computing on crowdsourced data, it requires not
the linked crowdsourced sensor data. Then we provide
only text mining for extracting valuable information from
related work in Section 6. Finally we conclude in Section 7
user-generated content but also semantic enrichment for
with an outlook to future work.
interpreting the meaning of the extracted information.
An ontology, as a “shared conceptualization”, plays an im-
portant role for the basis of connections between datasets 2. REPTILE ROAD MORTALITY: A CITI-
[14]. It is because an ontology presents a formal modeling
ZEN SCIENCE PROJECT
for knowledge representation geared towards resolving se-
mantic ambiguity, and consequently it contributes to the This section introduces the data collection in the citizen
achievement of semantic interoperability between informa- science project, Reptile Road Mortality (in Chinese, 路殺
tion communities [17]. Linked Data refers to the publica- 社). This citizen science project is hosted by the Endemic
tion of structured data on the Web in such a way that it Species Research Institute, Council of Agriculture, Taiwan.
is machine-readable, its meaning is explicitly defined, it is The citizen science project aims to collect reports of dead
linked to other external datasets, and can in turn be linked animals that have been struck and/or killed by motor vehi-
to from external datasets [2]. Technically, the Linked Data cles through the use of a Facebook group. The reason of
paradigm combines knowledge representation technolo- using Facebook as a crowdsourced data collection platform
gies, e.g. RDF and OWL, with traditional Web technologies, is its high user base in the Taiwanese population. Accord-
e.g. HTTP and REST, for publishing and interlinking data ing to a statistic of Socialbakers1 , over half of Taiwanese
and information [28]. The technologies enable a process population has a Facebook account. Facebook thus can be
evolving transition from current document-oriented Web a good social place for recruiting participants. The number
into a Web of interlinked data and, ultimately, into the Se- of participants in the Reptile Road Mortality is 4,187 at the
mantic Web [1]. end of year 2013, but only 618 persons ever posted at least
This paper reports our experiences on processing crowd- one observation. The ratio of participants and contributive
sourced data from social media into interlinked data for the participants reveals the reality of mass collaboration, which
Web. The process can be elaborated by the following: is often said that 80% of the work is done by 20% of the
people. Up to Jan. 4 2014, the group has assembled 7,842
• how the crowdsourced observation data can be trans- posts as shown in Figure 1.
formed and represented by an ontology of citizens as Any user possessing a Facebook account can join this
sensors, citizen project and post his/her observations of roadkill
animals. Figure 2 illustrates a roadkill observation posted
• how the crowdsourced observation data can be inter- in the Facebook group Reptile Road Mortality. Chuang Yu-
linked with other Linked Data resources such as bio- Ta saw a killed animal on the road, so he took a photo and
diversity (TaiCOL) and geospatial information (Geon- posted his observation with location and time description
ames), on the group. When Joyce read the post, she identified the
species in Chuang Yu-Ta’s photo and left the species name
• how the crowdsourced observation data can be acces- as comment. Thus, the roadkill observation was composed
sible to machines by using the Linked Data paradigm of photo, description of location and time, and identification
and be readable for humans by means of a faceted
1
browser. http://www.socialbakers.com
Observation Provider:
Thread Chuang Yu Ta Several different algorithms have been proposed to deal
Observation location: with the challenge. Generally speaking, the algorithms
Geoname:
(Sindian)
can be classified into character-based and word-based ap-
Road kilometre: proaches [33]. The character-based approaches ignore the
916.3K
Post (Province Road No.9, concept of words, and use characters to extract word-level
section 16.3 kilometer)
Lat:24.95149 information in the construction of information extraction
Lon:121.57520
Lat:151m system. The word-based approaches apply lexicon to seg-
Observation date: ment Chinese words. They often reply on a rich lexicon,
2013/12/4
sophisticated word segmentation, and/or syntactic analy-
Photo:
A proof of occurrence sis in extracting word-level information from documents
Comment
section Species Identifier: [4]. However, existing Chinese lexicons are constructed for
Joyce Chen
general applications. The lacks of domain-specific corpora
Species name:
often hamper the information extraction in specific domains
(Melogale moschata)
Identification Date:
such as geography and biodiversity. For example, the group
Dec. 4, 2013 Chinese Knowledge Information Processing (CKIP)3 is con-
The post is published on . information. The lexicon now contains over 140,000 word
entries, and is used in a corpus with over a million parsed
Figure 2: A post on the Facebook group Reptile Road sentences. This is a great research resource. Unfortunately
Mortality, as well as biodiversity observation informa- using the CKIP lexicon for extracting location and species
tion embedded in the post. names is not efficient.
To efficiently extract species and location names from
Facebook threads, it is necessary to constitute specific lexi-
of species. cons. We compiled a geo-name lexicon from the Taiwan Geo-
The participants of this citizen project would be asked graphic Names database4 and a species-name lexicon from
to provide the location and time descriptions for the their the Taiwan Catalogue of Life databases (TaiCOL)5 . Note
observations. Because of privacy and security issues, Face- that, however, species names and place names found in
book strips metadata (EXIF) from the photos. Without EXIF Facebook posts and comments are not always in these two
data, a photo from Facebook is just an image; the photo can- specific lexicons. The name-entity recognition approach
not in itself indicates the date and location on which it was we use was elaborated in a paper we previously published
taken. The text messages accompanying the photos will [10].
be the main sources for extracting biodiversity information
about the species in the photos.
Facebook posts can be retrieved using the Facebook
3.2 Evaluation of Name-Entity Recognition
Graph API2 which enables developers to read from and Precision and recall are the basic measures used in Natu-
write data to Facebook. This API offers a simple, consistent ral Language Processing to evaluate information extraction
view of the Facebook social graph, uniformly represent- methods [9, 18, 30]. Generally speaking, it needs a training
ing objects in the graph (e.g., people, photos, events, and dataset to assess the quality of information extraction. Our
pages) and the connections in between them (e.g., friend training dataset is generated by domain experts. While the
relationships, shared content, and photo tags). training dataset is considered as a positive set, the names
extracted by Name-Entity Recognition (NER) is a negative
set. According to whether an identification is correct, four
3. INFORMATION EXTRACTION sets can be distinguished: true positive, false positive, true
negative, and false negative. From the statistical point of
3.1 Name-Entity Recognition view, false negative are Type I errors, and false positives
The data offered by the Facebook Graph API is structured are Type II errors. Precision is the ratio of the number
around the Facebook social graph which is useful for pro- of correct names identified by both NER and domain ex-
cessing social relationships. However, this citizen science perts (True Positive) to the total number of incorrect and
project focuses on collecting occurrences of roadkill ani- correct names identified by NER (True Positive + False
mals. The valuable information is in the photos for proving Positive) (Eq. 1). Recall is the ratio of the number of correct
occurrences of roadkill animals, and in the texts for describ- names identified by both NER and domain experts (True
ing the time and location of occurrences of roadkill animals. Positive) to the total number of correct names identified
To extract the information of occurrences of roadkill ani- by domain experts (True Positive + False Negative) (Eq. 2).
mals, we apply name-entity recognition to identify location, The F-score is an overall metric that is calculated from both
time, and species in Facebook posts and comments. Be- precision and recall, treating these two metrics as equally
cause the participants in the Facebook group use traditional important (Eq. 3).
Chinese as the communication language, our task of name-
entity recognition actually aims at Chinese text processing. |nameactual ∩ namepredict |
Chinese texts are character-based, not word-based. More- Recall = (1)
|nameactual |
over, there is often no space between characters in written
Chinese sentences. This unique language feature leads to
3
a challenge of word segmentation. http://ckip.iis.sinica.edu.tw/CKIP/engversion/index.htm
4
http://placesearch.moi.gov.tw
2 5
https://developers.facebook.com/docs/graph-api http://col.taibif.tw
Table 1: Confusion matrix of information extraction sioc:Thread
foaf:Person
assessment.
Expert determine Expert not determine sioc:has_container
NER predict 282 7
sioc:reply_of
NER not predict 10 101
sioc:Post
sioc:has_container
foaf:holdsAccount
|nameactual ∩ namepredict |
Precision = (2)
|namepredict |
sioc:has_creator
2 × Recall × Precision
F-score = (3) sioc:UserAccount sioc:has_owner foaf:Image
Recall + Precision
where nameactual is the set of place names or species
names that has been identified from Facebook messages by
domain experts, and namepredict is the set of place names Figure 3: The vocabularies of SIOC and FOAF used in
or species names that has been identified from Facebook our ontology.
messages by the NER.
400 posts are randomly selected from the entire 7,842
posts for the evaluation. The confusion matrix of the infor- location. Also, the species of the animal is the feature of
mation extraction assessment is shown in Table 1. Thus, the interest. Figure 4 displays the use of the vocabularies of
precision is 282/(282 + 7) = 0.9758, the recall is 282/(282 + the SSN ontology in our ontology.
10) = 0.9656, and the F-score is 2.8973. However, the citizen is a person and cannot exactly be
regarded as a sensor. The persons can be expressed as
4. AN ONTOLOGY FOR CITIZENS AS SEN- foaf:Person, and the sensors can be defined to ssn:Sensor.
All individuals of foaf:Person cannot be the same as all
SORS individuals of ssn:Sensor. Only some of these individu-
als can be expressed as not only foaf:Person but also
4.1 A synthesis of social networks and sensor ssn:Sensor. To clarify the concept, we create the class
networks Citizen_As_Sensor which is a subclass of the intersec-
Before we begin to transform the crowdsourced content tion of the two classes. That is, an individual of the class
to RDF, we first develop an ontology for not only expressing Citizen_As_Sensor can be an instance of both classes. But
the notions of “Citizens as Sensors” but also formalizing the the instances of foaf:Person or ssn:Sensor are not neces-
extracted name-entities, e.g. species and geospatial names. sary to be the individuals of the class Citizen_As_Sensor.
To make linked data interoperable, the ontology reuses suit- Moreover, the same situation occurs for ssn:SesnorOutput,
able vocabularies from the existing ontologies as many as as some instances are in sioc:Post or in sioc:Image. There-
possible. Since the crowdsourced dataset is retrieved from fore, we define the class Post_As_SesnorOutput to be in
Facebook, a social media site, its content can be mapped the intersection of sioc:Post and ssn:SensorOutput, and
to RDF using existing social semantic web ontologies. The the class Image_As_SesnorOutput to be a subclass of both
Semantically Interlinked Online Communities (SIOC)6 is sioc:Image and ssn:SensorOutput.
used for representing the content of the Facebook group
Reptile Road Mortality, e.g. threads, posts, and images. 4.2 Formalizations of the extracted name-entities
The Friend of a Friend (FOAF)7 can be used to describe
content creators. Figure 3 shows the vocabularies of SIOC 4.2.1 Geospatial information
and FOAF used in our ontology.
In the process of information extraction, name entity
In this study, “Citizens as Sensors” means that a Citizen
recognition is used to identify the geospatial and species
voluntarily reporting his/her observations via social media
names. The extraction of geospatial information includes
for a citizen science project. The citizen acts as a Sensor
not only location names (such as names of populated places
which enables automatic measurement and/or recording of
and point of interests) and road names with kilometers
physical properties. To express the notion, the vocabularies
but also coordinates (longitude and latitude). If coordi-
of W3C Semantic Sensor Network (SSN) ontology are used
nates were not written in the texts of observation posts,
to express the content from social networks. Conceptually,
the location names would be used to retrieve the longitude
the action that a participant reports her/his roadkill obser-
and latitude. To semantically encode geospatial data, we
vation matches the pattern of Stimulus-Sensor-Observation.
use the vocabularies of Open Geospatial Consortium (OGC)
The pattern describes a process that a sensor transforms a
GeoSPARQL. The GeoSPARQL is one of OGC standards
stimulus from the physical world into an observation and
which provides three main components for semantically en-
thereby it allows us to reason about the observed proper-
coding geographic data: (1) The definitions of vocabularies
ties of particular features of interest [15]. A roadkill animal
for representing features, geometries, and their relation-
actually is the stimulus which triggers a citizen to a post
ships; (2) A set of domain-specific, spatial functions for use
her/his observations on the Facebook at specific time and
in SPARQL queries; (3) A set of query transformation rules
6
http://sioc-project.org [21].
7
http://www.foaf-project.org The ontology of the GeoSPARQL standard includes three
ssn:Sensor DUL:Entity
geo:Feature
ssn:observerBy ssn:detects
owl:subClassOf
ssn:Observation ssn:Stimulus owl:subClassOf
ssn:observationResult
PlaceOfObservation
ssn:SensorOutput
gn:name
geo:hasGeometry
ssn:featureOfInterest
Geoname gn:featureClass
sf:Point
ssn:FeatureOfInterest
DUL:hasLocation geo:asWKT
ssn:observationResultTime Feature Type
geo:WKTLiteral
PlaceOfObservation
time:DateTimeInterval Figure 5: The vocabularies of GeoSPARQL used in
our ontology.
Figure 4: The vocabularies of W3C SSN used in our
ontology. of biodiversity data has increased the scale from regional to
global, and has broaden the scope beyond that of establish-
ing species ranges [16]. To reach global biodiversity data
main classes: geo:SpatialObject , geo:Features, and coordination, standardized metadata vocabularies i.e. Dar-
geo:Geometry . The geo:Feature and geo:Geometry are win Core is used to develop data infrastructures for sharing
the subclass of geo:SpatialObject. The geo:Feature class biodiversity data. Darwin Core is a standard for sharing
represents features, which are abstractions of real world data about biodiversity — the occurrence of life on earth
phenomena. The concept of feature is derived from ISO and its associations with the environment [32]. However,
19109 General Feature Model. The geo:Geometry, express- Darwin Core is comprised of technology-independent vo-
ing spatial geometries of the features, has sixteen sub- cabularies. The classes in Darwin Core are categories and
classes defining a hierarchy of geometry types such as have no formal domain declarations for vocabularies [31].
point, polygon, curve, arc, and multi-curve. These geometry To improve the knowledge representation of Darwin Core,
classes are derived from ISO 19107 Spatial Schema. RDF Darwin-SW8 designs the properties between classes and
literals are used to store geometry values. There are two formalizes the classes including five existing core classes
ways to store geometry values via RDF literals: Well Known of Darwin Core (i.e. Taxon, Event, Identification, Location,
Text (WKT) and Geography Markup Language (GML). The Occurrence) and two new ones (i.e. Token and Individual
geo:asWKT and geo:asGML properties map between the ge- Organism). Figure 6 shows the classes and properties of
ometry entities and the geometry literals. Geometry val- Darwin Core are used in our ontology.
ues for these two properties use the geo:WKTLiteral and Traditionally, a specimen collecting all or part of an or-
geo:GMLLiteral data types respectively. Figure 5 shows ganism serves as an evidence for the occurrence of the
the classes and properties of GeoSPARQL used in our ontol- organism, and is a basis for identifying the organism to a
ogy. taxon concept. However, the documentation process nowa-
Although DUL:hasLocation is usually a predicate in be- days has many possible methods such as images, sound,
tween ssn:Observation and DUL:Entity in W3C SSN, it ac- or DNA sequences. The class dsw:Token is used to repre-
tually can be a property between any entities. To clarify the sent evidences from the classes dwctype:Occurrence and
place of observation, we create a class PlaceOfObservation dwctype:Identification. To connect Darwin Core to W3C
which is a subclass of both of DUL:Entity and geo:Fea- SSN, we create classes Token_As_FeatureOfInterest and
ture. The class PlaceOfObservation not only keeps the Occurrence_As_Stimulus. Token_As_FeatureOfInterest
DUL:hasLocation property but also inherits the formal is a subclass of the intersection of ssn:FeatureOfInterest
geospatial concepts from geo:Feature. As for the time of and dwstype:Token. The class Occurrence_As_Stimulus
an ssn:Observation event, ssn:observationResultTime is in the intersection of ssn:Stimulus and dwctype:Occ-
can be a predicate in between the class ssn:Observation urrence.
and the class time:DateTimeInterval.
4.2.2 Biodiversity information 4.3 Transformations from the extracted name-
entities to the RDF model
Discovery and inventory of specimen data is a fundamen-
tal work in biodiversity informatics. With the development
8
of Internet technologies, the aggregation and dissemination https://code.google.com/p/darwin-sw/
dwctype:Identification dwc:identifiedBy foaf:Person
dsw:toTaxonConcept
dsw:identifiedBasedOn
dwctype:Taxon dsw:Token
dsw:identifies
dsw:hasName dsw:hasDerivative
TaxonName dsw:IndividualOrganism Figure 9: The taxon name of extract species name is
dsw:hasEvidence
linked to a URI in TaiBIF.
dsw:hasOccurrence
dwctype:Occurrence
Figure 6: The vocabularies of Darwin-SW used in our
ontology.
Figure 8: The taxon concept of extract species name
is linked to a URI in TaiBIF.
Figure 10: The extract place name points to a URI in
Taiwan Geographic Name.
Assembling the above-mentioned vocabularies, we can
create the ontology of “Citizen as Sensor”, as shown in
Figure 7. Such designed ontology plays as the schema for study uses BBN Parliament, which is an open source triple
transforming crowdsourced content to linked sensor data. store developed by Raytheon BBN Technologies. The BBN
Take Figure 2 as example, we can correspondingly trans- Parliament is compliant with OGC GeoSPARQL standard,
form the user-generated content to RDF data, as shown and supports spatial and non-spatial SPARQL queries. Us-
in the Appendix. The extracted name entities of species ing BBN parliament, we build a GeoSPARQL endpoint9 . for
and place names are pointed to by URLs. The word “鼬獾” the linked crowdsourced sensor dataset. The following lists
(M elogale moschata subaurantiaca) is identified as a taxon a GeoSPARQL query, and Figure 11 is the result of the
, as query.
shown in Figure 8, and mapped to the scientific name
PREFIX geo:
, PREFIX geof:
as shown in Figure 9. The extracted place name “新店” (Sin- PREFIX owl:
PREFIX rdf:
dian) also is linked to a URI in Taiwan Geographic Name PREFIX rdfs:
whose URIs are all mapped to Geonames.org, as shown in PREFIX sf:
Figure 10. PREFIX time:
PREFIX units:
PREFIX xsd:
PREFIX eoe:
5. SPATIOTEMPORAL QUERIES PREFIX DUL:
PREFIX ssn:
Since the geospatial information is formalized by the vo-
cabularies of OGC GeoSPARQL, information in our RDF
9
dataset can be retrieved via spatiotemporal queries. This http://lod.tw/parliament/
DUL:Entity
ssn:Stimulus
geo:Feature Sensor Network ssn:detects
ssn:Sensor
owl:subClassOf
owl:subClassOf
ssn:Observation
Geospatial
DUL:hasLocation ssn:isProducedBy
ssn:observationResult
PlaceOfObservation
ssn:SensorOutput
gn:name ssn:featureOfInterest
geo:hasGeometry
Geoname gn:featureClass ssn:observationResult
ssn:FeatureOfInterest
sf:Point owl:subClassOf
ssn:observationResult
geo:asWKT
Feature Type
ssn:featureOfInterest owl:subClassOf
geo:WKTLiteral owl:subClassOf
owl:subClassOf
ssn:observationResultTime owl:subClassOf Occurrence_As_Stimulus
ssn:observes Post_As_SensorOutput Image_As_SensorOutput
dsw:hasEvidence
Time time:DateTimeInterval ssn:detects
ssn:isProducedBy
ssn:isProducedBy
Token_As_FeatureOfInterest
time:xsdDateTime ssn:observes
Person_As_Sensor
xsd:dateTime dsw:isBasedOn owl:subClassOf
owl:subClassOf owl:subClassOf
dwc:identifiedBy
dwc:dateIdentified owl:subClassOf
owl:subClassOf Social Network
dwctype:Identification sioc:Thread
foaf:Person
Biodiversity sioc:has_container
dsw:toTaxonConcept dsw:isBasedOn
sioc:reply_of
sioc:Post
sioc:has_container
dwctype:Taxon dsw:identifies dsw:Token
foaf:holdsAccount
dsw:hasName dsw:hasEvidence
dsw:hasDerivative sioc:has_creator
dsw:hasOccurrence dwctype:Occurrence
TaxonName dsw:IndividualOrganism sioc:UserAccount sioc:has_owner foaf:Image
Figure 7: The ontology of “Citizen as Sensor”.
Figure 11: The result of a spatiotemporal query.
SELECT Distinct ?Obs ?POO_geo ?POO_wkt
WHERE{
?Obs a ssn:Observation;
DUL:hasLocation ?POO ;
ssn:observationResultTime ?Int .
?POO geo:hasGeometry ?POO_geo .
?POO geo geo:asWKT ?POO_wkt .
_
?Int time:xsdDateTime ?Time_xsd . Figure 12: A faceted viewer.
FILTER (geof:sfWithin(?POO_wkt,"POLYGON((
121.756555 24.488236, 121.207238 24.488236,
121.207238 25.141394, 121.756555 25.141394,
121.756555 24.488236))"^^sf:wktLiteral))
Filter (?Time_xsd > "2013-12-19T16:00:00Z"^^xsd:dateTime )
concepts such as kingdom, phylum, class, order, family, and
}
genus. The social relation graph shows the connections
To efficiently browse the RDF triples, we develop a faceted in between the participants in the citizen science project.
viewer10 including a taxon tree, a social relation graph, and It can be used to view who observes what species, and
an observation map, as shown on Figure 12. The taxon tree where the species occurs. To display locations of species
can visualize the identified species names via their taxon occurrences, the coordinates are used to pin the species on
the map. Also a timeline is used to show the times of the
10
http://taibif.tw/vgd/ldow2014/viewer.php species occurrences.
6. RELATED WORK data in disaster management, and it shall help humanitar-
Traditionally, in order to ensure the quality of data col- ian agencies make informed decisions. The exploitation
lections, training and educating volunteers by experts or of external semantic resources to disambiguate contents
experienced participants is a common method in citizen is often said to be an effective method. To enrich the se-
science [11]. The volunteers, thus, are capable to fill des- mantics of folksonomies, Choudhury et al. not only built
ignated forms, to use well-defined terms, and/or to follow up relations among tags via statistical analysis but also
default steps on the web for reporting their observations. integrated the structured tags with the linked data cloud
The user-contributed data, thus, can be fitted to a default through the DBpedia [5]. Mendes et al. proposed a Linked
data model. However, this method is difficult to apply when Open Social Signals architecture for collection, semantic
citizen science projects depend on Web applications and annotation, and analysis of real-time social signals from
services. It is argued there exists an inherent trade-off microblogging data [19]. The design of Linked Data man-
between data quality and data quantity [23]. The growth agement often aim to “reach a high level of automation
of data quantity will be slow if the data contribution is re- with respect to the processing of an open and decentralized
stricted to experts or trained volunteers. On the contrary, data space bringing together data sources published by dif-
data volume often increases rapidly if data contribution ferent parties, of varying quality and using heterogeneous
is entirely open to volunteers. But data quality is hard to conceptual schemas and vocabularies” [27]. Crowley et al.
guarantee. Such volunteered contributions can easily be proposed a generic framework for aggregating and linking
imperfect (e.g. erroneous, incomplete, or fraudulent) and heterogeneous data from various sources and transforming
unstructured (e.g. in the form of texts and/or images) [6, them to Linked Data [8]. The framework allows reuse and
10]. Crowdsourcing is the first step of data collection in integration of the produced data with other data resources
citizen science. After preprocessing and cleaning up the (including social media and sensors) enabling spatial busi-
noise in crowdsourced data, it can provide more valuable in- ness intelligence for various domain-specific applications.
formation to scientists than what raw data can do. The role
of semantic web technologies is increasingly important for
tackling crowdsourced data. To enable semantic computing
7. CONCLUSION AND FUTURE WORK
to process crowdsourced data, Sheth proposed semantics- Social media creates new opportunities for citizen sci-
empowered social computing architecture for dealing with ence. The information created from social media is consid-
crowdsourced data [25]. The architecture emphasized the ered a new resource for scientific works. Meanwhile, the
use of domain-specific or spatial-temporal-thematic ontolo- use of social media in citizen science projects also brings
gies for extracting meaning in the data. new issues to research data. This study explored the is-
The idea of citizen sensing is not new. Goodchild coined sues involved in the use of social media in citizen science
the term “Volunteered Geographic Information” (VGI) to projects, as well as reported our experiences in transfer-
describe a contemporary trend where Web technologies ring unstructured collaborative information to structured
empower a network of human sensors voluntarily reporting data for scientific purposes. We shared our experiences in
and interpreting in-situ information [13]. Sheth also de- tackling the data collection from social process to scientific
scribed Internet users or Web-enabled social community as process. The successful implementation of this approach
citizens. The ability to interact with Web 2.0 services can can further facilitate the development of social-media based
augment these citizens into citizen sensors [24]. He further citizen science projects. We believe it also has broader
explained the advantages of “human-in-the-loop sensing”, applications in user-generated content management, and
emphasizing the background knowledge and past experi- promises to be a practical solution to an important design
ences from human in citizen sensing. Janowicz and Comp- problem in citizen science projects on the Web.
ton developed the Stimulus-Sensor-Observation ontology This study deals with crowdsourced content from a citi-
pattern which forms the Semantic Sensor Network (SSN) zen science project via a “Citizen as Sensor” ontology. The
ontology as developed by the W3C SSN Incubator Group processed data is formalized by inheriting the concepts
[15]. The design pattern provides a knowledge represen- from the ontology. Thus, the extracted name entities can
tation for integration of social web and sensor web. Some be mapped to the existing resources and linked to domain-
studies not only transformed the crowdsourced data to a specific concepts. With clarified domain-specific semantics,
standard format such as RDF but also leverage the power the triplified data can be applied in faceted exploration for
of the SSN ontology to describe the sensors on mobile de- new knowledge. This study uses several tools for storing
vices for passenger information system and in emergency and visualizing the RDF triples. To make the browser more
reporting applications on microblogging platforms [6, 7]. usable, a task to integrate the tools into a knowledge-based
Linked Data has established itself as the de facto means browser remains to be done in the future. Moreover, the
for the publication of structured data over the Web. More triplified dataset should be considered for linkage to larger
and more ICT ventures offer innovative data management linked datasets such as DBPedia and other resources.
services on the top of Linked Open Data (LOD) [27]. Ort-
mann et al. described an approach based on LOD to allevi-
ating the integration problems of crowdsourced data, and
8. REFERENCES
to improving the exploitation of crowdsourced data in dis- [1] S. Auer, J. Lehmann, and A.-C. N. Ngomo. Introduction
aster management [22]. To solve the problem of structural to linked data and its lifecycle on the web. In
and semantic interoperability, they also suggested engage Proceedings of the 7th International Conference on
people in processing unstructured observations into struc- Reasoning Web: Semantic Technologies for the Web
tured RDF-triples according to Linked Open Data principles. of Data, RW’11, pages 1–75, Berlin, Heidelberg, 2011.
The process would increase the impact of crowdsourced Springer-Verlag.
[2] C. Bizer, T. Heath, and T. Berners-Lee. Linked data - [16] S. Kelling, J. Gerbracht, D. Fink, C. Lagoze, W.-K.
the story so far. Int. J. Semantic Web Inf. Syst., Wong, J. Yu, T. Damoulas, and C. P. Gomes. A
5(3):1–22, 2009. human/computer learning network to improve
[3] G. Chatzimilioudis, A. Konstantinidis, C. Laoudias, biodiversity conservation and research. AI Magazine,
and D. Zeinalipour-Yazti. Crowdsourcing with 34(1):10–20, 2013.
smartphones. Internet Computing, IEEE, 16(5):36–44, [17] R. Lemmens and D. Deng. Web 2.0 and semantic web:
Sept 2012. Clarifying the meaning of spatial features. Semantic
[4] L.-F. Chien. Pat-tree-based keyword extraction for Web meets Geopatial Applications, AGILE, 2008.
chinese information retrieval. In ACM SIGIR Forum, [18] C. D. Manning and H. Schütze. Foundations of
volume 31, pages 50–58. ACM, 1997. statistical natural language processing. MIT press,
[5] S. Choudhury, J. G. Breslin, and A. Passant. 1999.
Enrichment and ranking of the YouTube tag space [19] P. N. Mendes, A. Passant, P. Kapanipathi, and A. P.
and integration with the linked data cloud. In Sheth. Linked open social signals. In Proceedings of
International Semantic Web Conference, volume 5823 the 2010 IEEE/WIC/ACM International Conference on
of LNCS, pages 747–762. Springer, 2009. Web Intelligence and Intelligent Agent
[6] D. Corsar, P. Edwards, N. Velaga, J. Nelson, and J. Z. Technology-Volume 01, pages 224–231. IEEE
Pan. Short paper: Addressing the challenges of Computer Society, 2010.
semantic citizen-sensing. In Proceedings of the 4th [20] G. Newman, D. Zimmerman, A. Crall, M. Laituri,
International Workshop on Semantic Sensor J. Graham, and L. Stapel. User-friendly web mapping:
Networks(SSN’11), pages 101–106, 2011. lessons from a citizen science website. Int. J. Geogr.
[7] D. Crowley, A. Passant, and J. G. Breslin. Short paper: Inf. Sci., 24(12):1851–1869, Dec. 2010.
Annotating microblog posts with sensor data for [21] OGC. GeoSPARQL - A Geographic Query Language for
emergency reporting applications. In Proceedings of RDF Data. Technical report,
the 4th International Workshop on Semantic Sensor http://www.opengeospatial.org/standards/geosparql,
Networks (SSN’11), pages 95–100, 2011. 2011.
[8] D. N. Crowley, M. Dabrowski, and J. G. Breslin. [22] J. Ortmann, M. Linbu, W. Dong, and T. Kauppinen.
Decision support using linked, social, and sensor data. Crowdsourcing linked open data for disaster
In Proceedings of the Nineteenth Americas management. In W. W. Cohen and S. Gosling, editors,
Conference on Information Systems, 2013. Terra Cognita, pages 11–22, 2011.
[9] K. Crowston, E. E. Allen, and R. Heckman. Using [23] J. Parsons, R. Lukyanenko, and Y. Wiersma. Easier
natural language processing technology for citizen science is better. Nature, 471(7336):37, Mar.
qualitative data analysis. International Journal of 2011.
Social Research Methodology, 15(6):523–543, 2012. [24] A. Sheth. Citizen sensing, social signals, and
[10] D.-P. Deng, G.-S. Mai, C.-H. Hsu, T.-R. Chuang, T.-E. enriching human experience. Internet Computing,
Lin, H.-H. Lin, K.-T. Shao, R. Lemmens, and M.-J. IEEE, 13(4):87–92, July 2009.
Kraak. Using social media for collaborative species [25] A. Sheth. Computing for human experience:
identification and occurrence: Issues, methods, and Semantics-empowered sensors, services, and social
tools. In Proceedings of the 1st ACM SIGSPATIAL computing on the ubiquitous web. Internet
International Workshop on Crowdsourced and Computing, IEEE, 14(1):88–91, 2010.
Volunteered Geographic Information, GEOCROWD [26] J. Silvertown. A new dawn for citizen science. Trends
’12, pages 22–29, New York, NY, USA, 2012. ACM. in Ecology & Evolution, 24(9):467 – 471, 2009.
[11] A. Flanagin and M. Metzger. The credibility of [27] E. Simperl. Crowdsourcing semantic data
volunteered geographic information. GeoJournal, management: Challenges and opportunities. In
72:137–148, 2008. Proceedings of the 2Nd International Conference on
[12] H. Gao, G. Barbier, and R. Goolsby. Harnessing the Web Intelligence, Mining and Semantics, WIMS ’12,
crowdsourcing power of social media for disaster pages 1:1–1:3, New York, NY, USA, 2012. ACM.
relief. Intelligent Systems, IEEE, 26(3):10–14, May [28] C. Stadler, J. Lehmann, K. Höffner, and S. Auer.
2011. Linkedgeodata: A core for a web of spatial open data.
[13] M. Goodchild. Citizens as sensors: the world of Semantic Web Journal, 3(4):333–354, 2012.
volunteered geography. GeoJournal, 69:211–221, [29] B. L. Sullivan, C. L. Wood, M. J. Iliff, R. E. Bonney,
2007. D. Fink, and S. Kelling. ebird: A citizen-based bird
[14] T. R. Gruber. A translation approach to portable observation network in the biological sciences.
ontology specifications. KNOWLEDGE ACQUISITION, Biological Conservation, 142(10):2282 – 2292, 2009.
5:199–220, 1993. [30] K. Verspoor, K. B. Cohen, A. Lanfranchi, C. Warner,
[15] K. Janowicz and M. Compton. The H. L. Johnson, C. Roeder, J. D. Choi, C. Funk,
stimulus-sensor-observation ontology design pattern Y. Malenkiy, M. Eckert, et al. A corpus of full-text
and its integration into the semantic sensor network journal articles is a robust evaluation tool for
ontology. In Proceedings of The 3rd International revealing differences in performance of biomedical
workshop on Semantic Sensor Networks 2010 natural language processing tools. BMC
(SSN10) in conjunction with the 9th International bioinformatics, 13(1):207, 2012.
Semantic Web Conference (ISWC 2010), ISWC’10, [31] C. Webb and S. Baskauf. Darwin-sw: Darwin core data
2010. for the semantic web. TDWG Annual Meeting;
2011-10-18, 2011. owl:sameAs http://lod.tw/placenames/159624 .
[32] J. Wieczorek, D. Bloom, R. Guralnick, S. Blum, eoe:point_559070840853748 rdf:type geo:Point ,
M. Döring, R. Giovanni, T. Robertson, and D. Vieglais. owl:NamedIndividual ;
w3c_geo:long "121.575200" ;
Darwin core: An evolving community-developed w3c_geo:lat "24.951490" ;
biodiversity data standard. PLoS ONE, 7(1):e29715, geo:asWKT "Point(121.575200
2012. 24.951490)"^^sf:wktLiteral .
[33] K.-F. Wong, W. Li, R. Xu, and Z.-s. Zhang. Introduction eoe:thread_559070840853748 rdf:type sioc:Thread ,
owl:NamedIndividual ;
to Chinese Natural Language Processing. Morgan & sioc:has_container fb:groups/roadkilled .
Claypool Publishers, 2010.
eoe:occr_559070840853748 rdf:type eoe:Occurrence_As_Stimulus ,
owl:NamedIndividual ;
APPENDIX dsw:hasEvidence eoe:token_559070840853748 .
eoe:person_100002525111203 rdf:type eoe:Person_As_Sensor ,
A. FROM UGC TO ENRICHED RDF DATA owl:NamedIndividual ;
rdfs:label "Chuang Yu Ta" ;
@prefix rdf: . ssn:detects eoe:occr_559070840853748 ;
@prefix geo: . ssn:observes eoe:token_559070840853748 ;
@prefix foaf: . foaf:account fb:100002525111203 .
@prefix DUL: .
@prefix dwc: . taxon:380522 rdf:type dwctype:Taxon ,
@prefix dsw: . owl:NamedIndividual ;
@prefix taibif: . dsw:hasName taibif:380522 ;
@prefix ssn: . skos:preLabel "Melogale moschata subaurantiaca" ;
@prefix sf: . skos:altLabel " 鼬獾 ’" .
@prefix w3c_geo: .
@prefix schema: .
@prefix sioc: .
@prefix rdfs: .
@prefix dwctype: .
@prefix time: .
@prefix dct: .
@prefix owl: .
@prefix xsd: .
@prefix rdf: .
@prefix eoe: .
@prefix fb: .
@prefix tgn: .
@prefix taxon: .
@prefix skos: .
@prefix gn: .
eoe:img_559070840853748 rdf:type eoe:Image_As_SensorOutput ,
owl:NamedIndividual ;
sioc:has_container eoe:thread_559070840853748 ;
sioc:has_owner fb:100002525111203 ;
ssn:isProducedBy eoe:person_100002525111203 .
fb:238918712815615 _694835510557264 rdf:type eoe:Post_As_SensorOutput ,
owl:NamedIndividual ;
sioc:has_container eoe:thread_559070840853748 ;
_
sioc:has creator fb:100002525111203 ;
ssn:isProducedBy eoe:person_100002525111203 .
eoe:iden_559070840853748_01 rdf:type dwctype:Identification ,
owl:NamedIndividual ;
dwc:dateIdentified eoe:iden_time_559070840853748 ;
dsw:identifies eoe:idv_238918712815615_694835510557264 ;
dsw:isBasedOn eoe:token_559070840853748 ;
dsw:toTaxonConcept taxon:380522 .
eoe:token_559070840853748 rdf:type eoe:Token_As_FeatureOfInterest ,
owl:NamedIndividual .
eoe:idv_238918712815615_694835510557264 rdf:type dsw:IndividualOrganism ,
owl:NamedIndividual .
eoe:obs_559070840853748 rdf:type ssn:Observation ,
owl:NamedIndividual ;
ssn:observationResultTime eoe:obs_time_559070840853748 ;
DUL:hasLocation eoe:placeOfOb_559070840853748 ;
ssn:observationResult eoe:img_559070840853748 ,
fb:238918712815615 _694835510557264 ;
ssn:featureOfInterest eoe:token_559070840853748 ;
ssn:observedBy eoe:person_100002525111203 .
eoe:obs_time_559070840853748 rdf:type time:DateTimeInterval ,
owl:NamedIndividual ;
time:xsdDateTime "2013-12-04T07:42:15"^^xsd:dateTime .
eoe:iden_time_559070840853748 rdf:type time:DateTimeInterval ,
owl:NamedIndividual ;
time:xsdDateTime "2013-12-11T07:42:15"^^xsd:dateTime .
eoe:placeOfOb_559070840853748 rdf:type eoe:PlaceOfObservation ,
owl:NamedIndividual ;
geo:hasGeometry eoe:point_559070840853748 ;
gn:name " 新店 " ;