A Semantic Wiki Alerting Environment Incorporating
Credibility and Reliability Evaluation
Brian Ulicnya, Christopher J. Matheusa, Mieczyslaw M. Kokara,b
a
VIStology, Inc., bNortheastern University
Abstract. In this paper, we describe a system that semantically annotates streams of reports
about transnational criminal gangs in order to automatically produce models of the gangs’
membership and activities in the form of a semantic wiki. A gang ontology and semantic
inferencing are used to annotate the reports and supplement entity and relationship annotations
based on the local document context. Reports in the datastream are annotated for reliability and
credibility in the proof-of-concept system.
Keywords: media monitoring; semantic analysis; entity/relation extraction;
event tracking; gangs; reliability; credibility
1 Introduction
In this paper, we describe a prototype we are developing that we call the Semantic
Wiki Alerting Environment (SWAE). SWAE ingests streams of open-source news
media and social media and automatically constructs a model of transnational
criminal street gangs, including their membership and their activities. The system
automatically provides updates and alerts to significant changes in that model in the
form of emails, text alerts and semantic wiki pages. The system relies heavily on
ontology-based semantic annotation [1].
In today’s intelligence and battlespace environment, large amounts of data from
many sources must be effectively analyzed in a timely manner in order to provide an
accurate and up-to-date understanding of current and potential threats. Key to
understanding these threats is the identification and characterization of the various
entities that they involve. These include the relevant individuals, groups, locations
and events along with their corresponding interrelationships.
A wiki is a Web-based environment in which users can easily edit the text and
layout of documents using a simplified, non-HTML syntax. Wikipedia is the most
familiar example: a world-wide encyclopedia that any user can edit. Change-tracking
by author and automatic hyperlinking are important aspects of wiki functionality. A
semantic wiki is a wiki in which users can not only easily insert hyperlinks between
documents, but in which semantic annotations of documents can be easily edited by
users. In a semantic wiki, semantic web triples are encoded directly in the text. The
subject of the triple is the topic of the page itself; predicate and object are then
encoded as attribute::value pairs in the text. Thus, the markup
[[population::3,396,990]] on a page for Berlin, asserts that Berlin has a population of
that size. One can further represent that the population predicate is of
Type::Number, to enable proper sorting and comparison. These triples can be used
within semantic queries and to populate visualizations such as maps, timelines, and
graphs automatically. We use the Semantic MediaWiki platform, an extension of the
MediaWiki platform that underlies Wikipedia.
STIDS 2010 Proceedings Page 100 of 135
In their current state, semantic wikis are relatively primitive and require significant
human effort in order to annotate the wiki’s contents with semantic markup
consistently [5]. However, in this project we have customized a semantic wiki to
automatically pre-process incoming data from multiple sources, extracting relevant
semantic information (explicit metadata and implicit relationships) and rendering it in
a form readily consumable, and editable, by human analysts through the wiki
interface. We also implement user definable alerting capabilities to permit automated
notification regarding significant new events or critical changes in the composite
representations of key entities such as dangerous individuals or groups. Ontology-
based alerting capabilities of this sort necessitate the use of a formal inference engine,
ideally one that is rule based to facilitate and simplify user customization.
2 System Overview
The high level design of SWAE is depicted in Figure 1. Data flows into the system
from the left in the form of data streams (e.g. Tweets (Twitter updates), Blogs, news,
alerts (standing news queries)). These reports are processed by the entity and relation
extraction and semantic analysis algorithms. The annotated results are placed into the
data repository and trigger the invocation of the alert engine, which is based on the
SPARQL query engine available in the Open Sesame RDF data store. The results are
used to inform the user of significant items and to update the semantic model
maintained in the semantic wiki for subsequent access and further analysis by users.
Semantic wiki pages are created automatically from the RDF produced during
semantic analysis and entity extraction.
Figure 1. SWAE Data Flow
For development purposes, we have chosen to monitor data about the activities of
transnational gangs as the focus of our investigation. There are many parallels
between countering organized gang activity and counterinsurgency. Reports about
gang activities are readily available from open sources and do not require translation.
We monitor several RSS feeds and periodically download and process new items
in order to update the system. In addition we track news media outlets and law
STIDS 2010 Proceedings Page 101 of 135
enforcement press releases that we obtain via the news aggregator service Topix.net.
Social media platforms such as Twitter (twitter.com), and Flickr (photo sharing)
contain many reports by both self-professed gang associates and those chronicling
their activity; these data streams, however, are quite noisy. Twitter status updates
mentioning gang names contain a mix of chatter about the gang, unrelated uses of the
term and links to news articles. Photo sharing sites such as Flickr (flickr.com) contain
many depictions of gang graffiti, which can often be mapped to specific times and
locations; several groups on Flickr are dedicated to documenting gang graffiti.
Our goal is to monitor these social media and open-source media streams in order
to trigger alerts such as:
• A 10% increase in gang G's weekly incidents of type I in location L
• First occurrence of incident I by G in L in past year
• A 10% increase in attacks of G1 on G2
• New member of gang G
• A 10% increase in G membership in L since T
• New leader L of G
• A 20% increase in communications between members of G in past 24 hours
• Social media report of gang activity not correlated with media report
• Graffiti by or about gang G .
3 Ontology
In the Street Gang ontology (Figure 2) there are four primary top-level classes:
Organization, Person, Incident and Information. The ontology defines numerous types
of Incidents but distinguishes between CriminalIncidents and non-criminal incidents,
the former of which is used to infer members of the Criminal class. There are also
several types of Information corresponding to the source data that SWAE processes.
There are two secondary classes - IncidentRate and Source. IncidentRate is intended
to be used to record information about the count of incidents of a certain incidentType
that are carried out by an Organization in a given period of time; these elements were
added as it became clear from our sample rules that such constructs but be necessary
to support many of them. There is also a Source class that was created to permit the
author of a piece of Information to be either an Organization or a Person.
4 System Components
Feeds for data sources are periodically re-queried in order to obtain the latest
reports from both media outlets (which are analogous to analyzed intelligence reports)
and social media reports such as Tweets and Flickr photos (which, if not citing media
outlets, are analogous to source material that has not yet been subject to intelligence
analysis). Feeds in non-RDF compliant formats are converted to RDF automatically.
These source feeds provide useful metadata about the reports. Links from the RSS
feeds are automatically extracted and are then processed using the OpenCalais API1 to
extract basic level objects and relations based on their local context in the text.
In the following extract, OpenCalais’s output detects the presence of an arrest
relationship in the string
1
OpenCalais Web Service API. http://www.opencalais.com/calaisAPI
STIDS 2010 Proceedings Page 102 of 135
Figure 2 Gang Ontology
…The arrest follows the May 28 arrest in Santa Cruz of [X], another [Gang
Y] member…
The RDF output of Open Calais encodes the detection of an instance of the Arrest
relation as follows: an entity of type InstanceInfo is created with URI
“…/Instance/40”. This InstanceInfo is about (oc:subject) URI. Note that this
InstanceInfo doesn’t by itself provide explicit information about who was arrested,
when or where. (The ‘oc’ prefix denotes an OpenCalais namespace.)
The arrest
follows ]the May 28 arrest in Santa Cruz of [X] [, another [Gang Y], or
[VariantName V], member]]]>
the May 28 arrest in Santa Cruz of [X]
55
1071
STIDS 2010 Proceedings Page 103 of 135
This further part of the OpenCalais output says that the indicated incident is of
rdf:type oc:Arrest. It also specifies that the oc:person of the Arrest incident is
specified by the URI indicated. It also indicates the date string (“May 28”) and
normalized date (2010-05-28) of the incident. Note that this RDF snippet about the
incident URI (i.e. all the information about this incident in RDF form) doesn’t
specify, in particular, where the Arrest took place. None of the elements below that
are associated with the Arrest incident are guaranteed to be present in the OpenCalais
output depicting an incident of rdf:type Arrest.
2010-05-28May 28
5 Semantic Analysis
OpenCalais’ processing is very sophisticated, but because it does not always
specify what we need it to specify for our alert processing, we need to do semantic
analysis of the RDF graph and the original text in order to both augment and correct
the RDF that has been output. OpenCalais’ recognizes entities and relationships
based on their local context only; we often need to use global- or document-level
inferencing to determine other relationships and entities.
We use the VIStology-developed inference engine, BaseVISor2, to modify and
augment the RDF produced by OpenCalais, and save the modified RDF. BaseVISor
is VIStology’s forward-chaining inference and rule engine that infers facts from an
RDF/OWL store based on an ontology (using OWL 2 RL) as well as user-specified
rules that can involve procedural attachments for things like computing the distance
between two latitude/longitude pairs. BaseVISor has been optimized to process
triples very efficiently.
This semantic processing by BaseVISor results in a number of augmentations to the
data. First, the OpenCalais RDF output lacks datatypes on elements, so these must be
supplied for integers, dates and other datatypes used in OpenCalais output. Second,
we use BaseVISor rules to correct systematic misidentifications that OpenCalais
makes. For example, OpenCalais always identifies one particular gang name as a
Person, not as a variant name for a specific gang. These revision rules are necessary
because end users cannot customize OpenCalais with a custom vocabulary at present.
Third, we employ BaseVISor rules to make rule-based inferences about the text in
order to supplement OpenCalais’s event representations. As noted above, while
OpenCalais identifies Arrest-type incidents in texts, it does not always identify the
who, what, where, and when attributes of these events presumably because they can’t
be determined by the local context. We use BaseVISor rules to infer times and
locations for the surrounding event based on the entire text. For example, if no
2
VIStology BaseVISor Inference Engine. http://vistology.com/basevisor/basevisor.html
STIDS 2010 Proceedings Page 104 of 135
location is specified for an event in the OpenCalais RDF output, to a first
approximation, we specify the closest instance of a City in the text as the location of
the Arrest. Similarly, if no date for an arrest is specified, then we take the date of the
report itself as the arrest date, and so on.
We also use BaseVISor to insert RDF triples for instances of types of things not
identified by OpenCalais, such as the names of gangs, and to associate persons with
gangs based on the OpenCalais RDF. For example, if OpenCalais specifies that there
is a joining relationship, and the subject of the joining is a certain person, and the
object of the joining event is “the ABC prison gang”, then based on the presence of
the term “ABC” in the object, we assert an association between the person and the
ABC gang.
BaseVISor is also used to infer relations that are implicit from the data and the
ontology as explicit triples. For instance, if the ontology says that “ABC” is a Gang,
then if John is a member of the ABC gang, he is a gang member. A triple encoding
this fact will be inferred and imported into the RDF store. All of the triples that can
be inferred by means of these semantic analysis rules and the combination of the RDF
output and the OWL ontology, using OWL 2 RL, are inserted into the global fact
base.
Finally, based on the OpenCalais RDF graph, we make API calls to other data
sources in order to augment the RDF data store with the necessary data for querying.
Although OpenCalais sometimes provides resolved geolocations for spatial entities
like cities, it does not always do so. For instance, OpenCalais may identify “Santa
Cruz” as being an instance of rdf:type City, but it does not always specify that this
mention of a City actually refers to “Santa Cruz, California” with the corresponding
latitude and longitude. Because OpenCalais cannot be forced to make a guess for
every instance of City, we invoke the Geonames.org API in order to determine the
latitude and longitude of the city based on document source metadata, from the feed.
After this, the data gathered and processed by the extraction component is
imported into an OpenSesame RDF store and queried via SPARQL in order to update
the model of the gang organization: its members, incident rates, event times and
locations, and so on. The RDF data that has been input into the data store is
periodically queried to provide semantic alerts, which are sent as email messages or
text messages. Additionally, SPARQL queries are used to create and update topical
pages in the Semantic MediaWiki reflecting our current knowledge of a gang.
6 Example and Discussion
In Figure 3, we contrast the approach outlined with more traditional keyword-
based approaches to alerting and event tracking. The blue column shows the number
of news stories containing a specified gang name, the name of the indicated city, and
“arrest” over a two-week period in June, 2010. There were 16 documents
corresponding to Santa Cruz, eight to Los Angeles, and one to Alexandria. Based on
document counts alone, then, one would suppose that there were far more arrests in
California than in Virginia during that period. However, the red column shows that
automated semantic analysis identifies three arrests in Santa Cruz and one each in Los
Angeles and Alexandria, VA. These are much closer to the actual figures (two in
Santa Cruz; zero in Los Angeles; three in Alexandria, VA).
STIDS 2010 Proceedings Page 105 of 135
Figure 3 Event Counts: Keywords vs Semantic Analysis
The result of the semantic processing shows promise in that the total number of
arrests identified per city is much closer to the actual result than one would infer from
the document counts. Three arrestees out of four are correctly identified out of
eighteen news articles (precision = 75%), and four out of six total arrestees in the
corpus are identified (recall = 66%).
7 Information Evaluation
NATO STANAG (Standard Agreement) 2022 “Intelligence Reports” states that
where possible, “an evaluation of each separate item of information included in an
intelligence report, and not merely the report as a whole” should be made. It presents
an alpha-numeric rating of “confidence” in a piece of information (compare [9])
which combines an assessment of the reliability of the source of the information and
an assessment of the credibility of a piece of information “when examined in the light
of existing knowledge”.3 The alphabetic Reliability scale ranges from A (Completely
Reliable) to E (Unreliable) and F (Reliability Cannot Be Judged). A similar numeric
information credibility scale ranges from 1 (Confirmed by Other Sources) to 5
(Improbable) and 6 (Credibility Cannot Be Judged).
As a first approximation, we have implemented some crude, initial rules for
reliability and credibility. For example, if a source is from Topix (a news source) we
mark it B (Usually Reliable). We could potentially mark reports from official
government sources, such as FBI press releases, even higher. If a source is from
Twitter or Flickr, we mark it 6 (Reliability Cannot Be Confirmed).
3
North Atlantic Treaty Organization (NATO) STANAG (Standardization Agreement) 2022
(Edition 8) Annex. Essentially the same matrix is presented as doctrine in Appendix B
“Source and Information Reliability Matrix” of US Army FM-2-22.3 “Human Intelligence
Collector Operations” (2006), although STANAG 2022 is not explicitly cited.
STIDS 2010 Proceedings Page 106 of 135
For credibility, if two reports identify the arrest/trial/conviction/killing of the same
person, we mark each such report as 1 (Confirmed By Other Sources). STANAG
2022 does not prioritize coherence with the earliest reports; rather, it says that the
largest set of internally consistent reports on a subject is more likely to be true, unless
there is contrary evidence. It is a military truism that “the first report is always
wrong” [6], so a bias towards coherence with the first report on a subject should be
avoided. Further research is need to determine the degree to which two reports must
be similar in order to count as independent confirmation of one another.
8 Conclusion and Future Work
We have described a proof-of-concept system for automatically creating and
updating a model of a group (here, a criminal gang) in the form of a semantic wiki.
We incorporate into this model a preliminary implementation of the STANAG 2022
metrics for source reliability and information credibility. Initial work on this system
presented interesting design decisions that we outlined here, along with plans for
future work. Our work differs from other available systems in that it attempts to
create and maintain a usable model of a group and its activities automatically by
creating semantic wiki pages that represent the current state of knowledge of the
group. Significant changes in this model are sent as email or text alerts to concerned
parties. By normalizing references to entities, relations and events across documents,
the system provides a solution to the problem of data redundancy in reports. In
ongoing work, we plan to investigate the incorporation of social-network metrics of
centrality as proxies for estimating source reliability [7], and to incorporate social-
network measures of source independence into our credibility calculation.
Acknowledgments. This material is based upon work supported by the United
States Navy under SBIR Award No. N00014-10-M-0088.
References
1. W3C OWL Working Group.
http://www.w3.org/2007/OWL/wiki/OWL_Working_Group
2. Model Driven Architecture and Ontology Development, by Dragan Gasevic, Dragan
Djuric and Vladan Devedzic, Springer publisher, 2006.
3. Wikipedia entry for Wiki. http://en.wikipedia.org/wiki/Wiki
4. Semantic Web Technologies – Tends and Research in Ontology-based Systems, by
John Davies, R. Studer and P. Warren, Wiley publisher, October 2007.
5. J. Bao. The Unbearable Lightness of Wiking – A Study of SMW Usability.
Presentation. Spring 2010 SMW (Semantic MediaWiki Conference). MIT.
http://www.slideshare.net/baojie_iowa/2010-0522-smwcon
6. LTG (Ret) Ricardo S. Sanchez, Military Reporters and Editors Luncheon Address.
12 Oct 2007. http://www.militaryreporters.org/sanchez_101207.html
7. Ulicny, B., Matheus, C., Kokar, M., Metrics for Monitoring a Social-Political
Blogsophere: A Malaysian Case Study. IEEE Internet Computing, Special Issue on Social
Computing in the Blogosphere. March/April 2010.
8. Semantic MediaWiki. http://semantic-mediawiki.org
9. Schum, D., Tecuci, G., Boicu, M., Marcu, D., Substance-Blind Classification of
Evidence for Intelligence Analysis, in Proceedings of “Ontology for the Intelligence
Community,” George Mason University, Fairfax, Virginia. October 2009.
STIDS 2010 Proceedings Page 107 of 135