Detection of Similar Terrorist Events

                       Vittoria Cozza1 and Michelangelo Rubino2
           1
               Department of Computer Science, University of Pisa, 56124 Pisa, IT
                                 cozza@di.unipi.it
                            2
                               Expert System, 41123 Modena, IT
                              mrubino@expertsystem.it


       Abstract. Event counting is significant when it allows us to discover and repre-
       sent implicit knowledge. We realize that a particular event happens somewhere
       not just by mere chance, it is unlikely to be what we call as accidental event. E.g.
       the number of violent attacks and terrorist acts can give the measure of the safety
       for a given country and can help us to predict where and/or when similar events
       are likely to happen next time.
       This work proposes an approach for detecting terrorist events sharing common
       details, available from open datasets, with the aim of merging their descriptions
       and counting them exactly. Events are aggregated according to a space-time-
       textual similarity function.


1   Introduction

Generally speaking, a high number of particular events within a geographical area can
give you a clue about what that place is characterized by. We can use this logic likewise
with events connected with terrorism, car bombs, suicide attacks and any other event
identified as terrorist act allow us to realize whether a country is dangerous or not: the
higher the number of violent acts, the greater the risk related to that place. This is the
reason why we need a methodology to count this kind of events; the number of terrorist
acts is therefore the indicator of safety for a place.
     In the following, we propose a two-stage approach, that is first extracting useful in-
formation from events and representing it as space-time-textual records, then clustering
these records according to a similarity function by combining space and time proximity
and keyword relevance.
     The first stage starts with the news contained in public datasets, where events con-
sist of a short description including place, date, act and casualty, with no comments or
personal opinions. Sometimes we can also find the group (typically an acronym) claim-
ing the act. There is a point about the length of the description to be considered when
we use these public databases: more details mean generally a longer description and a
longer description means a more important event. In its turn, an important event is more
likely to be found in each dataset we use and often to be found more times in the same
dataset. This because of updates, especially when dealing with terrorist acts considered
as relevant: daily updates can increase the news size. Here indeed, we need the second
stage, whose purpose is grouping the occurrences of the same event. In short, one event,
one record. Once we get single records, the similarity function can be exploited with


                                               28
different settings either by using the three dimensions or by combining them two by
two. This way, we have clusters of events with the same keywords to be placed on the
basis of space and/or time: e.g. the increasing presence of female terrorists in a partic-
ular area starting from a particular event or date, the use of the AK-47 rifle in a region
and so forth.


2      Related works

This work refers to existing approaches for information integration, e.g. entity resolu-
tion or deduplication, that also aims at finding real-world entities occurring in different
forms in multiple data records. For a review [5]. In [6] authors deal with deduplication
too. In particular, facing the problem of identifying redundant social network messages,
they are able to identify whether one message subsumes information from another one
(textual entailment) or they both convey the same information (paraphrase).
     Despite these works that mainly focus on textual similarity, in terrorist event sce-
narios space and time components too can be considered as relevant information for
duplicate events detection.
     Several problems connected to scoring spatio-temporal data have already been stud-
ied in the fields of spatial keyword queries, time-dependent text queries and in sensor
networks. In [1] the authors face the problem of efficiently processing spatial keyword
queries with AND semantic after evaluating them. Cong et al. [2] proposed to use a
hybrid index: an inverted file is associated to each R-tree node so that both location
information and text can be used to prune the search space at query time. In [3] the
authors introduced a new index named as Spatial Inverted Index (S2I) to efficiently
process top-k spatial keyword queries. With regard to temporal ranking, the time ma-
chine proposed in [4], allows to retrieve documents according to keywords and those
existing at a specific time. In [7] textual, temporal and spatial dimensions are combined
all together.


3      Data sources

After the 11 September attacks, starting with GDT, many terrorism databases, such as
WITS, CTC, ISVG, have been published, to make news collections publicly accessible
for scholars and specialists. Our approach has been applied to a few free sources listed
as follows:

 1. CTC (Combating terrorism center) SENTINEL3 , in particular the short news con-
    tained below the section Recent Highlights in Terrorist Activity in the monthly jour-
    nal Sentinel;
 2. ISVG (The Institute for the Study of Violentv Groups)4 , a research center providing
    data about transnational terrorism through its Violent extremism Knowledge Base
    (VKB);
 3
     CTC: http://www.ctc.usma.edu/sentinel
 4
     ISVG: http://vkb.isvg.org/Special:IsvgSearch


                                               29
 3. NCTC (National Counter Terrorism Center), whose WITS (Worldwide Incidents
    Tracking System)5 has been stopped since 2010.

Yet, there are also providers whose services are available with fee, the most popular is
the Terrorism and Insurgency Centre by Jane’s Information Group (JTIC). From these
sources a common main structure of the news can be recognized, in particular it can
be always identified a short description in natural language (text), a location, usually
understood as the country (space), a date (time) and the publisher, that is the dataset
source name, as in Table 1.
    In ISVG and WITS, that provide data in a semistructured format, the city of the
event is not detected neither, but only the country name. This because it is not always
possible to extract this entity, due to a few reasons, mainly to misspelling or inaccuracy
or, even, lack of this piece of information. Moreover, it has to be considered that of-
ten the date these sources provide is not correct, this because if an event happens late
evening, the date is likely to be the date the news has been reported (that is the day after
the current one), not the date the event has happened. The same piece of news may be
identified with the first date in a source and with the second date (the real time of the
event) in another one. For a deeper overview, Daniel J. Mabrey, executive Director of


                                   Table 1: Similar events
id description
n1 IRAQ - April 4, 2008 -On 4 April 2008, in As Sa’diyah, Diyala, Iraq, a suicide bomber deto-
   nated an improvised explosive device (IED) he was wearing near a funeral procession in Sed
   Himreen cemetery, killing 20 civilians and wounding 30 others. No group claimed responsibil-
   ity.” - WITS
n2 IRAQ - April 4, 2008 -April 4, 2008 (IRAQ): Iraq police officer funeral suicide attack kills
   nine people A SUICIDE bomber attacked a funeral of an Iraqi policeman in Hamrin in Diyala
   province on 4 April, Reuters reported. The blast killed nine people and wounded 30 others” -
   JTIC
n3 PHILIPPINES - May 29, 2008 - Blast in southern Philippines leaves three people dead On 29
   May, a bomb blast struck a building in front of the Edwin Andrews Airbase in Zamboanga City
   in the southern Philippines, killing three people and injuring 17 others. According to the local
   Filipino newspaper the- JTIC
n4 PHILIPPINES - May 29, 2008 - Philippines bomb kills three people SUSPECTED militants
   detonated a bomb targeting a building in front of the Edwin Andrews Airbase in Zamboanga
   City in the southern Philippines on 29 May, killing three people and injuring 17 others, the
   Philippine Inquirer reported. - JTIC


the Institute for the Study of Violent Groups, has examined and compared these inci-
dents databases in [8], where he highlights the differences among them in organizing
terrorist information. In addition, terrorist events providers, besides these datasets, typ-
ically have lists and taxonomies about groups, weapons, people connected to terrorism
 5
     WITS: https://wits.nctc.gov/FederalDiscoverWITS/index.do?N=0


                                              30
and that can be used to create customized dictionaries of frequent and less-frequent
domain words, as we explain later on.


4      Methodology
Given a terrorist event dataset as shown in Table 1, the goal is identifying the records
representing the same event and joining them into a single record. Analysing just the
space dimension can return inaccuracy, as well as analysing just the time dimension.
On the other hand, the simple keyword analysis is not enough to our goal, consequently
we need to use the three dimensions: space, time and text have to be intersected to get
one record for one event.
     Technically, we preprocess event records with NLP tools to extract relevant infor-
mation in structured format when not already available: time, space, keywords. In our
knowledge base the event is an ennuple (id; p; l; t; k) where id is a unique event identi-
fier, p is the publisher, l is the city location, t the time when the news was published and
k the list of keywords extracted from text. For instance, given semi-structured news as
in Table 1 we extract structured data as in Table 2.
     The event description includes place, event and casualty, seldom the group blaming
the act. From the description we first aim at extracting a more precise location, instead
of getting just the country name. This can be done as a brute-force approach by simple
searching whether any upper case word in the text correspond to a region or city name in
the given country, in particular we extract geographical places i.e. by Geonames. Indeed
geonames web service provides a function to search for places by name6 . In our table
the first two records show Himreen and Hamrin, the same location, though Himreen is
a less known name in the local language for the town of Hamrin. In this case, the most
widespread name is used (Hamrin). Please notice that when it is not possible to extract
unambiguously the location l from text, we use the centroid nation location.
     Furthermore, we extract keywords, neither identified as stop words, nor as locations,
characterizing the event description and corresponding to the type of attack. The idea
is to use a dictionary with the type of event or the terrorist group list for finding related
words. This is the way how we can extract bomb, police, officer, suicide, blast, militants
from the examples above.
     Even if it is not possible to know it beforehand, another interesting set of keywords
is represented by the facility types, which can be extracted starting from a consistent
dataset of old news of the same domain, e.g. the less frequent words we can find in our
sources over one year. If we consider the example, we refer to words such as airbase,
funeral, procession, cemetery, bus, as well as market, hotel, school
     Identifying the keywords follows the text analysis and extraction and would require
a long discussion, though this is not included in this paper’s goals. For further informa-
tion we may suggest to refer to the literature about this subject: NLP tools for Entity
Recognition (NERs), e.g. Stanford NLP group NER7 or tanl NER [9, 10], indeed the
NER goal is finding all proper nouns in a text and classifying them into categories of
interest as e.g. location, organization and so forth.
 6
     Geonames search: http://www.geonames.org/export/geonames-search.html
 7
     Standford NER: http://nlp.stanford.edu/software/CRF-NER.shtml


                                             31
                               Table 2: Event database snapshot
     id publisher location                    time      keywords
     n1 WITS Diyala, IRAQ                 April 4, 2008 suicide bomber IED funeral
                                                        procession cemetery
     n2   JTIC    Hamrin, Diyala, IRAQ    April 4, 2008 police officer funeral suicide
                                                        attack bomber blast
     n3   JTIC    Zamboanga City, PHILIP- May 29, 2008 blast bomb building airbase
                  PINES
     n4   JTIC    Zamboanga City, PHILIP- May 29, 2008 bomb militants building air-
                  PINES                                 base


    At the second stage, given the dataset of event, we cluster data when closer for time
and space and with high percentage of keywords expressing analogous concepts. The
clustering works on the three publishers searching for different descriptions of the same
event (E.g. the same terrorist act can be shorter in the first publisher database compared
to the others or it can be characterized with the acronym IED in the first case and
the compound word homemade bomb in the second one), complementary descriptions
(E.g. a publisher can supplement a piece of news with details not contained anywhere)
or updates as well.
    The similarity function is the ST-IR ranking function from [7], as shown in the
following.

Definition 1 (ST-IR rank). Given a reference event n1 and a comparative event n2, the
aggregation function τ returns a similarity score between n1 and n2, based on spatio-
temporal and textual proximity:

      τ (n1, n2) = α × δs (n1, n2) + β × δt (n1, n2) + (1 − α − β) × δw (n1, n2)      (1)

                             with 0 ≤ α ≤ 1 and 0 ≤ β ≤ 1 − α.

δs (n1, n2) and δt (n1, n2) and δw (n1, n2), ranging between 0 and 1, represent three
distance scores respectively for space, time and text.
    To give an example, setting α = 0 means not to give relevance to space rank and
considering events similar only compared to time and text. As previously mentioned,
these parameters can be modified to create different views on the basis of ones own
needs.
    Different functions can be used to model the spatial, temporal and textual score. For
space and time proximity, we use Euclidean distance. The textual rank function is any
of Jaccard or Cosine. It could be also interesting to consider other keywords having the
same meaning by computing synonyms e.g. from Wordnet8 .

 8
     Wordnet: http://wordnet.princeton.edu/


                                              32
5   Conclusion and future works

This work highlights the importance of clustering similar terrorist events according to
the three dimensions space, time and keywords, though each one should have a different
weight in calcuating the overall score, based on qualitative and quantitative analysis.
    By considering these three components, we can deduplicate events coming from dif-
ferent databases and aggregate them along more dimensions. Duplicate events detection
has the advantage of counting exactly the events, avoiding duplicates and/or incomplete
news.
    As a future task, this approach could be implemented, optimized and tested over
KBs enumerated in section 3.


References
 [1] Chen, Y.y., Markowetz, A.: Efficient Query Processing in Geographic Web Search Engines.
     In: proc. of ACM Sigmod. (2006) 277–288
 [2] G. Cong, C. S. Jensen, and D. Wu. Efficient Retrieval of the Top-k Most Relevant Spatial
     Web Objects. In: Int. Conf. on Very Large Data Bases (VLDB). (2009) 337–348
 [3] Rocha-junior, B., Gkorgkas, O., Jonassen, S., Nø rv, K.: Efficient Processing of Top-k
     Spatial Keyword Queries. Proceedings of the International Symposium on Spatial and
     Temporal Databases, Springer, LNCS 6849 (2011)
 [4] Berberich, K., Bedathur, S., Neumann, T., Weikum, G.: A time machine for text search.
     Proceedings of the 30th annual international ACM SIGIR conference on Research and
     development in information retrieval - SIGIR ’07 (2007) 519
 [5] G. Costa, A. Cuzzocrea, G. Manco and R. Ortale. Data De-duplication: A Review. Learning
     Structure and Schemas from Documents, Volume 375:385–412, 2011.
 [6] F. M. Zanzotto, M. Pennacchiotti and K Tsioutsiouliklis. 2011. Linguistic redundancy in
     Twitter. In Proceedings of the Conference on Empirical Methods in Natural Language
     Processing (EMNLP ’11). Association for Computational Linguistics, Stroudsburg, PA,
     USA, 659-669.
 [7] V. Cozza, A. Messina, D. Montesi, L. Arietta, and M. Magnani. Spatio-temporal key-
     word queries in social networks. In Springer, editor, B. Catania, G. Guerrini, and J.
     Pokorn(Eds.): ADBIS 2013, volume 8133 of LNCS, pages 70–83, 2013.
 [8] Daniel. J. Mabrey Analyzing Terrorist Activities through Operational & Associational
     Coding of Events: Introducing the Institute for the Study of Violent Groups’ Relational
     Database. Copyright 2010 - Institute for the Study of Violent Groups - All Rights Reserved
 [9] G. Attardi, S. Dei Rossi, F. Dell’Orletta, E.M. Vecchi. The Tanl Named Entity Recognizer
     at Evalita 2009. In: Proc. of Workshop Evalita 2009, ISBN 978-88-903581-1-1, 2009.
[10] G. Attardi, G. Berardi, S. Dei Rossi, M. Simi. The Tanl Tagger for Named Entity Recogni-
     tion on Transcribed Broadcast News at Evalita 2011. In B. Magnini et al. (Eds.), Proc. of
     Evalita 2011, LNCS 7689, pp. 116-125, 2012. ISBN 978-3-642-35827-2.


                                             33