<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Detection of Similar Terrorist Events</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vittoria Cozza</string-name>
          <email>cozza@di.unipi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michelangelo Rubino</string-name>
          <email>mrubino@expertsystem.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Pisa</institution>
          ,
          <addr-line>56124 Pisa, IT</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Expert System</institution>
          ,
          <addr-line>41123 Modena, IT</addr-line>
        </aff>
      </contrib-group>
      <fpage>28</fpage>
      <lpage>33</lpage>
      <abstract>
        <p>Event counting is significant when it allows us to discover and represent implicit knowledge. We realize that a particular event happens somewhere not just by mere chance, it is unlikely to be what we call as accidental event. E.g. the number of violent attacks and terrorist acts can give the measure of the safety for a given country and can help us to predict where and/or when similar events are likely to happen next time. This work proposes an approach for detecting terrorist events sharing common details, available from open datasets, with the aim of merging their descriptions and counting them exactly. Events are aggregated according to a space-timetextual similarity function.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Generally speaking, a high number of particular events within a geographical area can
give you a clue about what that place is characterized by. We can use this logic likewise
with events connected with terrorism, car bombs, suicide attacks and any other event
identified as terrorist act allow us to realize whether a country is dangerous or not: the
higher the number of violent acts, the greater the risk related to that place. This is the
reason why we need a methodology to count this kind of events; the number of terrorist
acts is therefore the indicator of safety for a place.</p>
      <p>In the following, we propose a two-stage approach, that is first extracting useful
information from events and representing it as space-time-textual records, then clustering
these records according to a similarity function by combining space and time proximity
and keyword relevance.</p>
      <p>The first stage starts with the news contained in public datasets, where events
consist of a short description including place, date, act and casualty, with no comments or
personal opinions. Sometimes we can also find the group (typically an acronym)
claiming the act. There is a point about the length of the description to be considered when
we use these public databases: more details mean generally a longer description and a
longer description means a more important event. In its turn, an important event is more
likely to be found in each dataset we use and often to be found more times in the same
dataset. This because of updates, especially when dealing with terrorist acts considered
as relevant: daily updates can increase the news size. Here indeed, we need the second
stage, whose purpose is grouping the occurrences of the same event. In short, one event,
one record. Once we get single records, the similarity function can be exploited with
different settings either by using the three dimensions or by combining them two by
two. This way, we have clusters of events with the same keywords to be placed on the
basis of space and/or time: e.g. the increasing presence of female terrorists in a
particular area starting from a particular event or date, the use of the AK-47 rifle in a region
and so forth.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related works</title>
      <p>
        This work refers to existing approaches for information integration, e.g. entity
resolution or deduplication, that also aims at finding real-world entities occurring in different
forms in multiple data records. For a review [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] authors deal with deduplication
too. In particular, facing the problem of identifying redundant social network messages,
they are able to identify whether one message subsumes information from another one
(textual entailment) or they both convey the same information (paraphrase).
      </p>
      <p>Despite these works that mainly focus on textual similarity, in terrorist event
scenarios space and time components too can be considered as relevant information for
duplicate events detection.</p>
      <p>
        Several problems connected to scoring spatio-temporal data have already been
studied in the fields of spatial keyword queries, time-dependent text queries and in sensor
networks. In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] the authors face the problem of efficiently processing spatial keyword
queries with AND semantic after evaluating them. Cong et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] proposed to use a
hybrid index: an inverted file is associated to each R-tree node so that both location
information and text can be used to prune the search space at query time. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] the
authors introduced a new index named as Spatial Inverted Index (S2I) to efficiently
process top-k spatial keyword queries. With regard to temporal ranking, the time
machine proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], allows to retrieve documents according to keywords and those
existing at a specific time. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] textual, temporal and spatial dimensions are combined
all together.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Data sources</title>
      <p>After the 11 September attacks, starting with GDT, many terrorism databases, such as
WITS, CTC, ISVG, have been published, to make news collections publicly accessible
for scholars and specialists. Our approach has been applied to a few free sources listed
as follows:
1. CTC (Combating terrorism center) SENTINEL3, in particular the short news
contained below the section Recent Highlights in Terrorist Activity in the monthly
journal Sentinel;
2. ISVG (The Institute for the Study of Violentv Groups)4, a research center providing
data about transnational terrorism through its Violent extremism Knowledge Base
(VKB);</p>
      <sec id="sec-3-1">
        <title>3 CTC: http://www.ctc.usma.edu/sentinel</title>
        <p>4 ISVG: http://vkb.isvg.org/Special:IsvgSearch
3. NCTC (National Counter Terrorism Center), whose WITS (Worldwide Incidents</p>
        <p>Tracking System)5 has been stopped since 2010.</p>
        <p>Yet, there are also providers whose services are available with fee, the most popular is
the Terrorism and Insurgency Centre by Jane’s Information Group (JTIC). From these
sources a common main structure of the news can be recognized, in particular it can
be always identified a short description in natural language (text), a location, usually
understood as the country (space), a date (time) and the publisher, that is the dataset
source name, as in Table 1.</p>
        <p>
          In ISVG and WITS, that provide data in a semistructured format, the city of the
event is not detected neither, but only the country name. This because it is not always
possible to extract this entity, due to a few reasons, mainly to misspelling or inaccuracy
or, even, lack of this piece of information. Moreover, it has to be considered that
often the date these sources provide is not correct, this because if an event happens late
evening, the date is likely to be the date the news has been reported (that is the day after
the current one), not the date the event has happened. The same piece of news may be
identified with the first date in a source and with the second date (the real time of the
event) in another one. For a deeper overview, Daniel J. Mabrey, executive Director of
id description
n1 IRAQ - April 4, 2008 -On 4 April 2008, in As Sa’diyah, Diyala, Iraq, a suicide bomber
detonated an improvised explosive device (IED) he was wearing near a funeral procession in Sed
Himreen cemetery, killing 20 civilians and wounding 30 others. No group claimed
responsibility.” - WITS
n2 IRAQ - April 4, 2008 -April 4, 2008 (IRAQ): Iraq police officer funeral suicide attack kills
nine people A SUICIDE bomber attacked a funeral of an Iraqi policeman in Hamrin in Diyala
province on 4 April, Reuters reported. The blast killed nine people and wounded 30 others”
JTIC
n3 PHILIPPINES - May 29, 2008 - Blast in southern Philippines leaves three people dead On 29
May, a bomb blast struck a building in front of the Edwin Andrews Airbase in Zamboanga City
in the southern Philippines, killing three people and injuring 17 others. According to the local
Filipino newspaper the- JTIC
n4 PHILIPPINES - May 29, 2008 - Philippines bomb kills three people SUSPECTED militants
detonated a bomb targeting a building in front of the Edwin Andrews Airbase in Zamboanga
City in the southern Philippines on 29 May, killing three people and injuring 17 others, the
Philippine Inquirer reported. - JTIC
the Institute for the Study of Violent Groups, has examined and compared these
incidents databases in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], where he highlights the differences among them in organizing
terrorist information. In addition, terrorist events providers, besides these datasets,
typically have lists and taxonomies about groups, weapons, people connected to terrorism
5 WITS: https://wits.nctc.gov/FederalDiscoverWITS/index.do?N=0
and that can be used to create customized dictionaries of frequent and less-frequent
domain words, as we explain later on.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <p>Given a terrorist event dataset as shown in Table 1, the goal is identifying the records
representing the same event and joining them into a single record. Analysing just the
space dimension can return inaccuracy, as well as analysing just the time dimension.
On the other hand, the simple keyword analysis is not enough to our goal, consequently
we need to use the three dimensions: space, time and text have to be intersected to get
one record for one event.</p>
      <p>Technically, we preprocess event records with NLP tools to extract relevant
information in structured format when not already available: time, space, keywords. In our
knowledge base the event is an ennuple (id; p; l; t; k) where id is a unique event
identifier, p is the publisher, l is the city location, t the time when the news was published and
k the list of keywords extracted from text. For instance, given semi-structured news as
in Table 1 we extract structured data as in Table 2.</p>
      <p>The event description includes place, event and casualty, seldom the group blaming
the act. From the description we first aim at extracting a more precise location, instead
of getting just the country name. This can be done as a brute-force approach by simple
searching whether any upper case word in the text correspond to a region or city name in
the given country, in particular we extract geographical places i.e. by Geonames. Indeed
geonames web service provides a function to search for places by name6. In our table
the first two records show Himreen and Hamrin, the same location, though Himreen is
a less known name in the local language for the town of Hamrin. In this case, the most
widespread name is used (Hamrin). Please notice that when it is not possible to extract
unambiguously the location l from text, we use the centroid nation location.</p>
      <p>Furthermore, we extract keywords, neither identified as stop words, nor as locations,
characterizing the event description and corresponding to the type of attack. The idea
is to use a dictionary with the type of event or the terrorist group list for finding related
words. This is the way how we can extract bomb, police, officer, suicide, blast, militants
from the examples above.</p>
      <p>Even if it is not possible to know it beforehand, another interesting set of keywords
is represented by the facility types, which can be extracted starting from a consistent
dataset of old news of the same domain, e.g. the less frequent words we can find in our
sources over one year. If we consider the example, we refer to words such as airbase,
funeral, procession, cemetery, bus, as well as market, hotel, school</p>
      <p>
        Identifying the keywords follows the text analysis and extraction and would require
a long discussion, though this is not included in this paper’s goals. For further
information we may suggest to refer to the literature about this subject: NLP tools for Entity
Recognition (NERs), e.g. Stanford NLP group NER7 or tanl NER [
        <xref ref-type="bibr" rid="ref10 ref9">9, 10</xref>
        ], indeed the
NER goal is finding all proper nouns in a text and classifying them into categories of
interest as e.g. location, organization and so forth.
6 Geonames search: http://www.geonames.org/export/geonames-search.html
7 Standford NER: http://nlp.stanford.edu/software/CRF-NER.shtml
id publisher location
n1 WITS Diyala, IRAQ
n2
n3
n4
      </p>
      <p>JTIC
JTIC
JTIC</p>
      <p>At the second stage, given the dataset of event, we cluster data when closer for time
and space and with high percentage of keywords expressing analogous concepts. The
clustering works on the three publishers searching for different descriptions of the same
event (E.g. the same terrorist act can be shorter in the first publisher database compared
to the others or it can be characterized with the acronym IED in the first case and
the compound word homemade bomb in the second one), complementary descriptions
(E.g. a publisher can supplement a piece of news with details not contained anywhere)
or updates as well.</p>
      <p>
        The similarity function is the ST-IR ranking function from [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], as shown in the
following.
      </p>
      <p>Definition 1 (ST-IR rank). Given a reference event n1 and a comparative event n2, the
aggregation function returns a similarity score between n1 and n2, based on
spatiotemporal and textual proximity:
(n1; n2) =
s(n1; n2) +
t(n1; n2) + (1
)
w(n1; n2)
(1)
with 0
1 and 0
1
.
s(n1; n2) and t(n1; n2) and w(n1; n2), ranging between 0 and 1, represent three
distance scores respectively for space, time and text.</p>
      <p>To give an example, setting = 0 means not to give relevance to space rank and
considering events similar only compared to time and text. As previously mentioned,
these parameters can be modified to create different views on the basis of ones own
needs.</p>
      <p>Different functions can be used to model the spatial, temporal and textual score. For
space and time proximity, we use Euclidean distance. The textual rank function is any
of Jaccard or Cosine. It could be also interesting to consider other keywords having the
same meaning by computing synonyms e.g. from Wordnet8.</p>
      <sec id="sec-4-1">
        <title>8 Wordnet: http://wordnet.princeton.edu/</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and future works</title>
      <p>This work highlights the importance of clustering similar terrorist events according to
the three dimensions space, time and keywords, though each one should have a different
weight in calcuating the overall score, based on qualitative and quantitative analysis.</p>
      <p>By considering these three components, we can deduplicate events coming from
different databases and aggregate them along more dimensions. Duplicate events detection
has the advantage of counting exactly the events, avoiding duplicates and/or incomplete
news.</p>
      <p>As a future task, this approach could be implemented, optimized and tested over
KBs enumerated in section 3.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Markowetz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Efficient Query Processing in Geographic Web Search Engines</article-title>
          .
          <source>In: proc. of ACM Sigmod</source>
          .
          <article-title>(</article-title>
          <year>2006</year>
          )
          <fpage>277</fpage>
          -
          <lpage>288</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Jensen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Wu</surname>
          </string-name>
          .
          <article-title>Efficient Retrieval of the Top-k Most Relevant Spatial Web Objects</article-title>
          .
          <source>In: Int. Conf. on Very Large Data Bases (VLDB)</source>
          .
          <article-title>(</article-title>
          <year>2009</year>
          )
          <fpage>337</fpage>
          -
          <lpage>348</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Rocha-junior</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gkorgkas</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jonassen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Nø rv, K.:
          <article-title>Efficient Processing of Top-k Spatial Keyword Queries</article-title>
          .
          <source>Proceedings of the International Symposium on Spatial and Temporal Databases</source>
          , Springer, LNCS
          <volume>6849</volume>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Berberich</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bedathur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weikum</surname>
          </string-name>
          , G.:
          <article-title>A time machine for text search</article-title>
          .
          <source>Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '07</source>
          (
          <year>2007</year>
          )
          <fpage>519</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Costa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cuzzocrea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Manco</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Ortale. Data</surname>
          </string-name>
          De-duplication:
          <article-title>A Review. Learning Structure and Schemas from Documents</article-title>
          , Volume
          <volume>375</volume>
          :
          <fpage>385</fpage>
          -
          <lpage>412</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Zanzotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pennacchiotti</surname>
          </string-name>
          and
          <string-name>
            <given-names>K</given-names>
            <surname>Tsioutsiouliklis</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Linguistic redundancy in Twitter</article-title>
          .
          <source>In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '11)</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Stroudsburg, PA, USA,
          <fpage>659</fpage>
          -
          <lpage>669</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Cozza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Messina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Montesi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Arietta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Magnani</surname>
          </string-name>
          .
          <article-title>Spatio-temporal keyword queries in social networks</article-title>
          . In Springer, editor, B. Catania, G. Guerrini, and J.
          <string-name>
            <surname>Pokorn</surname>
          </string-name>
          (Eds.):
          <source>ADBIS</source>
          <year>2013</year>
          , volume
          <volume>8133</volume>
          <source>of LNCS</source>
          , pages
          <fpage>70</fpage>
          -
          <lpage>83</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Daniel</surname>
          </string-name>
          . J.
          <article-title>Mabrey Analyzing Terrorist Activities through Operational &amp; Associational Coding of Events: Introducing the Institute for the Study of Violent Groups' Relational Database</article-title>
          .
          <article-title>Copyright 2010 - Institute for the Study of Violent Groups -</article-title>
          All Rights Reserved
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Dei</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.M.</given-names>
            <surname>Vecchi</surname>
          </string-name>
          .
          <source>The Tanl Named Entity Recognizer at Evalita</source>
          <year>2009</year>
          .
          <source>In: Proc. of Workshop Evalita</source>
          <year>2009</year>
          ,
          <source>ISBN 978-88-903581-1-1</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attardi</surname>
          </string-name>
          , G. Berardi,
          <string-name>
            <given-names>S. Dei</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simi</surname>
          </string-name>
          .
          <article-title>The Tanl Tagger for Named Entity Recognition on Transcribed Broadcast News at Evalita 2011</article-title>
          . In B. Magnini et al. (Eds.),
          <source>Proc. of Evalita</source>
          <year>2011</year>
          , LNCS 7689, pp.
          <fpage>116</fpage>
          -
          <lpage>125</lpage>
          ,
          <year>2012</year>
          . ISBN 978-3-
          <fpage>642</fpage>
          -35827-2.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>