<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.1007/978-3-642-19668-3_8</article-id>
      <title-group>
        <article-title>Online News Event Extraction for Crime Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Federica Rollo</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Po</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Bonisoli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Enzo Ferrari" Engineering Department, University of Modena and Reggio Emilia</institution>
          ,
          <addr-line>Via Vivarelli 10, Modena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>1</volume>
      <fpage>19</fpage>
      <lpage>22</lpage>
      <abstract>
        <p>Event Extraction is a complex and interesting topic in Information Extraction that includes methods for the identification of event's type, participants, location, and date from free text or web data. The result of event extraction systems can be used in several fields, such as online monitoring systems or decision support tools. In this paper, we introduce a framework that combines several techniques (lexical, semantic, machine learning, neural networks) to extract events from Italian news articles for crime analysis purposes. Furthermore, we concentrate to represent the extracted events in a Knowledge Graph. An evaluation on crimes in the province of Modena is reported.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;crime analysis</kwd>
        <kwd>NLP</kwd>
        <kwd>word embeddings</kwd>
        <kwd>question answering</kwd>
        <kwd>localization</kwd>
        <kwd>deduplication</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        the lack of crime up-to-date information [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6 ref7">3, 4, 5, 6, 7</xref>
        ]. Detailed information about the crime
events can be extracted through Natural Language Processing (NLP) techniques applied to the
news articles’ text. Newspapers provide reliable, localized, and timely information (the time
delay between the occurrence of the event and the publication of the news does not exceed
24/48 hours). The main drawback is that newspapers do not collect and publish all the facts
related to crimes, but only the ones that arouse the readers’interest. Therefore, a percentage of
police reports will not be turned into news articles and is lost.
      </p>
      <p>
        The scope of this paper is to describe a framework to extract crime data from news articles,
enrich them with semantic information and provide useful visualization. The strategy employs
several techniques and extends a previous work [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]: crime categorization, named entity
extraction, 5W+1H extraction, linked data mapping, geo-localization, time expression normalization,
entity linking and duplicate detection. The novelty of such a framework is the integration of
multiple techniques, previously used in diferent contexts, for solving various sub-problems into
a common framework for crime analysis. Moreover, the framework transforms texts contained
in news articles into a Crime knowledge graph that accurately describes and links the crimes.
The framework has been tested successfully on news articles related to the city of Modena.
However, it can be adapted to manage data of other cities or areas.
      </p>
      <p>The outline of the paper is the following: in Section 2 the pipeline of the framework is
presented, while Section 3 is devoted to the description of the use case in the province of
Modena. Finally, Section 4 depicts some conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Crime Analysis framework</title>
      <p>The pipeline of the framework to extract semantic information related to an event starting
from news articles published on the web and alerts shared on social media consists of 8 phases.
The phases should be executed mainly in sequence for each news article (except for some
phases where the execution can be run in parallel); in any case, diferent news articles can be
processed in parallel. The entire process is executed periodically to extract the latest published
news articles, analyze them and add information to the knowledge base (KB). The frequency of
execution depends on the need of having a real-time up-to-date KB and how often the selected
online newspapers publish news articles. Figure 1 illustrates the phases, in bold, and some
techniques and tools, in italic:
1. Data extraction is performed by harvesting online newspapers and social media (web
scraping). The content of each news article is labeled, structured and semantically
annotated [9, 10]. Some web content may already expose a predefined structure, i.e., HTML
pages encoded with the Document Object Model (DOM), and some libraries allow
accessing the data encapsulated into HTML tags;1 if this is not the case, other methods can be
used, such as RSS Feed, API, and so on.
2. The categorization of the event is crucial to map a news article w.r.t. a type of event
(business, sports, crime, politics, arts, culture, etc.). Given some pre-categorized news articles,
i.e., annotated training data, machine learning algorithms can be applied to uncategorized
news articles to assign them a type of event [11, 12]. Word embeddings can be exploited
to extract the vector representations of the news articles, then, classifiers can take in
input such representations to assign a category to each news article. Moreover, active
learning can be used to enhance the quality of categorization retraining the classification
model on the original dataset enriched with high-confidence categorized news articles.</p>
      <p>Other approaches can exploit topic detection algorithms [13, 14].
3. The identification and extraction of the 5W+1H ( What, When, Where, Who, Why, How)
might be performed by employing Event Extraction models or through the Question
Answering task using BERT (Bidirectional Encoder Representations from Transformers)
by adopting diferent questions according to the type of event [ 15, 16]. The 5W+1H are
the questions that a reporter must answer through the reporting. Therefore, these are
the essential elements of any news and also contribute to improve the value of news and
newsworthiness in journalism.
4. By analyzing the news article’s body, temporal expressions can be identified (for example,
words like “two days ago”, “this morning”) and then normalized in date format. This
operation allows identifying the exact date of the event, taking into account the date of
publication.
5. The Named Entity Extraction (NER) is applied to the text of the news articles to identify
the reference to persons, organizations, places, and temporal expressions and can be
executed in parallel with the second phase. Its results can intersect the output of the
5W+1H phase.
6. With the Entity Linking, the entities identified in phase 5, such as persons, organizations,
and locations are linked to resources (URI) available in Linked Datasets. For example,
DBpedia Spotlight [17] can be used to link to resources of DBpedia and Linked Geo Data.
Besides, an Italian version of Blink2 [18] can be used to link entities to Wikipedia or to
populate a new KB with entities not linked to external resources.
7. The geographical localization exploits the entities that have been identified as locations
in phase 5 or as answers to “where” in phase 3 and processes them to be geo-referenced.</p>
      <sec id="sec-2-1">
        <title>1An example is the Java HTML Parser named jsoup.</title>
        <p>2https://github.com/rpo19/BLINK</p>
        <p>In case a location is not specified in the news article, organizations (identified in phase 4)
can also be exploited to geolocate the event.
8. The identification of duplicates or storylines aims to find the same event described in
more news articles, this might occur also within the same newspaper where updates
about one event are published over time. To avoid too many comparisons among news, it
is possible to identify candidate duplicates and apply text similarity analysis methodology
to these candidates. In the end, the information of duplicates can be merged.</p>
        <p>The use of semantic technologies is a key point in the presented approach for detecting events
from news articles and enriching them with information automatically extracted from the text.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Crime Knowledge Graph for Modena</title>
      <p>The Crime Analysis framework has been applied to a collection of news articles related to the
crimes that occurred in the province of Modena. We select two major newspapers that publish
on average 850 news articles per year related to crimes in the Modena province and cover the
95% of the total news articles published in Modena newspapers. The framework collected 17,500
reports from June 2011 to December 2021 (approximately 10 years) and is currently running
to analyze news articles published every day by two local newspapers. On 17,500 reports, the
framework was able to geolocalize almost 100% of the crime events and normalize the time
expressions on 83% of the news articles. The results produced allow performing crime mapping
studies and the identification of crime hot spots in semi real-time: visualizations of these results
are shown through the “Modena Crime” web application.3</p>
      <p>Figure 2 shows an example of Knowledge Graph generation, geo-localization and duplicate
detection of an Italian news article reporting a theft. The news article is derived from the
translation from Italian to English of the news taken from the “Gazzetta di Modena” newspaper.4
The 5W+1H are extracted from the text and reported in the event-centric Knowledge Graph. The
central node (the one colored in red in Figure 2) identifies the event, while the other nodes report
information related to the 5W+1H. The time reference (“last Thursday”) and the publication
date (“12 March 2022”) are used to identify the date of the event, while the entities categorized
as locations are exploited by OSM Nominatim to find the GPS coordinates where the theft
occurred. Then, the coordinates are used to represent the event on the map. Thanks to the
duplicate detection algorithm, two news articles are identified as duplicates of the examined
news article. They are follow-up news since they report updates on the theft. The publication
date is used to build the storyline of the event.</p>
      <p>We created a Crime Knowledge Graph for 1246 thefts that occurred in Modena in 2020 using
the Neo4j tool. An analysis of the interconnections between the crimes has been conducted
using centrality algorithms that determine, on the basis of the graph topology, the importance of
the individual nodes, and community detection algorithms to distinguish groups of nodes within
the overall graph. Each crime event can share some connected nodes with other events, such as
the place where the theft happened or the stolen objects, etc. We added direct relationships
among the crime nodes to represent that they share some connected nodes. Using Pagerank,
we classified the crime nodes based on their importance in the graph. The higher the Pagerank</p>
      <sec id="sec-3-1">
        <title>3https://dbgroup.ing.unimo.it/modenacrime 4https://shorturl.at/vKSX2</title>
        <p>value of a node, the greater the connections of the event node with the other event nodes. For
example, thefts in which gold and valuables are stolen occur more frequently, and therefore
events reporting such stolen items are strongly connected. To detect the communities, we used
the label propagation algorithm on 5 diferent subgraphs obtained by examining the 5W+1H
relationships separately. For the Where subgraph, the result highlighted communities of nodes
sharing several locations, i.e., WHERE nodes. These first experiments provide some insights on
how Crime Analysis can benefit from graph-based methods.
3.1. Impact and scalability
To evaluate the impact of the proposed framework, the number of crimes collected by the
framework and the number of crimes published in the oficial report of ISTAT (i.e., the crimes
reported to the police) have been compared. The report related to the period from 2016 to
20205 has been taken into account. The information is only quantitative, the types of crime
are reported per province and no information about where and when, during the year, the
crime happened is provided. For providing a comparison between the two datasets, only the
crime categories in common have been taken into account. Unfortunately, a location-based
comparison is not possible because ISTAT provides a unique report for the entire province.
With the total number of crimes in the city of Modena of 9590 from 2016 to 2020, the built KB
covers around 10% of the crimes reported by ISTAT. A hypothesis on this low coverage can be
attributed to the fact that not all the criminal events recorded by ISTAT, and therefore in the
police reports, are of high impact and public interest. Therefore, not all of them are reported in
local news articles. The most frequent crimes in both datasets are thefts each year from 2016 to
2020. Figure 3 shows the total number of the top three types of crimes recorded in the report
of ISTAT and compared to the number collected by the framework. As can be seen, the lower
coverage is reported in scams and frauds (the percentage is between 2% and 5% each year),
while the higher one is in robbery (up to 50% in 2020).</p>
        <p>Even if the approach has been applied in a medium area, it highlights its potentiality. In
Italy, it is not possible to collect real-time crime information from oficial sources, since oficial
criminal statistics are reported annually with a delay of 6 months. The proposed approach can
be applied everywhere, also in small or medium cities/areas, since there will be always one or
more newspapers that report the main crimes to happen in that place.</p>
        <p>A first scalability test has been executed to ingest all the news articles related to crimes that
happened in the entire Emilia-Romagna region. Other 9 newspapers which publish news related
to the 9 provinces of the Emilia-Romagna region were selected. All the available news articles,
from 2011 till now, which refer to 11 crime types have been collected. The total number of news
articles is 35,000 (on average 3,900 news articles for each province). The crime ingestion can
be run in parallel for diferent newspapers and diferent crime types. Therefore, 99 ingestion
processes have been executed in parallel to extract, analyze and store data of the region. The
total loading time, which depends on the loading time of the province with the higher number
of news articles from 2011, is 3 hours for 35,000 news articles.6</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion and Future Work</title>
      <p>The framework presented in this paper is able to extract crime data from news articles, enrich
them with semantic information and provide a Knowledge Graph that can be exploited for
further analysis. It suggests multiple techniques for solving various sub-problems: extracting
crime events from news articles, geo-locating them, linking entities to Linked Data resources,
and detecting duplicates. The framework has been successfully employed in the province of
Modena and has allowed collecting a consistent dataset of more than 17,500 news articles about
5http://dati.istat.it/Index.aspx?QueryId=25097&amp;lang=en
6The test has been performed on a Microsoft Windows 10 Pro, 16GB RAM, Processor Intel(R) Core(TM) i7-8750H
CPU @ 2.20GHz, 2208 Mhz, 6 Cores, 12 Logical Processors.
13 types of crimes. A comparison with the oficial crime reports provided by ISTAT unveil that
this approach has allowed collecting about 10% of the crime events. This can be considered a
satisfactory result since we are aware that news articles do not cover all the crimes that happen
in a city. The approach is domain-independent; it can be applied to any kind of news article,
not only crime news, and can also be adapted to other languages.</p>
      <p>In future work, we will work on the definition of a crime ontology to describe the crime
events. In addition, Neo4j will be deepened to better analyze the Crime Knowledge Graph.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>This work is partially supported by the project “Deep Learning for Urban Event Extraction from
News and Social media streams” founded by the Engineering Department “Enzo Ferrari” of the
University of Modena and Reggio Emilia.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ristvej</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lacinák</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ondrejka</surname>
          </string-name>
          ,
          <article-title>On smart city and safe city concepts</article-title>
          ,
          <source>Mob. Networks Appl</source>
          .
          <volume>25</volume>
          (
          <year>2020</year>
          )
          <fpage>836</fpage>
          -
          <lpage>845</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11036-020-01524-4.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Oatley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ewart</surname>
          </string-name>
          ,
          <article-title>Data mining and crime analysis</article-title>
          ,
          <source>Wiley Interdiscip. Rev. Data Min. Knowl. Discov</source>
          .
          <volume>1</volume>
          (
          <year>2011</year>
          )
          <fpage>147</fpage>
          -
          <lpage>153</lpage>
          . doi:
          <volume>10</volume>
          .1002/widm.6.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>S. K</surname>
          </string-name>
          , P. S. Thilagam,
          <article-title>Crime base: Towards building a knowledge base for crime entities and their relationships from online news papers</article-title>
          ,
          <source>Information Processing and Management</source>
          <volume>56</volume>
          (
          <year>2019</year>
          ). doi:
          <volume>10</volume>
          .1016/j.ipm.
          <year>2019</year>
          .
          <volume>102059</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Po</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rollo</surname>
          </string-name>
          ,
          <article-title>Building an urban theft map by analyzing newspaper crime reports</article-title>
          ,
          <source>in: 13th International Workshop on Semantic and Social Media Adaptation and Personalization</source>
          , SMAP Zaragoza, Spain,
          <year>2018</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>18</lpage>
          . doi:
          <volume>10</volume>
          .1109/SMAP.
          <year>2018</year>
          .
          <volume>8501866</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Dasgupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Naskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Saha</surname>
          </string-name>
          , L. Dey, Crimeprofiler:
          <article-title>Crime information extraction and visualization from news media, in:</article-title>
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Sheth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ngonga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Slezak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Franczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Alt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tao</surname>
          </string-name>
          , R. Unland (Eds.),
          <source>Proceedings of the International Conference on Web Intelligence</source>
          , WI '17,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2017</year>
          , p.
          <fpage>541</fpage>
          -
          <lpage>549</lpage>
          . doi:
          <volume>10</volume>
          .1145/3106426.3106476.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>P. R.</given-names>
            <surname>Boppuru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kenchappa</surname>
          </string-name>
          ,
          <article-title>Geo-spatial crime analysis using newsfeed data in indian context</article-title>
          ,
          <source>Int. J. Web Based Learn. Teach. Technol</source>
          .
          <volume>14</volume>
          (
          <year>2019</year>
          )
          <fpage>49</fpage>
          -
          <lpage>64</lpage>
          . doi:
          <volume>10</volume>
          .4018/IJWLTT. 2019100103.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kulshreshtha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <article-title>Analyzing newspaper crime reports for identification of safe transit paths</article-title>
          , in: R.
          <string-name>
            <surname>Mihalcea</surname>
            ,
            <given-names>J. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Sarkar (Eds.),
          <source>NAACL HLT</source>
          <year>2015</year>
          ,
          <article-title>The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          , Denver, Colorado, USA, May 31 - June 5,
          <year>2015</year>
          , The Association for Computational Linguistics,
          <year>2015</year>
          , pp.
          <fpage>17</fpage>
          -
          <lpage>24</lpage>
          . doi:
          <volume>10</volume>
          .3115/v1/n15-
          <fpage>2003</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Rollo</surname>
          </string-name>
          , L. Po,
          <article-title>Crime event localization and deduplication</article-title>
          , in: J.
          <string-name>
            <given-names>Z.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. A. M.</given-names>
            <surname>Tamma</surname>
          </string-name>
          , C. d'Amato,
          <string-name>
            <given-names>K.</given-names>
            <surname>Janowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Seneviratne</surname>
          </string-name>
          , L. Kagal (Eds.),
          <source>The Semantic Web - ISWC 2020 - 19th International Semantic Web Conference</source>
          , Athens, Greece, November 2-
          <issue>6</issue>
          ,
          <year>2020</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , volume
          <volume>12507</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2020</year>
          , pp.
          <fpage>361</fpage>
          -
          <lpage>377</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -62466-8\_
          <fpage>23</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>