1. Introduction

10.1007/978-3-642-19668-3_8

Online News Event Extraction for Crime Analysis

Federica Rollo

Laura Po

Giovanni Bonisoli

0 0 Enzo Ferrari" Engineering Department, University of Modena and Reggio Emilia , Via Vivarelli 10, Modena , Italy

2022

1 19 22

Event Extraction is a complex and interesting topic in Information Extraction that includes methods for the identification of event's type, participants, location, and date from free text or web data. The result of event extraction systems can be used in several fields, such as online monitoring systems or decision support tools. In this paper, we introduce a framework that combines several techniques (lexical, semantic, machine learning, neural networks) to extract events from Italian news articles for crime analysis purposes. Furthermore, we concentrate to represent the extracted events in a Knowledge Graph. An evaluation on crimes in the province of Modena is reported.

eol>crime analysis NLP word embeddings question answering localization deduplication

1. Introduction

the lack of crime up-to-date information [ 3, 4, 5, 6, 7 ]. Detailed information about the crime events can be extracted through Natural Language Processing (NLP) techniques applied to the news articles’ text. Newspapers provide reliable, localized, and timely information (the time delay between the occurrence of the event and the publication of the news does not exceed 24/48 hours). The main drawback is that newspapers do not collect and publish all the facts related to crimes, but only the ones that arouse the readers’interest. Therefore, a percentage of police reports will not be turned into news articles and is lost.

The scope of this paper is to describe a framework to extract crime data from news articles, enrich them with semantic information and provide useful visualization. The strategy employs several techniques and extends a previous work [ 8 ]: crime categorization, named entity extraction, 5W+1H extraction, linked data mapping, geo-localization, time expression normalization, entity linking and duplicate detection. The novelty of such a framework is the integration of multiple techniques, previously used in diferent contexts, for solving various sub-problems into a common framework for crime analysis. Moreover, the framework transforms texts contained in news articles into a Crime knowledge graph that accurately describes and links the crimes. The framework has been tested successfully on news articles related to the city of Modena. However, it can be adapted to manage data of other cities or areas.

The outline of the paper is the following: in Section 2 the pipeline of the framework is presented, while Section 3 is devoted to the description of the use case in the province of Modena. Finally, Section 4 depicts some conclusions.

2. Crime Analysis framework

The pipeline of the framework to extract semantic information related to an event starting from news articles published on the web and alerts shared on social media consists of 8 phases. The phases should be executed mainly in sequence for each news article (except for some phases where the execution can be run in parallel); in any case, diferent news articles can be processed in parallel. The entire process is executed periodically to extract the latest published news articles, analyze them and add information to the knowledge base (KB). The frequency of execution depends on the need of having a real-time up-to-date KB and how often the selected online newspapers publish news articles. Figure 1 illustrates the phases, in bold, and some techniques and tools, in italic: 1. Data extraction is performed by harvesting online newspapers and social media (web scraping). The content of each news article is labeled, structured and semantically annotated [9, 10]. Some web content may already expose a predefined structure, i.e., HTML pages encoded with the Document Object Model (DOM), and some libraries allow accessing the data encapsulated into HTML tags;1 if this is not the case, other methods can be used, such as RSS Feed, API, and so on. 2. The categorization of the event is crucial to map a news article w.r.t. a type of event (business, sports, crime, politics, arts, culture, etc.). Given some pre-categorized news articles, i.e., annotated training data, machine learning algorithms can be applied to uncategorized news articles to assign them a type of event [11, 12]. Word embeddings can be exploited to extract the vector representations of the news articles, then, classifiers can take in input such representations to assign a category to each news article. Moreover, active learning can be used to enhance the quality of categorization retraining the classification model on the original dataset enriched with high-confidence categorized news articles.

Other approaches can exploit topic detection algorithms [13, 14]. 3. The identification and extraction of the 5W+1H ( What, When, Where, Who, Why, How) might be performed by employing Event Extraction models or through the Question Answering task using BERT (Bidirectional Encoder Representations from Transformers) by adopting diferent questions according to the type of event [ 15, 16]. The 5W+1H are the questions that a reporter must answer through the reporting. Therefore, these are the essential elements of any news and also contribute to improve the value of news and newsworthiness in journalism. 4. By analyzing the news article’s body, temporal expressions can be identified (for example, words like “two days ago”, “this morning”) and then normalized in date format. This operation allows identifying the exact date of the event, taking into account the date of publication. 5. The Named Entity Extraction (NER) is applied to the text of the news articles to identify the reference to persons, organizations, places, and temporal expressions and can be executed in parallel with the second phase. Its results can intersect the output of the 5W+1H phase. 6. With the Entity Linking, the entities identified in phase 5, such as persons, organizations, and locations are linked to resources (URI) available in Linked Datasets. For example, DBpedia Spotlight [17] can be used to link to resources of DBpedia and Linked Geo Data. Besides, an Italian version of Blink2 [18] can be used to link entities to Wikipedia or to populate a new KB with entities not linked to external resources. 7. The geographical localization exploits the entities that have been identified as locations in phase 5 or as answers to “where” in phase 3 and processes them to be geo-referenced.

1An example is the Java HTML Parser named jsoup.

2https://github.com/rpo19/BLINK

In case a location is not specified in the news article, organizations (identified in phase 4) can also be exploited to geolocate the event. 8. The identification of duplicates or storylines aims to find the same event described in more news articles, this might occur also within the same newspaper where updates about one event are published over time. To avoid too many comparisons among news, it is possible to identify candidate duplicates and apply text similarity analysis methodology to these candidates. In the end, the information of duplicates can be merged.

The use of semantic technologies is a key point in the presented approach for detecting events from news articles and enriching them with information automatically extracted from the text.

3. Crime Knowledge Graph for Modena

The Crime Analysis framework has been applied to a collection of news articles related to the crimes that occurred in the province of Modena. We select two major newspapers that publish on average 850 news articles per year related to crimes in the Modena province and cover the 95% of the total news articles published in Modena newspapers. The framework collected 17,500 reports from June 2011 to December 2021 (approximately 10 years) and is currently running to analyze news articles published every day by two local newspapers. On 17,500 reports, the framework was able to geolocalize almost 100% of the crime events and normalize the time expressions on 83% of the news articles. The results produced allow performing crime mapping studies and the identification of crime hot spots in semi real-time: visualizations of these results are shown through the “Modena Crime” web application.3

Figure 2 shows an example of Knowledge Graph generation, geo-localization and duplicate detection of an Italian news article reporting a theft. The news article is derived from the translation from Italian to English of the news taken from the “Gazzetta di Modena” newspaper.4 The 5W+1H are extracted from the text and reported in the event-centric Knowledge Graph. The central node (the one colored in red in Figure 2) identifies the event, while the other nodes report information related to the 5W+1H. The time reference (“last Thursday”) and the publication date (“12 March 2022”) are used to identify the date of the event, while the entities categorized as locations are exploited by OSM Nominatim to find the GPS coordinates where the theft occurred. Then, the coordinates are used to represent the event on the map. Thanks to the duplicate detection algorithm, two news articles are identified as duplicates of the examined news article. They are follow-up news since they report updates on the theft. The publication date is used to build the storyline of the event.

We created a Crime Knowledge Graph for 1246 thefts that occurred in Modena in 2020 using the Neo4j tool. An analysis of the interconnections between the crimes has been conducted using centrality algorithms that determine, on the basis of the graph topology, the importance of the individual nodes, and community detection algorithms to distinguish groups of nodes within the overall graph. Each crime event can share some connected nodes with other events, such as the place where the theft happened or the stolen objects, etc. We added direct relationships among the crime nodes to represent that they share some connected nodes. Using Pagerank, we classified the crime nodes based on their importance in the graph. The higher the Pagerank

3https://dbgroup.ing.unimo.it/modenacrime 4https://shorturl.at/vKSX2

value of a node, the greater the connections of the event node with the other event nodes. For example, thefts in which gold and valuables are stolen occur more frequently, and therefore events reporting such stolen items are strongly connected. To detect the communities, we used the label propagation algorithm on 5 diferent subgraphs obtained by examining the 5W+1H relationships separately. For the Where subgraph, the result highlighted communities of nodes sharing several locations, i.e., WHERE nodes. These first experiments provide some insights on how Crime Analysis can benefit from graph-based methods. 3.1. Impact and scalability To evaluate the impact of the proposed framework, the number of crimes collected by the framework and the number of crimes published in the oficial report of ISTAT (i.e., the crimes reported to the police) have been compared. The report related to the period from 2016 to 20205 has been taken into account. The information is only quantitative, the types of crime are reported per province and no information about where and when, during the year, the crime happened is provided. For providing a comparison between the two datasets, only the crime categories in common have been taken into account. Unfortunately, a location-based comparison is not possible because ISTAT provides a unique report for the entire province. With the total number of crimes in the city of Modena of 9590 from 2016 to 2020, the built KB covers around 10% of the crimes reported by ISTAT. A hypothesis on this low coverage can be attributed to the fact that not all the criminal events recorded by ISTAT, and therefore in the police reports, are of high impact and public interest. Therefore, not all of them are reported in local news articles. The most frequent crimes in both datasets are thefts each year from 2016 to 2020. Figure 3 shows the total number of the top three types of crimes recorded in the report of ISTAT and compared to the number collected by the framework. As can be seen, the lower coverage is reported in scams and frauds (the percentage is between 2% and 5% each year), while the higher one is in robbery (up to 50% in 2020).

Even if the approach has been applied in a medium area, it highlights its potentiality. In Italy, it is not possible to collect real-time crime information from oficial sources, since oficial criminal statistics are reported annually with a delay of 6 months. The proposed approach can be applied everywhere, also in small or medium cities/areas, since there will be always one or more newspapers that report the main crimes to happen in that place.

A first scalability test has been executed to ingest all the news articles related to crimes that happened in the entire Emilia-Romagna region. Other 9 newspapers which publish news related to the 9 provinces of the Emilia-Romagna region were selected. All the available news articles, from 2011 till now, which refer to 11 crime types have been collected. The total number of news articles is 35,000 (on average 3,900 news articles for each province). The crime ingestion can be run in parallel for diferent newspapers and diferent crime types. Therefore, 99 ingestion processes have been executed in parallel to extract, analyze and store data of the region. The total loading time, which depends on the loading time of the province with the higher number of news articles from 2011, is 3 hours for 35,000 news articles.6

4. Conclusion and Future Work

The framework presented in this paper is able to extract crime data from news articles, enrich them with semantic information and provide a Knowledge Graph that can be exploited for further analysis. It suggests multiple techniques for solving various sub-problems: extracting crime events from news articles, geo-locating them, linking entities to Linked Data resources, and detecting duplicates. The framework has been successfully employed in the province of Modena and has allowed collecting a consistent dataset of more than 17,500 news articles about 5http://dati.istat.it/Index.aspx?QueryId=25097&lang=en 6The test has been performed on a Microsoft Windows 10 Pro, 16GB RAM, Processor Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz, 2208 Mhz, 6 Cores, 12 Logical Processors. 13 types of crimes. A comparison with the oficial crime reports provided by ISTAT unveil that this approach has allowed collecting about 10% of the crime events. This can be considered a satisfactory result since we are aware that news articles do not cover all the crimes that happen in a city. The approach is domain-independent; it can be applied to any kind of news article, not only crime news, and can also be adapted to other languages.

In future work, we will work on the definition of a crime ontology to describe the crime events. In addition, Neo4j will be deepened to better analyze the Crime Knowledge Graph.

Acknowledgments

This work is partially supported by the project “Deep Learning for Urban Event Extraction from News and Social media streams” founded by the Engineering Department “Enzo Ferrari” of the University of Modena and Reggio Emilia.

[1]

Ristvej ,

Lacinák ,

Ondrejka , On smart city and safe city concepts , Mob. Networks Appl . 25 ( 2020 ) 836 - 845 . doi: 10 .1007/s11036-020-01524-4.

[2]

Oatley ,

Ewart , Data mining and crime analysis , Wiley Interdiscip. Rev. Data Min. Knowl. Discov . 1 ( 2011 ) 147 - 153 . doi: 10 .1002/widm.6.

[3] S. K , P. S. Thilagam, Crime base: Towards building a knowledge base for crime entities and their relationships from online news papers , Information Processing and Management 56 ( 2019 ). doi: 10 .1016/j.ipm. 2019 . 102059 .

[4]

Po ,

Rollo , Building an urban theft map by analyzing newspaper crime reports , in: 13th International Workshop on Semantic and Social Media Adaptation and Personalization , SMAP Zaragoza, Spain, 2018 , pp. 13 - 18 . doi: 10 .1109/SMAP. 2018 . 8501866 .

[5]

Dasgupta ,

Naskar ,

Saha , L. Dey, Crimeprofiler: Crime information extraction and visualization from news media, in:

A. P.

Sheth ,

Ngonga ,

Wang ,

Chang ,

Slezak ,

Franczyk ,

Alt ,

Tao , R. Unland (Eds.), Proceedings of the International Conference on Web Intelligence , WI '17, Association for Computing Machinery, New York, NY, USA, 2017 , p. 541 - 549 . doi: 10 .1145/3106426.3106476.

[6]

P. R.

Boppuru ,

Kenchappa , Geo-spatial crime analysis using newsfeed data in indian context , Int. J. Web Based Learn. Teach. Technol . 14 ( 2019 ) 49 - 64 . doi: 10 .4018/IJWLTT. 2019100103.

[7]

Sharma ,

Kulshreshtha ,

Singh ,

Agrawal ,

Kumar , Analyzing newspaper crime reports for identification of safe transit paths , in: R. Mihalcea , J. Y. Chai , A . Sarkar (Eds.), NAACL HLT 2015 , The 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , Denver, Colorado, USA, May 31 - June 5, 2015 , The Association for Computational Linguistics, 2015 , pp. 17 - 24 . doi: 10 .3115/v1/n15- 2003 .

[8]

Rollo , L. Po, Crime event localization and deduplication , in: J.

Pan ,

V. A. M.

Tamma , C. d'Amato,

Janowicz ,

Fu ,

Polleres ,

Seneviratne , L. Kagal (Eds.), The Semantic Web - ISWC 2020 - 19th International Semantic Web Conference , Athens, Greece, November 2- 6 , 2020 , Proceedings, Part

, volume 12507 of Lecture Notes in Computer Science, Springer, 2020 , pp. 361 - 377 . doi: 10 .1007/978-3- 030 -62466-8\_ 23 .