=Paper=
{{Paper
|id=Vol-1816/paper-22
|storemode=property
|title=Multilanguage Semantic Behavioural Algorithms to Discover Terrorist Related Online Contents
|pdfUrl=https://ceur-ws.org/Vol-1816/paper-22.pdf
|volume=Vol-1816
|authors=Maurizio Mencarini,Gianluca Sensidoni
|dblpUrl=https://dblp.org/rec/conf/itasec/MencariniS17
}}
==Multilanguage Semantic Behavioural Algorithms to Discover Terrorist Related Online Contents==
In Proceedings of the First Italian Conference on Cybersecurity (ITASEC17), Venice, Italy.
Copyright c 2017 for this paper by its authors. Copying permitted for private and academic purposes.
Multilanguage Semantic Behavioural
Algorithms to discover terrorist related online
contents
Maurizio Mencarini1 and Gianluca Sensidoni1
1
Expert System S.p.A.
mmencarini@expertsystem.com, gsensidoni@expertsystem.com
Abstract
Multilanguage Semantic behavioural algorithms mapped to machine learning
techniques in order to collect and analyze huge amount of heterogeneous and complex
Multimedia and Multilanguage terrorist-related contents from both the Surface and the
Deep Web, in order to discover (by “connecting the dots”), detect, analyze, and monitor
potential terrorist-related activities and people. At the conference a live demo of to-date
obtained results and some experiment coming from EU DANTE project (an H2020 EU
funded research project) will be shown.
1 Introduction
Contemporary terrorists and criminal organizations increasingly exploit the Internet to spread their
message and gain support throughout the world, using the Web as communication tool, in particular
for recruitment, training and propaganda activities, but also for disinformation, raising funds,
organization, planning, and financial transactions.
While the use of the Internet for propaganda and recruiting purposes (though not solved) has
received wide publicity, terrorist groups utilize the Internet for a variety of other purposes, including
fund raising. Not only Al-Qaeda, but also other terrorist groups (including ISIS) use the Internet to
raise funds to support their activities.
In order to promptly face this threat, it is become urgent to put in place countermeasures to enable
the Law Enforcement Agencies (LEAs) and intelligence officials to continuously monitoring in near
real time on-line relevant (for the purposes of counter terrorism, under lawful warrant)
communications and contents, both in the Surface Web, and in the Deep Web, where there is a vast
amount of data (there are estimates that this is more than 95% of the available data on the Internet) not
always indexed by automated search engines, but potentially providing useful contents and
information for detecting and fighting terrorist activities.
This activity must absolutely be part of an effective cybersecurity strategy.
222
Due to the escalating number of Internet users and the increasing speed of creation/deletion of
Internet contents, it is clear that searching terrorist-related contents (i.e. generated by terrorists or
individuals linked to terrorists, or linked to terrorist activities, including relevant contents generated
by non terrorists) and information by keywords and manually is highly error-prone in precision
and a lot time-consuming, dramatically slow and obsolete, making impractical the examination
of huge amount of resources.
2 Concept and Approach
The Multilanguage Semantic Behavioural Algorithms (also part of the EU DANTE project) are
aimed to support LEAs in the most advanced intelligence processes, through big data collection and
analysis. Knowing facts and events in advance is one of the key points of these proposed solutions,
used to prevent potential threats by detecting relevant online contents over the Internet. Thus, the
proposed solutions are mainly aimed at supporting the automatic discovery and analysis of relevant
online sources and contents in the Surface Web, but also in the Deep Web and Dark Nets. Indeed one
of the key elements of this kind of solutions aims at innovating and improving the intelligence
processes in such a dark part of Internet that is only accessible with specific clients, like hidden
services in TOR and I2P. These Dark Nets are used from organizations to hide their identity and
publish without censorship.
Most of the detected relevant multimedia contents contain information about activities and
events: one of the challenges of the solution is to automatically identify and cluster/classify such
activities and reconstruct the chain. However the solution is also about the automatic detection and
analysis of people and groups (and relationships), through the identification of identities and the
understanding of capability and intentions of individuals or organizations that may be engaged
in actions, with special focus on propaganda, training, and disinformation. Terroristic groups,
leaders and suspicious people will be detected, analyzed and monitored at different levels, including
sociological, criminological, and psychological, in order to identify behavioral patterns on which to
focus during the analysis processes. In this context it is crucial to identify and recognize the real
identity of people hidden behind the virtual accounts.
3 Expected Demo at ITA-SEC 2017
This paper is focused on the following solutions, approaches and methodologies that are also used
into H2020 EU funded project DANTE with a strongly activities of personalization, improving,
enhancing and consolidation of:
Behavioural algorithms
Extraction of relationships/facts
Multilanguage approach
Multimedia approach
3.1 Behavioural issues and Relationships
Live demo: analyzing a text coming from Internet (English language)
Some experiment coming from emotional and stylometric analysis of posts of relevant criminal
organizations:
223
224
225
226
The previous examples uses algorithms based on deep semantic rules. This scenario can be
expanded and enhanced with an HYBRID approach containing also machine learning techniques.
So semantics and machine learning in order to:
give smartness on creation of the knowledge base (ontology) of the final solution; take a
look for example at the numerous way of speaking and writing existing into common
channels of communication/social networks (see point 3.4)
give more parameters/indexes to improve and reach a new innovative stylometric
approaches. Following some experiments coming from EU DANTE project:
Starting from the output of Stylometric analysis we saw during the last live demo, so:
Readability
Vocabulary richness
Registers and slangs
Grammatical tenses
and so on…..
in order to understand if it is possible to recognize the true author of a message and/or a document.
All the analysed parameters have been used for training a Machine Learning system (Weka
algorithm) which is in charge of finding similar features among various input texts.
This is a supervised technique, so with the knowledge created from specific documents such as a
training set.
Training set is related to: 96 texts of 4 different authors (Michael Moore, Samantha Cristoforetti,
Michael White and noam Chomsky).
Below there is the Stylometric DNA linked to the four different authors (with different views):
227
.
Testing set is related to 11 new texts and the final result is:
228
3.2 Multilanguage approach
Having a Multilanguage approach is a must in the analysis of terrorist and criminal contents.
Different languages can be used in the same communication thread, cover terms done in another
language considering the main language of the text can be found and multiple dialects terms. These
are the Multilanguage issues an analyst has to face during his navigation into either Surface or Dark
Web.
Multilanguage approach can be faced into 2 different ways:
Deep semantic technology to analyze each specific language
Automatic translation to normalize multiple langue into “English” for example
Both approaches can be used too in order to propose an HYBRID approach (again as previously
mentioned about stylometric analysis); so an automatic translation for example, of Persian content
into English and analysis by a semantic technology strongly focused on the destination language (so
English.
Live demo: analyzing a text coming from Internet (Persian language)
229
3.3 Multimedia approach
There are different countries where local radios and broadcast sources has a relevant importance
compared with writings coming from social network. So Speech To Text (STT) technology, mainly
tuned on the specific sensor, has a key value into the final solution
Also in this case multilanguage issue is an important aspect to be considered; in particular,
referred to analyze local dialects.
Live demo: analyzing an audio/video coming from Internet (English language)
3.4 New challenges of behavioural algorithms and
multilanguage approach
The output of behavioural algorithms can be the input of the following scenarios:
Fake messages/authors and Disinformation (see previous example regarded Weka
machine learning algorithm)
Encoded messages (also related to multilangauge approach)
Mapping virtual identity with physical identity
230
Propaganda (radicalization process)
In order to reach these goals, also in relation of EU DANTE project, particular focus is given to
the managing and comprehension of slangs, acronyms and abbreviations included in the content and
also misspellings. This approach can be more relevant if we use multilanguage approaches and
transcription ones.
As the end, last but surely not least, new stylometric cool features requested from the Law
Enforcement Agencies(LEAs):
Understanding the gender of the writer/speaker
Understanding the age of the writer/speaker
Discovering the leaderships
Linking to mother tongue speaker/writer
4 References
Aa.Vv. The use of the Internet for terrorist purposes, 2012 report from UNODC (UNITED
NATIONS OFFICE ON DRUGS AND CRIME) Retrieved from
http://info.publicintelligence.net/UNODC-TerroristInternet.pdf
Robert Anderson, Jr., 2014, Cyber Security, Terrorism, and Beyond: Addressing Evolving Threats
to the Homeland, Statement Before the Senate Committee on Homeland Security and Governmental
Affairs Washington, D.C. Retrieved from https://www.fbi.gov/news/testimony/cyber-security-
terrorism-and-beyond-addressing-evolving-threats-to-the-homeland
Ben Saul, 2012, Terrorism, Bloomsbury Publishing
H.M. Virupakshiah, 2009, Terrorism Challenge Diplomacy, Concept Publishing Company
Aa.Vv, 2015, Cyber Counterterrorism, Cyber International Conflict, Virtual Cyber War Crimes,
Journal of Legal Technology Risk Management
Haim Assa, 2014, When Marx Meets Nietzsche in Cyberspace: Revolutionary Praxis and the Will
to Power in Twenty-First-Century Revolution, Contento de Semrik
231