-

Results at SwissText 2022

Claudio Giorgio Giancaterino

claudio.giancaterino@intesasanpaolovita.it 0 1 2 3 0 Intesa SanPaolo Vita , Milano , Italy 1 Lugano , Switzerland 2 Workshop Proce dings 3 tion, Part of Speech Tagging, N-grams analysis , Topic

Natural Language Processing, briefly NLP, will lead in the next years the revolution of the Artificial Intelligence for the Insurance industry. There are several opportunities to employ NLP in the insurance activities, from claims processing to fraud detection and chatbots. In the marketing field, NLP can be used to monitor the sentiment analysis of feedback that people publish on diferent social networks to better consider insured needs or extract insight risks. Textual analysis of claims and classification can simplify the claims processing to reduce time treatment, operational errors or provide help in fraud detection. Underwriting process can be improved by a better textual assessment. The workshop had the goal to show NLP techniques on fraud detection by the disaster Tweets data set from Kaggle classification competition.

The workshop ended with the classification task pre-

ance industry. [ 1 ]

-Marketing: NLP can be used to monitor the sentiment analysis of feedback to better consider insured needs, to monitor risks insight on what people are thinking about a particular product, to extract information about expected trends to improve marketing strategy.

-Underwriting: using Optical Character Recognition (OCR) and NLP is possible to extract information from medical reports and help underwriters in a better quote of the insurance coverage. NLP can categorize patients’ diseases and retrieves correlation between some symptoms and the likely cost of treatment for the Insurance Company.

-Reserving: the analysis of claim reports during the ifrst notification of loss can improve the reserving process for severe claims.

-Claims processing: textual analysis can simplify the trial reducing the time treatment of the process and reducing operational mistakes.

-Risk management: text classification is useful in the risk assessment giving help in fraud detection.

3. Exploratory Data Analysis Before to start with the application of any Machine Learn

ing model, is better to understand data involved in the project, and Exploratory Data Analysis is a block between data cleaning and data modeling with the goal to undercheck relationships between variables of a data set with

1. Introduction The workshop started with an introduction of the Natural Language Processing (NLP) explaining use cases in the Insurance world. NLP can find a slot in broad Insurance fields, from Marketing to Underwriting, from Claims processing to Risk assessment, and also it can be applied in the traditional actuarial Reserving area.

Kaggle classification competition. 1

The workshop was organized in the manner to show

an application of NLP techniques in the Insurance Fraud

Detection by the disaster Tweets data set retrieved from The first approach was to apply an Exploratory Data Analysis by the use of the word cloud, statistics of tweets and the language used in the data set. After the pre-processing activity the work went ahead deeply into the text discovering Named Entity Recognition, Part of Speech Tagging, N-grams analysis, Topic Modelling and Word Embedding. 2. Natural Language Processing in 3.1. Activity

the Insurance world

Natural Language Processing is a branch of Artificial Intelligence with the aim to design some models allowing

CEUR htp:/ceur-ws.org ISN1613-073

CEUR

Workshop Proceedings (CEUR-WS.org) https://www.kaggle.com/competitions/nlp-getting-started

3.2. Results The data set has 7.613 rows and 5 columns with ”id”,

”text”, ”target” and other 2 columns: ”keyword” and ”location”.

Text of the tweets are written in English for the almost whole data frame.

The data set has a length of 7.613 tweets with an average of 15 words per tweet, an average of 6 characters word length, and roughly 5 stop words per tweet.

Tweets are largely coming from the USA followed by UK and Canada. From the word cloud tool appears words “earthquake”, “deeds”, “reason”, “forest” and “fire” as the most common terms.

4. Text Analysis 4.1. Activity

4.1.1. Named Entity Recognition and Part of Speech Tagging generate tree representation of the sentence. POS tagging works as a prerequisite for further NLP analysis such as chunking, syntax parsing, information extraction, machine translation, sentiment analysis, grammar analysis and word-sense disambiguation. 4.1.2. Text cleaning and N-grams

At this point is required a pre-processing step, known

as text cleaning, to generate features, extracts patterns in text and further analysis. The aim of this process is preparing raw text for Natural Language Processing.

Text cleaning is based on several steps starting with the text normalization, proceeding with removing Unicode characters and stop words, and ending with performing stemming/lemmatization.

The analysis went ahead with N-grams, a contiguous sentence of n items generated from a given sample of text where the items can be words. N-grams can be used as features from text corpus for Machine Learning models, moreover they are useful in autocompletion of sentences, or speech recognition helping to predict the next word that should be occurring in a sequence.

In the workshop N-grams were performed by Bagof-Words (BoW) approach retrieving top occurrence of unigram, bigrams and trigrams in the data set.

Bag-of-Words [ 2 ] belong to a first level of techniques used to transform raw text into numerical features by vectorization. BoW describes the occurrence of words taken from a vocabulary obtained from a corpus text by labelling it in a binary vector.

Each word is represented by a one-hot vector, a sparse vector in the size of the vocabulary. The Bag-of-Words feature vector is the sum of all one-hot vectors of the words. 4.1.3. Word Embedding Text Analysis tools are used by Companies to extract The second level to transform raw text into numerical valuable insights from text data. features is given by Word Embedding with the goal to

The first approach followed was to explore Named En- generate vectors encoding semantic meanings: individtity Recognition (NER) and Part of Speech (POS) Tagging ual words are represented by vectors in a predefined by the use of “Spacy” library. vector space [ 3 ]. It allows words with similar meaning

A named entity is a proper noun that refer to a specific to have a similar representation. In the workshop Word entity like location, person, date, time, organization, etc. Embedding was performed by Word2vec using “gensim” Named Entity Recognition systems are used to identify library. Word2vec is a neural network that try to maxiand segmentize named entities in text, and you can find mize the probability to see a word in a context window, them in application such as question answering, infor- the cosine similarity between two vectors. This task was mation retrieval, and machine translation. In Insurance, explored by two architectures: continuous bag-of-words NER applications can be found in customer care to clas- (CBOW) model that try to predict a word from its context sify customer complaints. and by continuous skip-gram model that try to predict

A POS Tagging is the process on marking tokens in a the context from a word. part of speech like nouns, adverbs, verbs, punctuation, etc. The purpose of a POS tagger is to assign grammatical information to each token and in this way is possible to 4.1.4. Topic Modelling

The activity of this chapter ended with Topic Modelling,

a form of unsupervised learning that works discovering hidden relationships in the text, more precisely the purpose is to identify topics in a document.

The purpose of the workshop was to discover new available and best performing tools for NLP, and for this activity was explored the use of BERTopic developed by Maarten Grootendorst. 3

There are four key components in BERTopic [ 4 ], and can be considered as an ensemble of models.

It starts generating several document embeddings to represent the meaning of the sentences using a pretrained language model: BERT (Bidirectional Encoder Representations from Transformers).

Given the huge dimension of vectors generated, is applied a general non-linear dimensionality reduction technique: UMAP (Uniform Manifold Approximation and Projection).

At this point the reduced embeddings are clustered with HDBSCAN (Hierarchical Density-Based Spatial Clustering). It finds clusters of variable densities converting DBSCAN into hierarchical clustering.

To retrieve topics from clustered document is applied a modified version of TF-IDF (Term Frequency-Inverse Document Frequency): the class-based TF-IDF procedure, where the class represents the collection of documents merged into a single document per each cluster.

Input features for BERTopic has been generated by TF-IDF that is an extension of Bag of Words, where terms are weighted and in this way are highlighted words with useful information [ 2 ]. ter”. Looking at the bigrams they are: “suicide bomber”, “youtube video”, “northern california”, “california wildifre”, “bombe detonate”, and “natural disaster”.

Interesting results are coming from Word Embedding, where the vocabulary is similar between the two architectures models with these relevant words: “earthquake”, “forest”, “evacuation”, “people”, ’wildfire’, “california”, “flood”, “disaster”, “emergency”, “damage”.

What is changing is the similarity between words, the CBOW model usually provides a similarity between words lower than the one provided by the Skip-Gram model.

The Word Embedding exploration ended reducing the vectors dimension with the application of the Principal Component Analysis giving the opportunity for words visualization into two dimensions.

After that, was the turn of the Topic Modelling with BERTopic. The tool was used in easy way, without fine tuning parameters, using TF-IDF features as input and 4.2. Results with the arbitrary choice of ten topics. Participants asked how to trust in results, so the job From the N-grams analysis the top occurrence words has been completed with the coherence score evaluaare: “people”, ”video”, ”crash”, ”emergency”, and ”disas- tion of BERTopic and the comparison with the Latent Dirichlet Allocation (LDA) model [ 5 ], the common model employed in the Topic Modelling. a prediction for the task.

The same relevant words from BERTopic appears also The attention mechanism was introduced to improve in LDA, but with diferent probabilities. The evaluation the performance of the encoder-decoder model for mahas been done with the UMass coherence score that calcu- chine translation. The idea behind the attention mechalates how often two words appear together in the corpus. nism was to permit the decoder to use the most relevant The perfect coherence is in 0, and it usually decrease parts of the input sequence in a flexible manner, by a with the rising number of topics. The issue with the LDA weighted combination of all of the encoded input vecmodel was that the number of trained documents in each tors. chunk has been reduced to make it converge. From re- With Transformers we had reached the third level of sults BERTopic shows a number closer to 0 (roughly -14) vectorization technique in NLP, the Contextual Embedthan LDA model (roughly -18), though the result can be ding [9]. Both traditional Word Embedding (word2vec, improved tuning the model. Glove) and Contextual Embedding (ELMo, BERT), aim to learn a continuous (vector) representation for each word in the documents. 5. Text Classification The Word Embedding method builds a global vocabulary using unique words in the documents by ignoring 5.1. Activity the meaning of words in diferent context. Hence, given The approach followed in this chapter is the classifica- a word, its embedding is always the same in whichever tion task: given a target variable the aim is to predict if sentence it occurs, and for this reason, the pre-trained a tweet can be considered a “disaster” or otherwise “not Word Embeddings are static. disaster”. Twitter has become an important emergency Contextual Embedding methods are used to learn communication channel, because people by smartphones sequence-level semantics by considering the sequence are able to announce an emergency they’re observing in of all words in the documents. The embeddings are obreal-time. For this reason, more agencies and Insurance tained from a model by passing the entire sentence to Companies are interested in monitoring Twitter. More- the pre-trained model. The embeddings generated for over, it’s not always clear whether a person’s words are each word depends on the other words in a given senactually announcing a disaster, so this task can be linked tence. The Transformer based models work on attention to a Fraud Detection task. mechanism, and attention is a way to look at the relation

The approach followed was to use Transfer Learning between a word with its neighbours, and for this reason, [ 6 ] that is a Machine Learning method where a model pre-trained Contextual Embeddings are dynamic. developed for a task is reused as the starting point for BERT’s goal is to generate a language representation a model on a second task. In the traditional Supervised model and uses a rich input embedding representation, Learning approach, Machine Learning models are trained derived from a sequence of tokens, which is converted on labelled data sets expecting to perform well on un- into vectors and then three embedding layers are comseen data of the same task and domain. The traditional bined to obtain a fixed-length vector processed in the approach falling down when there is not enough labelled Neural Network. BERT is pre-trained using two Unsudata to perform training for the task or domain of inter- pervised Learning tasks: Masked LM (MLM) and Next est. Sentence Prediction (NSP).

The idea behind Transfer Learning is to try to store The usual workflow of BERT consists of two stages: the knowledge gained in solving the source task in the pre-training and fine-tuning. source domain and apply it to another similar problem The attention mechanism in the Transformer allows of interest, it is the same concept of learning process by BERT to model many downstream tasks, such as sentiexperience, so the aim is to exploit pre-trained models ment analysis, question answering, paraphrase detection that can be fine-tuned on smaller task or specific data and more. In this workshop has been used the “ktrain” sets. [10] library, a low-code library developed by Arun S.

Bidirectional Encoder Representation from Transform- Maiya 4 that provides a lightweight wrapper for “Keras”, ers (BERT) is one of the most popular state-of-art NLP making it easier to build, train, and deploy Deep Learning approaches for Transfer Learning, published by Google models. in 2018 [7].

BERT is a bidirectional multi-layer Transformer model 5.2. Results that exploits the attention mechanism [8]. A basic Transformer uses an encoder-decoder architecture. The en- Thanks to the “ktrain” low coding library was easy the coder learns the representation from the input sentence implementation of BERT, just to split the data set into a and the decoder receives the representation and produces train and test set, then sent the train set as input into the “ktrain” pre-processing and Deep Learning model.

The Transformer based model shows, as expected, a good performance both on the train set and test set with an accuracy greater than 82%. Given the imbalanced data set the performance was evaluated also by F1 score with quite the same results.

Participants asked an evaluation with other models, and the job has been completed with the common classification Machine Learning model: the Logistic Regression, always with “ktrain” library.

Implementation of this model is easy, but there is a poorly performance: roughly 63% on train set and roughly 58% on validation set.

Looking at the confusion matrix, BERT model shows a large number of elements across the diagonal and small number of elements of the diagonal, so a better matrix than the Logistic Regression.

Last step was the inference of the model, testing the right prediction of test tweets, and also in this situation BERT outperformed Logistic Regression model with all right predictions.

6. Conclusions With this workshop there has been the opportunity to

have an overview of Natural Language Processing applications in the Insurance world, and with the disaster Tweets data set there has been the opportunity to discover NLP applications in Fraud Detection.

In this work the development of Natural Language Processing has been retracted, from Tokenization to Bag of Words, from Word Embedding to Contextual Embedding.

Transfer Learning has been explored, looking on its potential that outperforms benchmark models both on Topic Modelling and Classification prediction. [7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, Advances in neural information processing systems 30 (2017). [9] Q. Liu, M. J. Kusner, P. Blunsom, A survey on contextual embeddings, arXiv preprint arXiv:2003.07278 (2020). [10] A. S. Maiya, ktrain: A low-code library for augmented machine learning, J. Mach. Learn. Res 23 (2020) 1–6.

[1]

Ly ,

Uthayasooriyar ,

Wang , A survey on natural language processing (nlp) and applications in insurance , arXiv preprint arXiv: 2010 . 00462 ( 2020 ).

[2]

Ferrario ,

Nägelin , The art of natural language processing: classical, modern and contemporary approaches to text document classification, Modern and Contemporary Approaches to Text Document Classification (March 1, 2020 ) ( 2020 ).

[3]

Mikolov ,

Chen , G. Corrado,

Dean , Eficient estimation of word representations in vector space , arXiv preprint arXiv:1301.3781 ( 2013 ).

[4]

Grootendorst , Bertopic: Neural topic modeling with a class-based tf-idf procedure , arXiv preprint arXiv:2203.05794 ( 2022 ).

[5]

Jelodar ,

Wang ,

Yuan ,

Feng ,

Jiang ,

Li ,

Zhao , Latent dirichlet allocation (lda) and topic modeling: models, applications, a survey , Multimedia Tools and Applications 78 ( 2019 ) 15169 - 15211 .

[6]

Malte , P. Ratadiya, Evolution of transfer learning in natural language processing , arXiv preprint arXiv: 1910 . 07370 ( 2019 ).