=Paper=
{{Paper
|id=Vol-3361/ws1
|storemode=property
|title=NLP and Insurance – Workshop Results at SwissText 2022
|pdfUrl=https://ceur-ws.org/Vol-3361/workshop1.pdf
|volume=Vol-3361
|authors=Claudio Giorgio Giancaterino
|dblpUrl=https://dblp.org/rec/conf/swisstext/Giancaterino22
}}
==NLP and Insurance – Workshop Results at SwissText 2022==
<pdf width="1500px">https://ceur-ws.org/Vol-3361/workshop1.pdf</pdf>
<pre>
NLP and Insurance – Workshop Results at SwissText 2022
Claudio Giorgio Giancaterino1
1
    Intesa SanPaolo Vita, Milano, Italy


                                             Abstract
                                             Natural Language Processing, briefly NLP, will lead in the next years the revolution of the Artificial Intelligence for the
                                             Insurance industry. There are several opportunities to employ NLP in the insurance activities, from claims processing to
                                             fraud detection and chatbots. In the marketing field, NLP can be used to monitor the sentiment analysis of feedback that
                                             people publish on different social networks to better consider insured needs or extract insight risks. Textual analysis of claims
                                             and classification can simplify the claims processing to reduce time treatment, operational errors or provide help in fraud
                                             detection. Underwriting process can be improved by a better textual assessment. The workshop had the goal to show NLP
                                             techniques on fraud detection by the disaster Tweets data set from Kaggle classification competition.


1. Introduction                                             computers to understand natural language in order to
                                                            perform some tasks. Some NLP applications are machine
The workshop started with an introduction of the Natural translation, question answering, and text summarization.
Language Processing (NLP) explaining use cases in the          In the following, some NLP opportunities in the Insur-
Insurance world.                                            ance industry. [1]
   NLP can find a slot in broad Insurance fields, from Mar-    -Marketing: NLP can be used to monitor the sentiment
keting to Underwriting, from Claims processing to Risk analysis of feedback to better consider insured needs, to
assessment, and also it can be applied in the traditional monitor risks insight on what people are thinking about a
actuarial Reserving area.                                   particular product, to extract information about expected
   The workshop was organized in the manner to show trends to improve marketing strategy.
an application of NLP techniques in the Insurance Fraud        -Underwriting: using Optical Character Recognition
Detection by the disaster Tweets data set retrieved from (OCR) and NLP is possible to extract information from
Kaggle classification competition. 1                        medical reports and help underwriters in a better quote
   The first approach was to apply an Exploratory Data of the insurance coverage. NLP can categorize patients’
Analysis by the use of the word cloud, statistics of tweets diseases and retrieves correlation between some symp-
and the language used in the data set.                      toms and the likely cost of treatment for the Insurance
   After the pre-processing activity the work went ahead Company.
deeply into the text discovering Named Entity Recogni-         -Reserving: the analysis of claim reports during the
tion, Part of Speech Tagging, N-grams analysis, Topic first notification of loss can improve the reserving process
Modelling and Word Embedding.                               for severe claims.
   The workshop ended with the classification task pre-        -Claims processing: textual analysis can simplify the
dicting which tweets are real disasters and which one trial reducing the time treatment of the process and re-
are not by the use of Transfer Learning applied in easy ducing operational mistakes.
way with the help of “ktrain” library, and exploring the       -Risk management: text classification is useful in the
model inference on the test set.                            risk assessment giving help in fraud detection.
   The job was focused on the use of BERT both in the
supervised and unsupervised learning tasks. 2
                                                                                                        3. Exploratory Data Analysis
2. Natural Language Processing in 3.1. Activity
   the Insurance world            Before to start with the application of any Machine Learn-
                                                                                                                 ing model, is better to understand data involved in the
Natural Language Processing is a branch of Artificial
                                                                                                                 project, and Exploratory Data Analysis is a block between
Intelligence with the aim to design some models allowing
                                                                                                                 data cleaning and data modeling with the goal to under-
SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022, stand patterns, detect mistakes, check assumptions and
Lugano, Switzerland                                                                                              check relationships between variables of a data set with
Envelope-Open claudio.giancaterino@intesasanpaolovita.it (C. G. Giancaterino)
                    © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License the help of graphical charts and summary statistics.
                                       Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
             CEUR Workshop Proceedings (CEUR-WS.org)
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073


1
    https://www.kaggle.com/competitions/nlp-getting-started
2
    https://github.com/claudio1975/SWISSTEXT_2022
                                                               generate tree representation of the sentence. POS tag-
                                                               ging works as a prerequisite for further NLP analysis
                                                               such as chunking, syntax parsing, information extrac-
                                                               tion, machine translation, sentiment analysis, grammar
                                                               analysis and word-sense disambiguation.

                                                               4.1.2. Text cleaning and N-grams
                                                            At this point is required a pre-processing step, known
                                                            as text cleaning, to generate features, extracts patterns
                                                            in text and further analysis. The aim of this process
                                                            is preparing raw text for Natural Language Processing.
                                                            Text cleaning is based on several steps starting with the
Figure 1: wordcloud tweets.
                                                            text normalization, proceeding with removing Unicode
                                                            characters and stop words, and ending with performing
                                                            stemming/lemmatization.
3.2. Results                                                   The analysis went ahead with N-grams, a contiguous
                                                            sentence of n items generated from a given sample of text
The data set has 7.613 rows and 5 columns with ”id”,
                                                            where the items can be words. N-grams can be used as
”text”, ”target” and other 2 columns: ”keyword” and ”lo-
                                                            features from text corpus for Machine Learning models,
cation”.
                                                            moreover they are useful in autocompletion of sentences,
   Text of the tweets are written in English for the almost
                                                            or speech recognition helping to predict the next word
whole data frame.
                                                            that should be occurring in a sequence.
   The data set has a length of 7.613 tweets with an av-
                                                               In the workshop N-grams were performed by Bag-
erage of 15 words per tweet, an average of 6 characters
                                                            of-Words (BoW) approach retrieving top occurrence of
word length, and roughly 5 stop words per tweet.
                                                            unigram, bigrams and trigrams in the data set.
   Tweets are largely coming from the USA followed by
                                                               Bag-of-Words [2] belong to a first level of techniques
UK and Canada. From the word cloud tool appears words
                                                            used to transform raw text into numerical features by
“earthquake”, “deeds”, “reason”, “forest” and “fire” as the
                                                            vectorization. BoW describes the occurrence of words
most common terms.
                                                            taken from a vocabulary obtained from a corpus text by
                                                            labelling it in a binary vector.
4. Text Analysis                                               Each word is represented by a one-hot vector, a sparse
                                                            vector in the size of the vocabulary. The Bag-of-Words
4.1. Activity                                               feature vector is the sum of all one-hot vectors of the
                                                            words.
4.1.1. Named Entity Recognition and Part of
        Speech Tagging                                      4.1.3. Word Embedding
Text Analysis tools are used by Companies to extract           The second level to transform raw text into numerical
valuable insights from text data.                              features is given by Word Embedding with the goal to
   The first approach followed was to explore Named En-        generate vectors encoding semantic meanings: individ-
tity Recognition (NER) and Part of Speech (POS) Tagging        ual words are represented by vectors in a predefined
by the use of “Spacy” library.                                 vector space [3]. It allows words with similar meaning
   A named entity is a proper noun that refer to a specific    to have a similar representation. In the workshop Word
entity like location, person, date, time, organization, etc.   Embedding was performed by Word2vec using “gensim”
Named Entity Recognition systems are used to identify          library. Word2vec is a neural network that try to maxi-
and segmentize named entities in text, and you can find        mize the probability to see a word in a context window,
them in application such as question answering, infor-         the cosine similarity between two vectors. This task was
mation retrieval, and machine translation. In Insurance,       explored by two architectures: continuous bag-of-words
NER applications can be found in customer care to clas-        (CBOW) model that try to predict a word from its context
sify customer complaints.                                      and by continuous skip-gram model that try to predict
   A POS Tagging is the process on marking tokens in a         the context from a word.
part of speech like nouns, adverbs, verbs, punctuation,
etc. The purpose of a POS tagger is to assign grammatical
information to each token and in this way is possible to
Figure 2: ”earthquake” similar words from CBOW model.       Figure 3: ”earthquake” similar words from Skip-Gram model.


4.1.4. Topic Modelling
The activity of this chapter ended with Topic Modelling,
a form of unsupervised learning that works discover-
ing hidden relationships in the text, more precisely the
purpose is to identify topics in a document.
   The purpose of the workshop was to discover new
available and best performing tools for NLP, and for this
activity was explored the use of BERTopic developed by
Maarten Grootendorst. 3
   There are four key components in BERTopic [4], and
can be considered as an ensemble of models.
   It starts generating several document embeddings to      Figure 4: Top words from a topic by BERTopic model.
represent the meaning of the sentences using a pre-
trained language model: BERT (Bidirectional Encoder
Representations from Transformers).                       ter”. Looking at the bigrams they are: “suicide bomber”,
   Given the huge dimension of vectors generated, is      “youtube video”, “northern california”, “california wild-
applied a general non-linear dimensionality reduction     fire”, “bombe detonate”, and “natural disaster”.
technique: UMAP (Uniform Manifold Approximation              Interesting results are coming from Word Embedding,
and Projection).                                          where the vocabulary is similar between the two archi-
   At this point the reduced embeddings are clustered     tectures models with these relevant words: “earthquake”,
with HDBSCAN (Hierarchical Density-Based Spatial          “forest”, “evacuation”, “people”, ’wildfire’, “california”,
Clustering). It finds clusters of variable densities con- “flood”, “disaster”, “emergency”, “damage”.
verting DBSCAN into hierarchical clustering.                 What is changing is the similarity between words,
   To retrieve topics from clustered document is applied  the CBOW model usually provides a similarity between
a modified version of TF-IDF (Term Frequency-Inverse      words lower than the one provided by the Skip-Gram
Document Frequency): the class-based TF-IDF procedure,    model.
where the class represents the collection of documents       The Word Embedding exploration ended reducing the
merged into a single document per each cluster.           vectors dimension with the application of the Principal
   Input features for BERTopic has been generated by      Component Analysis giving the opportunity for words
TF-IDF that is an extension of Bag of Words, where terms  visualization into two dimensions.
are weighted and in this way are highlighted words with      After that, was the turn of the Topic Modelling with
useful information [2].                                   BERTopic. The tool was used in easy way, without fine
                                                          tuning parameters, using TF-IDF features as input and
4.2. Results                                              with   the arbitrary choice of ten topics.
                                                             Participants asked how to trust in results, so the job
From the N-grams analysis the top occurrence words has been completed with the coherence score evalua-
are: “people”, ”video”, ”crash”, ”emergency”, and ”disas- tion of BERTopic and the comparison with the Latent
                                                          Dirichlet Allocation (LDA) model [5], the common model
3
    https://github.com/MaartenGr/BERTopic
employed in the Topic Modelling.                           a prediction for the task.
   The same relevant words from BERTopic appears also         The attention mechanism was introduced to improve
in LDA, but with different probabilities. The evaluation   the performance of the encoder-decoder model for ma-
has been done with the UMass coherence score that calcu-   chine translation. The idea behind the attention mecha-
lates how often two words appear together in the corpus.   nism was to permit the decoder to use the most relevant
The perfect coherence is in 0, and it usually decrease     parts of the input sequence in a flexible manner, by a
with the rising number of topics. The issue with the LDA   weighted combination of all of the encoded input vec-
model was that the number of trained documents in each     tors.
chunk has been reduced to make it converge. From re-          With Transformers we had reached the third level of
sults BERTopic shows a number closer to 0 (roughly -14)    vectorization technique in NLP, the Contextual Embed-
than LDA model (roughly -18), though the result can be     ding [9]. Both traditional Word Embedding (word2vec,
improved tuning the model.                                 Glove) and Contextual Embedding (ELMo, BERT), aim to
                                                           learn a continuous (vector) representation for each word
                                                           in the documents.
5. Text Classification                                        The Word Embedding method builds a global vocabu-
                                                           lary using unique words in the documents by ignoring
5.1. Activity                                              the meaning of words in different context. Hence, given
The approach followed in this chapter is the classifica-   a word, its embedding is always the same in whichever
tion task: given a target variable the aim is to predict ifsentence it occurs, and for this reason, the pre-trained
a tweet can be considered a “disaster” or otherwise “not   Word Embeddings are static.
disaster”. Twitter has become an important emergency          Contextual Embedding methods are used to learn
communication channel, because people by smartphones       sequence-level semantics by considering the sequence
are able to announce an emergency they’re observing in     of all words in the documents. The embeddings are ob-
real-time. For this reason, more agencies and Insurance    tained from a model by passing the entire sentence to
Companies are interested in monitoring Twitter. More-      the pre-trained model. The embeddings generated for
over, it’s not always clear whether a person’s words are   each word depends on the other words in a given sen-
actually announcing a disaster, so this task can be linked tence. The Transformer based models work on attention
to a Fraud Detection task.                                 mechanism, and attention is a way to look at the relation
   The approach followed was to use Transfer Learning      between a word with its neighbours, and for this reason,
[6] that is a Machine Learning method where a model        pre-trained Contextual Embeddings are dynamic.
developed for a task is reused as the starting point for      BERT’s goal is to generate a language representation
a model on a second task. In the traditional Supervised    model and uses a rich input embedding representation,
Learning approach, Machine Learning models are trained     derived from a sequence of tokens, which is converted
on labelled data sets expecting to perform well on un-     into vectors and then three embedding layers are com-
seen data of the same task and domain. The traditional     bined to obtain a fixed-length vector processed in the
approach falling down when there is not enough labelled    Neural Network. BERT is pre-trained using two Unsu-
data to perform training for the task or domain of inter-  pervised Learning tasks: Masked LM (MLM) and Next
est.                                                       Sentence Prediction (NSP).
   The idea behind Transfer Learning is to try to store       The usual workflow of BERT consists of two stages:
the knowledge gained in solving the source task in the     pre-training and fine-tuning.
source domain and apply it to another similar problem         The attention mechanism in the Transformer allows
of interest, it is the same concept of learning process by BERT to model many downstream tasks, such as senti-
experience, so the aim is to exploit pre-trained models    ment analysis, question answering, paraphrase detection
that can be fine-tuned on smaller task or specific data    and more. In this workshop has been used the “ktrain”
sets.                                                      [10] library, a low-code library developed by Arun S.
   Bidirectional Encoder Representation from Transform-    Maiya 4 that provides a lightweight wrapper for “Keras”,
ers (BERT) is one of the most popular state-of-art NLP     making it easier to build, train, and deploy Deep Learning
approaches for Transfer Learning, published by Google      models.
in 2018 [7].
   BERT is a bidirectional multi-layer Transformer model 5.2. Results
that exploits the attention mechanism [8]. A basic Trans-
former uses an encoder-decoder architecture. The en- Thanks to the “ktrain” low coding library was easy the
coder learns the representation from the input sentence implementation of BERT, just to split the data set into a
and the decoder receives the representation and produces
                                                           4
                                                               https://github.com/amaiya/ktrain
                                                                quite the same results.
                                                                   Participants asked an evaluation with other models,
                                                                and the job has been completed with the common classifi-
                                                                cation Machine Learning model: the Logistic Regression,
                                                                always with “ktrain” library.
                                                                   Implementation of this model is easy, but there is
Figure 5: BERT model results.                                   a poorly performance: roughly 63% on train set and
                                                                roughly 58% on validation set.
                                                                   Looking at the confusion matrix, BERT model shows a
                                                                large number of elements across the diagonal and small
                                                                number of elements off the diagonal, so a better matrix
                                                                than the Logistic Regression.
                                                                   Last step was the inference of the model, testing the
                                                                right prediction of test tweets, and also in this situation
Figure 6: Logistic Regression model results.                    BERT outperformed Logistic Regression model with all
                                                                right predictions.


                                                                6. Conclusions
                                                                With this workshop there has been the opportunity to
                                                                have an overview of Natural Language Processing ap-
                                                                plications in the Insurance world, and with the disaster
                                                                Tweets data set there has been the opportunity to dis-
                                                                cover NLP applications in Fraud Detection.
                                                                   In this work the development of Natural Language Pro-
                                                                cessing has been retracted, from Tokenization to Bag of
                                                                Words, from Word Embedding to Contextual Embedding.
                                                                   Transfer Learning has been explored, looking on its
                                                                potential that outperforms benchmark models both on
                                                                Topic Modelling and Classification prediction.
Figure 7: Confusion matrix on test set with BERT.

                                                                References
                                                                 [1] A. Ly, B. Uthayasooriyar, T. Wang, A survey on nat-
                                                                     ural language processing (nlp) and applications in
                                                                     insurance, arXiv preprint arXiv:2010.00462 (2020).
                                                                 [2] A. Ferrario, M. Nägelin, The art of natural language
                                                                     processing: classical, modern and contemporary
                                                                     approaches to text document classification, Modern
                                                                     and Contemporary Approaches to Text Document
                                                                     Classification (March 1, 2020) (2020).
                                                                 [3] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient
                                                                     estimation of word representations in vector space,
                                                                     arXiv preprint arXiv:1301.3781 (2013).
Figure 8: Confusion matrix on test set with Logistic Regres-     [4] M. Grootendorst, Bertopic: Neural topic modeling
sion.                                                                with a class-based tf-idf procedure, arXiv preprint
                                                                     arXiv:2203.05794 (2022).
                                                                 [5] H. Jelodar, Y. Wang, C. Yuan, X. Feng, X. Jiang, Y. Li,
train and test set, then sent the train set as input into the        L. Zhao, Latent dirichlet allocation (lda) and topic
“ktrain” pre-processing and Deep Learning model.                     modeling: models, applications, a survey, Multime-
   The Transformer based model shows, as expected, a                 dia Tools and Applications 78 (2019) 15169–15211.
good performance both on the train set and test set with         [6] A. Malte, P. Ratadiya, Evolution of transfer learn-
an accuracy greater than 82%. Given the imbalanced data              ing in natural language processing, arXiv preprint
set the performance was evaluated also by F1 score with              arXiv:1910.07370 (2019).
 [7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
     Bert: Pre-training of deep bidirectional transform-
     ers for language understanding, arXiv preprint
     arXiv:1810.04805 (2018).
 [8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
     L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At-
     tention is all you need, Advances in neural infor-
     mation processing systems 30 (2017).
 [9] Q. Liu, M. J. Kusner, P. Blunsom,            A sur-
     vey on contextual embeddings, arXiv preprint
     arXiv:2003.07278 (2020).
[10] A. S. Maiya, ktrain: A low-code library for aug-
     mented machine learning, J. Mach. Learn. Res 23
     (2020) 1–6.

</pre>