=Paper=
{{Paper
|id=Vol-3361/ws1
|storemode=property
|title=NLP and Insurance – Workshop Results at SwissText 2022
|pdfUrl=https://ceur-ws.org/Vol-3361/workshop1.pdf
|volume=Vol-3361
|authors=Claudio Giorgio Giancaterino
|dblpUrl=https://dblp.org/rec/conf/swisstext/Giancaterino22
}}
==NLP and Insurance – Workshop Results at SwissText 2022==
NLP and Insurance – Workshop Results at SwissText 2022 Claudio Giorgio Giancaterino1 1 Intesa SanPaolo Vita, Milano, Italy Abstract Natural Language Processing, briefly NLP, will lead in the next years the revolution of the Artificial Intelligence for the Insurance industry. There are several opportunities to employ NLP in the insurance activities, from claims processing to fraud detection and chatbots. In the marketing field, NLP can be used to monitor the sentiment analysis of feedback that people publish on different social networks to better consider insured needs or extract insight risks. Textual analysis of claims and classification can simplify the claims processing to reduce time treatment, operational errors or provide help in fraud detection. Underwriting process can be improved by a better textual assessment. The workshop had the goal to show NLP techniques on fraud detection by the disaster Tweets data set from Kaggle classification competition. 1. Introduction computers to understand natural language in order to perform some tasks. Some NLP applications are machine The workshop started with an introduction of the Natural translation, question answering, and text summarization. Language Processing (NLP) explaining use cases in the In the following, some NLP opportunities in the Insur- Insurance world. ance industry. [1] NLP can find a slot in broad Insurance fields, from Mar- -Marketing: NLP can be used to monitor the sentiment keting to Underwriting, from Claims processing to Risk analysis of feedback to better consider insured needs, to assessment, and also it can be applied in the traditional monitor risks insight on what people are thinking about a actuarial Reserving area. particular product, to extract information about expected The workshop was organized in the manner to show trends to improve marketing strategy. an application of NLP techniques in the Insurance Fraud -Underwriting: using Optical Character Recognition Detection by the disaster Tweets data set retrieved from (OCR) and NLP is possible to extract information from Kaggle classification competition. 1 medical reports and help underwriters in a better quote The first approach was to apply an Exploratory Data of the insurance coverage. NLP can categorize patients’ Analysis by the use of the word cloud, statistics of tweets diseases and retrieves correlation between some symp- and the language used in the data set. toms and the likely cost of treatment for the Insurance After the pre-processing activity the work went ahead Company. deeply into the text discovering Named Entity Recogni- -Reserving: the analysis of claim reports during the tion, Part of Speech Tagging, N-grams analysis, Topic first notification of loss can improve the reserving process Modelling and Word Embedding. for severe claims. The workshop ended with the classification task pre- -Claims processing: textual analysis can simplify the dicting which tweets are real disasters and which one trial reducing the time treatment of the process and re- are not by the use of Transfer Learning applied in easy ducing operational mistakes. way with the help of “ktrain” library, and exploring the -Risk management: text classification is useful in the model inference on the test set. risk assessment giving help in fraud detection. The job was focused on the use of BERT both in the supervised and unsupervised learning tasks. 2 3. Exploratory Data Analysis 2. Natural Language Processing in 3.1. Activity the Insurance world Before to start with the application of any Machine Learn- ing model, is better to understand data involved in the Natural Language Processing is a branch of Artificial project, and Exploratory Data Analysis is a block between Intelligence with the aim to design some models allowing data cleaning and data modeling with the goal to under- SwissText 2022: Swiss Text Analytics Conference, June 08–10, 2022, stand patterns, detect mistakes, check assumptions and Lugano, Switzerland check relationships between variables of a data set with Envelope-Open claudio.giancaterino@intesasanpaolovita.it (C. G. Giancaterino) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License the help of graphical charts and summary statistics. Attribution 4.0 International (CC BY 4.0). CEUR Workshop CEUR Workshop Proceedings (CEUR-WS.org) Proceedings http://ceur-ws.org ISSN 1613-0073 1 https://www.kaggle.com/competitions/nlp-getting-started 2 https://github.com/claudio1975/SWISSTEXT_2022 generate tree representation of the sentence. POS tag- ging works as a prerequisite for further NLP analysis such as chunking, syntax parsing, information extrac- tion, machine translation, sentiment analysis, grammar analysis and word-sense disambiguation. 4.1.2. Text cleaning and N-grams At this point is required a pre-processing step, known as text cleaning, to generate features, extracts patterns in text and further analysis. The aim of this process is preparing raw text for Natural Language Processing. Text cleaning is based on several steps starting with the Figure 1: wordcloud tweets. text normalization, proceeding with removing Unicode characters and stop words, and ending with performing stemming/lemmatization. 3.2. Results The analysis went ahead with N-grams, a contiguous sentence of n items generated from a given sample of text The data set has 7.613 rows and 5 columns with ”id”, where the items can be words. N-grams can be used as ”text”, ”target” and other 2 columns: ”keyword” and ”lo- features from text corpus for Machine Learning models, cation”. moreover they are useful in autocompletion of sentences, Text of the tweets are written in English for the almost or speech recognition helping to predict the next word whole data frame. that should be occurring in a sequence. The data set has a length of 7.613 tweets with an av- In the workshop N-grams were performed by Bag- erage of 15 words per tweet, an average of 6 characters of-Words (BoW) approach retrieving top occurrence of word length, and roughly 5 stop words per tweet. unigram, bigrams and trigrams in the data set. Tweets are largely coming from the USA followed by Bag-of-Words [2] belong to a first level of techniques UK and Canada. From the word cloud tool appears words used to transform raw text into numerical features by “earthquake”, “deeds”, “reason”, “forest” and “fire” as the vectorization. BoW describes the occurrence of words most common terms. taken from a vocabulary obtained from a corpus text by labelling it in a binary vector. 4. Text Analysis Each word is represented by a one-hot vector, a sparse vector in the size of the vocabulary. The Bag-of-Words 4.1. Activity feature vector is the sum of all one-hot vectors of the words. 4.1.1. Named Entity Recognition and Part of Speech Tagging 4.1.3. Word Embedding Text Analysis tools are used by Companies to extract The second level to transform raw text into numerical valuable insights from text data. features is given by Word Embedding with the goal to The first approach followed was to explore Named En- generate vectors encoding semantic meanings: individ- tity Recognition (NER) and Part of Speech (POS) Tagging ual words are represented by vectors in a predefined by the use of “Spacy” library. vector space [3]. It allows words with similar meaning A named entity is a proper noun that refer to a specific to have a similar representation. In the workshop Word entity like location, person, date, time, organization, etc. Embedding was performed by Word2vec using “gensim” Named Entity Recognition systems are used to identify library. Word2vec is a neural network that try to maxi- and segmentize named entities in text, and you can find mize the probability to see a word in a context window, them in application such as question answering, infor- the cosine similarity between two vectors. This task was mation retrieval, and machine translation. In Insurance, explored by two architectures: continuous bag-of-words NER applications can be found in customer care to clas- (CBOW) model that try to predict a word from its context sify customer complaints. and by continuous skip-gram model that try to predict A POS Tagging is the process on marking tokens in a the context from a word. part of speech like nouns, adverbs, verbs, punctuation, etc. The purpose of a POS tagger is to assign grammatical information to each token and in this way is possible to Figure 2: ”earthquake” similar words from CBOW model. Figure 3: ”earthquake” similar words from Skip-Gram model. 4.1.4. Topic Modelling The activity of this chapter ended with Topic Modelling, a form of unsupervised learning that works discover- ing hidden relationships in the text, more precisely the purpose is to identify topics in a document. The purpose of the workshop was to discover new available and best performing tools for NLP, and for this activity was explored the use of BERTopic developed by Maarten Grootendorst. 3 There are four key components in BERTopic [4], and can be considered as an ensemble of models. It starts generating several document embeddings to Figure 4: Top words from a topic by BERTopic model. represent the meaning of the sentences using a pre- trained language model: BERT (Bidirectional Encoder Representations from Transformers). ter”. Looking at the bigrams they are: “suicide bomber”, Given the huge dimension of vectors generated, is “youtube video”, “northern california”, “california wild- applied a general non-linear dimensionality reduction fire”, “bombe detonate”, and “natural disaster”. technique: UMAP (Uniform Manifold Approximation Interesting results are coming from Word Embedding, and Projection). where the vocabulary is similar between the two archi- At this point the reduced embeddings are clustered tectures models with these relevant words: “earthquake”, with HDBSCAN (Hierarchical Density-Based Spatial “forest”, “evacuation”, “people”, ’wildfire’, “california”, Clustering). It finds clusters of variable densities con- “flood”, “disaster”, “emergency”, “damage”. verting DBSCAN into hierarchical clustering. What is changing is the similarity between words, To retrieve topics from clustered document is applied the CBOW model usually provides a similarity between a modified version of TF-IDF (Term Frequency-Inverse words lower than the one provided by the Skip-Gram Document Frequency): the class-based TF-IDF procedure, model. where the class represents the collection of documents The Word Embedding exploration ended reducing the merged into a single document per each cluster. vectors dimension with the application of the Principal Input features for BERTopic has been generated by Component Analysis giving the opportunity for words TF-IDF that is an extension of Bag of Words, where terms visualization into two dimensions. are weighted and in this way are highlighted words with After that, was the turn of the Topic Modelling with useful information [2]. BERTopic. The tool was used in easy way, without fine tuning parameters, using TF-IDF features as input and 4.2. Results with the arbitrary choice of ten topics. Participants asked how to trust in results, so the job From the N-grams analysis the top occurrence words has been completed with the coherence score evalua- are: “people”, ”video”, ”crash”, ”emergency”, and ”disas- tion of BERTopic and the comparison with the Latent Dirichlet Allocation (LDA) model [5], the common model 3 https://github.com/MaartenGr/BERTopic employed in the Topic Modelling. a prediction for the task. The same relevant words from BERTopic appears also The attention mechanism was introduced to improve in LDA, but with different probabilities. The evaluation the performance of the encoder-decoder model for ma- has been done with the UMass coherence score that calcu- chine translation. The idea behind the attention mecha- lates how often two words appear together in the corpus. nism was to permit the decoder to use the most relevant The perfect coherence is in 0, and it usually decrease parts of the input sequence in a flexible manner, by a with the rising number of topics. The issue with the LDA weighted combination of all of the encoded input vec- model was that the number of trained documents in each tors. chunk has been reduced to make it converge. From re- With Transformers we had reached the third level of sults BERTopic shows a number closer to 0 (roughly -14) vectorization technique in NLP, the Contextual Embed- than LDA model (roughly -18), though the result can be ding [9]. Both traditional Word Embedding (word2vec, improved tuning the model. Glove) and Contextual Embedding (ELMo, BERT), aim to learn a continuous (vector) representation for each word in the documents. 5. Text Classification The Word Embedding method builds a global vocabu- lary using unique words in the documents by ignoring 5.1. Activity the meaning of words in different context. Hence, given The approach followed in this chapter is the classifica- a word, its embedding is always the same in whichever tion task: given a target variable the aim is to predict ifsentence it occurs, and for this reason, the pre-trained a tweet can be considered a “disaster” or otherwise “not Word Embeddings are static. disaster”. Twitter has become an important emergency Contextual Embedding methods are used to learn communication channel, because people by smartphones sequence-level semantics by considering the sequence are able to announce an emergency they’re observing in of all words in the documents. The embeddings are ob- real-time. For this reason, more agencies and Insurance tained from a model by passing the entire sentence to Companies are interested in monitoring Twitter. More- the pre-trained model. The embeddings generated for over, it’s not always clear whether a person’s words are each word depends on the other words in a given sen- actually announcing a disaster, so this task can be linked tence. The Transformer based models work on attention to a Fraud Detection task. mechanism, and attention is a way to look at the relation The approach followed was to use Transfer Learning between a word with its neighbours, and for this reason, [6] that is a Machine Learning method where a model pre-trained Contextual Embeddings are dynamic. developed for a task is reused as the starting point for BERT’s goal is to generate a language representation a model on a second task. In the traditional Supervised model and uses a rich input embedding representation, Learning approach, Machine Learning models are trained derived from a sequence of tokens, which is converted on labelled data sets expecting to perform well on un- into vectors and then three embedding layers are com- seen data of the same task and domain. The traditional bined to obtain a fixed-length vector processed in the approach falling down when there is not enough labelled Neural Network. BERT is pre-trained using two Unsu- data to perform training for the task or domain of inter- pervised Learning tasks: Masked LM (MLM) and Next est. Sentence Prediction (NSP). The idea behind Transfer Learning is to try to store The usual workflow of BERT consists of two stages: the knowledge gained in solving the source task in the pre-training and fine-tuning. source domain and apply it to another similar problem The attention mechanism in the Transformer allows of interest, it is the same concept of learning process by BERT to model many downstream tasks, such as senti- experience, so the aim is to exploit pre-trained models ment analysis, question answering, paraphrase detection that can be fine-tuned on smaller task or specific data and more. In this workshop has been used the “ktrain” sets. [10] library, a low-code library developed by Arun S. Bidirectional Encoder Representation from Transform- Maiya 4 that provides a lightweight wrapper for “Keras”, ers (BERT) is one of the most popular state-of-art NLP making it easier to build, train, and deploy Deep Learning approaches for Transfer Learning, published by Google models. in 2018 [7]. BERT is a bidirectional multi-layer Transformer model 5.2. Results that exploits the attention mechanism [8]. A basic Trans- former uses an encoder-decoder architecture. The en- Thanks to the “ktrain” low coding library was easy the coder learns the representation from the input sentence implementation of BERT, just to split the data set into a and the decoder receives the representation and produces 4 https://github.com/amaiya/ktrain quite the same results. Participants asked an evaluation with other models, and the job has been completed with the common classifi- cation Machine Learning model: the Logistic Regression, always with “ktrain” library. Implementation of this model is easy, but there is Figure 5: BERT model results. a poorly performance: roughly 63% on train set and roughly 58% on validation set. Looking at the confusion matrix, BERT model shows a large number of elements across the diagonal and small number of elements off the diagonal, so a better matrix than the Logistic Regression. Last step was the inference of the model, testing the right prediction of test tweets, and also in this situation Figure 6: Logistic Regression model results. BERT outperformed Logistic Regression model with all right predictions. 6. Conclusions With this workshop there has been the opportunity to have an overview of Natural Language Processing ap- plications in the Insurance world, and with the disaster Tweets data set there has been the opportunity to dis- cover NLP applications in Fraud Detection. In this work the development of Natural Language Pro- cessing has been retracted, from Tokenization to Bag of Words, from Word Embedding to Contextual Embedding. Transfer Learning has been explored, looking on its potential that outperforms benchmark models both on Topic Modelling and Classification prediction. Figure 7: Confusion matrix on test set with BERT. References [1] A. Ly, B. Uthayasooriyar, T. Wang, A survey on nat- ural language processing (nlp) and applications in insurance, arXiv preprint arXiv:2010.00462 (2020). [2] A. Ferrario, M. Nägelin, The art of natural language processing: classical, modern and contemporary approaches to text document classification, Modern and Contemporary Approaches to Text Document Classification (March 1, 2020) (2020). [3] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, arXiv preprint arXiv:1301.3781 (2013). Figure 8: Confusion matrix on test set with Logistic Regres- [4] M. Grootendorst, Bertopic: Neural topic modeling sion. with a class-based tf-idf procedure, arXiv preprint arXiv:2203.05794 (2022). [5] H. Jelodar, Y. Wang, C. Yuan, X. Feng, X. Jiang, Y. Li, train and test set, then sent the train set as input into the L. Zhao, Latent dirichlet allocation (lda) and topic “ktrain” pre-processing and Deep Learning model. modeling: models, applications, a survey, Multime- The Transformer based model shows, as expected, a dia Tools and Applications 78 (2019) 15169–15211. good performance both on the train set and test set with [6] A. Malte, P. Ratadiya, Evolution of transfer learn- an accuracy greater than 82%. Given the imbalanced data ing in natural language processing, arXiv preprint set the performance was evaluated also by F1 score with arXiv:1910.07370 (2019). [7] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transform- ers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, At- tention is all you need, Advances in neural infor- mation processing systems 30 (2017). [9] Q. Liu, M. J. Kusner, P. Blunsom, A sur- vey on contextual embeddings, arXiv preprint arXiv:2003.07278 (2020). [10] A. S. Maiya, ktrain: A low-code library for aug- mented machine learning, J. Mach. Learn. Res 23 (2020) 1–6.