Predicting Themes within Complex Unstructured Texts: A Case Study on Safeguarding Reports Aleksandra Edwards1,2 , David Rogers1,2 , Jose Camacho-Collados1 , Hélène de Ribaupierre1 , and Alun Preece1,2 1 School of Computer Science and Informatics, Cardiff University, Cardiff, UK 2 Crime and Security Research Institute,Cardiff University, Cardiff, UK Abstract. Text classification typically requires large amounts of labelled training data; however, the acquisition of high volumes of labelled datasets is often expensive or unfeasible, especially for highly-specialised domains for which both training data (documents) and access to subject-matter expertise (for labelling) is limited. Language models pre-trained on large amounts of text corpora provide state-of-the-art performance against most standard natural language processing (NLP) benchmarks, including text classification. However, their prevalence over more traditional linear classifiers and domain-based approaches have not been investigated fully. In this paper, we address the combination of state-of-the-art deep learning and classification methods and provide an insight into what combination of methods fit the needs of small, domain-specific, and terminologically- rich corpora. We focus on a real-world scenario related to a collection of safeguarding reports comprising learning experiences and reflections on tackling serious incidents involving children and vulnerable adults. Our aim is to automatically identify the main themes in a safeguarding report using three main types of classification. Our results show that for a very small amount of data a simple linear classifier outperforms state-of-the-art language models. Further, we show that the performance of classifiers is more affected by the size of the training data rather than the amount of context given. Keywords: text classification · small domain corpus · language models. 1 Introduction The performance of natural language processing (NLP) classification tasks is heavily reliant on the amount of training data available [15]. However, the acquisition of high volumes of labelled data can be an expensive, time- and resource-consuming process [1], especially when the text to be labelled is in a highly-specialised domain where only scarce domain experts can perform the manual labelling task [15]. Current pre-trained neural models such as BERT (Bidirectional Encoder Representations from Transformers) [3] proved to pro- vide state-of-the-art results in most standard NLP benchmarks [16], including text classification. However, the applicability of these language models to very small collections of highly specialised documents has not been fully explored or compared to more traditional methods. A limitation to pre-trained models is that there is still a need for task-specific datasets for these models to perform 2 A. Edwards et al. well in a specific domain [11]. Therefore, adapting these large but generic models to specific domains and tasks has become the new standard approach for many NLP problems [13]. For instance, the authors of [7] provide a more extensive research on whether it is still helpful to tailor a pretrained model to the domain of a target task. However, this research is not focused on text classification and does not compare neural models to other types of machine learning models. Further, a recent research [4] on few-shot classification, analysed the role of labeled and unlabeled data for classifiers by comparing a linear model such as fastText coupled with domain-specific embeddings against fine-tuned BERT model using both domain-specific and generic corpora. However, the authors performed analysis using generic datasets assuming the presence of large amounts of unlabeled data which can be used for fine-tuning models on domain data. We build on this research [4] by comparing the performance of three types of classifiers for a real-world scenario related to the safeguarding domain where there is very limited amount of labeled and unlabeled dataset. Previous work on performing NLP analysis on the safeguarding corpus emphasized the challenges of extracting knowledge from the documents using off-the-shelf text analysis tools due to the highly specialised lexical characteristics of the reports [5]. Further, there is no existing knowledge resources which fit the needs of the domain [5] which makes the use of semantic enrichment approaches for the domain difficult. We look at whether domain-trained embeddings are effective even when trained on very limited corpus. Our main contribution is that we conduct a thorough analysis of what combination of embedding and language models and classification approaches fit the needs of a small domain-specific and terminology- rich corpus. Further, we also look at how deep learning approaches are affected by training dataset size versus the amount of context given. 2 Case study: Safeguarding reports The purpose of a safeguarding report is to identify and describe related events that precede a serious safeguarding incident — for example, involving a child or vulnerable adult — and to reflect on agencies’ roles and the application of best practices. Each report contains key information about learning experiences and reflections on tackling serious incidents. The reports carry great potential to improve multi-agency work and help develop better safeguarding practices and strategies [5]. Analyzing and understanding safeguarding reports is crucial for health and social care agencies; in particular, a key task is to identify common themes across a set of reports. Traditionally, this is done in social science by a process of manually annotating the reports with themes identified by subject- matter experts using a qualitative analysis tool such as NVivo. However, each report is lengthy and complex, so manual annotation is a time-consuming and potentially bias-prone process [5]. Furthermore, in our particular case, the safe- guarding collection is expected to grow significantly in the near future, with the additional resourcing of 500 historical reports, making the manual annotation of these additional documents unfeasible. Therefore, we aim to automate the process of document annotation. Furthermore, in our particular case, the safeguarding collection is expected to grow significantly in the near future, with the additional resourcing of 500 historical reports, making the manual annotation of these additional documents unfeasible. Therefore, we aim to automate the process of document annotation. Predicting Themes within Complex Unstructured Texts 3 The thematic framework [12] used for performing document classification resulted from collaborative work between multiple subject-matter experts. In this context, a theme refers to the main topic of discussion related to safeguarding incidents, specifically relevant to domestic homicide and mental health homicide. 3 The Dataset At the time of development the corpus consisted of 27 full safeguarding reports. The annotations were carried out by a social science team following standard methodology in the field. They used a qualitative analysis tool (NVivo) to label parts of documents with thematic annotations from 5 top-level themes according to the thematic framework described in Section 2. The annotation was performed by labelling different-length passages of the reports with themes from the thematic framework. The majority of report contents were labeled except appendices. The total number of sentences in the corpus was 3,421 (see Table 1) with unbalanced distribution between the different themes where sentences can be associated with multiple themes. We evaluated models performance using training, development Table 1. Data distribution of sentences per themes Theme Train Dev Test Description Example Contact with 1,281 335 219 Agency interactions with the people in- The person injured his ankle and was seen Agencies volved prior to the incident at the GP surgery Indicative 1,078 276 83 Types of behaviour that might indicate a The perpetrator had a long history of al- Behaviour risk to self and others, such as signs of cohol misuse and criminality aggression, previous offences Indicative 427 104 99 Personal circumstances prior the incident Their relationship was based on the eco- Circumstances that might indicate a risk to self and others, nomic realities of subletting a flat infor- such as relationship problems, debt mally Mental Health Is- 316 76 51 Indications of any mental health problems They were diagnosed with Attention sues that anyone involved in the incident expe- Deficit Hyperactivity Disorder rienced Reflections 780 203 78 Key lessons learned in reviewing the case This highlights a challenge for agencies on what information to share when victims and perpetrators reside in different admin- istrative areas Total 2,736 685 300 (both training and development were randomly sampled from the 27 reports) and test sets. Both development and test sets were annotated at the passage-level. The test set was extracted from safeguarding reports different from the original 27 documents. The test set contained a 100 randomly selected passages where each passage consisted of 3 sentences. Due to the limited amount of reports available, we built and evaluated classifier models on a sentence level (i.e., results presented in Section 4.2). Thus, each sentence was assigned the label of the passage to which it belongs. Further, we ensured that the train and development set do not intersect by automatically selecting random non-overlapping partitions for the two subsets. We also performed analysis at the passage-level, presented in Section 5. 4 A. Edwards et al. 4 Classification Experiments 4.1 Methods In our experiments, we perform multi-label classification to identify main themes within documents. We compare three classifiers — a simple count-based classifier, a linear classifier based on word embeddings, and a state-of-the-art language model. We perform experiments with pre-trained and corpus-trained embeddings as well as different methods for building feature vectors. We use an n-gram feature representation and a Naive Bayes classifier as our baseline. Our method consists of four overall steps, described below. Fig. 1. Method overview Step 1: Pre-processing We extracted terms from the corpus using FlexiTerm [14], an open-source software tool for automatic recognition of multi-word terms. We used the sentences pre-processed with the terminology extraction step for building sentence embeddings and for creating simple n-gram feature vectors. Step 2: Feature Extraction (FE) We used fastText word embeddings [2] pre- trained with subword information on Common Crawl. We also used fastText for learning domain-specific embeddings because it captures the meaning of rare words better than other approaches. We used the skip-gram method for building word embeddings with 300 dimensions. Step 3: Feature Integration (FI) We use several ways for combining the word embeddings into reduced sentence representations: In the first approach, we average the embeddings of each word in a sentence along each dimension. In the second approach, we assign TF-IDF weights to the words in a sentence, and calculate the weighted average of the word embeddings along each dimension (where the contribution of a word is proportional to its TF-IDF weight). Finally, we use Bidirectional Encoder Representations from Transformers (BERT) [3]. A limitation of the word embedding model described above is that it produces a single vector of a word despite the context in which it appears. In contrast to the other embedding methods, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. This characteristic allows the model to learn the context of a word based on all of its surroundings, thus it generates more contextually-aware word representations. There are two steps in the BERT framework: pre-training and fine-tuning. In this step of the methodology, we use the base pre-trained BERT model, trained on the Books Corpus and English Wikipedia, for extracting contextualized sentence embeddings. The fine-tuning step consists of further training on the downstream tasks. Predicting Themes within Complex Unstructured Texts 5 Step 4: Classifiers We perform classification on a sentence level where each sentence had been assigned the theme of the passage the sentence belonged to. Here, we take ‘ground truth’ to be the annotations made by the social scientist expert annotators who were involved in creating the thematic framework (see Section2). For a baseline we use GNB classifier based on frequency-based features available in Scikit-Learn library [10], since it is considered a strong baseline for many text classification tasks[6]. A potential problem with linear classifiers is that they struggle with OOV words, fine-grained distinctions and unbalanced datasets. The fastText classifier [8] addresses this problem by integrating a linear model with a rank constraint, allowing sharing parameters among features and classes. Further, we fine-tune BERT for the classification task using a sequence classifier, a learning rate of 5e-5 and 4 epochs. In particular, we made use of the BERT’s Hugging Face default transformers implementation for classifying sentences [17]. 4.2 Results We evaluate the performance of the machine learning algorithms by using precision, recall, and F1-measure metrics. The summary results are calculated using micro- and macro- based measures. Early experiments using Word2Vec embeddings [9] and SVM classifier showed unsatisfactory performance compared to fastText embeddings and GNB classifier. Thus, these results are omitted from Table 2. The results in Table 2 show that a simple terminology-based pre-processing step leads to slight improvements over the baseline with micro F1 of 0.59 in comparison to baseline micro-F1 of 0.57. Despite the small amount of data, we found that corpus trained embedding provide a notable advantage over pre-trained embeddings in the classifiers performance. Table 2. Summary classification results Method Micro Macro Classifier FE FI p r F1 p r F1 Baseline 1,2 grams count .51 .66 .57 .48 .65 .54 Terms count .52 .68 .59 .49 .66 .54 mean .38 .54 .44 .38 .55 .42 FT GNB TF-IDF .27 .48 .34 .32 .56 .34 Domain mean .47 .60 .53 .45 .61 .50 BERT BERT .43 .60 .50 .40 .59 .47 Domain Mean .52 .67 .59 .48 .62 .54 FT FT Mean .52 .64 .57 .48 .59 .52 Fine-tune BERT BERT .56 .73 .64 .52 .68 .59 fastText classifier outperformed GNB model, especially when domain-based embeddings were used. A non-verbatim example of a sentence where fastText model, based on corpus-trained embeddings performs better than pre-trained embedding models is: ’The police received information that the subject was selling crack’. A potential reason for fastText to classify correctly this sentence versus the classifiers using pre-trained embeddings is that the word ‘crack’ has the meaning of a ‘drug’ in the reports. However, this is not the widely accepted meaning for this word and thus it cannot be interpreted correctly by pre-trained models. The GNB based on pre-trained BERT model outperforms the classifiers 6 A. Edwards et al. based on pre-trained embeddings, however it does not lead to improvements over the domain-based models. Fine-tuning BERT is the best performing classifier with micro-F1 of 0.64 and macro-F1 of 0.59 which gives 0.5 improvement over the baseline. The improvement in the results achieved by fine-tuning BERT indicate the importance of adapting even the more context-aware pre-trained language models to the specific domain, especially when the domain contains highly specialised language. Further, the poor performance of classifiers based on pre-trained word models shows the lack of transferability of pre-trained embeddings for a highly specialised domain such as the safeguarding reports. The three best-performing classifiers give similar average results between the dev and test set (see Table 3). Further, models tend to return higher results for some themes, especially ‘Mental Health Issues’ for the test set rather than the dev set. A potential reason for this may be attributed to the fact that the test set has been annotated in a similar manner to the classification models, i.e., independent of the context of the entire documents. The BERT classifier returned results above 0.50 for the themes ‘Contact with Agencies’, ‘Reflections’ and ‘Indicative Behaviour’ for the dev and test datasets with precision above 0.60 and recall above 0.70. Table 3. Results per theme for best performing classifiers dev set test set Method Theme p r F1 p r F1 Contact with Agencies .65 .70 .68 .86 .47 .61 Indicative behaviour .56 .63 .59 .46 .57 .51 baseline Indicative circumstances .33 .64 .44 .52 .51 .51 Mental Health Issues .26 .57 .36 .39 .45 .42 Reflections .58 .69 .63 .47 .76 .58 AVERAGE .48 .65 .54 .54 .55 .52 Contact with Agencies .55 .75 .63 .79 .74 .76 Indicative behaviour .58 .73 .65 .45 .70 .54 FT Indicative circumstances .41 .56 .47 .49 .36 .42 Mental Health Issues .26 .42 .33 .35 .31 .33 Reflections .58 .64 .61 .51 .64 .56 AVERAGE .48 .62 .54 .52 .55 .53 Contact with Agencies .62 .82 .71 .84 .58 .69 Indicative behaviour .60 .74 .66 .48 .63 .54 BERT Indicative circumstances .47 .56 .51 .68 .34 .46 Mental Health Issues .31 .51 .39 .47 .46 .46 Reflections .59 .76 .67 .51 .82 .63 AVERAGE .52 .68 .59 .60 .57 .58 5 Error Analysis In the preceding section we evaluated the performance of the classification approaches against the annotations generated by the creators of the thematic framework, who we refer to as the expert annotators. By creating a classifier that uses the annotations generated by expert annotators as a ‘ground truth’, we aim to produce unified and comparable results across generations that are not Predicting Themes within Complex Unstructured Texts 7 susceptible to variations in annotations created by different human annotators interpreting the coding framework. Going further, we judge the predictive power of the models by comparing their performance against the annotations of expert validators: independent social scientists who did not participate in the creation of the thematic annotation framework. We aim to measure the ability of the learned models to conserve the knowledge of the expert annotators versus if the task was performed manually by independent social scientists who were not creators of the framework (Section 5.1). In this way, we will be able to judge whether automated approaches are reliable for labeling the reports. We perform three main types of analysis. First, we compare the performance of the classifiers against the annotations of expert validators. Secondly, we compare the performance of the classifiers for different length of sentences to observe the classifiers suitability for various sequence lengths. We also measure the effect of the training dataset size on the performance of the models (Section 5.2). Thirdly and finally, we look at the effect of the number of training instances versus the amount of context provided per instance on the performance of the classifiers (Section 5.3). 5.1 Expert Validators vs Classifiers The initial thematic framework was developed by annotating passages of the documents rather than individual sentences. However, our classifiers are trained with sentences. In order to fairly judge the predictive power of the models against human annotators for annotating sentences and passages of the reports, we performed a study comparing the performance of the classification models versus two independent expert validators on sentence- and passage-level. For these purposes we used two datasets — one consisting of sentences and one consisting of passages. The sentence set consisted of a sample of 100 randomly chosen sentences, while the passage set consisted of a 100 passages, each containing three sentences. The sentence set was extracted from the dev set while the passage set was extracted from the test set (see Table 1). We measured the inter-annotator agreement for predicting themes using Cohen’s kappa (see Table 4). We also compare the average F1 measure per theme between the expert validators and the best performing classifier (BERT). Table 4. Expert validator results (Cohen’s Kappa, average expert F1, BERT F1): ‘Expert F1’ refers to the average F1 measure between the two expert validators. sentences passages Theme Kappa Expert F1 BERT F1 Kappa Expert F1 BERT F1 Contact with Agencies .48 .56 .71 .31 .71 .72 Indicative behaviour .36 .51 .66 .16 .56 .61 Indicative circumstances .32 .39 .48 .38 .54 .58 Mental Health Issues .56 .42 .47 .67 .65 .56 Reflections .27 .37 .65 .47 .52 .54 AVERAGE .40 .45 .61 .40 .60 .60 The Cohen’s kappa scores showed moderate agreement between the validators with an average score 0.40 on sentence and a passage level. The highest level of agreement is for ‘Mental Health Issues’ theme. However, the average expert F1 8 A. Edwards et al. for this theme is surprisingly low. The reason for the discrepancy between the Cohen’s kappa score and the F1 measure is the occurrence of sentences which mention mental health problems such as ‘depression’. Such sentences are labeled by the expert validators as ‘Mental Health Issues’, however their true label is different because of the surrounding context. Surprisingly, a large portion of these sentences were correctly classified by BERT. The average F1 score for the expert validators significantly improves for passage-level classification with average F1 = 0.60 in comparison to sentence-level annotations where an average F1 = 0.45 (see Table 4). This suggests that humans need more context — i.e., to see the sentences embedded in paragraphs — to classify sentences correctly, compared to deep learning models that can generalize better in these cases with limited context thanks to what they learned from the training set. 5.2 Effect of sentence length and training size Experiments comparing the best-performing classifiers for different sentence length and training set size showed that BERT performed better than the baseline method for any length of sentences. Further, BERT gave higher results than fastText and the baseline for shorter sentences. For long sentences, BERT and fastText had very similar performance with a difference less than 1% (see Fig.2). The comparison between the classification models performance for different sizes of training set revealed that deep learning models (i.e., BERT) are highly influenced by the size of the training set in comparison to linear models such as the baseline and fastText (see Fig.2). BERT performed worse than the baseline for the very small training set while fastText gave similar performance to the baseline. However, BERT’s performance almost doubled as more sentences were added to the training set while GNB performance was not that heavily influenced by the size of the training data, especially for a training set with more than 1,000 sentences. 5.3 Sentences vs Passages In this section, we extend the analysis from Section 5.1 by looking at the effect of context versus the number of training instances provided for the classifier models. In this experiment, we gradually increase the length of the training instances in order to judge the importance of the training size versus the context (in terms of passage length). We evaluate the models using sentences and passages where each test passage consisted of three sentences (see Fig. 3). The test sets for these experiments were extracted from the dev set while the training sentences and passages were extracted from the training set. Results showed that the performance of deep learning models is more influenced by the amount of the training instances rather than the length of the training passages. Further, models trained on sentence-level with a higher volume of training data give better results when tested on small paragraphs than classifiers trained on passages but with less training data available. This signifies the importance of higher volume of labelled data for reaching good classifiers performance. 6 Conclusions and Future Work Through this work, we explored the problem of predicting the main themes in safeguarding reports using supervised machine learning approaches. We analysed Predicting Themes within Complex Unstructured Texts 9 Fig. 2. Micro-F1 measure per: sentence length (i.e., sent with more than 3 tokens, etc) (left) and different train dataset size (i.e., train dataset with up to 341 sentences, etc) (right) Fig. 3. Micro-F1 measure per different paragraph size: test set consists of sentences(left); test set consists of paragraphs (right) the performance of state-of-the-art classifiers, feature extraction and feature integration techniques which allowed us to identify classification methods suitable for domain-specific documents. Results showed that state-of-the-art deep learning model performance is highly dependent on the size of the training data in comparison to linear models as BERT’s performance is worse than a simple Naive Bayes baseline and fastText for very small training datasets. Further, training word embeddings onto the specific domain, even when the size of the corpus is very small, lead to much higher results in comparison to pre-trained embeddings. This shows the importance of targeting pre-trained models to the specific corpus despite its size. The study comparing the expert validators’ performance versus the automated models showed that the thematic analysis can be challenging even for subject-matter experts without prior knowledge of the thematic annotation framework. Further, humans need more knowledge about the context surrounding a sentence, compared to deep learning approaches. Experiments showed that BERT and fastText performance is more affected by the size of the training data rather than the amount of context given. On this respect, sentence-level classification provides more training data and fine-grained distinction between themes, which in turn allows for an easier expansion of the models and faster annotation. In the future, we want to improve theme detection for the safeguarding documents by using generative language models for artificially augmenting the sparse data of the corpus. We will use the additional data as a training set in order to improve classifier performance. Further, we plan to look into developing and using knowledge graphs for improving classification. This will help refine the query functionality of the application and help improve the identification of similar documents and common trends in the safeguarding collection. 10 A. Edwards et al. References 1. Ali, Z.: Text classification based on fuzzy radial basis function. Iraqi Journal for Computers and Informatics 45(1), 11–14 (2019) 2. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, 135–146 (2017) 3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv abs/1810.04805, 16 (2019) 4. Edwards, A., Camacho-Collados, J., De Ribaupierre, H., Preece, A.: Go simple and pre-train on domain-specific corpora: On the role of training data for text classi- fication. In: Proceedings of the 28th International Conference on Computational Linguistics. pp. 5522–5529 (2020) 5. Edwards, A., Preece, A., De Ribaupierre, H.: Knowledge extraction from a small corpus of unstructured safeguarding reports. In: European Semantic Web Conference. pp. 38–42. Springer, Portoroź, Slovenia (2019) 6. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. Journal of machine learning research 9(Aug), 1871–1874 (2008) 7. Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., Smith, N.A.: Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964 (2020) 8. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. pp. 427–431. Association for Computational Linguistics, Valencia, Spain (April 2017) 9. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed repre- sentations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26, 3111–3119 (10 2013) 10. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al.: Scikit-learn: Machine learning in python. Journal of machine learning research 12(Oct), 2825–2830 (2011) 11. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019) 12. Robinson, A.L., Rees, A., Dehaghani, R.: Making connections: a multi-disciplinary analysis of domestic homicide, mental health homicide and adult practice reviews. The Journal of Adult Protection 21(1), 16–26 (2019) 13. Sainz, O., Rigau, G.: Ask2transformers: Zero-shot domain labelling with pre-trained language models. arXiv preprint arXiv:2101.02661 (2021) 14. Spasić, I., Greenwood, M., Preece, A., Francis, N., Elwyn, G.: Flexiterm: a flexible term recognition method. Journal of biomedical semantics 4(1), 27 (2013) 15. Türker, R., Zhang, L., Koutraki, M., Sack, H.: Knowledge-based short text cat- egorization using entity and category embedding. In: European Semantic Web Conference. pp. 346–362. Springer (2019) 16. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: A multi- task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461 (2018) 17. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Brew, J.: Huggingface’s transformers: State-of- the-art natural language processing. ArXiv abs/1910.03771 (2019)