A Document Tagging Support System for Nursing Care Experts⋆ Beat Tödtli1,* , Sebastian Müller1 , Melanie Rickenmann1 , Janine Vetsch1 and Simon Haug1 1 Eastern Switzerland University of Applied Sciences, Rosenbergstrasse 59, 9000 St. Gallen, Switzerland Abstract We present the findings of an interdisciplinary project that implemented a document tagging support system for nursing care experts. We evaluate its performance and provide lessons learned. This project was particularly marked by a low inter-rater reliability of the document labels and use case understanding issues, but also of a good performance of a simple, BERT-based binary relevance approach. Keywords document tagging support, document classification, inter-rater reliability, nursing care professional education 1. Introduction Text mining and document classification are long-standing research areas [1] where deep neural networks have made very significant contributions over the past years. In particular, bidirectional transformer models such as BERT seem to "learn" structural information about language [2] and provide unprecedented performance and ease of use [3]. Such models can be used for a broad range of application classes, such as document classification [4], regression tasks, document tagging, information retrieval, recommendation tasks, and many more [5]. While this is ground-breaking, it opens up a wide range of applied machine learning research opportunities. The challenge there lies in gathering, structuring, consolidating and spreading experiences into domain-adapted methodologies and insights. Peculiar challenges are often raised by concrete application cases, such as in the case study reported here. We present the findings of a small-scale applied interdisciplinary project in the domain of tagging support for document labelling tasks. The project’s goal was building a document tagging support system for health experts tasked with tagging nursing care publications. As an applied machine learning project, it had a set of requirements and challenges that are quite different from standard text classification or information filtering tasks, on which we report here. LWDA’23: Lernen, Wissen, Daten, Analysen. October 09–11, 2023, Marburg, Germany * Corresponding author. $ beat.toedtli@ost.ch (B. Tödtli) € https://www.ost.ch/de/person/beat-toedtli-1039 (B. Tödtli); https://www.ost.ch/de/person/sebastian-mueller-940 (S. Müller); https://www.ost.ch/de/person/melanie-rickenmann-1090 (M. Rickenmann); https://www.ost.ch/de/person/janine-vetsch-1046 (J. Vetsch)  0000-0003-3674-2340 (B. Tödtli); 0000-0001-9877-0086 (S. Müller); 0009-0004-7421-0905 (S. Haug) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. The Application Case 2.1. Business Understanding Researchers in nursing care are investigating how to promote continuous professional learning in the daily practice of nursing [6]. In performing patient documentation, a tool could integrate scientific evidence at the point of care and therefore provide nurses with easy access to evidence in daily work. They therefore are labelling a dataset of nursing care publications and are associating one or several topic tags with each of them1 . This report is concerned with building a document tagging support system for these experts, to be used as a decision support aid. In the business understanding efforts it was realized rather late in the project that helping experts to save on the time needed to inspect a document was not as relevant as motivating the expert to rethink his or her tagging decisions. 2.2. Data Understanding The data consisted of 1515 nursing care or medical publications each associated with one or several of 24 different tags. The number of publications per tag (see Fig. 1) arose historically and reflects a data taking campaign that did not specify the relative frequency of the tag categories in the data set. After dropping tags with less than 20 associated documents and the class others, 1293 documents in 18 relevant categories remained. The tag frequency distribution was a data understanding indicator that had major implications on the project: Only few tags were associated per document, but a significant fraction of documents had more than one tag associated with it. 76% of all documents had 1 tag, 23% had two tags and 1% had three or more tags. Therefore, a multi-label classification approach [7] is adequate since in 24% of the cases more than one tag is associated with a document. Furthermore, metrics such as precision@k need to be evaluated with respect to the multi-label case. In a UX workshop with the experts it was found desirable to present 𝑘 = 3 selected document tags. This implies that precision@3 values cannot reach a value of 100%. Also, most information retrieval or recommender systems have a much smaller percentage of relevant items to retrieve or recommend, so that our precision values will likely be much higher. 2.3. Dealing with Imprecise Labels A complication arose with the early realisation that there was some disagreement between experts on which labels to assign to a given document. This effect was expected based on the results of Xia and Yetisgen-Yildiz, since medical training alone does not ensure high inter- annotator agreement and no NLP researcher had been involved in the annotation process until this project. [8] The inter-expert labelling reliability was assessed using a small labelling campaign where two experts labelled the same 60 documents. The co-occurrence matrix of the ratings as judged by two experts indicated disagreements even though for most documents (87%) at least one tag was overlapping. It was also found that averaged over all assigned tags, the 1 We use the word "tag" or "category" instead of "label" to indicate that each document can have more than one tag (or category) associated with it. Figure 1: The number of documents per tag category (bars, lower x-axis) and the median and interquar- tile range of the predicted (test) class probabilities (’x’- and ’*’-marked points with bars, upper x-axis). The black, ’x’-marked bars correspond to documents containing the given tag, the blue, star-marked bars to documents not containing the given tag. Documents associated with only the label "others" and labels with less than 20 associated documents were not used for training. Noticeably, more training data is generally helpful. intersection over union of the assigned tag sets was 72%. However, to assess inter-expert reliability, Krippendorf’s alpha [9] is more appropriate here, as it takes into account inter- labeller (dis-)agreement by chance. It has further advantages in that it allows for missing values and multiple tags. We find a value of 𝛼 = 0.59, indicating a substantial inter-rater agreement according to Landis and Koch [10], but not according to Krippendorff who is reported to consider 0.8 as an absolute minimum value for any serious purpose [11]. Based on these indications, an important task in a next iteration is to write annotation guidelines and ensure consensus between the various nursing care experts about how to apply these guidelines [8]. 3. Functional Prototype Construction 3.1. Data Preprocessing and Feature Engineering The proposed system consists of text extraction and feature engineering steps that return the BERT sentence embedding using the Hugging Face model "paraphrase-MiniLM-L6-v2"2 of the paper abstract and a tag filtering step based on this vector to be detailed further in Sec. 3.2. 2 See https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2 Figure 2: Screenshot of the application’s tagging interface, where the three filtered tags are highlighted in red. The predicted tag probabilities are given in brackets. A feature selection decision was made by using BERT feature vectors of the publication abstracts only. They were extracted manually by the experts and have the advantage of fitting well within the 512 subword tokens limit imposed by BERT. 3.2. Modelling: Multi-Label Classification for Document Tag Filtering For multi-label classification, we used a binary relevance problem transformation method [12]. Let 𝑛𝑐 = 18 be the number of different tags available before filtering. Binary probabilistic classifiers 𝑓𝑖 : 𝑋 → [0, 1] for each 𝑖 ∈ {1 . . . 𝑛𝑐 } were trained using support vector classifiers with rbf-kernels. Their predicted class probabilities for a document 𝑥, {𝑓𝑖 (𝑥), 𝑖 = 1, ...𝑛𝑐 }, were interpreted as tag relevance scores. The tags with the 𝑘 = 3 highest scores were selected. As part of the user interface design choice and as a consequence of the distribution of the number of tags per document, 𝑘 = 3 tags were selected and presented in red with the associated probabilities. Fig. 2 shows the user interface as it is currently implemented. 3.3. Evaluation: Statistical Results We discuss the evaluation of the data mining prototype on the statistical level. The user level evaluation is discussed in 4.2. Our main results are given in Tab. 1 The low precision@3 and MAP@3 values can be attributed to the fact that the number of relevant tags per document are mostly 1 or 2. The average precision@1 is higher, at 82%. From a practical point of view, the recall@3 value is presumably the most important one, as three tags can easily and quickly be judged by an expert. By looking only at these top-3 selections, however, the expert might miss relevant tags in 8% of the cases. This result was deemed sufficient for a first iteration and therefore terminated classifier optimization efforts, as early user experience feedback was deemed more important. However, Precision@3 42% Precision@1 82% Recall@3 92% MAP@3 61% Table 1 Precision, recall and mean average precision@3 values of the SVM-rbf tagging support system averaged over all documents in the test set. These correspond to the best obtained filtering model selected on the validation set. Figure 3: Precision-Recall curves of the single tag relevance score estimators for the 3 most frequent tags (neurology, oncology, geriatrics - solid lines) and the 3 least frequent tags (critical situations, orthopedics, angiology - dotted lines). the single-tag binary classifiers were far from perfect, as the precision-recall curves in Fig. 3 show. In fact, these curves show that the less frequent tags (with a prevalence of 2-4%) perform worse than the more frequent ones (appearing in 6-13% of the documents), suggesting that a larger data set will likely help improve the performance further. Although useful as a rough performance estimate, recall@3 must be considered a flawed metric since the number of relevant tags varies mostly between 1 and 2 tags per document Given that ∼ 97% of all documents in the test data set had at most two associated labels, Tab. 2 considers the documents with one and two labels separately. On the subset of ∼ 63% of documents with only one relevant tag, Tab. 2 lists the percentages of documents for which this relevant tag is not found, found at the first, second or third item in the filtered list. For the ∼ 33% of documents with two relevant labels, similarly the percentages of where the two relevant tags were found are given. Thus, for documents with one relevant tag, precision@3 values of over 90% were reached, Position 1 relevant tag 2 relevant tags Classifier Type None 1 2 3 None 1 tag 1+2 1+3 2+3 Support Vector Machine 7% 80% 10% 3% 1% 18% 64% 12% 5% Naïve Bayes 8% 76% 13% 3% 3% 20% 61% 7% 7% Table 2 Position of the relevant tags in the filtered list for documents with one and two relevant tags, respectively. Here "1+2" is to be read as "at the first and second position", "1 tag" as "one relevant tag found", and "None" as "no relevant tag selected". whereas for documents with two relevant tags, both tags were found in the top 3 recommen- dations in at least 92% of the cases. Using a naïve Bayes classifier instead of a support vector machine generally resulted in a minor performance reduction. 4. Deployment 4.1. Software Architecture All tagged publications are made accessible for nursing practitioners on a TYPO3-based website. Data is stored on a Linux server and managed in a MariaDB database. Experts administrate publications in the backend of the website. The Python-based filter system is started once a day with a cron job, retrieves newly uploaded publications via REST-API and writes the filtered tags (also via REST-API) into a designated table of the MariaDB database, from where they are displayed to the tagging experts (see Fig. 2). 4.2. Evaluation: Assessment of User Experience The system has so far been in use for 6 months. Based on an interview conducted with the nursing care expert chiefly tasked with tagging documents, several crucial insights could be established. The most problematic one was that the expert viewed the filtered tags as almost authoritative recommendations. The expert looked at the system’s three filtered tags and checked their plausibility, instead of looking at the document and determining the relevant tags. The option of tagging the document with a non-filtered item and the tag probability indications were ignored. This was done mainly out of convenience and efficiency, believing that no tag could be relevant that was not highlighted. The expert also declared that he no longer looked at the abstract and/or the paper, instead he would rely solely on the filtering system and the title. These practices must be registered with alarm, since apparently even with usage instructions and being involved in this project and knowledge of the statistical evaluation results, the expert seemd happy to uncritically pass on the responsibility for the correctness of the tags to the tag filtering system. 5. Conclusion and Outlook We built a document tagging support system to aid nursing care experts in building a labelled dataset for on-the-job professional education. Using a simple BERT-based feature engineering approach combined with a standard radial basis function support vector machine, recall@3 values of around 90% were reached. While the domain experts deemed this result good enough for deployment, issues regarding the specified inter-rater reliability and scarce training data for several tags show potential for improvements. Consequently, Fig. 1 showed that not all tags were reliably recognized by the system. It is likely that inter-labeller agreement was the limiting factor in this regard, in contrast to many CRISP-DM-projects where data collection, feature engineering and optimizing the modelling step often pose the key challenges. Our results also demonstrate that building custom tagging support systems is already quite inexpensive. This again suggests that domain-adaptation efforts are reasonably likely to be successful when transformer-based sentence embeddings are used. We cautiously try generalise these insights. When trying to judge the probability of success of a tagging support system, some important positive indications are if • the domain-specific dataset is at least moderately sized. In the case of publications, 20-50 abstracts per category might suffice. • the labels are easily distinguished based on the text only, possibly even by non-experts. • experts judge the reasoning necessary to distinguish between categories to be simple. During our project, we designed the system with the the goal of challenging the experts to reconsider their tag selections with the aid of the tagging support system. The system was therefore designed around assisting expert labelling, rather than providing an expert opinion by itself. Despite respective efforts and instructions though, the expert interviewed after using the system for two months developed a significant inclination to uncritically adopt the system’s (alleged) "recommendation" out of convenience. This UX-challenge remains unsolved and offers great potential for future research. 5.1. Lessons to Consider In summary, we list some lessons learned: • Carefully think about how the system biases the expert’s labelling decision. The experts might need training to correctly use the system. Otherwise there is a possibility that they uncritically accept the system’s biases in the tagging process by trying to become more efficient. Additionally regularly check the way it is used after deployment. • Do not start a tagging or labelling process without defining clear labelling instructions. • Monitor the label correlations in the labelling process, and discuss label categories among experts to verify that they are discernible by all [12, 7]. • If in doubt, measure the inter-rater reliability early on. A low inter-rater reliability may invalidate the project goals, and the system performance will be limited by the label consistency the training data has. • For the task of scientific document tagging support, abstracts are useful. The bulk of a document should be considered in a second iteration only because the limited transformer input length poses additional potentially expensive challenges. References [1] A. Bilski, A review of artificial intelligence algorithms in document classification, In- ternational Journal of Electronics and Telecommunications vol. 57 (2011). URL: http: //journals.pan.pl/Content/86895/PDF/35.pdf. doi:10.2478/v10177-011-0035-6. [2] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, I. Polosukhin, Attention is all you need, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, volume 30, Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/ paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. [3] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, A. Rush, Transformers: State-of-the-art natural lan- guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. URL: https://aclanthology.org/2020.emnlp-demos.6. doi:10.18653/v1/2020.emnlp-demos.6. [4] A. Adhikari, A. Ram, R. Tang, J. Lin, Docbert: Bert for document classification, ArXiv abs/1904.08398 (2019). URL: http://arxiv.org/abs/1904.08398. [5] R. Nogueira, K. Cho, Passage re-ranking with bert, 2019. URL: https://arxiv.org/abs/1901. 04085. doi:10.48550/ARXIV.1901.04085. [6] R. Ranegger, S. Haug, J. Vetsch, D. Baumberger, R. Bürgin, Providing evidence-based knowledge on nursing interventions at the point of care: findings from a mapping project, BMC Medical Informatics and Decision Making 22 (2022) 308. URL: https://doi.org/10.1186/ s12911-022-02053-8. doi:10.1186/s12911-022-02053-8. [7] R. B. Pereira, A. Plastino, B. Zadrozny, L. H. Merschmann, Correlation analysis of performance measures for multi-label classification, Information Processing & Management 54 (2018) 359–369. URL: https://www.sciencedirect.com/science/article/pii/ S0306457318300165. doi:https://doi.org/10.1016/j.ipm.2018.01.002. [8] F. Xia, M. Yetisgen-Yildiz, Clinical corpus annotation: challenges and strategies, in: Proceedings of the third workshop on building and evaluating resources for biomedical text mining (BioTxtM’2012) in conjunction with the international conference on language resources and evaluation (LREC), Istanbul, Turkey, 2012, pp. 21–27. [9] K. Krippendorff, Content Analysis: An Introduction to Its Methodology (second edition), Sage Publications, 2004. [10] J. R. Landis, G. G. Koch, The measurement of observer agreement for categorical data, Biometrics 33 (1977) 159–174. URL: http://www.jstor.org/stable/2529310. [11] R. Artstein, M. Poesio, Survey article: Inter-coder agreement for computational linguistics, Computational Linguistics 34 (2008) 555–596. URL: https://aclanthology.org/J08-4004. doi:10.1162/coli.07-034-R2. [12] J. Read, A. Bifet, G. Holmes, B. Pfahringer, Scalable and efficient multi-label classification for evolving data streams, Machine Learning 88 (2012) 243–272. URL: https://doi.org/10. 1007/s10994-012-5279-6. doi:10.1007/s10994-012-5279-6.