1. Introduction

Bologna, Italy. $ anne.rother@ovgu.de (A. Rother)

Comparing visual tools for pairwise comparisons of tabular data

Anne Rother

Matteo Polsinelli

Till Ittermann

Giuseppe Placidi

Myra Spiliopoulou

0 0 Otto-von-Guericke University Magdeburg , Magdeburg , Germany 1 University Medicine Greifswald , Greifswald , Germany 2 University of L'Aquila , L'Aquila , Italy 3 University of Salerno , Salerno , Italy

2025

000 0 0002

AI-based diagnostics demand reliable medical record labeling. Despite the advances of few-shot and zero-shot learning, each specialized medical data collection demands at least some labels that agree with the feature space and the class distribution of the collection. However, human posteriori classification of existing records on diagnoses that have not been considered during the original data acquisition demands efort and expert knowledge. To facilitate human labor and decrease the required level of expertise, we propose a workflow that encompasses pairwise comparisons of medical records and dedicated visualizations for the juxtaposition of record pairs in the original feature space. We evaluate the potential of new visualization schemes in controlled experiments with human volunteers and we juxtapose the results to those achieved with earlier, much simpler visualizations.

1. Introduction

Pairwise comparisons are used in machine learning to derive similarity functions that take local proximity between objects into account [ 1 ]. Pairwise comparisons are also used crowdworking to capitalize on the fact that humans can discern similarities between objects with their eyes, in a way that AI still cannot immitate [ 2 ], [ 3 ], [ 4 ]. For example, when called to perform a pairwise comparison among the three faces in the upper part of Figure 1, humans are likely to ignore the whiskers, a feature of some importance when comparing the three faces in the lower part of the same figure. When it comes to high-dimensional medical records though, human annotators need more assistance when deciding which features to concentrate on.

In this paper, we investigate the potential of diferent structured record visualizations in assisting humans in pairwise comparisons. We propose a workflow that encompasses a mechanism for triplet construction from a set of labeled medical records for a binary classification problem (person has the disease: Y/N), two visualizations for pairwise comparisons, an experiment design for the evaluation of these visualizations on volunteers, and a set of evaluation criteria to assess the potential of each method and its merit in comparison to simpler visualization mechanisms.

Our first contribution is the complete workflow, intended to assist human annotators who do pairwise comparison of structured medical records for the purpose of labeling. Our second contribution consists of the two presented visualizations, which are intended to highlight similarities and diferences among records in the original feature space. Our last contribution is the evaluation approach, covering an experiment that involves human volunteers and a retrospective comparison to the results of an earlier experiment that used simpler visualizations.

The paper is organized as follows. We first discuss related work on pairwise comparisons and on visualization of structured medical records, focusing on visualization methods for the original feature space. In section III, we present the elements of our approach, while in section IV we present the medical data we used, the experiment we performed with human volunteers and our evaluation criteria. Section V contains our results and a discussion on them. The last section summarizes the findings and provides an outlook.

2. Related Work 2.1. Pairwise comparisons

Studied intensively from the machine learning perspective, see e.g. [ 5 ], where the objective is to induce a distance function over the data space. The human-driven process of finding the two most similar objects inside a triplet is investigated in psychology, but there the objective is to acquire insights into human perception [ 6 ]. Insights into whether triplet comparisons performed by human annotators are indeed exploitable by machine learning algorithms are mostly limited to the comparison of images [ 7, 8 ]. Arguably, pairwise comparison in triplets of tabular data records, such as medical instances, is diferent from the comparison of image instances. Yao et al. used pairwise comparisons for the estimation of treatment efects in observational data [ 9]: they chose three pairs of instances, one consisting of the most proximal target instance and control instance , one consisting of the most remote target instance with respect to , and one consisting of the most remote control with respect to . They then introduced two counteracting metrics on the basis of loss functions, intended to bring similar instances close to each other but not too close in the representation space. 2.2. Measuring the dificulty of annotation and labeling tasks Dificulty of pairwise comparisons of images has been investigated in [ 7, 10 ]. Similarly to our earlier works [ 11, 2 ] on pairwise comparisons of non-image data. Ahonen et al. [ 7 ] used sensors that measure electrodermal activity. Their results were not conclusive, in the sense that it did not become evident what makes a comparison dificult independently of the person who performs the comparison. The dificulty of pairwise comparisons of non-image objects is less investigated in general, despite the fact that non-image objects are of relevance in several application domains, including the annotation of clinical data. However, there are several investigations on the dificulty of crowdworkers tasks, including labeling tasks and more elaborate annotations. Traditionally, ‘dificulty’ (which is not observable) is modeled on the basis of observable quantities. One of them is ‘duration’, defined in [ 12] as the time needed to complete a specific task and used as indicator of task dificulty for a specific crowdworker. An important indicator is (dis)agreement among crowdworkers, pointing to task ambiguity [13] or to diverging interpretations of a task [14], i.e. to inherent task properties independently of a specific crowdworker’s skills and expertise. In [ 2 ] we focused on (dis)agreement as potential indicator of dificulty: Annotator (dis)agreement was not predictive – neither for dificulty nor for correctness. Furthermore annotators performed pairwise comparisons on triplets that consisted of 10-dimensional medical instances from the cohort SHIP-2 of [15]. We found that for some instances proximity across certain dimensions was misleading in the sense that annotators consistently decided that a pair of instances inside a triplet were more similar than they truly were.

2.3. Annotation of medical data

Images, diagnostic texts or structured instances, is a very important task, for which crowd-working has been applied increasingly and successfully in recent years [16, 17, 18]. In [18], Wazny et al. list 8 areas of medical applications, where crowdsourcing is being used; among them, diagnosis, such as assigning scores to tumors. This corresponds to the creation of ground truth in existing datasets through labeling. However, medical annotations go beyond the assignment of labels or scores. For example, Joshi et al. recruited volunteers who identified the ‘location’ of emotional episodes in timestamped data, as well as the duration of these episodes [19]. Studies on the annotation of medical data follow diferent directions. They include the study of the potential of Virtual Reality (VR) technologies as in [20, 21], the generation of open access datasets [22], the role of annotated data collections in education [23], and ways of semi-automating the labeling/annotation process. Among the latter, the earlier work of Nissim et al. [24] highlighted the potential of active learning to reduce label acquisition cost. More recently, combinations of semi-supervision and crowdsourcing have become a popular subject of investigation, see e.g. [25, 26]. 3. Workflow for record annotation through pairwise visual comparisons

3.1. A pie-based visualization

The proposed method was inspired by the solution proposed in [ 2 ] in which the experiment participant was shown two representations: a tile-based and a line-based. In the first, each triplet is composed of ten tiles for each risk factor with the numerical values marked as shade (Figure 2, left box). In the second, the position of the middle record value for some variables indicates its distance from the variable values for the other two records (Figure 2, right box). This solution has been shown to be efective but can be improved using a new visualization method that does not separate the features from the others.

The main idea is to use the pie-based visualization shown in Figure 3. Compared to the old visualization, this is more compact since each of the ten variables is represented as a slice of the pie. In this way, three pies are necessary to represent the three subjects A, B, and C of the experiment. The comparison between subjects is immediate, and the slices of the pie are position invariant, since the crowd worker is not biased by the particular arrangement of each variable (there is no ordering between them).

In both methods, the color palette is assigned by linearly distributing the colors in the Min-Max interval of the feature values by using a discrete number of colors for discrete features. 5-values color scales is used for continuous features, the 2-values color scales is used for binary features, and the 3-values color scales is used for the ternary feature.

The resulting color-based triplet assignments are described in Algorithm 1 for the old method and in Algorithm 2 for the new method.

Algorithm number 1 tripletA, tripletB, tripletC ∈ ← ( ) ← ( ) ← 2 ← 3 ← 5

Palette ← createPalette(bins, m, M) ( ) ← ( , . .) ( ) ← ( , . .) ( ) ← ( , . .) ℎ( )

Algorithm number 1 tripletA, tripletB, tripletC ← 10 ← ( ) ← ( ) ← ( ) ← 1 ∈ ← ( ) ← ( ) ← 2 ← 3 ← 5 ← (, , ) () ← ( , . .) () ← ( , . .) () ← ( , . .) ← + 1 ← (5, 0, 1) ℎ( )

In Figure 3, it is possible to understand how easily it can be concluded that instance B is similar to instance A because the right half-pie of both is equal, as well as the slices representing LDL, CRP and Alcohol. Instead, in Figure 2, in which the same instances A,B,C are represented, the comparison is less immediate because the crowdworker is led to analyze one variable at a time.

This is even more evident in Figures 4 and 5 which present a less obvious case. In fact, instance B is still more similar to instance A, but in this case, the similarities are few and it is not possible to establish it by directly confronting each variable, but it is necessary an overall view, and for this reason, pie-based visualization is still superior.

The last example is presented in Figure 6 and Figure 7 and is very dificult to assess. Both instances A and C are good candidates and looking carefully in the pie-based visualization, it is possible to conclude that B is more similar to A, even if even if very little.

4. Our Evaluation Workflow 4.1. The triplets of the experiment

In this study, we investigate the potential of diferent visualization schemes for pairwise comparison of medical records.

As a follow-up of the experiment in [ 2 ] we asked 2 experts to annotate the new visualization to assess whether an individual is more similar to a healthy or a diseased individual using hepatic steatosis as an outcome. Both experts conduct research on active learning, prediction and classification. They do not know the SHIP dataset. Each expert was asked to annotate 30 annotation tasks + 3 tasks of diferent levels of dificulty. Furthermore, they have to express the perceived dificulty for the annotation of each triplet by choosing one of the following four answers: “very certain", “rather certain", “rather uncertain", and “very uncertain".

For choosing the triplets we used the dataset as presented and described in [ 2 ]. There we randomly selected 90 records out of 852 individuals of SHIP-2. These are categorized into the following three categories: “no hepatic steatosis" (liver fat fraction ≤ 5.0%, n = 501), “mild hepatic steatosis" (5.0% ≤ liver fat fraction <14%, n = 238), and “moderate to severe hepatic steatosis" (liver fat fraction ≥ 14%, n = 113) [27]. More specific we selected 45 individuals from the class “no hepatic steatosis" and 45 from the class “moderate to severe hepatic steatosis" and split this two subsamples into three groups of 15 individuals. For each subject, ten risk factors of hepatic steatosis are reported: age, sex, alanine-aminotransferase (ALAT), low-density lipoproteine (LDL) cholesterol, alcohol consumption, hypertension, beta-blocker intake, type 2 diabetes mellitus, smoking status, and c-reactive protein (CRP).

4.2. Evaluation Criteria

Our scenario is a controlled pairwise comparison experiment, in which we want to find out which features catch the participants’ eye under each configuration and which configuration helps them most in finding the ‘good features’. The configurations are (a) our new color-based one and (b) the baseline used in the article of [ 2 ].

To compare the new graphic model with the article of [ 2 ], we compute correctness, and then we run the experiment with the same triplets. We compute the average correctness as performance indicators to evaluate the new graphic model for diferent degrees of task dificulty.

To evaluate the performance of both methods, we define the following evaluation criteria: • Correct classifications • Score, to compare the two visualizations: How often one was correct under each visualization In addition, we present the uncertainty of experts for the new visualization.

5. Findings 5.1. Findings with the proposed visualization

In Table 1 we show the annotation of the two experts. They difer in the annotation in 6 tasks (bold marked). Furthermore, the column “Uncertainty" shows the perceived dificulty per triplet.

As depicted both experts are “rather certain" in the annotation: 14 and 12 times out of 30. “Rather uncertain" they are in 9 and 10 triplets out of 30. On 4 and 6 triplets, they are “very uncertain". The experts gave the lowest response for “very certain": Only 3 and 2 times out of 30 triplets they chose this answer.

It is remarkable that the annotation of the triplets for easy, medium and dificult difer. T11 represent the easy triplet - here the dificulty changes slightly. For middle dificulty the annotation changes completely. Under T18, both experts annotated incorrectly. Later on, they annotated correctly when they annotated this task again. For the dificulty triplet the perceived dificulty changes from very uncertain to rather uncertain. The annotation remains the same, but incorrect.

5.2. Comparison to the baseline visualization

In Table 2 we juxtaposed how the experts annotated the triplets for both visualizations. For better comparability we removed one expert annotation for the old version. This expert is an epidemiologist and created the dataset.

The annotations difer in 14 out of 30 tasks and are marked in bold. The new visualization was annotated slightly better than the old visualization. We have better correctness for the easy triplets, similar correctness for the medium ones, and also similar for the dificult ones. On average, the old visualization was correctly annotated 0.50, the new visualization on average 0.57. This could also be related to the choice of experts. In the old visualization, a physician annotated the triplets and another expert knew the SHIP-2 dataset. In contrast, the two new experts for the new visualization have no T31 T32 T33

Correctness

Expert 1 Expert 2 yes no yes yes yes yes no no yes yes yes yes yes no no no no no yes yes yes yes no no yes yes no no no no yes yes yes yes yes yes yes no no no yes yes yes yes no yes yes yes no yes yes yes no no no no no yes no no yes no no yes no no Expert 1 rather uncertain rather uncertain rather certain very certain rather certain rather uncertain rather uncertain rather certain very certain very certain rather certain rather uncertain rather certain rather certain rather certain very uncertain rather certain very uncertain rather certain rather uncertain rather uncertain rather certain rather uncertain rather uncertain rather certain very uncertain rather certain rather uncertain rather certain very uncertain

Uncertainty

Expert 2 rather certain rather uncertain rather uncertain very certain rather certain rather uncertain rather uncertain rather certain very certain rather uncertain rather certain very uncertain rather certain rather certain rather certain rather uncertain very uncertain very uncertain rather certain rather certain rather certain rather uncertain rather uncertain rather certain rather certain very uncertain rather uncertain very uncertain rather certain very uncertain very uncertain rather uncertain rather certain rather certain very uncertain rather certain medical background and do not know the data set. We are not trying to find the most globally influential variable. Since the important variables vary per triplet. Therefore, each variable has the same position in each triplet.

6. Conclusion and Future Work

In this work, we investigated the potential of diferent visualization schemes of medical records. We elaborated on an experiment whether a new visualization leads to a better annotation, based on correctness and investigated this with expert annotation on a previous visualization. Thereafter, we will start investigating the role of stress as a confounder. We will also expand the experiment to non-experts and focus on uncertainty, to further improve the visualization and thus get better results in the annotation. Moreover, we will investigate which features are afecting correctness and how to combine with semisupervised pairwise comparisons.

Triplet 6.1. Further possibilities for data annotation

In addition to various visualization methods, annotation can also take place on the basis of raw data, for example as tabular data (see Table 3). Table 3 shows a simple triplet. The middle, B, instance is to be assigned whether it is more similar to the A instance or C instance. Similar variables are marked in blue (B more similar to A) or orange (B more similar to C). In this example, the IRIS data set consists of only a few variables, so that a more manageable assessment can be made. In this example, annotators would look at how many matches there are per variable (the class is not visible) and then decide whether the B instance is more similar to the A instance or to the C instance. A rather more dificult example is in Table 4. This is also based on the IRIS data set, but the assignment is made more dificult by the similarity of the A and C instances. The variable “sepal lengh” is not unique in this example. Annotators could therefore possibly ignore this variable for the decision-making process. In the triplet as a whole, the B instance is slightly more similar to the C instance than to the A instance. As soon as a variable is weighted more importantly, this decision could either strengthen the decision or lead to a diferent decision. With data sets that contain more variables, such as the mushroom data set, it is very dificult to recognize individual variables separately. Our suggestion would be to hide the variables where the values are identical so that a better assignment can take place. This and the optimal number of variables per triplet will be investigated in future experiments.

instance A B C

Funding

SHIP is part of the Community Medicine Research net of the University of Greifswald, Germany, supported by the Federal Ministry of Education and Research (grants no. 01ZZ9603, 01ZZ0103, and 01ZZ0403), the Ministry of Cultural Afairs as well as the Social Ministry of the Federal State of Mecklenburg-West Pomerania.

Declaration on Generative AI

The author(s) have not employed any Generative AI tools. [8] S. Sharifi Noorian, S. Qiu, U. Gadiraju, J. Yang, A. Bozzon, What should you know? a human-inthe-loop approach to unknown unknowns characterization in image recognition, in: Proceedings of the ACM Web Conference 2022, 2022, pp. 882–892. [9] L. Yao, S. Li, Y. Li, M. Huai, J. Gao, A. Zhang, Representation learning for treatment efect estimation from observational data, Advances in Neural Information Processing Systems 31 (2018) 2633–2643. [10] E. Amid, A. Ukkonen, Multiview triplet embedding: Learning attributes in multiple maps, in:

International Conference on Machine Learning, 2015, pp. 1472–1480. [11] N. Jambigi, T. Chanda, V. Unnikrishnan, M. Spiliopoulou, Assessing the dificulty of labelling an instance in crowdworking, in: 2nd Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning@ ECML PKDD 2020, 2020. [12] U. Gadiraju, G. Demartini, R. Kawase, S. Dietze, Crowd anatomy beyond the good and bad: Behavioral traces for crowd worker modeling and pre-selection, Computer Supported Cooperative Work (CSCW) 28 (2019) 815–841. [13] M. Schaekermann, E. Law, K. Larson, A. Lim, Expert disagreement in sequential labeling: A case study on adjudication in medical time series analysis, in: SAD/CrowdBias@ HCOMP, 2018, pp. 55–66. [14] S. Kairam, J. Heer, Parting crowds: Characterizing divergent interpretations in crowdsourced annotation tasks, in: Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Computing, 2016, pp. 1637–1648. [15] H. Völzke, J. Schössow, C. O. Schmidt, C. Jürgens, A. Richter, A. Werner, N. Werner, D. Radke, A. Teumer, T. Ittermann, et al., Cohort profile update: The study of health in pomerania (ship), International journal of epidemiology (2022). [16] J. D. Tucker, S. Day, W. Tang, B. Bayus, Crowdsourcing in medical research: concepts and applications, PeerJ 7 (2019) e6762. [17] C. Wang, L. Han, G. Stein, S. Day, C. Bien-Gund, A. Mathews, J. J. Ong, P.-Z. Zhao, S.-F. Wei, J. Walker, et al., Crowdsourcing in health and medical research: a systematic review, Infectious diseases of poverty 9 (2020) 1–9. [18] K. Wazny, Applications of crowdsourcing in health: an overview, Journal of global health 8 (2018). [19] A. A. Joshi, M. Chong, J. Li, S. Choi, R. M. Leahy, Are you thinking what i’m thinking? synchronization of resting fmri time-series across subjects, NeuroImage 172 (2018) 740–752. [20] A. Huaulmé, F. Despinoy, S. A. H. Perez, K. Harada, M. Mitsuishi, P. Jannin, Automatic annotation of surgical activities using virtual reality environments, International journal of computer assisted radiology and surgery 14 (2019) 1663–1671. [21] O. Legetth, J. Rodhe, S. Lang, P. Dhapola, M. Wallergård, S. Soneji, Cellexalvr: A virtual reality platform to visualize and analyze single-cell omics data, Iscience (2021) 103251. [22] E. E. Kpokiri, R. John, D. Wu, N. Fongwen, J. Z. Budak, C. C. Chang, J. J. Ong, J. D. Tucker, Crowdsourcing to develop open-access learning resources on antimicrobial resistance, BMC infectious diseases 21 (2021) 1–7. [23] M. van Deursen, L. Reuvers, J. D. Duits, G. de Jong, M. van den Hurk, D. Henssen, Virtual reality and annotated radiological data as efective and motivating tools to help social sciences students learn neuroanatomy, Scientific Reports 11 (2021) 1–10. [24] N. Nissim, M. R. Boland, N. P. Tatonetti, Y. Elovici, G. Hripcsak, Y. Shahar, R. Moskovitch, Improving condition severity classification with an eficient active learning based framework, Journal of biomedical informatics 61 (2016) 44–54. [25] W. Shi, V. S. Sheng, X. Li, B. Gu, Semi-supervised multi-label learning from crowds via deep sequential generative model, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1141–1149. [26] P. A. Traganitis, G. B. Giannakis, Bayesian semi-supervised crowdsourcing, arXiv preprint arXiv:2012.11048 (2020). [27] J.-P. Kühn, D. Hernando, A. Muñoz del Rio, M. Evert, S. Kannengiesser, H. Völzke, B. Mensel, R. Puls, N. Hosten, S. B. Reeder, Efect of multipeak spectral modeling of fat for liver iron and fat quantification: correlation of biopsy with mr imaging results, Radiology 265 (2012) 133–142.

[1]

Simard ,

Rönnqvist ,

Lebel ,

Lehoux , A method to classify data quality for decision making under uncertainty , ACM Journal of Data and Information Quality ( 2022 ).

[2]

Rother , U. Niemann,

Hielscher ,

Völzke ,

Ittermann , M. Spiliopoulou, Assessing the dificulty of annotating medical data in crowdworking with help of experiments , PloS one 16 ( 2021 ) e0254764 .

[3]

Rother ,

Ittermann ,

Spiliopoulou , Semi-supervised learning with pairwise instance comparisons for medical instance classification , in: International Symposium on Intelligent Data Analysis , Springer, 2025 . To appear.

[4] A. Holzinger, Interactive machine learning for health informatics: when do we need the humanin-the-loop? , Brain Informatics 3 ( 2016 ) 119 - 131 .

[5]

Kleindessner , U. von Luxburg, Kernel functions based on triplet comparisons , in: Advances in neural information processing systems , 2017 , pp. 6807 - 6817 .

[6]

Diersch ,

J. P.

Valdes-Herrera ,

Tempelmann , T. Wolbers, Increased hippocampal excitability and altered learning dynamics mediate cognitive mapping deficits in human aging , Journal of Neuroscience 41 ( 2021 ) 3204 - 3221 .

[7]

Ahonen ,

Cowley ,

Torniainen ,

Ukkonen ,

Vihavainen ,

Puolamaki , S1: Analysis of electrodermal activity recordings in pair programming from 2 dyads, PLoS One . Retrieved from http://journals. plos. org/plosone/article/asset ( 2016 ).