1. Introduction

Portorož, SLO £ C.Debruyne@uliege.be (C. Debruyne) Ȉ

A Protocol for KG Construction Tasks Involving Users

Ademar Crotti Junior

Christophe Debruyne

0 0 Montefiore Institute, University of Liège , Belgium

2025

000 0 0003

Knowledge graph construction (KGC) from (semi-)structured data is challenging, and facilitating user involvement is an issue frequently brought up within this community. We cannot deny the progress we have made with respect to (declarative) knowledge graph construction languages and tools to help build such mappings. However, it is surprising that no two studies report on similar protocols. This heterogeneity does not allow for comparing KGC languages, techniques, and tools. This paper first analyses studies involving users to identify the points of comparison. These gaps include a lack of systematic consistency in task design, participant selection, and evaluation metrics. Moreover, there needs to be a systematic way of analyzing the data and reporting the findings, which is also lacking. We thus propose and introduce a user protocol for KGC designed to address this challenge. Where possible, we draw and take elements from the literature we deem fit for such a protocol. The protocol, as such, allows for the comparison of languages and techniques for the RDF Mapping Language (RML) core functionality, which is covered by most of the other state-of-the-art techniques and tools. We also propose how the protocol can be amended to compare extensions (of RML). This protocol provides an important step towards a more comparable evaluation of KGC user studies.

eol>KG Construction User studies Research Methods

1. Introduction

We preach to the choir that knowledge graphs are essential for meaningfully organizing and representing information in various domains. However, as knowledge graphs grow in complexity, eficient methods for their construction (or generation) are crucial. When dealing with the challenges of (semi)structured data sources, such as the lack of explicit semantics, which need to be aligned with ontologies or vocabularies, creating such mappings becomes a knowledge engineering task where user involvement is crucial. Users bring the necessary domain expertise to ensure the mappings are appropriate.

Scholars have systematically analyzed the functionalities of Knowledge Graph Construction (KGC) tools and proposed benchmarks to analyze their behavior in diferent settings and their memory and CPU usage. It is thus surprising that user involvement and the perception of the user using the languages, tools, etc., have yet to be studied in such detail. Conducting such a study for all languages and tools is an infeasible undertaking for one group, but what is feasible is putting forward a protocol that scholars in the domain should adopt to report on user studies. This paper aims to achieve this goal by proposing a user study protocol for KGC.

This will enable researchers to compare diferent knowledge graph construction languages and techniques, leading to a better understanding of their strengths and weaknesses and, ultimately, to more efective tools for KGC.

This paper’s contributions are twofold. First, we review user studies in the KGC domain, which indicate the abovementioned challenges. The second contribution is the protocol. Section 2 provides an overview of the related work on (declarative) approaches to KGC by mapping data sources onto RDF datasets, focusing on those that explicitly report on user studies. The goal is to show that no two papers adopt the same protocol, which makes comparing studies impossible. Section 3 presents the protocol we have made available with CC-BY-SA 4.0 license. The protocol provides detailed guidelines for recruiting participants and disclosing potential biases. The process guidelines for informed consent, pre-questionnaires, familiarization activities, task execution, and post-questionnaires. The tasks consist of five tasks, of which, when comparing two groups, the last two can be changed to ensure a common base for comparison. The related work will show that reporting is often limited to simple metrics and averages. Still, we deem it important to analyze the relationships between task execution, perceived usefulness, and perceived cognitive load. To this end, Section 4 proposes the statistical means to use when adopting this protocol. In Section 5, we discuss the resources from various aspects, such as the scientific and technical, to elaborate on the soundness of our approach. This section also discusses some of the limitations. Section 6 then concludes the paper and proposes future directions.

2. Related Work

In [ 1 ], the authors presented an excellent survey on declarative KGC tools to help the community and practitioners choose which languages, tools, or techniques fit their needs. However, the article looks at those from a technical perspective. They look at the functionalities ofered by the diferent options. In [ 2 ], the authors proposed a benchmark to compare KGC tools and applied it to some well-known implementations such as RMLMapper1, Morph-KGC [ 3 ], and SDM-RDFizer [ 4 ]. It is surprising to see that the perceptions of users and practitioners have yet to be examined in a systematic manner.

From a broader perspective, [ 5 ] describes three “personas” that engage with KGs: KG builders, KG analysts, and KG consumers, which were distilled from interviews with practitioners. As the name intuitively implies, the KG builder persona would be responsible for generating the KG from heterogeneous sources, but the persona is also in charge of ontology engineering. The authors state that builders could benefit from tools that help them ensure that the schema is respected (what the authors call an “enforcer”) as well as adequate visualization tools. While the paper does not explicitly mention KG construction and mappings as tasks, they fall under the “data integration” umbrella. The interviews indicate that there are challenges impeding uptake.

It seems that practitioners’ or users’ roles are sometimes neglected. This is certainly the case for KG construction, as we will demonstrate via our literature review. Our review looked at the following papers reporting on users, their experiences, and/or perceived usability: [ 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 ]. Most of these studies looked at the creation of mappings. Exceptions are [13] reporting on studies on mapping understanding and [16] reporting on a complex data flow that included mapping creation.2 We compare the various aspects of these user studies in Tables 1, 2, and 3. From these tables, we can observe a couple of important points: • Some report on comparing mapping languages and/or tools (e.g., [ 11 ] and [ 12 ]), and others report on comparing mapping languages (e.g., [14] and [15]). Quite a few papers merely report on the perceived usability of their tool without any comparison. We argue that reporting on user studies only makes sense if there is a basis for comparison. Without a reference point (e.g., a comparable tool evaluated under a shared protocol) or even a common protocol to establish such points, it becomes dificult to interpret the significance of usability results. For instance, reporting that users found a tool “easy to use” or completed a task within a certain time frame lacks context unless these outcomes can be meaningfully compared. Comparative user studies are therefore essential to provide this community with insights and to guide the development of more efective knowledge graph construction languages and tools. • Looking at the procedure, we see many recurring elements (some (training) resources are being shared, pre-assessment surveys, introduction of tasks, surveys, etc.). No two procedures are the same, which limits our ability to compare the studies. Some studies reported asking about information such as gender and age but did not report on those in the data analysis. • Most studies involved participants with expertise in IT, databases and/or semantic web technologies. Many studies also report inviting MSc students in computer science or related fields.

1https://github.com/RMLio/rmlmapper-java

2Excluded from this survey are publications that do not report on users. For example, in [17], the authors reported involving participants, but no report on the participants’ experience was made. Other examples of papers mentioning users, participants, etc. without any detailed reporting include [18], [19], and [20].

Self-reported prior knowledge and competencies are a recurring theme, but no two studies tackle this aspect comparably. • The same heterogeneity can be observed for the tasks and datasets, where we do notice that most studies adopt datasets that do not require specific domain expertise (people, movies, places, etc.). • Recurring themes in data being analyzed are time (eficiency), accuracy, and perceived usability.

Most rely on System Usability Metrics [21] (SUS) for perceived usability. A few studies rely on Post-Study System Usability Questionnaire [22] (PSSUQ) to obtain information on perceived information quality, interface quality, and system usefulness. Such studies do allow a means to compare results. Few studies have reported on qualitative feedback from users and the mental workload of tools and mapping languages. • Most studies merely report on averages, which can arguably make sense when authors only report on one group and tools or languages are not compared. Few studies employed techniques to analyze whether groups are (significantly) diferent or whether certain aspects had a (statistically significant) impact on eficiency, accuracy, etc.

From this survey, we can conclude that there is a critical need for homogeneous protocols, including tasks, for comparing advances in KG construction (KGC) approaches (mapping languages and tools alike). In the next section, we propose a protocol to address these issues and how they can be used.

3. The KG Construction User Study Protocol

This section presents the protocol for comparing KGC tools and languages. The protocol’s3 structure foresees placeholders for text to be easily adapted for research ethics applications.

The protocol can be used to analyze the perceived usability and cognitive load of a mapping language or tool, as well as the accuracy achieved by users and the task execution time. When scholars use this protocol on only one group, the hypotheses are limited to comparing participants with diferent backgrounds or comparing the results with other experiments using that same protocol. Scholars adopting this protocol can easily extend the protocol to compare two tools, techniques, or languages. This will be explained in Section 3.5.

The focus of the protocol is to facilitate the discovery of problems (i.e., formative testing), not on task-level measurements. [23] describes the diference between the two. The protocol proposes tasklevel measurements and techniques for analyzing the data. When sample sizes are small, they can merely give insights.

3.1. Participant Selection

Adopters of the protocol should indicate how participants are recruited and from where. Adopters should disclose potential biases by providing details about factors influencing the study or participant behavior. Examples include the hierarchical relationship between research group leaders and researchers, as well as students recruited from classes. There is also a diference between voluntary participation and mandatory representation (e.g., in the context of a teaching activity), which may lead to non-probability samples. The absence of certain participants may lead to response bias, which is the possible impact on observations had those participants taken part in the experiment. In [24], the authors describe how disclosing participant selection is important to recognize potential biases.

Practitioners involved with UI and UX often state that five users are suficient to discover most of the usability problems. That is more likely the case for problem discovery than task-level measurements, which require larger sample sizes [23]. As [25] observed in an experiment, “increasing the number from 5 to 10 can result in a dramatic improvement in data confidence.” They also found that increasing the number to 20 practically guarantees all problems to be seen, but we recognize that recruiting participants is dificult. As such, we (strongly) recommend 10 participants per group.

3https://github.com/chrdebru/kgc-user-study-protocol

• 1-hour study • tutorial and material before the experiment (days before)

Comparison • self-assessment on Semantic Web and mapping languages YARRRML and used, and demographics SPARQL Anything. • task • nonstandard evaluation questionnaire • self-assessment on Semantic Web Developers use Matey, • briefing about technologies and tools non-experts use • task

RMLEditor. • usability assessment and some specific questions De Brouwer et RMLEditor, al., 2024 Matey

RML,

YARRRML

3.2. Process

Participants begin by reviewing informed consent materials and completing a pre-questionnaire assessing their backgrounds, prior knowledge, and expectations. Next, they attend a presentation introducing the technology, review relevant documentation, and engage in a familiarization activity. The

Employees and 2 musaencyamseasp,puinncglse.aOrhnoew accuracy Heyvaert et projects. Movies schema based (start from usability (SUS) al., 2016 (DBpedia) and directors.

qualitative feedback Crotti Junior et al., 2017

Northwind database 1 mapping in 3 parts (2 classes and linking them) core of the study involves participants executing a defined task using the technology. Finally, participants complete a post-questionnaire evaluating their experience, including usability, eficiency, and perceived cognitive load. This structured process aims to gather comprehensive data on user interaction and perception of the technology.

Next to presenting and demonstrating the tool or mapping language, we also request participants to handle the environment. We deem this familiarization activity novel compared to the related work. ontology) and one data based (start from data). 3 tasks of low, medium and high complexity as described by the authors 10 mappings, 2 classes/3 object properties/5 data type properties. (Maybe this means 1 mapping task in several steps but not

clear).

understanding the mapping i.e., what is

being mapped 2 use cases, one per dataset.

time accuracy completion rate (# people who completed tasks) qualitative feedback time accuracy usability (PSSUQ) accuracy (11 questions on what is being mapped) users’ preference and confidence questions accuracy usability (SUS) qualitative feedback accuracy 1 mapping in 3 parts (2 usability (PSSUQ) classes and linking them) mental workload (MWL and

NASA-TLX) 1 mapping on each representation relating people and places time accuracy usability (PSSUQ) mental workload 2 tasks mouse and keyboard activity a modified usability questionnaire accuracy time fill gaps and partial solutions were provided task was on the overall workflow and not directly on mapping creation accuracy (manually) accuracy usability (SUS)

Averages on accuracy ANOVA, Anderson darling, Welch and Wilcoxon tests Normality tests Correlations with Pearson and Spearman Reliability Averages on accuracy ANOVA, Anderson darling, Welch and Wilcoxon tests Normality tests Correlations with Pearson and Spearman Reliability Average, standard deviation, min and max values One-way ANOVA Kruskal-Wallis test Subjective evaluation when analyzing users’ results Averages Averages Averages Averages Averages Averages Number of people who completed tasks.

Averages on accuracy and time Welch Two Sample t-test and Friedman non-parametric test for PSSUQ This activity ensures participants are comfortable executing mappings within the provided (tool’s) environment. We guide participants in demonstrating the practical aspects of using the tool’s interface, such as utilizing the command line in the terminal or identifying the correct buttons to click. This focused familiarization will prevent the environment from becoming an obstacle, allowing us to assess the tool or language’s usability and gather unbiased feedback on its functionalities.

Furthermore, we ask authors to report on how responses were submitted (e.g., email, paper, form) and the anticipated duration of the experiment. While in-class experiments often have time constraints, other environments may be more flexible. Our protocol foresees 1 hour for the five tasks. If, for example, all steps are conducted in a classroom setting, the experiment would require 2:30 in total (i.e., including the questionnaires and consulting the training material). Finally, clarify whether participants can ask questions during the experiment. Help should be limited to aspects not core to the KG construction process and experiment. For example, helping participants navigate to the correct directory in a terminal or assessing whether a network issue is permitted, but providing help to execute a mapping is not. Ideally, studies should report on those (and their number of occurrences). Project_ID

Project name start_date

end_date first_name

Employee_ID last_name

Employee managed by part of

Task_ID Task description_en description_fr

3.3. Pre-questionnaire

Studies often inquire about participants and group them based on self-reported information on their background and proficiency with specific techniques. We aim to homogenize this by proposing an exhaustive list of current roles, formal training, and self-perceived competency levels in certain Semantic Web technologies. We also included three questions related to intrinsic motivation (enjoyment, curiosity, and value). It is important to analyze the impact of self-selection bias in voluntary participation. This allows us to explore potential self-selection bias in voluntary participation by examining correlations between motivation levels, task performance, and perceived usability. In contrast to self-selection bias in voluntary participation, mandatory settings introduce the risk of low engagement among less intrinsically motivated participants. Including questions on intrinsic motivation enables us to assess how motivation levels influence task performance and perceived usability.

3.4. Mapping tasks

Participants will be requested to complete five mapping tasks, some of which are interdependent. We can observe that diferent studies adopt diferent domains, some of which are specific (e.g., health care). We argue that participants should question the domain used in the experiment. Therefore, we propose a domain that is suficiently generic and accessible for participants to understand. Our protocol proposes mapping data about projects, project tasks, and employees who manage projects and are assigned tasks. Figure 1 depicts the Universe of Discourse (UoD) of the data to be transformed into RDF. We use an Entity-Relationship Diagram (ERD), but JSON and XML files can easily represent the data. 4 To ensure attribute names do not confuse participants, we ensured all attributes are unambiguous. In this simple UoD, all relations are many-to-one, though this can be easily extended to many-to-many when transforming documents.

The tasks can be summarized as follows: 1. Generate instances of ex:Employee with their first and last-names. The IRIs of employees are based on the name. 2. Generate instances of ex:Project with their name, start- and end-date. Both dates are of the type xsd:date, allowing us to assess the creation of typed literals. The IRIs of projects are based on the project’s ID. 3. Generate ex:managedBy properties from projects to employees. 4. Generate instances of ex:Task with their descriptions (in two languages). The IRIs of tasks are based on the task’s ID. The descriptions allow us to assess the creation of language tags. 5. Generate ex:of and ex:assignedTo properties from tasks to, resp., projects and employees.

We point out that the mappings mainly focus on RML-core [20] functionality. Part of RML-core are multi-valued expression maps, which are irrelevant for CSV files. People can easily adapt the JSON ifles to provide one or more task descriptions in diferent languages to test more complex language maps, for instance.

4Examples are provided in the GitHub repository under the “assets” directory.

We draw attention to the fact that employees’ IRIs are based on their names, whereas the project data sources refer to employees via their IDs. This requires users to join the data in the two sources. In the case of RML, this requires users to “join” the two sources either at the level of the logical source or by using a referencing object map.

3.5. Variants

We stated that the tasks focus on RML-core functionality, which is also covered by other KGC languages and techniques such as ShExML [14] and SPARQL Anything [26]. While one can argue that the set of desired functionalities is limited, [20] covered the requirements of all RML extensions. Formulating tasks that cover practically all possible features and scenarios, not only in time but also in data source complexity, is not feasible. However, when a particular feature or scenario needs to be investigated, the protocol can be amended. This section describes how this protocol can be adapted and used to compare diferent tools, features, or aspects.

• One can provide the same tasks to two diferent groups when comparing languages, techniques, or tools. One can compare diferent mapping languages (e.g., ShExML vs. RML), compare editors vs. “bare bones” languages (e.g., RMLEditor vs. RML), compare languages and abstractions of languages (e.g., RML vs. YARRRML), and even diferent editors and languages. • When the aim is to compare a tool or language’s support for an advanced KGC requirement such as named graphs, collections and containers, or RDF-star (among others), then one can take this protocol as is for one group, and only change the last two tasks for the second in which those requirements are covered for the second group. The first three tasks, which are shared, provide a basis for comparing the two groups. This design ensures that both groups share a comparable foundation (tasks 1-3), allowing us to isolate and evaluate the impact of the new features introduced in the final tasks. • We have proposed a generic and accessible domain for the protocol to ensure broad applicability.

However, the protocol can be adapted to include similar tasks within a diferent (and domainspecific) context. This adaptation would enable researchers to assess whether a language or tool designed for a particular application domain performs better in its setting. In such cases, it is important to compare the tool or language across both the original (generic) domain and the adapted (domain-specific) version.

Participants are assumed to have access to prepared “resources” or “environments” to focus on the tasks. In the case of RML, for instance, the logical sources would be provided in the tool or for them to copy and paste. This allows researchers to assess the languages and tools with respect to these aspects by giving one group the prepared artifacts and requesting the other to formulate the logical sources themselves. We deem this a special case of comparing a baseline with an extension as described above, but where the five tasks remain unchanged.

3.6. Post-questionnaire

Both SUS and PSSUQ are used to measure usability. SUS is adequate for a rapid and general measure of a system’s usability. Still, the latter ofers more advantages because it assesses three aspects of a system: system usefulness, information quality, and interface quality. Furthermore, there is a question about the system as a whole, which allows one to dampen the perception of individual aspects. The original PSSUQ survey uses 19 questions (as adopted by [ 10 ], for instance), but recent iterations have removed three redundant questions.

As for the perceived (mental) workload, we adopt both the Workload Profile (WP) [ 27] and the NASA Task Load Index (NASA-TLX) [28].

• WP adopts a theory in which participants have diferent capacities (dimensions) related to the stage, mode, input, and output of information processing. The eight dimensions are each quantiifed through subjective rates, and participants must rate the proportion of attentional resources instruments. racy. For instance: used for performing a given task with a value from 0 to 100 after task completion. A rating of 0 means that the task placed no demand, while 100 indicates that it required maximum attention.

8 The WP of a participant is calculated as 1 ∑=1 .

8 • NASA-TLX has been validated in several domains [28] and combines six factors believed to inlfuence the mental workload. Each factor is quantified with a subjective judgment and a weight computed via a paired comparison procedure. For each possible pair of the six factors, participants must decide which factor contributed the most to the mental workload during the task. The weights are the number of times each dimension was selected. The possible weights range from 0 (irrelevant) to 5 (most important). The final score is computed as a weighted average, considering the subjective rating of each attribute and the corresponding weights : 1 It is possible to calculate the scores by eliminating the weighted procedure, which yields the 15

6 ∑=1 ∗ .

so-called Raw TLX.

Both instruments are used in industry and research. You may notice that both instruments use diferent rating systems, which may confuse participants in a paper survey. Erroneous inputs can be prevented by adopting electronic forms. We choose not to harmonize the scales, which has been done in [ 10 ], for instance, to obtain results that can be compared to other studies faithfully adopting those Studies should explicitly report the method used to calculate performance measures, such as accu• Accuracy should be determined by (1) graph isomorphism (did they generate the expected graph, which is true or false), and, for more nuanced numbers, (2) precision (the proportion of triples that are generated that are in the expected graph), (3) recall (the proportion of expected triples that are in the generated graph), and (4) F-measure combining precision and recall.

This approach accounts for situations where a participant generates additional triples, for instance. that measures how well resources are used, which is not only time. Most studies measured the time it took for tasks to be completed. We recommend studies to report on task execution time, and the method to measure it. One can manually record time or use software to record user interactions to time the tasks. Another approach is to request users to report on the time or use electronic forms that keep track of time.

While we strongly encourage placing a time limit on the tasks for the experiment to obtain comparable results, there are two important cases to track: did not finish and did not start. The former may indicate insuficient time left to finish a task or that the task was too dificult. The latter merely indicates that the user never started the task.

We propose to limit the protocol to these five metrics (four on accuracy and one related to task execution time). Studies are free to include other metrics, such as the number of times a mapping was executed (i.e., trial and error), but that would indirectly impact the task execution time.

As the experiment should not be too time-consuming, we avoided interviews to obtain qualitative feedback. We also avoid “think aloud” experiments as they can impact the cognitive load. Whether a study reports on it or not, we recommend studies to report on any additional instruments they used. There is, however, a qualitative dimension to our protocol, as the PSSUQ does leave room for comments on each of the 16 questions.

4. Results and Analysis

As part of our protocol, we recommend a structured approach for reporting collected data. As mentioned, while many user studies focus primarily on presenting averages and standard deviations, we emphasize the importance of extending these reports to include statistical tests. This ensures robust comparisons between groups and tools, enhancing the general reliability and interpretability of the experiments. The following describes the recommended aspects and tests to be considered when reporting results and analysis.

Reliability and Internal Consistency Reliability refers to the degree to which the items within a test or survey consistently measure the same construct. High internal consistency strengthens the statistical reliability of metrics, thereby enhancing the validity of group comparisons. We recommend using Cronbach’s Alpha to evaluate internal consistency. Higher alpha coeficients indicate greater shared covariance among items, suggesting they assess the same underlying concept. A Cronbach’s Alpha value of ≥ 0.7 is generally considered acceptable. [29] Data Normality Data normality refers to how much data distribution aligns with a normal curve.

While normality is not always required, t-tests and ANOVA assume that data follows such a distribution. ANOVA is relatively robust when data is not normally distributed, when sample sizes are large, but that is dificult when dealing with user studies. We, therefore, require studies to test for normality and report on normality. Participants may, if they wish, use other statistical measures. As the sample sizes of a group will likely not exceed 50, we propose the Shapiro-Wilk test to assess normality. In this test, the sample is compared to a theoretical normal distribution. Homogeneity Some statistical tests, such as ANOVA, assume that the variances across groups are equal. This is known as the homogeneity of variances. Again, this assumption should be verified as part of the analysis. Levene’s Test is a standard method for evaluating this assumption. Group Comparisons To determine whether diferences between groups are statistically significant, researchers should employ well-established statistical tests. The choice of test depends on the assumptions about the data. The recommended tests when the data is normally distributed (also known as parametric tests) are Welch’s t-test when comparing two groups and ANOVA when comparing more than two groups simultaneously. Both tests assume normality and homogeneity of variance. The recommended tests when the data is not normally distributed (also known as non-parametric tests) are the Wilcoxon test for comparing two groups and the Kruskal-Wallis test for multiple.

Correlation Analysis Correlation methods assess the strength and direction of the relationship between variables, which can provide deeper insights into study outcomes. For instance, examining correlations between usability, accuracy, and mental workload can reveal relationships in user behavior. When data is normally distributed, we recommend the Pearson’s Correlation to measure the strength of a relationship, for example, between accuracy and usability. Otherwise, one should use the Spearman’s Correlation. Researchers must report on correlations between all relevant variables (e.g., usability and mental workload, usability and accuracy, etc.) to provide a comprehensive analysis.

Transparency and Accessibility To promote transparency and reproducibility, the data and the statistical tests performed should be publicly available online, provided that all personally identifiable information is removed and participant anonymity is preserved in accordance with ethical research standards.

Access to data, analysis scripts, and detailed methodology facilitates validation and enhances the study’s credibility. Moreover, sharing such data would allow one to compare results across studies more easily, provided the conditions are similar.

5. Discussion

We presented a comparison of user studies in the KGC domain, and we noticed that all studies were diferent. This makes it impossible to compare KGC languages, tools, and software. To this end, we analyzed the related work, distilled elements we appreciated, and proposed others to establish a common protocol. As such, we proposed a resource, a user study protocol, that provides the KGC community with a better way to present, compare, and scrutinize contributions.

When designing the protocol, we selected and refined elements from the state-of-the-art that we appreciated. Examples include the use of accuracy and task execution time as simple measures, the use of PSSUQ over SUS to obtain more fine-grained information on usability, usefulness, and information quality, and measuring the mental workload right after the task. Reporting on the correlations between the perceived usability, task execution, and mental workload could shed interesting insights. [ 12 ] We provided guidelines on what statistical techniques to use when reporting on user studies with this protocol, as most merely reported on averages.

Relatively novel compared to the related work is our informed decision to adopt an accessible domain, focus on RML-core functionality as a basis, and formulate five tasks. In Section 3, we provided a rationale that, when comparing two groups, the last two tasks can be replaced for one group so that extensions or variants within the same language or tool can be compared. As such, the resource is suficiently general for use in the KGC community, and the approach in designing this protocol may inspire others within the Semantic Web community.

The resources have not only been made available with a DOI5 on a long-term preservation platform, but they are also available on a GitHub repository. The latter allows peers to contribute to the project. The directory structure we use allows for variants of the protocol to be made available.

The protocol has not yet been used at this stage, but its first use is planned for the spring of 2025. However, many of its separate elements are drawn from existing studies. As such, those parts have already been validated in the community. We also aim to engage with the wider KGC community via the W3C working group on adopting this protocol across diferent institutions.

6. Conclusions

Prior KGC user studies used diferent protocols, making comparison impossible. This paper thus highlights the lack of standardized protocols in user studies related to KGC, making it dificult to compare diferent studies. We present a new protocol to address these inconsistencies, focusing on participant selection, task design, and evaluation metrics. The protocol suggests detailed guidelines for recruiting participants and disclosing potential biases. The process guidelines for informed consent, pre-questionnaires, familiarization activities, task execution, and post-questionnaires. Five specific mapping tasks are proposed, which can be solved with the equivalent of the RML-Core specification. The protocol recommends using the Post-Study System Usability Questionnaire (PSSUQ) for usability, and the NASA Task Load Index (NASA-TLX) for mental workload, among others. We designed the protocol in such a way that one can analyze one group, or compare groups. To this end, we provided guidelines on which statistical instruments to use.

This protocol aims to provide a more comparable evaluation of KGC user studies, ultimately leading to more efective tools for knowledge graph construction. As such, the protocol is an essential artifact for future longitudinal and comparative studies. We recognize that the proposal has been constructed in a bottom-up fashion for this community, and future work should look into aligning our proposal with methods for comparing the usability of diferent (information) systems, such as [ 30]. Finally, future work also involves encouraging the adoption of this protocol by various KGC scholars. While ambitious, it is hoped that this protocol will form the basis of a new, open repository of KGC user studies.

Acknowledgments

We thank the reviewers for their many (many) thoughtful comments, which greatly improved the paper. Their feedback was invaluable.

Declaration on Generative AI

During the preparation of this work, the author(s) used Grammarly to improve grammar, check spelling, and reword. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content. cations - Second International Symposium, H-WORKLOAD 2018, Amsterdam, The Netherlands, September 20-21, 2018, Revised Selected Papers, volume 1012 of Communications in Computer and Information Science, Springer, 2018, pp. 160–179. [13] A. Crotti Junior, A Jigsaw Puzzle Metaphor for Representing Linked Data Mappings, Ph.D. thesis,

Trinity College Dublin, 2019. [14] H. García-González, I. Boneva, S. Staworko, J. E. L. Gayo, J. M. C. Lovelle, ShExML: improving the usability of heterogeneous data mapping languages for first-time users, PeerJ Comput. Sci. 6 (2020) e318. [15] P. Warren, P. Mulholland, E. Daga, L. Asprino, Path-based and triplification approaches to mapping data into RDF: User behaviours and recommendations, Semantic Web (2023) 1–27. [16] M. De Brouwer, P. Bonte, D. Arndt, M. Vander Sande, A. Dimou, R. Verborgh, F. De Turck, F. Ongenae, Optimized continuous homecare provisioning through distributed data-driven semantic services and cross-organizational workflows, J. Biomed. Semant. 15 (2024) 9. [17] P. Heyvaert, A. Dimou, R. Verborgh, E. Mannens, R. Van de Walle, Semantically annotating CEUR-WS workshop proceedings with RML, in: Semantic Web Evaluation Challenges - Second SemWebEval Challenge at ESWC 2015, Portorož, Slovenia, May 31 - June 4, 2015, Revised Selected Papers, volume 548 of Communications in Computer and Information Science, Springer, 2015, pp. 165–176. [18] I. A. Ibrahim, T. Choudhury, J. Sargeant, R. Shah, M. J. Hossain, S. M. Islam, CEREI: an open-source tool for cost-efective renewable energy investments, SoftwareX 26 (2024) 101708. [19] P. Heyvaert, B. De Meester, A. Dimou, R. Verborgh, Declarative Rules for Linked Data Generation at Your Fingertips!, in: The Semantic Web: ESWC 2018 Satellite Events - ESWC 2018 Satellite Events, Heraklion, Crete, Greece, June 3-7, 2018, Revised Selected Papers, volume 11155 of Lecture Notes in Computer Science, Springer, 2018, pp. 213–217. [20] A. Iglesias-Molina, D. Chaves-Fraga, I. Dasoulas, A. Dimou, Human-Friendly RDF Graph Construction: Which One Do You Chose?, in: Web Engineering - 23rd International Conference, ICWE 2023, Alicante, Spain, June 6-9, 2023, Proceedings, volume 13893 of Lecture Notes in Computer Science, Springer, 2023, pp. 262–277. [21] J. Brooke, SUS: a retrospective, J. Usability Studies 8 (2013) 29–40. [22] J. R. Lewis, Psychometric evaluation of the PSSUQ using data from five years of usability studies,

International Journal of Human-Computer Interaction 14 (2002) 463–488. [23] J. R. Lewis, Sample sizes for usability tests: mostly math, not magic, Interactions 13 (2006) 29–33. [24] J. W. Creswell, J. D. Creswell, Research design: Qualitative, quantitative, and mixed methods approaches, Sage publications, 2017. [25] L. Faulkner, Beyond the five-user assumption: Benefits of increased sample sizes in usability testing, Behavior Research Methods, Instruments, & Computers 35 (2003) 379–383. [26] E. Daga, L. Asprino, P. Mulholland, A. Gangemi, Facade-X: an opinionated approach to SPARQL anything, volume 53: Further with Knowledge Graphs of Studies on the Semantic Web, IOS Press, 2021, pp. 58–73. [27] P. S. Tsang, V. L. Velazquez, Diagnosticity and multidimensional subjective workload ratings,

Ergonomics 39 (1996) 358–381. [28] S. G. Hart, NASA-task load index (NASA-TLX); 20 years later, in: Proceedings of the human factors and ergonomics society annual meeting, volume 50, 2006, pp. 904–908. [29] R. A. Peterson, A meta-analysis of cronbach’s coeficient alpha, Journal of consumer research 21 (1994) 381–391. [30] R. Kruger, J. Brosens, M. Hattingh, A methodology to compare the usability of information systems, in: Responsible Design, Implementation and Use of Information and Communication Technology: 19th IFIP WG 6.11 Conference on e-Business, e-Services, and e-Society, I3E 2020, Skukuza, South Africa, April 6–8, 2020, Proceedings, Part II, Springer-Verlag, Berlin, Heidelberg, 2020, p. 452–463.

[1]

Van Assche ,

Delva , G. Haesendonck,

Heyvaert ,

De Meester ,

Dimou , Declarative RDF graph generation from heterogeneous (semi-)structured data: A systematic literature review , J. Web Semant . 75 ( 2023 ) 100753 .

[2]

Van Assche ,

Chaves-Fraga , A. Dimou, KROWN: A benchmark for RDF graph materialisation , in: G. Demartini,

Hose ,

Acosta ,

Palmonari , G. Cheng, H. Skaf-Molli , N.

Ferranti , D.

Hernández , A . Hogan (Eds.), The Semantic Web - ISWC 2024 - 23rd International Semantic Web Conference , Baltimore, MD , USA, November 11 - 15 , 2024 , Proceedings, Part

III

, volume 15233 of Lecture Notes in Computer Science, Springer, 2024 , pp. 20 - 39 .

[3]

Arenas-Guerrero ,

Chaves-Fraga ,

Toledo ,

M. S.

Pérez , Ó. Corcho, Morph-KGC: Scalable knowledge graph materialization with mapping partitions , Semantic Web 15 ( 2024 ) 1 - 20 .

[4]

Iglesias ,

Jozashoori ,

Chaves-Fraga ,

Collarana , M. Vidal, SDM-RDFizer: An RML Interpreter for the Eficient Creation of RDF Knowledge Graphs , in: M. d'Aquin , S.

Dietze , C.

Hauf , E. Curry, P.

Cudré-Mauroux (Eds.), CIKM '20: The 29th ACM International Conference on Information and Knowledge Management , Virtual Event, Ireland, October 19-23 , 2020 , ACM, 2020 , pp. 3039 - 3046 .

[5]

Li ,

Appleby ,

C. D.

Brumar ,

Chang ,

Suh , Knowledge Graphs in Practice: Characterizing their Users, Challenges, and Visualization Opportunities , IEEE Transactions on Visualization and Computer Graphics 30 ( 2023 ) 584 - 594 .

[6]

Pinkel ,

Binnig ,

Haase ,

Martin ,

Sengupta ,

Trame , How to Best Find a Partner? An Evaluation of Editing Approaches to Construct R2RML Mappings, in: The Semantic Web: Trends and Challenges - 11th International Conference, ESWC 2014, Anissaras, Crete, Greece, May 25 -29, 2014 . Proceedings, volume 8465 of Lecture Notes in Computer Science, Springer, 2014 , pp. 675 - 690 .

[7]

Heyvaert ,

Dimou ,

Herregodts ,

Verborgh ,

Schuurman , E. Mannens, R. Van de Walle, RMLEditor: A Graph-Based Mapping Editor for Linked Data Mappings, in: The Semantic Web . Latest Advances and New Domains - 13th International Conference, ESWC 2016 , Heraklion, Crete, Greece, May 29 - June 2, 2016 , Proceedings, volume 9678 of Lecture Notes in Computer Science, Springer, 2016 , pp. 709 - 723 .

[8]

Á.

Sicilia ,

Nemirovski ,

Nolle , Map-On: A web-based editor for visual ontology mapping , Semantic Web 8 ( 2017 ) 969 - 980 .

[9]

Bak ,

Blinkiewicz ,

Lawrynowicz , User-friendly Visual Creation of R2RML Mappings in SQuaRE , in: Proceedings of the Third International Workshop on Visualization and Interaction for Ontologies and Linked Data co-located with the 16th International Semantic Web Conference (ISWC 2017 ), Vienna, Austria, October 22 , 2017 , volume 1947 of CEUR Workshop Proceedings, CEUR-WS.org , 2017 , pp. 139 - 150 .

[10]

Crotti Junior ,

Debruyne , D. O'Sullivan, Juma: An Editor that Uses a Block Metaphor to Facilitate the Creation and Editing of R2RML Mappings, in: The Semantic Web: ESWC 2017 Satellite Events - ESWC 2017 Satellite Events , Portorož, Slovenia, May 28 - June 1, 2017 , Revised Selected Papers, volume 10577 of Lecture Notes in Computer Science, Springer, 2017 , pp. 87 - 92 .

[11]

Heyvaert ,

Dimou , B. De Meester,

Seymoens ,

Herregodts ,

Verborgh ,

Schuurman , E. Mannens, Specification and implementation of mapping rule visualization and editing: MapVOWL and the RMLEditor , J. Web Semant . 49 ( 2018 ) 31 - 50 .

[12]

Crotti Junior ,

Debruyne ,

Longo , D.

O'Sullivan, On the Mental Workload Assessment of Uplift Mapping Representations in Linked Data, in: Human Mental Workload: Models and Appli-