CrowdIQ: An Ontology for Crowdsourced
Information Quality Assessments
Davide Ceolin1 , Dafne van Kuppevelt2 and Ji Qi2
1
    Centrum Wiskunde & Informatica, Science Park 123, 1098XG, Amsterdam, The Netherlands
2
    Netherlands eScience Center, Science Park 402, 1098 XH Amsterdam, The Netherlands


                                         Abstract
                                         Fact-checking is a common journalistic practice adopted to verify the truthfulness of claims and informa-
                                         tion items. Because of the demanding nature of fact-checking, a significant amount of research has been
                                         devoted to the use of crowdsourcing to scale up this practice. The idea to use laypeople to fact-check
                                         information allows accessing a vast amount of human computation resources, but introduces an issue
                                         of reliability: when these tasks are performed by laypeople instead of experts, their quality might be
                                         questioned. In this paper, we introduce an ontology for modeling crowdsourced datasets of information
                                         quality assessments. We emphasize that we allow modeling information about the items evaluated as
                                         well as important metadata such as the authors of such assessments. The goal of this model is to favor
                                         interoperability among different datasets of the same kind, as well as to support internal analyses of the
                                         dataset themselves in terms of bias and reliability of the collected assessments.

                                         Keywords
                                         Information Quality, Crowdsourcing, Data Modeling


1. Introduction
Fact-checking is a common journalistic practice employed in order to test the veracity of claims
to be reported in newspapers. In order to fact-check information, journalists look up infor-
mation sources, appraise their reliability, extract information out of them, and then reason on
the resulting information set. This sequence of operations is rather time-consuming, while
the amount of claims that require fact-checking online is constantly growing. On the hand,
this means that the workload of specialists is high and, on the other hand, their possibility to
intervene real-time is limited. Crowdsourcing has revealed a useful tool to allow scaling up
fact-checking. Crowd workers can, in fact, be instructed so to produce expert-like assessments,
and by collecting multiple assessments about the same item (wisdom of the crowds), quality
can be assured. In order to maximize the usefulness of the fact-checking (and, more in general,
information quality assessment) datasets obtained from crowdsourcing, it is important to anno-
tate them with specific metadata. This paper describes a lightweight ontology to describe and
annotate crowdsourced information quality assessments. The ontology allows characterizing
information about the items being assessed, the author of the assessment, and information

EKAW’22: Companion Proceedings of the 23rd International Conference on Knowledge Engineering and Knowledge
Management, September 26–29, 2022, Bozen-Bolzano, IT
$ davide.ceolin@cwi.nl (D. Ceolin)
 0000-0002-3357-9130 (D. Ceolin); 0000-0000-0000-0000 (D. v. Kuppevelt); 0000-0000-0000-0000 (J. Qi)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
quality details. We leverage existing ontologies like the Data Quality Vocabulary (DQV) and
Schema.org (Schema) and we select and specialize their elements to serve this specific niche
of information. The paper is structured as follows. Section 2 presents related work. Section 3
introduces the ontology, while Section 4 discusses example applications. Section 5 concludes.


2. Related Work
Relevant to this line of work is the Data Quality Vocabulary (DQV) [1], that allows defining
quality dimensions and metrics for annotating quality of data. Our model extends and specializes
DQV in order to model specifically crowdsourced information quality assessments. In turn, this
links to the extensive line of research on data quality modeling and measuring [2]. While we
can observe similarities with the measurement and modeling of data quality, the information
quality assessments we are interested in aim at assessing the information content of items,
rather than their data serialization. For an overview on information quality and its philosophy,
we refer the reader to the book edited by Floridi and Illari [3]. The Data Cube Vocabulary [4] is
also a predecessor of our model, since it allows modeling multidimensional metadata. We see
our model as a specialization of this vocabulary as well.
   The work presented in this paper is also relevant to the field of FAIR data principles, as it
aims at favoring findability (especially on principle F2) and interoperability (I1) of data. We
refer the reader to the work of Poveda et al. [5] who provide an extensive analysis on this topic.


3. The CrowdIQ Ontology
The ontology we present here aims at achieving the highest interoperability while allowing
modeling the necessary information resulting from the crowdsourcing tasks of interest. In the
ontology, we identify three main “superclasses” that represent the core of the information we
intend to model. We describe them as follows, while Figure 1 provides an overview.

Target Item The target item represents the item to be evaluated. This ontology means to
model quality assessments on information items online, therefore this class specializes the
DigitalDocument class of Schema.org.

Worker The worker class is meant to characterize the author of a quality assessment, and is
thus modeled from the Person class of Schema.org. The instances of this class could be more or
less populated depending on the level of anonymity granted to the crowd workers. Information
to populate instances of the worker class can be provided both by the crowdsourcing platform
or by the worker who responds to demographic questions in the crowdsourcing task.

IQ Assessment This class models the information quality assessment that the worker provides
about the target item. This class requires the specification of a quality dimension (e.g., precision,
accuracy, truthfulness; see [3] for an overview of information quality dimensions) and of a
metric. E.g., the precision of an online document might be expressed using a 5-level Likert scale.
We also record provenance information (time, platform of creation).
                     schema:person                         schema:DigitalDocument                                 dqv:QualityMeasurement


                                                                        rdf:subClassOf


                       rdf:subClassOf
                                                                     :target_item
                                                                                                                                    :platform

                                                     dqv:hasQualityMeasurement            rdf:subClassOf
                                                                                                             prov:atLocation


                           :worker                                              :iq_assessment             prov:startedAtTime
                                         prov:wasAttributedTo
                                                                                                                               "YYYY-MM-DDTHH:MM:SSZ"
                                                                  dqv:inDimension           dqv:isMeasurementOf                ^^xsd:dateTime


                                                                dqv:dimension                               dqv:metric


                 daq           http://purl.org/eis/vocab/daq#
                 dcat          http://www.w3.org/ns/dcat#
                 dcterms       http://purl.org/dc/terms/
                 dqv           http://www.w3.org/ns/dqv#
                 duv           http://www.w3.org/ns/duv#
                 prov          http://www.w3.org/ns/prov#
                 schema        http://schema.org
                 rdfs          http://www.w3.org/2000/01/rdf-schema#


Figure 1: Overview of the CrowdIQ Ontology


4. Applications
4.1. Data Conversion Requirements
The proposed ontology is meant to produce and annotate Linked Data representation of crowd-
sourced information quality assessments. Given that these data are often produced in CSV
format, we refer the reader to the CSV on the Web standard [6] to convert CSV files in RDF or
JSON-LD. We also provide an example metadata file online1 for converting CSV files using our
ontology.

4.2. Analysis workflow in Knime
In the real-world research project “The Eye of the Beholder” 2 , we aim to building or leveraging
a platform for scholars to train machine learning pipelines to automatically assess quality of
online information. To train these pipelines, we make use of a dataset collected from a crowd
sourcing platform called CrowdFrame [7]. The training data contains information from three
parts, namely worker information, document information, and assessment information. Details
about the corresponding crowdsourcing experiment is available at [8].
   Through investigation and comparison, we chose the KNIME [9], a free and open-source data
analysis platform. KNIME integrates various components for machine learning, data mining,
and reporting. Users can visually create workflows with these components and execute part or
all of them, and subsequently use interactive views to examine results, models, etc.
    1
     Available at https://github.com/EyeofBeholder-NLeSC/assessments-ontology/blob/main/metadata.json
    2
     https://www.esciencecenter.nl/projects/the-eye-of-the-beholder-transparent-pipelines-for-assessing-online-
information-quality/
  The desired tool is composed of several KNIME workflows working in tandem with each other.
These workflows automate the process of data exploration, model training, result interpretation,
and pipeline comparison. To achieve this goal, we designed the CrowdIQ ontology to design
the data interfaces for these workflows, and we also need to validate the training data files
corresponding to this ontology before feeding the to those workflows.

4.3. User stories
The requirements for integrating the ontology in KNIME are specified as follows.

Upload data from the crowdsourcing platform Once the crowd sourcing task has been
completed by enough workers, our user wants to load the resulting assessments into KNIME for
further analysis. The platform provides our user with a URL of the metadata of the file (the url
of the data itself can be inferred from the metadata). The KNIME component that we develop
takes this URL as an input. The component downloads the data and metadata, and interprets
the data according to the CrowdIQ ontology. It executes some sanity checks: it returns an error
when, for example, no documents or no users are present. It outputs three tables: for workers,
documents and assessments respectively. It also outputs a visualization of the realization of
some of the attributes in the data: e.g., which dimensions and worker attributes are present.

Specify metadata When the user wants to provide data from a platform that does not give
machine readable metadata compatible with our ontology, the user needs to specify which fields
relate to which attributes in the data. Instead of providing an URL to the metadata as above, the
user should be able to specify in a graphical user interface, for each column of their CSV data,
which attribute in the ontology it represents. Depending on the format of the data that the user
provides, the CSV data can either consist of one table with all information combined, or three
different tables for workers, assessments, and documents respectively.

4.4. Proposed solutions
To account for the user stories as described in the previous section, we propose three types of
validation to integrate in KNIME workflows. First, if the training data is stored in CSV format,
a CSV-on-the-Web validation is essential. In this step, metadata annotations are added to the
input CSV files (manually or by referring to an existing metadata to interpret the data files
(encoding, data types of fields, etc.). At the same time, these files will be converted into a linked
data format. Next, the obtained linked data should also be tested for consistency with the
ontology. This is done through an ontology-based validation that maps the data fields to the
expected classes/properties defined in the ontology and further cleans the data. Finally, the
resulting data should also be validated against the constraints defined in the ontology. This can
include various checks such as missing/duplicate properties, missing or improper type arcs,
and inconsistent value ranges. In addition, we propose to extend the CrowdFrame platform to
output a metadata file describing the output CSV according to the CSV on the web standard. In
the interface for creating a new crowd sourcing task, the user should be able to specify which
fields correspond to which attributes in the ontology, for those fields for which it cannot be
inferred automatically. A prototype of a Knime pipeline that reads CSV and validates against
the ontology, can be found on GitHub3 . It wraps the csvw library4 in a Python scripting node.


5. Conclusion
In this paper, we introduce CrowdIQ, an ontology for modeling crowdsourced information
quality assessments. The ontology aims at achieving minimal commitement, while specializing
existing models to represent common information about information quality crowdsourced
tasks. The ontology is described along with a possible utilization scenario. The goals of this
model are twofold: allow interoperability among existing crowdsourced datasets, as well as to
allow inspecting each dataset separately in order to assess its reliability and bias. Future work
will aim at extending the model further and at providing pipeline automation components that
leverage the model’s expressivity.

Acknowledgements
Supported by the Netherlands eScience Center project “The Eye of the Beholder” (project nr.
027.020.G15).


References
[1] E. Hyvonen, R. Albertoni, A. Isaac, Introducing the data quality vocabulary (dqv), Semant.
    Web 12 (2021) 81–97. URL: https://doi.org/10.3233/SW-200382. doi:10.3233/SW-200382.
[2] V. C. Storey, R. Y. Wang, Modeling quality requirements in conceptual database design., in:
    I. N. Chengalur-Smith, L. Pipino (Eds.), IQ, MIT, 1998, pp. 64–87.
[3] L. Floridi, P. Illari (Eds.), The Philosophy of Information Quality, Springer, 2014.
[4] D. R. Richard Cyganiak, The RDF Data Cube Vocabulary, 2014.
[5] M. Poveda-Villalón, P. Espinoza-Arias, D. Garijo, O. Corcho, Coming to terms with fair
    ontologies, in: Knowledge Engineering and Knowledge Management: 22nd International
    Conference, EKAW 2020, Springer-Verlag, Berlin, Heidelberg, 2020, p. 255–270.
[6] J. Tennison, Csv on the web: A primer, https://www.w3.org/TR/tabular-data-primer/, 2016.
[7] M. Soprano, K. Roitero, F. Bombassei De Bona, S. Mizzaro, Crowd_frame: A simple and
    complete framework to deploy complex crowdsourcing tasks off-the-shelf, in: Proceedings
    of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM
    ’22, ACM, 2022, p. 1605–1608.
[8] M. Soprano, K. Roitero, D. La Barbera, D. Ceolin, D. Spina, S. Mizzaro, G. Demartini,
    The many dimensions of truthfulness: Crowdsourcing misinformation assessments on a
    multidimensional scale, Information Processing Management 58 (2021) 102710.
[9] M. R. Berthold, N. Cebron, F. Dill, T. R. Gabriel, T. Kötter, T. Meinl, P. Ohl, C. Sieb, K. Thiel,
    B. Wiswedel, KNIME: The Konstanz Information Miner, in: Studies in Classification, Data
    Analysis, and Knowledge Organization (GfKL 2007), Springer, 2007.
   3
       https://github.com/EyeofBeholder-NLeSC/assessments-ontology
   4
       https://github.com/cldf/csvw