<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Subjective Quality Evaluation: What Can be Learnt From Cognitive Science?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Simon Hviid Del Pin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Seyed Ali Amirshahi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Norwegian University of Science and Technology</institution>
          ,
          <addr-line>Gjøvik</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Subjective ratings given by observers are a critical part of research in image and video quality assessment. Like any other field of science, with subjective data collection, researchers may lack the expertise needed to address the diferent issues they face. In this study, we review diferent approaches and find potential pitfalls that generally seem overlooked in quality research. To address these issues, we found six relevant pitfalls relating to recruitment, instructions, experimental design, and data analysis that could be addressed by studies done in the field of cognitive science. Combining accessed datasets from quality research with newly collected data, we statistically demonstrated four of the six pitfalls: observers used the scale non-linearly; ratings can change throughout the experiment; features can influence individual observers diferently; and allowing observers to decide how many ratings they give can lead to biases. We need additional data to investigate the two pitfalls related to instructions and recruitment. Our ifndings suggest that pitfalls which might not be initially clear to researchers in the field of image and video processing can still have an empirically demonstrable influence on the data. While this article will not solve every issue, it will try to suggest improvements that researchers can readily employ.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Image quality assessment</kwd>
        <kwd>subjective data collection</kwd>
        <kwd>cognitive science</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Quality judgments from human observers are crucial to researchers interested in evaluating the
quality of images and/or videos. A typical process for such researchers is to ask observers to
rate the quality of videos or images. Researchers take these ratings as “ground truth data” which
can not only be used to train and test diferent models to predict observers’ judgement, but can
also open new doors for evaluating and understanding diferent aspects of the human visual
system. However, those involved in quality research may lack the expertise in diferent aspects
of subjective data collection, such as instructing human observers or collecting and interpreting
the resulting data. Realistically, it is nearly impossible to directly measure the experiences
of observers, and no experience can be seen by itself as “right” or “wrong”. Therefore, there
are real risks that researchers do not know about possible pitfalls they can face. In this study,
our objective is to show some of the relevant pitfalls that are highlighted in cognitive science
research. We will aim to demonstrate these pitfalls empirically for quality experiments and
provide guidelines on how researchers can avoid them.</p>
      <p>Excellent</p>
      <p>Worst
Imaginable</p>
      <p>Best</p>
      <p>Imaginable</p>
      <p>Latent Quality Space</p>
      <p>In this paper, we start with an introduction in Section 1. Section 2 is dedicated to introducing
the methods used in our study for data collection, followed by Section 3 by a discussion of our
ifndings. Finally, in Section 4 we provide a summary and the conclusion of the work.</p>
      <sec id="sec-1-1">
        <title>1.1. Rating quality is a non-objective decision-making process</title>
        <p>
          Researchers often employ scales as a systematic method for observers to report their experiences.
When using scales, a tacit (i.e. unspoken) assumption is that there is a latent space of quality that
each observer can experience. Two extremes could define this space: on one end is the “Worst
Imaginable” and on the other end the “Best Imaginable” quality experiences. The assumption
in current studies is that the scale is divided into equally distanced categories. In the field of
image and video quality assessment, the typical recommendation refers to 5 or 9 discrete points
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. For example, when people are asked to use the typical five-point Absolute Category Rating
(five-point ACR), we assume that they rate everything below a certain threshold as “Bad” in the
given latent space. Anything that is above this threshold but below the threshold of “Fair” will
be labeled “Poor”, etc. (Figure 1).
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Remember that your scale may be non-linearly understood</title>
        <p>
          In current studies, a common (if not standard) approach to quantifying the quality of an
image/video independent of the tasks observers have been given is the use of the Mean Opinion
Score (MOS) [
          <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
          ]. MOS takes the numeric average of all subjective scores given to an
image by diferent observers. This practice is tacitly assuming that the diferent categories
introduced to the observers are linear, and so all have the same distance. However, decades of
research deem this assumption unreasonable. For example, Jones &amp; McManus [6] investigated
how people understood certain terms on a scale from “Worst Imaginable” to “Best Imaginable”.
Focusing on the five terms of the five-point ACR, they show that the observers did not see
tseB ingaab
m
I
e
t lb
s a
roW inga
        </p>
        <p>Im Bad</p>
        <sec id="sec-1-2-1">
          <title>Poor</title>
        </sec>
        <sec id="sec-1-2-2">
          <title>Fair</title>
        </sec>
        <sec id="sec-1-2-3">
          <title>Good Excellent</title>
          <p>these terms as equally spaced (Figure 2). The study from Jones &amp; McManus [6] thus shows
that the use of diferent words and phrases to introduce the diferent categories could influence
the judgement of observers. For example, they show that there is a small perceived diference
between “bad” and “poor”. An observer we tested for this article explicitly agreed with this
sentiment: (“[I]t was dificult to choose between Bad and Poor as they are so arbitrary[.]”
[Observer in our post-experiment questionnaire]).</p>
          <p>It is plausible that research can lead to a scale with terms that are closer to equidistant. By
using a scale that is closer to the terms that observers themselves would prefer and see as linear,
we could address two critical issues. First, the current non-linear nature of the scale and, second,
any confusion that using specific terms could cause the observers. Researchers have constructed
such scales in cognitive research on, for example, the clarity of briefly flashed figures and the
sense of control [7, 8]. In addition to being closer to the participants’ experience, a benefit of
such a scale could be that the terms are less arbitrary to the observers. This could lead to more
stable thresholds in the latent space. Thus, investigating which words to include in scales could
be a fruitful endeavor for future studies.</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>1.3. Viewing ratings as an active decision process</title>
        <p>Studies in cognitive science have shown that scale usage is a complex decision process influenced
by even minor details. A convincing line of research comes from Siedlecka et al. [9, 10]. In one
study, they investigated anagrams, i.e. scrambled letters that may make up a certain word. For
instance, the letters ASRONTE can correctly be unscrambled to SENATOR or incorrectly to
TOASTER. In their experiment, the participants rated their confidence in the accuracy with
which they had unscrambled the letters from “1: I am guessing” to “4: I am very confident”. They
also saw a word and judged whether that word represented a correct unscramble. Importantly,
the researchers varied the order of these events, meaning that people could see the word before
or after giving their input. The order of pressing yes/no and the rating on the scale was also
manipulated. The results showed that the procedural order was related to how the observers
used the scale. For instance, observers would use the extreme parts of the scale more if they
saw the proposed word first.</p>
        <p>In follow-up experiments, the researchers showed that pressing even an unrelated button
before using the scale could influence how the scale was used [ 9]. In another paper, people
chose a color displayed on a color wheel and rated how clearly they saw it [11]. People used
the scale diferently if they first tried to match the color on a wheel compared to when the
task was absent. Although understanding the cognitive decision process is an ongoing field
of research [10] and beyond this article, it may be worth remembering that observers are not
merely instruments that output measurements. They are people who make complex decisions.
This means that procedural details, instructions, and scale definitions are crucial.</p>
      </sec>
      <sec id="sec-1-4">
        <title>1.4. Instructions matter - Importance of sharing instructions</title>
        <p>When looking through the literature in the field, it is often ambiguous what exact instructions
observers were given before the experiment began and even what specific question they were
answering when giving their rating. Merely writing that observers rated the quality on a
ifve-point ACR is not suficient and directly against the recommendations of the International
Telecommunicatoin Union (ITU) [12]. Without in-depth investigation, we cannot know if small
variations in instructions or questions matter, but we may again draw on publications from
cognitive research. For example, Sandberg et al. [13] tested whether asking three very similar
questions influenced how observers used the ratings. They had observers select the object (e.g.
a triangle or circle), which was briefly displayed. They then asked the observers “how clearly
they saw the object” or “how sure they were of their choice”. The questions yielded diferent
responses, but if one simply analyzed the responses as one to four, this nuance could easily be
missed. Similarly, we cannot know if asking “What was the technical quality of this image?”
may lead to diferent responses than “What was the overall quality?”. We strongly encourage
researchers to share this aspect of their methods and reviewers to demand such sections before
accepting papers. To practice what we preach, we of course have made our specific instructions
available in the repository together with our statistical code1
1The raw data, statistical codes and instructions can be found at https://osf.io/6qvwm/</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Methods and Data</title>
      <p>To further investigate the issues raised, we used a publicly available dataset in the field of image
quality assessment [14] (Dataset 1 in Table 1). The dataset was collected through an experiment
conducted locally under controlled conditions and used 10 distorted reference images. As one
of our experiments in this study, we recreated an online experiment using the same reference
images (Experiment 2 in Table 1). The other experiment had 235 reference images from the
KonIQ-10k IQA database [15] . Using Pavlovia, we then performed another set of subjective
experiments with 250 trials per observer on the mentioned images (Experiment 1 in Table 1).
In addition to using diferent sets of stimuli, our experiments had identical instructions and
experimental paradigms.</p>
      <sec id="sec-2-1">
        <title>2.1. Testing if the scale is non-linearly used in quality experiments</title>
        <p>We first analyzed whether observers in the three experiments used the scale equidistantly
or with flexible thresholds. If you take the means of the ratings, you should (tacitly) expect
equidistant usage. Flexible thresholds assume that the distance between each scale point is not
equal. This model will by definition have more degrees of freedom and we, therefore, wish
to investigate if it also yields correspondingly to better predictions. We created all models in
BRMS [16] a package for The R Project for Statistical Computing. We constructed equidistant
and flexible models in BRMS (for more information on this process, see [ 17]. We then compared
the two models to see if the flexible model had better predictive power. We used the R package
LOO which uses PSIS-LOO to approximate a leave-one-out cross-validation. PSIS-LOO has been
shown to be a robust and computationally eficient method for picking models [ 18]. This type
of comparison considers not only the absolute outcome of the model prediction but also gives
an estimate of how likely it is that the same model would perform better on future samples
from the same population. If two models are within two standard errors (SEs), we cannot be
sure which is the better one. In this case, the parsimonious choice would be to choose the
simpler model. Using two SEs roughly corresponds to having a 95% probability that the complex
model is better (for a more thorough argument on this, see [19]). Comparing the equidistant
and flexible models showed that the flexible model was about five SEs better. Therefore, we
conclude that a flexible model is best at describing the data (Figure 3).</p>
        <p>Our results, along with what we have already emphasized in the literature, indicate that
observers neither understand nor use the five ACR categories as equally spaced. Recent research</p>
        <p>3
Rating
4</p>
        <p>5
Equidistant Model</p>
        <p>Flexible Model
shows that treating nonlinear data as linear can lead to false conclusions and increase the risk
of Type I and Type II errors [20]. This means that using a metric model both increases your risk
of missing a real efect and falsely concluding that a non-existent efect is real.</p>
        <p>We present a code for a statistical method which does not require ratings to be quality scores
but merely ordered (that is, knowing that “Good” is above “Fair” but not to what extent). The
method requires more computational power, especially as the size of the dataset increases
and may thus not be practical for all situations. We also point out that there are currently a
significant number of research studies done on statistical models of ordinal data [ 17, 20, 21]).
The code we present in this paper may therefore not be the same as what we would recommend
a few years from now. Nevertheless, we believe that the current methods are mature enough to
be widely applied.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. How many trials do observers need to learn the task?</title>
        <p>
          Cognitive researchers encourage to begin the experiments with 40-50 trials used purely to
let participants learn the task and the scale [22]. We rarely see this practice in image quality
experiments conducted in the field of image processing and computer vision. In the few cases
that this is done, there is no golden number used and depending on the size of the datasets,
the time they have estimated for the experiment etc. This could simply range from a handful
[14] to a large but randomly selected number of images [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] as warm-up trials. Additionally,
the warm-up trials could range from having the observer evaluate the quality of an image to
simply showing randomly selected images from the dataset for a few seconds. Therefore, we
investigated if ratings change throughout the experiment, and if so, would it be reasonable to
        </p>
        <p>200
Trial in Experiment
300
follow the practice from cognitive science?</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Analysis of scale usage throughout the experiment</title>
        <p>To investigate whether scale usage changes throughout the experiments, we modeled the ratings
and response time as a function of the trial number. We defined the first model we created as
 ∼ () + (1|),
(1)
where () represents a spline during the trials and 1| allows for an intercept per
participant. This models the efect of trials on the ratings on a population level and that each
individual has their intercept. We thus assume each observer is similarly influenced throughout
the experiment but difer in how they rate. This allows for cases where one observer typically
rates 4 and another typically rates 3 but their ratings both fall similarly throughout the
experiment. In addition, we model the ratings as a spline [23] which allows for a nonlinear efect. This
is beneficial if the ratings are not linearly influenced throughout the experiment. We confirmed
the model was over five SEs better than a null model not including trials to predict ratings. Due
to space limitations, we only present the model calculated for Dataset 1 (Figure 4).</p>
        <p>Our model showed that after roughly 50 trials observers in general converge to an average
quality value slightly lower than what they normally start with. When we performed the same
analysis on Experiment 1 that used the same number of reference images we again saw that
ratings dropped throughout the experiment. In the case of Experiment 2 with 235 references,
the spline model was less than two SEs better than our null model. This indicates that there
was no consistent change in ratings throughout the experiment.</p>
        <p>200
Trial in Experiment
300</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Analysis of response time throughout the experiment</title>
        <p>We created similar models to capture the response time throughout the experiment. The first
model we created was defined
- ∼ () + (1|)
(2)
were () representing a spline during the trials and 1| allows for an intercept per
participant. We thus assume there is an efect of trials on the response time on a population level
while each individual can be faster or slower. In the case of Dataset 1 (Figure 5) our analysis
shows that response time decreased with more trials, especially in roughly the first 50 trials. We
found similar efects for Experiment 1 and Experiment 2. Taken together, our analysis shows
that ratings decreased similarly to response times. An exception was Experiment 1 with 235
reference images. These results indicate that significant learning occurs, particularly in the first
part of the experiments. It seems that the participants “learn” repeated reference images, raising
concerns about current practices in the field. As a concrete example, one could speculate that
the observer learns they should focus on a particular flower in a reference image to discern if it
is of good quality.</p>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. The importance of counterbalancing your conditions</title>
        <p>The fact that ratings systematically difered throughout the experiments that had repeating
reference images highlights the importance of randomizing all possible aspects of the experiment.
An experiment in which conditions are not properly balanced could erroneously show diferences
in image quality simply because of the order in which they are shown to participants. Imagine,
for instance, an observer taking part in an experiment in which they are first shown 40 images
compressed with a novel algorithm and then 40 images compressed with the current benchmark.
Without a warm-up phase, the observer most probably will rate the novel algorithm higher
even if it was no better than the benchmark approach. Even in less obvious cases, we recommend
counterbalancing. As Brooks [24] puts it, “Reactions of neural, psychological, and social systems
are rarely, if ever, independent of previous inputs and states”.</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.6. Modelling Individual diferences</title>
        <p>As argued before, there may be individual diferences in how observers understand the scale.
This problem may be compounded if there are also personal diferences in how image features
influence ratings. For instance, you could imagine that both colourfulness and sharpness
positively afect ratings. However, it may be the case that there are individual diferences in how
much these features influence every individual’s evaluation. Sharpness may be more important
to one observer, whereas colourfulness is more important to another. Or maybe beyond a certain
level of colorfulness, more colors do not matter. Again, this threshold may difer from one
observer to the other. To investigate this, we analysed how sharpness influenced the ratings
of individual observers in Experiment 1. Preliminary analysis showed that ratings rise with
sharpness, but level of or even fall with higher values. We approximated this as a second-order
polynomial. The reason we didn’t use splines is that they are not computationally feasible to
estimate for each individual. We defined the simple model to assume that each individual can
rate higher or lower, but that sharpness will equally influence the observers as
 ∼ ℎ + (ℎ2) + (1|)
(3)
were ℎ + (ℎ2) representing a second-order polynomial over sharpness and
(1|) allows for an intercept per participant. We thus assume there is an efect of sharpness on
the ratings on a population level while each individual can be rate higher or lower. The complex
model was defined to assume that each individual is influenced in their own way by sharpness
 ∼ ℎ + (ℎ2) + (ℎ + (ℎ2)|)
(4)
were a second-order polynomial over sharpness and ℎ + (ℎ2)|
additionally allows for sharpness to have a unique efect per participant.</p>
        <p>Our analysis showed that the complex model was over 5 SEs better than the simple model.
Our observers thus had diferences in how they are influenced by sharpness. This result is
highly relevant whenever individual ratings are important, but may also be relevant when, for
instance, our goal is to model how observers would generally rate an image. Take, for instance,
the research from Götz-Hahn et al. [25] where they find that to use the maximum predictive
power, in a large image dataset it is optimal for each image to be rated by just five observers. In
other words, they would rather have many images rated a few times rather than fewer images
rated many times. Though speculative, maybe they would gain even more predictive power, if
the rating profiles of the five observers were further investigated. With only five ratings per
image, it may, for instance, apply to know if a specific type of distortion or even specific content
particularly influences observers.</p>
      </sec>
      <sec id="sec-2-7">
        <title>2.7. Recruitment and external validity</title>
        <p>Before running an experiment, consider who should be recruited as an observer. Naturally,
diferent observers could represent one or multiple groups of people and so their subjective scores
would naturally represent that group(s). Researchers rarely state this tacitly in image/video
quality research. While in theory most researchers aim to have observers which ideally represent
“all internet/computer users” or some similarly wide group, however, in most cases observers
can better be described as “the ones available at campus” or “the first 100 people that responded
on the online platform”. In the previous section, we showed that there are indeed individual
diferences in how people understand and use the scales. When pollsters are conducting surveys,
they often use considerable energy addressing the degree to which their respondents represent
the entire population of voters. If you, for instance, want to know who will win the next US
election it may be more valuable to ask 100 people from a wide range of backgrounds than to ask
500 from the liberal arts college. Likewise, the preferences of young, educated observers who in
most cases are working in the field of computer science (if not image processing and computer
vision) are over represented in current studies. As the collection of subjective experiments using
online platforms has increased, it may be relevant to not only focus on the number of people
but also on who these people represent. As yet, the magnitude of this problem seems unknown.
We simply do not know how much bias we have in our data.</p>
      </sec>
      <sec id="sec-2-8">
        <title>2.8. Observers picking the number of trials themselves</title>
        <p>Platforms such as Mechanical Turk and Appen allow participants to decide how many trials to
complete. This may inflate variance in ratings because of some observers stopping before they
“learn” the task (warm-up) and others contributing many trials after that fact. To investigate how
observers empirically behave, we investigated the publicly available KONVID-150k dataset [25].
The dataset represents an experiment on Appen in which observers could choose to quit after
each block of 15 videos (one to three of the videos being tests of the observer’s attention). We
see the dataset contains 1257 observers for a total of 233,168 observations with a great variance
in how many trials each observer completed. The median number of completed trials was just
84, and the max completed trials were 1596. The 640 observers that had given 84 or fewer
ratings made up about half the observers, but only made up 17.7% of the total observations.
Likewise, the 47 observers that had made over 1000 observations made up 3.7% of the total
group but they had made 26.5% of the total observations. We thus see that a minority makes up a
disproportionately large portion of the total ratings. The question, however, is how problematic
that is? We performed an exploratory analysis (Figure 6) which arbitrarily compared those that
completed 60 or fewer trials (60 trials was the 1st Quartile of total ratings) with those that had
rated 600 or more images (an order of magnitude more ratings). To avoid diferences in learning,
we only tested the first 60 trials for all observers. The model with a diference between the two
groups was more than eight SEs better than the model which did not include groups. We thus
see a diference in how the two groups rated, but it is not clear why. Perhaps observers with
certain preferences or understanding of the scale are more likely to continue? We cannot say
what makes some observers complete more than a thousand trials whereas others complete less
than 60.</p>
        <p>To address this issue we recommend giving each observer the same number of trials. Not
doing so, lets a minority of raters leave a large influence on the entire dataset. In the present
example, we see that the observers that endured 600 ratings rated lower than the ones that left
before trial 61. If one was training an algorithm to predict how the average observer would rate
a video/image, one could therefore end with lower estimates than the population as a whole
would rate. Whereas this section could be read as a critique of the KONVID-150k dataset, we
wish to commend the researchers for making their datasets with individual ratings publicly
available. There could easily be similar issues with other datasets, but way too often such
datasets only release the MOS rendering this type of analysis impossible.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Discussion</title>
      <sec id="sec-3-1">
        <title>3.1. Evidence for the pitfalls presented in this paper</title>
        <p>This paper has presented several pitfalls in how subjective datasets are collected in the field
of image/video quality assessment and has tried to address them through research performed
in the field of cognitive science. We aim to demonstrate these pitfalls empirically either by
performing novel statistical work on existing datasets or by additionally collecting new data
to analyze (Table 2). Looking through the cognitive literature, we have found six potential
pitfalls related to subjective qualitative ratings. We could demonstrate three pitfalls statistically
in both accessed and novel collected data. One of them, voluntary number of trials, could
only be demonstrated in accessed data as both our experiments had a fixed number of trials.
Finally, recruitment /external validity and influence from instructions remain pitfalls that at
present are not demonstrated empirically. Both of these efects could be further investigated
in future research. Recruitment/external validity could require a relatively high number of
observers to be demonstrated, especially since we do not know to what extent, say, a group of
college students rate diferently from a representative sample of YouTube users. Thus, it may
well be that this efect is most relevant to those conducting large-scale research. However, it
would not be impossible to compare a convenience sample (such as the first 200 people who
volunteer to participate) to a representative sample of people that closely matches a target
demographic. Such research needs to be conducted before we can know if it has practical
relevance or not. The influence of instructions seems more tangible to demonstrate empirically.
After all, Sandberg et al [13] only needed 36 observers in their demonstration. A bigger problem
is that the instructions are not always available. This leaves the field in a situation where it
would be relatively trivial to test diferent instructions but with no direct way to access them.
Once again, we can only recommend that we share our direct instructions and expect the same
from our colleagues.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. No current scale is without controversy</title>
        <p>Although this paper has focused on five-point ACR, do not assume that simply shifting to
another scale will absolve the current issues. For instance, one could be tempted to use a slider
to avoid using terms which observers understand diferently. However, this is also not without
problems. In a review of diferent response scale characteristics, DeCastellarnau [ 26] shows the
overwhelming options in building a scale. Relating to a slider, there are issues in that the scale
takes longer time to use and that observers often will divide it into sections of five and thus
still use it more discrete rather than linearly. Moreover, recent research has shown that some
cultures understand Fair as average, whereas other cultures understand it as less than average
[27]. Therefore, it may be problematic if a scale is developed or tested primarily in a certain
cultural context. Taken together, the only way we can know that a scale is useful is when it has
been thoroughly tested under diferent experimental conditions and even cultures. A simple
“hunch” to overcome the issues of the scale which we have presented in this paper will probably
be insuficient.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>Taken together, this paper has demonstrated several pitfalls empirically and further highlighted
some that future research could investigate. Whereas such studies are relevant in themselves,
we also hope that this paper is directly useful to researchers in the field, and we, therefore,
end with recommendations focusing on the pitfalls that we have demonstrated empirically.
Note that these are general recommendations to remove confounding information from future
studies, but not necessarily hard rules that must be followed in all cases.</p>
      <p>We recommend all observers rate the same number of images or videos, if possible. Allowing
observers to select the number of trials for themselves allows people with certain traits to
comprise a large part of the collected data. Allowing observers to give only a few ratings also
makes it harder to estimate their individual rating profiles. We also recommend that experiments
either have at least 35 warm-up trials that are discarded or that a statistical model be used
to allow for warm-up efects. This seems particularly relevant if the stimuli consist of a few
references that are repeated. We appreciate it may not be possible to discard 35 trials in all
cases and therefore share code for a model which can apply to future experiments. Keep in
mind to properly balance your experiment. This should be the case whenever possible, but
particularly if you cannot follow the previous recommendations. Omitting to do so may lead to
false conclusions.</p>
      <p>Finally, we recommend that researchers consider whether they are interested in scale ratings
themselves or rather what they are supposed to represent. Depending on your specific research
question, using the means of ratings may be suficient. In other cases, remind yourself that
ratings represent a nonlinear decision process. We provide code that can test if the data contain
non-linear ratings and take that into account while modeling other aspects. We note such
models are more computationally heavy and may not apply for very large datasets.</p>
      <p>We hope that this paper has not only pointed out the methodological issues that are often
seen in the field today, but also shown the relevance of cognitive research to measuring quality.
We believe that future research in this overlap between the fields can lead to more robust data
that represents the quality that the observers are actually experiencing.
and Imaging Conference, volume 2021, Society for Imaging Science and Technology, 2021,
pp. 258–263.
[6] B. L. Jones, P. R. McManus, Graphic scaling of qualitative terms, SMPTE journal 95 (1986)
1166–1171.
[7] M. Y. Dong, K. Sandberg, B. M. Bibby, M. N. Pedersen, M. Overgaard, The development of
a sense of control scale, Frontiers in psychology 6 (2015) 1733.
[8] T. Z. Ramsøy, M. Overgaard, Introspection and subliminal perception, Phenomenology
and the cognitive sciences 3 (2004) 1–23.
[9] M. Siedlecka, J. Hobot, Z. Skóra, B. Paulewicz, B. Timmermans, M. Wierzchoń, Motor
response influences perceptual awareness judgements, Consciousness and cognition 75
(2019) 102804.
[10] M. Siedlecka, M. Koculak, B. Paulewicz, Confidence in action: Diferences between
perceived accuracy of decision and motor response, Psychonomic Bulletin &amp; Review 28
(2021) 1698–1706.
[11] Z. Skóra, K. Ciupińska, S. H. Del Pin, M. Overgaard, M. Wierzchoń, Investigating the
validity of the perceptual awareness scale–the efect of task-related dificulty on subjective
rating, Consciousness and Cognition 95 (2021) 103197.
[12] I. T. Union, P. 800.1: Mean opinion score (mos) terminology, 2006.
[13] K. Sandberg, B. Timmermans, M. Overgaard, A. Cleeremans, Measuring consciousness: is
one measure better than the other?, Consciousness and cognition 19 (2010) 1069–1078.
[14] O. Cherepkova, A. A. Seyed, M. Pedersen, Analyzing the variability of subjective image
quality ratings for diferent distortions, in: International Conference on Image Processing
Theory, Tools and Applications (IPTA), 2022.
[15] V. Hosu, H. Lin, T. Sziranyi, D. Saupe, Koniq-10k: An ecologically valid database for deep
learning of blind image quality assessment, IEEE Transactions on Image Processing 29
(2020) 4041–4056.
[16] P.-C. Bürkner, brms: An r package for bayesian multilevel models using stan, Journal of
statistical software 80 (2017) 1–28.
[17] P.-C. Bürkner, M. Vuorre, Ordinal regression models in psychology: A tutorial, Advances
in Methods and Practices in Psychological Science 2 (2019) 77–101.
[18] A. Vehtari, A. Gelman, J. Gabry, Practical bayesian model evaluation using leave-one-out
cross-validation and waic, Statistics and computing 27 (2017) 1413–1432.
[19] S. H. Del Pin, Z. Skóra, K. Sandberg, M. Overgaard, M. Wierzchoń, Comparing theories
of consciousness: object position, not probe modality, reliably influences experience and
accuracy in object recognition tasks, Consciousness and Cognition 84 (2020) 102990.
[20] T. M. Liddell, J. K. Kruschke, Analyzing ordinal data with metric models: What could
possibly go wrong?, Journal of Experimental Social Psychology 79 (2018) 328–348.
[21] B. Paulewicz, A. Blaut, The general causal cumulative model of ordinal response, PsyArXiv
preprint (2022).
[22] M. Overgaard, K. Sandberg, The perceptual awareness scale—recent controversies and
debates, Neuroscience of Consciousness 2021 (2021) niab044.
[23] P.-C. Bürkner, Advanced bayesian multilevel modeling with the r package brms, arXiv
preprint arXiv:1705.11123 (2017).
[24] J. L. Brooks, Counterbalancing for serial order carryover efects in experimental condition
orders., Psychological methods 17 (2012) 600.
[25] F. Götz-Hahn, V. Hosu, H. Lin, D. Saupe, Konvid-150k: A dataset for no-reference video
quality assessment of videos in-the-wild, IEEE Access 9 (2021) 72139–72160.
[26] A. DeCastellarnau, A classification of response scale characteristics that afect data quality:
a literature review, Quality &amp; quantity 52 (2018) 1523–1559.
[27] T. Yan, M. Hu, Examining translation and respondents’ use of response scales in 3mc
surveys, Advances in comparative survey methods: Multinational, multiregional, and
multicultural contexts (3MC) (2018) 501–518.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>ITU-T</surname>
          </string-name>
          ,
          <volume>910</volume>
          :
          <article-title>Subjective video quality assessment methods for multimedia applications</article-title>
          . geneva, switzerland, International Telecommunication Union (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Amirshahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Denzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Redies</surname>
          </string-name>
          ,
          <article-title>Jenaesthetics-a public dataset of paintings for aesthetic research</article-title>
          , in: Poster workshop at the european conference on
          <source>computer vision</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Amirshahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. U.</given-names>
            <surname>Hayn-Leichsenring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Denzler</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Redies, Evaluating the rule of thirds in photographs and paintings</article-title>
          ,
          <source>Art &amp; Perception</source>
          <volume>2</volume>
          (
          <year>2014</year>
          )
          <fpage>163</fpage>
          -
          <lpage>182</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Amirshahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. U.</given-names>
            <surname>Hayn-Leichsenring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Denzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Redies</surname>
          </string-name>
          ,
          <article-title>Jenaesthetics subjective dataset: Analyzing paintings by subjective scores</article-title>
          ,
          <source>Lecture Notes in Computer Science</source>
          <volume>8925</volume>
          (
          <year>2015</year>
          )
          <fpage>3</fpage>
          -
          <lpage>19</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pedersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Ali</given-names>
            <surname>Amirshahi</surname>
          </string-name>
          ,
          <article-title>Colourlab image database: Geometric distortions</article-title>
          , in: Color
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>