=Paper=
{{Paper
|id=Vol-3271/Paper14_CVCS2022
|storemode=property
|title=Subjective Quality Evaluation: What Can be Learnt From Cognitive Science?
|pdfUrl=https://ceur-ws.org/Vol-3271/Paper14_CVCS2022.pdf
|volume=Vol-3271
|authors=Simon Hviid Del Pin,Seyed Ali Amirshahi
|dblpUrl=https://dblp.org/rec/conf/cvcs/PinA22
}}
==Subjective Quality Evaluation: What Can be Learnt From Cognitive Science?==
Subjective Quality Evaluation: What Can be Learnt From Cognitive Science? Simon Hviid Del Pin1 , Seyed Ali Amirshahi1 1 Norwegian University of Science and Technology, Gjøvik, Norway Abstract Subjective ratings given by observers are a critical part of research in image and video quality assessment. Like any other field of science, with subjective data collection, researchers may lack the expertise needed to address the different issues they face. In this study, we review different approaches and find potential pitfalls that generally seem overlooked in quality research. To address these issues, we found six relevant pitfalls relating to recruitment, instructions, experimental design, and data analysis that could be addressed by studies done in the field of cognitive science. Combining accessed datasets from quality research with newly collected data, we statistically demonstrated four of the six pitfalls: observers used the scale non-linearly; ratings can change throughout the experiment; features can influence individual observers differently; and allowing observers to decide how many ratings they give can lead to biases. We need additional data to investigate the two pitfalls related to instructions and recruitment. Our findings suggest that pitfalls which might not be initially clear to researchers in the field of image and video processing can still have an empirically demonstrable influence on the data. While this article will not solve every issue, it will try to suggest improvements that researchers can readily employ. Keywords Image quality assessment, subjective data collection, cognitive science 1. Introduction Quality judgments from human observers are crucial to researchers interested in evaluating the quality of images and/or videos. A typical process for such researchers is to ask observers to rate the quality of videos or images. Researchers take these ratings as “ground truth data” which can not only be used to train and test different models to predict observers’ judgement, but can also open new doors for evaluating and understanding different aspects of the human visual system. However, those involved in quality research may lack the expertise in different aspects of subjective data collection, such as instructing human observers or collecting and interpreting the resulting data. Realistically, it is nearly impossible to directly measure the experiences of observers, and no experience can be seen by itself as “right” or “wrong”. Therefore, there are real risks that researchers do not know about possible pitfalls they can face. In this study, our objective is to show some of the relevant pitfalls that are highlighted in cognitive science research. We will aim to demonstrate these pitfalls empirically for quality experiments and provide guidelines on how researchers can avoid them. The 11th Colour and Visual Computing Symposium 2022, September 08–09, 2022, Gjøvik, Norway $ simon.h.d.pin@ntnu.no (S. H. Del Pin); s.ali.amirshahi@ntnu.no (S. A. Amirshahi) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Bad Poor Fair Good Excellent Worst Best Imaginable Imaginable Latent Quality Space Figure 1: A depiction of a latent space representing the experience of quality for an observer. The extremes of this space can be called “Worst Imaginable” and “Best Imaginable” respectively. The observer establishes thresholds for which to rate the quality into one category. The thresholds may not be equally spaced. In this paper, we start with an introduction in Section 1. Section 2 is dedicated to introducing the methods used in our study for data collection, followed by Section 3 by a discussion of our findings. Finally, in Section 4 we provide a summary and the conclusion of the work. 1.1. Rating quality is a non-objective decision-making process Researchers often employ scales as a systematic method for observers to report their experiences. When using scales, a tacit (i.e. unspoken) assumption is that there is a latent space of quality that each observer can experience. Two extremes could define this space: on one end is the “Worst Imaginable” and on the other end the “Best Imaginable” quality experiences. The assumption in current studies is that the scale is divided into equally distanced categories. In the field of image and video quality assessment, the typical recommendation refers to 5 or 9 discrete points [1]. For example, when people are asked to use the typical five-point Absolute Category Rating (five-point ACR), we assume that they rate everything below a certain threshold as “Bad” in the given latent space. Anything that is above this threshold but below the threshold of “Fair” will be labeled “Poor”, etc. (Figure 1). 1.2. Remember that your scale may be non-linearly understood In current studies, a common (if not standard) approach to quantifying the quality of an image/video independent of the tasks observers have been given is the use of the Mean Opinion Score (MOS) [2, 3, 4, 5]. MOS takes the numeric average of all subjective scores given to an image by different observers. This practice is tacitly assuming that the different categories introduced to the observers are linear, and so all have the same distance. However, decades of research deem this assumption unreasonable. For example, Jones & McManus [6] investigated how people understood certain terms on a scale from “Worst Imaginable” to “Best Imaginable”. Focusing on the five terms of the five-point ACR, they show that the observers did not see Imaginable Best Imaginable Worst Bad Poor Fair Good Excellent Figure 2: Perceived intervals between terms used for quality category scales. Figure recreated with data presented in Jones & McManus [6]. 37 participants were told to draw in 15 words as points on a line between “Worst Imaginable” and “Best Imaginable”. Results show that compared to an idealised regression (blue line) the five words of the ACR are not equally spaced. “Poor” is a small step above “Bad” whereas “Excellent” is a relatively steep step above “Good”. these terms as equally spaced (Figure 2). The study from Jones & McManus [6] thus shows that the use of different words and phrases to introduce the different categories could influence the judgement of observers. For example, they show that there is a small perceived difference between “bad” and “poor”. An observer we tested for this article explicitly agreed with this sentiment: (“[I]t was difficult to choose between Bad and Poor as they are so arbitrary[.]” [Observer in our post-experiment questionnaire]). It is plausible that research can lead to a scale with terms that are closer to equidistant. By using a scale that is closer to the terms that observers themselves would prefer and see as linear, we could address two critical issues. First, the current non-linear nature of the scale and, second, any confusion that using specific terms could cause the observers. Researchers have constructed such scales in cognitive research on, for example, the clarity of briefly flashed figures and the sense of control [7, 8]. In addition to being closer to the participants’ experience, a benefit of such a scale could be that the terms are less arbitrary to the observers. This could lead to more stable thresholds in the latent space. Thus, investigating which words to include in scales could be a fruitful endeavor for future studies. 1.3. Viewing ratings as an active decision process Studies in cognitive science have shown that scale usage is a complex decision process influenced by even minor details. A convincing line of research comes from Siedlecka et al. [9, 10]. In one study, they investigated anagrams, i.e. scrambled letters that may make up a certain word. For instance, the letters ASRONTE can correctly be unscrambled to SENATOR or incorrectly to TOASTER. In their experiment, the participants rated their confidence in the accuracy with which they had unscrambled the letters from “1: I am guessing” to “4: I am very confident”. They also saw a word and judged whether that word represented a correct unscramble. Importantly, the researchers varied the order of these events, meaning that people could see the word before or after giving their input. The order of pressing yes/no and the rating on the scale was also manipulated. The results showed that the procedural order was related to how the observers used the scale. For instance, observers would use the extreme parts of the scale more if they saw the proposed word first. In follow-up experiments, the researchers showed that pressing even an unrelated button before using the scale could influence how the scale was used [9]. In another paper, people chose a color displayed on a color wheel and rated how clearly they saw it [11]. People used the scale differently if they first tried to match the color on a wheel compared to when the task was absent. Although understanding the cognitive decision process is an ongoing field of research [10] and beyond this article, it may be worth remembering that observers are not merely instruments that output measurements. They are people who make complex decisions. This means that procedural details, instructions, and scale definitions are crucial. 1.4. Instructions matter - Importance of sharing instructions When looking through the literature in the field, it is often ambiguous what exact instructions observers were given before the experiment began and even what specific question they were answering when giving their rating. Merely writing that observers rated the quality on a five-point ACR is not sufficient and directly against the recommendations of the International Telecommunicatoin Union (ITU) [12]. Without in-depth investigation, we cannot know if small variations in instructions or questions matter, but we may again draw on publications from cognitive research. For example, Sandberg et al. [13] tested whether asking three very similar questions influenced how observers used the ratings. They had observers select the object (e.g. a triangle or circle), which was briefly displayed. They then asked the observers “how clearly they saw the object” or “how sure they were of their choice”. The questions yielded different responses, but if one simply analyzed the responses as one to four, this nuance could easily be missed. Similarly, we cannot know if asking “What was the technical quality of this image?” may lead to different responses than “What was the overall quality?”. We strongly encourage researchers to share this aspect of their methods and reviewers to demand such sections before accepting papers. To practice what we preach, we of course have made our specific instructions available in the repository together with our statistical code1 1 The raw data, statistical codes and instructions can be found at https://osf.io/6qvwm/ Table 1 An overview of the datasets analyzed in this paper. Name indicates how we will refer to them throughout the paper. Name # observers # reference images Local \Online Reference Dataset 1 24 10 Local [14] Experiment 1 40 235 Online This Study Experiment 2 31 10 Online This Study 2. Methods and Data To further investigate the issues raised, we used a publicly available dataset in the field of image quality assessment [14] (Dataset 1 in Table 1). The dataset was collected through an experiment conducted locally under controlled conditions and used 10 distorted reference images. As one of our experiments in this study, we recreated an online experiment using the same reference images (Experiment 2 in Table 1). The other experiment had 235 reference images from the KonIQ-10k IQA database [15] . Using Pavlovia, we then performed another set of subjective experiments with 250 trials per observer on the mentioned images (Experiment 1 in Table 1). In addition to using different sets of stimuli, our experiments had identical instructions and experimental paradigms. 2.1. Testing if the scale is non-linearly used in quality experiments We first analyzed whether observers in the three experiments used the scale equidistantly or with flexible thresholds. If you take the means of the ratings, you should (tacitly) expect equidistant usage. Flexible thresholds assume that the distance between each scale point is not equal. This model will by definition have more degrees of freedom and we, therefore, wish to investigate if it also yields correspondingly to better predictions. We created all models in BRMS [16] a package for The R Project for Statistical Computing. We constructed equidistant and flexible models in BRMS (for more information on this process, see [17]. We then compared the two models to see if the flexible model had better predictive power. We used the R package LOO which uses PSIS-LOO to approximate a leave-one-out cross-validation. PSIS-LOO has been shown to be a robust and computationally efficient method for picking models [18]. This type of comparison considers not only the absolute outcome of the model prediction but also gives an estimate of how likely it is that the same model would perform better on future samples from the same population. If two models are within two standard errors (SEs), we cannot be sure which is the better one. In this case, the parsimonious choice would be to choose the simpler model. Using two SEs roughly corresponds to having a 95% probability that the complex model is better (for a more thorough argument on this, see [19]). Comparing the equidistant and flexible models showed that the flexible model was about five SEs better. Therefore, we conclude that a flexible model is best at describing the data (Figure 3). Our results, along with what we have already emphasized in the literature, indicate that observers neither understand nor use the five ACR categories as equally spaced. Recent research 2000 Observed Ratings 1000 0 1 2 3 4 5 Rating Equidistant Model Flexible Model Figure 3: Use of ratings modelled for a flexible or an equidistant scale usage. The bars represent the observed ratings and the points represent the estimates from both models. We see that the flexible model accurately captures the usage of all ratings whereas the equidistant model overestimates the usage of 3 and 5 but underestimate the usage of rating 2 and 4. shows that treating nonlinear data as linear can lead to false conclusions and increase the risk of Type I and Type II errors [20]. This means that using a metric model both increases your risk of missing a real effect and falsely concluding that a non-existent effect is real. We present a code for a statistical method which does not require ratings to be quality scores but merely ordered (that is, knowing that “Good” is above “Fair” but not to what extent). The method requires more computational power, especially as the size of the dataset increases and may thus not be practical for all situations. We also point out that there are currently a significant number of research studies done on statistical models of ordinal data [17, 20, 21]). The code we present in this paper may therefore not be the same as what we would recommend a few years from now. Nevertheless, we believe that the current methods are mature enough to be widely applied. 2.2. How many trials do observers need to learn the task? Cognitive researchers encourage to begin the experiments with 40-50 trials used purely to let participants learn the task and the scale [22]. We rarely see this practice in image quality experiments conducted in the field of image processing and computer vision. In the few cases that this is done, there is no golden number used and depending on the size of the datasets, the time they have estimated for the experiment etc. This could simply range from a handful [14] to a large but randomly selected number of images [4] as warm-up trials. Additionally, the warm-up trials could range from having the observer evaluate the quality of an image to simply showing randomly selected images from the dataset for a few seconds. Therefore, we investigated if ratings change throughout the experiment, and if so, would it be reasonable to 3.3 Rating 3.0 2.7 2.4 100 200 300 Trial in Experiment Figure 4: Expected ratings as a function of trial number for Dataset 1. We see that ratings drop over the first roughly 50 trials and then remain relatively stable for the rest of the experiment. follow the practice from cognitive science? 2.3. Analysis of scale usage throughout the experiment To investigate whether scale usage changes throughout the experiments, we modeled the ratings and response time as a function of the trial number. We defined the first model we created as 𝑟𝑎𝑡𝑖𝑛𝑔 ∼ 𝑠(𝑡𝑟𝑖𝑎𝑙𝑠) + (1|𝑖𝑑), (1) where 𝑠(𝑡𝑟𝑖𝑎𝑙𝑠) represents a spline during the trials and 1|𝑖𝑑 allows for an intercept per par- ticipant. This models the effect of trials on the ratings on a population level and that each individual has their intercept. We thus assume each observer is similarly influenced throughout the experiment but differ in how they rate. This allows for cases where one observer typically rates 4 and another typically rates 3 but their ratings both fall similarly throughout the experi- ment. In addition, we model the ratings as a spline [23] which allows for a nonlinear effect. This is beneficial if the ratings are not linearly influenced throughout the experiment. We confirmed the model was over five SEs better than a null model not including trials to predict ratings. Due to space limitations, we only present the model calculated for Dataset 1 (Figure 4). Our model showed that after roughly 50 trials observers in general converge to an average quality value slightly lower than what they normally start with. When we performed the same analysis on Experiment 1 that used the same number of reference images we again saw that ratings dropped throughout the experiment. In the case of Experiment 2 with 235 references, the spline model was less than two SEs better than our null model. This indicates that there was no consistent change in ratings throughout the experiment. 5.5 Response Time (Seconds) 5.0 4.5 4.0 3.5 100 200 300 Trial in Experiment Figure 5: Response times throughout the experiment for Dataset 1. We see a large drop over the first roughly 50 trials and a gradual drop over the rest of the experiment. 2.4. Analysis of response time throughout the experiment We created similar models to capture the response time throughout the experiment. The first model we created was defined 𝑟𝑒𝑠𝑝𝑜𝑛𝑠𝑒-𝑡𝑖𝑚𝑒 ∼ 𝑠(𝑡𝑟𝑖𝑎𝑙𝑠) + (1|𝑖𝑑) (2) were 𝑠(𝑡𝑟𝑖𝑎𝑙𝑠) representing a spline during the trials and 1|𝑖𝑑 allows for an intercept per participant. We thus assume there is an effect of trials on the response time on a population level while each individual can be faster or slower. In the case of Dataset 1 (Figure 5) our analysis shows that response time decreased with more trials, especially in roughly the first 50 trials. We found similar effects for Experiment 1 and Experiment 2. Taken together, our analysis shows that ratings decreased similarly to response times. An exception was Experiment 1 with 235 reference images. These results indicate that significant learning occurs, particularly in the first part of the experiments. It seems that the participants “learn” repeated reference images, raising concerns about current practices in the field. As a concrete example, one could speculate that the observer learns they should focus on a particular flower in a reference image to discern if it is of good quality. 2.5. The importance of counterbalancing your conditions The fact that ratings systematically differed throughout the experiments that had repeating reference images highlights the importance of randomizing all possible aspects of the experiment. An experiment in which conditions are not properly balanced could erroneously show differences in image quality simply because of the order in which they are shown to participants. Imagine, for instance, an observer taking part in an experiment in which they are first shown 40 images compressed with a novel algorithm and then 40 images compressed with the current benchmark. Without a warm-up phase, the observer most probably will rate the novel algorithm higher - even if it was no better than the benchmark approach. Even in less obvious cases, we recommend counterbalancing. As Brooks [24] puts it, “Reactions of neural, psychological, and social systems are rarely, if ever, independent of previous inputs and states”. 2.6. Modelling Individual differences As argued before, there may be individual differences in how observers understand the scale. This problem may be compounded if there are also personal differences in how image features influence ratings. For instance, you could imagine that both colourfulness and sharpness positively affect ratings. However, it may be the case that there are individual differences in how much these features influence every individual’s evaluation. Sharpness may be more important to one observer, whereas colourfulness is more important to another. Or maybe beyond a certain level of colorfulness, more colors do not matter. Again, this threshold may differ from one observer to the other. To investigate this, we analysed how sharpness influenced the ratings of individual observers in Experiment 1. Preliminary analysis showed that ratings rise with sharpness, but level off or even fall with higher values. We approximated this as a second-order polynomial. The reason we didn’t use splines is that they are not computationally feasible to estimate for each individual. We defined the simple model to assume that each individual can rate higher or lower, but that sharpness will equally influence the observers as 𝑟𝑎𝑡𝑖𝑛𝑔 ∼ 𝑠ℎ𝑎𝑟𝑝𝑛𝑒𝑠𝑠 + 𝐼(𝑠ℎ𝑎𝑟𝑝𝑛𝑒𝑠𝑠2 ) + (1|𝑖𝑑) (3) were 𝑠ℎ𝑎𝑟𝑝𝑛𝑒𝑠𝑠 + 𝐼(𝑠ℎ𝑎𝑟𝑝𝑛𝑒𝑠𝑠2 ) representing a second-order polynomial over sharpness and (1|𝑖𝑑) allows for an intercept per participant. We thus assume there is an effect of sharpness on the ratings on a population level while each individual can be rate higher or lower. The complex model was defined to assume that each individual is influenced in their own way by sharpness 𝑟𝑎𝑡𝑖𝑛𝑔 ∼ 𝑠ℎ𝑎𝑟𝑝𝑛𝑒𝑠𝑠 + 𝐼(𝑠ℎ𝑎𝑟𝑝𝑛𝑒𝑠𝑠2 ) + (𝑠ℎ𝑎𝑟𝑝𝑛𝑒𝑠𝑠 + 𝐼(𝑠ℎ𝑎𝑟𝑝𝑛𝑒𝑠𝑠2 )|𝑖𝑑) (4) were a second-order polynomial over sharpness and 𝑠ℎ𝑎𝑟𝑝𝑛𝑒𝑠𝑠 + 𝐼(𝑠ℎ𝑎𝑟𝑝𝑛𝑒𝑠𝑠2 )|𝑖𝑑 addition- ally allows for sharpness to have a unique effect per participant. Our analysis showed that the complex model was over 5 SEs better than the simple model. Our observers thus had differences in how they are influenced by sharpness. This result is highly relevant whenever individual ratings are important, but may also be relevant when, for instance, our goal is to model how observers would generally rate an image. Take, for instance, the research from Götz-Hahn et al. [25] where they find that to use the maximum predictive power, in a large image dataset it is optimal for each image to be rated by just five observers. In other words, they would rather have many images rated a few times rather than fewer images rated many times. Though speculative, maybe they would gain even more predictive power, if the rating profiles of the five observers were further investigated. With only five ratings per image, it may, for instance, apply to know if a specific type of distortion or even specific content particularly influences observers. 2.7. Recruitment and external validity Before running an experiment, consider who should be recruited as an observer. Naturally, different observers could represent one or multiple groups of people and so their subjective scores would naturally represent that group(s). Researchers rarely state this tacitly in image/video quality research. While in theory most researchers aim to have observers which ideally represent “all internet/computer users” or some similarly wide group, however, in most cases observers can better be described as “the ones available at campus” or “the first 100 people that responded on the online platform”. In the previous section, we showed that there are indeed individual differences in how people understand and use the scales. When pollsters are conducting surveys, they often use considerable energy addressing the degree to which their respondents represent the entire population of voters. If you, for instance, want to know who will win the next US election it may be more valuable to ask 100 people from a wide range of backgrounds than to ask 500 from the liberal arts college. Likewise, the preferences of young, educated observers who in most cases are working in the field of computer science (if not image processing and computer vision) are over represented in current studies. As the collection of subjective experiments using online platforms has increased, it may be relevant to not only focus on the number of people but also on who these people represent. As yet, the magnitude of this problem seems unknown. We simply do not know how much bias we have in our data. 2.8. Observers picking the number of trials themselves Platforms such as Mechanical Turk and Appen allow participants to decide how many trials to complete. This may inflate variance in ratings because of some observers stopping before they “learn” the task (warm-up) and others contributing many trials after that fact. To investigate how observers empirically behave, we investigated the publicly available KONVID-150k dataset [25]. The dataset represents an experiment on Appen in which observers could choose to quit after each block of 15 videos (one to three of the videos being tests of the observer’s attention). We see the dataset contains 1257 observers for a total of 233,168 observations with a great variance in how many trials each observer completed. The median number of completed trials was just 84, and the max completed trials were 1596. The 640 observers that had given 84 or fewer ratings made up about half the observers, but only made up 17.7% of the total observations. Likewise, the 47 observers that had made over 1000 observations made up 3.7% of the total group but they had made 26.5% of the total observations. We thus see that a minority makes up a disproportionately large portion of the total ratings. The question, however, is how problematic that is? We performed an exploratory analysis (Figure 6) which arbitrarily compared those that completed 60 or fewer trials (60 trials was the 1st Quartile of total ratings) with those that had rated 600 or more images (an order of magnitude more ratings). To avoid differences in learning, we only tested the first 60 trials for all observers. The model with a difference between the two groups was more than eight SEs better than the model which did not include groups. We thus see a difference in how the two groups rated, but it is not clear why. Perhaps observers with certain preferences or understanding of the scale are more likely to continue? We cannot say what makes some observers complete more than a thousand trials whereas others complete less than 60. 0.5 Probability of Each Rating 0.4 0.3 0.2 0.1 0.0 1 2 3 4 5 Rating Oberservers with <=60 Ratings (415) Oberservers with >=600 Ratings (86) Figure 6: Response distribution for the 415 observers that completed 60 or fewer trials (left) and the 86 observers that completed 600 or more trials (right). We see the observers that stopped before 61 trials rated four the most whereas the observers that completed 600 trials rated three the most. We also see that the observers that endured 600 trials were more likely to use rating one and less likely to use rating five. This analysis only includes the first 60 trials for all observers. To address this issue we recommend giving each observer the same number of trials. Not doing so, lets a minority of raters leave a large influence on the entire dataset. In the present example, we see that the observers that endured 600 ratings rated lower than the ones that left before trial 61. If one was training an algorithm to predict how the average observer would rate a video/image, one could therefore end with lower estimates than the population as a whole would rate. Whereas this section could be read as a critique of the KONVID-150k dataset, we wish to commend the researchers for making their datasets with individual ratings publicly available. There could easily be similar issues with other datasets, but way too often such datasets only release the MOS rendering this type of analysis impossible. 3. Discussion 3.1. Evidence for the pitfalls presented in this paper This paper has presented several pitfalls in how subjective datasets are collected in the field of image/video quality assessment and has tried to address them through research performed in the field of cognitive science. We aim to demonstrate these pitfalls empirically either by performing novel statistical work on existing datasets or by additionally collecting new data to analyze (Table 2). Looking through the cognitive literature, we have found six potential pitfalls related to subjective qualitative ratings. We could demonstrate three pitfalls statistically in both accessed and novel collected data. One of them, voluntary number of trials, could only be demonstrated in accessed data as both our experiments had a fixed number of trials. Table 2 An overview of the pitfalls we have listed in this paper and the degree to which we have demonstrated them having relevance to quality research. Pitfall/Evidence from Cognitive literature Accessed data Collected data Non-Linear ratings X X X Warm-up X X X Individual effects X X X Voluntary number of trials X X N/A Recruitment /external validity X N/A N/A Influence from instructions X N/A N/A Finally, recruitment /external validity and influence from instructions remain pitfalls that at present are not demonstrated empirically. Both of these effects could be further investigated in future research. Recruitment/external validity could require a relatively high number of observers to be demonstrated, especially since we do not know to what extent, say, a group of college students rate differently from a representative sample of YouTube users. Thus, it may well be that this effect is most relevant to those conducting large-scale research. However, it would not be impossible to compare a convenience sample (such as the first 200 people who volunteer to participate) to a representative sample of people that closely matches a target demographic. Such research needs to be conducted before we can know if it has practical relevance or not. The influence of instructions seems more tangible to demonstrate empirically. After all, Sandberg et al [13] only needed 36 observers in their demonstration. A bigger problem is that the instructions are not always available. This leaves the field in a situation where it would be relatively trivial to test different instructions but with no direct way to access them. Once again, we can only recommend that we share our direct instructions and expect the same from our colleagues. 3.2. No current scale is without controversy Although this paper has focused on five-point ACR, do not assume that simply shifting to another scale will absolve the current issues. For instance, one could be tempted to use a slider to avoid using terms which observers understand differently. However, this is also not without problems. In a review of different response scale characteristics, DeCastellarnau [26] shows the overwhelming options in building a scale. Relating to a slider, there are issues in that the scale takes longer time to use and that observers often will divide it into sections of five and thus still use it more discrete rather than linearly. Moreover, recent research has shown that some cultures understand Fair as average, whereas other cultures understand it as less than average [27]. Therefore, it may be problematic if a scale is developed or tested primarily in a certain cultural context. Taken together, the only way we can know that a scale is useful is when it has been thoroughly tested under different experimental conditions and even cultures. A simple “hunch” to overcome the issues of the scale which we have presented in this paper will probably be insufficient. 4. Conclusions Taken together, this paper has demonstrated several pitfalls empirically and further highlighted some that future research could investigate. Whereas such studies are relevant in themselves, we also hope that this paper is directly useful to researchers in the field, and we, therefore, end with recommendations focusing on the pitfalls that we have demonstrated empirically. Note that these are general recommendations to remove confounding information from future studies, but not necessarily hard rules that must be followed in all cases. We recommend all observers rate the same number of images or videos, if possible. Allowing observers to select the number of trials for themselves allows people with certain traits to comprise a large part of the collected data. Allowing observers to give only a few ratings also makes it harder to estimate their individual rating profiles. We also recommend that experiments either have at least 35 warm-up trials that are discarded or that a statistical model be used to allow for warm-up effects. This seems particularly relevant if the stimuli consist of a few references that are repeated. We appreciate it may not be possible to discard 35 trials in all cases and therefore share code for a model which can apply to future experiments. Keep in mind to properly balance your experiment. This should be the case whenever possible, but particularly if you cannot follow the previous recommendations. Omitting to do so may lead to false conclusions. Finally, we recommend that researchers consider whether they are interested in scale ratings themselves or rather what they are supposed to represent. Depending on your specific research question, using the means of ratings may be sufficient. In other cases, remind yourself that ratings represent a nonlinear decision process. We provide code that can test if the data contain non-linear ratings and take that into account while modeling other aspects. We note such models are more computationally heavy and may not apply for very large datasets. We hope that this paper has not only pointed out the methodological issues that are often seen in the field today, but also shown the relevance of cognitive research to measuring quality. We believe that future research in this overlap between the fields can lead to more robust data that represents the quality that the observers are actually experiencing. References [1] P. ITU-T, 910: Subjective video quality assessment methods for multimedia applications. geneva, switzerland, International Telecommunication Union (2021). [2] S. A. Amirshahi, J. Denzler, C. Redies, Jenaesthetics—a public dataset of paintings for aesthetic research, in: Poster workshop at the european conference on computer vision, 2013. [3] S. A. Amirshahi, G. U. Hayn-Leichsenring, J. Denzler, C. Redies, Evaluating the rule of thirds in photographs and paintings, Art & Perception 2 (2014) 163–182. [4] S. A. Amirshahi, G. U. Hayn-Leichsenring, J. Denzler, C. Redies, Jenaesthetics subjective dataset: Analyzing paintings by subjective scores, Lecture Notes in Computer Science 8925 (2015) 3–19. [5] M. Pedersen, S. Ali Amirshahi, Colourlab image database: Geometric distortions, in: Color and Imaging Conference, volume 2021, Society for Imaging Science and Technology, 2021, pp. 258–263. [6] B. L. Jones, P. R. McManus, Graphic scaling of qualitative terms, SMPTE journal 95 (1986) 1166–1171. [7] M. Y. Dong, K. Sandberg, B. M. Bibby, M. N. Pedersen, M. Overgaard, The development of a sense of control scale, Frontiers in psychology 6 (2015) 1733. [8] T. Z. Ramsøy, M. Overgaard, Introspection and subliminal perception, Phenomenology and the cognitive sciences 3 (2004) 1–23. [9] M. Siedlecka, J. Hobot, Z. Skóra, B. Paulewicz, B. Timmermans, M. Wierzchoń, Motor response influences perceptual awareness judgements, Consciousness and cognition 75 (2019) 102804. [10] M. Siedlecka, M. Koculak, B. Paulewicz, Confidence in action: Differences between perceived accuracy of decision and motor response, Psychonomic Bulletin & Review 28 (2021) 1698–1706. [11] Z. Skóra, K. Ciupińska, S. H. Del Pin, M. Overgaard, M. Wierzchoń, Investigating the validity of the perceptual awareness scale–the effect of task-related difficulty on subjective rating, Consciousness and Cognition 95 (2021) 103197. [12] I. T. Union, P. 800.1: Mean opinion score (mos) terminology, 2006. [13] K. Sandberg, B. Timmermans, M. Overgaard, A. Cleeremans, Measuring consciousness: is one measure better than the other?, Consciousness and cognition 19 (2010) 1069–1078. [14] O. Cherepkova, A. A. Seyed, M. Pedersen, Analyzing the variability of subjective image quality ratings for different distortions, in: International Conference on Image Processing Theory, Tools and Applications (IPTA), 2022. [15] V. Hosu, H. Lin, T. Sziranyi, D. Saupe, Koniq-10k: An ecologically valid database for deep learning of blind image quality assessment, IEEE Transactions on Image Processing 29 (2020) 4041–4056. [16] P.-C. Bürkner, brms: An r package for bayesian multilevel models using stan, Journal of statistical software 80 (2017) 1–28. [17] P.-C. Bürkner, M. Vuorre, Ordinal regression models in psychology: A tutorial, Advances in Methods and Practices in Psychological Science 2 (2019) 77–101. [18] A. Vehtari, A. Gelman, J. Gabry, Practical bayesian model evaluation using leave-one-out cross-validation and waic, Statistics and computing 27 (2017) 1413–1432. [19] S. H. Del Pin, Z. Skóra, K. Sandberg, M. Overgaard, M. Wierzchoń, Comparing theories of consciousness: object position, not probe modality, reliably influences experience and accuracy in object recognition tasks, Consciousness and Cognition 84 (2020) 102990. [20] T. M. Liddell, J. K. Kruschke, Analyzing ordinal data with metric models: What could possibly go wrong?, Journal of Experimental Social Psychology 79 (2018) 328–348. [21] B. Paulewicz, A. Blaut, The general causal cumulative model of ordinal response, PsyArXiv preprint (2022). [22] M. Overgaard, K. Sandberg, The perceptual awareness scale—recent controversies and debates, Neuroscience of Consciousness 2021 (2021) niab044. [23] P.-C. Bürkner, Advanced bayesian multilevel modeling with the r package brms, arXiv preprint arXiv:1705.11123 (2017). [24] J. L. Brooks, Counterbalancing for serial order carryover effects in experimental condition orders., Psychological methods 17 (2012) 600. [25] F. Götz-Hahn, V. Hosu, H. Lin, D. Saupe, Konvid-150k: A dataset for no-reference video quality assessment of videos in-the-wild, IEEE Access 9 (2021) 72139–72160. [26] A. DeCastellarnau, A classification of response scale characteristics that affect data quality: a literature review, Quality & quantity 52 (2018) 1523–1559. [27] T. Yan, M. Hu, Examining translation and respondents’ use of response scales in 3mc surveys, Advances in comparative survey methods: Multinational, multiregional, and multicultural contexts (3MC) (2018) 501–518.