=Paper=
{{Paper
|id=Vol-3300/short_1345
|storemode=property
|title=Imputing Missing Answers in the World Values Survey
|pdfUrl=https://ceur-ws.org/Vol-3300/short_1345.pdf
|volume=Vol-3300
|authors=Arsen Matej Golubovikj,Branko Kavšek,Marko Tkalčič
|dblpUrl=https://dblp.org/rec/conf/hci-si/GolubovikjKT22
}}
==Imputing Missing Answers in the World Values Survey==
Imputing Missing Answers in the World Values Survey
Arsen Matej Golubovikj1 , Branko Kavšek1,2 and Marko Tkalčič1
1
University of Primorska, Faculty of Mathematics, Natural Sciences and Information Technologies, Glagoljaška 8,
SI-6000 Koper, Slovenia
2
Jožef Stefan Institute, Department for Artificial Intelligence, Jamova 39, SI-1000 Ljubljana, Slovenia
Abstract
Questionnaire surveys are useful for many areas of science, in particular social sciences. Such surveys are
often the prime means of gathering data directly from participants, however, they are prone to missing
data, which could be caused by many reasons: (i) an error by survey administrators, (ii) participants not
responding to certain questions, (iii) acts of nature and, (iv) etc. In order to keep the full survey sample,
researchers must often use imputation to deal with the missing data problem. Methods for imputation
can sometimes offer reasonable estimates for the missing data, however, in the case of the survey: (i)
imputation can add high noise to the data, (ii) imputation becomes unreliable when more than 40% of
the data is missing. This work attempts to address these issues by evaluating if the usage of matrix
completion methods stemming from collaborative filtering (CF) in recommender systems can yield more
accurate imputations of survey data. The rationale for the usage of these methods is (i) the similarity
between the problem framing, methods and data representation used in CF and survey imputation; (ii) the
effectiveness of CF-based methods in recommender systems. We use data from the World Values Survey,
a valuable dataset in social science of high volume and veracity, to compare (i) one simple approach
to imputation, (ii) two established imputation approaches (iii) two CF matrix completion techniques.
The results show that our chosen CF matrix completion techniques perform overall comparable, but not
better than existing imputation techniques for the case of survey imputation. The matrix completion
techniques, however, might prove useful in niche situations, such as in the imputation of non-ordinal
question answers. The right technique for imputation often depends on the problem, these results beckon
the consideration of CF-based techniques in future research on survey imputation.
Keywords
imputation, survey, matrix completion, collaborative filtering
1. Introduction
For many areas of science, in particular social sciences, questionnaires are an essential tool for
gathering data. The process of collecting data through questionnaires, called a survey [4], has
advantages, such as getting data directly from the participants, but also downsides, such as
missing data values [16]. There are many reasons why data might be missing, (i) the survey
administrator/s made an error [12], (ii) participants might not answer all questions i.e. item
non-response [12], (iii) other reasons, e.g. acts of nature. No matter their cause, missing values
in questionnaire-acquired data must be dealt with before researchers can make inferences from
the data [2].
Human-Computer Interaction Slovenia 2022, November 29, 2022, Ljubljana, Slovenia
Envelope-Open matej.golubovik@gmail.com (A. M. Golubovikj); branko.kavsek@famnit.upr.si (B. Kavšek);
marko.tkalcic@famnit.upr.si (M. Tkalčič)
© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073
CEUR Workshop Proceedings (CEUR-WS.org)
A common approach to dealing with missing values is to delete all entries which contain
them [10]. The advantage of this deletion is its simplicity [10], however, it forces the researcher
to operate on a partial dataset, which might produce misleading results [10]. To operate on the
whole data, missing values must often be imputed i.e. filled in with replacement values. Often
used techniques for imputation in surveys include: (i) simple imputation [12], which replaces
missing data in a variable with its average or most frequent value (ii) hot-deck imputation [12],
which exploits the similarities between entries in the data to find suitable replacements (iii)
model based approaches [12], which model each variable based on the available data and fill in
missing values using the model for each variable.
Existing imputation techniques have advantages, such as, allowing the user to operate on the
full data, however, they can have the issues of: (i) introduction of high noise to the data [10]
and, (ii) in the survey case, ineffectiveness when more than 40% of the data is missing [12] (high
missingness). In our work, we address these issues by evaluating if the usage of alternative
imputation methods that are commonly used in recommender systems (RS), can yield more
accurate imputations of missing values, both in the case of low and high missingness. The
rationale for the usage of these methods is (i) the similarity of the problem framing between
questionnaires and RS, and (ii) the effectiveness of these methods in recommender systems.
These similarities in problem framing, are most noticeable in collaborative filtering (CF) for
recommender systems. CF operates on a user-to-item ratings matrix that stores the opinion of
human users about given items, usually expressed as a scalar value called a rating (ex. 1-5 Likert
scale, where 1 is a very negative and 5 is a very positive ratting). Due to the large volume of
items in such systems, users are usually familiar with only a fraction of the items, consequently,
much of the entries in the ratings matrix are empty [1], i.e. missing. The recommendation
is then done by filling these missing entries using solely data from this matrix, through a
process called matrix completion [1], items with high predicted ratings, i.e. opinion, are then
recommended to the user. If we represent the questionnaire data as a matrix, where rows
represent participants, and columns represent questions, the problem of filling missing data is
now similar to the problem of matrix completion.
This paper focuses on the comparison between matrix completion techniques and classical
survey imputation techniques, in the task of filling in missing answers in the World Values
Survey [6] - a highly valued dataset in the field of social science [11].
2. Related Work
A number of studies have utilized matrix completion and collaborative filtering outside of the
field of recommender systems. Some of the fields which have used these techniques include
medicine[7], bioinformatics[14], image processing[5], infrastructure[9] and security[13]. Many
of these fields find favorable results in the use of collaborative filtering for their specific problems,
especially when large amounts of data is missing. Moreover, the specific works of Saha et al.[14]
and Li et al.[9] have successfully utilized matrix completion in the imputation of DNA and
highway traffic-related data, respectfully.
Two works examine the use of matrix completion in a broad imputation scenario: (i) Wang et al.
produce an ensemble-based imputation method, which includes an item-to-item collaborative
technique in the ensemble, they show that their ensemble method outperforms k-nearest
neighbors (KNN) imputation, on common datasets from the UCI (University of California Irvine)
data repository, however, do not evaluate the performance of the item to item collaborative
technique on its own; (ii) Chi and Li [3] examines the use of low-rank matrix completion for the
general role of imputation, they use synthetic data to show that low-rank matrix completion
techniques can operate under the statistical assumptions for missing data, utilized in imputation.
In the case of survey imputation, the use of matrix completion is also highlighted in some
cases. Vozalis et. al.[17] test the usage of a user-based collaborative filtering technique in
the imputation of a small transportation survey consisting of univariate question answers on
the Likert (1-5) scale. They report a MAE (Mean Absolute Error) of 0.846 for this technique
when imputing data with 20% missing answers. Similarly, Oliveira et al. [17] compare matrix
factorization and item-to-item collaborative filtering techniques for the purpose of predicting
univariate Likert scale questionnaire responses in a large company survey. They find that, on
20% missing data, these techniques can distinguish between a positive and negative response
with an Area Under the Curve (AUC) score of at least 0.80 on the given data.
Although there has been research using matrix completion on survey data, to the best of our
knowledge, there have been no attempts to compare the effectiveness of matrix completion
techniques and classical survey imputation techniques. In this work, we fill this gap by directly
comparing both approaches on the scenario of World Values Survey data.
3. Data Overview
For the purpose of comparing the effectiveness of matrix completion and classical imputation
techniques in the case of missing survey data, we utilize data from the World Values Survey
(WVS). The WVS is an international research program devoted to the scientific and academic
study of social, political, economic, religious, and cultural values of people in the world [6]. In
our testing, we use a subset of the data from the WVS’s 7th wave (7th iteration) of the survey,
conducted across 57 countries in the years 2017-2021. This subset used in our testing contains
the answers of 84638 participants to 274 survey questions, covering topics such as: (i) ethical
values (ii) social values and perceptions (iii) political values, (iv) stances on various social and
political questions, (v) etc.
The questions used in the WVS are closed questions, meaning that the participants respond
using a list of provided answers rather than articulating the answers themselves. Responses are
recorded as a number which denotes the participant’s choice from the list. The ranges of the
numbers used to record responses in the WVS are from 1 to the number of answers, e.g. for
five answers the answer range is 1 to 5. Among the questions in our subset, we find 8 question
answer ranges: ”1-2”, ”1-3”, ”1-4”, ”1-5”, ”1-7”, ”1-8”, ”1-10” and ”1-11”. Based on their range,
the questions answers in our subset can be divided into three categories: (i) Dichotomous -
questions with binary (e.g. Yes/No) answers, questions in the ”1-2” range fall into this category,
(ii) Nominal-Polytomous - questions with a set of more than two answers with no inherent
ordering, in our subset questions on the ”1-3” range fall into this category, and (iii) Ordinal-
Polytomous, questions with a set of more than two answers which in themselves contain an
ordering, in our subset all other ranges (”1-4” and up) fall into this category.
For reasons mentioned in the methodology section of this paper, we also retain data on the
participant’s country of origin in our testing subset.
4. Methodology
The flow of our methodology from the data preparation step to the final comparisons between
approaches is presented in Figure 1.
The data utilized in our testing is described in Section 3. Among the three types of survey
question answers, observed in section 3, i.e. (i) Dichotomous, (ii) Nominal-Polytomous, and
(iii) Ordinal-Polytomous questions, we find two imputation tasks, namely, a regression task
and a classification task. Ordinal-polytomous answers are handled using regression, while
classification is used to handle answers to dichotomous and nominal-polytomous questions.
In both tasks, we compare the effectiveness of matrix completion and classical imputation
approaches on the testing data for the specific task. The approaches remain the same in both
tasks, only they are adjusted to fit the problem (classification or regression).
Three classical imputation approaches are considered: (i) simple imputation, which serves
as a baseline, it imputes the mean value in the regression case and the mode value in the
classification case; (ii) k-nearest neighbors (KNN) imputation, a hot-deck approach, in the
regression case it uses weighted mean resolution to impute from the neighborhood, while mode
resolution is used in the classification case; (iii) model based imputation, which performs initial
simple imputation then imputes utilizing one regressor per feature, it uses linear regression
with initial mean imputation for the regression task and a bayesian ridge regressor with initial
mode imputation in the classification task.
The classical imputation approaches are compared to two matrix completion techniques: (i)
item-to-item CF, which, similarly to KNN, uses weighted mean resolution among similar items
in the regression task and mode resolution among similar items in the classification case; (ii)
non-negative matrix factorization, refined by a Decision Tree regressor and classifier in the
regression and classification tasks respectively.
The non-negative matrix factorization is refined by a decision tree in the following way. Let
𝑄 be a column of the original matrix and let 𝑄’ be the estimation of 𝑄 in the resulting matrix
from matrix factorization, for each pair of 𝑄 and 𝑄’ we use the available data in 𝑄 to train a
Decision Tree which predicts 𝑄 from 𝑄’ and use this model to predict the remaining missing
answers in 𝑄 from 𝑄’.
To compare the five approaches described above, for each task, we simulate varying degrees
of missingness in the data, from 10% to 50%, and evaluate their performance in imputing the data.
For regressors, we evaluate Mean Absolute Error (MAE) and Mean Squared error (MSE), while
classifiers are evaluated with their accuracy, precision, recall, and F1 scores. The simulation and
evaluation is done through an augmented cross-validation technique. A comparison between
ordinary and augmented cross-validation is given in Figure 2.
To cater our imputation to the data, hence producing more robust results, we perform all
imputations per country separately. We impute per country since, if the survey data contains
clusters, such as those born of demographics, better imputation results are achieved if data is
imputed for each cluster separately [8, 15], moreover, in international surveys, an often taken
Regression imputers definition
1. K Nearest Neighbors w/
weighted mean resolution
2. Item to Item w/
Evaluation with
weighted mean resolution
Mean Absolute
Error and
Dichotomous and 3. One Linear Regressor per Fea
Mean Squared
Classification Nominal-Polytomous Error
Questions 4. Non-Negative Matrix Factorizat Measurements
Decision Tree Regressor
Data Analysis Classification imputers definition
Feature Division based on Comparisons
and Preparation Prediction task 1. K Nearest Neighbors w/
weighted most frequent class reso
Regression Ordinal-Polytomous
2. Item to Item w/
Questions
weighted most frequesnt class res Evaluation with
Accuracy,
3. One Multinomial Logistic Regre
Precision, Recall,
per Feature
and F1 Score
4. Non-Negative Matrix Factorizat Measurements
Decision Tree Classifier
Figure 1: Flow of the imputation and evaluation procedures
As an example
Each
Ordinary Cross Validation Procedure we show the
train test split at
question the 5th iteration
has same of CV.
row order
before it's Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
split into 1 1 1 1 80% of the data
1 1 1 1 1 1 1 1
folds 2 2 2 2 (4 folds) is taken
2 2 2 2 2 2 2 2
3 3 3 3 Split data 3 3 3 3
Train/test 3 3 3 3
as the training
4 4 4 4 set
4 4 4 4 into 5 folds 5 5 5 5 split 4 4 4 4
5 5 5 5 5 5 5 5
6 6 6 6 6 6
6 6 6 6 6 6
7 7 7 7 7 7
7 7 7 7 7 7
8 8 8 8 8 8
8 8 8 8 8 8
9 9 9 9 9 9
20% of the data
9 9 9 9 9 9
10 10 10 10 10 10 10 10 10 10 10 10
(1 fold) is taken
as the test set
Each
question
has a
different
Our Augmented Cross Validation Procedure
random In the actual
row order order of the
Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4
before it's Q1 Q2 Q3 Q4 dataset, 20% of
split into 9 8 1 10 We put 1 1 1 1 the data per
folds 9 8 1 10
5 folds &
6 7 3 8 all rows in 2 2 2 2 question has
6 7 3 8 1 6 2 9 3 3 3 3
1 6 2 9 train/test 8 4 4 2
the same 4 4 4 4
been made
missing and the
8 4 4 2
split 5 9 7 5 order 5 5 5 5
missing answers
5 9 7 5 4 10 9 6 6 6 6 6
4 10 9 6 5 3 are scattered
2 2 7 7 7 7
2 2 5 3 10 4 across the
3 1 8 8 8 8
3 1 10 4 10 3 6 7 9 dataset.
9 9 9
10 3 6 7 7 5 8 1 10 10 10 10
7 5 8 1 Again we show
20% of the data
the train test
(1 fold) is made
split at the 5th
missing
iteration of CV.
Figure 2: Differences between our augmented cross-validation procedure and ordinary cross-validation
and effective approach is imputing answers for each country separately [18].
5. Results
5.1. Regression
Table 1 shows the Mean Absolute Error and Mean Squared error of our regression approaches.
The final errors are calculated by taking the average across all questions, all countries, and
finally across all magnitudes of missingness tested (from 10% to 50%). For the task of regression,
all values are scaled from to the 1 to 3 scale. We keep this scale in our final results to make sense
of average error in the multivariate scenario. The best-performing imputer is marked in bold.
Matrix Item Regression Mean Mean
Metric
Factorization to item KNN Based Imputation Imputation
per %
(w/ Decision Tree) CF Imputation (whole dataset) (per country)
MAE 0.4120 0.3809 0.3933 0.3602 0.5076 0.4488
MSE 0.3713 0.2677 0.2591 0.2302 0.3758 0.3171
Table 1
MAE and MSE scores for each regression imputation method considered, calculated by taking the
average across all questions, all countries, and across all magnitudes of missingness tested (10%, 20%,
30%, 40% and 50%). The errors are presented on a scale of 1 to 3 (we would achieve similar answers on
the -1 to 1 scale, as well). The best imputer is marked in bold.
5.2. Classification
The results for Accuracy, F1 Score, Precision, and Recall in the classification task are given in
Table 2. Similarly, as in the regression case, the evaluation statistics presented are the average
across all questions that fall under this task, as well as over all countries and magnitudes of
missingness tested. The best-performing technique for each evaluation statistic is marked in
bold.
From Table 2 we can see that the mode per country is a powerful predictor in the case of
the classification task. This implies that the data is unbalanced, hence, the F1 score, Precision,
and Recall are better indicators of the performance in this imputation task. Since the F1 score
is a balanced measure between Precision and Recall, we will use it as the prime metric for
comparison in the case of classification.
Matrix Item Regression Mode Mode
Metric
Factorization to item KNN Based Imputation Imputation
per %
(w/ Decision Tree) CF Imputation (whole dataset) (per country)
Accuracy 0.6940 0.6701 0.7344 0.6175 0.6680 0.7077
F1 0.4745 0.4116 0.4391 0.3483 0.3261 0.3413
Precision 0.5345 0.4599 0.4893 0.4427 0.2761 0.2985
Recall 0.4952 0.4180 0.4612 0.3574 0.4082 0.4133
Table 2
Accuracy, F1, Precision, and Recall scores for each classification imputation method considered, calcu-
lated by taking the average across all questions, all countries, and across all magnitudes of missingness
tested (10%, 20%, 30%, 40% and 50%). For each score, the best imputer is marked in bold.
6. Discussion and Conclusion
The results show that our chosen CF matrix completion techniques perform overall comparable,
but not better than existing imputation techniques for the case of survey imputation. The matrix
completion techniques, however, might prove useful in niche situations highlighted in the
results. Item-to-item collaborative filtering performs comparable to the KNN technique in both
imputation tasks, only failing to match it on high ratios of missing data in the classification case.
On the other hand, item-to-item fails to compare to model-based imputation in the regression,
however, performs better than it in the classification task. Moreover, the results show that the
matrix factorization technique offers poor performance in terms of MSE in the regression case,
failing to match both existing imputation techniques, however, in the case of classification it
outperforms all techniques tested with its F1 performance on unbalanced data.
In comparison with our related work, we achieve similar results to Vozalis et. al. [17] for MAE
in terms of matrix completion, his MAE of 0.846 on univariate 1 to 5 data is comparable to our
0.40 MAE on the scale of 1 to 3, achieved under multivariate data. This raises the question of
whether the scale affects the matrix completion techniques, collaborative filtering techniques in
recommender systems usually operate on ratings all on the same scale. Can alterations of these
techniques to fit multivariate data, be more beneficial in future work in survey imputation?
We also note that the nature of the data might affect the results, for example, the model-
based imputer performs initial mean imputation before building its models, therefore the high
performance of the model-based imputer in the regression task may be due to the power of
mean imputation in our data. Future work might compare matrix completion and classical
imputation techniques on a larger range of survey data.
Future work on this subject should also consider these techniques in different scenarios, as
well as, examine the effects that these techniques have on the statistical inference. Moreover,
our study included only simple techniques for matrix completion, CF techniques are vast and
varied, and other techniques might succeed where we have failed, the considerations of such
techniques in the study of imputation may also prove fruitful in future work.
References
[1] C. C. Aggarwal. Recommender Systems. Springer International Publishing, 2016. isbn:
978-3-319-29657-9. doi: 10.1007/978-3-319-29659-3.
[2] S. Buuren. Flexible Imputation of Missing Data. 2nd ed. New York: Chapman and Hall/CRC,
2018. isbn: 978-0-429-49225-9. doi: 10.1201/9780429492259.
[3] E. C. Chi and T. Li. “Matrix completion from a computational statistics perspective”. In:
WIREs Comp Stat 11.5 (2019). issn: 1939-5108, 1939-0068. doi: 10.1002/wics.1469.
[4] R. M. Groves, ed. Survey methodology. 2nd ed. Wiley series in survey methodology. Wiley,
2009. 461 pp. isbn: 978-0-470-46546-2.
[5] N. Gupta K.and Goyal and H. Khatter. “Optimal reduction of noise in image processing
using collaborative inpainting filtering with Pillar K-Mean clustering”. In: The Imaging
Science Journal 67.2 (2019), pp. 100–114. issn: 1368-2199, 1743-131X. doi: 10.1080/13682199.
2018.1560958.
[6] C. Haerpfer et al. World Values Survey Wave 7 (2017-2022) Cross-National Data-Set. In collab.
with K. Kizilova et al. Version Number: 4.0.0 Type: dataset. 2022. doi: 10.14281/18241.18.
[7] F. Hao and R. H. Blair. “A comparative study: Classification vs. user-based collaborative
filtering for clinical prediction”. In: BMC Medical Research Methodology 16.1 (2016). issn:
1471-2288. doi: 10.1186/s12874-016-0261-9.
[8] N. Karmitsa et al. “Missing Value Imputation via Clusterwise Linear Regression”. In: IEEE
Transactions on Knowledge and Data Engineering 34.4 (2022). Conference Name: IEEE
Transactions on Knowledge and Data Engineering, pp. 1889–1901. issn: 1558-2191. doi:
10.1109/TKDE.2020.3001694.
[9] L. Li et al. “Missing value imputation for traffic-related time series data based on a multi-
view learning method”. In: IEEE Transactions on Intelligent Transportation Systems 20.8
(2019), pp. 2933–2943. issn: 1524-9050. doi: 10.1109/TITS.2018.2869768.
[10] R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. 3rd. John Wiley &
Sons, 2019.
[11] S. G. Ludeke and E. G. Larsen. “Problems with the Big Five assessment in the World
Values Survey”. In: Personality and Individual Differences 112 (2017), pp. 103–105. issn:
0191-8869. doi: 10.1016/j.paid.2017.02.042.
[12] A. Mirzaei et al. “Missing data in surveys: Key concepts, approaches, and applications”. In:
Research in Social and Administrative Pharmacy 18.2 (2022), pp. 2308–2316. issn: 15517411.
doi: 10.1016/j.sapharm.2021.03.009.
[13] R. M. Rodríguez et al. “Using collaborative filtering for dealing with missing values
in nuclear safeguards evaluation”. In: International Journal of Uncertainty, Fuzziness
and Knowlege-Based Systems 18.4 (2010), pp. 431–449. issn: 0218-4885. doi: 10.1142/
S0218488510006635.
[14] S. Saha et al. “Missing value imputation in DNA microarray gene expression data: A
comparative study of an improved collaborative filtering method with decision tree based
approach”. In: International Journal of Computational Science and Engineering 18.2 (2019),
pp. 130–139. issn: 1742-7185. doi: 10.1504/IJCSE.2019.097954.
[15] J. Shao and H. Wang. “Sample Correlation Coefficients Based on Survey Data Under
Regression Imputation”. In: Journal of the American Statistical Association 97.458 (2002),
pp. 544–552. doi: 10.1198/016214502760047078.
[16] G. N. Singh et al. “Some imputation methods for missing data in sample surveys”. In:
Hacettepe Journal of Mathematics and Statistics 45.6 (2016), pp. 1865–1880. issn: 2651-477X.
doi: 10.15672/HJMS.20159714095.
[17] M. Vozalis, S. Basbas, and I. Politis. “Applying Collaborative Filtering Techniques In
Transportation Surveys”. In: 1st International Conference on Engineering and Applied
Sciences Optimization. 2014, pp. 1630–1638.
[18] M. Weber and M. Denk. Imputation of Cross-Country Time Series: Techniques and Evalua-
tion. 2010.