<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>AOABB</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards Guiding Data Imputation for Scientific Data Analytics Via SHAP-like Scores</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Omer Abramovich</string-name>
          <email>omera1@mail.tau.ac.il</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hadas Stiebel-Kalish</string-name>
          <email>hadaskalish@tauex.tau.ac.il</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Deutch</string-name>
          <email>danielde@post.tau.ac.il</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Declaration on Generative AI</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Blavatnik School of Computer Science, Tel Aviv University</institution>
          ,
          <addr-line>Tel Aviv</addr-line>
          ,
          <country country="IL">Israel</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Gray Faculty of Health Life Sciences, Tel Aviv University</institution>
          ,
          <addr-line>Tel Aviv</addr-line>
          ,
          <country country="IL">Israel;</country>
          <institution>Eye Laboratory, Felsenstein Medical Research Center, Tel Aviv, Israel; Department of Ophthalmology, Rabin Medical Center</institution>
          ,
          <addr-line>Petah Tikva</addr-line>
          ,
          <country country="IL">Israel</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>0000</year>
      </pub-date>
      <volume>2</volume>
      <abstract>
        <p>Missing data is prevalent in the context of scientific databases, arising from measurements failures, partially manually filled data, and other reasons. Missing data may adversely afect scientific analytics and needs to be dealt with, in a process referred to as data imputation. Standard techniques for data imputation either incur loss of data, introduce errors, or require significant manual labor to complete the values. In this paper we put forward a novel approach, that guides the data imputation process by adapting and extending a recently emerging technique for explainable computation, namely attribution. Attribution involves assigning importance scores to individual data items based on their contribution to the computation result, and concrete notions of such scores have been extensively studied in recent years based on measurements from game theory such as Shapley and Banzhaf values. Our observation is that these attribution scores are valuable not only for explaining the computation but also, in presence of missing values, for guiding data imputation in the sense of deciding what data to compute and what resources to allocate for it. We present the approach as well as a concrete use case in the domain of medical data, and highlight multiple directions for future research.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>This paper describes a novel approach for guiding data imputation in the context of scientific analysis.
We start by recalling the problem of missing data in scientific analysis, highlighting the main challenges.
We then briefly overview existing approaches for data imputation and their limitations. We follow with
a description of notions of attribution and importance scores and then describe our high-level approach
towards a solution.</p>
      <p>Missing Data in Scientific Analytics Scientific studies typically involve the analysis of data that
comes from various sources, and is often incomplete or dificult to obtain. Missing data (NULLs) can
significantly afect the computation and the analytics results on one hand, but maybe dificult or even
impossible to complete on the other hand.</p>
      <p>Example 1.1. We will use as an example the case of retrospective medical data analysis, where data in
retrospective studies often contains missing information. Leading causes of incomplete data include:
patient failure to show up for appointments, physician work-overload leading to incomplete data
notation, and lack of healthcare resource availability.</p>
      <p>
        Data Imputation Techniques To overcome the challenges posed by missing data and to nevertheless
perform analytics, multiple approaches have been developed which generally fall under the term data
imputation [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. There is a large body of techniques on data imputation, including listwise deletion
(namely, ignoring/deleting the entire tuple), manual curation of the missing values, algorithmic and
ML-based completion mechanisms. Each of these techniques is inherently imperfect: listwise deletion
comes with the cost of losing information (in particular with respect to values in other attributes of
the same tuple), which may be prohibitively wasteful; automatic imputation (e.g. using regression,
K-nearest Neighbors, etc.) is inherently imprecise; and manual labor is costly.
      </p>
      <p>Example 1.2. A particular use case of interest is that of rare disease analysis. By definition, such
analysis needs to deal with a scarcity of data even in the absence of NULLs, and as such listwise deletion
of tuples in which a particular value is absent, is highly problematic. Since there is too few data, the
analysis is also highly sensitive to each individual data value, and as such errors in automated data
imputation may have significant efects on the analysis results. Manual curation and case-by-case data
completion through contacting the medical staf may be possible, but is labor-intensive.</p>
      <p>Ideally, we would like a way to identify which of the cells where data is missing are most important
for the analysis purposes, and focus the imputation eforts on these cells.</p>
      <p>
        Attribution for Explainable Analytics In the context of Machine Learning models, and recently also
in that of database query evaluation, a prominent approach is that of attribution. Namely, one assigns a
score to each feature (in Machine Learning) or input tuple (in databases) reflecting its contribution to
prediction/query results. The measures used are typically ones that have been proposed in the context
of game theory, namely Shapley [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Banzhaf [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] values. In a nutshell, these measures aggregate, in
diferent ways, the marginal contribution of each player to every subset of players (coalition). To apply
them in diferent contexts such as database queries and ML models, one needs to define the players and
the game value function. For instance, for database queries, the tuples are players and the query result
is the game value function; for ML prediction, the players are features and the prediction result is the
game value function. Similar applications of these measures have been successfully employed in other
contexts.
      </p>
      <p>Our Approach We propose a novel approach for leveraging attribution for targeted data imputation,
which we next describe at a high-level. In query evaluation, as explained above, attribution helps us
identify which tuples are most important for query results. The high-level idea is that these attribution
scores are potentially highly useful for data imputation: as explained above, the “budget" for high-quality
data imputation is typically limited, because it typically requires extensive manual labor. Via attribution,
we can identify which of the tuples in which some of the values are missing, are most useful for the
analysis, and allocate resources for imputing these values.</p>
      <p>Example 1.3. As an extreme example, some tuples (e.g. particular examination results) may be filtered
out early on in the analysis and have no efect over the query result. Asking experts to impute NULLs
in these tuples would be wasteful. Even if a particular tuple  is not filtered out, there may be many
alternative tuples to  that yield the same result, in which case the contribution of  is less significant
than in the absence of such alternatives.</p>
      <p>An analysis of the above flavor focuses on database tuples. By contrast, data imputation typically
focuses on cells where NULLs currently stand instead of values. When we guide cell imputation based
on importance of tuples, we lose in granularity.</p>
      <p>Example 1.4. A tuple as a whole may be important, but the query may then project out its NULL
attribute. As another example, if the query involves arithmetics, then it may e.g. perform a weighted
average of attributes and then use it for decision making.</p>
      <p>
        This requires a fine-grained analysis that is cell-based. To our knowledge, cell-based attribution has
not been studied in the context of query evaluation. However, it is closely related to attribution methods
commonly used in machine learning, in particular feature importance techniques such as SHAP [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
The technical challenge in shapley- (or banzhaf-) based analysis where players are features/cells is that
the game value function cannot be computed directly: a model may not be run on a subset of features
and a query may not be executed on a subset of cells. SHAP addresses this by defining the game value
function for a subset of features as the model’s prediction when missing features are filled in using a
background distribution. We explore in this paper ways to adapt this idea to the setting of cell-based
explanations to query results.
      </p>
      <p>Contributions and scope of this paper This short paper puts forward the novel approach of using
attribution to guide data imputation; it presents a cell-based attribution model and a sampling-based
algorithm to attribution computation; and it outlines a potential concrete use-case in the domain of
retrospective medical data analysis with missing data.</p>
      <p>An implementation of our proposed approach, its deployment for the particular use-case we describe
and others, and experimentation are left as future work and are beyond the scope of this paper.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <p>We next overview necessary preliminaries on databases, queries, data imputation and attribution.
Database and Queries A relational database  consists of a finite set of relations {ℛ1, . . . , ℛ}.
Each relation ℛ has a fixed schema schema(ℛ) = ⟨1, . . . , ⟩. A database instance is a mapping of
each relation ℛ to a finite set of tuples over its schema. A query  maps a database instance  to a
Boolean or numeric value (). For queries that return a relation, we instead consider the residual
query , which asks whether a tuple  appears in the output.</p>
      <p>Example 2.1. Figure 1 presents a database of medical test information for diferent patients. The tests
are stored in two relations, namely blood tests and eye tests. In addition, the database stores the weights
of four linear models in the ClassificationCoeficients relation. The figure also presents a SQL query 
that encodes the linear classifier and returns the patients who “pass” it.</p>
      <p>Missing Values Databases often contain missing values, for instance when patients fail to provide
some information or when some test result is undocumented. These missing values are captured by
NULLs. Analysts may treat NULLs by applying the standard three-value logic for SQL evaluation,
essentially treating the NULLs as unknown values. Alternatively, one may impute NULLs. An
imputation function ℐ maps a database instance  with NULLs to a complete instance ℐ() without
NULLs. Common approaches include listwise deletion (i.e. deleting the tuple), completing NULLs
using the mean/median/most common value occurring in this attribute, performing regression, iterative
imputation approaches, imputation based on some learned distribution, and others.
Example 2.2. Consider again the database in Figure 1. Consider an imputation function ℐ that fills
each numerical (categorical) cell with the average (most common) value in the corresponding attribute.
Consider the tuple in relation ‘EyeTest’ annotated by 3, with a missing ‘Vision’ attribute. ℐ completes
this value to be the mean value for the attribute, namely 0.73. Consider now the tuple in relation
‘BloodTest’ annotated by 4 with a missing ‘Type’ attribute; ℐ may fill in this value with the most
common value, ‘A’. Eventually, ℐ() returns a database without missing values ^.</p>
      <p>
        Shapley Values Given a finite set of players , A cooperative game is defined by a characteristic
function  : () → R, such that (∅) = 0. The value () represents the total gain created by the
subset of players  (a "coalition"). The Shapley value [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] then aims to measure the fair share of each
individual player  ∈  in the total gain () for the cooperative game ().
      </p>
      <p>1
2
3
4
SELECT DISTINCT Patient FROM BloodTest JOIN EyeTest USING(Patient) JOIN ClassificationCoefficients( USING(Type)
WHERE (Hem · C:Hem + Chol · C:Chol + Vision · C:Vision + Field · C:Field) ≥ 1</p>
      <p>The Shapley value of a player  is defined by taking the expectation of the marginal contribution of
 to coalitions form as we sample a random order of players.</p>
      <p>Shapley(, , ) d=ef ∑︁ ||! · (|| − || − 1)!</p>
      <p>||!
⊆∖{}
︀( ( ∪ {}) − () )︀
Shapley values capture the marginal contribution of  for each set, multiplied by ||! · (|| − || − 1)!
which is the number of permutations of  such that  comes after  and before ( ∖ {}) ∖ .</p>
    </sec>
    <sec id="sec-3">
      <title>3. Attribution Model for cells in Query Answering</title>
      <p>We next present a simple attribution model for cells, based on SHAP-like scores.</p>
      <sec id="sec-3-1">
        <title>3.1. SHAP-like scores for cells</title>
        <p>To quantify the contribution of individual missing values to a query result using Shapley values, we
model the database cells (i.e., individual attribute values) as players in a cooperative game. The game
value is defined as the output of the query . The challenge is that, by contrast to the classic settings
where players are tuples (and then a subset of tuples is simply a sub-database and the query semantics
w.r.t. it is readily defined), here it is unclear what it means to evaluate the query on a subset of cells. To
this end, we define the semantics as follows. Whenever the Shapley value formula requires the query
result for a specific subset of cells, we treat all cells outside this subset as NULLs. We then apply a
(black-box) imputation function to fill in these missing values, and nfially evaluate the query on the
completed database.</p>
        <p>
          Concretely, let  be a database, let ℛ be a relation with attributes Schema(ℛ) = {1, 2, . . . , }.
Let  ∈  be a specific tuple in ℛ. We denote the value of the cell corresponding to  in  by .. We
partition the database  into endogenous cells and exogenous cells denoted by , and respectively 
such that  =  ∪ . Endogenous cells are those for which we want to measure their importance,
where Exogenous cells are treated as constants of the database. This is analogous to the concept of
endogenous and exogenous tuples presented in prior work [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ]. We then define:
Definition 3.1 (Cell Shapley Value). Let  =  ∪ be a database,  be a query, . be an endogenous
cell in a tuple , and ℐ be an imputation function. The Shapley value of . w.r.t. , , ℐ is:
Shapley(, , , ℐ, .) =
        </p>
        <p>∑︁
 ⊆ ∖{.}
| | · ︀[ (ℐ( ∪ {.} ∪ )) − (ℐ( ∪  ))]︀
(1)
Where | | is the Shapley coeficient. For non-deterministic imputation functions, the
defined as the expectation of Equation 1 over the imputation distribution.</p>
        <p>Shapley value is
Example 3.2. Consider the database  and the residual query of 1 of the query  from Figure 1
and consider computing the Shapley value of the cell 1.. To this end, consider an imputation
function that replaces each attribute with its mean, and each categorical value with its most common
value. We demonstrate the computation for the subset of endogenous cells  = {1.ℎ, 1. }.
The imputation function completes 1.Type with the most common value (’A’) and 1.Vision with the
mean (0.6). The resulting query value is 1 (ℐ( ∪ )) = 0. When adding 1. to  , we get
that 1 (ℐ( ∪ 1. ∪ )) = 1. The marginal contribution (1 − 0 = 1) is then weighted by
the Shapley coeficient. The final value is the weighted average of these contributions across all such
subsets of endogenous cells.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Importance scores for NULL-valued cells</title>
        <p>We next extend our definition of cell-based attribution to further quantify the importance of cells with
missing (i.e. NULL) values. The definition is geared towards prioritizing cells to be imputed. Intuitively,
a NULL should be prioritized if its possible underlying values lead to widely divergent results. When
only one NULL afects the result, we could compute statistics about the query result when its value
is drawn from the distribution, compared to its value under the imputation function. However, when
many NULLS exist, comparing the efect of those NULLS becomes a more complex task. To analyze such
cases, we assume the existence of a joint distribution Ω governing the missing values. Such distribution
could be acquired using domain knowledge, or learned based on existing database values; if we have
no suficient such information, one may use the uniform distribution and then the attribution is still
informative yet is based solely on the structure of the query.</p>
        <p>We are now ready to define the ShapVar score of NULL-valued cells as the variance of their Shapley
values over the distribution Ω.</p>
        <p>Definition 3.3 ( ShapVar score). Let  be a database,  a query, ℐ an imputation function, . a
NULL-valued cell in , and Ω a joint distribution over all NULL-valued cells in . We define:</p>
        <p>ShapVar(., , ) = ^,.∼Ω() (Shapley(, ^, ℐ, ^.))
where  is the variance, ^ is a complete database instance sampled from Ω, and Shapley(·) is the
cell-level score defined in Equation 1 when only the NULL cells are endogenous.</p>
        <p>
          Example 3.4. Consider again the database and query from Fig. 1, this time with the residual query
2 for the tuple 2. Assume the NULL value in cell 2.Field is distributed uniformly over [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], and
suppose the imputation function assigns its mean value. For every possible value of 2.Field , patient
2 passes the classification test. Thus, replacing the imputed value with any realization yields zero
marginal contribution, and hence ShapVar = 0.
        </p>
        <p>Now consider the classification of patient 3. There are two relevant NULL values, 3.ℎ and
3. . Assume their distributions are independent and uniform in the ranges [0.5, 2] and [0.5, 1]
respectively. We can numerically compute the ShapVar score by sampling from this distribution and
computing the Shapley values. We get that the ShapVar score of 3.ℎ and 3.  are 0.18 and
0.02, respectively. Intuitively, this is because 3.ℎ ranges over broader values and has a higher
coeficient in the classification test.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Computational Aspects</title>
        <p>
          Computing the cell Shapley value poses significant computational challenges. Naively, its definition
requires considering exponentially many subsets of cells, and evaluating the imputation function and
query on each such subset. Even when query evaluation and imputation are eficient, this combinatorial
explosion renders exact computation infeasible in general. Nevertheless, prior work on Shapley value
computation in databases and ML suggests that computation may be tractable in many practical settings.
For database tuples, techniques based on lineage, factorization, and knowledge compilation have been
used to avoid explicit enumeration of subsets [
          <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
          ]. For SHAP scores, sampling and model based
approximations were shown to work well in practice [
          <xref ref-type="bibr" rid="ref6">12, 6, 13</xref>
          ]. These complementary lines of work
indicate that similar strategies could be adapted to the cell setting. In particular:
Proposition 3.5. Let  be a query and ℐ an imputation function, both computable in polynomial time
with respect to the size of the database . Then, the cell Shapley value Shapley(, , ℐ, .) admits a
Fully Polynomial-time Randomized Approximation Scheme.
        </p>
        <p>We leave for future work the development of approximation schemes for ShapVar and the
identification of tractable fragments for exact computation.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Use Case: Rare Disease Treatment Planning</title>
      <p>More than ten thousand conditions fit the definition of a rare disease, afecting approximately
300400 million people worldwide. Due to small sample sizes and logistic constraints, perfect data or
randomized controlled trials are unlikely to be performed for most of these conditions; thus, actionable
information may be most likely to be obtained from existing retrospective data and meticulous analysis
of treatments across patients [14]. However, data collected from retrospective studies often contains
missing information, due to e.g. patient’s failure to show up for appointments, physician work-overload
leading to incomplete data notation, and lack of resource availability. These imperfections, compounded
with the fact that the disease is rare, lead to a paucity of quality data. In turn, missing data poses
significant challenges that complicate data analysis and the derivation of conclusions such as treatment
guidelines.</p>
      <p>Existing Data Imputation techniques do not adequately address the challenges posed by missing data
in this context. Automated data imputation is based on statistical estimation of the missing values and
is inherently imprecise. Coupled with the small amount of data and the high stakes associated with
the analysis, statistical errors are especially harmful. Manual curation, on the other hand, is highly
laborious, especially in the context of retrospective studies, where reaching out to the patients for
questions regarding the missing data may be costly or even impossible.</p>
      <p>A particular medical question of interest is that of deciding the optimal treatment for optic neuritis
(ON), a rare, yet potentially blinding, inflammatory disorder of the optic nerve. ON most commonly
presents in the context of multiple sclerosis (MS), Neuromyelitis Optica Spectrum Disorder (NMOSD),
and Myelin Oligodendrocyte Glycoprotein Antibody-Associated Disease (MOGAD). Most MS-ON cases
respond well to intravenous methylprednisolone (IVMP) with excellent visual outcomes, whereas
NMOSD-ON and MOGAD-ON are frequently associated with severe, permanent visual loss [15, 16, 17].
The common treatment algorithm for ON is based on retrospective case series, often containing missing
data. Such missing data lead to treatment-algorithm distortions by preventing the correct prediction of
which MS-ON will not have a good visual outcome, which NMOSD-ON patient is rapidly responding to
steroids and may not require immunoadsorption, etc.</p>
      <p>This particular use case is expected to benefit significantly from our approach of importance-guided
data imputation, which allocates costly imputation eforts only to data cells that materially afect the
analysis. As a simple example, if a particular patient is excluded from the analysis, then naturally there
is no need to contact the clinic to complete missing information regarding their treatment. Importantly,
diferent patients included in the analysis difer in their level of influence on the analysis result; for
example, a patient with MS-ON with missing data will afect the analysis less than a patient with
NMOSD with missing data, because NMOSD-Optic neuritis is a very rare condition, while MS-ON is
more common with a wealth of data for this disorder.</p>
      <p>We intend to deploy our solutions using a retrospective clinical database of 70 patients treated for
acute demyelinating ON with documented 3–6 month visual outcomes. To assess factors afecting final
visual outcome, we will collect: (1) visual acuity at nadir; (2) time from vision loss to IVMP initiation;
(3) visual acuity at 5 ± 2 days; (4) visual acuity at 30 ± 7 days; (5) visual acuity at 3–6 months (main
outcome); (6) antibody status at 2 weeks, when escalation therapy decisions are made. Additional
covariates include age, sex, and escalation therapy use. The database contains missing values, such as
visual acuity at critical time points and delayed antibody test results. We expect the study to (1) provide
insight into the quality of our measures and algorithms and (2) support improved approaches to ON
treatment. Implementation and experimentation are ongoing.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>We have presented a novel approach for guiding data imputation using attribution in query answering.
While attribution has been used for explanations, we demonstrate a novel application of it, for imputation.
In this context, we have identified a need to extend attribution solutions to account for the granularity
of cells rather than tuples, and have introduced a novel attribution model to this efect. We have then
presented a concrete use case for data imputation in the context of medical data analysis where some
data is missing. This is ongoing work and we plan to implement algorithms for cell-based attribution as
well as a solution that leverages these algorithms for guiding data imputation. We further intend to
deploy these solutions in the context of medical data analysis in line with the outlined use-case.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This study was supported by the Clair and Amedee Maratier Institute for the study of Blindness
and Visual Disorders, Faculty of Medical &amp; Health Sciences, Tel-Aviv University; the Israeli Science
Foundation (Grant 1476/24); the Len Blavatnik and the Blavatnik Family foundation; and the Deutsch
foundation.
[12] E. Štrumbelj, I. Kononenko, Explaining prediction models and individual predictions with feature
contributions, Knowledge and information systems 41 (2014) 647–665.
[13] R. Okhrati, A. Lipani, A multilinear sampling algorithm to estimate shapley values, in: 2020 25th</p>
      <p>International Conference on Pattern Recognition (ICPR), IEEE, 2021, pp. 7992–7999.
[14] T. R. Frieden, Evidence for health decision making—beyond randomized, controlled trials, New</p>
      <p>England Journal of Medicine 377 (2017) 465–475.
[15] N. Moheb, J. J. Chen, The neuro-ophthalmological manifestations of nmosd and mogad—a
comprehensive review, Eye 37 (2023) 2391–2398.
[16] J. J. Chen, E. P. Flanagan, S. J. Pittock, N. C. Stern, N. Tisavipat, M. T. Bhatti, K. D. Chodnicki,
D. A. Tajfirouz, S. Jamali, A. Kunchok, et al., Visual outcomes following plasma exchange for
optic neuritis: an international multicenter retrospective analysis of 395 optic neuritis attacks,
American journal of ophthalmology 252 (2023) 213–224.
[17] J. S. Graves, F. C. Oertel, A. Van der Walt, S. Collorone, E. S. Sotirchos, G. Pihl-Jensen, P. Albrecht,
E. A. Yeh, S. Saidha, J. Frederiksen, et al., Leveraging visual outcome measures to advance therapy
development in neuroimmunologic disorders, Neurology: Neuroimmunology &amp;
Neuroinflammation 9 (2021) e1126.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Rubin</surname>
          </string-name>
          ,
          <article-title>Inference and missing data</article-title>
          ,
          <source>Biometrika</source>
          <volume>63</volume>
          (
          <year>1976</year>
          )
          <fpage>581</fpage>
          -
          <lpage>592</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jordon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schaar</surname>
          </string-name>
          , Gain:
          <article-title>Missing data imputation using generative adversarial nets</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>5689</fpage>
          -
          <lpage>5698</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Shapley</surname>
          </string-name>
          ,
          <article-title>A value for n-person games, Contributions to the Theory of Games 2 (</article-title>
          <year>1953</year>
          )
          <fpage>307</fpage>
          -
          <lpage>317</lpage>
          . URL: http://www.library.fa.ru/files/Roth2.pdf#page=
          <fpage>39</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Penrose</surname>
          </string-name>
          ,
          <article-title>The elementary statistics of majority voting</article-title>
          ,
          <source>J. Royal Stats. Soc</source>
          .
          <volume>109</volume>
          (
          <year>1946</year>
          )
          <fpage>53</fpage>
          -
          <lpage>57</lpage>
          . URL: http://www.jstor.org/stable/2981392.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Banzhaf</surname>
          </string-name>
          <string-name>
            <surname>III</surname>
          </string-name>
          ,
          <article-title>Weighted voting doesn't work: A mathematical analysis</article-title>
          ,
          <source>Rutgers Law Review</source>
          <volume>19</volume>
          (
          <year>1965</year>
          )
          <fpage>317</fpage>
          -
          <lpage>343</lpage>
          . URL: https://heinonline.org/HOL/LandingPage?handle=hein.
          <source>journals/rutlr19&amp; div=19&amp;id=&amp;page=.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-I.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>A unified approach to interpreting model predictions</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Livshits</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bertossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kimelfeld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sebag</surname>
          </string-name>
          ,
          <article-title>The shapley value of tuples in query answering</article-title>
          ,
          <source>Logical Methods in Computer Science</source>
          <volume>17</volume>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Abramovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deutch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Frost</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Olteanu</surname>
          </string-name>
          ,
          <article-title>Banzhaf values for facts in query answering</article-title>
          ,
          <source>Proc. ACM Manag. Data</source>
          <volume>2</volume>
          (
          <year>2024</year>
          ). URL: https://doi.org/10.1145/3654926. doi:
          <volume>10</volume>
          .1145/ 3654926.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Deutch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Frost</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kimelfeld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Monet</surname>
          </string-name>
          ,
          <article-title>Computing the shapley value of facts in query answering</article-title>
          ,
          <source>in: Proceedings of the 2022 International Conference on Management of Data</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1570</fpage>
          -
          <lpage>1583</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>O.</given-names>
            <surname>Abramovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deutch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Frost</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Olteanu</surname>
          </string-name>
          ,
          <article-title>Banzhaf values for facts in query answering</article-title>
          ,
          <source>Proceedings of the ACM on Management of Data</source>
          <volume>2</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>O.</given-names>
            <surname>Abramovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Deutch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Frost</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Olteanu</surname>
          </string-name>
          ,
          <article-title>Advancing fact attribution for query answering: Aggregate queries and novel algorithms</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>18</volume>
          (
          <year>2025</year>
          )
          <fpage>3996</fpage>
          -
          <lpage>4008</lpage>
          . URL: https://doi.org/10.14778/3749646.3749670. doi:
          <volume>10</volume>
          .14778/3749646.3749670.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>