=Paper= {{Paper |id=Vol-2414/paper7 |storemode=property |title=HITS Hits Readersourcing: Validating Peer Review Alternatives Using Network Analysis |pdfUrl=https://ceur-ws.org/Vol-2414/paper7.pdf |volume=Vol-2414 |authors=Michael Soprano,Kevin Roitero,Stefano Mizzaro |dblpUrl=https://dblp.org/rec/conf/sigir/SopranoRM19 }} ==HITS Hits Readersourcing: Validating Peer Review Alternatives Using Network Analysis== https://ceur-ws.org/Vol-2414/paper7.pdf
   HITS Hits Readersourcing: Validating Peer
   Review Alternatives Using Network Analysis

             Michael Soprano, Kevin Roitero, and Stefano Mizzaro

 Dept. of Mathematics, Computer Science, and Physics. University of Udine, Italy.
        michael.soprano@outlook.com, roitero.kevin@spes.uniud.it,
                             mizzaro@uniud.it


      Abstract. Peer review is a well known mechanism exploited within the
      scholarly publishing process to ensure the quality of scientific literature.
      Such a mechanism, despite being well established and reasonable, is not
      free from problems, and alternative approaches to peer review have been
      developed. Such approaches exploit the readers of scientific publications
      and their opinions, and thus outsource the peer review activity to the
      scholar community; an example of this approach has been formalized
      in the Readersourcing model [5]. Our contribution is two-fold: (i) we
      propose a stochastic validation of the Readersourcing model, and (ii)
      we employ network analysis techniques to study the bias of the model,
      and in particular the interactions between readers and papers and their
      goodness and effectiveness scores. Our results show that by using network
      analysis interesting model properties can be derived.


  1    Introduction
    Peer review is an a priori mechanism exploited within the scholarly publishing
process to ensure the quality of scientific literature; an article written by some
authors undergoes peer review when it is judged and rated by colleagues of the
same degree of competence. Such mechanism, despite being well established and
reasonable, is not free from problems; indeed, it is characterized by various issues
related to the process itself and the malicious behavior of some stakeholders [8].
    In literature one can find alternative approaches to peer review, which exploit
readers of scientific publications and their opinions as a “review force”, thereby
outsourcing the peer review activity itself to the community of readers. One of
these approaches has been proposed by Mizzaro [5] and called Readersourcing,
as a portmanteau for “crowdsourcing” and “readers”, and it is based on a model
proposed in a previous work [4]. Another similar model is TrueReview [2]. So-
prano and Mizzaro [8] describe a general ecosystem called Readersourcing 2.0
which provides an implementation for such models.
    The aim of the Readersourcing model is to define a way to measure the overall
quality of a published article as well as the reputation of a scholar as a reader
/ assessor; moreover, from these measures it is possible to derive the reputation
of a scholar as an author. In other terms, the main issue to address is how the
numerical judgments given to publications should be aggregated into indexes
of quality and, from these indexes, how to compute indexes of reputation for
the readers and, eventually, indexes of how much an author is able to publish
2         Michael Soprano, Kevin Roitero, and Stefano Mizzaro

papers which are positively rated by their readers. Therefore, to each entity (i.e.,
publications, authors, and readers) is assigned one or more scores which measure
how much good (skilled) it is.
    Network analysis is a discipline which studies features and properties of (usu-
ally large) networks or graphs. Its algorithms can be can be quite general and,
therefore, applicable to different domains. Mizzaro and Robertson [6] exploit link
analysis techniques such as the HITS algorithm proposed by Kleinberg [3] to ad-
dress a research question related to the effectiveness evaluation of Information
Retrieval (IR) systems. The evaluation of IR systems is performed within dif-
ferent initiatives, such as TREC (Text REtrieval Conference). Before the actual
conference, TREC provides a test collection made of documents and topics (i.e.,
representations of information needs); such a test collection is used as a bench-
mark to compare the performance of different IR systems. Participants use their
systems to retrieve, and submit to TREC, a list of documents for each topic.
System effectiveness is then measured by well established metrics like Mean Av-
erage Precision (MAP) and a final ranking is built. Mizzaro and Robertson [6]
study the interactions between the difficulty of topics and the final rank of IR
systems. In particular, they investigate the correlation between topic ease and
the ability to predict system effectiveness and they find that to be effective, a
system has to perform well on easy topics. Such finding is quite undesirable since
difficult topic are more useful to allow IR to evolve. Roitero et al. [7] extend the
work of Mizzaro and Robertson [6] by performing a more detailed analysis on
three different datasets: they confirm that the original result is valid and general
across datasets; they find that when only the most effective IR systems are con-
sidered there is no evidence that the ranking is affected only by easy topics; and
they prove that such results are robust across different effectiveness metrics.
    In this paper we take advantage of the methodology proposed by Mizzaro
and Robertson [6] and extended by Roitero et al. [7] to address a similar research
question related to the Readersourcing model. More in detail, we intend to study
the interactions between the skill of a reader and the quality of a paper, where
such quantities are computed by Readersourcing models. This paper is structured
as follows. Section 2 details the related work; Section 4 describes the experiments
performed; Section 5 discusses the results. Finally, Section 6 concludes the paper.

    2     Background
    In an attempt to make this paper self contained, in this section we summa-
rize two major related work areas that we considered to do our analysis. Sec-
tion 2.1 summarizes the Readersourcing model proposed by Mizzaro [5], while
Section 2.2 summarizes the methodology proposed by Mizzaro and Robertson
[6] to investigate the correlation between topic ease and the ability to predict
system effectiveness within the effectiveness evaluation of IR systems activity.
    2.1   The Readersourcing Model
    In the Readersourcing model three entities are identified: papers, readers,
and authors. The score of an author is simply defined as a weighted average of
his or her papers; we do not analyze it in detail in this paper, where we focus on
                                                             HITS Hits Readersourcing                        3

             p1 · · · pn            Sr       σr               p1       ···      pn          Sr       σr
        r1 jr1 ,p1 · · · jr1 ,pn Sr (r1 ) σr (r1 )      r1 g(jr1 ,p1 ) · · · g(jr1 ,pn ) Sr (r1 ) σr (r1 )
         ..          ..              ..       ..         ..             ..                   ..       ..
          .             .             .        .          .                .                  .        .
        rm jrm ,p1 · · · jrm ,pn Sr (rm ) σr (rm )      rm g(jrm ,p1 ) · · · g(jrm ,pn ) Sr (rm ) σr (rm )
        Sp Sp (p1 ) · · · Sp (pn )                      Sp Sp (p1 ) · · · Sp (pn )
        σp σp (p1 ) · · · σp (pn )                      σp σp (p1 ) · · · σp (pn )

  (a) RP matrix with judgments (RPJ)                 (b) RP matrix with goodness values (RPG)

Fig. 1: Reader-paper matrices (RP) with judgments (goodness values), scores,
and steadiness at a fixed timestamp.


more readers and papers. A generic reader is asked to give a numerical judgment
to each paper he reads. Such judgments are used to compute a quality score for
each paper. Likewise, each reader is characterized by a score which measures its
skill/reputation. To each judgment is assigned a measure of its goodness with
respect to other judgments given to the same paper. Moreover, to papers and
readers is assigned a steadiness value which affects the update of the scores; a
high (low) steadiness value leads to faster (slower) change of the score themselves.
    Scores are dynamic and they change depending on user behaviour. For exam-
ple, if an author with a low score publishes a paper positively rated by readers,
his score increases; if a reader expresses a judgment which is judged as untruthful
and/or biased because “distant” from other judgments (for a given paper) his
score decreases, and so on. Therefore, there is a temporal dimension to consider,
since the internal state of the model evolves as time passes. In the following, we
hypothesize to “freeze” such state at a fixed timestamp, where no new judgments
can be expressed and no new papers can be added. Figure 1 shows a representa-
tion of the model as a reader-paper matrix (RP) with m rows (i.e., readers) and
n columns (i.e., papers) which can be represented in two ways. In the former
(RPJ), each cell contains the numerical judgment given by reader r to paper p,
while in the latter (RPG) each cell contains a measure of the goodness of the
numerical judgment given by reader r to paper p. In both representations, each
reader (paper) has a related score and steadiness pair, which are represented
by the Sr and σr column (and Sp and σp row) vectors. These are computed
according to the formulas defined by the Readersourcing model [4].
  2.2      The HITS hits TREC methodology
    The output of a TREC-like initiative can be represented as a system-topic
matrix (ST) with m rows (i.e., systems) and n columns (i.e., topics). Each cell
contains an effectiveness measure of each system with respect to each topic ac-
cording to some metric such as Average Precision (AP). Each row is averaged
to compute Mean Average Precision (MAP), which is a measure of system effec-
tiveness with respect to all topics. Each column is averaged to compute Average
Average Precision (AAP) which is a measure of topic ease.
    The ST matrix is then normalized in two ways. Let us call AAP and MAP
the AAP column and the MAP row of ST. In the former normalization, each
AP(si , tj ) value is transformed into a APA (si , tj ) value (Normalized AP accord-
4       Michael Soprano, Kevin Roitero, and Stefano Mizzaro

ing to AAP) by subtracting AAP to ST. In the latter, each AP(si , tj ) value is
transformed into a APM (si , tj ) value (Normalized AP according to MAP) by
subtracting MAP to ST. The normalized matrices STA and STM are exploited
to study the interactions between topic ease and system effectiveness. More in
detail, these two matrices can be merged into a single adjacency matrix which
represents a complete weighted bipartite system-topic graph.
     Each link s → t with weight APM between a system s and a topic t of system-
topic matrix (ST) represents how much s “thinks” that t is easy (or “un-easy”,
i.e., difficult, with APM < 0). Each link s ← t with weight APA represents how
much t thinks that s is effective (or “un-effective”, with APA < 0).
     Mizzaro and Robertson [6] exploit the complete weighted bipartite graph to
compute hubness and authority values by using an extended version of the HITS
algorithm proposed by Kleinberg [3] which allows to include negative values
for links weights. As explained by Mizzaro and Robertson [6], the authority
value of a topic t of the system-topic matrix (ST) represents its easiness; when
considered for a system s, it represents its effectiveness. The hubness value of a
topic t represents its ability to recognize effective systems; when considered for
a system s, it represents its ability to recognize easy topics.

    3   HITS Hits Readersourcing
    We intend to study the interactions between reader skill and paper quality
where such quantities are computed by the Readersourcing model proposed by
Mizzaro [5] (Section 2.1) by taking advantage of the methodology proposed by
Mizzaro and Robertson [6] (Section 2.2).
    The starting point is a slightly different version of the RPJ matrix shown in
Figure 1 (left), which is shown in Figure 2 (left). Let us consider the judgment
matrix RPJ* . The only difference with respect to RPJ is that RPJ* has only one
additional row and column. The former is called MJp and its values are used to
normalize each column of RPJ* (like AAP in the original methodology), while
the latter is called MJr and its values are used to normalize each row of RPJ*
(like MAP). The goodness matrix RPG* is built similarly. This formalization is
useful since it allows to analyze different combinations of MJr and MJp (MGr
and MGp ) with judgment (goodness) matrices.
    Once the set of MJr and MJp (MGr and MGp ) have been computed, the
RPJ* matrix shown in Figure 2 (left) and the RPG* one are normalized in
two ways. In the former normalization, each jri ,pj /g(jri ,pj ) value is transformed
into a jari ,pj /ga(jri ,pj ) (Normalized Judgment/Goodness according to MJp ) by
subtracting MJp (MGp ) to RPJ* (RPG* ). In the latter, each jri ,pj /g(jri ,pj )
value is transformed into a jmri ,pj /gm(jri ,pj ) (Normalized Judgment/Goodness
according to MJr ) value by subtracting MJr (MGr ) to RPJ* (RPG* ).
    The normalized matrices RPJ*A and RPJ*M (RPG*A and RPG*M ) are then
used to build a complete weighted bipartite reader-paper graph. Such a graph
represents relationships between readers and papers which depend on the chosen
set of MJp and MJr and it is used to compute hubness and authority values as
done by Mizzaro and Robertson [6].
                                                                         HITS Hits Readersourcing                      5

         p1     ···     pn      MJr                   p1     ···     pn      MJr              p1      ···     pn
   r1  jr1 ,p1 · · · jr1 ,pn MJr (r1 )           r1 jar1 ,p1 · · · jar1 ,pn MJr (r1 )    r1 jmr1 ,p1 · · · jmr1 ,pn 0
    ..           ..              ..               ..          ..               ..         ..           ..             ..
     .              .             .                .             .              .          .              .            .
  rm jrm ,p1 · · · jrm ,pn MJr (rm )             rm jarm ,p1 · · · jarm ,pn MJr (rm )   rm jmrm ,p1 · · · jmrm ,pn 0
  MJp MJp (p1 ) · · · MJp (pn )                        0     ···      0                 MJp MJp (p1 ) · · · MJp (pn )


Fig. 2: Reader-paper matrix (RPJ* ) with judgments, MJr , and MJp (left),
Judgment-paper matrix normalized according to MJp (RPJ*A ) (middle), and
judgment-paper matrix normalized according to MJr (RPJ*M ) (right).

           p1 · · · pn
      r1                           r1 · · · rm p1 · · · pn
                             r1
      ..                 &
       .     RP∗M            ..
                              .          0          RP∗M       0
      rm                                                                 (b) r → p with weight M VAL
           p1 · · · pn     rm
      r1                   p1
                            ..
      ..                 % p.           RP∗A
                                             T
                                                       0       0
       .     RP∗A            n

                                    A VAL M VAL 0                        (c) r ← p with weight A VAL
      rm
                                  (a)
                                                                                  T
Fig. 3: (a) Construction of the adjacency matrix. RP∗A is the transpose of RP∗A .
(b-c) Relationships between readers and papers of RP∗ matrix with weight
M VAL and A VAL.


  4        Experiments
    In our experiments we hypothesize a scenario in which there is a publishing
system; authors submit their papers to such a system and readers are able to
rate the papers. We run some stochastic simulation experiments, in which readers
express stochastic judgments on papers according to some predefined setting, and
we measure the outcome. There are 5,000 readers, 10,000 papers, and 134,000
judgments. We simulate one month of activity. Readers are partitioned into five
groups GRi of equal size. The members of each group rate a certain amount of
papers, as shown in Table 1 (left). Papers are partitioned into five groups GPi
of different size to simulate the internal state of a publishing system.
    For each reader a sample of papers is picked, whose size depends on his
group. Every paper is simulated by a beta distribution defined by two parame-
ters α and β; its support is the [0, 1] interval. The beta distribution probability
density function can assume five shapes which are represented in Figure 4, de-
pending on the chosen set of α and β parameters. The beta distribution allows us
to represent five different distributions of user behavior across papers, thus rep-
resenting five different kinds of paper. Each of the distribution shapes (shown
6         Michael Soprano, Kevin Roitero, and Stefano Mizzaro

    Group Frequency Amount               Group % Parameters                           Shape
    GR1   1 x 2 Weeks          2         GP1    5% (α = 1) ∧ (β = 1)                 flat
    GR2   1 x Week             4         GP2   30% (α = β) ∧ (α > 1) ∧ (β > 1)       bell-shaped
    GR3   2 x Week             8         GP3   20% (0 < α < 1) ∧ (0 < β < 1)         U-shaped
    GR4   1 x Day             30         GP4   30% (α > 1 ∧ β = 1) ∨ (α = 1 ∧ β > 1) J-shaped
    GR5   3 x Day             90         GP5   15% (α > 1 ∧ β > 1) ∧ (α 6= β)        skewed-bell


Table 1: Amount of rated papers for each group of readers in one month (left),
and amount of papers for each group with beta distribution parameters (right).


             4
             3                                                    GP1 - flat
                                                                  GP2 - bell-shaped
             2                                                    GP3 - U-shaped
                                                                  GP4 - J-shaped
             1                                                    GP5 - skewed-bell
             0
                 0.0    0.2        0.4   0.6    0.8     1.0

          Fig. 4: Beta distributions used to generate the simulation data.


in Figure 4) originates a different simulation of the judgment agreement over
the paper: the flat distribution (GP1 ) simulates a completely random judging
behavior; the bell shaped distribution (GP2 ) simulates a judgment distribution
centered around a data point in the centre of the judgment scale, simulating
a case of high agreement; the U-shaped distribution (GP3 ) simulates the case
of maximum disagreement, where two-polarized behavior act in the opposite
boundaries of the judgment scale; the J-shaped distribution (GP4 ), and the
skewed bell distribution (GP5 ) simulate, as well as the bell-shaped distribution,
the case of high agreement distributed near to the scale boundaries. The usage of
the Beta distribution to capture and mimic different level of agreement, as well
as the relationships between agreement and scale boundaries has been formally
discussed in detail by Checco et al. [1]. The beta distributions for each paper are
generated in the following way: the set of all papers is partitioned into five groups
GPi (one for each configuration) where each group contains a fixed percentage of
papers. To each of these papers an instance of the beta distribution is assigned,
whose α and β parameters depend on the paper group. Table 1 (right) shows
such paper groups and parameters. Therefore, the stochastic judgment given to
a paper by a reader is generated by sampling a value from the corresponding
beta distribution.
    The simulation produces a list of tuples ht, r, p, a, si: at timestamp t reader r
judges paper p written by author a with a score equal to s. Such a list is provided
as input data to an implementation of the Readersourcing model and from its
output the final RPJ (RPG) matrix (Figure 1) is built. For each RPJ (RPG)
matrix the corresponding RPJ* (RPG* ) matrix is built, where each of them is a
judgment (goodness) matrix characterized by a related set of MJp and MJr (MGp
                                                  HITS Hits Readersourcing          7

           MJp    MGp    Sp    σp                     MJr    MGr      Sr     σr
      MJp        0.12   0.94† −0.01             MJr           0.07    0.06× −0.02
      MGp 0.25          0.12? −0.04             MGr  0.11             0.87‡ 0.02
      Sp   0.97† 0.25?        −0.01             Sr   0.11×    0.98‡          0.03
      σp  −0.01 −0.046 −0.01                    σr  −0.01     0.03    0.032


Table 2: Paper (left) and reader (right). Pearson’s ρ in the lower triangular part,
and Kendall’s τ in the upper triangular part


and MGr ) values. The RP∗ matrices are then normalized to build adjacency
matrices which are then used to compute hubness and authority values. In the
following section we will discuss the meaning of the resulting relations (i.e., the
links of the complete weighted bipartite graph) and hubness/authority values.
  5     Results
   We now detail the results of our experiments: Section 5.1 focuses on the mea-
sures defined in the Readersourcing model and analyzes the correlations between
them; Section 5.2 discusses the outcome of HITS applied to our simulations.
  5.1    Correlation Between Readersourcing Measures
    The Readersourcing model produces both score and steadiness values for both
readers and papers (i.e., Sr , σr , Sp , and σp ). We also compute: the mean judgment
received by a paper (MJp ) the mean goodness of the judgments received by a
paper (MGp ), the mean judgment expressed by a reader (MJr ), and the mean
goodness of the judgments expressed by a reader (MGr ). Table 2 shows the
correlation values for the paper (left) and reader (right) scores, from which we
can draw several remarks.
    Let us focus on the mean judgment of a paper and the paper score (i.e., MJp
and Sp of the left table, highlighted with † ), and between the mean goodness of
a reader and the reader score (i.e., MGr and Sr of the right table, highlighted
with ‡ ). The first correlation highlights some potential bias in how we generate
the simulated data: there is lack of variance in the judgments of readers of a given
paper. In other words, for each paper the vast majority of the readers that rated
it present high agreement in their scores. If we look at Table 1 (right) we see
that the beta distributions that induce high agreement between readers (GP2 ,
GP4 , and GP5 ) represent the 75% of the total scores. We leave for future work
the analysis of a different group distribution in the statistical simulation. The
second correlation strengthens, and is a consequence of, the previous remark:
there is a lack of variance in the quality of readers of a given paper; once a
reader expressed a judgment on a paper, all the other readers of the same paper
tend to express judgments of the similar quality.
    Conversely, when looking at the dual scenario (i.e., MGp and Sp of the left
table, highlighted with ? , and MJr and Sr of the right table, highlighted with × ),
we see that a sort of dual symmetry is present: neither mean judgment of a
reader nor the mean goodness of a reader are correlated with respectively the
paper and the reader scores. This suggests that: (i) the readers vote using the
8         Michael Soprano, Kevin Roitero, and Stefano Mizzaro

whole judgment scale, and (ii) the papers receive judgments that span across
all the judgment scale. This shows that the current simulation setting is able to
cover all the judgment scale.
    The correlations between all other measures are low and not interesting.
    5.2   HITS Algorithm and Hubness
    In this section we detail the results of the HITS algorithm when run on the
normalized RP∗ matrices, both when considering the judgments (i.e., RPJ* )
and the goodness (i.e., RPG* ). As detailed in previous work [6, 7], the most
interesting index we obtain from running HITS is the hubness of readers and
papers; when we consider the judgment matrix, the hubness of a reader measures
its capability to recognize papers that tend to obtain high judgments, while the
hubness of a paper measures the paper capability to recognize reader that tend
to give high judgments (or, in other words, readers that are biased towards
giving high judgments). Symmetrically, when we consider the goodness matrix
the hubness of a paper measures its capability to recognize readers that tend to
give judgments that have a high quality (or, in other words, high quality readers),
while the hubness of a reader measures the reader capability to recognize papers
that tend to receive judgments of a high quality (i.e., papers that tend to be
judged from high quality readers).
    We start with the judgments, i.e., the RPJ* matrix. Figure 5 shows some
scatterplots. All the y-axes report the hubness computed by HITS. In the plots
on the left column, the x-axes report the model measures that refer to a paper,
while in the plots on the right column, the x-axes report the model measures
that refer to a reader; thus, in the scatterplots of the left each point is a paper,
while in the scatterplots of the right each point is a reader. Each scatterplot also
shows the respective Pearson’s ρ and Kendall’s τ correlations. The meaning of
the correlations of each plot in the figure can be detailed as follows.

(a) The higher the score of a paper, the higher its capability of recognizing
    readers that tend to give high judgments. This correlation is expected to
    be high due to how the Readersourcing model is formalized; intuitively, if a
    score of a paper is high then the paper will be good in recognizing readers
    that tend to give high judgments.
(b) Since the correlation is really low, and close to zero, whatever the score of a
    reader (high/low, i.e, high-/low-quality reader), he has the same capability
    to recognize papers that tend to obtain high (and low) judgments. This is
    a good property of the Readersourcing model: a reader can be of a high (or
    low) quality independently of whether he expressed judgments on papers
    that have an average judgments that is either high or low. In other words, if
    a reader expresses a high quality judgment on a paper, his score as a reader
    will increase no matter what the judgment score is. Ideally, for a model to
    be completely fair, this correlation value should be exactly zero.
(c) The higher the mean judgment of a paper, the higher its capability of rec-
    ognizing readers that tend to give high judgments. Also in this case, as for
    Figure 5(a), the high correlation value is expected and less interesting. Nev-
    ertheless, since the correlation value it exactly one, it can be also interpreted
                                                                                                                                      HITS Hits Readersourcing                                             9




                                                                                                                                1e 12
          0.00016                                                                                                         3.5
                           : p=1.0 (p < .01)                                                                                         : p=0.11 (p < .01)
          0.00014          : p=0.94 (p < .01)                                                                             3.0        : p=0.06 (p < .01)
          0.00012                                                                                                         2.5
hubness




                                                                                                                hubness
          0.00010                                                                                                         2.0
          0.00008                                                                                                         1.5
          0.00006                                                                                                         1.0
          0.00004                                                                                (a)                      0.5                                                                    (b)
                     0.2         0.3         0.4         0.5    0.6    0.7     0.8          0.9                           0.0
                                                           paper_score                                                                0.3         0.4     0.5       0.6      0.7   0.8         0.9
                                                                                                                                                                reader_score
                                                                                                                                1e 12
           0.00016
                            : p=1.0 (p < .01)                                                                            3.5         : p=1.0 (p < .01)
           0.00014          : p=1.0 (p < .01)                                                                                        : p=0.98 (p < .01)
                                                                                                                         3.0
           0.00012                                                                                                       2.5
 hubness




           0.00010
                                                                                                               hubness
                                                                                                                         2.0
           0.00008                                                                                                       1.5
           0.00006                                                                                                       1.0
           0.00004                                                                                (c)                    0.5
                                                                                                                                                                                                 (d)
           0.00002                                                                                                       0.0
                      0.2         0.3         0.4     0.5    0.6    0.7         0.8         0.9
                                                    mean_judgment_paper                                                                     0.2          0.4        0.6            0.8               1.0
                                                                                                                                                        mean_judgment_reader
                                                                                                                                1e 12
           0.00016
                                                                                                                         3.5
                            : p=0.25 (p < .01)                                                                                       : p=0.11 (p < .01)
           0.00014          : p=0.12 (p < .01)                                                                                       : p=0.07 (p < .01)
                                                                                                                         3.0
           0.00012                                                                                                       2.5
 hubness




           0.00010
                                                                                                               hubness




                                                                                                                         2.0
           0.00008                                                                                                       1.5
           0.00006                                                                                                       1.0
           0.00004                                                                                (e)                    0.5                                                                     (f)
                           0.4               0.5        0.6       0.7               0.8                0.9               0.0
                                                    mean_goodness_paper                                                              0.3      0.4        0.5    0.6     0.7        0.8         0.9
                                                                                                                                                        mean_goodness_reader
                                                                                                                                1e 12
           0.00016
                                                                                                                         3.5
                            : p=-0.01 (p > .05)                                                                                      : p=-0.03 (p < .05)
           0.00014          : p=-0.01 (p > .05)                                                                          3.0         : p=-0.04 (p < .01)
           0.00012                                                                                                       2.5
 hubness




           0.00010
                                                                                                               hubness




                                                                                                                         2.0
           0.00008                                                                                                       1.5
           0.00006                                                                                                       1.0
           0.00004                                                                               (g)                     0.5                                                                     (h)
                           2.5         5.0         7.5      10.0 12.5        15.0         17.5          20.0             0.0
                                                         paper_steadiness                                                        0            200            400         600             800
                                                                                                                                                          reader_steadiness


Fig. 5: Hubness vs. Sp , MJp , MGp , and σp (left column) and vs. Sr , MJr MGr ,
and σr (right column), computed on RPJ* matrix
10      Michael Soprano, Kevin Roitero, and Stefano Mizzaro

    as a bias in how we generate the data: in fact, this plot shows that the
    variance of the judgments expressed by readers on each paper is on average
    very low, despite the beta distributions we use to generate the data. This is
    confirmed also by analyzing the next plot.
(d) The higher the mean judgment by a reader, the higher his ability to recog-
    nize papers that get high scores. As for previous plot, also in this case the
    correlation value is exactly one. While the high correlation of the previous
    plot is expected, this one is not. On the contrary, if the mean judgment by
    a reader is high, then it should not necessarily mean that the papers that
    s/he judged should get on average high scores, since the other readers judg-
    ing the same papers could give lower judgments. This is an indication of a
    possible bias in how we generate the data. We leave for future work to use
    more sophisticated statistical methods to generate the data, such as for ex-
    ample vine copulas, that would allow to consider both the paper and reader
    distributions at the same time.
(e) Since the correlation is really low, whatever the goodness of the judgments
    received by a paper (i.e., high or low mean goodness), it has the same capa-
    bility to recognize readers that tend to give high judgments. This is a good
    property of the Readersourcing model: a paper can be either good or bad
    (i.e., have a high or low mean goodness) independently from having been
    judged by readers biased towards high or low scores. In other words, the
    model formalization of the goodness measure of a paper is robust to the
    possible reader bias on the judgment scale.
(f) Since the correlation is really low, whatever the mean goodness of the judg-
    ments expressed by a reader (i.e., high or low mean goodness), he has the
    same capability to recognize papers that tend to get high scores. As for the
    previous plot, also this correlation is an indication of a good model property:
    a reader can be either good or bad (i.e., have a high or low mean goodness)
    independently from the fact that he has judged papers that get on average
    high or low scores. In other words, the model formalization of the goodness
    measure of a reader is robust to the possible behavior of other readers that
    express judgments on the same paper.
(g) Since the correlation is zero, whatever the steadiness of a paper, it has the
    same capability to recognize readers that tend to give high judgments. This
    reflects a good property of the model: the formalization of the steadiness
    measure of a paper is robust to the fact that the paper will get scores in the
    upper or lower part of the judgment scale.
(h) As for the previous plot, since the correlation is zero, whatever the reader
    steadiness, he has the same capability to recognize papers that tend to get
    high scores. Symmetrically from what derived from the previous plot, this
    hints that the steadiness measure of a reader is robust to the judgment
    behavior of the other readers that express judgments on the same paper.
   We now turn to discuss Figure 6 which shows the same plots as in Figure 5
but when running the HITS algorithm on the goodness matrix RPG* .
(a) Due to the low correlation, whether a paper has a high or low score, it has
    the same capability to recognize readers that tend to express high quality
                                                 HITS Hits Readersourcing       11

    judgments (judgments with high goodness). This highlights a good property
    of the Readersourcing model: the ability of a paper of recognize good readers
    is independent from the quality of the paper itself. In an ideal model, the
    correlation value of this plot should be zero.
(b) The higher the score of a reader, the higher its capability to recognize papers
    that tend to get high quality judgments. This high correlation highlights a
    possible bias in how we generate the simulations: in fact, if a high quality
    reader judges a paper, all the other readers that judge the same paper will
    tend to be of high quality. As for Figure 5(d), we leave for future work the
    use of more sophisticated models for the statistical generation of judgments.
(c) This plot is the same as Figure 6(a); this has a double meaning: the paper
    score and mean judgment of a paper are almost perfectly correlated (see
    the 0.97 value in Table 2) and, as for Figure 6(a), the ability of a paper of
    recognize good readers is independent from the mean judgment of the paper.
(d) Due to the very low correlation, whether a reader has a high or low mean
    judgment, it has the same capability of recognize papers that tend to get high
    quality judgments. As for the previous plot, this highlights a good property
    of the model: the ability of a reader to recognize papers with high quality
    judgments is independent from the judgment location of the reader (i.e., it
    is independent from the judgment scale).
(e) The higher the mean goodness of the judgments received by a paper, the
    higher its capability of recognizing readers that tend to express high quality
    judgments. In this case the correlation is exactly one, and this is expected
    and derived from how the Readersourcing model is defined.
(f) The higher the mean goodness of the judgments expressed by a reader, the
    higher his capability to recognize papers that tend to get judgments having
    high quality. This, as the previous plot, is a natural consequence of how the
    Readersourcing model is defined.
(g) Due to the low correlation, whether a paper has a high or low steadiness,
    it has the same capability of recognizing readers that tend to express high
    quality judgments.
(h) Due to the correlation close to zero, whether a reader has a high or low
    steadiness, he has the same capability of recognizing papers that tend to
    get high quality judgments. Also in this case this highlights a good property
    of the model: the ability of a reader to recognize papers that receives high
    quality judgments is independent from its steadiness value.

  6    Conclusions and Future Work
    We have provided a two-fold contribution: (i) we proposed an experimental
validation of the Readersourcing model carried out through a stochastic simula-
tion, and (ii) we explored model properties using network analysis techniques.
    This paper leaves plenty of space for future work like, for example, the us-
age of other stochastic models, and the analysis of other models that propose
alternatives to peer review [2].
12                  Michael Soprano, Kevin Roitero, and Stefano Mizzaro




                                                                                                                        1e 12
          0.00014                                                                                                 4.0
          0.00013         : p=0.25 (p < .01)                                                                                 : p=0.98 (p < .01)
                          : p=0.12 (p < .01)                                                                      3.5        : p=0.86 (p < .01)
          0.00012
          0.00011                                                                                                 3.0
hubness




                                                                                                        hubness
          0.00010                                                                                                 2.5
          0.00009
          0.00008                                                                                                 2.0
          0.00007                                                                                                 1.5
                                                                                           (a)                                                                                          (b)
          0.00006
                                                                                                                  1.0
                    0.2       0.3      0.4         0.5    0.6    0.7      0.8         0.9
                                                     paper_score                                                              0.3         0.4    0.5       0.6      0.7   0.8         0.9
                                                                                                                                                       reader_score
                                                                                                                        1e 12
          0.00014                                                                                                 4.0
          0.00013         : p=0.25 (p < .01)                                                                                 : p=0.11 (p < .01)
                          : p=0.12 (p < .01)                                                                      3.5        : p=0.07 (p < .01)
          0.00012
          0.00011                                                                                                 3.0
hubness




                                                                                                        hubness
          0.00010
                                                                                                                  2.5
          0.00009
          0.00008                                                                                                 2.0
          0.00007                                                                                                 1.5
                                                                                           (c)                                                                                          (d)
          0.00006
                    0.2       0.3      0.4      0.5    0.6    0.7         0.8         0.9                         1.0
                                              mean_judgment_paper                                                                   0.2          0.4        0.6           0.8               1.0
                                                                                                                                                mean_judgment_reader
                                                                                                                        1e 12
          0.00014                                                                                                 4.0
                          : p=1.0 (p < .01)                                                                                  : p=1.0 (p < .01)
                          : p=1.0 (p < .01)                                                                       3.5        : p=0.98 (p < .01)
          0.00012
                                                                                                                  3.0
hubness




                                                                                                        hubness




          0.00010                                                                                                 2.5

          0.00008                                                                                                 2.0
                                                                                                                  1.5
          0.00006                                                                          (e)                                                                                          (f)
                                                                                                                  1.0
                      0.4             0.5         0.6       0.7               0.8            0.9
                                              mean_goodness_paper                                                            0.3      0.4        0.5    0.6     0.7       0.8         0.9
                                                                                                                                                mean_goodness_reader
                                                                                                                        1e 12
          0.00014                                                                                                 4.0
          0.00013         : p=-0.05 (p < .01)                                                                                : p=0.0 (p > .05)
                          : p=-0.04 (p < .01)                                                                     3.5        : p=-0.0 (p > .05)
          0.00012
          0.00011                                                                                                 3.0
hubness




                                                                                                        hubness




          0.00010
                                                                                                                  2.5
          0.00009
          0.00008                                                                                                 2.0
          0.00007                                                                                                 1.5
                                                                                           (g)
          0.00006                                                                                                                                                                       (h)
                      2.5       5.0          7.5      10.0 12.5        15.0         17.5         20.0             1.0
                                                   paper_steadiness                                                      0            200            400         600            800
                                                                                                                                                  reader_steadiness


Fig. 6: Hubness vs. Sp , MJr , MGr , and σp (left column) and vs. Sr , MJr MGr ,
and σr (right column), computed on RPG* matrix
                               References


[1] Checco, A., Roitero, K., Maddalena, E., Mizzaro, S., Demartini, G.: Let’s
    agree to disagree: Fixing agreement measures for crowdsourcing. In: 5th
    HCOMP (2017)
[2] De Alfaro, L., Faella, M.: TrueReview: A Platform for Post-Publication Peer
    Review. CoRR (2016), http://arxiv.org/abs/1608.07878
[3] Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. J. ACM
    46(5), 604–632 (Sep 1999), http://doi.acm.org/10.1145/324133.324140
[4] Mizzaro, S.: Quality control in scholarly publishing: A new proposal. JASIST
    54(11), 989–1005 (2003), https://doi.org/10.1002/asi.22668
[5] Mizzaro, S.: Readersourcing - A Manifesto. JASIST 63(8), 1666–1672 (2012),
    https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.22668
[6] Mizzaro, S., Robertson, S.: HITS Hits TREC: Exploring IR Evaluation Re-
    sults with Network Analysis. In: Proceedings of 30th ACM SIGIR. pp. 479–
    486 (2007)
[7] Roitero, K., Maddalena, E., Mizzaro, S.: Do easy topics predict effectiveness
    better than difficult topics? In: ECIR. pp. 605–611. Springer (2017)
[8] Soprano, M., Mizzaro, S.: Crowdsourcing peer review: As we may do. In: Dig-
    ital Libraries: Supporting Open Science. pp. 259–273. Springer International
    Publishing (2019)