<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>O2D2: Out-Of-Distribution Detector to Capture Undecidable Trials in Authorship Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benedikt Boenninghof</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert M. Nickel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dorothea Kolossa</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bucknell University</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ruhr University Bochum</institution>
          ,
          <addr-line>Germay</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>The PAN 2021 authorship verification (AV) challenge is part of a three-year strategy, moving from a cross-topic/closed-set AV task to a cross-topic/open-set AV task over a collection of fanfiction texts. In this work, we present a novel hybrid neural-probabilistic framework that is designed to tackle the challenges of the 2021 task. Our system is based on our 2020 winning submission, with updates to significantly reduce sensitivities to topical variations and to further improve the system's calibration by means of an uncertainty adaptation layer. Our framework additionally includes an out-of-distribution detector (O2D2) for defining non-responses. Our proposed system outperformed all other systems that participated in the PAN 2021 AV task.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Authorship Verification</kwd>
        <kwd>Out-Of-Distribution Detection</kwd>
        <kwd>Open-Set</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Deep Metric</p>
      <p>Learning
Pair of documents</p>
      <p>Linguistic
Embedding
Vectors (LEVs)</p>
      <p>Bayes Factor</p>
      <p>Scoring
Uncertainty
Modeling</p>
      <p>Posterior
0.3 probabilities
0.7
Confusion
matrix
0.8 0.1
0.2 0.9
Out-of-Distribution
Detector (O2D2)
0.1
0.9</p>
      <p>Non-response?
Uncertainty
Adaptation</p>
      <p>Calibrated posterior
probabilities
0.31
0.69</p>
      <p>Final
Decision
0.69 or 0.5
the fact that the test dataset, which is not publicly available, only contains trials from a subset
of the authors and fandoms provided in the training data.</p>
      <p>
        To increase the level of dificulty, the current PAN AV challenge moved from a closed-set
task to an open-set task in 2021, while the training dataset is identical to that of the previous
year [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In this scenario, the new test data contains only authors and fandoms that were not
included in the training data. We thus expect a covariate shift between training and testing data,
i.e. the distribution of our neural stylometric representations extracted from the training data is
expected to be diferent from the distribution of the test data representations. It was implicitly
shown in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and our experiments confirm this analysis, that such a covariate shift, due to topic
variability, is a major cause of errors.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. System Overview</title>
      <p>
        The overall structure of our revised system1 is shown in Fig. 1. It expands our winning system
from 2020 as follows: Suppose we have a pair of documents 1 and 2 with an associated
ground-truth hypothesis ℋ for  ∈ {0, 1}. The value of  indicates, whether the two documents
were written by the same author ( = 1) or by diferent authors (  = 0). Our task can formally
be expressed as a mapping :{1, 2} →−  ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ]. The estimated label ̂︀ is obtained from a
threshold test applied to the output prediction . In our case, we choose ̂︀ = 1 if  &gt; 0.5 and
 = 0 if  &lt; 0.5. The PAN 2020/21 shared tasks also permit the return of a non-response (in
̂︀
addition to ̂︀ = 1 and ̂︀ = 0) in cases of high uncertainty [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], e.g. when  is close to 0.5. In this
work, we therefore define three hypotheses:
ℋ0 : The two documents were written by two diferent persons,
ℋ1 : The two documents were written by the same person,
ℋ2 : Undecidable, trial does not sufice to establish authorship.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we introduced the concept of linguistic embedding vectors (LEVs). To obtain these,
we perform neural feature extraction followed by deep metric learning (DML) to encode the
stylistic characteristics of a pair of documents into a pair of fixed-length and topic-invariant
stylometric representations. Given the LEVs, a Bayes factor scoring (BFS) layer computes the
1The source code is accessible online: https://github.com/boenninghof/pan_2020_2021_authorship_
verification
posterior probability for a trial. This discriminative two-covariance model was introduced in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
As a new component, we propose an uncertainty adaptation layer (UAL). This idea is adopted
from [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], aiming to find and correct wrongly classified trials of the BFS layer, to model its noise
behavior, and to return re-calibrated posteriors.
      </p>
      <p>
        For the decision whether to accept ℋ0/ℋ1, or to return a non-response, i.e. ℋ2, it is desirable
that the value of the posterior  reliably reflects the uncertainty of the decision-making process.
We may roughly distinguish two diferent types of uncertainty [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]: In AV, aleatoric or data
uncertainty is associated with properties of the document pairs. Examples are topical variations
or the intra- and inter-author variabilities. Aleatoric uncertainty generally can not be reduced,
but it can be addressed (to a certain extent) by returning a non-response (i.e. hypothesis ℋ2)
if it is too large to allow for a reliable decision. To accomplish this, and inspired by [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], we
incorporate a feed-forward network for out-of-distribution detection (O2D2), which is trained
on a dataset that is diferent, i.e. disjoint w.r.t. authors and fandoms, from the training set used
to optimize the DML, BFS and UAL components.
      </p>
      <p>
        Additionally, epistemic or model uncertainty characterizes uncertainty in the model
parameters. Examples are unseen authors or topics. Epistemic uncertainty can be reduced through a
substantial increase in the amount of training data, i.e. an increase in the number of training
pairs. We capture epistemic uncertainty in our work through the proposed O2D2 approach
and also by extending our model to an ensemble. We expect all models to behave similarly for
known authors or topics, but the output predictions may be widely dispersed for pairs under
covariate shift [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>The training procedure consists of two stages: In the first stage, we simultaneously train the
DML, BFS and UAL components. In the second stage, we learn the parameters of the O2D2
model.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset Splits for the PAN 2021 AV Task</title>
      <p>
        The text preprocessing strategies, including tokenization and pair re-sampling, are
comprehensively described in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The fanfictional dataset for the PAN 2020/21 AV tasks are described
in [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. In the following, we report on the various dataset splits that we employed for our PAN
2021 submission.
      </p>
      <p>Each document pair is characterized by a tuple (,  ), where  ∈ {0, 1} denotes the authorship
similarity label and  ∈ {0, 1} describes the equivalent for the fandom. We assign each
document pair to one of the following author-fandom subsets2 SA_SF, SA_DF, DA_SF, and
DA_DF given its label tuple (,  ).</p>
      <p>
        As shown in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], one of the dificulties working with the provided small/large PAN datasets
is that each author generally contributes only with a small number of documents. As a result,
we observe a high degree of overlap in the re-sampled subsets of same-author trials. We
decided to work only with the large dataset this year and split the documents into three disjoint
(w.r.t. authorship and fandom) sets. Overlapping documents, where author and fandom belong
to diferent sets, are removed. The splits are summarized in Fig. 2 and Table 1. Altogether,
the following datasets have been involved in the PAN 2021 shared task, to train the model
components, tune the hyper-parameter and for testing:
2SA=same author, DA=diferent authors, SF=same fandom, DF=diferent fandoms
      </p>
      <p>49,654 docs
39,547 authors
200 fandoms</p>
      <p>
        46,373 docs
37,883 authors
200 fandoms
• The training set is identical to the one used in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and was employed for the first
stage, i.e., to train the DML, BFS and UAL components simultaneously. During training
we re-sampled the pairs epoch-wise such that all documents contribute equally to the
neural network training in each epoch. The numbers of training pairs provided in Table 1
therefore vary in each epoch.
• The calibration set has been used for the second stage, i.e., to train (calibrate) the O2D2
model. During training, we again re-sampled the pairs in each epoch and limited the total
number of pairs in the diferent-authors subsets to partly balance the dataset.
• The purpose of the validation set is to tune the hyper-parameters of the O2D2 stage
and to report the final evaluation metrics for all stages in Section 5.
• The development set is identical to the evaluation set in[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and was used to tune the
hyper-parameters during the training of the first stage. This dataset contains pairs from
the calibration and validation sets. However, due to the pair re-sampling strategy in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
documents may appear in diferent subsets and varied document pairs may be sampled.
      </p>
      <p>It thus does not represent a union of the calibration and validation sets.
• Finally, the PAN 2021 evaluation set, which is not publicly available, has been used
to test our submission and to compare it with the proposed frameworks of all other
participants.</p>
      <p>Note that both, the validation and development set in Table 1 only contain SA_DF and DA_SF
pairs, for reasons discussed in Section 5. The pairs of these sets are sampled once and then kept
ifxed.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodologies</title>
      <p>
        In this section, we briefly describe all components of our neural-probabilistic model. Sections 4.1
through 4.4 repeat information that is already provided in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to provide proper context.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Neural Feature Extraction and Deep Metric Learning</title>
        <p>Feature extraction and deep metric learning are realized in the form of a Siamese network,
feeding both input documents through exactly the same function.</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. Neural Feature Extraction:</title>
          <p>The system passes token and character embeddings into a two-tiered bidirectional LSTM
network with attentions,</p>
          <p>
            = NeuralFeatureExtraction (︀ , )︀ ,
where  contains all trainable parameters,  represents word embeddings and  represents
character embeddings. A comprehensive description is given in [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ].
          </p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. Deep Metric Learning:</title>
          <p>
            We feed the document embeddings  in Eq. (1) into a metric learning layer,  = tanh (︀  DML+
DML)︀ , which yields the two LEVs 1 and 2 via the trainable parameters  = { DML, DML}.
We then compute the Euclidean distance between both LEVs, (1, 2) = ‖1 − 2‖22 . In [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ],
we introduced a new probabilistic version of the contrastive loss: Given the Euclidean distance
of the LEVs, we apply a kernel function
          </p>
          <p>DML(ℋ1|1, 2) = exp (︀ −   (1, 2) )︀ ,
where  and  can be seen as both, hyper-parameters or trainable variables. The loss then is
given by</p>
          <p>DML =  · max {︀   − DML(ℋ1|1, 2), 0}︀ 2 + (1 − ) · max {DML(ℋ1|1, 2) −  , 0}2 ,
ℒ ,
(3)
and incorporate Eq. (5) into the binary cross entropy,</p>
          <p>BFS =  · log {BFS(ℋ1|1, 2)} + (1 − ) · log {1 − BFS(ℋ1|1, 2)} ,
ℒ
where all trainable parameters are denoted with  = {︀  BFS, BFS,  , ,  }︀ .
(1)
(2)
(5)
(6)
where we set   = 0.91 and   = 0.09.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Deep Bayes Factor Scoring</title>
        <p>
          We assume that the LEVs stem from a Gaussian generative model that can be decomposed as
 =  + , where  characterizes a noise term. We assume that the writing characteristics
of the author lie in a latent stylistic variable . The probability density functions for  and
 are modeled as Gaussian distributions. We outlined in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] how to compute the likelihoods
for both hypotheses. The verification score for a trial is then given by the log-likelihood ratio:
score(1, 2) = log (1, 2|ℋ1) − log (1, 2|ℋ0). Assuming (ℋ1) = (ℋ0) = 21 , the
probability for a same-author trial is calculated as [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]:
BFS(ℋ1|1, 2) =
        </p>
        <p>
          (1, 2|ℋ1)
(1, 2|ℋ1) + (1, 2|ℋ0)
= Sigmoid(︀ score(1, 2))︀
(4)
We reduce the dimension of the LEVs via BFS = tanh(︀  BFS + BFS)︀ to ensure numerically
stable inversions of the matrices [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. We rewrite Eq. (4) as
        </p>
        <p>BFS(ℋ1|1, 2) = Sigmoid(︀ score(1BFS, 2BFS))︀</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Uncertainty Modeling and Adaptation</title>
        <p>as follows
Now, we treat the posteriors of the BFS component as noisy outcomes and rewrite Eq. (5) as
BFS( ℋ̂︀1|1, 2) to emphasize that this represents an estimated posterior. We firstly have to
ifnd a single representation for both LEVs, which is done by UAL = tanh (︀  UAL(︀ 1 − 2)︀ ∘ 2 +
UAL)︀ , where (· )∘ 2 denotes the element-wise square. Next, we compute a 2 × 2 confusion matrix
(ℋ | ℋ̂︀, 1, 2) =
for ,  ∈ {0, 1}.</p>
        <p>∑︀
′∈{0,1}
exp (︀  BFS +</p>
        <p>︀)
exp (︀ ′ BFS + ′
︀)
The term (ℋ | ℋ̂︀, 1, 2) defines the conditional probability of the true hypothesis
ℋ given
the hypothesis ℋ̂︀ assigned by the BFS. We can then define the final output predictions as:
UAL(ℋ |1, 2) =</p>
        <p>∑︁ (ℋ | ℋ̂︀, 1, 2) · BFS( ℋ̂︀|1, 2).</p>
        <p>The loss consists of two terms, the negative log-likelihood of the ground-truth hypothesis and a
regularization term,
ℒ</p>
        <p>UAL = − log UAL(ℋ |1, 2) + 
∑︁</p>
        <p>∑︁ (ℋ | ℋ̂︀, 1, 2) · log (ℋ | ℋ̂︀, 1, 2),
∈{0,1}</p>
        <p>
          ∈{0,1} ∈{0,1}
with trainable parameters denoted by  = {︀  UAL,  UAL, , |,  ∈ {0, 1}}︀ . The
regularization term, controlled by  , follows the maximum entropy principle to penalize the confusion
matrix for returning over-confident posteriors [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Combined Loss Function:</title>
        <p>All components are optimized independently w.r.t. the following combined loss:
ℒ , ,, = ℒ ,</p>
        <p>DML + ℒ</p>
        <p>BFS + ℒ</p>
        <p>UAL.</p>
        <p>(7)
(8)
(9)
(10)
(11)
(12)</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Out-of-Distribution Detector (O2D2)</title>
        <p>
          Following [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], we incorporate a second neural network to detect undecidable trials. We treat
the training procedure as a binary verification task. Given the learned DML, BFS and UAL
components, the estimated authorship labels are obtained via
        </p>
        <p>̂︀ = arg max [︀ UAL(ℋ0|1, 2), UAL(ℋ1|1, 2)]︀ .</p>
        <p>Now, we can define the binary O2D2 labels as follows:</p>
        <p>0, otherwise.
O2D2 =</p>
        <p>{︃1, if  ̸= ̂︀ or 0.5 −  ≤ UAL(ℋ1|1, 2) ≤ 0.5 + ,
The model-dependent hyper-parameter  ∈ [0.05, 0.15] is optimized on the validation set w.r.t
the PAN 2021 metrics. The input of O2D2, noted as O2D2, is a concatenated vector of the LEVs,
i.e. (︀ 1 − 2)︀ ∘ 2 and (︀ 1 + 2)︀ ∘ 2, and the confusion matrix. This vector is fed into a three-layer</p>
        <p>All trainable parameters are summarized in Γ = {︀  O2D2, O2D2| ∈ {1, 2, 3}}︀ . The obtained
prediction for hypothesis ℋ2 is inserted into the cross-entropy loss,</p>
        <p>O2D2 = O2D2 · log {O2D2(ℋ2|1, 2)} + (1 − O2D2) · log {1 − O2D2(ℋ2|1, 2)} . (14)
ℒΓ</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Ensemble Inference</title>
        <p>As a last step, an ensemble is constructed from  trained models, ℳ1, . . . , ℳ , with  being
an odd number. Since all models are randomly initialized and trained on diferent re-sampled
pairs in each epoch, we expect to obtain a slightly diferent set of weights/biases, which in turn
produces diferent posteriors, especially for pairs under covariate shift. We propose a majority
voting for the non-responses. More precisely, the ensemble returns a non-response, if</p>
        <p>Authorship Veri cation
Fandom Veri cation
(13)
(15)
(16)

∑︁ 1[︀ O2D2(ℋ2|1, 2, ℳ) ≥ 0.5]︀ &gt;
=1
︂⌊  ⌋︂
2
where 1[· ] denotes the indicator function. Otherwise, we define a subset of confident models,
ℳ = {ℳ| O2D2(ℋ2|1, 2, ℳ) &lt; 0.5}, and return the averaged posteriors of its elements,
1
E︀[ UAL(ℋ1|1, 2)]︀ =</p>
        <p>∑︁ UAL(ℋ1|1, 2, ℳ).</p>
        <p>|ℳ| ℳ∈ℳ
Our submitted system consisted of an ensemble with  = 21 trained models.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <p>
        The PAN evaluation metrics and procedure are described in [
        <xref ref-type="bibr" rid="ref14 ref4 ref5">4, 5, 14</xref>
        ]. To capture the calibration
capacity, we also provide the accuracy (acc), confidence score (conf), expected calibration error
(ECE) and maximum calibration error (MCE) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. All confidence values lie within the interval
[0.5, 1], since we are solving a binary classification task. Hence, to obtain confidence scores, the
posterior values are transformed w.r.t. the estimated authorship label, showing (ℋ1|1, 2) if
̂︀ = 1 and 1 − (ℋ1|1, 2) if ̂︀ = 0. For both metrics, the confidence interval is discretized into
a fixed number of bins. The ECE then reflects the average absolute error between confidence
and accuracy of all bins, while the MCE returns the maximum absolute error. For acc and conf,
we perform weighted macro-averaging w.r.t. the number of trials in each bin.
      </p>
      <p>
        Inspired by the promising results in domain-adversarial training of neural networks in [
        <xref ref-type="bibr" rid="ref16 ref17">16,
17</xref>
        ], we also experimented with an adversarial fandom verifier : Starting with the document
embeddings in Eq. (1), we fed this vector into the author verification system (including DML,
BFS and UAL) and into an additional fandom verifier , which is placed parallel to the author
verification system. It has the same architecture but includes a gradient reversal layer and
diferent trainable parameters. However, in these experiments, we did not achieve any significant
improvements by domain-adversarial training. Therefore, we independently optimized the
fandom verifier by stopping the flow of the gradients from the fandom verifier to the authorship
verification components, so that the training of the fandom verifier does not afect the target
system at all. Fig. 3 shows the obtained epoch-wise accuracies during training. It can be seen
that the fandom accuracy stays around 55%, which indicates that the training strategy yields
nearly topic-invariant stylometric representations, even without domain-adversarial training.
      </p>
      <sec id="sec-5-1">
        <title>5.1. Results on the Calibration Dataset</title>
        <p>We first evaluated the UAL component on the calibration set (without non-responses) and
calculated the respective PAN metrics for diferent combinations of the author-fandom subsets.
Results are shown in Table 2. To guarantee that the calculated metrics are not biased by an
imbalanced dataset, we reduced the number of pairs to the smallest number of pairs of all
subsets. Thus, all results in Table 2 were computed from 2 × 2, 100 pairs. Unsurprisingly, best
performance was obtained for the least challenging SA_SF + DA_DF pairs and the worst
performance was seen for the most challenging SA_DF + DA_SF pairs. We continued to optimize our
system w.r.t this most challenging subset combination in particular, even though we specifically
expect to see SA_DF + DA_DF pairs in the PAN 2021 evaluation set.</p>
        <p>Table 3 additionally provides the corresponding calibration metrics. Analogously to the PAN
metrics, the ECE consistently increases from the least to the most challenging data scenarios.
Interestingly, our system is under-confident for SA_SF pairs, i.e. conf &lt; acc. The predictions
then change to be over-confident (conf &gt; acc) for SA_DF pairs.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results on the Validation Dataset</title>
        <p>Next, we separately provide experimental results for all system components on the validation
dataset, since O2D2 has been trained on the calibration dataset. The first four rows in Tables 4
and 5 summarize the PAN metrics and the corresponding calibration measures averaged over
all ensembles models.</p>
        <p>The overall score of the UAL component in the third row of Table 4 is on par with the DML
and BFS components and slightly lower compared to the corresponding UAL score measured
on the calibration dataset in Table 2. Nevertheless, we do not observe significant diferences in
the metrics for both datasets, which shows the robustness and generalization of our system.</p>
        <p>Going from the third to the fourth row in Table 4, it can be observed that the overall score,
boosted by c@1 and F1, significantly increases from 92.5 to 93.2. Hence, the model performs
better if we take undecidable trials into account. However, the f_05_u score decreases, since it
treats non-responses as false negatives. The percentage of undecidable trials generally ranges
1.00</p>
        <p>0.4 0.6
Output predictions
0.8
from 8% to 11%.</p>
        <p>In Table 5, we see that both, the BFS and UAL components notably improve the ECE and
MCE metrics. However, an insertion of non-responses via O2D2 significantly increases the MCE.
This can be explained by the posterior histograms in Fig. 4. The plots (a) and (b) show the
histograms for SA_DF and DA_SF pairs without applying O2D2 to define non-responses. In
contrast, plots (c) and (d) present the corresponding histograms including the 0.5-values of
non-responses. The efect of O2D2 is that most of the trials, whose posteriors fall within the
interval [0.3, 0.7], are eventually declared as undecidable. Hence, the system correctly predicts
nearly all of the remaining as confidently assigned trials around 0.7/0.8 for same-author pairs
or 0.2/0.3 for diferent-author pairs. As a result, we see a large gap (i.e. conf &lt;&lt; acc) between
the confidence score and the averaged accuracy in these bins.</p>
        <p>The last two rows in Tables 4 and 5 show the performance of the ensemble, first without and
then with non-responses, to show the efect of O2D2. On the validation set, our ensemble with
O2D2 returns non-responses in 9% of the test cases. Comparing the last two rows, we obtain
the highest overall score with our proposed framework, which ultimately presents our final
submission.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Results on the PAN 2021 Evaluation Dataset</title>
        <p>To conclude this section, we present our results on the oficial PAN 2021 evaluation set. The
performance for both, the early-bird and the final submission, can be found in Table 6. We also
provide the reported result on the PAN 2020 evaluation set for the predecessor model.</p>
        <p>Unsurprisingly, the early-bird overall score (single model) on the PAN 2021 evaluation set is
slightly higher, since it contains DA_DF pairs instead of DA_SF pairs. The main diference is,
unexpectedly, given by the f_05_u score, which increases from 89.3% to 94.6%. In our opinion,
this is caused by returning a lower number of non-responses, which would also explain the
lower values for c@1 and F1.</p>
        <p>Comparing the early-bird (2 row) with the final submission ( 4ℎ row), we can further
significantly increase the overall score by 1.5%. We assume that the ensemble now returns a
higher number of non-responses, which results in a slightly lower f_05_u score. Conversely,
we can observe improved values for the c@1, F1 and brier scores.</p>
        <p>The last row displays the achieved PAN 2020 results. As can be seen, our final submission
ends up with a higher overall score (plus 2%) by significantly improving all single metrics,
although the PAN competition moved from a closed-set to open-set shared task, illustrating the
eficiency of the proposed extensions.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this work, we presented O2D2, which captures undecidable trials and supports our hybrid
neural-probabilistic end-to-end framework for authorship verification. We made use of the
early-bird submission to receive a preliminary assessment of how the framework behaves
on the novel open-set evaluation. Finally, based on the presented results, we submitted an
O2D2-supported ensemble to the shared task, which clearly outperformed our own system from
2020 as well as the new submissions to the PAN 2021 AV task.</p>
      <p>These results support our hypothesis that modeling aleatoric and epistemic uncertainty and
using them for decision support is a beneficial strategy—not just for responsible ML, which
needs to be aware of the reliability of its proposed decisions, but also, importantly, for achieving
optimal performance in real-life settings, where distributional shift is almost always hard to
avoid.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This work was in significant parts performed on an HPC cluster at Bucknell University through
the support of the National Science Foundation, Grant Number 1659397. Project funding was
provided by the state of North Rhine-Westphalia within the Research Training Group "SecHuman
- Security for Humans in Cyberspace" and by the Deutsche Forschungsgemeinschaft (DFG)
under Germany’s Excellence Strategy - EXC2092CaSa- 390781972.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Boenninghof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rupp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nickel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kolossa</surname>
          </string-name>
          ,
          <article-title>Deep Bayes Factor Scoring for Authorship Verification</article-title>
          ,
          <source>in: CLEF</source>
          <year>2020</year>
          ,
          <string-name>
            <given-names>Notebook</given-names>
            <surname>Papers</surname>
          </string-name>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chulvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. L. D. L. P.</given-names>
            <surname>Sarracén</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kestemont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Manjavacas</surname>
          </string-name>
          , I. Markov,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Rangel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wolska</surname>
          </string-name>
          , , E. Zangerle, Overview of PAN 2021:
          <article-title>Authorship Verification,Profiling Hate Speech Spreaders on Twitter,and Style Change Detection</article-title>
          ,
          <source>in: 12th International Conference of the CLEF Association (CLEF</source>
          <year>2021</year>
          ), Springer,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <article-title>A Survey of Modern Authorship Attribution Methods</article-title>
          ,
          <source>Journal of the American Society for Information Science and Technology</source>
          <volume>60</volume>
          (
          <year>2009</year>
          )
          <fpage>538</fpage>
          -
          <lpage>556</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kestemont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Manjavacas</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Markov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          , E. Stamatatos,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <source>Overview of the Cross-Domain Authorship Verification Task at PAN</source>
          <year>2020</year>
          , in: CLEF 2020,
          <string-name>
            <given-names>Notebook</given-names>
            <surname>Papers</surname>
          </string-name>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kestemont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Manjavacas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <article-title>Overview of the Authorship Verification Task at PAN 2021, in: CLEF 2021 Labs and Workshops, Notebook Papers, CEUR-WS</article-title>
          .org,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cumani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Brümmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Burget</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Laface</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Plchot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasilakakis</surname>
          </string-name>
          ,
          <article-title>Pairwise Discriminative Speaker Verification in the I-Vector Space</article-title>
          ,
          <source>IEEE Trans. Audio</source>
          , Speech,
          <string-name>
            <surname>Lang. Process.</surname>
          </string-name>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>Learning with Noise: Enhance Distantly Supervised Relation Extraction with Dynamic Transition Matrix</article-title>
          ,
          <source>in: 55th Annual Meeting of the ACL</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>430</fpage>
          -
          <lpage>439</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kendall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gal</surname>
          </string-name>
          ,
          <article-title>What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>Calibrating Deep Neural Network Classifiers on Out-of-</article-title>
          <string-name>
            <surname>Distribution</surname>
            <given-names>Datasets</given-names>
          </string-name>
          , ArXiv abs/
          <year>2006</year>
          .08914 (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B.</given-names>
            <surname>Lakshminarayanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pritzel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Blundell</surname>
          </string-name>
          ,
          <article-title>Simple and Scalable Predictive Uncertainty Estimation Using Deep Ensembles</article-title>
          ,
          <source>in: 31st NeurIPS</source>
          ,
          <year>2017</year>
          , p.
          <fpage>6405</fpage>
          -
          <lpage>6416</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Boenninghof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kolossa</surname>
          </string-name>
          ,
          <string-name>
            <surname>Robert M. Nickel</surname>
          </string-name>
          ,
          <string-name>
            <surname>Self-Calibrating</surname>
          </string-name>
          Neural
          <article-title>-Probabilistic Model for Authorship Verification Under Covariate Shift</article-title>
          ,
          <source>in: 12th International Conference of the CLEF Association (CLEF</source>
          <year>2021</year>
          ), Springer,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Boenninghof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hessler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kolossa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Nickel</surname>
          </string-name>
          ,
          <article-title>Explainable Authorship Verification in Social Media via Attention-based Similarity Learning</article-title>
          ,
          <source>in: IEEE International Conference on Big Data</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pereyra</surname>
          </string-name>
          , G. Tucker,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chorowski</surname>
          </string-name>
          , Łukasz Kaiser, G. Hinton,
          <source>Regularizing Neural Networks by Penalizing Confident Output Distributions</source>
          ,
          <year>2017</year>
          . arXiv:
          <volume>1701</volume>
          .
          <fpage>06548</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gollub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , TIRA Integrated Research Architecture, in: N.
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Peters (Eds.),
          <source>Information Retrieval Evaluation in a Changing World, The Information Retrieval Series</source>
          , Springer, Berlin Heidelberg New York,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C.</given-names>
            <surname>Guo</surname>
          </string-name>
          , G. Pleiss,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <source>On Calibration of Modern Neural Networks, in: 34th International Conference on Machine Learning</source>
          , volume
          <volume>70</volume>
          ,
          <string-name>
            <surname>PMLR</surname>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>1321</fpage>
          -
          <lpage>1330</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ganin</surname>
          </string-name>
          , E. Ustinova,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ajakan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Germain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Laviolette</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Marchand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Lempitsky</surname>
          </string-name>
          , Domain-Adversarial
          <source>Training of Neural Networks, J. Mach. Learn. Res</source>
          .
          <volume>17</volume>
          (
          <year>2016</year>
          )
          <fpage>2096</fpage>
          -
          <lpage>2030</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bischof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Deckers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schliebs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <article-title>The Importance of Suppressing Domain Style in Authorship Analysis</article-title>
          , CoRR abs/
          <year>2005</year>
          .14714 (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>