<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. Zhang and L. Zhu. “Citation recommendation using semantic representation of cited
papers' relations and content”. InE: xpert Systems with Applications</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Querying the Past: Automatic Source Attribution with Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ryan Muther</string-name>
          <email>muther.r@northeastern.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathew Barber</string-name>
          <email>mathew.barber@aku.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Smith</string-name>
          <email>dasmith@northeastern.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aga Khan University Institute for the Study of Muslim Civilisations</institution>
          ,
          <addr-line>10 Handyside St, London, GVQG 23</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Northeastern University</institution>
          ,
          <addr-line>360 Huntington Ave, Boston, MA 02115</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>187</volume>
      <issue>2022</issue>
      <fpage>344</fpage>
      <lpage>355</lpage>
      <abstract>
        <p>This paper explores new methods for locating the sources used to write a text by 昀椀ne-tuning a variety of language models to rerank candidate sources. These methods promise to shed new light on traditions with complex citational practices, such as in medieval Arabic where citations are ambiguous and boundaries of quotation are poorly de昀椀ned. A昀琀er retrieving candidates sources using a baseline BM25 retrieval model, a variety of reranking methods are tested to see how e昀ective they are at the task of source attribution. We conduct experiments on two datasets-English Wikipedia and medieval Arabic historical writing-and employ a variety of retrieval- and generation-based reranking models. In particular, we seek to understand how the degree of supervision required a昀ects the performance of various reranking models. We 昀椀nd that semi-supervised methods can be nearly as e昀ective as fully supervised methods while avoiding potentially costly span-level annotation of the target and source documents.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;information retrieval</kwd>
        <kwd>citation modeling</kwd>
        <kwd>source attribution</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>When reading a text, it is o昀琀en useful to know which sources were used to write it.
Knowledge of the sources used to write a news article, for example, can inform a reader of bias in
how information in the article is reported. In historical domains, the sources used to write a
document can both provide insight into how the author worked and what materials they had
access to. We de昀椀ne the problem of determining the sources used to write a piece of text as
that of source attribution.</p>
      <p>Researchers in natural language processing most o昀琀en study source attribution in scienti昀椀c
papers, inferring links to referenced articles based on citations. Part of why this can be done
so well is that modern citations follow a standardized format—o昀琀en generated by required
typesetting packages—that can be parsed by regular expressions or other simple methods. This
comparative ease of data creation in turn allows the creation of large data sets for training fully
supervised models for source attribution using the bibliographic information recovered from
the citations. These models tend to work best when there is a 1:1 correspondence between
椀昀rst printings of a work and papers. In more ambiguous domains, where potential sources
can be redundant, these bibliographic models can o昀琀en fail to disambiguate which of multiple
possible editions of a work is the correct source.</p>
      <p>
        Citations in medieval Arabic historiography are a particularly complex case. From the
seventh until around the twel昀琀h century, the predominant form of citation was thisenad, a chain
of authorities linking the author of a text back to the original (o昀琀en presumed oral) source for
a narrative [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Isnads gradually gave way to the citation of individual authors, occasionally
with reference to the title of an author’s text. All of these forms of citation are highly
ambiguous. While they can o昀琀en be precisely traced to source texts, the process of manually doing so
is laborious and subject to error.
      </p>
      <p>
        For example, the historian al-Maqrizi (d. 1442CE/845AH) wrote multiple large texts, which
are full of citations both to lost and extant works. As Bauden has shown, these citations
reveal how al-Maqrizi worked, the sources to which he had access, and his attitudes towards
source attribution3[
        <xref ref-type="bibr" rid="ref2 ref4">, 4, 2</xref>
        ]. Al-Maqrizi is not exceptional: for examples of other large,
richlycited, texts see [
        <xref ref-type="bibr" rid="ref10 ref13 ref5">10, 13, 5</xref>
        ]. If we are to understand how the Arabic historiographical tradition
approaches quotation and citation, we need methods that can disambiguate vast numbers of
citations across a range of texts and citation styles.
      </p>
      <p>In order to retrieve sources in settings where the citations are harder to locate and more
ambiguous, creating fully annotated data in large amounts can be time consuming and require
signi昀椀cant domain expertise. To circumvent this, we experiment with di昀erent levels of
supervision in the models we use to retrieve potential sources. As we will see, semi-supervised
methods can perform comparably to more annotation-intensive fully supervised models.</p>
      <p>Looking only at the text itself, there are two principle forms of information about the sources:
text reuse and citation. Text reuse is when an author directly copies material from their
source, possibly involving more complex transformations. This is common practice in highly
intertexual domains like historical Arabic writing. Citation involves the author explicitly telling
the reader which source is being used, as one o昀琀en sees in modern scienti昀椀c writing or Wikipedia
entries. Citations can have varying degrees of speci昀椀city, ranging from simply the author(s)
and year, as one sees in some 昀椀elds of scienti昀椀c literature, to a more full-昀氀edged citation
including a title and page number, as is more common in many 昀椀elds in the humanities. In some cases,
the citation may take the form of a unique identi昀椀er, such as a URL or Wikipedia headword.</p>
      <p>Each of these forms of citation and reuse can be viewed as part of a broader spectrum of the
form of relationship between a text and its sources. At one extreme, we have Wikipedia, where
the simplest form of citations to other Wikipedia articles is a link to the cited article by only
headword. At the opposite extreme, there is the highly intertextual classical Arabic domain,
where source attribution is more easily performed by recognizing the source text rather than
any attribution on the part of the author. We view both text reuse and citations as species
of queries, that allow a reader to combine information in the text, domain knowledge, and
bibliographic knowledge to track down the intended source.</p>
      <p>Di昀erent architectures can be used to model the process of source attribution from the
perspective of the author or that of the reader. From the perspective of the author, one could
imagine a process in which they select a source (retrieval) and then use the text of that source
as the basis for their own writing (generation). This process is similar to the process used by
recently-proposed retrieval-augmented generation models in the work by Lewis et8a]lF.r[om
the perspective of the reader, the source retrieval problem is more like one of retrieval alone,
as the reader doesn’t need to create the target themselves, but can use it in the construction of
a query to 昀椀nd sources.</p>
      <p>We operationalize the problem by turning it into a two-stage retrieval and reranking problem.
We 昀椀rst use a baseline retrieval model to retrieve candidate sources for a given target document.
A second model is then used to rerank the possible sources.</p>
      <p>This paper is organized as follows:2§covers related work;3§provides an overview of the
datasets we experiment with; §4 covers the forms of models we use;5§describes our
experiments and results; and §6 provides a discussion of the results as well as potential avenues for
future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        This paper is closely related to work by others on combined retrieval and generation methods
for question answering, citation suggestion, and literary evidence retrieval, similar to that done
by Lewis, Perez, Piktus, Petroni, Karpukhin, Goyal, Küttler, Lewis, Yih, Rocktäschel, Riedel,
and Kiela [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In contrast their work, however, we are more focused on improving the model’s
retrieval performance rather than the generation performance. Also of interest here is the work
of Mao, He, Liu, Shen, Gao, Han, and Chen [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] on generation-augmented retrieval. They focus
more on generating better queries by applying generative models than on using the generative
model as a reranker.
      </p>
      <p>The problem of source attribution is similar to that of citation recommendation, which is
usually thought of as a tool for writers to 昀椀nd relevant citations. Zhang and Zh2u0][evaluate
various forms of citation prediction model based on the similarity between the citation context
and the citing paper to predict citations in PLOS ONE. While this work is valuable for helping
authors of scienti昀椀c publications, it is limited in scope due to the way scienti昀椀c papers tend to
engage with their sources at the coarse paper level. In more humanistic domains, authors o昀琀en
engage with the text of their sources directly and may reference multiple parts of a source,
making our source attribution problem a more granular version of the commonly studied citation
recommendation problem.</p>
      <p>
        The most similar problem studied elsewhere is literary evidence retrieval as proposed by
Thai, Chang, Krishna, and Iyyer 1[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. In RELIC, the goal is to retrieve the correct quoted
passage from a known text based on its context in a work of literary analysis. The objective here
is similar, but the relationship between the citing and cited texts may be more complex than
direct quotation and the source text is not necessarily known, complicating source retrieval.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets and Tasks</title>
      <p>We work with two datasets in this paper: Wikipedia links to other Wikipedia pages
(WikiLink) and two classical Arabic texts taken from the OpenITI corpus of digitized Arabic texts
[16]. These datasets datasets in particular were chosen as the represent di昀erent points in the
spectrum of relationships between texts and sources. For the Wikipedia link task, the use of
the source requires very little modi昀椀cation and could be reduced to copying the headword of
the source article. With Maqrizi, the relationship between the text and its sources are more
complicated, with the source text o昀琀en uncited or cited in ways that are di昀케cult to recognize
automatically, lacking any sort of standardized form like that found in modern genres.
Additionally, the source may o昀琀en be heavily edited or paraphrased, further complicating the
source-target relationship.</p>
      <p>Table 1 shows the sizes of training and test sets. The texts used in the classical Arabic
experiments are al-Maqrizi (d. 1442CE/845AH)’asl-Mawaʽiz wa-l-Ittiʽbar bi-Dhikr al-Khitat
wal-Athar - a topographical history of Egypt, o昀琀en referred to as thKehitat - and one of Maqrizi’s
sources, Ibn ʽAbd al-Hakam (d. 871CE/257AH)’sFutuh Misr wa-l-Maghrib - a history of the
Muslim conquest of Egypt. Based on passim analysis of the OpenITI Corpus (passim release
2022.2.7), the Futuh Misr is the second-most reused text by al-Maqrizi in hKishitat, with around
20,000 word tokens shared between the two books.</p>
      <p>These reused passages of theFutuh Misr are o昀琀en cited in conjunction with other sources.
The following excerpt is from thKehitat, found in a section entitled, ’The canals that intersect
the Nile’ (citations have been underlined):</p>
      <p>It is known that once the Nile has 昀椀nished rising, canals and channels are cut from
it... [summary of the canals and their names]
Ibn ʽAbd al-Hakam said, [quoting] from Abu Riham al-Sam:aʽOild Cairo [Misr] had
stone and other kinds of bridges, [built] by decree and design, such that water
passed under its houses and [through] its courtyards...[a description of some of
the canals dug in Egypt]. The Sakha canal was dug by Tudarus b. Sa b. Qubtim b.
Misrayim b. Baysar b. Ham b. Nuh. He was the 昀椀rst of the ancient Coptic kings,
who ruled Egypt in the 昀椀rst age.</p>
      <p>Ibn Wasif Shah said: King Tudarus was the 昀椀rst King to rule the full extent of it
[Egypt] a昀琀er his father Sa... [there follows a short biography of Tudarus]
The Sardus Canal: Haman dug it. Ibn Wasif Shah sai:dKing Talma b. Qumas sat
on the on the throne... [Brief biography of Talma]
Others mention:... When he [Talma] became King he spent money freely, brought
close those who were loyal to him, and killed those who opposed him. His rule
was moderate. He appointed Haman as a successor...[more description of Talma’s
rule, including the canals that he commissioned]
Ibn ʽAbd al-Hakam said, [quoting] from ʽAbdullah b. ʽAmru b. al-(ʽAGsod be pleased
with them both): The Pharoah tasked Haman with digging the Sardus canal..1.2[,
pp. 186-188]</p>
      <p>This quotation is characteristic of the kind of writing found in thKehitat, where multiple
sources are threaded together to describe topographical features or landmarks and outline their
history, including signi昀椀cant tangents. With no quotation markers, it can be di昀케cult to
separate quoted sources from the author’s commentary. Al-Maqrizi introduces the section, giving
an overview of the canals. He then quotes Ibn ʽAbd al-Hakam for a general historical
introduction, at the conclusion of which the King Tudarus is mentioned, connected to the Sakha canal.
This leads him to quote Tudarus’ biography from another author, Ibn Wasif Shah. Al-Maqrizi
then moves on to the next canal, the Sardus Canal, which he states was dug by Haman. Having
introduced in his own words, al-Maqrizi re-cites Ibn Wasif Shah - using him once again for a
biography of a pre-Islamic ruler. At the close of that biography, there is an ambiguous citation
- ’Others mention’ - for a description of Talma’s connection to Haman. Al-Maqrizi then cites
Ibn ʽAbd al-Hakam to describe the digging of the Sardus canal.</p>
      <p>As should be clear from this example, citation and quotation in tKhheitat can be quite
ambiguous, especially as the author moves between sources. For example, ’Others mention’ could
be a citation used by Ibn Wasif Shah, or it could be al-Maqrizi’s own citation. Retrieval of
potential source texts would allow us to resolve ambiguous references such as this.</p>
      <p>The citations themselves are references to authors - ’so-and-so said’. References to book
titles occur, but much less frequently, for example:
al-Asʽad b. Mammati said in his bookQawanin al-Dawawin: the Alexandria canal
has a number of channels... 1[2, p. 189]
The use of author names, rather than book titles, increases the level of ambiguity. Authors
might be referred to by di昀erent names, and - more crucially - authors o昀琀en wrote more than
one book. We need, therefore, to be able to resolve a citation and quotation to the original
source text. Text reuse detection can only partially solve this problem, as the same passage
might be quoted by multiple authors (and might be used by the same author in multiple works).
Moreover, it is necessary to separate the target author’s source text from the texts that are being
quoted by the source text.</p>
      <p>
        Addressing source attribution in thKehitat promises to reveal more about al-Maqrizi’s sources
and his use of them. Al-Maqrizi produced a large oeuvre, including 9 works (in the OpenITI
Corpus) that exceed 100,000 tokens in length, much of which he copied from early source texts
(both extant and lost). Of these works, theKhitat contains the largest number of citations; in
other works, he more frequently quotes from sources without citation. As al-Maqrizi shares
so many of his sources between his works, identifying citations within thKehitat and their
corresponding sources promises to unlock the identity of sources quoted in his other works.
For a small-scale case study, see Barber’s examination of al-Maqrizi’s quotations from the lost
Fatimid biography, theSirat al-Yazuri [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Given the size of al-Maqrizi’s works and the breadth
of his source usage, computational methods are essential if we are to move beyond small case
studies. This examination of theKhitat and its use of theFutuh Misr is, therefore, an essential
preliminary step in understand these kinds of complex citation and source attribution
questions.
      </p>
      <p>
        The texts have been annotated by a domain expert on al-Maqrizi to create a dataset of 181
regions of shared text between the two works, some with and some without direct attribution
in the form of citations by al-Maqrizi. When the other text is directly cited, the citation is
separately marked in the annotations. The annotations are created based on the output of the
text reuse detection algorithm Passim1[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which operates by aligning sections of texts with a
high number of shared character n-grams to 昀椀nd regions of shared material, which the domain
expert re昀椀ned to create the dataset of source-target pairs that we use in our experiments. Since
the works in the corpus are so long, rather than aligning full texts, we cut the works up into
300token chunks and align those. The alignments created by our annotator are at the chunk level,
where chunk X ofFutuh Misr is a source for chunk Y ofKhitat. The goal of the experiments
with this dataset is to retrieve the proper source chunk for a given target chunk.
      </p>
      <p>
        To create the dataset of Wikipedia citations, we collected 150,000 links from Wikipedia
articles to other Wiki articles where the link’s anchor text was the name of the cited page from
Singh, West, and Colavizza 1[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. To better handle long articlesretrieval and reranking is done
at the section level, and any retrieved section from the correct source page is counted as
relevant for evaluation purposes. For this dataset, the goal is to be able to retrieve a section from
the cited page using the sentence with the link from the citing page
      </p>
      <p>While one could accomplish the source attribution task in this simpli昀椀ed Wikipedia setting
by using a simple lookup table of headwords from other Wiki pages, this task should not be
discounted as uninteresting as a form of citation. The goal with the experiments on this data
is to demonstrate that, as a form of citation, the kinds of models used in contexts where the
relationship between the source and the target is more complicated than simple copying are
also usable for this simpli昀椀ed domain.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Models</title>
      <p>For our experiments we compare several kinds of models: a baseline retrieval model, as well
as several forms of reranking models applied to the results of the baseline retrieval model;
embedding similarity and a generative target text model. We use each form of model to rerank
candidate sources retrieved by a baseline BM25 retrieval model. Each of these models is meant
to test the usefulness of di昀erent architectures in solving the source attribution task. The
embedding similarity model is a baseline for how well untuned embedding models can solve
this task. The generative models are used to examine how e昀ectively generative models can
learn to copy material from the source to the target as well as how the text is transformed in
moving from the source to the target.</p>
      <sec id="sec-4-1">
        <title>4.1. Baseline Retrieval Model</title>
        <p>The starting point of all of our experiments is a BM25 retrieval model used to retrieve possible
sources, for which we use pyserini’s implementation9].[To allow the retrieval model to
leverage information present in citations, we augment the source documents with bibliographic data,
which is o昀琀en otherwise not present in the source documents. In the case of the Wikipedia
link, this takes the form of the article title. For the al-Maqrizi dataset, we augment sources with
the author’s name and the title of the source text, some combination of which is frequently be
used by al-Maqrizi to indicate his sources when citations are present.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Embedding Similarity</title>
        <p>
          The simplest form reranks documents by descending order of cosine similarity of the
representations of the source and target under a BERT embedding model trained on English and Arabic
by Lan et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] As one might infer, the similarity is calculated as in Equatio1nwhere  
and   are the BERT embeddings of the target and source respectively.
        </p>
        <p>= (

,  
)
(1)
This places source documents with more similar embeddings to the target document higher in
the ranked list. The intuition behind this approach is that the sources used to write a text will
be topically similar, meaning that the source documents may be nearby in embedding space.
In practice, however, this intuition may not lead to better source retrieval performance in all
domains, as experiments in Section 5 will show.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Generator-Only</title>
        <p>
          The generator-only models rerank sources using the likelihood from a BART-based generation
model [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] of some portion of the target (citing) document conditioned on the source (cited)
document and the unmasked sections of the target. Unlike the embedding similarity method
described in the previous section that only requires supervision at the document level, this
method also requires the span of interest in the target to be annotated. The span of interest is
the section of the target that we are interested in attributing to a particular source. At training
time, this section of the target is replaced with &lt;MASK&gt; tokens and the BART model is trained
to predict the masked span conditioned on concatenation of the masked target and source by
minimizing the log likelihood of the masked span as in Equatio2.n
 = −(

|  , )
where   is the masked span in the target,  is the observed portion of the target, andis
the source document. At inference time, the retrieved sources are reranked using the same loss,
moving source documents that are more useful for generating the target text up in the ranked
list of sources. For the al-Maqrizi dataset, since some annotated spans are o昀琀en quite long,
with the longest being 300 words (an entire input document), we truncate the masked sections
to be at most 100 word pieces in length, leaving some of the target document to condition the
generator.
        </p>
        <p>Additionally, to test the feasibility of less annotation-heavy semi-supervised models, we
also experiment with a semi-supervised version of BART where rather than using the
knowncorrect source we substitute the top-ranking retrieved source from the baseline retrieval
model ′, giving us
 = −(

|  ,  ′)
(2)
(3)
as a loss function. In theory, a su昀케ciently strong baseline retrieval model will give BART
enough correct documents to learn from, while learning to ignore erroneously retrieved
irrelevant sources. This form of model, of course, still requires annotation at the span level in the
target, but frees the annotator from locating the correct source for the target span of interest,
which is o昀琀en the more time-intensive portion of the task.</p>
        <p>We can, of course, also relax this assumption that the human annotator needs to mark the
span of interest as well by selecting a span of interest using some automatic method. For
Wikipedia, this is done by construction as the source links are themselves chosen automatically
when constructing the dataset. For al-Maqrizi, such a dataset can be constructed using the raw
passim alignments that the annotator used as the basis to construct the dataset in the 昀椀rst place.
Rather than using the human-annotated target spans and veri昀椀ed source documents, the model
can be trained on model-retrieved target spans and source documents.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <p>We divide our experiments into two sections; retrieval-oriented and generation-oriented. The
retrieval experiments are meant to explore how e昀ective di昀erent forms of models with varying
degrees of supervision are at solving the problem of source attribution, and are evaluated using
Recall@10 and Mean Reciprocal Rank. The generation experiments aim to understand the
degree to which generative models are capable of learning to copy from source documents,
rather than relying on the ability to 昀椀ll in masked text using only the surrounding context
using the understanding of language gained during pretraining. To this end, we evaluate the
accuracy of the generator when tasked with predicting the masked text, rather than employing
the generator as a reranker as in the retrieval experiments. This is of particular interest in the
domain of Wikipedia, as it was part of the model’s pretraining dataset, so data leakage from
the pretraining data may have occurred.</p>
      <sec id="sec-5-1">
        <title>5.1. Retrieval Experiments</title>
        <p>We will now describe our experiments on the Wikipedia and al-Maqrizi datasets. As a
baseline, we use a simple bag-of-words BM25 retrieval model as implemented in pyser9i]nFi.o[ r
Wikipedia, the baseline retrieval model attains a recall at 10 of .64 and MRR of .478. If we then
use these retrieval results as input to a BART-based (i.e. generation only) reranking model,
which has been trained to generate the link text conditioned on the masked target text and
complete source text, the recall at 10 increases to .97 and the MRR to .927. The purely
retrievalbased model, untrained BERT, display vastly worse performance than the baseline model
Training with partial supervision performs almost as well as a fully supervised model with .94 recall
at 10 and .895 MRR, despite the worse performance of the baseline retrieval model.</p>
        <p>On the al-Maqrizi dataset, the baseline retrieval model attains a recall at 10 of .84 and an
MRR of .680. Reranking with BERT without any further pretraining for the task actually
degrades performance, decreasing recall at 10 to .36 and MRR to .30. Again, similar to Wikipedia,
reranking with a purely generative approach signi昀椀cantly improves performance, reaching .95
recall at 10 and an MRR of .948. Interestingly, as with the Wikipedia dataset, most of this
performance increase is maintained if we switch the training data from fully supervised training
to a semi-supervised setup with only span-level supervision in the target documents, with a
very minor decrease in MRR to .947. This continues to hold true even when one relaxes the
constraint that the spans to generate also be human-annotated, as one can see from the
BARTPassim model, where the model outperforms the retrieval baseline both in terms of MRR and
Recall@10. Due to the extreme length of the masked sections in this data set, a similar
evaluation for the predictive accuracy of BART without conditioning on the source document would
likely be uninformative as the odds of correctly predicting an entire span of 100 subword
tokens exactly correctly would be much lower, making the exact match evaluation as performed
on the Wikipedia data in the next section much less informative.</p>
        <p>The downside of this approach is that, in addition to telling the model to generate some text
which may be derived from the source, both these approaches also force the model to generate
either some or all of the unrelated content in the sentence not derived from the source. This
has two main e昀ects on performance. Firstly, all pages that are not useful for explaining part
of a text become less likely to be sources according to the trained reranker. In contrast, source
candidates that do happen to explain part of the unrelated text are erroneously considered more
likely to be sources. One way to get around this would be to use a text reuse model detection to
attempt to identify passages of direct reuse from the potential source set, and bias them asking
to make tokens marked as reuse more o昀琀en than those not marked as reuse, potentially making
the reconstruction loss more useful as a means of determining whether a potential source is
useful or not.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Generation Experiments</title>
        <p>The improved performance on Wikipedia link prediction may come from the generative ability
imparted by BART’s pretraining rather than the learned ability of the 昀椀ne-tuned model to
copy from sources. Furthermore, for the Wikipedia experiments, the training and test sets are
themselves a part of what BART was originally pretrained on, making data leakage between the
pretraining dataset and this downstream task possible. To test that this isn’t arti昀椀cially in昀氀ating
the models’ source reranking performance, we measure the predictive accuracy of three forms
of BART-only models at the task of 昀椀lling in the masked link text on the Wikipedia dataset.
The results for these experiments can be seen in Tabl3e. First, we test completely untuned
BART-base using the text of the target only. Secondly, we use the same test input, but 昀椀ne
tune the model on the link prediction task. Finally, we do the full conditioning on the masked
target and source section as described above. The untuned model was completely unable to
predict the link text (0% accuracy), while the target-only model achieved 17% accuracy. The
combined source-target model, which is the fully supervised setup from the previous section,
achieved 71.4% accuracy. This shows that this training process is indeed teaching the model to
copy from the source document, rather the ability to rerank being a side e昀ect of the presence
of Wikipedia in BART’s pretraining data. This predictive performance is largely maintained
when we use the highest-ranked BM25 retrieved source in place of the known-correct source,
with a very minor decrease to 70.9% accuracy.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion and Future Work</title>
      <p>The experiments discussed in the previous section provide insight into how well di昀erent forms
of large language model can be used to solve the problem of source attribution, as well as the
importance of task-speci昀椀c 昀椀ne tuning. It is clear that generative models like BART are capable
of learning more complex source-target relationships than simple copying as evidenced by the
strong performance of BART on both data sets. However, the required annotation to train
such a model makes it unattractive as a general solution to the problem, though they may be
useful for small specialized domains like that of al-Maqrizi in this case. In contrast, the poor
performance of the untuned BERT model indicates that some additional training is necessary.</p>
      <p>There are several interesting avenues for potential future research on this topic. Further
work on unsupervised methods like retrieval augmented generation may be appealing with
access to better hardware capable of running a more complete version of the model. The
experiments also show the potential bene昀椀ts of applying semi-supervised methods to this problem
to avoid costly annotation for, as it seems, a small loss in performance. It would be worth
evaluating this approach on larger datasets to see if the conclusions we draw from the al-Maqrizi
test set generalize. Indeed, one bene昀椀t of moving to a semi-supervised approach where both
the spans and document pairs don’t require human annotation is that all the human annotated
data can be used for evaluation. We avoid doing so here in the interest of fair comparison
between the various forms of model that do require human annotated training data. It would
also be worthwhile to examine the performance of 昀椀ne-tuned BERT trained to embed sources
and targets more closely as a potential reranker.</p>
      <p>Additionally, as al-Maqrizi and Wikipedia represent very extreme notions of what
sourcetarget relationships look like, it would be valuable to 昀椀nd another dataset where citation is
more formally structured than al-Maqrizi, but less than in Wikipedia as a third use case,
ideally without the OCR errors present in the Internet Archive dataset. For instance, once could
imagine looking the the work of 19th century philosopher J. S. Mill and his sources, which not
only have a citations more along the lines of what one sees in modern writing, but would also
allow one to examine the utility of these methods in a cross-lingual setting, as he o昀琀en cites
sources in languages other than English that are also digitize1d4][. One could also attempt to
look at Wikipedia’s citations to other sources such as Google Books or the Internet Archive.
As the human record becomes more tractable to computation, models of source attribution
promise not only to improve search but also to help us understand the reading and writing
methods of the past.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The authors would like to thank the anonymous reviewers for their helpful feedback. This
paper is part of a project that has received funding from the European Research Council (ERC)
under the European Union’s Horizon 2020 research and innovation program (Grant agreement
No. 772989)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Barber</surname>
          </string-name>
          . “
          <article-title>Fatimid historiography and its survival. A case study of the vizierate of alYāzūrı̄ (r</article-title>
          .
          <fpage>442</fpage>
          -
          <issue>450</issue>
          /
          <fpage>1050</fpage>
          -1058)”.
          <source>PhD thesis</source>
          . University of Edinburgh,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bauden</surname>
          </string-name>
          . “
          <article-title>Ismaili and Fatimid Studies in Honor of Paul E. Walker”</article-title>
          . In: ed. by
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Craig</surname>
          </string-name>
          . Middle East Documentation Center,
          <year>2010</year>
          . Chap. Maqriziana XII.
          <article-title>Evaluating the Sources for the Fatimid Period: Ibn al-Maʾmūn alBaṭāʾi'̄ḥsıHistory and Its Use by al-Maq rz̄ıı̄ (with a Critical Edition of His Résumé for the years</article-title>
          501-515), pp.
          <fpage>33</fpage>
          -
          <lpage>85</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bauden</surname>
          </string-name>
          . “
          <string-name>
            <surname>Maqriziana</surname>
            <given-names>II</given-names>
          </string-name>
          :
          <article-title>Discovery of an Autograph Manuscript of al-Maz̄qır:̄Tıowards a Better Understanding of His Working Method, Analysis”</article-title>
          .
          <source>IMn:amluk Studies Review</source>
          <volume>12</volume>
          (
          <year>2008</year>
          ), pp.
          <fpage>51</fpage>
          -
          <lpage>118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bauden</surname>
          </string-name>
          . “
          <string-name>
            <surname>Maqriziana</surname>
            <given-names>IX</given-names>
          </string-name>
          :
          <article-title>Should al-Maqz̄rıı̄Be Thrown Out with the Bath Water? The Question of His Plagiarism of al-Awḥa'̄dsıKhiṭaṭ and the Documentary Evidence”</article-title>
          .
          <source>In: Mamluk Studies Review</source>
          <volume>24</volume>
          (
          <year>2010</year>
          ), pp.
          <fpage>159</fpage>
          -
          <lpage>232</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>F.</given-names>
            <surname>Bora</surname>
          </string-name>
          .
          <article-title>Writing history in the medieval Islamic world: the value of chronicles as archives</article-title>
          .
          <source>Bloomsbury</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <surname>A. Ritter. “</surname>
          </string-name>
          <article-title>An Empirical Study of Pre-trained Transformers for Arabic Information Extraction”</article-title>
          . Ina:rXiv:
          <year>2004</year>
          .14519 [cs] (
          <year>2020</year>
          ). url: http://arxiv.or g/abs/
          <year>2004</year>
          .14519.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          . “BART:
          <article-title>Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension”</article-title>
          . InC:oRR abs/
          <year>1910</year>
          .13461 (
          <year>2019</year>
          ). arXiv:
          <year>1910</year>
          .13461. url: http://arxiv.org/abs/
          <year>1910</year>
          .13461.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P. S. H.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          . “
          <article-title>Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”</article-title>
          . InC:oRR abs/
          <year>2005</year>
          .11401 (
          <year>2020</year>
          ). arXiv:
          <year>2005</year>
          .11401. url: https://arxiv.org/abs/
          <year>2005</year>
          .1140 1.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , S.-
          <string-name>
            <surname>C. Lin</surname>
            ,
            <given-names>J.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Pradeep</surname>
            , and
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Nogueira</surname>
          </string-name>
          . “
          <article-title>Pyserini: A Python Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations”</article-title>
          .
          <source>In: Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)</source>
          .
          <year>2021</year>
          , pp.
          <fpage>2356</fpage>
          -
          <lpage>2362</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Little</surname>
          </string-name>
          .
          <article-title>An Introduction to Mamlūk Historiography: An analysis of Arabic Annalistic and Biographical sources for the Reign of al-Malik an-Nāṣir Muḥammad ibn Qalāʾūn</article-title>
          . F. Steiner,
          <year>1970</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , J. Han, and
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Generation-Augmented Retrieval for Open-domain Question Answering</article-title>
          .
          <year>2021</year>
          . arXiv:
          <year>2009</year>
          .
          <article-title>08553 [cs</article-title>
          .CL].
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>al-Maqrizi</surname>
          </string-name>
          .
          <article-title>al-Mawa'iz wa-l-Itti'bar 昀椀 Dhikr al-Khitat wa-l-</article-title>
          <string-name>
            <surname>Athar</surname>
            . Ed. by
            <given-names>A. F.</given-names>
          </string-name>
          <string-name>
            <surname>Sayyid</surname>
          </string-name>
          . Vol.
          <volume>1</volume>
          .
          <string-name>
            <surname>Furqan</surname>
          </string-name>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>E.</given-names>
            <surname>Muhanna</surname>
          </string-name>
          . The World in a Book:
          <article-title>Al-Nuwayri and the Islamic Encyclopedic Tradition</article-title>
          . Princeton University Press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>H. O'Neill</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Welsh</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Roe</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Terras</surname>
          </string-name>
          . “
          <article-title>Text mining Mill: Computationally detecting in昀氀uence in the writings of John Stuart Mill from library records”</article-title>
          .
          <source>In: Digital Scholarship in the Humanities</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C. F.</given-names>
            <surname>Robinson</surname>
          </string-name>
          . “
          <source>The Oxford History of Historical Writing: Volume</source>
          <volume>2</volume>
          :
          <fpage>400</fpage>
          -
          <lpage>1400</lpage>
          ”. In: ed. by
          <string-name>
            <given-names>S.</given-names>
            <surname>Foot</surname>
          </string-name>
          and
          <string-name>
            <given-names>C. F.</given-names>
            <surname>Robinson</surname>
          </string-name>
          . Oxford University Press,
          <year>2012</year>
          . Chap. Islamic Historical Writing,
          <source>Eighth through the Tenth Centuries</source>
          , pp.
          <fpage>238</fpage>
          -
          <lpage>264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Romanov</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Seydi</surname>
          </string-name>
          .
          <source>OpenITI: a Machine-Readable Corpus of Islamicate Texts. Version</source>
          <year>2019</year>
          .
          <volume>1</volume>
          .1.
          <string-name>
            <surname>Zenodo</surname>
          </string-name>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .5281/zenodo.3082464. url: https://doi.org/10.5281/z enodo.
          <volume>3082464</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>H.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>West</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Colavizza.</surname>
          </string-name>
          “
          <article-title>Wikipedia Citations: A comprehensive dataset of citations with identi昀椀ers extracted from English Wikipedia”</article-title>
          . InC:oRR abs/
          <year>2007</year>
          .07022 (
          <year>2020</year>
          ). arXiv:
          <year>2007</year>
          .07022. url: https://arxiv.org/abs/
          <year>2007</year>
          .0702 2.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cordel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Dillon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Stramp</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Wilkerson</surname>
          </string-name>
          . “
          <article-title>Detecting and modeling local text reuse”</article-title>
          .
          <source>InIE:EE/ACM Joint Conference on Digital Libraries</source>
          .
          <year>2014</year>
          , pp.
          <fpage>183</fpage>
          -
          <lpage>192</lpage>
          . doi:
          <volume>10</volume>
          .1109/jcdl.
          <year>2014</year>
          .
          <volume>6970166</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [14] [16] [17] [18] [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>Thai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Krishna</surname>
          </string-name>
          , and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>IyyerR</article-title>
          .ELIC:
          <article-title>Retrieving Evidence for Literary Claims</article-title>
          .
          <year>2022</year>
          . arXiv:
          <volume>2203</volume>
          .10053 [cs.CL].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>