<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J. H. Brusuelas); https://www.cs.mtsu.edu/~jwallin/ (J. F. Wallin)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards a Platform for AI-Assisted Papyrology</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Matthew I. Swindall</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Graham West</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>James H. Brusuelas</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alex C. Williams</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>John F. Wallin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Amazon, AWS AI</institution>
          ,
          <addr-line>440 Terry Ave N, Seattle, WA 98109</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Middle Tennessee State University</institution>
          ,
          <addr-line>1301 East Main St, Murfreesboro, TN 37132</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Kentucky</institution>
          ,
          <addr-line>410 Administration Dr., Lexington, KY 40506</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>We propose an AI-powered platform to assist experts in transcribing, dating, identifying, and editing ancient manuscripts. In this paper, we discuss our ongoing work on AI-assisted Greek papyrology and our vision for a broader application that is intuitive for scholars of the ancient world. We envision this platform as an all-in-one system for AI-assisted papyrology that can be extended to additional languages and media.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Digital Humanities</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Papyrology</kwd>
        <kwd>Generative AI</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Transfer Learning</kwd>
        <kwd>Handwritten Text Recognition</kwd>
        <kwd>Blockchain &amp; Smart Contracts</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>1.1. The Ancient Lives Project</title>
        <p>In 2011, a Zooniverse.org collaboration called the Ancient Lives project began crowdsourcing
the transcription of papyrus fragments housed at the University of Oxford, such as the one
shown in Figure 1. The project resulted in millions of annotations. These highly damaged
fragments are challenging for most modern handwritten text recognition (HTR) methods.</p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. The AL-ALL and AL-PUB Datasets</title>
        <p>
          The AL-ALL dataset, which was derived from the crowdsourced annotations collected during
the Ancient Lives project, consists of 419,445 Greek characters, representing all 24 characters of
the Greek alphabet, cropped from images of papyrus fragments. Due to ongoing papyrological
research, only 205,797 character images from published papyri were made available as the
AL-PUB dataset, shown in Figure 2 and available at https://data.cs.mtsu.edu/al-pub/. As
demonstrated in Swindall et al. 2021 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], this dataset has been instrumental in the development of
deep learning methods for Greek character classification, especially for images of manuscripts
that exhibit severe damage and decay.
        </p>
      </sec>
      <sec id="sec-1-3">
        <title>1.3. Synthetic Characters with GAN’s</title>
        <p>
          One of the greatest challenges in crowdsourcing datasets is sampling bias. This was especially
true for the AL-ALL and AL-PUB datasets. In Swindall et al. 2022 [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], StyleGAN2 was trained
on samples from AL-ALL to generate synthetic images of Greek characters on papyrus. The two
smallest samples in AL-ALL were doubled by adding these synthetic images. This created the
ALSYNTH dataset, which was used to train new classification models. The new models showed no
change in overall accuracy, but demonstrated considerable increases in per-character accuracy
for the augmented samples. This work demonstrates the usefulness of synthetically augmenting
image datasets to reduce the efects of sampling bias. In addition, synthetic character images
may be immensely useful for graphical reconstruction of papyri and stylistic comparisons.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Our AI Tools</title>
      <sec id="sec-2-1">
        <title>2.1. Automated Transcription</title>
        <p>
          Utilizing models trained on AL-ALL, several machine learning tools have been developed that
form a handwritten text recognition (HTR) pipeline. This pipeline expedites the process of
producing a diplomatic transcription, which constitutes an un-edited typescript of the text
visible in a given manuscript. The first tool, a character segmentation model, used transfer
learning to re-task YOLOv5 with locating characters within papyrus images. The second tool
is a character classification model, with a validation accuracy over 94%, which is based on a
ResNet architecture used in Swindall et al. 2021 [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] and trained on AL-ALL. The third tool
is an unsupervised line-sequencing algorithm, which utilizes mean-shift clustering to group
characters into lines based on their vertical coordinates.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Manuscript Dating</title>
        <p>A routine task in papyrology is accurately dating manuscripts. In the case of documentary
papyri (accounts, letters, leases, etc.), the scribe usually dates the manuscript; though the date is
often lost due to damage. Literary papyri (ancient books) never contain a date, unless portions
of it were reused for documentary purposes. In the absence of a date, papyrologists must infer
it by comparing the handwriting with other dated manuscripts. To automate this process, a
pipeline of models was created that can classify a fragment according to classes representing
a period of two centuries (i.e., 400 BCE - 201 BCE, 200 BCE - 1 BCE, etc.), with a range of
dates spanning from 400 BCE to 600 CE. To create this pipeline, images of documentary papyri
with known dates were run through our HTR pipeline, thus obtaining dating information
at the level of individual characters. Models were then trained via transfer learning on the
ResNet classification model to attribute dates to individual characters. However, due to the
high variability of handwriting styles, individual character dates can be unreliable. To address
this, a Gaussian Process model was created which assigns a date to an entire fragment based on
the predicted dates of its constituent characters. When trained on fragments with 25 or more
characters, this model achieves a precision and recall of 75%-80%. Currently, we are investigating
possible ways of increasing the temporal resolution without diminishing prediction quality.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Future Work</title>
      <sec id="sec-3-1">
        <title>3.1. Natural Language Processing</title>
        <p>
          Digital Epigraphy, which produced digital editions of ancient inscriptions, continues to be a
promising area of natural language processing (NLP) research. Eforts such as Pythia [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] and
masked language modeling [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] have demonstrated that human-level proficiency is probable
for future NLP models. Additional challenges posed by Greek papyri include the lack of
word division and punctuation, as well as the physical damage to the fragment resulting in
missing characters. To combat these issues, a multi-phase approach may be necessary, including
identifying where characters are missing, predicting how many characters may be missing, and
predicting what the missing characters are likely to be.
        </p>
        <p>
          Beyond textual reconstruction, we believe it may be possible to use computational and
deep learning methods for tasks including document identification, provenance, and detection
of classification errors for existing digital editions. For example, Williams et al. 2014 [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
demonstrated the ability of genetic sequencing algorithms (especially for fragmented texts
and texts with a history of textual variation) to compare transcriptions to a corpus of known
texts for identification (author, work, etc.). This approach, paired with additional tools, may be
invaluable for the AI-assisted study of ancient texts.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Born-Digital Edition Management with Blockchain &amp; Proteus</title>
        <p>
          With the increasing development of AI tools to assist in ancient manuscript research, it will
be necessary to modernize the existing infrastructure for creating and managing born-digital
editions of ancient manuscripts. Our Proteus platform, Williams et al. 2015[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and Brusuelas
and Meccariello 2023 [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], is a unique environment dedicated to creating, peer-reviewing, and
managing born-digital editions of papyri. One of the challenges experienced with Proteus was
the complexity of not only managing large volumes of editions using a database system, but
also the diferent scholarly reconstructions (or versions) of the same papyrus fragment.
        </p>
        <p>
          Currently a solution is in development which will utilize blockchain and smart contract
technologies for the management and storage of digital editions. In this proposed system,
illustrated in Figure 4, smart contracts are created for new, original editions. Rather than storing
all data in complex databases, this smart contract stores only the location of the data on the
blockchain itself. Editors of critical editions can submit their edition to the smart contract, which
then stores the location of the critical edition’s data. The editions themselves can be stored in a
number of ways: on a local server where the blockchain is hosted, on a public blockchain, or
in a distributed file storage platform such as the InterPlanetary File System (https://ipfs.tech/).
Beyond ofering a less complex method of edition management, blockchain and smart contracts
ofer an avenue to a more transparent and decentralized peer-review ecosystem, as discussed in
Tenorio-Fornés et al. 2021 [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. An AI-Driven Platform for Papyrology</title>
      <p>Although we have developed a suite of AI-enabled methods to study the papyrology as a proof of
concept application, these tools remain out of reach for many scholars in the field. We envision
the creation of a holistic platform which incorporates a host of tools that assist in transcribing,
dating, identifying, and editing manuscripts. Figure 3 shows an example transcription. Our
approach is likely transferable to other kinds of manuscripts and languages. Instead of a platform
limited to Greek papyrology, we envision one that can be interoperable with other language
and manuscript datasets from the ancient world.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sommerschield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Assael</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stefanak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Senior</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bodel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Prag</surname>
          </string-name>
          , I. Androutsopoulos, N. de Freitas,
          <source>Machine Learning for Ancient Languages: A Survey</source>
          ,
          <source>Computational Linguistics</source>
          <volume>49</volume>
          (
          <year>2023</year>
          )
          <fpage>703</fpage>
          -
          <lpage>747</lpage>
          . URL: https://doi.org/10.1162/coli_a_00481. doi:
          <volume>10</volume>
          .1162/coli_a_
          <fpage>00481</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Swindall</surname>
          </string-name>
          , G. Croisdale,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Hunter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Keener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Brusuelas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Krevans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sellew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fortson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Wallin</surname>
          </string-name>
          ,
          <article-title>Exploring learning approaches for ancient greek character recognition with citizen science data</article-title>
          ,
          <source>in: 2021 17th International Conference on eScience (eScience)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>128</fpage>
          -
          <lpage>137</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Swindall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Player</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Keener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brusuelas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nicolardi</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. D'Angelo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Vergara</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>McOsker</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wallin</surname>
          </string-name>
          ,
          <article-title>Dataset augmentation in papyrology with generative models: A study of synthetic ancient greek character images</article-title>
          ,
          <year>2022</year>
          , pp.
          <fpage>4948</fpage>
          -
          <lpage>4954</lpage>
          . doi:
          <volume>10</volume>
          . 24963/ijcai.
          <year>2022</year>
          /687.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Assael</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Sommerschield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Prag</surname>
          </string-name>
          ,
          <article-title>Restoring ancient text using deep learning: a case study on Greek epigraphy</article-title>
          , in: K. Inui,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>6368</fpage>
          -
          <lpage>6375</lpage>
          . URL: https://aclanthology.org/D19-1668. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1668.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lazar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Saret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yehudai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Horowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Wasserman</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Stanovsky, Filling the gaps in Ancient Akkadian texts: A masked language modelling approach</article-title>
          , in: M.
          <article-title>-</article-title>
          <string-name>
            <surname>F. Moens</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Specia</surname>
          </string-name>
          , S. W.-t. Yih (Eds.),
          <source>Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Online and
          <string-name>
            <given-names>Punta</given-names>
            <surname>Cana</surname>
          </string-name>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>4682</fpage>
          -
          <lpage>4691</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .emnlp-main.
          <volume>384</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .emnlp-main.
          <volume>384</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. D.</given-names>
            <surname>Carroll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Wallin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Brusuelas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fortson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-F.</given-names>
            <surname>Lamblin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Identification of ancient greek papyrus fragments using genetic sequence alignment algorithms</article-title>
          ,
          <source>in: 2014 IEEE 10th International Conference on e-Science</source>
          , volume
          <volume>2</volume>
          ,
          <year>2014</year>
          , pp.
          <fpage>5</fpage>
          -
          <lpage>10</lpage>
          . doi:
          <volume>10</volume>
          .1109/eScience.
          <year>2014</year>
          .
          <volume>14</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Santarsiero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Meccariello</surname>
          </string-name>
          , G. Verhasselt,
          <string-name>
            <given-names>H. D.</given-names>
            <surname>Carroll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. F.</given-names>
            <surname>Wallin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Obbink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Brusuelas</surname>
          </string-name>
          ,
          <article-title>Proteus: A platform for born digital critical editions of literary and subliterary papyri</article-title>
          ,
          <source>in: 2015 Digital Heritage</source>
          , volume
          <volume>2</volume>
          ,
          <year>2015</year>
          , pp.
          <fpage>453</fpage>
          -
          <lpage>456</lpage>
          . doi:
          <volume>10</volume>
          . 1109/DigitalHeritage.
          <year>2015</year>
          .
          <volume>7419546</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Brusuelas</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. H.</surname>
          </string-name>
          ,
          <article-title>Proteus: A platform for born-digital, critical editions of literary and subliterary papyri, Textual History of the Bible, Volume 3D: A Companion to Textual Criticism</article-title>
          , Brill,
          <fpage>507</fpage>
          -
          <lpage>512</lpage>
          . (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Ámbar</given-names>
            <surname>Tenorio-Fornés</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Tirador</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Sánchez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hassan</surname>
          </string-name>
          , Decentralizing science:
          <article-title>Towards an interoperable open peer review ecosystem using blockchain</article-title>
          ,
          <source>Information Processing Management</source>
          <volume>58</volume>
          (
          <year>2021</year>
          )
          <article-title>102724</article-title>
          . URL: https://www.sciencedirect.com/science/ article/pii/S0306457321002089. doi:https://doi.org/10.1016/j.ipm.
          <year>2021</year>
          .
          <volume>102724</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>