Restoration of Archival Images Using Neural Networks Raphaela Heil1 , Fredrik Wahlberg2 1 Dept. of Information Technology, Uppsala University, Lägerhyddsvägen 1, 752 37 Uppsala, Sweden 2 Dept. of Linguistics and Philology, Uppsala University, Thunbergsvägen 3H, 751 26 Uppsala, Sweden Abstract Substantial parts of the image material of today’s digital archives are of low quality, creating problems for automated processing using machine learning. These quality issues can stem from a multitude of reasons, ranging from damaged originals to the reproduction hardware. Modern machine learning has made automatic “restoration” or “colourization” readily available. Curators and scholars might want to “improve” or “restore” the original’s quality to create engagement with the artefacts. However, a fundamental problem of the “restoration” process is that information must always be added to the original, creating reproductions with a synthesized extended realism. In this paper, we will discuss the nature of the “restoration” or “colourization” process in two parts. Firstly, we will focus on how the restoration algorithms work, discussing the nature of digital imagery and some intrinsic properties of “enhancement”. Secondly, we propose a system, based on modern machine learning, that can automatically “improve” the quality of digital reproductions of handwritten medieval manuscripts to allow for large scale computerized analysis. Furthermore, we provide code for the proposed system. Lastly, we end the paper by discussing when and if “restoration” can, and should, be used. Keywords digitization, digital restoration, machine learning, image processing 1. Introduction As our archives are housing ever-increasing volumes of digital material, the need for automated processing is ever more evident. Automated processing for text search in photographed text [1], scribal attribution [2], and production year estimation [3] are becoming increasingly useful. A common issue posing a significant problem to these endeavours is the quality of some digital reproductions, where old cameras or damaged originals can prove to be impossible obstacles. Digitized historical manuscripts sometimes contain a variety of reproduction deteriorations, such as stains, discolouring and compression artefacts. Some examples of such degradations to the image material are shown in Figure 1. While some types of deteriorations may not impact the work of a trained scholar, they pose a significant challenge regarding legibility and interpretability for both laypersons and computerized processing. The 6th Digital Humanities in the Nordic and Baltic Countries Conference (DHNB 2022), Uppsala, Sweden, March 15-18, 2022 Envelope-Open raphaela.heil@it.uu.se (R. Heil); fredrik.wahlberg@lingfil.uu.se (F. Wahlberg) Orcid 0000-0002-5010-9149 (R. Heil); 0000-0002-5306-1283 (F. Wahlberg) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 79 (a) (b) (c) (d) (e) Figure 1: Examples of types of degradations: a) JPEG Compression, b) Detail view of (a) with compression artefacts causing discolourations on the page surrounding the ink stroke, c) Greyscale photography, d) Microform, e) Warping. Generally, artefacts in digitized manuscripts can be divided into two distinct categories, deteriorations inherent to the material and deteriorations introduced during the digitization process. The former, deteriorations of type I, can often be attributed to the manuscript’s age, archival conditions, or handling over time. This type of deterioration can typically be found in all kinds of digitized, historical material. In contrast to these, deteriorations of type II are rooted in the way a manuscript was digitized. Differing, or lack of, image capture protocols can result in issues such as inconsistent illumination or viewpoint, as well as deteriorations caused by the storage process, such as image compression artefacts and colour skewing. Type II is often found in older digitizations, due to technological and procedural constraints. It can however also occur while using modern imaging equipment, such as mobile phones. This classification is not an absolute dichotomy, however, as in the case of some manuscripts on parchment that would need to be flattened for proper photography. This would ruin the original in some cases, where the spine was never designed to allow for being fully opened as the parchment aged. While the effect of both types of deteriorations can be reduced using classical image pro- cessing approaches, these methods are usually highly specialized and require frequent human intervention. They therefore have to be combined into complex pipelines, together with detec- tion of degradation type, and be specifically designed for subgroups of images within the same book. These pipelines seldom generalize well and require both significant on-site expertise and frequent quality inspections. This is both laborious and expensive. One solution for mitigating some of the most common road blocks on the path to automatic processing is specialized cleaning using machine learning [4], which we will focus on in this work. Another area where “enhancing” digital material is useful is for creating engagement with museum audiences [5]. This can include automatic colourization [6], increasing image resolution [7], or adding video frames [8]. A modern film camera doesn’t suffer from the blurriness coming from long exposure time, uneven frame timing, or colour imbalances from chemically developing the film1 . These limitations can largely be overcome using modern machine learning techniques for video and image enhancement. Some audiences can perceive “enhanced” video as much more engaging, as it looks like something that could have been filmed in the contemporary world they experience. Old battlefields or everyday streets come to life, giving the onlooker a sense of closeness to a lost era. 1 It does, however, suffer from rolling shutter effects, explaining why some mobile phone videos of helicopters don’t show the rotors moving, but that’s outside our scope. 80 A problem for the analysis and “enhancement” described above is that wherever information is missing (colour, video frames, manuscript holes, etc.), it will have to be “guessed” and filled in. This is nothing new to scholars studying any material, as an educated guess can be cultivated though careful study, which is why we’d rather put our trust in a philologist or historian than a layman when it comes to interpreting an incomplete manuscript (or even a complete one). We use the provocative term “guess” here to illustrate the core issue: do you trust the “educated guess” of a computer? This is the fundamental limitation of any computerized “enhancement” technique, as we are forced to put our trust in an opaque process, driven by what often seems like some esoteric mathematical magic. We will argue below that this trust should be a function of the application area and not an automatic choice following from technological positivism. Sometimes adding information to the reproduction of the original creates a “fake”, sometimes it improves your analysis. This is also why we choose to put words implying improvement in quotes throughout this paper. 2. Restoration using machine learning To better explain the process of “restoring” images, and its limitations, we want to start with discussing how images are created and represented in the machine. After this, we can go on to more fully critique the processes and machine learning models of “restoration”. 2.1. The nature of a digital image In the machine, a digital image is represented as coloured areas, called pixels, usually organized in a quadratic grid structure2 . If you zoom in on any photo, you will usually be able to see this grid structure3 . A digital camera creates an image by recording the energy from photons (light particles) hitting the image sensor. These photons create a current in the sensor depending on their number and frequency. This process is very similar to the light detection in an eye. Just like in a human eye, there is specialized hardware that detects light of different frequency bands, i.e. colours. The layout of the sensor matches the layout of the pixels in our digital photo. In the machine, the colours of an image are (usually) represented by intensities in three colour bands for each pixel. These are red, green, and blue. The choice of the colour representation is a trade-off between what a human can see and the technical limitations of computer screens. As there are a lot of colours in the world humans can’t see, there is no need for us to fill our camera storage with this information. The colour band intensities are, for historical reasons, most often represented as three integer numbers between 0 and 255. This allows us to, for example, encode a pixel in cornflower blue, with colour intensities [100, 149, 237], as the binary number 011001001001010111101101 . This binary number is what is actually stored in the computer’s memory. An illustration of the colour bands is shown in Figure 2, where the three colours have been separated into images for the respective colour bands. The left-most image in the figure is a detail of a cherry blossom, created by putting the three images to the right on top of each other. If you zoom in on the respective colour channel images, you can read (as 2 While other types of image grids exist, they are very rare. 3 This statement is not entirely true, as some devices interpolate “intelligently” by add pixels to your photos when zooming in. 81 Figure 2: An illustration of colour channels in a digital image, where colour is digitally represented as a mixture of three base colours. The left-most image is a detail from an image of a cherry blossom. The following three images are the Red, Green, and Blue (RGB) colour channels of the same image. If you zoom in on the images to the right, you can see the intensity values, between 0 and 255, for each pixel and colour channel. This, so called RGB encoding, of colour is by far the most common type of colour encoding today. an illustration) the intensity values for each pixel. Another important point illustrated by the cherry blossom image is that it is often very hard to identify the content of a digital image from a close-up. While avoiding getting too philosophical, it is beneficial to remember that the digital image only serves as an inspiration for the image conjured up in our brains. While humans have this ability for working with images and interpreting the three-dimensional world from a two-dimensional reproduction, the computer has no such evolutionary history or obvious way to learn. Luckily for us, computerized image analysis, a subfield of computer science and mathematics, has been working for decades on building a stack of algorithms which lets us go from the binary ones and zeros in memory, to the image manipulation of modern processing software. Older camera film only allowed for different greyscale colour schemes. When colourizing such images today, it can be hard to know what the original colour was. As humans, we can often infer the colour from context. We would recognize the Union Jack in a World War 1 image and fill in the colours in our minds. When doing a digital conversion to greyscale from the three RGB colour channels, we mix the colour intensities as 0.299 ⋅ 𝑟 + 0.587 ⋅ 𝑔 + 0.114 ⋅ 𝑏, where the letters (r, g, b) are intensity values for their respective colour channel. This gives a greyscale intensity corresponding to the original colour. There are, however, many triplets of r, g, and b that give the same greyscale intensity. In Figure 3, we have taken the Swedish flag (top left) and converted it to greyscale (top right). We then did the same thing with the Scania flag (bottom left), which is a mix of the Swedish and Danish flags. It turned out that the Swedish and the Scania flags look the same in greyscale (top and bottom right). This problem becomes almost impossible to solve when we have less contextual clues, e.g., embroidery on traditional clothing or petals of extinct flowers. To overcome this problem, a machine learning model for automatic colourization would have to infer colour information from image content. This opens for new challenges, like identifying objects in the image and correlating how they look to some training material. Luckily, this is precisely what has been done. 82 Figure 3: An illustration of how very different colours can look the same in greyscale. The Swedish flag (top left) and the Scania flag (bottom left), a mix between the Danish and Swedish flag, looks the same in greyscale (right). 2.2. Generative machine learning models In the field of machine learning, the word “model” is an abstract category of algorithms and processes that describe some set of data. In its simplest form, a model can be a spam filter that finds particular words or phrases, and then classifies an email as junk mail. A popular distinction between artificial intelligence (AI) and machine learning (ML) is that the former is a cleverly designed algorithm, while the latter is cleverly designed to learn from a set of training data. This means that while an old, AI-based, spam filter would be given specific search terms and phrases, a spam filter built on ML is trained on a set of emails that a human has labelled as junk mail or not. The ML-based spam filter would then try to generalize from training data collected “in the wild”. This gives the ML model the advantage of being able to determine its own set of classification rules that are relevant to the task, without having to rely on extensive human efforts. A fundamental characteristic of an ML model is that it must be trained, which can be quite expensive in terms of data and processing power. Generally, machine learning models are grouped into two major categories, discriminative and generative models. The former deliver categorizations, or classifications, given some data. The initial spam filter example, is a discriminative model, as it does not describe what an email might look like, but only makes a decision whether an email should end up in your junk folder. It can only go in one direction: email input to categorization. This is in contrast to generative models, which, as the name indicates, might be able to generate a full, authentic-looking, email that would be classifiable as either spam or not spam. Various strategies have been proposed to create generative models, however the approach that is relevant to this paper is adversarial training[9]. Generative adversarial networks (GANs) 83 Figure 4: Four entirely synthetic images of people and animals, generated by a generative adver- sarial network (StyleGAN2 [11]). Images retrieved from https://thispersondoesnotexist.com/, https: //thishorsedoesnotexist.com/ and https://thiscatdoesnotexist.com/, respectively. are a group of models that aim to generate synthetic data, e.g. images, that would plausibly fit into a collection of real examples, also called the domain of the task. The domain of a spam filter is emails, but the domain where GANs excel is images of natural scenes. Generally, these models consist of two main components, a generator and an adversarial discriminator, that are placed in competition with each other during the learning phase. In this competition, on the one side, the generator tries to create synthetic, albeit plausible data, e.g. images, from the target domain. On the other hand, the discriminator is tasked with identifying whether a presented piece of data is real, i.e. taken from the target domain, or fake, i.e. produced by the generator. Both competitors try to “outsmart” each other by improving their performance on the respective task. In the ideal case, the generator will reach a generative quality at which the discriminator cannot distinguish real from fake. The generator can then be used to create fake images, and is often so good at doing this that humans are fooled. Note that the training data doesn’t need to be strongly curated, one only needs to make sure that the GAN has enough images from the intended domain. As an illustration of this, Figure 4 shows some synthetic images generated with the methodology above. All of the depicted images of people and animals were generated using a model called StyleGAN2 [10, 11]. The outlined general idea of GANs can be put into a real-world context by considering the analogy of an art student and a critic, where the former takes the role of the generator, creating paintings, while the latter represents the discriminator, determining whether an artefact is worthy of a spot in the town’s art gallery4 . At the start of the training phase (in our analogy), both our student and critic start out by being completely unskilled at their respective crafts. Hence, the critic’s job is fairly easy in the beginning. To make this analogy work, our critic must be excellent at giving feedback to the student on how they can improve. The critic will now pick up paintings randomly from either some collection of acceptable art, the domain, or some paintings made by the student. After careful deliberation, our critic will make a decision on if this is acceptable art. If the randomly picked painting was made by our student, the critic will give them feedback on how to create art that looks more like the paintings in the domain. This process will then continue until our critic, who is constantly getting better at their work, can no longer distinguish the domain paintings from at artwork by our student. If this was more than an analogy, an objection to this setup might be that the student isn’t really encouraged to 4 Goodfellow et al. describe these roles as counterfeiters and police, respectively [9] 84 be creative, i.e. actually create art. This is true for the machine learning model too. The model doesn’t learn creativity, it learns to imitate. It should be noted that machine learning is an applied field in the sense that models are researched and trained in order to solve some task. It is widely accepted that “all models are wrong, but some are useful” [12]. When a model learns to imitate some training data, the type of data that is imitated is of crucial importance to “enhancement” using the adversarial generative techniques described above. Let’s say we have a photo, taken in the 1930s. It is blurry and in greyscale. One way of “enhancing” the quality of this photo is to make the generative model find an image, with both high resolution and colour, that matches our original. How do we determine the best match? We can think of this as the model generating images, we convert them to blurry greyscale, and then compare, pixel by pixel, to the original. Note that some operations that are very time-consuming for a human can be well adapted to a computer’s capabilities. After some searching we could always find multiple images that, after conversion, would match the original. This is unavoidable, as the original simply doesn’t contain the information needed to “enhance” detail, e.g., if there is a person in the background or just a trick of the light. Though the “enhanced” photograph can be very convincing, as we show in Figure 4, this should not be taken as evidence for that it shows something that is not conjured up by the model, like in some American crime dramas. However, if the model has been trained on images very much like the original, perhaps through high quality modern re-enactment or some timeless physical phenomenon, it is likely to synthesize something highly plausible. 3. Cycle-consistent Generative Adversarial Networks (CycleGANs) In order to “restore” degraded archival manuscript pages, we propose to employ a type of GAN, commonly referred to as cycle-consistent generative adversarial networks (CycleGANs) [13]. These models have found prior applications in a variety of image translation tasks, such as the presentation of photos in the style of famous classical painters, e.g. Van Gogh [13], the transformation of images like portraits and animal photos to traditional Japanese flower arrangements [14] and the removal of strike-through artefacts from handwritten words [15]. Additionally, CycleGANs have been used to remove certain degradations, such as stains and watermarks, from printed documents [16]. The general structure of a CycleGAN is illustrated in Figure 5, using the task of image restoration as an example. Concretely, this approach consists of two regular GANs that are trained in conjunction with each other. As shown in the illustration, each of the two generators, named ‘restore’ and ‘degrade’, are concerned with generating restored, respectively degraded, images. Discriminator A assesses whether a given clean images is a genuine high-quality one or was created by the generator, while discriminator B determines whether a presented degradation is genuine or generated. To demonstrate the flow of images through the system, we trace the path of a degraded image in the following (Figure 5, left). Initially, the degraded image is fed into the generator ‘restore’, which uses it as a basis to produce a restored image. This image is then assessed by the discriminator A. In addition to this, it is processed by generator ‘degrade’, which returns it 85 cycle cycle similarity similarity real generated generated real real restored? real degraded? restore degrade restore degrade (discriminator A) (discriminator B) Figure 5: General structure of a CycleGAN. Dataflow is being demonstrated via an example of image “restoration” (left) and degradation (right). Starting points of each cycle, i.e. input images, are marked with a red border, arrows indicate input and output of data, and ‘restore’ and ‘degrade’ represent the two generators, performing the respective transformations. to a degraded state. Besides the discriminator’s assessment, feedback regarding the generation quality is provided to ‘restore’ by comparing the original input image with the restored and subsequently degraded image. The latter comparison aims for a high similarity, which is referred to as cycle-consistency, hence the name of the model. In parallel to this pass, a second cycle takes place, starting from a clean image. This process is shown on the right in Figure 5, following the clean input image through the generator ‘degrade’ to a degraded output and back through generator ‘restore’ to a restored image, to close the cycle via a comparison of the latter with the input. The same feedback mechanisms as in the first cycle are applied to the respective generator, using the respective discriminator. Various implementation for the generators and discriminators exist and are often, to some degree, determined by the task at hand. In this work, we employ dense U-Nets [17], via the implementation from [18] for the generators and the traditional discriminators, proposed by [13]. 3.1. Data In order to train and evaluate the proposed approach, we randomly selected 17 manuscripts from Manuscripta5 , a digital collection of medieval and early modern manuscripts, provided by the National Library of Sweden. The selected manuscripts are dated between ca. 1300 and ca. 1526 and contain texts in Old Swedish. A total of 150 pages were randomly selected from these manuscripts and split into 75 pages for the training, 25 for the validation and 50 for the test set. While preparing these splits, we ensured that images from one manuscript would only be present in exactly one split, in order to avoid information leakage. With the aim of closely representing real archival image conditions, we consider the following artificial degradations (cf. Figure 1) in our experiments: 5 https://www.manuscripta.se/ 86 JPEG Compression Page images are converted to a JPEG representation with varying com- pression levels, randomly sampled from the range [65,85]. Greyscale The standard, weighted greyscale transformation is applied to the input image. Microform Firstly, the given image is converted to greyscale, following the same weighted approach as above. Subsequently, the contrast is adjusted via a sigmoid correction [19]. Lastly, the illumination of the microform during digitisation, is simulated by superimposing a mask which darkens the image towards the edges. Warping A small elastic transformation [20], with alpha and sigma randomly drawn from [15,25], respectively [4,6] is applied, resulting in warping and slight displacements of the image content. Following the taxonomy of degradations, outlined in the introduction, Warping can be categorised as type I, while the other three are examples of type II. Regarding the altered images, it should be noted that JPEG Compression and Warping produce RGB images, while Greyscale and Microform result in single-channel representations. For ease of training, outputs from the latter two transformations are repeated three times and stacked, to result in a three-channel format. Implementations for JPEG Compression, Greyscale and Warping were provided by [21], while Microform is a custom implementation. The code can be found in the accompanying repository (cf. Appendix A). For the preparation of the validation set, each page was altered individually by each of the four outlined approaches. Subsequently, one patch random patch of size 256 by 256 pixels was cropped from each page, for each of the four augmentations. This results in a set of 100 image patches, each of which are stored together with the corresponding patch from the unaltered page, for comparison. The test set was prepared in a similar fashion, however instead of a single patch per page and augmentation, ten random, disjoint, i.e. not overlapping, patches were selected, resulting in a total of 2000 patches. 3.2. Neural Network Training Protocol We train the outlined CycleGAN for a total of 60 epochs, using the Adam [22] optimizer with a learning rate of 0.001. Each epoch entails the sampling with replacement of 300 pages from the training set. 50 percent of these are artificially deteriorated, using a randomly selected approach from the ones outlined above, while the other 50 percent are left unaltered. One randomly located, square patch of width 256px is cropped from each of the 300 pages. Deteriorated patches are input into generator ‘restore’ and discriminator B, while clean ones are supplied to generator ‘degrade’ and discriminator A. Following each training epoch, we assess the restoration performance on images from the validation set, via the Root-Mean-Square Error (RMSE) and the Structural Similarity Index Measure (SSIM)[23]. The model checkpoint exhibiting the best validation performance is retained and used for evaluation on the test set. 87 Degradation Degraded SSIM Restored SSIM Degraded RMSE Restored RMSE Microform 0.8046 0.8301 0.3717 0.2849 Greyscale 0.9909 0.9369 0.0921 0.1033 Warping 0.8928 0.8502 0.0702 0.1144 JPEG 0.9484 0.9099 0.0353 0.0857 Overall 0.9092 0.8818 0.1423 0.1471 Table 1 Quantitative results for the “restoration” task. Both SSIM and RMSE range between zero and one. For the former, higher values are better, while for the latter, values closer to zero are. 3.3. Evaluation In order to evaluate the “restoration” performance of our proposed approach, the chosen model checkpoint was applied to all patches in the test set. A hand-picked selection of model outputs, one per degradation type, is shown in Figure 6 (bottom row), together with the original state of the patch (top row) and the altered version that was provided as input to the ‘restore’ generator (middle row). As can be seen from the samples, the generator successfully transforms the microform and greyscale patches into coloured images. The range of displayed colours is slightly diminished, one could say muted, as compared to the original images, however none of the patches display extreme or unexpected colours. For the warped patch, a de-warping effect is not immediately apparent, however it can be noted that the generator has adapted the colour slightly. A similar colour effect can be observed for the JPEG-compressed patch. In contrast to the warped one, however, a “restoration” effect, in the form of smoothing, and reduction of blocking and ringing artefacts, is visible. Overall, the above observations hold for the majority of the test patches of types microform, greyscale and warping. Results for the JPEG compression are more diverse in quality, but some level of improvement or smoothing is generally observable. To provide a more comprehensive overview over the model’s “restoration” performance, Figure 7 illustrates two samples for which the generator provides a convincing “reconstruction” of the background colour but fails to correctly represent the use of red ink, visible in the original patch. These examples tangibly demonstrate how the extent of the training data can influence the model’s “restoration” performance. As red ink is used sparingly, only to highlight selected words or phrases, it is not represented as frequently in the training data as regular, black ink. The model will therefore exhibit a strong tendency to colour darker areas, generally corresponding to some form of ink in the microform and greyscale images, in black or darker greys instead of other potential ink colours. Training the model on a dataset with more diverse shades of ink would potentially mitigate this issue. Besides the qualitative evaluation, we also present a brief quantitative analysis below. Table 1 shows the RMSE and SSIM values for the altered and the “restored” patches, each calculated with respect to the ground truth. Notably, the measures only improve for “restored” patches in the case of microform degradations. In all other cases, the performance appears to drop by one to five percentage points. Considering that a large portion of the qualitative results 88 Figure 6: Hand-picked “restoration” samples, demonstrating the performance of the model. The first row shows the original image patch, the second the degraded version that is input into the generator ‘restore’ and the last row show the generator’s output. The last column (“JPEG Zoom”) shows a close-up of the box, marked in white in the previous column (“JPEG”). are perceived as reasonable restorations, this raises the question of how the qualitative and quantitative results can be consolidated and put into context with each other. A major aspect to consider, when it comes to the evaluated metrics, is that these require a concrete reference image, or ground truth, to compare with. Herein, however, lies the crux of the restoration problem: the true original data is not available. One can only attempt to obtain an approximation of the truth by making “educated guesses” based on existing clean and artificially altered data. It can therefore be argued that, while these metrics can be useful to get a general idea of the similarity and thereby of a certain degree of quality in the context of artificial data, they should not be considered in isolation. Instead, they should be combined with the qualitative results and examined by experienced scholars in an appropriate context. 4. Discussion & Conclusion With the continued increase in computational power and capabilities of machine learning approaches, many opportunities arise in the humanities. These methods can not only facilitate novel research questions, but also provide opportunities for engaging with the public. As demonstrated above, machine learning (ML) can act as a convenient tool, but should not be used carelessly or trusted blindly. In order to “restore” information, any model will have to make an educated guess, based on the data it is presented during training. Through this, the 89 Figure 7: Hand-picked “restoration” samples, demonstrating cases where the model proposes a “restora- tion” that differs substantially from the expected colourization. “restorations” run the risk of becoming copies of the training data, instead of approximations of the true original form, which we are aiming for in the case of image enhancements. The incorporation of undesired pieces of information in a “restoration” can have a considerable detrimental effect when research questions should be answered using the processed data. A wrong representation or interpretation will affect the results and conclusions. This effect also has to be considered when data is being refined for consumption by laypersons, for example in the form of an exhibition or teaching material. It is crucial to ensure that the “reconstructions” allow observers to get an appropriate idea of the information and its implications. The initial example of the colour “restoration” of the greyscale flag (cf. Figure 3) serves as a cautionary tale here: whether the Swedish or the Scanian flag is presented to an audience, whatever level of expertise they may hold, could considerably affect their interpretation of the context and any conclusions they draw from it. As argued above, generative modelling in machine learning can create digital images of impressing fake realities. This is done through a process of learning to imitate a domain of real images, e.g., human faces or landscapes. For museums, “enhancing” historical photographs or film has the potential to create engagement with older image material. As with any generative model, the frames that are filled in, or the pixels that are added, take their inspiration from both the original and the image material in the training set of the machine learning method. Hence, the historical material is merged with modern material to create a synthetic high-quality image. As such, small details in a historical photograph will be “filled in” by data from the training images in its high-quality counterpart. How can we be confident that the computer made the correct choice when filling in this information? If the training data is modern, a square object in someone’s hand might get interpreted as a mobile phone, whereas originally, it may have been a cigarette case. When a model is used for sensitive material, the information that it is trained on therefore needs to be curated carefully. Generative modelling for image “enhancement” is likely to get better and less expensive in the coming years. Making historical film look like it was taken with modern equipment, like 90 Peter Jackson did with World War 1 footage in “They shall not grow old”, is likely to become ubiquitous in museums. As computer scientists, we want to caution against using such methods uncritically. The temptation of technological positivism is strong, with a substantial hype around “artificial intelligence” (what we here call machine learning). It’s always warranted to ask about which training data we are “enhancing” from. What do they include, and more importantly, what type of images or people are not included in the training data? The choice of using machine learning “enhancement”, and which data to base it on, must follow from the intended application area. If it improves text recognition, “enhancement” can be used fairly safely, as the risk of accidentally adding plausible words is very low. If you are going to improve historical photography, however, make sure the model was trained on similar material to yours. If you want to make out details that weren’t really visible in the original, you are not working with the original any more, and your results will be based on a synthetic reality. This last case is the most important, as there are a lot of imaging applications where super-resolution is viable. This does not apply to historical imagery, as the material is much more diverse than CT scans or cartoons. As is often the case with new technology, the potential is great and intriguing. However, in a world of believable synthetics and deepfakes, the need for a trained human eye and careful curation has never been greater. Acknowledgments The computations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC) at Chalmers Centre for Computational Science and Engineering (C3SE) partially funded by the Swedish Research Council through grant agreement no. 2018-05973. This work is partially supported by Riksbankens Jubileumsfond (RJ) under the project “New Eyes on Sweden’s Medieval Scribes”, Dnr NHS14-2068:1. A special thanks to the PI, professor Lasse Mårtensson. References [1] T. Wilkinson, J. Lindstrom, A. Brun, Neural ctrl-f: Segmentation-free query-by-string word spotting in handwritten manuscript collections, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017. [2] A. Brink, J. Smit, M. Bulacu, L. Schomaker, Writer identification using directional ink-trace width measurements, Pattern Recognition 45 (2012) 162–171. URL: https: //www.sciencedirect.com/science/article/pii/S0031320311002810. doi:https://doi.org/ 10.1016/j.patcog.2011.07.005 . [3] S. Boldsen, F. Wahlberg, Survey and reproduction of computational approaches to dating of historical texts, in: Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa), Linköping University Electronic Press, Sweden, Reykjavik, Iceland (Online), 2021, pp. 145–156. URL: https://aclanthology.org/2021.nodalida-main.15. [4] R. Heil, E. Vats, A. Hast, Paired image to image translation for strikethrough removal from handwritten words, 2022. arXiv:2201.09633 , Under review at DAS 2022. 91 [5] H. Sommer, Assessing millennial engagement in museum spaces, in: Theory and Practice 1, The Museum Scholar, 2018. URL: http://articles.themuseumscholar.org/tp_vol1sommer. [6] G. Larsson, M. Maire, G. Shakhnarovich, Learning representations for automatic coloriza- tion, 2017. arXiv:1603.06668 . [7] C. Dong, C. C. Loy, K. He, X. Tang, Image super-resolution using deep convolu- tional networks, CoRR abs/1501.00092 (2015). URL: http://arxiv.org/abs/1501.00092. arXiv:1501.00092 . [8] S. Niklaus, F. Liu, Context-aware synthesis for video frame interpolation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. [9] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio, Generative adversarial networks, 2014. arXiv:1406.2661 . [10] T. Karras, S. Laine, T. Aila, A style-based generator architecture for generative adversarial networks, 2019. arXiv:1812.04948 . [11] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, T. Aila, Analyzing and improving the image quality of stylegan, 2020. arXiv:1912.04958 . [12] G. E. P. Box, Science and statistics, Journal of the American Statistical Association 71 (1976) 791–799. doi:10.1080/01621459.1976.10480949 . [13] J.-Y. Zhu, T. Park, P. Isola, A. A. Efros, Unpaired image-to-image translation using cycle- consistent adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017. [14] C. H. Mai, R. Nakatsu, N. Tosa, Developing japanese ikebana as a digital painting tool via ai, in: N. J. Nunes, L. Ma, M. Wang, N. Correia, Z. Pan (Eds.), Entertainment Computing – ICEC 2020, Springer International Publishing, Cham, 2020, pp. 297–307. [15] R. Heil, E. Vats, A. Hast, Strikethrough removal from handwritten words using cyclegans, in: J. Lladós, D. Lopresti, S. Uchida (Eds.), Document Analysis and Recognition – ICDAR 2021, Springer International Publishing, Cham, 2021, pp. 572–586. [16] M. Sharma, A. Verma, L. Vig, Learning to clean: A gan perspective, in: G. Carneiro, S. You (Eds.), Computer Vision – ACCV 2018 Workshops, Springer International Publishing, Cham, 2019, pp. 174–185. [17] S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, Y. Bengio, The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 1175–1183. doi:10.1109/CVPRW.2017.156 . [18] N. Pielawski, OctoPyTorch: Segmentation Neural Networks, 2021. URL: https://github. com/npielawski/octopytorch. [19] G. J. Braun, M. D. Fairchild, Image lightness rescaling using sigmoidal contrast enhance- ment functions, Journal of Electronic Imaging 8 (1999) 380–393. [20] P. Simard, D. Steinkraus, J. Platt, Best practices for convolutional neural networks applied to visual document analysis, in: ICDAR, 2003. doi:10.1109/ICDAR.2003.1227801 . [21] A. B. Jung, K. Wada, J. Crall, S. Tanaka, J. Graving, C. Reinders, S. Yadav, J. Banerjee, G. Vecsei, A. Kraft, Z. Rui, J. Borovec, C. Vallentin, S. Zhydenko, K. Pfeiffer, B. Cook, I. Fernández, F.-M. De Rainville, C.-H. Weng, A. Ayala-Acevedo, R. Meudec, M. Laporte, et al., imgaug, https://github.com/aleju/imgaug, 2020. Online; accessed 14-Feb-2022. [22] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in: Y. Bengio, Y. LeCun 92 (Eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL: http://arxiv.org/abs/ 1412.6980. [23] Z. Wang, A. Bovik, H. Sheikh, E. Simoncelli, Image quality assessment: from error visibility to structural similarity, IEEE Transactions on Image Processing 13 (2004) 600–612. doi:10.1109/TIP.2003.819861 . A. Online Resources The code used to train and evaluate the proposed CycleGAN is available here: https://zenodo. org/record/6592707. 93