=Paper=
{{Paper
|id=Vol-3290/short_paper5571
|storemode=property
|title=Automatic Identification and Classification of Portraits in a
Corpus of Historical Photographs
|pdfUrl=https://ceur-ws.org/Vol-3290/short_paper5571.pdf
|volume=Vol-3290
|authors=Taylor Arnold,Lauren Tilton,Justin Wigard
|dblpUrl=https://dblp.org/rec/conf/chr/ArnoldTW22
}}
==Automatic Identification and Classification of Portraits in a
Corpus of Historical Photographs==
Automatic Identification and Classification of Portraits in a Corpus of Historical Photographs Taylor Arnold1,∗,† , Lauren Tilton2,† and Justin Wigard2,† 1 Linguistics Program, Carole Weinstein International Center, 211 Richmond Way, Richmond, VA 23173, U.S.A 2 Rhetoric & Communication Studies, 231 Richmond Way, Richmond, VA 23173, U.S.A Abstract There have been recent calls for an increased focus on the application of computer vision to the study and curation of digitised cultural heritage materials. In this short paper, we present an approach to bridge the gap between existing algorithms and humanistically driven annotations through a case study in which we create an algorithm to detect and and classify portrait photography. We apply this method to a collection of about 40,000 photographs and present a preliminary analysis of the constructed data. The work is part of the larger ongoing study that applies computer vision to the computational analysis of over a million U.S. documentary photographs from the early twentieth century. Keywords computer vision, cultural heritage, photography, public humanities 1. Introduction 1.1. Motivation In this paper we present work that adapts and applies computer vision algorithms to aid in the discovery and use of historic digital photography [2, 9]. Rather than treating cultural heritage images as a monolith, whereby computational approaches are o昀琀en developed and applied without attention to the form of cultural heritage in technical scholarship, we pursue technical research with computer vision that considers the speci昀椀city of photography as a medium, social practice, and source of evidence for humanistic inquiry [4]. We present work on a photography collection from the early 20th century, totaling nearly 40,000 photographs and held by the U.S. Library of Congress (LoC). The number of photographs in the LoC collections provides not only a scale that bene昀椀ts from the use of computer vision, but speaks to the necessity of experimenting and developing approaches to computer vision for access and discovery of images. Our work is designed to support the LoC’s work to “expand access” and “increase discoverability” to their collections. CHR 2022: Computational Humanities Research Conference, December 12 – 14, 2022, Antwerp, Belgium ∗ Corresponding author. † These authors contributed equally. £ tarnold2@richmond.edu (T. Arnold); ltilton@richmond.edu (L. Tilton); jwigard@richmond.edu (J. Wigard) ç https://statsmaths.github.io (T. Arnold); https://anon@anon.org/ (L. Tilton); http://justinwigard.com/ (J. Wigard) ȉ 0000-0003-0576-0669 (T. Arnold); 0000-0003-4629-8888 (L. Tilton); 0000-0003-0124-5934 (J. Wigard) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 25 Our project builds on recent work at the intersection of machine learning and photogra- phy [7, 10]. Corrin, Davis, Lincoln, and Weingart’s CAMPI paper o昀昀ers a report on the use of computer vision for processing digital photograph collections [6]. The scope includes the assessment of back-end metadata generation for visual similarity search, duplicate or close- match detection, and utilizing image similarity for tagging. Our work looks at the feasibility of apply a larger set of algorithms for a broader approach to metadata classi昀椀cation. This new approach gives attention to the speci昀椀city of the features of photography such as composition and types of photography. The case study described in this paper is part of the ADDI project, which makes interventions in several areas.1 The project investigates the accuracy and appropriateness of di昀昀erent types of computer vision algorithms applied to historic photography. Because current algorithms are designed and trained on 21st century color images, potential challenges for these algorithms include identifying and labeling historic objects as well as analyzing black and white images, particularly legacy preservation digital formats [9]. The project models the feasibility of using algorithmically generated metadata features for automated search and discovery with attention to the ethical considerations of using computer vision. We investigate the use of algorithmi- cally extracted features to directly create automated metadata to facilitate discovery. Speci昀椀c attention will be to given metadata categories that reveal features of photographs such as com- position. Questions include the usefulness of automated features for search and discovery and how to display automated features to general publics using best practices from the 昀椀eld of human-computer interaction (HCI). Finally, the project assesses the necessary technical archi- tecture for running and storing the results of computer vision algorithms within a cloud-based architecture. In this paper, we present an analysis that models these four interventions applied to a speci昀椀c set of photographs. The George Grantham Bain Collection consists of nearly 40,000 black-and-white photographs taken by one of the earliest news picture agencies in the United States [11]. Most images come from the 昀椀rst two decades of the 20th century. They document a wide range of activities, including quotidian scenes of shops, gas stations, and lunch counters, major political rallies, University football games, weddings, and funerals. One type of image that we found to be particularly prominent when browsing through the collection are formal portrait photographs. There are clearly many of these in the collection, but they are not directly identi昀椀ed by meta- data 昀椀elds or a consistent description in the photograph titles. The goal in this paper is to determine how we can use computer vision annotations to identify and describe the the por- trait photographs found within the Bain collection. 2. Detecting studio portraits 2.1. Locating people As a starting point, we investigated the results of the application of a region segmentation algo- rithm to every image in the collection [1, 3].2 This algorithm attempts to associate each pixel 1 Information about the larger project can be found at https://github.com/distant-viewing/addi. 2 We used a model trained on the COCO segmentation dataset using the R50-FPN architecture from the Detectron2 model zoo (137260431). 26 Figure 1: Distribution showing the number of detected people in each photograph in the Bain collec- tion. Notice that the modal number of people is one; these correspond to (a superset of) the portraits in the collection. in an image with a object type or background region (such as the ground or sky). Though tech- nically not an “object,” one of the object types detected by the algorithm are people. Figure 1 shows the number of images in the Bain collection based on the number of people detected by the region segmentation algorithm. Two interesting things stand out in the results. First, notice that there are very few images that contain no people. This leads us to assume that there are not many images that contain only the built or natural environment. It also indicates potential patterns about how Bain visually de昀椀ned news through the centering of people. Sec- ondly, we see that images with a single person can be clearly identi昀椀ed. There are far more images with one person compared to any other number of people. These, likely, are where most of the portrait photographs can be found. 2.2. Bimodal distribution To investigate further, we need to understand more about the people detected in the images. We can do this by looking at the proportion of the image frame that is taken up by people. Speci昀椀cally, we will look at a density plot of the proportions of the images taken up by people based on whether there is only one person or multiple people. Images without people are excluded. Our primary interest is the shape of the images with one person; the other images will help as a point of comparison. A density plot is shown in Figure 2. Interestingly, we see that for images with only one person, the proportion of frame image taken up by the person concentrates around two di昀昀erent values. One is around about 20% of the image and the other is around 50% of the image. As a comparison, notice that the density curve for images with two or more people has a sharp peak around 3%, with a steady decrease for larger proportions. To understand more, we will look closer at the images with a single person from each of the 27 Figure 2: Density plot showing the percentage of the frame taken up by people based on whether there are one or more than one person detected. There is a bimodality of the single-person distribution, showing modes around 18% and around 47% of the frame. modes of the density plot. Note that the process of moving back and forth between aggregating the data and looking at individual images is a common and fruitful mode of analysis throughout our work. Figure 3 shows 20 random photographs that have a single detected individual that takes up between 15% and 20% of the image frame. For comparison, Figure 4 shows 20 random photographs that have a single detected individual that takes up between 45% and 50% of the image frame. Looking at the images, we can now understand the di昀昀erence between the two types of photographs containing a single detected person. In the 昀椀rst mode, most of the images feature the entire body of a single person shown in an interesting place; for example, a baseball player on the 昀椀eld, a man in a radio station, or a woman in front of a sewing machine. In contrast, the second mode primarily contains portraits of people shown from the chest upwards. The majority of these images appear to be shot in a studio setting and a neutral background. The person in the frame is o昀琀en dressed in a formal wear or an o昀케cial uniform. They seem to be looking directly at, or just slightly o昀昀, the camera. It seems, then, that we can use the number of people (1) and the percentage of the frame taken up by a person (around 50%) to determine if an image in this collection is a studio portrait. However, note that these general patterns of the two modes are generally accurate but not perfect. One image in the second mode is an interestingly framed image of a man playing the piano; two of the images in the 昀椀rst mode are full-length shots taken in a studio setting. So, in using these derived annotations, we should keep in mind that there will be some errors. This does not stop us from using the results in aggregate or as a general method for search and discovery. However, if one wanted to add this information directly into the archival metadata tag directly, we will want to be clear that the categorization is algorithmically generated. 28 Figure 3: Twenty randomly selected images from the first mode of the single person images, in which around 18% of the frame is taken up by the person. The analysis also opens up interesting questions about news and visual culture. Are their social roles that are documented in certain ways compared to others? Our initial data suggests that activities such as dance and sports (which we could generalize to a category called per- formers) as well as women more generally are o昀琀en photographed with visual information to clearly communicate to the audience the role of the person. In other words, the scene is their skill and helps the viewer understand why they are being featured. On the other hand, the studio portrait with a neutral or decorative background draws the eye to the face and clothes. There is little extra information to indicate exactly who the person is. Like the portraits that line government buildings with a name engraved in brass, the style of the portrait is designed to convey the person’s prominence. It appears that certain roles in society such as military of- 昀椀cials and politicians are being granted the visual power and cultural prominence of the close up. There is signi昀椀cantly more analysis to bolster this initial observation, but the initial di昀昀er- ences are opening up questions and potential (historical) patterns regarding the relationship of framing, social position, and power [8]. 29 Figure 4: Twenty randomly selected images from the first mode of the single person images, in which around 47% of the frame is taken up by the person. 2.3. Application Now that we have identi昀椀ed the set of portrait photographs, what can we do with them? From an access perspective, we could identify these photographs and create an exhibit or digital public project focused speci昀椀cally on these images. As a form of analysis, we might try to identify how other archival metadata compares to portrait photographs. As one example, we can look at the distribution of photographs from the Bain collection by time based on whether an image is a portrait or not. Figure 5 shows a density plot of these results. Looking at the density plot shows that the dates of photographs seem to cluster around “round” dates, such as 1900, 1910, 1915, and 1920. This is likely an artifact of the data collection rather than an interesting feature of the data itself. Unfortunately, a signi昀椀cant portion of the collection was lost in a 昀椀re. Speci昀椀cally on the topic of the portrait photography, what is most interesting is that what we have tagged as studio portrait photographs seem to be equally distributed across the same time periods of the rest of the collection. So, the set of portrait photographs are an important element of the Bain collection through the early 20th century rather than being a feature of only a few years. 30 Figure 5: Distribution of the one-person photographs by the date listed in the archival data. Notice that there are bumps around particular clusters but that the mixture of the two modes seems to be relatively constant. 3. Portrait classication 3.1. Orientation using pose detection Another way of looking at the portrait photography in the Bain Collection is by using a pose detection algorithm to estimate the orientation of people looking at the camera [5].3 This will help us understand if people are rotated to the le昀琀, right, or squarely looking into the camera. The pose detection algorithm can help with this by allowing us to compare the position of di昀昀er- ent body parts relative to one another. As a starting point, we compared the distance between one’s nose with their right and le昀琀 ears. Calculating which ear is closer in two-dimensional space to the nose is a way of detecting how the face is framed relative to the camera. Applying this algorithm to the portraits in the Bain Collection shows that there seems to be no particular preference for poses to the le昀琀 or right. In order to understand these results, let’s look at some of the images based on their pose. Figure 6 shows 20 randomly selected images that appear to be posed to the right and for com- parison, 昀椀gure 7 shows 20 randomly selected images that appear to be posed to the le昀琀. Looking at these images we can see that they do seem to correctly identify the orientation of people’s faces. However, our algorithm only uses the location of the ears and therefore is unable to to detect which way people’s actual eyes are being directed. One trope we see in the above images is that may poses have one’s face directed to one side of the frame, but their eyes cutting across the frame in the other direction. Further exploration of this compositional pattern is necessary for it also has the potential to connect back to our earlier observations. We can see again that 3 We used a model trained on the COCO Person Keypoint dataset using the R50-FPN architecture from the Detec- tron2 model zoo (137261548). 31 Figure 6: Twenty randomly selected portraits posed to the right, according to the post detection algo- rithm. the portraits in this style are primarily men and many appear to be White, although we want to proceed with caution about assuming race by doing additional research. The analysis opens up more questions about the role of portraiture, gender, and race in early 20th century visual culture. 3.2. Orientation using face keypoints We can repeat the same process using the locations of the eyes themselves relative to the loca- tion of the nose key points. Similarly, this algorithm does not display any strong preference for poses to the le昀琀 or right. Looking at examples can once again be helpful. Figure 8 shows 20 randomly chosen examples of poses based on the eyes to the right. And in the 昀椀nal set, Figure 9 shows 20 randomly chosen examples of poses based on the eyes to the le昀琀. Looking at these results, we see that the eye-based calculation does 昀椀nd poses which are more strongly oriented to one side of the image or another. In all of the example cases, we see that the entire person is oriented in the expected directly. Still, several examples show people who are looking o昀昀 across the camera with their pupils. This highlights that orienting a person one way while gazing across the frame of the image is a common element of these studio portrait photographs. Further close and computational analysis, as well as a re昀椀nement of the classi昀椀cation of portrait photography, in order to better understand this phenomenon is 32 Figure 7: Twenty randomly selected portraits posed to the le昀琀, according to the post detection algo- rithm. a planned topic for future work. 4. Future Directions In this short paper we present prelimenary results from our work to identify and classify for- mal elements of photography using existing computer vision algorithms. Despite being used on historic, black-and-white photographs, the algorithms used in our application appear to perform well in terms of both precision and recall. However, signi昀椀cant work is required to construct rules for mapping low-level computer vision annotations into meaningful categories that are of interest to scholars of visual culture, archivists, and others looking to increase access and discover of digitised photographic corpora. Our application here shows the feasibility of this task on the speci昀椀c case of portrature detection and classi昀椀cation in one relatively large collection. Future work will be able to identify and classify images based on additional for- mal features of photography and ensure that the techniques can be adapted uniformally across many collections and time periods. 33 Figure 8: Twenty randomly selected portraits posed to the right, according to the face keypoints. Acknowledgments The work in this paper was funded in part by a grants from the Library of Congress’ Computing in the Cloud Initiative (BAA #LCCIO20D0112), the National Endowment for the Humanities (HAA-261239-18), and the Mellon Foundation. References [1] T. Arnold and L. Tilton. “Distant viewing Toolkit: A python package for the analysis of visual culture”. In: Journal of Open Source So昀琀ware 5.45 (2020), p. 1800. [2] T. Arnold and L. Tilton. “Distant viewing: analyzing large visual corpora”. In: Digital Scholarship in the Humanities 34.Supplement_1 (2019), pp. i3–i16. [3] H. Caesar, J. Uijlings, and V. Ferrari. “Coco-stu昀昀: Thing and stu昀昀 classes in context”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 1209–1218. [4] C. Dijkshoorn, L. Jongma, L. Aroyo, J. Van Ossenbruggen, G. Schreiber, W. Ter Weele, and J. Wielemaker. “The Rijksmuseum collection as linked data”. In: Semantic Web 9.2 (2018), pp. 221–230. 34 Figure 9: Twenty randomly selected portraits posed to the le昀琀, according to the face keypoints. [5] K. He, G. Gkioxari, P. Dollár, and R. Girshick. “Mask R-CNN”. In: Proceedings of the IEEE international conference on computer vision. 2017, pp. 2961–2969. [6] M. Lincoln, J. Corrin, E. Davis, and S. B. Weingart. “CAMPI: Computer-Aided Metadata Generation for Photo archives Initiative”. In: (2020). [7] T. Smits and S. Asser. “The Great Unseen. Photojournalism and the archive: from ana- logue to digital”. In: TMG Journal for Media History 25.1 (2022), pp. 1–17. [8] J. Tagg. The disciplinary frame: Photographic truths and the capture of meaning. U of Min- nesota Press, 2009. [9] M. Wevers and T. Smits. “The visual digital turn: Using neural networks to study histor- ical images”. In: Digital Scholarship in the Humanities (2019). [10] M. Wevers, N. Vriend, and A. de Bruin. “What to do with 2.000.000 Historical Press Pho- tos? The Challenges and Opportunities of Applying a Scene Detection Algorithm to a Digitised Press Photo Collection”. In: TMG Journal for Media History 25.1 (2022), pp. 1– 24. [11] D. Yotova. “The Bain Collection: Created and maintained by the Library of Congress: http://www. loc. gov/pictures/collection/ggbain/. Reviewed September 2016”. In: Ameri- can Journalism 33.4 (2016), pp. 488–490. 35