<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Applied Face Recognition in the Humanities</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Martin</forename><surname>Bullin</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Media Informatics</orgName>
								<orgName type="institution">University of Bamberg</orgName>
								<address>
									<addrLine>An der Weberei 5</addrLine>
									<postCode>96047</postCode>
									<settlement>Bamberg</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Andreas</forename><surname>Henrich</surname></persName>
							<affiliation key="aff0">
								<orgName type="department">Media Informatics</orgName>
								<orgName type="institution">University of Bamberg</orgName>
								<address>
									<addrLine>An der Weberei 5</addrLine>
									<postCode>96047</postCode>
									<settlement>Bamberg</settlement>
									<country key="DE">Germany</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Applied Face Recognition in the Humanities</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">13359273B20B970F2163641721F809A1</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:20+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Face Recognition</term>
					<term>Face Detection</term>
					<term>Scene Classification</term>
					<term>Personality Mapping</term>
					<term>Personality Recognition</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Several research areas in the humanities and social sciences could potentially benefit from state-of-theart machine learning technologies. In the field of computer vision, there are exemplary face detection (FD) and face recognition (FR) algorithms that could provide support on several levels. We discuss two application scenarios where the deployment of trained networks can be used to generate further information. We show how FD can be used to recognize scene types such as dialogue and speech on non-photographic images such as Emblematica. The second part shows another application scenario where FR can be used to combine or link images in research datasets with authority records by finding personalities from the Wikidata dataset.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Searching collections of historical images and documents usually places high demands on labeling the collections with appropriate metadata. In this paper, we present first results of experiments in which we use face detection and recognition methods to firstly support a classification of the depicted scenes based on the detected faces and gaze directions and secondly to assign authority records (more precisely Wikidata items) to the depicted persons with as high accuracy as possible.</p><p>The first approach employs face detection algorithms for scene classification. Given a dataset, potentially comprising all images extracted from a particular book or collection, the objective is to classify the scene type present in each image. In our specific scenario, we focused on scene types such as Dialogue, Speech, and Other. The second proposed approach aims to simplify the process of labeling images by using metadata from a reliable reference source like WikiData. This reference source contains portrait images connected to authority records that can be identified using face recognition. By utilizing these available resources, the goal is to make the unambiguous labeling of the images in a research dataset easier and more efficient.</p><p>The remainder of the paper is organized as follows: In section 2 we will address related work and give more information on the background of our work. The use of face detection techniques for scene type classification is assessed in section 3. Section 4 considers the use of LWDA'23: Lernen, Wissen, Daten, Analysen. October 09-11, 2023, Marburg, Germany Envelope martin.bullin@uni-bamberg.de (M. Bullin); andreas.henrich@uni-bamberg.de (A. Henrich) GLOBE https://www.uni-bamberg.de/minf/team/bullin/ (M. Bullin); https://www.uni-bamberg.de/minf/team/henrich/ (A. Henrich) Orcid 0000-0001-9498-3615 (M. Bullin); 0000-0002-5074-3254 (A. Henrich) face recognition techniques for labeling unseen data based on a reference source like WikiData and section 5 concludes the paper and presents potential future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work and Background</head><p>In the upcoming subsections, an exploration of computer vision will be undertaken, focusing on three crucial areas: face detection, face recognition, and scene classification. Face detection algorithms are designed to locate and identify human faces within images or video frames. Face recognition techniques enable the identification and verification of individuals based on their unique facial features. Lastly, scene classification methods, which will be further discussed in subsection 2.2, aim at the categorization of images or video frames into various scene types.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Face Detection and Face Recognition</head><p>In the domain of face detection and its subfield, face recognition, a wide range of tools are available. Wang et al. conducted a comprehensive survey <ref type="bibr" target="#b0">[1]</ref>, wherein they evaluated 15 different methods. The results demonstrated that all of these methods achieved accuracies exceeding 99% on simple datasets, and at least the superior methods, attained high accuracies above 90% across all datasets. Due to the focus of our investigations on the application of state-of-the-art methods, the selection of tools was initially guided by their user-friendly nature, followed by their cutting-edge performance and significance within the research community.</p><p>A well utilized tool in the field is InsightFace<ref type="foot" target="#foot_0">1</ref> , which encompasses various CNN based components such as RetinaFace for face detection and ArcFace for face recognition <ref type="bibr" target="#b1">[2]</ref>. With over 15k stars on GitHub, this project has gained significant popularity and is available as a Python library. Its utilization, particularly for face detection and during the deployment phase, is relatively straightforward. However, the training process for the standard implementation presents greater complexity, as it requires the dataset to be in the MXNet binary format. Hence, in section 3 where face detection was necessary, InsightFace, particularly RetinaFace, was employed. Conversely, it could not be employed in section 4 that involved the learning of new faces for the MWW dataset which consists of old engravings with portraits. The tool employed in this application scenario was the face_recognition library 2 . Unlike InsightFace, this library offers ease of use for both the deployment and training aspects of face recognition tasks. In default mode it works without CNN models and thus can be run on CPU solely. It is developed based on the dlib C++ library and claims to achieve a remarkable accuracy of 99.38% on the Labeled Faces in the Wild 3 dataset. With 50k stars the corresponding repository enjoys even greater popularity than InsightFace. Figure <ref type="figure" target="#fig_0">1</ref> gives examples of the face detection quality for both libraries. It will be discussed in more detail in section 4.3. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Scene Classification</head><p>In the field of "Scene Classification", it is noteworthy to mention that no publications specifically addressing this particular use-case were found. Most publications focus on annotated standard datasets like the MIT Indoor 67 with predefined classes like indoor and outdoor with multiple subclasses like store or home <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4,</ref><ref type="bibr" target="#b4">5,</ref><ref type="bibr" target="#b5">6]</ref>. The majority of these publications and subsequent studies employ deep learning models to classify these standard scene classes <ref type="bibr" target="#b6">[7]</ref>. However, only one publication by Wevers et al. <ref type="bibr" target="#b7">[8]</ref> introduces classes that are specifically relevant to the dialogue and speech scenario. Further research could explore testing our approach against the performance of these existing models on their dataset, limited to the defined classes.</p><p>Another research field delves into the specific topic of dialogue detection, with a particular emphasis on this singular class. In the pioneering work of Kotti et al. <ref type="bibr" target="#b8">[9]</ref>, audio-assisted dialogue detection in movies was explored, marking one of the early instances of deploying neural networks in this context. The exploration of our proposed approach with video data could shed light on the suitability and efficacy within this specific research domain.</p><p>Another topic is described by Impett showing an approach of clustering the gestures found in art history <ref type="bibr" target="#b9">[10]</ref>. In future the mentioned automatic human pose and gesture estimation could be compared to as well as combined with the proposed approach.</p><p>An approach that already bridges the domains of dialogue detection and classification with face recognition is presented in the work by Ito et al., which uses additionally sound <ref type="bibr" target="#b10">[11]</ref>. The authors aim to recognize smile and laughter by combining speech processing and face recognition techniques. Similar to our proposed approach, they utilize facial landmarks generated from the face as part of their methodology. This prior work serves as a relevant reference in demonstrating the feasibility of integrating facial landmark analysis into the context of dialogue classification.</p><p>Our study aims to address the research gap in scene recognition, specifically focusing on the recognition of dialogue and speech. To achieve this, the study leverages state-of-the-art deep learning approaches in combination with a rule-based classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Face Detection Deployment for Scene Type Detection</head><p>Face detection has advanced significantly, leading to the development of various applications such as InsightFace-REST <ref type="foot" target="#foot_2">4</ref> . These applications encompass a range of functionalities, including face detection, face recognition, age estimation, and even specialized tasks like mask detection during the COVID-19 pandemic. The algorithms utilized in these applications often rely on facial landmarks and bounding boxes to perform their tasks effectively.</p><p>Facial landmarks, represented by red and green dots in the lower part of Figure <ref type="figure" target="#fig_0">1</ref>, are key points on the face used in face detection and recognition algorithms. They typically include landmarks corresponding to the eyes, nose, and mouth regions. By leveraging the positions and relationships between these landmarks, algorithms can accurately detect and recognize faces.</p><p>In this study, the generated facial landmarks and bounding boxes were employed to infer the scene type depicted in the underlying image. Various scene types, as previously mentioned, were introduced for classification purposes. The scene type Dialogue specifically refers to an interaction where two individuals are engaged in a conversation, facing and looking at each other. Another scene type, Speech, is defined by the presence of a speaker making eye contact with at least one other person, with the number of people observing the speaker surpassing a customizable threshold. Images not classified as Dialogue or Speech encompassing scenes that do not fit the specific criteria are categorized as Other.</p><p>This study highlights how existing deep learning solutions can be leveraged to address problem scenarios that extend beyond their primary purposes. In this particular case, the focus was on utilizing a face detection algorithm to infer the scene types depicted in images. By exporting the results and generating a script, the study demonstrates how the different scene classes can be effectively described and captured.</p><p>In the forthcoming sections, the concept of the study will be elaborated upon in detail, outlining the proposed methodology. Subsequently, a brief introduction to the examined datasets will be presented. Finally, the qualitative evaluation conducted on the provided datasets will be depicted, highlighting the findings and outcomes of the study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Concept</head><p>Acknowledging that the application scenario is hypothetical and emphasizing the exploration of possibilities in addressing new problem scenarios, it is important to note that the generated classes, namely Dialogue and Speech, are also constructed and leave space for individual interpretations and refinements. While a simple solution could involve counting the number of faces in an image and using a threshold to determine whether it is a dialogue or a speech (i.e., two faces indicating a dialogue and more than two faces indicating a speech), our approach considers additional features. We used the Oxford English Dictionary (OED) definitions as a basis for Dialogue and Speech. According to the OED, a Dialogue is defined as a "Conversation between two or more characters in a literary work; the words spoken by the actors in a play, film, etc. " <ref type="foot" target="#foot_3">5</ref> In the context of this research, the focus was specifically on two-way conversations, where two individuals are present and maintain visual contact by looking at each other. On the other hand, Speech is defined as an "address or discourse of a more or less formal character delivered to an audience or assembly. " <ref type="foot" target="#foot_4">6</ref> Our approach takes into consideration factors such as the potential eye contact between more than two people, the number of people looking at a single person, and the person in focus looking into the direction of most individuals present.</p><p>To determine the main auxiliary variables, namely gaze-direction, eye-contact, and looks-at, for the detected faces generated by the face detection network, the following process was implemented: Gaze-direction: The gaze-direction for each face is classified as either left, right, or not-clear. The threshold for not-clear is set at the minimum, to start with a high recall, but keeping the possibility to move from a binary classification into left and right to a more precise one including not-clear. This determination is based on the relative position of the point between the two eyes with respect to the vertical bounding box borders. If the point is closer to one of the vertical bounding box borders the gaze-direction is assigned as the direction of the closer bounding box border. Looks-at: The gaze-direction information is then used to check if a person is looking at another person. The looks-at variable is set to true if a person is located beside another person and the gaze-direction indicates that he or she is focusing on one of the keypoints of the other person. This calculation is based on the viewing angle, which is defined by the vector connecting the eyes and an angle of default 180 ∘ "around" this vector. The choice of a relatively high default value, exceeding the biological norm, was made due to the limitations of the 2D representation of images and the unconventional nature of the datasets. Eye-contact: The eye-contact variable summarizes whether both person 1 and person 2 have the "looks-at" flag for each other, indicating mutual eye contact.</p><p>Using these collected pieces of information for all faces in the image, a categorization process was implemented using a simple Python script. This script facilitates the classification and analysis of the facial features and interactions within the image with a simple heuristics. To categorize the scenes into Dialogue, Speech, or Other, the following steps are undertaken: Derivation of centrality from a Looks-at statistics: The list of looks-at pairs is split into all occurrences of individuals involved. Assume a scene where person 𝑎 is looking at person 𝑏 and at person 𝑐, and person 𝑐 is also looking at person 𝑎. Then from the list of looks-at pairs [𝑎𝑏, 𝑎𝑐, 𝑐𝑎], the list of all occurrences of individuals [𝑎, 𝑏, 𝑎, 𝑐, 𝑐, 𝑎] would be derived. This helps to identify the most prominent or central person in the image by assessing the frequency of their appearance in the next step. Frequency counter: A frequency counter is created to determine the occurrence frequency of each person in the list containing all occurrences of individuals determined before. Referring to the example this would look like [𝑎:3, 𝑏:1, 𝑐:2]. This provides insights into the relative prominence of individuals within the scene. Threshold calculation: A threshold value is computed by multiplying the adjustable threshold value (0.7 in our example) by the number of individuals looking at someone minus one. This adjustment accounts for the fact that the speaker must look at someone. In the given example this would be 0.7 ⋅ (3 − 1) = 1.4. So at least 2 looks would have to go from or to the speaker.</p><p>After these preparation steps, the classification is performed by the following checks:</p><p>Dialogue classification: The first check involves verifying if only two persons have mutual eye contact and only these two persons look at someone leading to the classification Dialogue.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Speech classification:</head><p>The next check ensures that there is at least one eye-contact present and that the person with the highest number of looks exceeds the calculated threshold. Then the class Speech is applied.</p><p>Other classification: In all other cases the class Other is assigned.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Data</head><p>To address the lack of annotated datasets for the task at hand, three different datasets were considered for evaluation:</p><p>Emblematica: The Emblematica Online Collection<ref type="foot" target="#foot_5">7</ref> is a challenging dataset in this study. It consists of 1,388 facsimiles of emblem books from various libraries. Emblems within this collection typically comprise different components, such as a heading, one or more mottos, and a pictura, which will be classified by the approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MWW Portraits:</head><p>The MWW Portraits collection<ref type="foot" target="#foot_6">8</ref> contains approximately 32,000 prints, including 6,000 duplicates, dating from the 16th to the mid-19th century. It encompasses various printmaking techniques and primarily features scholars, bourgeoisie, and portraits of theologians.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Jürg Straumann Artwork:</head><p>The dataset comprises digitized artworks by Jürg Straumann <ref type="foot" target="#foot_7">9</ref> .</p><p>These three datasets were selected because they represent non-photographic images related to the humanities and offer a diverse range of potential application scenarios due to their variations in content and characteristics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Evaluation</head><p>Due to the absence of labeled datasets, a quantitative evaluation could only be performed on a subset of the results. Additionally, it is important to note that the interpretation of the results may vary, as there is no definitive ground truth. It is crucial to consider the specific application scenario when assessing the correctness of the results.</p><p>For instance, in Figure <ref type="figure" target="#fig_2">2</ref> (a), an image depicts two dialogues: two individuals in the background engaged in a conversation and two individuals in the foreground. Although the image was classified as a dialogue by the system, the prediction might be seen as incorrect. However, if the main objective of the application scenario is to identify images with dialogues, allowing for the presence of more than two individuals, the classification may also be considered appropriate.</p><p>In such cases, it may be worthwhile to adjust the decision process of the algorithm by implementing a more suitable approach. This could involve reducing the threshold for the face probability, which would increase the recall but potentially lower the precision. It is important</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Resulting image classification counts of the approach for the three investigated datasets. to strike a balance between maximizing recall and maintaining an acceptable level of precision, considering the potential trade-off of classifying non-face regions as faces.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Class</head><p>The results presented in Table <ref type="table">1</ref> indicate that the system classifies less than 50% of the images into one of the meaningful classes. This can be attributed to several factors.</p><p>Firstly, the incompleteness of the class set plays a significant role. Since the classes were generated for the purpose of this study and are not derived from a comprehensive or predefined set, there may be instances where the classes do not fully encompass the variations and nuances present in the images.</p><p>Secondly, the algorithm itself is designed to prioritize accuracy, aiming to provide reasonably precise classifications. This focus on precision may result in a higher number of images being classified as Other due to the conservative nature of the algorithm.</p><p>In the following exemplary qualitative assessments of the results for the three datasets will be given.</p><p>Emblematica When examining specific examples, such as those depicted in Figure <ref type="figure" target="#fig_2">2</ref>, the results appear to align quite well with the defined classes. For instance, when analyzing the first 30 images in alphabetical order of the image names classified as dialogues, 27 of them accurately reflect the definition, resulting in a true positive rate of 90%.</p><p>Applying the same evaluation method to the class Speech, the accuracy rate varies depending on the interpretation of Speech. If Speech is understood as the presence of a main actor with people looking at him or her, the accuracy rate is approximately 83% for the first 30 images. However, if Speech is interpreted according to the defined criteria, the accuracy rate may be lower at around 67%.   Similarly, when considering the Other images, the results are comparable to those of the Speech class. Most of these images do not neatly fit into either of the two defined classes, and their classification may depend on the interpretation of Speech.</p><p>MWW The nature of portraits typically implies that only one person is prominently depicted, which aligns with the expectation that a majority of the images would be classified as Other. This is evident from the results presented in Table <ref type="table">1</ref>, where a larger proportion (69%) of the MWW Portraits dataset is classified as Other compared to the Emblematica (52%) dataset.</p><p>Interestingly, the results for Dialogue and Speech in the MWW Portraits dataset still show fitting classifications, albeit dependent on the specific definition used. To provide a comprehensive view, Figure <ref type="figure" target="#fig_3">3</ref> displays exemplary results for all classes. This allows observers to form their own opinion and interpretation based on their own understanding and definition of the classes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Straumann</head><p>In the case of the Straumann artwork dataset, the face detection algorithm performs good, but the detection probabilities are often relatively low. A stringent reduction of the dataset takes place, when the images are filtered by the face probability threshold. It is important to note that this reduction process was applied to all datasets for the purpose of comparison and to ensure consistent evaluation. Despite the reduction, the remaining images in the dialogue class as exemplarily depicted in Figure <ref type="figure" target="#fig_3">3</ref> demonstrate a good fit, indicating accurate classification. However, the results for the speech class are not as precise, which can be attributed to the unique nature of the artwork dataset. Interpreting scenes in artwork can be challenging, particularly for individuals without art-related expertise. Nevertheless, the correct classifications obtained for this specialized dataset are noteworthy, highlighting the potential for accurate scene recognition even in complex and nuanced scenarios. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Face Recognition Deployment for Labeling Unseen Data</head><p>This section focuses on the practical implementation of face recognition techniques for labeling unseen data and mapping datasets of persons with images that may have different naming conventions but represent the same individuals. The section is structured into three subsections:</p><p>The concept section provides an overview of the underlying principles and methodologies. In the data section, the training dataset, namely Wikidata, is discussed. Lastly, the evaluation section outlines the metrics and methodologies used to assess the performance of the proposed approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Concept</head><p>Due to the very good results of the face detection algorithm used in section 3, the corresponding face recognition algorithm would have been the first choice for the subsequent section. The high effort to preprocess the images into the training format, lead us to the choice of the face recognition network developed by Adam Geitgey<ref type="foot" target="#foot_8">10</ref> as outlined in Section 2. It is superior in the context of ease of training and deployment. Based on the crawled person dataset explained later and shown on the left side of Figure <ref type="figure" target="#fig_4">4</ref>, the feature representation for each image was calculated, as shown on the right side of the figure. This representation of persons is then used to identify the best-matching faces in the dataset when presented with query images, as can be seen in Figure <ref type="figure" target="#fig_5">5</ref>. The comparison returns a list of calculated distances; the smaller the distance the more similar the images tend to be. In the example image the ice hockey player Kevin Gloor with the Wikidata id Q1740152<ref type="foot" target="#foot_9">11</ref> has the smallest distance to the search image of Chester Bennington.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Data</head><p>The Wikidata<ref type="foot" target="#foot_10">12</ref> platform serves as a valuable starting point for this project. As stated on their website, Wikidata is a free and open knowledge base that allows both humans and machines to read and edit its structured data. Utilizing the whole Wikidata dump, a dataset comprising 1,109,006 human entities with 1,141,894 associated images was generated. The images for each person were crawled in a descending order based on the number of images available in Wikidata to increase the chance of hitting a matching image, if the person is included in the dataset.</p><p>The process began with the complete Wikidata JSON dump (the left database symbol in Figure <ref type="figure" target="#fig_4">4</ref>). This compressed file of about 110 GiB was filtered to include only the ID and image URL of entities representing humans, identified by the Q5 ID with the P18 property denoting the presence of an image URL. Through the original Wikidata IDs in the resulting 80MiB file, we could expand the process by filtering to a specific time period by using the original Wikidata dump.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Evaluation</head><p>The analysis primarily focused on a qualitative assessment to evaluate the performance and limitations of the deployed network. This involved testing both, images gathered from the internet and non-photographic images sourced from the MWW portrait dataset<ref type="foot" target="#foot_11">13</ref> to pseudoquantitatively examine different types of images.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Photo Detection: Bill Gates</head><p>To assess the qualitative precision and recall of the face recognition architecture, a person from the crawled dataset with multiple images in the Wikidata dataset was selected for analysis. In this case, Bill Gates was chosen. The four images depicting him in Wikidata can be found in Figure <ref type="figure" target="#fig_7">6</ref> on the left side. To challenge the face recognition network, ten images with various differences to the ground truth images were selected as query images<ref type="foot" target="#foot_12">14</ref> . The differences included wearing sunglasses, without glasses, different mouth positions during speech, high contrast due to sunlight, a gray-scale image, and even an image of Bill Gates Sr. to check, if similarity of relatives can be detected as well.</p><p>The network successfully identified Bill Gates among the top 10 similar faces for five out of the ten query images. Additionally, for eight out of the ten images, the network detected Bill Gates within the top 100 similar faces. Only two images posed bigger challenges for the network: The image of Bill Gates Sr., which, considering the difficulty in recognizing the similar facial features due to their familial relationship, was expected. The second was the one where he was depicted with Melinda under high contrast conditions. Interestingly, Melinda herself was recognized with the lowest difference among all faces in that particular image.  The image of Bill Gates with sunglasses and without glasses yielded similar results, with the FR network identifying faces containing different types of glasses or no glasses at all. The gray-scale image predominantly generated results containing other gray-scale images, while the image of Bill Gates Sr. resulted in visually similar depictions of older men. These outcomes align with common assumptions about the network's ability to recognize facial features and characteristics despite variations in accessories, image color, and familial resemblance. These results highlight the capabilities and limitations of the FR network in accurately identifying and matching faces.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Non photographic Dataset: MWW Portraits</head><p>To evaluate the performance of the face recognition system on the MWW portrait dataset, the ten most prominent persons were identified (most frequent in the MWW portrait dataset), and the corresponding depicted images were mapped to their respective Wikidata IDs manually resulting in 8 personalities with over 450 test images for evaluation. Unfortunately, despite expanding the search space to the 500 closest matching faces when searching the Wikidata images, not a single portrait image was correctly classified. The observed performance issues of the chosen face recognition algorithm on the MWW Portraits, can also be seen in in Figure <ref type="figure" target="#fig_0">1</ref>, where it is compared to the InsightFace approach, leading to the conclusion that alternative face recognition algorithms may offer better results in recognizing faces from non-photographic datasets. The complexity involved in generating training datasets using the original implementation, which led to the use of this FR algorithm, could be addressed using third-party implementations of InsightFace yielding an easier to train solution.</p><p>An alternative approach could involve exploring the possibility of retraining or replacing the relevant components of the face_recognition library. Further investigation into the dlib Library reveals that feature comparison is performed using a stored neural network. Because the documentation does not provide detailed insights into this process, the existence of numerous alternatives implies that exploring other options could easier yield improved results.</p><p>The results presented in this section suggest that the recognition of people in photos works well enough to support an annotation process. However, the library used was not convincing on drawings and engravings, for example. Unfortunately, the original InsightFace library used in the section 3 could not be used in this scenario for the technical reasons mentioned above. However, InsightFace already shows in Figure <ref type="figure" target="#fig_0">1</ref> and also in the scenario considered in section 3 that there is clear potential with suitable libraries for non-photographic images.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Future Research</head><p>The investigated research scenarios presented in the previous sections were designed to demonstrate and prove the applicability of face detection and recognition methods in applications from the humanities. During the investigations it turned out that the potential usefulness of the methods can in addition be demonstrated by a scenario that emerged in discussions with researchers from the social sciences. It involves the analysis of images crawled from school websites. In this scenario, the first stage would involve checking the images for the presence of faces in order to filter for relevant images. Subsequently, face recognition could be employed to identify unique faces and obtain frequencies for age, gender, and potentially ethnicity with additional efforts. The images could then be classified into categories such as "class pictures", "portraits" or "events" using a similar approach to the one applied in section 3, providing further insights into the dataset and the way schools present themselves on the web.</p><p>This scenario underscores the importance of user-friendly tools for researchers in the humanities and social sciences. Furthermore, it emphasizes the need for continuous improvement and refinement of the methodologies employed.</p><p>In the following sections, we will outline some potential enhancements that could be implemented to improve the two methodologies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Face Recognition for Data Labeling</head><p>One potential avenue for future research is to explore the use of alternative face recognition tools and techniques. Specifically, investigating networks that are trained on multiple images of the same person could potentially improve the usability and accuracy of the face recognition process. Additionally, incorporating methods developed under the InsightFace project, which have demonstrated better performance on non-photographic datasets, particularly for face detection, could further enhance the applicability of the face recognition component.</p><p>Furthermore, for a real-world application scenario involving the integration of metadata from Wikidata, implementing a filtering mechanism based on additional criteria such as the years in which the dataset's content was generated could be beneficial.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Scene Recognition</head><p>A potential improvement for scene recognition could be to leverage the generated ground truth, and perform a manual annotation process to specifically train a neural network. By using this ground truth data, it is probably possible to train a network that can achieve even better classification performance.</p><p>Additionally, there is the opportunity to optimize the approach by further refining the heuristic algorithm for determining the classes. Taking into account the relative sizes of faces in an image could provide additional contextual information for scene classification. Furthermore, refining the algorithm's ability to accurately determine the gaze-directionof persons could contribute to more precise scene interpretations.</p><p>Overall, the topics explored in this study and the results offer valuable insights and inspire potential research questions. It is noteworthy that with a network capable of comparing an input image to over a million images, the detection of a person within the top-100 results is achieved in 8 out of 10 instances. This shows the potential for exciting application scenarios.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>.Figure 1 :</head><label>1</label><figDesc>Figure 1: Examples of the MWW-Portrait Dataset and Bill Gates images with the detected faces generated by the two used face detection algorithms. (face_recognition above with blue markings, InsightFace below with red markings)</figDesc><graphic coords="3,351.21,135.30,50.00,50.00" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Exemplary results by the scene classification algorithm for both target classes out of the Emblematica dataset.</figDesc><graphic coords="7,258.05,527.10,79.17,79.17" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: Exemplary results by the scene classification algorithm for all classes out of the MWW Portrait (a -g) and Straumann Artwork (h -m) Datasets.</figDesc><graphic coords="8,435.11,187.33,66.67,50.33" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Training process for face recognition trained on Wikidata</figDesc><graphic coords="9,89.29,84.19,416.69,124.68" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Recognition process for face recognition trained on Wikidata showing the results</figDesc><graphic coords="10,89.29,84.19,416.67,73.50" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_7"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Ground truth images (from Wikidata) and query images of Bill Gates for the evaluation of the FR algorithm.</figDesc><graphic coords="11,89.29,135.22,50.00,50.00" type="bitmap" /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">https://insightface.ai/ (all URLs accessed in July</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2023" xml:id="foot_1">) 2 https://pypi.org/project/face-recognition/ 3 http://vis-www.cs.umass.edu/lfw/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2">https://github.com/SthPhoenix/InsightFace-REST</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3">https://www.oed.com/search?searchType=dictionary&amp;q=dialog&amp;_searchBtn=Search</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4">https://www.oed.com/view/Entry/186128</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_5">https://emblematica.grainger.illinois.edu/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_6">https://vfr.mww-forschung.de/portraetsammlung</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_7">https://juergstraumann.ch</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_8">https://github.com/ageitgey/face_recognition</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_9">https://www.wikidata.org/wiki/Q1740152</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_10">https://www.wikidata.org/wiki/Wikidata:Main_Page</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="13" xml:id="foot_11">https://vfr.mww-forschung.de/portraetsammlung</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="14" xml:id="foot_12">https://www.google.com/search?q=bill+gates&amp;tbm=isch</note>
		</body>
		<back>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Peng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Guo</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2212.13038</idno>
		<title level="m">A Survey of Face Recognition</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild</title>
		<author>
			<persName><forename type="first">J</forename><surname>Deng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ververas</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Kotsia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zafeiriou</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR42600.2020.00525</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</title>
				<imprint>
			<date type="published" when="2020">2020. 2020</date>
			<biblScope unit="page" from="5202" to="5211" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Recognizing indoor scenes</title>
		<author>
			<persName><forename type="first">A</forename><surname>Quattoni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Torralba</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2009.5206537</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Conference on Computer Vision and Pattern Recognition</title>
				<imprint>
			<date type="published" when="2009">2009. 2009</date>
			<biblScope unit="page" from="413" to="420" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Visual Place Categorization: Problem, Dataset, and Algorithm</title>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Christensen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Rehg</surname></persName>
		</author>
		<idno type="DOI">10.1109/IROS.2009.5354164</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2009</title>
				<imprint>
			<date type="published" when="2009">2009. 2009</date>
			<biblScope unit="page" from="4763" to="4770" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories</title>
		<author>
			<persName><forename type="first">S</forename><surname>Lazebnik</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schmid</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ponce</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2006.68</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR&apos;06)</title>
				<imprint>
			<date type="published" when="2006">2006. 2006</date>
			<biblScope unit="volume">2</biblScope>
			<biblScope unit="page" from="2169" to="2178" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">SUN database: Large-scale scene recognition from abbey to zoo</title>
		<author>
			<persName><forename type="first">J</forename><surname>Xiao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hays</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Ehinger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Oliva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Torralba</surname></persName>
		</author>
		<idno type="DOI">10.1109/CVPR.2010.5539970</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition</title>
				<meeting>the IEEE Computer Society Conference on Computer Vision and Pattern Recognition</meeting>
		<imprint>
			<date type="published" when="2010">2010</date>
			<biblScope unit="page" from="3485" to="3492" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">D</forename><surname>Zeng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Liao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Tavakolian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Guo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Pietikäinen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Liu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2101.10531</idno>
		<title level="m">Deep Learning for Scene Classification: A Survey</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Scene Detection in De Boer Historical Photo Collection</title>
		<author>
			<persName><forename type="first">M</forename><surname>Wevers</surname></persName>
		</author>
		<idno type="DOI">10.5220/0010288206010610</idno>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 13th International Conference on Agents and Artificial Intelligence</title>
				<meeting>the 13th International Conference on Agents and Artificial Intelligence<address><addrLine>Vienna, Austria</addrLine></address></meeting>
		<imprint>
			<publisher>SCITEPRESS -Science and Technology Publications</publisher>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="601" to="610" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A neural network approach to audio-assisted movie dialogue detection</title>
		<author>
			<persName><forename type="first">M</forename><surname>Kotti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Benetos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Kotropoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Pitas</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.neucom.2007.08.006</idno>
	</analytic>
	<monogr>
		<title level="j">Neurocomputing</title>
		<imprint>
			<biblScope unit="volume">71</biblScope>
			<biblScope unit="page" from="157" to="166" />
			<date type="published" when="2007">2007</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Analyzing gesture in digital art history</title>
		<author>
			<persName><forename type="first">L</forename><surname>Impett</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">The Routledge Companion to Digital Humanities and Art History</title>
				<imprint>
			<publisher>Routledge</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="386" to="407" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Smile and laughter recognition using speech processing and face recognition from conversation video</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ito</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Suzuki</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Makino</surname></persName>
		</author>
		<idno type="DOI">10.1109/CW.2005.82</idno>
	</analytic>
	<monogr>
		<title level="m">2005 International Conference on Cyberworlds (CW&apos;05)</title>
				<imprint>
			<date type="published" when="2005">2005</date>
			<biblScope unit="page">444</biblScope>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
