Are 1,000 Features Worth A Picture? Combining Crowdsourcing and Face Recognition to Identify Civil War Soldiers Vikram Mohanty, David Thames, Kurt Luther Department of Computer Science and Center for Human-Computer Interaction Virginia Tech, Arlington, VA, USA Abstract acteristics, as well as distinctive non-facial features like ear shape or facial hair styles. We introduce a web-based platform called Civil War Photo Still, these algorithms can often narrow down the search Sleuth for helping users identify unknown soldiers in por- space from a large candidate pool to a smaller one that con- traits from the American Civil War era. Our system employs a novel person identification pipeline by leveraging the com- tains the correct matching photo (i.e., no false negatives) plementary strengths of crowdsourced human vision and face at the cost of many similar-looking photos (i.e., many false recognition algorithms. positives). This brings us to the “last mile” of person identi- fication, i.e. helping a user pick the correct match from a set of very similar-looking photos suggested by the algorithm. Introduction This paper attempts to address the “last mile” problem The American Civil War (1861–1865) was the first major by leveraging the strengths of the human vision system via conflict to have been extensively photographed, with the im- crowdsourcing to complement those of face recognition al- ages being widely displayed and sold in large quantities. gorithms. Specifically, we address the following questions: Around 4,000,000 soldiers fought the war, and most of them • How well can crowds identify a person from a set of very were photographed at least once. After 150 years, thousands similar-looking photos? of these photographs have survived, but most of the identi- ties of these soldiers are lost. Identifying people in historical • How can the complementary strengths of crowds and face photos is important for preserving material culture (Martinez recognition algorithms be combined to support person 2012), correcting the historical record (Schmidt 2016), and identification? recognizing contributions of marginalized groups (Fortin • How can an interface design help a user interpret the 2018), among other reasons. complementary information provided by crowds and al- The current research methods employed by historians, ge- gorithms to correctly identify a person? nealogists, and collectors largely involve manually scanning through hundreds of low-resolution photographs, military In this project, we scope these questions to focus on iden- records, and reference books, and can often be tedious and tifying American Civil War soldiers using the Civil War frustrating. To help these researchers identify Civil War por- Photo Sleuth platform. traits, we built a web platform called Civil War Photo Sleuth. Identifying a Civil War soldier photo, like many per- Related Work son identification tasks, requires a combination of face recognition and analysis of visual clues about the person’s Our work draws on concepts from both artificial intelligence body, clothing, accoutrements, and other contextual de- and cognitive science to create a novel person identification tails.Automated face recognition algorithms are helpful for pipeline. (Kumar et al. 2011) introduce the concept of de- this process, but not sufficient for several reasons. Studies scribable visual attributes for face recognition. We use these of these algorithms often compare with a human baseline, attributes, which have the advantage of being generalizable and many show humans outperforming these algorithms and human-interpretable, to help novice crowds systemati- (Blanton et al. 2016; Kemelmacher-Shlizerman et al. 2016; cally distinguish facial features. Zhao et al. 2003). Furthermore, historical photos creates The Feature Contrast Model (FCM) (Tversky 1977) pro- real-world challenges for algorithms because they are often poses that similarity between two objects increases with ad- achromatic, low resolution, and faded or damaged, imped- dition of common features and deletion of distinctive fea- ing the detection of facial landmarks. These algorithms also tures (i.e., features belonging to one object and not the ignore relevant facial features like scars or other skin char- other). In addition, the extension effect suggests that fea- tures shared by some objects in the candidate pool, but Copyright c 2018for this paper by its authors. Copying permitted not all, have higher diagnostic value and increase the sim- for private and academic purposes. ilarity between the objects having these features (Tversky 1977). (Gentner and Markman 1997) show that when com- Figure 1). Our initial tests show that when the similarity con- paring two objects, differences are more salient in high- fidence threshold parameter is set to 0.50 with Civil War por- similarity pairs than low-similarity ones. If the differences traits, the API yields poor precision but near-perfect recall. are alignable, meaning that matching relations have match- This implies that the correct result is almost always present ing arguments and any element in one representation corre- in the search results, bringing us to the “last mile” problem. sponds to only one matching element in the other representa- tion, then they decrease similarity more than non-alignable differences. Our novel pipeline leverages these cognitive sci- ence concepts to help crowds and algorithms work together. Flock (Cheng and Bernstein 2015) is an interactive ma- chine learning platform that uses crowdsourcing for nom- inating features and labels to train hybrid crowd-machine learning classifiers. Tropel (Patterson et al. 2015) creates vi- sual classifiers with limited training examples using crowd- sourcing. A person identification task, however, cannot be seen as a multi-label classification problem because of scal- ability and complexity issues. Since both Flock and Tropel require a user to define the prediction task and example data with labels, we cannot directly apply these approaches to a person identification task. Figure 1: Photo Sleuth software prototype showing real face recognition results from the Microsoft Face API. Note that Civil War Photo Sleuth due to image quality differences, even an exact duplicate Base System (Thomas Whiting) is not the top result. The proposed person identification pipeline is built on the foundation of Civil War Photo Sleuth1 , a website we devel- oped for sharing and discussing Civil War-era portraits. The Novel Crowdsourced Pipeline site is equipped with basic features such as the ability to up- Our crowdsourcing pipeline helps the user to find the correct load and tag photos. It allows a user to connect the photos to match from a pool of similar-looking people by performing profiles of Civil War soldiers with detailed military records. fine-grained photo analysis. Currently there are over 15,000 identified Civil War soldier portraits and military service records aggregated from multi- Feature Selection We perform fine-grained pairwise anal- ple public sources like the US Military History Institute (US- ysis by capturing information about features according to the AHEC 2018) and the US National Park Service Soldiers & cognitive science models described above. Based on these Sailors Database (NPS 2018). models, we classify these features into two types: 1. Alignable Differences. Building on the idea that differ- User Tags The person identification process begins with ences are salient in this similar-looking candidate pool, the user uploading an unidentified portrait to Civil War the system captures information for a pre-determined fea- Photo Sleuth, which simultaneously adds the photo to the ture list. We modified a subset of attributes used in (Ku- reference database to support future photo identifications. mar et al. 2011) for the purposes of this project. We term Thereafter, the user adds tags based on visual clues about the these attributes as high-level features. Examples include uniform, insignia, equipment and weapons. Our initial user “hair”, “eyebrow”, “eyes”, “nose”, “mouth”, “chin/jaw”. base is targeted towards history enthusiasts with a degree of For each of these high-level features, we add possible familiarity with these categories. common low-level tags in an ad hoc manner; e.g., “hair” Filter Suggestions The system then draws upon encoded can be “receding”, “straight”, “short”, etc. domain knowledge of Civil War portraits to generate search 2. Unique Similarities. Since unique features of high- filters based on user-provided tags. For example, if the user diagnostic value increase similarity between objects, the tagged a hat insignia of crossed swords and a shell jacket, system allows users to input such features that may be the system will recommend “Cavalry”, and adding the coat uniquely distinctive for the unknown photo. The pipeline color to be dark would add another search filter for “Union captures information about the presence of these fea- Army”. These filters leverage military records to signifi- tures in the search pool. Examples can include “no right cantly narrow down the search results pool. hand”,“muttonchops facial hair style”,“baldness”, etc. Face Recognition The current prototype employs Mi- Crowd Interface The system launches crowdsourcing crosoft’s Face Recognition API (Microsoft 2018) to scour tasks using Amazon Mechanical Turk such that three crowd through this filtered search pool and generate a set of candi- workers consider each pair of photos. This crowd interface dates with faces highly similar to the unknown soldier (see shows the crowd worker the unknown photo and another photo from the search pool and asks which of the high-level 1 http://www.civilwarphotosleuth.com/ features are similar or different in both the photos. For the features that were selected as different, the interface asks for Future Work low-level tags to be associated with both the photos. We are currently planning several studies to address the orig- For example, if a crowd worker selects “hair” as differ- inal research questions. The first study will examine how ent in both the photos, the system asks the crowd worker well the aggregated crowd scores work. We will compare which of the low-level tags for “hair” e.g. “curly”, “reced- with the ground truth and check the average rank of the cor- ing”, “straight”, “full”, “long”, etc. are associated for which rect matching photo when the search results are sorted by photo. Since the crowd worker thinks that “hair” is differ- these scores. Further, we will evaluate the performance of ent in both the photos, it can be assumed that at least one crowd scores by seeing if a threshold can be established that of the tags will be different in order to justify the decision. narrows down the original search pool. We will measure how The worker does the same comparison for all the different the “crowd + face recognition” system differs from the orig- features. inal search pool. The system then asks the crowd worker about the pres- A second study will examine whether the crowd decisions ence of the uniquely distinctive features in the other photo. and the feature responses are correlated. Here we plan to find After comparing all the features, the crowd worker makes an the effectiveness of alignable differences and unique similar- overall judgement about the similarity of the people in both ities in contributing to the final decision, and correlate with photos using a four-point Likert scale. the ground truth. We will also perform a qualitative evalua- tion of the responses. Search Interface The search interface shows the user 1) A third study will evaluate the user’s interaction with the the final aggregated crowd scores next to each photo in the overall system. We compare the success rate of the user cor- original search pool and 2) the search results sorted by the rectly identifying matches by using only the face recognition these scores. The user can also perform a fine-grained anal- search results as opposed to the “crowd + face recognition” ysis of one photo at a time by checking the distribution of system. In addition, we will also compare the percentage of aggregated differences along with the tags, as provided by search results the user has to scour through in both the sys- the crowd workers. The user also sees the presence/absence tems before making a final decision. Further, we will evalu- of the high-diagnostic valued features. ate the effectiveness of the feature information by checking how often the user refers to fine-grained pair-wise analysis. The system also provides the option of filtering search Here we check the number of cases in which the user uses results by the high-level features, and sorting by smallest the distribution of differences to make a final decision, and differences compared to the unknown soldier. This is in ac- how often it is correct. We plan a similar evaluation for the cordance with the theory of having few features that can be presence of unique similarities. counted as alignable differences and the presence of high Finally, we will evaluate how our proposed system com- diagnostic-valued similar features for finding “more similar” pares against the user’s current, manual identification meth- objects from a set of very similar objects. ods, in terms of the time taken and success rate for correctly finding a match. Preliminary Results Conclusion We conducted a pilot study to measure how crowds perform Civi War Photo Sleuth’s hybrid crowd + face recognition on a pairwise photo comparison task. Aggregated crowd pipeline attempts to address the “last mile” problem in per- scores for four unknown soldiers suggested that the initial son identification, on a dataset of historical photographs that search pool of six photos for each soldier (that included presents both cultural value and technical challenges. Since the correct matching photo and five similar-looking photos) this pipeline has the flexibility of being data-agnostic, our could be narrowed down further to a smaller pool of three hybrid approach may also generalize to other domains where photos for each soldier, with the score for the correct match- person identification is relevant, like journalism and crimi- ing photo being the highest among all in two of those cases. nal investigation. At the same time, our work opens doors to These results support our hypothesis that crowds can further exploring new ways to leverage the strengths of the human filter the initial pool of similar-looking photos. vision system for complementing the power of an AI system We further validated the use of prior high-level features by in complex image analysis tasks. asking crowds to nominate features that justified their com- parison decision in a pair-wise analysis. We collected 216 Acknowledgements feature responses, and our post hoc analysis found that they fell into 17 feature categories. If only facial features were We wish to thank Ron Coddington and Paul Quigley for his- considered from these categories, then they overlapped with torical expertise, and Sneha Mehta, Nam Nguyen, and Abby the high-level feature list. Jetmundsen for early prototyping. This research was sup- ported by NSF CAREER-1651969. This justifies the use of a prior system-provided feature list since there is no apparent loss of information. There is also a speed trade-off with a prior feature list as we can em- References ploy a “yes/no” line of questioning rather than free-flow text Blanton, A.; Allen, K. C.; Miller, T.; Kalka, N. D.; and Jain, inputs for capturing feature-related information. A. K. 2016. A comparison of human and automated face verification accuracy on unconstrained image sets. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 161–168. Cheng, J., and Bernstein, M. S. 2015. Flock: Hybrid crowd- machine learning classifiers. In Proceedings of the 18th ACM conference on computer supported cooperative work & social computing, 600–611. ACM. Fortin, J. 2018. She Was the Only Woman in a Photo of 38 Scientists, and Now Shes Been Identified. The New York Times. Gentner, D., and Markman, A. B. 1997. Structure mapping in analogy and similarity. American psychologist 52(1):45. Kemelmacher-Shlizerman, I.; Seitz, S. M.; Miller, D.; and Brossard, E. 2016. The megaface benchmark: 1 million faces for recognition at scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4873–4882. Kumar, N.; Berg, A.; Belhumeur, P. N.; and Nayar, S. 2011. Describable visual attributes for face verification and image search. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(10):1962–1977. Martinez, R. 2012. Unknown No More: Identifying A Civil War Soldier. NPR.org. Microsoft. 2018. Face API - Facial Recognition Software | Microsoft Azure https://azure.microsoft.com/ en-us/services/cognitive-services/face/. NPS. 2018. Soldiers and Sailors Database - The Civil War (U.S. National Park Service) https://www.nps.gov/subjects/civilwar/ soldiers-and-sailors-database.htm. Patterson, G.; Van Horn, G.; Belongie, S. J.; Perona, P.; and Hays, J. 2015. Tropel: Crowdsourcing detectors with mini- mal training. In HCOMP, 150–159. Schmidt, M. S. 2016. Flags of Our Fathers Author Now Doubts His Father Was in Iwo Jima Photo. The New York Times. Tversky, A. 1977. Features of similarity. Psychological review 84(4):327. USAHEC. 2018. MOLLUS-MASS Civil War Photo- graph Collection http://cdm16635.contentdm. oclc.org/cdm/landingpage/collection/ p16635coll12. Zhao, W.; Chellappa, R.; Phillips, P. J.; and Rosenfeld, A. 2003. Face recognition: A literature survey. ACM computing surveys (CSUR) 35(4):399–458.