Are 1,000 Features Worth A Picture?
  Combining Crowdsourcing and Face Recognition to Identify Civil War Soldiers

                                    Vikram Mohanty, David Thames, Kurt Luther
                         Department of Computer Science and Center for Human-Computer Interaction
                                           Virginia Tech, Arlington, VA, USA


                           Abstract                                 acteristics, as well as distinctive non-facial features like ear
                                                                    shape or facial hair styles.
  We introduce a web-based platform called Civil War Photo             Still, these algorithms can often narrow down the search
  Sleuth for helping users identify unknown soldiers in por-
                                                                    space from a large candidate pool to a smaller one that con-
  traits from the American Civil War era. Our system employs
  a novel person identification pipeline by leveraging the com-     tains the correct matching photo (i.e., no false negatives)
  plementary strengths of crowdsourced human vision and face        at the cost of many similar-looking photos (i.e., many false
  recognition algorithms.                                           positives). This brings us to the “last mile” of person identi-
                                                                    fication, i.e. helping a user pick the correct match from a set
                                                                    of very similar-looking photos suggested by the algorithm.
                       Introduction                                    This paper attempts to address the “last mile” problem
The American Civil War (1861–1865) was the first major              by leveraging the strengths of the human vision system via
conflict to have been extensively photographed, with the im-        crowdsourcing to complement those of face recognition al-
ages being widely displayed and sold in large quantities.           gorithms. Specifically, we address the following questions:
Around 4,000,000 soldiers fought the war, and most of them          • How well can crowds identify a person from a set of very
were photographed at least once. After 150 years, thousands           similar-looking photos?
of these photographs have survived, but most of the identi-
ties of these soldiers are lost. Identifying people in historical   • How can the complementary strengths of crowds and face
photos is important for preserving material culture (Martinez         recognition algorithms be combined to support person
2012), correcting the historical record (Schmidt 2016), and           identification?
recognizing contributions of marginalized groups (Fortin            • How can an interface design help a user interpret the
2018), among other reasons.                                           complementary information provided by crowds and al-
   The current research methods employed by historians, ge-           gorithms to correctly identify a person?
nealogists, and collectors largely involve manually scanning
through hundreds of low-resolution photographs, military               In this project, we scope these questions to focus on iden-
records, and reference books, and can often be tedious and          tifying American Civil War soldiers using the Civil War
frustrating. To help these researchers identify Civil War por-      Photo Sleuth platform.
traits, we built a web platform called Civil War Photo Sleuth.
   Identifying a Civil War soldier photo, like many per-                                  Related Work
son identification tasks, requires a combination of face
recognition and analysis of visual clues about the person’s         Our work draws on concepts from both artificial intelligence
body, clothing, accoutrements, and other contextual de-             and cognitive science to create a novel person identification
tails.Automated face recognition algorithms are helpful for         pipeline. (Kumar et al. 2011) introduce the concept of de-
this process, but not sufficient for several reasons. Studies       scribable visual attributes for face recognition. We use these
of these algorithms often compare with a human baseline,            attributes, which have the advantage of being generalizable
and many show humans outperforming these algorithms                 and human-interpretable, to help novice crowds systemati-
(Blanton et al. 2016; Kemelmacher-Shlizerman et al. 2016;           cally distinguish facial features.
Zhao et al. 2003). Furthermore, historical photos creates              The Feature Contrast Model (FCM) (Tversky 1977) pro-
real-world challenges for algorithms because they are often         poses that similarity between two objects increases with ad-
achromatic, low resolution, and faded or damaged, imped-            dition of common features and deletion of distinctive fea-
ing the detection of facial landmarks. These algorithms also        tures (i.e., features belonging to one object and not the
ignore relevant facial features like scars or other skin char-      other). In addition, the extension effect suggests that fea-
                                                                    tures shared by some objects in the candidate pool, but
Copyright c 2018for this paper by its authors. Copying permitted    not all, have higher diagnostic value and increase the sim-
for private and academic purposes.                                  ilarity between the objects having these features (Tversky
1977). (Gentner and Markman 1997) show that when com-              Figure 1). Our initial tests show that when the similarity con-
paring two objects, differences are more salient in high-          fidence threshold parameter is set to 0.50 with Civil War por-
similarity pairs than low-similarity ones. If the differences      traits, the API yields poor precision but near-perfect recall.
are alignable, meaning that matching relations have match-         This implies that the correct result is almost always present
ing arguments and any element in one representation corre-         in the search results, bringing us to the “last mile” problem.
sponds to only one matching element in the other representa-
tion, then they decrease similarity more than non-alignable
differences. Our novel pipeline leverages these cognitive sci-
ence concepts to help crowds and algorithms work together.
   Flock (Cheng and Bernstein 2015) is an interactive ma-
chine learning platform that uses crowdsourcing for nom-
inating features and labels to train hybrid crowd-machine
learning classifiers. Tropel (Patterson et al. 2015) creates vi-
sual classifiers with limited training examples using crowd-
sourcing. A person identification task, however, cannot be
seen as a multi-label classification problem because of scal-
ability and complexity issues. Since both Flock and Tropel
require a user to define the prediction task and example data
with labels, we cannot directly apply these approaches to a
person identification task.
                                                                   Figure 1: Photo Sleuth software prototype showing real face
                                                                   recognition results from the Microsoft Face API. Note that
                  Civil War Photo Sleuth                           due to image quality differences, even an exact duplicate
Base System                                                        (Thomas Whiting) is not the top result.
The proposed person identification pipeline is built on the
foundation of Civil War Photo Sleuth1 , a website we devel-
oped for sharing and discussing Civil War-era portraits. The       Novel Crowdsourced Pipeline
site is equipped with basic features such as the ability to up-    Our crowdsourcing pipeline helps the user to find the correct
load and tag photos. It allows a user to connect the photos to     match from a pool of similar-looking people by performing
profiles of Civil War soldiers with detailed military records.     fine-grained photo analysis.
Currently there are over 15,000 identified Civil War soldier
portraits and military service records aggregated from multi-      Feature Selection We perform fine-grained pairwise anal-
ple public sources like the US Military History Institute (US-     ysis by capturing information about features according to the
AHEC 2018) and the US National Park Service Soldiers &             cognitive science models described above. Based on these
Sailors Database (NPS 2018).                                       models, we classify these features into two types:
                                                                   1. Alignable Differences. Building on the idea that differ-
User Tags The person identification process begins with               ences are salient in this similar-looking candidate pool,
the user uploading an unidentified portrait to Civil War              the system captures information for a pre-determined fea-
Photo Sleuth, which simultaneously adds the photo to the              ture list. We modified a subset of attributes used in (Ku-
reference database to support future photo identifications.           mar et al. 2011) for the purposes of this project. We term
Thereafter, the user adds tags based on visual clues about the        these attributes as high-level features. Examples include
uniform, insignia, equipment and weapons. Our initial user            “hair”, “eyebrow”, “eyes”, “nose”, “mouth”, “chin/jaw”.
base is targeted towards history enthusiasts with a degree of         For each of these high-level features, we add possible
familiarity with these categories.                                    common low-level tags in an ad hoc manner; e.g., “hair”
Filter Suggestions The system then draws upon encoded                 can be “receding”, “straight”, “short”, etc.
domain knowledge of Civil War portraits to generate search         2. Unique Similarities. Since unique features of high-
filters based on user-provided tags. For example, if the user         diagnostic value increase similarity between objects, the
tagged a hat insignia of crossed swords and a shell jacket,           system allows users to input such features that may be
the system will recommend “Cavalry”, and adding the coat              uniquely distinctive for the unknown photo. The pipeline
color to be dark would add another search filter for “Union           captures information about the presence of these fea-
Army”. These filters leverage military records to signifi-            tures in the search pool. Examples can include “no right
cantly narrow down the search results pool.                           hand”,“muttonchops facial hair style”,“baldness”, etc.
Face Recognition The current prototype employs Mi-                 Crowd Interface The system launches crowdsourcing
crosoft’s Face Recognition API (Microsoft 2018) to scour           tasks using Amazon Mechanical Turk such that three crowd
through this filtered search pool and generate a set of candi-     workers consider each pair of photos. This crowd interface
dates with faces highly similar to the unknown soldier (see        shows the crowd worker the unknown photo and another
                                                                   photo from the search pool and asks which of the high-level
   1
       http://www.civilwarphotosleuth.com/                         features are similar or different in both the photos. For the
features that were selected as different, the interface asks for                         Future Work
low-level tags to be associated with both the photos.              We are currently planning several studies to address the orig-
   For example, if a crowd worker selects “hair” as differ-        inal research questions. The first study will examine how
ent in both the photos, the system asks the crowd worker           well the aggregated crowd scores work. We will compare
which of the low-level tags for “hair” e.g. “curly”, “reced-       with the ground truth and check the average rank of the cor-
ing”, “straight”, “full”, “long”, etc. are associated for which    rect matching photo when the search results are sorted by
photo. Since the crowd worker thinks that “hair” is differ-        these scores. Further, we will evaluate the performance of
ent in both the photos, it can be assumed that at least one        crowd scores by seeing if a threshold can be established that
of the tags will be different in order to justify the decision.    narrows down the original search pool. We will measure how
The worker does the same comparison for all the different          the “crowd + face recognition” system differs from the orig-
features.                                                          inal search pool.
   The system then asks the crowd worker about the pres-              A second study will examine whether the crowd decisions
ence of the uniquely distinctive features in the other photo.      and the feature responses are correlated. Here we plan to find
After comparing all the features, the crowd worker makes an        the effectiveness of alignable differences and unique similar-
overall judgement about the similarity of the people in both       ities in contributing to the final decision, and correlate with
photos using a four-point Likert scale.                            the ground truth. We will also perform a qualitative evalua-
                                                                   tion of the responses.
Search Interface The search interface shows the user 1)               A third study will evaluate the user’s interaction with the
the final aggregated crowd scores next to each photo in the        overall system. We compare the success rate of the user cor-
original search pool and 2) the search results sorted by the       rectly identifying matches by using only the face recognition
these scores. The user can also perform a fine-grained anal-       search results as opposed to the “crowd + face recognition”
ysis of one photo at a time by checking the distribution of        system. In addition, we will also compare the percentage of
aggregated differences along with the tags, as provided by         search results the user has to scour through in both the sys-
the crowd workers. The user also sees the presence/absence         tems before making a final decision. Further, we will evalu-
of the high-diagnostic valued features.                            ate the effectiveness of the feature information by checking
                                                                   how often the user refers to fine-grained pair-wise analysis.
   The system also provides the option of filtering search
                                                                   Here we check the number of cases in which the user uses
results by the high-level features, and sorting by smallest
                                                                   the distribution of differences to make a final decision, and
differences compared to the unknown soldier. This is in ac-
                                                                   how often it is correct. We plan a similar evaluation for the
cordance with the theory of having few features that can be
                                                                   presence of unique similarities.
counted as alignable differences and the presence of high
                                                                      Finally, we will evaluate how our proposed system com-
diagnostic-valued similar features for finding “more similar”
                                                                   pares against the user’s current, manual identification meth-
objects from a set of very similar objects.
                                                                   ods, in terms of the time taken and success rate for correctly
                                                                   finding a match.
                  Preliminary Results
                                                                                          Conclusion
We conducted a pilot study to measure how crowds perform           Civi War Photo Sleuth’s hybrid crowd + face recognition
on a pairwise photo comparison task. Aggregated crowd              pipeline attempts to address the “last mile” problem in per-
scores for four unknown soldiers suggested that the initial        son identification, on a dataset of historical photographs that
search pool of six photos for each soldier (that included          presents both cultural value and technical challenges. Since
the correct matching photo and five similar-looking photos)        this pipeline has the flexibility of being data-agnostic, our
could be narrowed down further to a smaller pool of three          hybrid approach may also generalize to other domains where
photos for each soldier, with the score for the correct match-     person identification is relevant, like journalism and crimi-
ing photo being the highest among all in two of those cases.       nal investigation. At the same time, our work opens doors to
These results support our hypothesis that crowds can further       exploring new ways to leverage the strengths of the human
filter the initial pool of similar-looking photos.                 vision system for complementing the power of an AI system
    We further validated the use of prior high-level features by   in complex image analysis tasks.
asking crowds to nominate features that justified their com-
parison decision in a pair-wise analysis. We collected 216                           Acknowledgements
feature responses, and our post hoc analysis found that they
fell into 17 feature categories. If only facial features were      We wish to thank Ron Coddington and Paul Quigley for his-
considered from these categories, then they overlapped with        torical expertise, and Sneha Mehta, Nam Nguyen, and Abby
the high-level feature list.                                       Jetmundsen for early prototyping. This research was sup-
                                                                   ported by NSF CAREER-1651969.
    This justifies the use of a prior system-provided feature
list since there is no apparent loss of information. There is
also a speed trade-off with a prior feature list as we can em-                             References
ploy a “yes/no” line of questioning rather than free-flow text     Blanton, A.; Allen, K. C.; Miller, T.; Kalka, N. D.; and Jain,
inputs for capturing feature-related information.                  A. K. 2016. A comparison of human and automated face
verification accuracy on unconstrained image sets. In Pro-
ceedings of the IEEE Conference on Computer Vision and
Pattern Recognition Workshops, 161–168.
Cheng, J., and Bernstein, M. S. 2015. Flock: Hybrid crowd-
machine learning classifiers. In Proceedings of the 18th
ACM conference on computer supported cooperative work
& social computing, 600–611. ACM.
Fortin, J. 2018. She Was the Only Woman in a Photo of
38 Scientists, and Now Shes Been Identified. The New York
Times.
Gentner, D., and Markman, A. B. 1997. Structure mapping
in analogy and similarity. American psychologist 52(1):45.
Kemelmacher-Shlizerman, I.; Seitz, S. M.; Miller, D.; and
Brossard, E. 2016. The megaface benchmark: 1 million
faces for recognition at scale. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition,
4873–4882.
Kumar, N.; Berg, A.; Belhumeur, P. N.; and Nayar, S. 2011.
Describable visual attributes for face verification and image
search. IEEE Transactions on Pattern Analysis and Machine
Intelligence 33(10):1962–1977.
Martinez, R. 2012. Unknown No More: Identifying A Civil
War Soldier. NPR.org.
Microsoft. 2018. Face API - Facial Recognition Software |
Microsoft Azure https://azure.microsoft.com/
en-us/services/cognitive-services/face/.
NPS.         2018.       Soldiers and Sailors Database
- The Civil War (U.S. National Park Service)
https://www.nps.gov/subjects/civilwar/
soldiers-and-sailors-database.htm.
Patterson, G.; Van Horn, G.; Belongie, S. J.; Perona, P.; and
Hays, J. 2015. Tropel: Crowdsourcing detectors with mini-
mal training. In HCOMP, 150–159.
Schmidt, M. S. 2016. Flags of Our Fathers Author Now
Doubts His Father Was in Iwo Jima Photo. The New York
Times.
Tversky, A. 1977. Features of similarity. Psychological
review 84(4):327.
USAHEC. 2018. MOLLUS-MASS Civil War Photo-
graph Collection http://cdm16635.contentdm.
oclc.org/cdm/landingpage/collection/
p16635coll12.
Zhao, W.; Chellappa, R.; Phillips, P. J.; and Rosenfeld, A.
2003. Face recognition: A literature survey. ACM computing
surveys (CSUR) 35(4):399–458.