=Paper= {{Paper |id=Vol-2203/157 |storemode=property |title=Personalized Recommendations in Police Photo Lineup Assembling Task |pdfUrl=https://ceur-ws.org/Vol-2203/157.pdf |volume=Vol-2203 |authors=Ladislav Peska,Hana Trojanova |dblpUrl=https://dblp.org/rec/conf/itat/PeskaT18 }} ==Personalized Recommendations in Police Photo Lineup Assembling Task== https://ceur-ws.org/Vol-2203/157.pdf
S. Krajči (ed.): ITAT 2018 Proceedings, pp. 157–160
CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Ladislav Peška and Hana Trojanová




                 Personalized Recommendations in Police Photo Lineup Assembling Task
                                Ladislav Peska                                                              Hana Trojanova
                      Department of Software Engineering                                               Department of Psychology
         Faculty of Mathematics and Physics, Charles University, Prague                        Faculty of Arts, Charles University, Prague
                                Czech Republic                                                               Czech Republic
                            peska@ksi.mff.cuni.cz                                                        trojhanka@gmail.com

        Abstract. In this paper, we aim to present a novel application          assembled lineup as fair if mock witnesses are unable to
        domain for recommender systems: police photo lineups. Photo             identify a suspect based only on a brief textual description.
        lineups play a significant role in the eyewitness identification        See Figure 1 for an example of a highly biased lineup.
        prosecution and subsequent conviction of suspects. Unfortunately,           Assembling photo lineups, i.e., finding candidates for
        there are many cases where lineups have led to the conviction of an
                                                                                filling the lineup for a particular suspect, according to the
        innocent persons. One of the key factors contributing to the
                                                                                lineup fairness principle is a challenging and time-
        incorrect identification is unfairly assembled (biased) lineups, i.e.
        that the suspect differs significantly from all other candidates.       consuming task involving the exploration of large datasets
        Although the process of assembling fair lineup is both highly           of candidates. In the recent years, some research projects [4,
        important and time-consuming, only a handful of tools are               11] as well as commerce activities, e.g., elineup.org, aimed
        available to simplify the task.                                         to simplify the process of eyewitness identifications.
                                                                                However, they mostly focused on the lineup administration
        We describe our work towards using recommender systems for the
        photo lineup assembling task. Initially, two non-personalized           and do not support intelligent lineup assembling.
        recommending methods were evaluated: one based on the visual                From the point of view of recommender systems, lineup
        descriptors of persons and the other their content-based attributes.    assembling is quite specific task for several reasons. Users
        Next, some personalized hybrid techniques combining both                of the system are respected experts, who assemble lineups
        methods based on the feedback from forensic technicians were            regularly, although, usually, not on a daily bases. Therefore,
        evaluated. Some of the personalized techniques significantly            we can expect a steady flow of feedback from long-term
        improved the results of both non-personalized techniques w.r.t.         users. Also, each lineup assembling task is highly unique,
        nDCG and recall@top-k.                                                  i.e., the same suspect hardly ever appears in multiple lineups.
                                                                                Thus, some popular approaches incorporating collaborative
        1    Introduction                                                       filtering [2] or “the wisdom of the crowd” cannot be applied
            Evidence from eyewitnesses often plays a significant role           in this scenario. Last, but not least, the relevance judgement
        in criminal proceedings. A very important part is the lineup,           is highly based on the visual appearance and/or similarity of
        i.e., eyewitness identification of the perpetrator. Lineups             the suspect and lineup candidate.
        may lead to the prosecution and subsequent conviction of the                In this paper, we describe our work in progress towards
        perpetrator. Yet there are cases where lineups can played a             designing recommender systems aiding user to assemble fair
        role in the conviction of an innocent suspect. This forensic            lineups. In our previous work, we evaluated two non-
        method consists of the recognition of persons or things and             personalized, item-based recommending strategies [8].
        thus is linked with a wide range of psychological processes             Based on the initial evaluation of non-personalized methods,
        such as perception, memory, and decision making. Those                  we propose a content-based personalized approach
        processes can be influenced by the lineup itself. In order to           combining both non-personalized techniques, aiming to re-
        prevent witnesses from making incorrect identifications, the
        lineup assembling task is among the top research topics of
        the psychology of eyewitness identification [1, 4, 6, 9, 10].
            The sources of error in eyewitness identifications are
        numerous. Some variables affecting error probability are on
        the side of the witness (e.g., level of attention, age or
        ethnicity) and the event (e.g., distance, lighting, time of the
        day) and in general cannot be controlled [6, 9]. Controllable
        variables include the method of questioning, identification
        procedure, interaction with investigators, and similar [9, 10].
            One of the principal recommendations for inhibiting
        errors in identification is to assemble lineups according to
                                                                                Figure 1: Example of an extremely biased lineup. Lineup
        the lineup fairness principle [1, 5]. Lineup fairness is usually        usually consists of four to eight persons and witness is
        assessed on the basis of data obtained from "mock                       instructed that suspect may or may not be among them.
        witnesses" - people who have not seen the offender, but                 However in this case, suspect can be easily identified even
        received a short description of him/her. Lineup fairness                by a mock witness knowing only a short description such
        measures a bias against the suspect and defining the                    as, “Vietnamese male, 50-70 years old.”
158                                                                                                               Ladislav Peška and Hana Trojanová




      rank the list of proposed candidates according to the long-           binarized features, TF-IDF weighting and cosine similarity.
      term preferences of the user.                                         CB-RS strategy was intended to be closely similar to the
          More specifically, main contributions of this paper are:          attribute-based searching, which is commonly available in
           Proposed and evaluated hybrid personalized                      lineup assembling tools.
               recommendation method.                                           Recommendation Based on visual features (Visual-RS)
           Dataset of assembled lineups with both positive and             leverages the similarity of visual descriptors received from a
               negative training examples.                                  pre-trained CNN (VGG network for facial recognition
         To the best of our knowledge, our work is the first                problems, VGG-Face [7], in our case). More information is
      application of recommender systems principles on the lineup           available in the previous work [8].
      assembling task.
                                                                            2.3   Evaluation of Item-Based Recommenders
                                                                                To make this paper self-contained, let us briefly describe
      2       Item-based Recommendations                                    the results of non-personalized recommendation strategies.
                                                                                The evaluation was based on a user study of domain
      2.1     Dataset of Lineup Candidates                                  experts, i.e., forensic technicians, whose task was to select
                                                                            best lineup candidates out of the ones recommended by both
         Although there are several commercial lineup databases1,
      we need to approach carefully while applying such datasets            techniques. More specifically, 30 persons were selected
      due to the problem of localization. Not only are the racial           from the dataset to play the role of suspects. For each
      groups highly different e.g., in North America (where the             suspect, both non-personalized recommendation strategies
      datasets are mostly based) and Central Europe, but other              proposed top-20 candidates that were merged into a single
                                                                            list3 and displayed together with the suspect to the domain
      aspects such as common clothing patterns, haircuts or make
      up trends vary greatly in different countries and continents.         experts. Domain experts selects the most suitable candidates;
      Uunderlined datasets should follow the same localization as           these were considered as positively preferred. Participants
      the suspect in order to inhibit the bias of detecting strangers       were instructed to maintain lineup fairness principles, they
      or having the incorrect ethnicity in a lineup. We evaluated           were allowed to produce incomplete lineups if no more
      the proposed methods in the context of the Czech Republic.            suitable candidates were available, or select more candidates
      Although the majority of the population is Caucasian, mostly          if they were equally eligible.
      of Czech, Slovak, Polish and German nationality, there are                The evaluation was performed by seven forensic
      large Vietnamese and Romany minorities which make                     technicians from the Czech Republic, with 202 assembled
      lineup assembling more challenging. We collected the                  lineups and 800 selected candidates in total. Table 1
      dataset of candidate persons from the wanted and missing              illustrates overall results of the user study. One can observe
      persons application2 of the Police of the Czech Republic. In          that although Visual-RS clearly outperformed CB-RS, also
                                                                            the candidates recommended by CB-RS were selected quite
      total, we collected data about 4,423 missing or wanted
      males. All records contained a photo, nationality, age and            often. Together with the surprisingly low size of the
      appearance characteristics such as: (facial) hair color and           intersection (1.83%) between the lists of recommended
      style, eye color, figure shape, tattoos and more. More                candidates and relatively high level of disagreement among
      information about the dataset may be found in [8].                    participants on the selected candidates, the results indicate
                                                                            that some merged, personalized strategy is plausible.
      2.2     Item-Based Recommending Strategies for Lineup                 Furthermore, as the mean rank of selected candidates was
              Assembling
         In our previous work [8], we proposed two non-
      personalized recommending strategies, where the list of                 Table 1: Evaluation results depicting the volume of
                                                                           selected candidates, the differences in volumes of selected
      proposed candidates is based on the similarity between the
                                                                                candidates (p-value of paired t-test), the level of
      suspect and lineup candidates. We use the underlined                   agreement among participants (Krippendorff’s alpha)
      assumption that the lineup fairness can be approximated                and the average rank of the selected candidates. Note
      through the similarity of the suspect and fillers, i.e. by filling       that candidates proposed by both strategies were
      lineups with candidates similar to the suspect, we ensure                             excluded from results.
      that lineups remain unbiased.
         Content-based Recommendation Strategy (CB-RS)                                      Selected                Level of    Average
                                                                                                       P-value
      leverages the collected content-based attributes of                                 candidates              agreement        rank
                                                                              Visual-RS    466 / 58%                    0.178        8.2
      candidates. We employed the Vector Space Model [3] with                                            1.2e-8
                                                                                 CB-RS     298 / 37%                    0.138        8.9

          1
          e.g., http://elineup.org                                          decision whether the next list item will be filled by CB-RS
          2
          aplikace.policie.cz/patrani-osoby/Vyhledavani.aspx                or Visual-RS method.
        3
          The ordering of candidates proposed by each method
      was maintained, i.e., the randomness was applied on the
Personalized Recommendations in Police Photo Lineup Assembling Task                                                                              159




        relatively high for both methods (8, resp. 9 out of 20), there               Non-personalized similarity based on the 𝐿1
        is a room for some re-ranking approach.                                       distances (baseline)
                                                                                  Linear regression (denoted as LM in the evaluation)
        3    Personalized Recommendations                                         Lasso regression (Lasso)
            Based on the evaluation of non-personalized, item-based               Decision tree (Dec. tree)
        recommending techniques, we hypothesized that the                         Gradient boosted tree (GBT)
        proposed recommendations can be further improved by                    As the initial evaluation of the proposed method was only
        employing some content-based personalized techniques. We            partially successful (machine learning methods were to able
        approach this task through state-of-the-art machine learning        significantly improve the baseline only in the case of
        methods as follows.                                                 𝐴𝑐𝑏 attribute set), we further proposed a hybrid approach
            Suppose that for arbitrary user 𝑢, his/her previous             integrating two components:
        interactions with the system are in the form of triples                   Predictions of a selected machine learning method
        𝐹𝑢 : {𝑟𝑢 (𝑖, 𝑗)}, where 𝑖 is the suspect of some previously                   on 𝐴𝑐𝑏 attribute set.
        created lineup, 𝑗 is a recommended candidate and 𝑟𝑢 = 1 if 𝑗              Predictions based on a non-personalized 𝐿1
        was selected to the lineup and 𝑟𝑢 = 0 otherwise.                              distance metric applied on 𝐴𝑣𝑖𝑠 attribute set.
        Furthermore, both 𝑖 and 𝑗 can be represented by three sets of          Both prediction techniques are aggregated via
        attributes:                                                         probabilistic sum, i.e., 𝑟𝑗 ≔ 𝑟𝑗𝑐𝑏 + 𝑟𝑗𝑣𝑖𝑠 − 𝑟𝑗𝑐𝑏 × 𝑟𝑗𝑣𝑖𝑠 . This
               𝐴𝑐𝑏 are TF-IDF values of content-based attributes           approach is denoted as hybrid in the evaluation.
                    of each object.
               𝐴𝑣𝑖𝑠 represents the visual descriptor based on the          3.2       Evaluation of Personalized Recommendations
                    VGG-Face network.                                              The main goal of the personalized recommendations
               The union of both sets: 𝐴𝑐𝑏 ∪ 𝐴𝑣𝑖𝑠                              evaluation is to clarify, whether the long-term user
            Suppose that equations below represents scoring                     preferences, i.e., collected during some previous lineups
        functions of the non-personalized recommending strategies.              assembling, can be utilized to improve the list of
                               1                                 1              recommended candidates for the current lineup.
        𝑠𝑐𝑏 (𝑖, 𝑗) =                       𝑠𝑣𝑖𝑠 (𝑖, 𝑗) =                           In order to confirm this hypothesis, we performed an off-
                     1 + ∑𝑎∈𝐴𝑐𝑏 |𝑎𝑖 − 𝑎𝑗 |               1 + ∑𝑎∈𝐴𝑣𝑖𝑠 |𝑎𝑖 − 𝑎𝑗 |
                                                                                line evaluation on the dataset of assembled lineups collected
            Now, let us define a personalized classification /
                                                                                during the evaluation of item-based recommendations. The
        regression task4 with the train set examples constructed as
                                                                                resulting dataset contained in total 7659 records (800
        follows. For each 𝑓 ∈ 𝐹𝑢 , the output variable 𝑦 = 𝑟 and the
                                                                                positive and 6859 negative), i.e., in average 1094 records per
        list of dependent variables 𝐱𝐴 are constructed as a
                                                                                user. Proposed methods were evaluated based on the 10-fold
        subtraction of suspect’s and candidate’s attributes for a set
                                                                                cross-validation protocol applied on the lineups.
        of attributes 𝐴: ∀𝑎 ∈ 𝐴: 𝑥𝑎 ≔ |𝑎𝑖 − 𝑎𝑗 |.
                                                                                Hyperparameters of the methods were learned via grid-
            Given an arbitrary classification method 𝑀, the model of search on an internal leave-one-lineup-out protocol.
        user preferences 𝑚𝑢,𝐴 is trained by applying method 𝑀 on                   For each tested lineup, each recommending method re-
        the per-user train set {(𝐱 𝐴 , 𝑦)}. When the user starts a new ranks objects originally displayed to the forensic technicians
        lineup task with some new suspect 𝑖̅, the lineup candidates according to the computed relevance 𝑟𝑗 (selected candidates
        are ranked according to their probability to be selected in the should appear on top of the list). We measure normalized
        lineup:                                                                 discounted cumulative gain (nDCG), recall at top-10 and
             𝑟𝑗 ≔ 𝑃(𝑟𝑢 (𝑖̅, 𝑗) = 1|𝑚𝑢,𝐴 ).                                      recall at top-5 (rec@10, rec@5 resp.) of the list and report
            We would like to note that such recommendation scenario on the average results for all evaluated users and lineups.
        is quite challenging as we do not have any feedback from the               Table 2 depicts results of the off-line evaluation. We can
        current lineup and need to rely solely on the long-term user observe that both linear model and gradient boosted trees
        preferences (note the relation to the page zero problem or improved over the baseline method in case of the 𝐴𝑐𝑏
        homepage recommendation problem). On the other hand, attributes set. Therefore, we evaluated the hybrid approach
        quite complex learning methods can be used, because the with both methods. Both hybrid methods outperformed the
        time-span between two consecutive lineup assembling best baselines w.r.t. nDCG and rec@5 metrics, while GBT
        performed by the same forensic technician tends to be rather hybrid provides the best performance w.r.t. all evaluated
        large.                                                                  metrics.
            Following preference learning methods were evaluated 5:


           4
             Please note that although the classification is a natural      and in case of classification method, we use positive class
        choice due to the binary output variable, the final output of       probability score as ranking.
        the method should be ranking of candidates. Thus, we also             5
                                                                                 We use the methods’ implementation from sci-kit
        evaluate several regression-based machine learning methods          package, http://scikit-learn.org.
160                                                                                                           Ladislav Peška and Hana Trojanová




          Table 2: Results of the personalized recommendation            Acknowledgments
            methods. Note that 𝐴𝑣𝑖𝑠 based machine learning
                                                                         This work was supported by the Czech grants GAUK-
           approaches did not improve the baseline and were
                     omitted for the sake of space.                      232217 and Q48.
                                                                         References
            Method      Attributes      nDCG rec@10           rec@5
             Baseline    𝐴𝑐𝑏            0.4088     0.1796     0.0805
                                                                         [1]   Brigham, J.C. 1999. Applied issues in the construction and
             Baseline    𝐴𝑣𝑖𝑠           0.4990     0.3837     0.1725
                                                                               expert assessment of photo lineups. Applied Cognitive
             Baseline 𝐴𝑐𝑏 ∪ 𝐴𝑣𝑖𝑠        0.4201     0.2432     0.1090           Psychology.                                         (1999).
                  LM     𝐴𝑐𝑏            0.4605     0.2949     0.1413           DOI:https://doi.org/10.1002/(SICI)1099-
                Lasso    𝐴𝑐𝑏            0.3816     0.1255     0.0484           0720(199911)13:1+3.3.CO;2-W.
             Dec. tree   𝐴𝑐𝑏            0.3842     0.0871     0.0611
                                                                         [2]   Hu, Y. et al. 2008. Collaborative Filtering for Implicit
                 GBT     𝐴𝑐𝑏            0.4563     0.2728     0.1451
                                                                               Feedback Datasets. Proceedings of the 2008 Eighth IEEE
            LM hybrid 𝐴 ∪ 𝐴𝑣𝑖𝑠
                       𝑐𝑏               0.4995     0.3693     0.2003           International Conference on Data Mining (Washington, DC,
           GBT hybrid 𝐴𝑐𝑏 ∪ 𝐴𝑣𝑖𝑠        0.5205     0.3843     0.2042           USA, 2008), 263–272.

      4     Conclusions                                                  [3]   Lops, P. et al. 2011. Content-based Recommender Systems:
                                                                               State of the Art and Trends. Recommender Systems
          The main aim of this work in progress was to analyze the             Handbook. F. Ricci et al., eds. Springer US. 73–105.
      applicability of recommender systems principles in the
      problem of photo lineup assembling. Although the photo             [4]   MacLin, O.H. et al. 2005. PC_eyewitness and the sequential
                                                                               superiority effect: Computer-based lineup administration.
      lineup assembling task is both important and time-
                                                                               Law        and        Human        Behavior.       (2005).
      consuming task, state-of-the-art tools do not provide                    DOI:https://doi.org/10.1007/s10979-005-3319-5.
      intelligent search API beyond simple attribute search and to
      the best of our knowledge, apart from our work, there are no       [5]   Mansour, J.K. et al. 2017. Evaluating lineup fairness:
      papers utilizing recommending principles in the lineup                   Variations across methods and measures. Law and Human
      assembling task.                                                         Behavior. (2017). DOI:https://doi.org/10.1037/lhb0000203.
          After the initial evaluation of item-based recommending
                                                                         [6]   Meissner, C.A. and Brigham, J.C. 2001. Thirty Years of
      algorithms, we proposed several variants of content-based                Investigating the Own-Race Bias in Memory for Faces: A
      personalized recommending algorithms utilizing long term                 Meta-Analytic Review. Psychology, Public Policy, and Law.
      preferences of the user. The off-line evaluation confirmed
      that long-term preferences can be used to improve the final        [7]   Parkhi, O.M. et al. 2015. Deep Face Recognition. Procedings
      ranking of candidates, however, only in case of content-                 of the British Machine Vision Conference 2015 (2015).
      based attributes.
                                                                         [8]   Peska, L. and Trojanova, H. 2017. Towards recommender
          Proposed approaches remained ineffective in the case of              systems for police photo lineup. ACM International
      visual descriptors, so one direction of our future work is to            Conference Proceeding Series (2017).
      further analyze this problem and providing solutions suitable
      also for visual descriptors. Siamese networks merging both         [9]   Shapiro, P.N. and Penrod, S. 1986. Meta-Analysis of Facial
                                                                               Identification Studies. Psychological Bulletin.
      content-based and visual descriptors seems particularly
      suitable for the task. Another option is to use visual
                                                                         [10] Steblay, N. et al. 2003. Eyewitness Accuracy Rates in Police
      descriptors as a base for short-term user preferences, i.e., the        Showup and Lineup Presentations: A Meta-Analytic
      ones expressed in the current lineup and refine the                     Comparison. Law and Human Behavior (2003).
      recommended objects based on the already selected
      candidates.                                                        [11] Valentine, T.R. et al. 2007. How can psychological science
                                                                              enhance the effectiveness of identification procedures? An
          Textual description of the suspect also plays an important
                                                                              international comparison. Public Interest Law Reporter.
      role in the lineup assembling, as forensic technicians often            (2007).
      tries to select candidates that match mentioned, highly                 DOI:https://doi.org/10.1017/CBO9781107415324.004.
      specific, features, e.g., scars, skin defects, specific haircut
      etc. Another direction of our future work would aim to
      incorporate searching for these specific features in a “guided
      recommendation” API. Selecting specific regions of interest
      within the suspect’s photo seems to be a suitable initial
      strategy.
          Finally, the long term goal of our work is to move from
      the recommendation of candidates to the recommendation of
      assembled lineups and to provide a ready-to-use software for
      forensic technicians.