S. Krajči (ed.): ITAT 2018 Proceedings, pp. 157–160 CEUR Workshop Proceedings Vol. 2203, ISSN 1613-0073, c 2018 Ladislav Peška and Hana Trojanová Personalized Recommendations in Police Photo Lineup Assembling Task Ladislav Peska Hana Trojanova Department of Software Engineering Department of Psychology Faculty of Mathematics and Physics, Charles University, Prague Faculty of Arts, Charles University, Prague Czech Republic Czech Republic peska@ksi.mff.cuni.cz trojhanka@gmail.com Abstract. In this paper, we aim to present a novel application assembled lineup as fair if mock witnesses are unable to domain for recommender systems: police photo lineups. Photo identify a suspect based only on a brief textual description. lineups play a significant role in the eyewitness identification See Figure 1 for an example of a highly biased lineup. prosecution and subsequent conviction of suspects. Unfortunately, Assembling photo lineups, i.e., finding candidates for there are many cases where lineups have led to the conviction of an filling the lineup for a particular suspect, according to the innocent persons. One of the key factors contributing to the lineup fairness principle is a challenging and time- incorrect identification is unfairly assembled (biased) lineups, i.e. that the suspect differs significantly from all other candidates. consuming task involving the exploration of large datasets Although the process of assembling fair lineup is both highly of candidates. In the recent years, some research projects [4, important and time-consuming, only a handful of tools are 11] as well as commerce activities, e.g., elineup.org, aimed available to simplify the task. to simplify the process of eyewitness identifications. However, they mostly focused on the lineup administration We describe our work towards using recommender systems for the photo lineup assembling task. Initially, two non-personalized and do not support intelligent lineup assembling. recommending methods were evaluated: one based on the visual From the point of view of recommender systems, lineup descriptors of persons and the other their content-based attributes. assembling is quite specific task for several reasons. Users Next, some personalized hybrid techniques combining both of the system are respected experts, who assemble lineups methods based on the feedback from forensic technicians were regularly, although, usually, not on a daily bases. Therefore, evaluated. Some of the personalized techniques significantly we can expect a steady flow of feedback from long-term improved the results of both non-personalized techniques w.r.t. users. Also, each lineup assembling task is highly unique, nDCG and recall@top-k. i.e., the same suspect hardly ever appears in multiple lineups. Thus, some popular approaches incorporating collaborative 1 Introduction filtering [2] or “the wisdom of the crowd” cannot be applied Evidence from eyewitnesses often plays a significant role in this scenario. Last, but not least, the relevance judgement in criminal proceedings. A very important part is the lineup, is highly based on the visual appearance and/or similarity of i.e., eyewitness identification of the perpetrator. Lineups the suspect and lineup candidate. may lead to the prosecution and subsequent conviction of the In this paper, we describe our work in progress towards perpetrator. Yet there are cases where lineups can played a designing recommender systems aiding user to assemble fair role in the conviction of an innocent suspect. This forensic lineups. In our previous work, we evaluated two non- method consists of the recognition of persons or things and personalized, item-based recommending strategies [8]. thus is linked with a wide range of psychological processes Based on the initial evaluation of non-personalized methods, such as perception, memory, and decision making. Those we propose a content-based personalized approach processes can be influenced by the lineup itself. In order to combining both non-personalized techniques, aiming to re- prevent witnesses from making incorrect identifications, the lineup assembling task is among the top research topics of the psychology of eyewitness identification [1, 4, 6, 9, 10]. The sources of error in eyewitness identifications are numerous. Some variables affecting error probability are on the side of the witness (e.g., level of attention, age or ethnicity) and the event (e.g., distance, lighting, time of the day) and in general cannot be controlled [6, 9]. Controllable variables include the method of questioning, identification procedure, interaction with investigators, and similar [9, 10]. One of the principal recommendations for inhibiting errors in identification is to assemble lineups according to Figure 1: Example of an extremely biased lineup. Lineup the lineup fairness principle [1, 5]. Lineup fairness is usually usually consists of four to eight persons and witness is assessed on the basis of data obtained from "mock instructed that suspect may or may not be among them. witnesses" - people who have not seen the offender, but However in this case, suspect can be easily identified even received a short description of him/her. Lineup fairness by a mock witness knowing only a short description such measures a bias against the suspect and defining the as, “Vietnamese male, 50-70 years old.” 158 Ladislav Peška and Hana Trojanová rank the list of proposed candidates according to the long- binarized features, TF-IDF weighting and cosine similarity. term preferences of the user. CB-RS strategy was intended to be closely similar to the More specifically, main contributions of this paper are: attribute-based searching, which is commonly available in  Proposed and evaluated hybrid personalized lineup assembling tools. recommendation method. Recommendation Based on visual features (Visual-RS)  Dataset of assembled lineups with both positive and leverages the similarity of visual descriptors received from a negative training examples. pre-trained CNN (VGG network for facial recognition To the best of our knowledge, our work is the first problems, VGG-Face [7], in our case). More information is application of recommender systems principles on the lineup available in the previous work [8]. assembling task. 2.3 Evaluation of Item-Based Recommenders To make this paper self-contained, let us briefly describe 2 Item-based Recommendations the results of non-personalized recommendation strategies. The evaluation was based on a user study of domain 2.1 Dataset of Lineup Candidates experts, i.e., forensic technicians, whose task was to select best lineup candidates out of the ones recommended by both Although there are several commercial lineup databases1, we need to approach carefully while applying such datasets techniques. More specifically, 30 persons were selected due to the problem of localization. Not only are the racial from the dataset to play the role of suspects. For each groups highly different e.g., in North America (where the suspect, both non-personalized recommendation strategies datasets are mostly based) and Central Europe, but other proposed top-20 candidates that were merged into a single list3 and displayed together with the suspect to the domain aspects such as common clothing patterns, haircuts or make up trends vary greatly in different countries and continents. experts. Domain experts selects the most suitable candidates; Uunderlined datasets should follow the same localization as these were considered as positively preferred. Participants the suspect in order to inhibit the bias of detecting strangers were instructed to maintain lineup fairness principles, they or having the incorrect ethnicity in a lineup. We evaluated were allowed to produce incomplete lineups if no more the proposed methods in the context of the Czech Republic. suitable candidates were available, or select more candidates Although the majority of the population is Caucasian, mostly if they were equally eligible. of Czech, Slovak, Polish and German nationality, there are The evaluation was performed by seven forensic large Vietnamese and Romany minorities which make technicians from the Czech Republic, with 202 assembled lineup assembling more challenging. We collected the lineups and 800 selected candidates in total. Table 1 dataset of candidate persons from the wanted and missing illustrates overall results of the user study. One can observe persons application2 of the Police of the Czech Republic. In that although Visual-RS clearly outperformed CB-RS, also the candidates recommended by CB-RS were selected quite total, we collected data about 4,423 missing or wanted males. All records contained a photo, nationality, age and often. Together with the surprisingly low size of the appearance characteristics such as: (facial) hair color and intersection (1.83%) between the lists of recommended style, eye color, figure shape, tattoos and more. More candidates and relatively high level of disagreement among information about the dataset may be found in [8]. participants on the selected candidates, the results indicate that some merged, personalized strategy is plausible. 2.2 Item-Based Recommending Strategies for Lineup Furthermore, as the mean rank of selected candidates was Assembling In our previous work [8], we proposed two non- personalized recommending strategies, where the list of Table 1: Evaluation results depicting the volume of selected candidates, the differences in volumes of selected proposed candidates is based on the similarity between the candidates (p-value of paired t-test), the level of suspect and lineup candidates. We use the underlined agreement among participants (Krippendorff’s alpha) assumption that the lineup fairness can be approximated and the average rank of the selected candidates. Note through the similarity of the suspect and fillers, i.e. by filling that candidates proposed by both strategies were lineups with candidates similar to the suspect, we ensure excluded from results. that lineups remain unbiased. Content-based Recommendation Strategy (CB-RS) Selected Level of Average P-value leverages the collected content-based attributes of candidates agreement rank Visual-RS 466 / 58% 0.178 8.2 candidates. We employed the Vector Space Model [3] with 1.2e-8 CB-RS 298 / 37% 0.138 8.9 1 e.g., http://elineup.org decision whether the next list item will be filled by CB-RS 2 aplikace.policie.cz/patrani-osoby/Vyhledavani.aspx or Visual-RS method. 3 The ordering of candidates proposed by each method was maintained, i.e., the randomness was applied on the Personalized Recommendations in Police Photo Lineup Assembling Task 159 relatively high for both methods (8, resp. 9 out of 20), there  Non-personalized similarity based on the 𝐿1 is a room for some re-ranking approach. distances (baseline)  Linear regression (denoted as LM in the evaluation) 3 Personalized Recommendations  Lasso regression (Lasso) Based on the evaluation of non-personalized, item-based  Decision tree (Dec. tree) recommending techniques, we hypothesized that the  Gradient boosted tree (GBT) proposed recommendations can be further improved by As the initial evaluation of the proposed method was only employing some content-based personalized techniques. We partially successful (machine learning methods were to able approach this task through state-of-the-art machine learning significantly improve the baseline only in the case of methods as follows. 𝐴𝑐𝑏 attribute set), we further proposed a hybrid approach Suppose that for arbitrary user 𝑢, his/her previous integrating two components: interactions with the system are in the form of triples  Predictions of a selected machine learning method 𝐹𝑢 : {𝑟𝑢 (𝑖, 𝑗)}, where 𝑖 is the suspect of some previously on 𝐴𝑐𝑏 attribute set. created lineup, 𝑗 is a recommended candidate and 𝑟𝑢 = 1 if 𝑗  Predictions based on a non-personalized 𝐿1 was selected to the lineup and 𝑟𝑢 = 0 otherwise. distance metric applied on 𝐴𝑣𝑖𝑠 attribute set. Furthermore, both 𝑖 and 𝑗 can be represented by three sets of Both prediction techniques are aggregated via attributes: probabilistic sum, i.e., 𝑟𝑗 ≔ 𝑟𝑗𝑐𝑏 + 𝑟𝑗𝑣𝑖𝑠 − 𝑟𝑗𝑐𝑏 × 𝑟𝑗𝑣𝑖𝑠 . This  𝐴𝑐𝑏 are TF-IDF values of content-based attributes approach is denoted as hybrid in the evaluation. of each object.  𝐴𝑣𝑖𝑠 represents the visual descriptor based on the 3.2 Evaluation of Personalized Recommendations VGG-Face network. The main goal of the personalized recommendations  The union of both sets: 𝐴𝑐𝑏 ∪ 𝐴𝑣𝑖𝑠 evaluation is to clarify, whether the long-term user Suppose that equations below represents scoring preferences, i.e., collected during some previous lineups functions of the non-personalized recommending strategies. assembling, can be utilized to improve the list of 1 1 recommended candidates for the current lineup. 𝑠𝑐𝑏 (𝑖, 𝑗) = 𝑠𝑣𝑖𝑠 (𝑖, 𝑗) = In order to confirm this hypothesis, we performed an off- 1 + ∑𝑎∈𝐴𝑐𝑏 |𝑎𝑖 − 𝑎𝑗 | 1 + ∑𝑎∈𝐴𝑣𝑖𝑠 |𝑎𝑖 − 𝑎𝑗 | line evaluation on the dataset of assembled lineups collected Now, let us define a personalized classification / during the evaluation of item-based recommendations. The regression task4 with the train set examples constructed as resulting dataset contained in total 7659 records (800 follows. For each 𝑓 ∈ 𝐹𝑢 , the output variable 𝑦 = 𝑟 and the positive and 6859 negative), i.e., in average 1094 records per list of dependent variables 𝐱𝐴 are constructed as a user. Proposed methods were evaluated based on the 10-fold subtraction of suspect’s and candidate’s attributes for a set cross-validation protocol applied on the lineups. of attributes 𝐴: ∀𝑎 ∈ 𝐴: 𝑥𝑎 ≔ |𝑎𝑖 − 𝑎𝑗 |. Hyperparameters of the methods were learned via grid- Given an arbitrary classification method 𝑀, the model of search on an internal leave-one-lineup-out protocol. user preferences 𝑚𝑢,𝐴 is trained by applying method 𝑀 on For each tested lineup, each recommending method re- the per-user train set {(𝐱 𝐴 , 𝑦)}. When the user starts a new ranks objects originally displayed to the forensic technicians lineup task with some new suspect 𝑖̅, the lineup candidates according to the computed relevance 𝑟𝑗 (selected candidates are ranked according to their probability to be selected in the should appear on top of the list). We measure normalized lineup: discounted cumulative gain (nDCG), recall at top-10 and 𝑟𝑗 ≔ 𝑃(𝑟𝑢 (𝑖̅, 𝑗) = 1|𝑚𝑢,𝐴 ). recall at top-5 (rec@10, rec@5 resp.) of the list and report We would like to note that such recommendation scenario on the average results for all evaluated users and lineups. is quite challenging as we do not have any feedback from the Table 2 depicts results of the off-line evaluation. We can current lineup and need to rely solely on the long-term user observe that both linear model and gradient boosted trees preferences (note the relation to the page zero problem or improved over the baseline method in case of the 𝐴𝑐𝑏 homepage recommendation problem). On the other hand, attributes set. Therefore, we evaluated the hybrid approach quite complex learning methods can be used, because the with both methods. Both hybrid methods outperformed the time-span between two consecutive lineup assembling best baselines w.r.t. nDCG and rec@5 metrics, while GBT performed by the same forensic technician tends to be rather hybrid provides the best performance w.r.t. all evaluated large. metrics. Following preference learning methods were evaluated 5: 4 Please note that although the classification is a natural and in case of classification method, we use positive class choice due to the binary output variable, the final output of probability score as ranking. the method should be ranking of candidates. Thus, we also 5 We use the methods’ implementation from sci-kit evaluate several regression-based machine learning methods package, http://scikit-learn.org. 160 Ladislav Peška and Hana Trojanová Table 2: Results of the personalized recommendation Acknowledgments methods. Note that 𝐴𝑣𝑖𝑠 based machine learning This work was supported by the Czech grants GAUK- approaches did not improve the baseline and were omitted for the sake of space. 232217 and Q48. References Method Attributes nDCG rec@10 rec@5 Baseline 𝐴𝑐𝑏 0.4088 0.1796 0.0805 [1] Brigham, J.C. 1999. Applied issues in the construction and Baseline 𝐴𝑣𝑖𝑠 0.4990 0.3837 0.1725 expert assessment of photo lineups. Applied Cognitive Baseline 𝐴𝑐𝑏 ∪ 𝐴𝑣𝑖𝑠 0.4201 0.2432 0.1090 Psychology. (1999). LM 𝐴𝑐𝑏 0.4605 0.2949 0.1413 DOI:https://doi.org/10.1002/(SICI)1099- Lasso 𝐴𝑐𝑏 0.3816 0.1255 0.0484 0720(199911)13:1+3.3.CO;2-W. Dec. tree 𝐴𝑐𝑏 0.3842 0.0871 0.0611 [2] Hu, Y. et al. 2008. Collaborative Filtering for Implicit GBT 𝐴𝑐𝑏 0.4563 0.2728 0.1451 Feedback Datasets. Proceedings of the 2008 Eighth IEEE LM hybrid 𝐴 ∪ 𝐴𝑣𝑖𝑠 𝑐𝑏 0.4995 0.3693 0.2003 International Conference on Data Mining (Washington, DC, GBT hybrid 𝐴𝑐𝑏 ∪ 𝐴𝑣𝑖𝑠 0.5205 0.3843 0.2042 USA, 2008), 263–272. 4 Conclusions [3] Lops, P. et al. 2011. Content-based Recommender Systems: State of the Art and Trends. Recommender Systems The main aim of this work in progress was to analyze the Handbook. F. Ricci et al., eds. Springer US. 73–105. applicability of recommender systems principles in the problem of photo lineup assembling. Although the photo [4] MacLin, O.H. et al. 2005. PC_eyewitness and the sequential superiority effect: Computer-based lineup administration. lineup assembling task is both important and time- Law and Human Behavior. (2005). consuming task, state-of-the-art tools do not provide DOI:https://doi.org/10.1007/s10979-005-3319-5. intelligent search API beyond simple attribute search and to the best of our knowledge, apart from our work, there are no [5] Mansour, J.K. et al. 2017. Evaluating lineup fairness: papers utilizing recommending principles in the lineup Variations across methods and measures. Law and Human assembling task. Behavior. (2017). DOI:https://doi.org/10.1037/lhb0000203. After the initial evaluation of item-based recommending [6] Meissner, C.A. and Brigham, J.C. 2001. Thirty Years of algorithms, we proposed several variants of content-based Investigating the Own-Race Bias in Memory for Faces: A personalized recommending algorithms utilizing long term Meta-Analytic Review. Psychology, Public Policy, and Law. preferences of the user. The off-line evaluation confirmed that long-term preferences can be used to improve the final [7] Parkhi, O.M. et al. 2015. Deep Face Recognition. Procedings ranking of candidates, however, only in case of content- of the British Machine Vision Conference 2015 (2015). based attributes. [8] Peska, L. and Trojanova, H. 2017. Towards recommender Proposed approaches remained ineffective in the case of systems for police photo lineup. ACM International visual descriptors, so one direction of our future work is to Conference Proceeding Series (2017). further analyze this problem and providing solutions suitable also for visual descriptors. Siamese networks merging both [9] Shapiro, P.N. and Penrod, S. 1986. Meta-Analysis of Facial Identification Studies. Psychological Bulletin. content-based and visual descriptors seems particularly suitable for the task. Another option is to use visual [10] Steblay, N. et al. 2003. Eyewitness Accuracy Rates in Police descriptors as a base for short-term user preferences, i.e., the Showup and Lineup Presentations: A Meta-Analytic ones expressed in the current lineup and refine the Comparison. Law and Human Behavior (2003). recommended objects based on the already selected candidates. [11] Valentine, T.R. et al. 2007. How can psychological science enhance the effectiveness of identification procedures? An Textual description of the suspect also plays an important international comparison. Public Interest Law Reporter. role in the lineup assembling, as forensic technicians often (2007). tries to select candidates that match mentioned, highly DOI:https://doi.org/10.1017/CBO9781107415324.004. specific, features, e.g., scars, skin defects, specific haircut etc. Another direction of our future work would aim to incorporate searching for these specific features in a “guided recommendation” API. Selecting specific regions of interest within the suspect’s photo seems to be a suitable initial strategy. Finally, the long term goal of our work is to move from the recommendation of candidates to the recommendation of assembled lineups and to provide a ready-to-use software for forensic technicians.