A Study on Lemma vs Stem for Legal Information Retrieval Using R Tidyverse. IMS UniPD @ AILA 2020 Task 1 Giorgio Maria Di Nunzioa,b a Department of Information Engineering, University of Padova, Italy b Department of Mathematics, University of Padova, Italy Abstract In this paper, we describe the results of the participation of the Information Management Systems (IMS) group at AILA 2020 Task 1, precedents and statutes retrieval. In particular, we participated in both subtasks: precedents retrieval (task a) and statutes retrieval (task b). The goal of our work was to compare and evaluate the efficacy of a simple reproducible approach based on the use of either lemmas or stems with a tf-idf vector space model and a plain BM25 model. The results vary significantly from one subtask/evaluation measure to another. For the subtask of statutes retrieval, our approach performed well, being second only to a participant that used BERT to represent documents. Keywords Legal IR, BM25, TF-IDF, Text Pipelines, R Tidyverse 1. Introduction The FIRE Artificial Intelligence for Legal Assistance (AILA) is an evaluation challenge of series of shared tasks aimed at developing datasets and methods for solving variety of legal problems by means of search engine approaches [1]. This year, AILA proposed two different legal documents tasks: precedent and statute retrieval task, and semantic segmentation task. In this paper, we report the results of our participation to the first task, precedent and Statute Retrieval [2]. This task investigates the problem of identifying the relevant statutes and prior cases given a description of a situation (i.e. the query, in traditional IR terms). The contribution of our experiments to this task can be summarized as follows: • the implementation of a reproducible pipeline for text analysis; • an evaluation of basic rankers based on lexical levels with a tf-idf approach and a BM25 approach. This work follows a series of reproducible experiments that has origin in the CLEF eHealth Task [3], and the source code for replicating all our experiments will be online.1 Forum for Information Retrieval Evaluation 2020, December 16-20, 2020, Hyderabad, India " giorgiomaria.dinunzio@unipd.it (G. M. Di Nunzio) ~ http://github.com/gmdn (G. M. Di Nunzio)  0000-0001-9709-6392 (G. M. Di Nunzio) © 2020 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 http://github.com/gmdn The remainder of the paper will introduce the methodology and a brief summary of the experimental settings that we used in order to create the official runs that we submitted for this task. 2. Method In this section, we summarize the pipeline for text pre-processing which has been developed in the last years [4]. In general, our method follows the principles described by [5] where the idea is to mine textual information from large text collections in an efficient and effective by means of organized workflows named pipelines. Pipelines are an effective way to manage the sequential process of text analysis by splitting the source code into steps, where the output of one step is the input for the subsequent step. The R programming language has an interesting set of packages that follow this idea, named tidyverse,2 that we will use in our experiments. Apart from being a tidy way of organizing software, an important advantage in working with pipelines is that this practice promotes shareability and reproducibility in research workflows which is one of the main pillars in the European Open Science Cloud (EOSC).3 2.1. Pipeline for Data Cleaning In order to produce the clean dataset, we followed the same pipeline for data ingestion and preparation for all the experiments: • split text into words;4 • remove stopwords; • remove words with less than two characters; • lemmatize/stem words;5 • compute tf-idf for each word; • compute relevance score (BM25) for each word. 3. Experiments In this section, we briefly describe the setting of official runs that we submitted for this task and the preliminary results sent by the organizers before the workshop. 3.1. Dataset The datasets of the two subtasks consisted in: 1. • Corpus: 3,257 casedocs • Queries: 50 description of situations 2. • Corpus: 197 statutes • Queries: 50 description of situations (same as subtask 1) 2 https://www.tidyverse.org 3 https://www.eosc-portal.eu 4 https://www.tidytextmining.com 5 https://cran.r-project.org/web/packages/corpus/vignettes/stemmer.html 3.2. Run Settings For subtask 1, we split each casedocs into the following parts (we add an example for casedoc number 1): • who: Masud Khan v State Of Uttar Pradesh; • where: Supreme Court of India; • when: 26 September 1973; • what: Writ Petition No. 117 of 1973; • delivered: The Judgment was delivered by : A. Alagiriswami, J.; • text: 1. Petitioner Masud Khan prays for his release on the ground that he, an Indian citizen . . . In the runs of subtask1, we used only the last field (that we named ‘text’) for the retrieval of precedents. The goal of our experiments is to compare the effectiveness of the different lexical choices (stem or lemma) with a baseline (BM25). We submitted three runs for each subtask, and we followed the same procedure for each subtask: • bm25_lemma: this run uses a BM25 retrieval model with lemmas; • tfidf_lemma: this run uses a tfidf document representation and a cosine similarity score on lemmas; • tfidf_stem: this run uses a tfidf document representation and a cosine similarity score on stems; 3.3. Results The organizers of this task provided the results (averaged across topics) achieved by many baselines compared to the runs of each participant. In Table 1 and Table 2, we show a summary of these results. A preliminary analysis of the results shows that, in terms of standard evaluation measures such as MAP, BPREF, reciprocal rank, and P@10 the use of the BM25 with lemmas performed worse compared to the other two runs. For subtask 1 (Table 1), the performance was poor compared to the median values for all the measures. This result, compared with the performance of the other subtask, requires a failure analysis to understand how the choice of just on field affected the performance, as well as how additional information in the statistics of the word may be useful or not (see for example team HLJIT2019-AILA of AILA 2019 [6]). For subtask 2, the use of a vector space model with tf-idf on stems was the second best run overall. It is interesting to see that, despite the type of query (which is a complex description of the situation), and the result in the previous subtask, this approach performed very well compared to the other systems. Table 1 Summary of the results for subtask 1-a, precedents retrieval. Runs are ordered by MAP. participant Run_ID MAP BPREF recip_rank P @ 10 UB UB-3 0.16 0.11 0.24 0.08 double_liu_2020 double_liu_2020_3 0.14 0.10 0.19 0.07 fs_hu fs_hu_task1a 0.14 0.09 0.20 0.10 double_liu_2020 double_liu_2020_1 0.13 0.07 0.20 0.07 TUW_informatics basic 0.13 0.07 0.19 0.07 f s_ hit_ 1 fs_hit_1_task1a_01 0.13 0.09 0.19 0.07 LAWNICS LAWNICS_2 0.13 0.09 0.16 0.10 TUW_informatics word_count 0.13 0.07 0.19 0.06 SSNCSE_NLP task_1a_1 0.13 0.09 0.20 0.08 f s_ hit_ 2 fs_hit_2_task1a_01 0.12 0.07 0.19 0.07 double_liu_2020 double_liu_2020_2 0.12 0.06 0.20 0.08 UB UB-1 0.12 0.07 0.20 0.09 f s_ hit_ 1 fs_hit_1_task1a_02 0.12 0.07 0.21 0.09 UB UB-2 0.12 0.08 0.20 0.07 TUW_informatics false_friends 0.11 0.07 0.19 0.05 LAWNICS LAWNICS_1 0.11 0.08 0.16 0.08 Uottawa_NLP run3_TFIDF 0.08 0.04 0.12 0.05 f s_ hit_ 1 fs_hit_1_task1a_03 0.07 0.03 0.11 0.07 SSNCSE_NLP task_1a_2 0.07 0.04 0.10 0.05 IMS_UNIPD tfidf_lemma 0.06 0.03 0.11 0.02 IMS_UNIPD tfidf_stem 0.06 0.03 0.11 0.03 IMS_UNIPD bm25_lemma 0.04 0.01 0.15 0.03 f s_ hit_ 2 fs_hit_2_task1a_02 0.01 0.00 0.04 0.02 Uottawa_NLP run1_Glove 0.01 0.00 0.02 0.00 f s_ hit_ 2 fs_hit_2_task1a_03 0.01 0.00 0.04 0.02 Uottawa_NLP run2_Doc2Vec 0.00 0.00 0.01 0.00 4. Final remarks and Future Work The aim of our participation to the FIRE AILA 2020 Task 1 was to test the effectiveness of a reproducible baseline without any learning strategy. The initial results show a completely different scenario according which subtask is considered. The approach seems very promising, but a failure analysis and a topic-by-topic comparison is needed to understand when and how the different combination in the retrieval pipeline are significantly better/worse than other models. References [1] P. Bhattacharya, P. Mehta, K. Ghosh, S. Ghosh, A. Pal, A. Bhattacharya, P. Majumder, Overview of the FIRE 2020 AILA track: Artificial Intelligence for Legal Assistance, in: Proceedings of FIRE 2020 - Forum for Information Retrieval Evaluation, 2020. [2] L. Goeuriot, H. Suominen, L. Kelly, Z. Liu, G. Pasi, G. S. Gonzales, M. Viviani, C. Xu, Overview of the CLEF eHealth 2020 task 2: Consumer health search with ad hoc and spoken queries, in: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum, CEUR Workshop Proceedings, 2020. Table 2 Summary of the results for subtask 1-b, statutes retrieval. Runs are ordered by MAP. participant Run_ID MAP BPREF recip_rank P @ 10 scnu scnu_1 0.39 0.31 0.56 0.18 SSNCSE_NLP task_1b_2 0.34 0.14 0.34 0.07 IMS_UNIPD tfidf_stem 0.34 0.28 0.53 0.17 IMS_UNIPD tfidf_lemma 0.32 0.26 0.53 0.17 UB UB-2 0.31 0.26 0.58 0.15 UB UB-1 0.31 0.26 0.57 0.14 SSN_NLP R1 0.30 0.25 0.48 0.15 LAWNICS LAWNICS_1 0.30 0.28 0.46 0.13 TUW_informatics basic 0.26 0.20 0.49 0.13 TUW_informatics word_count 0.26 0.21 0.39 0.14 Uottawa_NLP run3_TFIDF 0.25 0.19 0.31 0.12 fs_hu fs_hu_task1b 0.23 0.20 0.36 0.08 TUW_informatics false_friends 0.23 0.19 0.38 0.10 IMS_UNIPD bm25_lemma 0.23 0.16 0.46 0.15 f s_ hit_ 1 fs_hit_1_task1b_03 0.21 0.16 0.34 0.13 f s_ hit_ 2 fs_hit_2_task1b_01 0.20 0.16 0.35 0.10 LAWNICS LAWNICS_2 0.20 0.15 0.48 0.09 f s_ hit_ 2 fs_hit_2_task1b_03 0.19 0.13 0.28 0.10 UB UB-3 0.19 0.15 0.25 0.09 f s_ hit_ 2 fs_hit_2_task1b_02 0.18 0.12 0.25 0.12 f s_ hit_ 1 fs_hit_1_task1b_01 0.17 0.09 0.22 0.12 f s_ hit_ 1 fs_hit_1_task1b_02 0.17 0.09 0.22 0.12 Uottawa_NLP run1_Glove 0.15 0.08 0.34 0.10 scnu scnu_3 0.13 0.05 0.15 0.12 SSNCSE_NLP task_1b_1 0.12 0.07 0.27 0.07 nlpninjas nlpninjas_st1 0.09 0.02 0.12 0.07 Uottawa_NLP run2_Doc2Vec 0.04 0.01 0.07 0.02 scnu scnu_2 0.03 0.00 0.02 0.00 [3] G. M. D. Nunzio, A study on a stopping strategy for systematic reviews based on a distributed effort approach, in: Experimental IR Meets Multilinguality, Multimodal- ity, and Interaction - 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22-25, 2020, Proceedings, 2020, pp. 112–123. URL: https://doi.org/10.1007/978-3-030-58219-7_10. doi:10.1007/978-3-030-58219-7\_10. [4] G. M. D. Nunzio, Classification of Animal Experiments: A Reproducible Study. IMS Unipd at CLEF eHealth Task 1, in: L. Cappellato, N. Ferro, D. E. Losada, H. Müller (Eds.), Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September 9-12, 2019, volume 2380 of CEUR Workshop Proceedings, CEUR-WS.org, 2019. URL: http://ceur-ws.org/Vol-2380/paper_104.pdf. [5] H. Wachsmuth, Text Analysis Pipelines - Towards Ad-hoc Large-Scale Text Mining, volume 9383 of Lecture Notes in Computer Science, Springer, 2015. URL: https://doi.org/10.1007/ 978-3-319-25741-9. doi:10.1007/978-3-319-25741-9. [6] Z. Zhao, H. Ning, L. Liu, C. Huang, L. Kong, Y. Han, Z. Han, Fire2019@aila: Legal information retrieval using improved BM25, in: Working Notes of FIRE 2019 - Forum for Information Retrieval Evaluation, Kolkata, India, December 12-15, 2019, 2019, pp. 40–45. URL: http: //ceur-ws.org/Vol-2517/T1-7.pdf.