Legal Assistance using Word Embeddings S. Kayalvizhi, D. Thenmozhi and Chandrabose Aravindan Department of CSE, SSN College of Engineering, Chennai {kayalvizhis,theni_d,aravindanc}@ssn.edu.in Abstract. The legal counsellors will always look up on the prior cases and statutes to ensure justice. Referring all prior cases is a time consum- ing process since they are vast. Artificial intelligence can be made use to select the most relevant among all the documents. The existing sys- tems have made use of different word embeddings, machine learning and deep learning methods to select the relevant ones. In our approach, dif- ferent vectorization methods such as Word2Vec, Glove, Tf-Idf and count vectorizer are used and then similarity is calculated using Jaccard simi- larity and cosine similarity for ranking the prior cases and statutes. The work is evaluated on the AILA@FIRE-2019 dataset, in which it provides two tasks of finding the prior documents namely the relevant cases and statutes. Keywords: Arifical Intelligence · Machine learning · Cosine similarity · Vectorization · Word Embeddings 1 Introduction The population which needs to attain any legal assistance have to search for the prior cases and statutes for the current case. The search goes innumerable since there are many. Aiding the help of artificial intelligence for the search and retrieval seems to be a effective idea when compared to the manual search re- trieval. Different machine learning and deep learning methods include Doc2Vec, Tf-Idf, LSTM , etc. are made use for retrieving the prior cases [2, 6, 8, 5, 7] . AILA@FIRE-2019 [3] proclaimed two tasks namely identifying relevant prior cases and relevant statutes. Identifying relevant prior cases refers to the retriev- ing similar prior cases for the given cases and identifying relevant prior statutes refers to the retrieving similar prior statutes for the given statute. 2 Proposed Methodology 2.1 Dataset Description The AILA@FIRE-2019 dataset has 50 queries whose related statutes and prior cases has to be found out among given 197 statutes and 3000 prior cases. The Copyright c 2019 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). FIRE 2019, 12-15 Decem- ber 2019, Kolkata, India. S. Kayalvizhi et al. first 10 queries are given the gold standard set which can be taken as training set and the remaining 40 queries are considered as test set. For retrieving the relevant cases and statutes, initially the documents are vectorized using Word2Vec, Tf-Idf, count vectorizer and a ensembling method of Glove and Word2Vec. After vectorizing, the documents are all ranked by finding the similarity using cosine similarity and jaccard similarity. 2.2 Task 1: Identifying relevant cases Word2Vec: The query document and the case document are initially vectorized using Word2Vec model of dimension ’300’. After vectorizing the whole document, the max-min vector of the documents are considered to represent each document as a vector. Max-min vector of query is considered as ’a’ and ’b’ be the max-min vector of case document. The case documents are ranked by finding the cosine similarity [4] between a and b. Ensembling Word2Vec and Glove: In this method, the query and documents are all vectorized using both Word2Vec and Glove and then they are concatenated. Considering a single query and case document, vectorize the case document using Glove which forms the vector ’a1’ and vectorize the case document using Word2Vec model which forms the vector ’a2’. Concatenate the two vectors a1 and a2 which becomes the vector ’a’. The same process of vectorization is done for the query document which forms vec- tor ’b’. Then, the case documents are ranked by finding the cosine similarity [4] between a and b. Tf-idf vectorizer: In this method, the documents and queries are all vectorized using Tf-Idf vector- izer. The entire corpus (queries and cases) is made use to form vocabulary which is used to fit the documents. The vector of query forms ’a’ and that of document forms ’b’. Rank the documents by finding the cosine similarity [4] between a and b. 2.3 Task 2: Identifying relevant statutes Tf-idf vectorizer: In this method, the documents and queries are all vectorized using Tf-Idf vec- torizer. The entire corpus (queries and statues) is made use to form vocabulary which is used to fit the documents. The vector of query forms ’a’ and that of document forms ’b’. Rank the statute documents by finding the cosine similarity [4] between a and b. Jaccard Similarity: This method is done by finding out the Jaccard similarity. The vocabulary of query document forms ’a’ and vocabulary of statute document which forms ’b’. The documents are ranked by finding the jaccard similarity [1] between a and b. Count vectorizer: Legal Assistance using Word Embeddings In this method, the documents and queries are all vectorized using count vector- izer (i.e) the count of each word in the documents (query and statute). Frequent words of the query document forms the dictionary of the file whose vector forms ’a’. Frequent words of the statute document forms the dictionary of the file whose vector forms ’b’. The statutes are ranked by finding the cosine similarity [4] between a and b. 3 Results Table 1 shows the result of Task 1 of identifying the relevant cases with respect to the given case and Table 2 shows the results of Task 2 of identifying the relevant statutes. Different evaluation metrices like MAP, P@10, BPREF and 1/rank of the first relevant document have been used to evaluate the performance in which MAP score have been reported here. From the results, Word2Vec vectorization seems to perform better than the other Teams and Runs MAP score SSN_NLP Run 1 0.0405 HLJIT2019 0.1492 Jiaming Gao 0.1382 Baban Gain 0.0984 TRDDC Pune 0.0956 Yunqiu Shao 0.0689 Sara Renjit 0.0481 Soumil Mandal - JU_SRM 0.0478 Kavya S Ganesh 0.0416 Table 1. Final evaluation for Test Data - TASK 1 vectorization methods for identifying the relevant cases and Tf-Idf vectorizing method seems to be better for identifying the relevant statutes. 4 Conclusion Artificial Intelligence helps us in many ways for identifying the relevant docu- ments among the entire prior documents in legal domain. Different word embed- ding methods of finding out similarity are experimented on ALIA@FIRE-2019 dataset. Various word embeddings like Word2Vec, Glove, ensembling Word2Vec and Glove, Tf-Idf vectorizer and count vectorizer are used for vectorization. Af- ter vectorization, cosine similarity is calculated to rank the documents. Among these Word2Vec seems to perform better than the other vectorization process for Task 1 of identifying the relevant cases and Tf-Idf seems to perform better than the other vectorizing methods for task 2 of identifying the relevant prior S. Kayalvizhi et al. Teams and Runs MAP score SSN_NLP Run 1 0.077 SSN_NLP Run 2 0.061 SSN_NLP Run 3 0.051 Yunqiu Shao 0.156 UBLTM 0.102 Sara Renjit 0.096 Soumil Mandal - JU_SRM 0.083 HLJIT2019 0.081 Kavya S Ganesh 0.068 Baban Gain - IITP 0.036 Table 2. Final evaluation for Test Data - TASK 2 statutes. The performance can be further improved by other machine learning and deep learning methods. Acknowledgement We would like to thank the Science and Engineering Research Board (SERB), Department of Science and Technology for funding the GPU system (EEQ/2018/000262) where this work has been carried out. References 1. Achananuparp, P., Hu, X., Shen, X.: The evaluation of sentence similarity measures. In: International Conference on data warehousing and knowledge discovery. pp. 305– 316. Springer (2008) 2. BarathiGaneshH., B., Kumar, M.A., Soman, K.P.: Distributional semantic repre- sentation for text classification and information retrieval. In: FIRE (2016) 3. Bhattacharya, P., Ghosh, K., Ghosh, S., Pal, A., Mehta, P., Bhattacharya, A., Ma- jumder, P.: Overview of the FIRE 2019 AILA track: Artificial Intelligence for Legal Assistance. In: Proceedings of FIRE 2019 - Forum for Information Retrieval Evalu- ation (December 2019) 4. Huang, A.: Similarity measures for text document clustering. In: Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand. vol. 4, pp. 9–56 (2008) 5. Locke, D., Zuccon, G.: Automatic cited decision retrieval: Working notes of ielab for fire legal track precedence retrieval task. In: FIRE (2017) 6. Sandeep, G.V., Bharadwaj, S.: An extraction based approach to keyword generation and precedence retrieval: Bits pilani - hyderabad. In: FIRE (2017) 7. Thenmozhi, D., Kannan, K., Aravindan, C.: A text similarity approach for prece- dence retrieval from legal documents. 8. Tian, L., Ning, H., Kong, L., Han, Z., Xiao, R., Qi, H.: Hljit2017@irled-fire2017: Information retrieval from legal documents. In: FIRE (2017)