=Paper=
{{Paper
|id=Vol-2036/T3-2
|storemode=property
|title=Distributed Representation in Information Retrieval - AMRITA_CEN_NLP@IRLeD 2017
|pdfUrl=https://ceur-ws.org/Vol-2036/T3-2.pdf
|volume=Vol-2036
|authors=Barathi Ganesh HB,Reshma U,Anand Kumar M,Soman KP
|dblpUrl=https://dblp.org/rec/conf/fire/BUMP17
}}
==Distributed Representation in Information Retrieval - AMRITA_CEN_NLP@IRLeD 2017==
Distributed Representation in Information Retrieval - AMRITA_CEN_NLP@IRLeD 2017 Barathi Ganesh HB, Reshma U, Anand Kumar M and Soman KP Center for Computational Engineering and Networking Amrita University Coimbatore, India barathiganesh.hb@gmail.com,reshma.anata@gmail.com,m_anandkumar@cb.amrita.edu soman_kp@amrita.edu ABSTRACT As told earlier this shared task involves two subtasks. The sub- In this contemporary research era, the science of retrieving re- task 1 deals with the catch phrase extraction from the legal docu- quired information from the stored database is extending its ap- ments and subtask 2 deals with retrieval of documents related to plications in the legal and life science domains. With the exponen- the current case document from the prior case documents. [3]. tial growth of the digital data available in the legal domain as an Text representation is a principal component in any of the text electronic media, there is a great demand for efficient and effective analytics problem. This has the direct proportion with the perfor- ways to retrieve required information from the stored document mance of the system. Most of the current systems in the retrieval collection. This paper details our experimented approach in Infor- process follows the frequency based representation methods [1]. mation Retrieval from Legal Documents (IRLeD 2017) task. The This is ineffective, when we need to retrieve the documents with re- task includes two subtasks, where the subtask 1 deals with infor- spect its context. Representation of the context of the document is mation extraction and the subtask 2 deals with document retrieval. ineffective in the count based representation methods (Document Text representation being a core component in any of the text an- - Term Matrix and Term Frequency - Inverse Document Frequency alytics solution, we have experimented on the provided dataset to Matrix) and Distributional Representation methods (Count based observe the performance of distributed representation of text in representation followed by the Matrix Factorization) [1]. the Information Retrieval task. The distributed representation of The cons stated above helped us to observe the performance text attained 3rd position in subtask 1 and attained satisfactory of distributed representation in IR and IE. Here document to vec- score in subtask 2. tor (doc2vec) is used to get the distributed representation of the documents and the phrases. For this experiment the data set has CCS CONCEPTS been provided by the Information Extraction for Legal Documents (IRLeD) shared task organizers1 . On successive representation, we • Information systems → Information retrieval; • Comput- have utilized cosine distance for ranking the retrieved documents ing methodologies → Natural language processing; as well as extracted phrases. The remaining part of the paper dis- cusses the distributed representation in Section 2 and the experi- KEYWORDS ments, observations are detailed in Section 3. Information Extraction, Information Retrieval, Text Representation, Distributed Representation, Legal Documents, Catchphrase Extrac- tion, Doc2Vec 2 DISTRIBUTED REPRESENTATION Though the Count based methods and Distributional Representa- tion methods has ability to include the word’s context through n- 1 INTRODUCTION grams, it suffers from the selection of n-gram phrases, sparsity and The success of Optical Character Recognition (OCR) and the avail- curse of dimensionality [5]. To overcome the above stated cons, ability of digital documents in legal domain, enforces the researchers distributed representation is used to compute the fixed size dense to automate the process involved in legal domain. Among these vector representation of texts [2]. This representation method has processes Information Retrieval (IR) is a fundamental process [4], the capability of representing the context of the text with variable- where in legal domain it helps in retrieving the prior cases related length into fixed size dense vector. The dimension of the vector is to the current cases (precedence retrieval) and it can act as a sup- dynamic and typically its value ranges from hundred to thousand. porting reference to the legal practitioners [3]. Word to Vectors (Word2Vec) is a framework for learning word In legal documents, more than the functional words (commonly vectors and it is shown in Fig 1a. The architecture is similar to the used uninformative words), the frequency of the content words Auto Encoder, where input is the one hot encoded context words (domain dependent informative words) are high. The complex struc- and output is one hot encoded target word to be predicted. The ture of these legal documents reduces the effectiveness of the rep- intermediate learning weights, maps context to the target to be resentation as well as retrieval. Thus, by storing these documents predicted [2]. In Fig. 1a, the context of three words is mapped to with the meta data instead of the raw data will enhance the perfor- mance of the retrieval process. One such meta data are the catch- phrases (list of legal terms) and these can be extracted through the 1 https://sites.google.com/view/fire2017irled/track-description?authuser=0 Information Extraction (IE) task [3]. Mean R Mean Precision Mean Recall Mean Average Overall Precision @10 @100 Precision Recall 0.168037101 0.1443333333 0.5352269964 0.1995772889 0.652431732 Table 1: Results for subtask 1 (b) Learning phrase/document vectors (a) Learning word vectors Figure 1: Distributed Representation predict the fourth word by learning the matrix W. The column vec- tors in the matrix W is known as word embedding (dense word d = {d_1, d_2, ..., d_400} (1) vectors). D = doc2vec ({D_1, D_2, ..., D_400}) (2) Doc2Vec is a frame work for learning documents or sentence vectors and it is shown in Fig. 1b. The architecture is similar to c = {c_1, c_2, ..., c_98} (3) the Word2Vec architecture shown in Fig. 1a. The only change is C = doc2vec ({C_1, C_2, ..., C_98}) (4) introducing a matrix D along with the matrix W to map the context words to the target words to be predicted [2]. Here concatenation In above equation D represents the document matrix, C represents or average of column vectors from D and W will be used to predict the Catch Phrase Matrix, D_i represents the document vector and the target word. In Word2Vec the word itself act as the symbol C_i represents the catch phrase vectors. On successive represen- to retrieve the corresponding vectors from the matrix W but in tation, we have computed the cosine distance between the catch Doc2Vec a symbolic label will be assigned to each documents for phrase vector and the document vectors. Based on this cosine dis- the retrieval the corresponding vectors from the matrix D. tances we have ranked the catch phrase for making final submis- sion. The results are shown in following Table 1. For few of the application the basic count based methods performs better than 3 EXPERIMENTS AND OBSERVATIONS the advanced representation methods. In order to observe the per- Dataset for both the subtasks are provided by the Information Re- formance we experimented the same approach with the document trieval for Legal Documents (IRLeD) shared task organizers [3]. In - term matrix also. subtask 1, we were provided with the 100 legal case document and In subtask2, the objective is to retrieve the relevant documents its corresponding catch phrases for training. The objective is to ex- from the prior case documents by taking current documents as the tract catch phrases for 300 test documents and ranking them with query. We have been provided with the 2000 prior case documents respect to its relevance with the corresponding documents. The and 200 current case documents. Both the documents sets are rep- given training and test documents (400) are represented as vectors resented as a matrix through Doc2vec. This can be represented as, using Doc2Vec as explained in Section 2. In both the subtasks we { } have utilized Distributed Memory model for computing the docu- prior = p_1, p_2, ..., p_2000 (5) ment vectors. The file name of the documents are taken as the label Prior = doc2vec ({P_1, P_2, ..., P_2000}) (6) for the documents. Similar to the documents, each catch phrase in the training documents are considered as the document itself and current = {c_1, c_2, ..., c_200} (7) represented as a vector by assigning unique label. There is totally Current = doc2vec ({C_1, C_2, ..., C_200}) (8) 98 unique catch phrases available in the given training set. This In above equation Prior represents the prior documents matrix, can be represented as, Current represents the current documents matrix, P_i represents 2 the prior document vector and C_i represents the current docu- [3] Arpan Mandal, Kripabandhu Ghosh, Arnab Bhattacharya, Arindam Pal, and Sap- ment vector. Similar to the subtask 1, here also cosine distance be- tarshi Ghosh. 2017. Overview of the FIRE 2017 track: Information Retrieval from Legal Documents (IRLeD). In Working notes of FIRE 2017 - Forum for Information tween the current and prior document vectors are measured and Retrieval Evaluation (CEUR Workshop Proceedings). CEUR-WS.org. ranked. We have used cosine distance from python scipy pacakge2 . [4] Mandar Mitra and BB Chaudhuri. 2000. Information retrieval from documents: A survey. Information retrieval 2, 2-3 (2000), 141–163. The measured cosine distance given below, [5] Peter D Turney and Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of artificial intelligence research 37 (2010), u ·v distance = 1 − (9) 141–188. ∥u ∥ 2 ∥v ∥ 2 In order to compare the different representation methods, we have experimented the same approach using Term Frequency - In- verse Document Frequency Matrix and Document - Term Matrix followed by a Singular value Decomposition. While computing SVD the reduced dimension is 200. The obtained results are shown in following Table 2. Mean Average Mean Reciprocal Precision Recall Precision Rank @10 @100 0.0058 0.0145 0.0025 0.058 Table 2: Results for subtask 2 The Document - Term Matrix, Term Frequency - Inverse Doc- ument Frequency Matrix and Singular value Decomposition are computed using Scikit Learn python library3 . The Doc2Vec is com- puted using Gensim python library4 . In both the tasks Doc2Vec is computed using the parameters - dimension = 50, minimumcount = 1, windowsize = 5, model = distributedmemory. The recall should be higher for the real time application. In subtask 1, though the sys- tem attains less precision, it is able to attain the highest accuracy comparing other participated system. 4 CONCLUSIONS The documents and the phrases provided by the organizers are rep- resented as a matrix using distributed representation method. In subtask 1, n-grams are extracted and its relevance with the docu- ments is measured using cosine distance. Similarly, in subtask 2 the relevance between current and prior documents are ranked based on the cosine distance. This approach yields 3rd position in subtask 1 by attaining 0.199 as a Mean Average Precision and has also obtained highest overall recall (0.652) among the other participated systems. It has attained 0.0058 as a Mean Average Precision in subtask 2. The absence of gold-data in the training phase constrained to tune the system us- ing hyper parameters in doc2vec. Hence the future work will be to focus more on developing a performance measurement method for unsupervised retrieval system. REFERENCES [1] Barathi Ganesh HB, Anand Kumar M, and Soman KP. 2016. Distributional Se- mantic Representation for Text Classification and Information Retrieval. (2016). [2] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on Machine Learn- ing (ICML-14). 1188–1196. 2 https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/ scipy.spatial.distance.cosine.html 3 scikit-learn.org 4 https://radimrehurek.com/gensim/ 3