=Paper= {{Paper |id=Vol-2036/T3-2 |storemode=property |title=Distributed Representation in Information Retrieval - AMRITA_CEN_NLP@IRLeD 2017 |pdfUrl=https://ceur-ws.org/Vol-2036/T3-2.pdf |volume=Vol-2036 |authors=Barathi Ganesh HB,Reshma U,Anand Kumar M,Soman KP |dblpUrl=https://dblp.org/rec/conf/fire/BUMP17 }} ==Distributed Representation in Information Retrieval - AMRITA_CEN_NLP@IRLeD 2017== https://ceur-ws.org/Vol-2036/T3-2.pdf
            Distributed Representation in Information Retrieval -
                      AMRITA_CEN_NLP@IRLeD 2017
                           Barathi Ganesh HB, Reshma U, Anand Kumar M and Soman KP
                                   Center for Computational Engineering and Networking
                                                    Amrita University
                                                    Coimbatore, India
                    barathiganesh.hb@gmail.com,reshma.anata@gmail.com,m_anandkumar@cb.amrita.edu
                                                 soman_kp@amrita.edu

ABSTRACT                                                                     As told earlier this shared task involves two subtasks. The sub-
In this contemporary research era, the science of retrieving re-          task 1 deals with the catch phrase extraction from the legal docu-
quired information from the stored database is extending its ap-          ments and subtask 2 deals with retrieval of documents related to
plications in the legal and life science domains. With the exponen-       the current case document from the prior case documents. [3].
tial growth of the digital data available in the legal domain as an          Text representation is a principal component in any of the text
electronic media, there is a great demand for efficient and effective     analytics problem. This has the direct proportion with the perfor-
ways to retrieve required information from the stored document            mance of the system. Most of the current systems in the retrieval
collection. This paper details our experimented approach in Infor-        process follows the frequency based representation methods [1].
mation Retrieval from Legal Documents (IRLeD 2017) task. The             This is ineffective, when we need to retrieve the documents with re-
task includes two subtasks, where the subtask 1 deals with infor-         spect its context. Representation of the context of the document is
mation extraction and the subtask 2 deals with document retrieval.        ineffective in the count based representation methods (Document
Text representation being a core component in any of the text an-        - Term Matrix and Term Frequency - Inverse Document Frequency
alytics solution, we have experimented on the provided dataset to         Matrix) and Distributional Representation methods (Count based
observe the performance of distributed representation of text in          representation followed by the Matrix Factorization) [1].
the Information Retrieval task. The distributed representation of            The cons stated above helped us to observe the performance
text attained 3rd position in subtask 1 and attained satisfactory         of distributed representation in IR and IE. Here document to vec-
score in subtask 2.                                                       tor (doc2vec) is used to get the distributed representation of the
                                                                          documents and the phrases. For this experiment the data set has
CCS CONCEPTS                                                              been provided by the Information Extraction for Legal Documents
                                                                         (IRLeD) shared task organizers1 . On successive representation, we
• Information systems → Information retrieval; • Comput-
                                                                          have utilized cosine distance for ranking the retrieved documents
ing methodologies → Natural language processing;
                                                                          as well as extracted phrases. The remaining part of the paper dis-
                                                                          cusses the distributed representation in Section 2 and the experi-
KEYWORDS                                                                  ments, observations are detailed in Section 3.
Information Extraction, Information Retrieval, Text Representation,
Distributed Representation, Legal Documents, Catchphrase Extrac-
tion, Doc2Vec
                                                                         2 DISTRIBUTED REPRESENTATION
                                                                         Though the Count based methods and Distributional Representa-
                                                                         tion methods has ability to include the word’s context through n-
1 INTRODUCTION                                                           grams, it suffers from the selection of n-gram phrases, sparsity and
The success of Optical Character Recognition (OCR) and the avail-        curse of dimensionality [5]. To overcome the above stated cons,
ability of digital documents in legal domain, enforces the researchers   distributed representation is used to compute the fixed size dense
to automate the process involved in legal domain. Among these            vector representation of texts [2]. This representation method has
processes Information Retrieval (IR) is a fundamental process [4],       the capability of representing the context of the text with variable-
where in legal domain it helps in retrieving the prior cases related     length into fixed size dense vector. The dimension of the vector is
to the current cases (precedence retrieval) and it can act as a sup-     dynamic and typically its value ranges from hundred to thousand.
porting reference to the legal practitioners [3].                           Word to Vectors (Word2Vec) is a framework for learning word
   In legal documents, more than the functional words (commonly          vectors and it is shown in Fig 1a. The architecture is similar to the
used uninformative words), the frequency of the content words            Auto Encoder, where input is the one hot encoded context words
(domain dependent informative words) are high. The complex struc-        and output is one hot encoded target word to be predicted. The
ture of these legal documents reduces the effectiveness of the rep-      intermediate learning weights, maps context to the target to be
resentation as well as retrieval. Thus, by storing these documents       predicted [2]. In Fig. 1a, the context of three words is mapped to
with the meta data instead of the raw data will enhance the perfor-
mance of the retrieval process. One such meta data are the catch-
phrases (list of legal terms) and these can be extracted through the
                                                                         1 https://sites.google.com/view/fire2017irled/track-description?authuser=0
Information Extraction (IE) task [3].
                               Mean R       Mean Precision Mean Recall Mean Average                     Overall
                              Precision           @10              @100         Precision                Recall
                             0.168037101    0.1443333333       0.5352269964 0.1995772889              0.652431732
                                                     Table 1: Results for subtask 1




                                                                                         (b) Learning phrase/document vectors
                      (a) Learning word vectors

                                                  Figure 1: Distributed Representation


predict the fourth word by learning the matrix W. The column vec-
tors in the matrix W is known as word embedding (dense word                                      d = {d_1, d_2, ..., d_400}                 (1)
vectors).
                                                                                           D = doc2vec ({D_1, D_2, ..., D_400})             (2)
   Doc2Vec is a frame work for learning documents or sentence
vectors and it is shown in Fig. 1b. The architecture is similar to                                c = {c_1, c_2, ..., c_98}                 (3)
the Word2Vec architecture shown in Fig. 1a. The only change is
                                                                                            C = doc2vec ({C_1, C_2, ..., C_98})             (4)
introducing a matrix D along with the matrix W to map the context
words to the target words to be predicted [2]. Here concatenation            In above equation D represents the document matrix, C represents
or average of column vectors from D and W will be used to predict            the Catch Phrase Matrix, D_i represents the document vector and
the target word. In Word2Vec the word itself act as the symbol              C_i represents the catch phrase vectors. On successive represen-
to retrieve the corresponding vectors from the matrix W but in               tation, we have computed the cosine distance between the catch
Doc2Vec a symbolic label will be assigned to each documents for              phrase vector and the document vectors. Based on this cosine dis-
the retrieval the corresponding vectors from the matrix D.                   tances we have ranked the catch phrase for making final submis-
                                                                             sion. The results are shown in following Table 1. For few of the
                                                                             application the basic count based methods performs better than
3 EXPERIMENTS AND OBSERVATIONS                                               the advanced representation methods. In order to observe the per-
Dataset for both the subtasks are provided by the Information Re-            formance we experimented the same approach with the document
trieval for Legal Documents (IRLeD) shared task organizers [3]. In          - term matrix also.
subtask 1, we were provided with the 100 legal case document and                In subtask2, the objective is to retrieve the relevant documents
its corresponding catch phrases for training. The objective is to ex-        from the prior case documents by taking current documents as the
tract catch phrases for 300 test documents and ranking them with             query. We have been provided with the 2000 prior case documents
respect to its relevance with the corresponding documents. The               and 200 current case documents. Both the documents sets are rep-
given training and test documents (400) are represented as vectors           resented as a matrix through Doc2vec. This can be represented as,
using Doc2Vec as explained in Section 2. In both the subtasks we                                       {                     }
have utilized Distributed Memory model for computing the docu-                                 prior = p_1, p_2, ..., p_2000                (5)
ment vectors. The file name of the documents are taken as the label                      Prior = doc2vec ({P_1, P_2, ..., P_2000})          (6)
for the documents. Similar to the documents, each catch phrase in
the training documents are considered as the document itself and                              current = {c_1, c_2, ..., c_200}              (7)
represented as a vector by assigning unique label. There is totally                     Current = doc2vec ({C_1, C_2, ..., C_200})          (8)
98 unique catch phrases available in the given training set. This             In above equation Prior represents the prior documents matrix,
can be represented as,                                                      Current represents the current documents matrix, P_i represents
                                                                        2
the prior document vector and C_i represents the current docu-                             [3] Arpan Mandal, Kripabandhu Ghosh, Arnab Bhattacharya, Arindam Pal, and Sap-
ment vector. Similar to the subtask 1, here also cosine distance be-                           tarshi Ghosh. 2017. Overview of the FIRE 2017 track: Information Retrieval from
                                                                                               Legal Documents (IRLeD). In Working notes of FIRE 2017 - Forum for Information
tween the current and prior document vectors are measured and                                  Retrieval Evaluation (CEUR Workshop Proceedings). CEUR-WS.org.
ranked. We have used cosine distance from python scipy pacakge2 .                          [4] Mandar Mitra and BB Chaudhuri. 2000. Information retrieval from documents: A
                                                                                               survey. Information retrieval 2, 2-3 (2000), 141–163.
The measured cosine distance given below,                                                  [5] Peter D Turney and Patrick Pantel. 2010. From frequency to meaning: Vector
                                                                                               space models of semantics. Journal of artificial intelligence research 37 (2010),
                                      u ·v
                          distance = 1 −                     (9)                               141–188.
                                   ∥u ∥ 2 ∥v ∥ 2
   In order to compare the different representation methods, we
have experimented the same approach using Term Frequency - In-
verse Document Frequency Matrix and Document - Term Matrix
followed by a Singular value Decomposition. While computing SVD
the reduced dimension is 200. The obtained results are shown in
following Table 2.

   Mean Average Mean Reciprocal Precision                             Recall
    Precision           Rank             @10                          @100
      0.0058            0.0145        0.0025                          0.058
              Table 2: Results for subtask 2



   The Document - Term Matrix, Term Frequency - Inverse Doc-
ument Frequency Matrix and Singular value Decomposition are
computed using Scikit Learn python library3 . The Doc2Vec is com-
puted using Gensim python library4 . In both the tasks Doc2Vec is
computed using the parameters - dimension = 50, minimumcount =
1, windowsize = 5, model = distributedmemory. The recall should
be higher for the real time application. In subtask 1, though the sys-
tem attains less precision, it is able to attain the highest accuracy
comparing other participated system.

4 CONCLUSIONS
The documents and the phrases provided by the organizers are rep-
resented as a matrix using distributed representation method. In
subtask 1, n-grams are extracted and its relevance with the docu-
ments is measured using cosine distance. Similarly, in subtask 2 the
relevance between current and prior documents are ranked based
on the cosine distance.
   This approach yields 3rd position in subtask 1 by attaining 0.199
as a Mean Average Precision and has also obtained highest overall
recall (0.652) among the other participated systems. It has attained
0.0058 as a Mean Average Precision in subtask 2. The absence of
gold-data in the training phase constrained to tune the system us-
ing hyper parameters in doc2vec. Hence the future work will be
to focus more on developing a performance measurement method
for unsupervised retrieval system.

REFERENCES
[1] Barathi Ganesh HB, Anand Kumar M, and Soman KP. 2016. Distributional Se-
    mantic Representation for Text Classification and Information Retrieval. (2016).
[2] Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and
    documents. In Proceedings of the 31st International Conference on Machine Learn-
    ing (ICML-14). 1188–1196.

2 https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/
scipy.spatial.distance.cosine.html
3 scikit-learn.org
4 https://radimrehurek.com/gensim/

                                                                                       3