<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Track: Legal domain search with minimal domain knowledge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tobias Fink</string-name>
          <email>tobias.fink@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gabor Recski</string-name>
          <email>gabor.recski@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Allan Hanbury</string-name>
          <email>hanbury@ifs.tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>TU Wien, Faculty of Informatics, Research Unit E-commerce</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <abstract>
        <p>We tackle task1 in the AILA 2020 shared task, where the goal is to retrieve precedent and statute documents related to a case document query in the Indian legal domain. We use BM25 with simple hyperparameter tuning and preprocessing for both precedent and statute retrieval and achieve a Mean Average Precision (MAP) of 0.1294 and 0.2619, respectively. We also experiment with removing frequent terms from the query as well as removing terms that produce high scores only in irrelevant documents, but both methods fail to improve the baseline results.</p>
      </abstract>
      <kwd-group>
        <kwd>information retrieval</kwd>
        <kwd>legal domain</kwd>
        <kwd>BM25</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In domain specific information retrieval (IR) each domain comes with its own challenges and its
own language. Getting an understanding of the domain specific language and knowing which
words and phrases can help to distinguish documents is important for IR, but unfortunately
the intricacies of such a language are often dificult to understand and only known by domain
experts. For example, in the case law system, there is the need to retrieve precedents and
relevant statutes for legal documents, like cases. However, due to the length of such a document,
it can contain passages about several topics, not all of which are helpful in distinguishing
between documents, and might contain terms of which only some are related to relevant facts
and rules.</p>
      <p>
        In Task 1 of the FIRE2020 AILA track[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the goal is to retrieve relevant precedent cases judged
by the Indian Supreme Court and statutes from Indian law for queries consisting of legal
case documents. A training set consisting of legal document queries as well as relevant and
irrelevant precedent and statute documents is provided. It can be challenging for a non-expert to
understand why a document is relevant or irrelevant for a particular query, because sometimes
relevant and irrelevant documents seemingly deal with similar topics. This is more pronounced
with longer documents, such as precedent case documents.
      </p>
      <p>
        To tackle this task using only the relevance information and the text data of the provided
training set, we use the well known BM25 document ranking algorithm[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] (implemented by the
nEvelop-O
(A. Hanbury)
open-source Lucene-based search tool elasticsearch1) and determine hyperparameters for BM25
based on a random search on the training set. Further, we use simple heuristics to detect query
terms that could be harmful to the desired search outcome and remove them from the query.
The heuristics decide whether a term should be removed based on the frequency of each term
in the corpus and the BM25 scores of each query term across a set of relevant and irrelevant
documents.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Dataset and Related Work</title>
      <p>
        The dataset is partially taken from last year’s AILA2019 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and consists of 3.257 prior case
documents (2914 old, 343 new) and 197 statute documents as well as 50 training queries for which
the relevance of precedents and statutes is known and 10 test queries. The queries consist of a
paragraph of raw text and are as such rather diferent from search queries typically entered into
web search engines, which are usually much shorter. The mean (standard deviation) of relevant
documents per query is 3.9 (3.82) and 4.42 (0.67) for precedents and statutes, respectively.
For the AILA2019, there where many submissions successfully employing BM25 in some form
or another. One of the top performers in both precedent and statute retrieval, Zhao et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
employ a new relevance score created by calculating BM25 on a filtered query document and an
unfiltered query document and adding the two scores. The filtering is done by ranking the query
terms according to their IDF-scores and taking the top 50 highest scoring terms. Additionally,
they also experiment with a Word2Vec based similarity function, which works well for statute
retrieval but not precedent retrieval. Similarly, Gao et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] submitted runs using TF-IDF
or Textrank to first extract the top 60 to 80 words from the query and using a Vector Space
Model (VSM), BM25 and a Language Model (LM) for retrieval. For Task 1 the TF-IDF based
query extraction paired with the VSM achieves 2nd place, followed by TF-IDF paired with BM25
achieving 4th place. They did not submit any runs for Task 2. For Task 1, Shao et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] extract
sentences containing the phrase “high court” as key sentences and utilize VSM, LM and a VSM
+ Mixed-Size Bigram Model combination but only achieve rank 10 for this task. For Task 2, they
use the entire description and utilize VSM, LM and a VSM + BM25 combination. They achieve
rank 1 (VSM), rank 2 (VSM+BM25) and rank 3 (LM) for statute retrieval. While in these works
the hyperparameters of BM25  1 and  are set to static values, the choice which values should
be selected is also topic of research in IR. For example, in Lipani et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],  is instead calculated
based on the mean average term frequency of a collection. Due to the overall good performance
of BM25 based methods, we also opt to experiment with this retrieval method.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>To retrieve data from the corpus, we create two indices using elasticsearch2, one for precedents
and one for statutes. While elasticsearch has their own stack of text pre-processing and analysis
tools, we opt to perform our text preprocessing outside of elasticsearch, since some of the
desired functionality, like lemmatization, is not supported by elasticsearch. Instead we use the
1https://github.com/elastic/elasticsearch
2https://www.elastic.co/elastic-stack
where  is the document to be scored,  is the query,  is a query term/token occurring in the
query,  (, )
in tokens and  
is the term frequency of token  in document  , || is the length of the document
is the average document length for documents in the collections. Further,
 1 and  are hyperparameters and   ()
formula:</p>
      <p>is the inverse document frequency calculated by this
  () = (
 − () + 0.5
() + 0.5
+ 1)
where  is the total number of documents and () is the number of documents containing
query term  .</p>
      <sec id="sec-3-1">
        <title>3.2. Hyperparameter Search</title>
        <p>open source natural language processing library spacy3 to tokenize the document text. We
further clean the text by removing punctuation tokens, numbers and typical English stopwords.
Finally, the tokens for each document are lemmatized, lowercased and then added to a single
indexed field. We generate our queries by applying the same procedure to the query documents,
but since the query documents can be very long and occasionally exceed the elasticsearch max
clause limit of 1024, we remove all duplicate tokens from the resulting list of tokens.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.1. Ranking Method</title>
        <p>We score the documents using the commonly used Okapi BM25 ranking function (as is
implemented in elasticsearch), which is calculated using the following formula:
 25(, ) =
∑   () ⋅
∈
 (, ) ⋅ (</p>
        <p>1 + 1)
 (, ) + 
1 ⋅ (1 −  +  ⋅

||
)
(1)
(2)
(3)
(4)
Since we do not know what the best hyperparameter values for  1 and  are for our two tasks,
we decided to experiment with the selection of the values. Instead of taking the often used
values of  1 = 1.2 and  = 0.75 , we do a random search to determine the values that best fit the
collection. We repeatedly select random values from an interval of [1.2, 2.0] for  1 and [0.0, 1.0]
for  , run our 50 training queries and evaluate the results. We take the values that resulted in
the best performance after 30 repetitions as our final values. We use the mean average precision
(MAP) metric to quantify the performance of an iteration, shown in the following formula:
where  is the set of training queries,  is a single query and || is the number of queries. Further,
the average precision of a query</p>
        <p>is calculated with the following formula:
where  is the number of retrieved documents,   () is the Precision @  for query  , |  | is
the number of relevant documents for query  and    () is 1 if document at rank  is relevant
otherwise 0.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Finding problematic terms</title>
        <p>As we are unfamiliar with the Indian legal domain and we consequently do not know the typical
keywords and phrases of the domain, we attempt to gain some insight into the domain using
the relevance judgements that we have. We looked at the BM25 scores assigned to individual
query terms  (see Formula 1) of relevant and irrelevant documents and noticed a few issues:
• If there are enough query terms with a high term frequency and a high document
frequency like “court”, this can cause an irrelevant document to be ranked higher than a
relevant one.
• There are query terms that have a high score in irrelevant documents, but not in relevant
ones, because they either are less frequent in relevant documents or do not occur there.
• It appears that some documents that are relevant for poor-performing queries are
suppressed by irrelevant documents. In these documents most high-scoring terms also
appears in irrelevant documents and have a higher term frequency (relative to document
length) there, while at the same time containing no high-scoring terms that are unique to
them.</p>
        <p>Based on these findings, we develop a heuristic to detect query terms that would cause irrelevant
documents to be ranked higher than relevant ones. These ”additional stopwords” detected by
the heuristic are then removed from the query and every remaining term of the query is treated
as a search term. We experiment with the following approaches for detecting these ”additional
stopwords” in the query:
Word Count: We filter out the most frequent words in the corpus. We preprocess precedents
and statutes and count how often each term occurs in each respective corpus. Using this
information, we add the 200 most frequent terms to our list of ”additional stopwords”. This is done
for precedent and statute documents separately.</p>
        <p>False Friends: We measure the BM25 scores assigned to individual query terms  (see Formula
1) of our training queries and compare the results for relevant and irrelevant documents. Using
the static hyperparameters 1 = 1.2 and  = 0.75 , we calculate a ranking for each training
query. Then we retrieve4 the scores of each term for each relevant document and for the first
100 irrelevant documents. For each query term  across all training queries, we calculate a
classification   and   from the maximum score of that token for all retrieved relevant and
irrelevant documents respectively. The classification   and   for a token is either ’Not Found’
if the query token was not found in the retrieved documents, ’Low’ if the maximum score was
at or below the threshold  and ’High’ if the maximum score was above  . We add those tokens 
4Using the elasticsearch explain functionality</p>
        <p>basic
false_friends
word_count</p>
        <p>basic
false_friends
word_count
0.07
0.05
0.06
that have a label   = ’High’ AND   = ’Low’ or ’Not Found’ to our stopword list. Based on a
separate grid search experiment, we set the parameter  = 1.5 .</p>
        <p>However, tests with these methods on the training data using cross-validation showed that they
did not consistently improve the retrieval results. Due to a lack of further development time,
we submitted these methods as runs to measure their performance on the test set.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>We submitted three runs for precedents and statutes each. The basic run only performs
the hyperparameter search, while the word_count and false_friends run both include the
hyperparameter search and their respective method of detecting additional stopwords. The
results of these runs are shown in Table1 and show that among our runs the basic method
generally achieves the best results. Compared to the other groups, our best method can be
found around the middle of the ranking. The best overall precedent retrieval MAP was 0.1573
(run UB-3) and best overall statute retrieval MAP was 0.3851 (run scnu_1).</p>
      <p>This tells us that using BM25 with some preprocessing and hyperparameter tuning is still a good
start when trying to work with a new domain. However, our method of removing additional
stopwords from the query proved detrimental and other methods of extracting keywords from
document queries should be considered. Removing the most frequent terms from a query either
does not retrieve more relevant documents or makes more relevant documents more dificult
to retrieve, hinting that these tokens still carry some useful information even if they are very
frequent. Also, the way we remove terms based on their BM25 scores might be very prone to
overfitting on the training set. It might still be possible that these removed terms are important
for unknown relevant documents of unknown queries. A better way to work with such a
’High/Low’ classification might be to assign higher weights to (boost) query terms of which we
know they score highly in relevant documents.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>We conclude that BM25 can be a good starting point when working with an unfamiliar domain.
In the Indian legal domain and with little hyperparameter tuning, it achieves a MAP about 18%
lower than the top result on precedent retrieval and about 32% lower than the top result on
statute retrieval. We attempted utilizing the word count and the BM25 query token scores of
training queries to detect unimportant or harmful tokens as additional stopwords. However,
removing either from the query document did not improve results.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>Project partly supported by BRISE-Vienna (UIA04-081), a European Union Urban Innovative
Actions project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <article-title>Overview of the FIRE 2020 AILA track: Artificial Intelligence for Legal Assistance</article-title>
          ,
          <source>in: Proceedings of FIRE 2020 - Forum for Information Retrieval Evaluation</source>
          , Hyderabad, India,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatford</surname>
          </string-name>
          , Okapi at TREC-3, Nist Special Publication Sp
          <volume>109</volume>
          (
          <year>1995</year>
          )
          <fpage>109</fpage>
          .
          <article-title>Publisher: NATIONAL INSTIUTE OF STANDARDS &amp; TECHNOLOGY</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          , P. Majumder,
          <article-title>FIRE 2019 AILA Track: Artificial Intelligence for Legal Assistance</article-title>
          ,
          <source>in: Proceedings of the 11th Forum for Information Retrieval Evaluation</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>4</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ning</surname>
          </string-name>
          , L. Liu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z</surname>
          </string-name>
          . Han,
          <article-title>FIRE2019@ AILA: Legal Information Retrieval Using Improved BM25</article-title>
          ., in: Working Notes of FIRE 2019 -
          <article-title>Annual Meeting of the Forum for Information Retrieval Evaluation</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          , volume
          <volume>2517</volume>
          ,
          <string-name>
            <surname>Kolkata</surname>
          </string-name>
          , India,
          <year>2019</year>
          , pp.
          <fpage>40</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ning</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <surname>L</surname>
          </string-name>
          . Kong, H. Qi,
          <article-title>FIRE2019@ AILA: Legal Retrieval Based on Information Retrieval Model</article-title>
          ., in: Working Notes of FIRE 2019 -
          <article-title>Annual Meeting of the Forum for Information Retrieval Evaluation</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          , volume
          <volume>2517</volume>
          ,
          <string-name>
            <surname>Kolkata</surname>
          </string-name>
          , India,
          <year>2019</year>
          , pp.
          <fpage>64</fpage>
          -
          <lpage>69</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ye</surname>
          </string-name>
          , THUIR@ AILA 2019:
          <article-title>Information Retrieval Approaches for Identifying Relevant Precedents and Statutes</article-title>
          ., in: Working Notes of FIRE 2019 -
          <article-title>Annual Meeting of the Forum for Information Retrieval Evaluation</article-title>
          ,
          <source>CEUR Workshop Proceedings</source>
          , volume
          <volume>2517</volume>
          ,
          <string-name>
            <surname>Kolkata</surname>
          </string-name>
          , India,
          <year>2019</year>
          , pp.
          <fpage>46</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lipani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lupu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Aizawa</surname>
          </string-name>
          ,
          <article-title>Verboseness fission for BM25 document length normalization</article-title>
          ,
          <source>in: Proceedings of the 2015 International Conference on the Theory of Information Retrieval</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>385</fpage>
          -
          <lpage>388</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>