<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On the importance of legal catchphrases in precedence retrieval∗</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Edwin Thuma†</string-name>
          <email>thumae@mopipi.ub.bw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nkwebi P. Motlogelwa‡</string-name>
          <email>motlogel@mopipi.ub.bw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Botswana, Department of Computer Science</institution>
          ,
          <addr-line>Gaborone</addr-line>
          ,
          <country country="BW">Botswana</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>This paper presents our working notes for FIRE 2017, Information Retrieval from Legal documents -Task 2 (Precedence retrieval). Common Law Systems around the world recognize the importance of precedence in Law. In making decisions, Judges are obliged to consult prior cases that had already been decided to ensure that there is no divergence in treatment of similar situations in diferent cases. Our approach was to investigate the efectiveness of using legal catchphrases in precedence retrieval. To improve retrieval performance, we incorporated term dependency in our retrieval. In addition, we investigate the efects of deploying query expansion on the retrieval performance. Our results show an improvement in the retrieval performance when we incorporate term dependence in scoring and ranking prior cases. However, we see a degradation in the retrieval performance when we deploy query expansion.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Common Law Systems around the world recognize the importance
of precedence in Law. In making decisions, Judges are obliged to
align their decisions to relevant prior cases. Thus, when lawyers
prepare for cases, they research extensively on prior cases. In
addition, Judges also consult prior cases that had already been
decided to ensure that a similar situation is treated similarly in every
case [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This can be overwhelming due to the enormous number
of prior cases and length of each. Task 2 of the Information
retrieval in Legal Documents track (precedence retrieval), explores
techniques and tools that could ease this task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In general,
precedence retrieval will retrieve a ranked list of prior cases that are
related to a certain current case.
      </p>
      <p>In this work we investigate the importance of legal catchphrases
as queries in precedent retrieval. These legal catchphrases are
extracted from current cases. To achieve this, we used a training set
of documents provided for Task 1 (catchphrase extraction) where
case documents have corresponding gold standard catchphrase. We
used Term Frequency-Inverse Document Frequency (TF-IDF) term
weighting model to identify similarity between documents in the
∗On the importance of legal catchphrases in precedence retrieval
†Lecturer, Department of Computer Science, University of Botswana
‡Lecturer, Department of Computer Science, University of Botswana
training set and current cases. Queries were formulated using
legal catchphrases from the most relevant documents in the training
set.</p>
      <p>
        For retrieval, we deployed the parameter-free DPH term
weighting model to score and rank prior cases. Moreover investigate whether
taking the dependence of query terms in to consideration when
ranking and scoring prior cases could improve thr retrieval
performance.Previous work has shown that incorporating term
dependency in scoring and ranking documents could significantly
improve the retrieval performance [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In addition we deployed query
expansion where the original queries are reformulated by adding
new terms to investigate its impact on retrieval performance.
Previous research has shown that query expansion could improve
retrieval efectiveness [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>This paper is structured as follows. Section 2 contains a
background on algorithms used. Section 3 describes the experimental
setup. In Section 4, we describe the methodologies used for the 3
runs submitted by team UB_Botswana_Legal for Task 2. Section 5
presents results and discussions.
2</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND</title>
      <p>In this section, we begin by presenting a brief but essential
background on the diferent algorithms used in our experimental
investigation and evaluation. We start by describing the TF-IDF term
weighting model, in Section 2.1. We then describe DPH term
weighting model in Section 2.2, Lastly we describe the Bose-Einstein 1
(Bo1) model for query expansion in Section 2.3.
2.1</p>
    </sec>
    <sec id="sec-3">
      <title>TF-IDF term weighting model</title>
      <p>
        In our experimental setup, we used T F -I DF [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to score and rank
documents. Generally, T F -I DF calculates the weight of each term
t as the product of its term frequency (t f ) weight in document d
and its inverse document frequency (id ft ).
      </p>
      <p>scoreT F -I D F (d; Q ) =
∑ 1 + log(t f ) log N
t 2Q
d ft
(1)
t f is the term frequency of term t in document d.
d ft is the document frequency of term t - the number of
documents in the collection that the term t occurs in.
id f = log dNft is the inverse document frequency of term t
in a collection of N documents
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>DPH Term Weighting Model</title>
      <p>
        Our baseline system used the parameter-free DPH term
weighting model from the Divergence from Randomness (DFR)
framework [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The DPH term weighting model calculates the score of a
document d for a given query Q as follows:
      </p>
      <p>scoreDPH (d;Q) = ∑t 2Q qt f norm (t f log((t f avlд_l ) ( tNfc )) + 0:5 log(2 π t f (1 tMLE ))) (2)
where qt f , t f and t f c are the frequencies of the term t in the query
Q , in the document d and in the collection C respectively. N is
number of documents in the collection C, avд_l is the average length of
documents in the collection C and l is the length of the document
d. tM LE = tlf and norm = (1 tMLE )2 .</p>
      <p>t f +1
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Bose-Einstein 1 (Bo1) model for Query</title>
    </sec>
    <sec id="sec-6">
      <title>Expansion</title>
      <p>
        In our experimental investigation and evaluation, we used the
Terrier4.0 Divergence from Randomness (DFR) Bose-Einstein 1 (Bo1) model
to select the most informative terms from the topmost documents
after a first pass document ranking. The DFR Bo1 model calculates
the information content of a term t in the top-ranked documents
as follows [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]:
1 + Pn (t )
      </p>
      <p>Pn (t )
w (t ) = t f x log2
+ log2 (1 + Pn (t ))</p>
      <p>(3)
t f c
Pn (t ) = (4)</p>
      <p>N
where Pn (t ) is the probability of t in the whole collection, t f x is
the frequency of the query term in the top x ranked documents,
t f c is the frequency of the term t in the collection, and N is the
number of documents in the collection.
3
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>EXPERIMENTAL SETUP</title>
    </sec>
    <sec id="sec-8">
      <title>Document Collection</title>
      <p>
        In this work we use the document collection provided by the
Information Retrieval in Legal Documents track organizers. It
comprised 200 documents representing current cases and 2000
documents representing prior cases [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. For each current case, the
objective is to retrieve relevant ranked prior cases such that the most
relevant appear at the top of the list and the least relevant at the
bottom together with scores for prior case.
3.2
      </p>
    </sec>
    <sec id="sec-9">
      <title>Precedence Retrieval Experimental</title>
    </sec>
    <sec id="sec-10">
      <title>Platform</title>
      <p>For all our experimental evaluation, we used Terrier-4.2, an open
source Information Retrieval (IR) platform. Documents were
preprocessed before indexing: tokenising text, stemming each token
using the full Potter stemming algorithm, and stopword removal
using terrier stopword list.
4
4.1</p>
    </sec>
    <sec id="sec-11">
      <title>METHODOLOGY query formulation</title>
      <sec id="sec-11-1">
        <title>Query Generation For the diferent Runs</title>
        <p>For all the runs in this task, we indexed the 100 case documents
provided in task1, which had the corresponding catchphrases
using Terrier-4.2 IR platform. During indexing, each case document
was first tokenised and stopwords were removed using the Terrier
stopword list. Each token was then stemmed using the full Porter
stemming algorithm.</p>
        <p>For each current case provided in task 2, We used the TF-IDF term
weighting model in Terrier 4.2 to score and rank the indexed case
documents. Each case document was first pre-processed using the
same pre-processing steps undertaken during indexing. After
retrieving the top 40 case documents, we formulated queries for each
current case using the gold standard catchphrases that appear in
these ranked case documents and also in the current case
document used for retrieval.
4.2</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>UB_Botswana_Legal_Task2_R1</title>
      <p>Using the formulated queries, we deployed the parameter-free DPH
Divergence from Randomness term weighting model in
Terrier4.2 IR platform as our baseline system to score and rank the prior
cases.
4.3</p>
    </sec>
    <sec id="sec-13">
      <title>UB_Botswana_Legal_Task2_R2</title>
      <p>
        We used UB_Botswana_Legal_Task2_R1 as the baseline system. In
addition, we deployed the Sequential Dependence (SD) variant of
the Markov Random Fields for term dependence. Sequential
Dependence only assumes a dependence between neighbouring query
terms [
        <xref ref-type="bibr" rid="ref4 ref6">4, 6</xref>
        ]. In this work, we used a default window size of 2 as
provided in Terrier-4.2.
4.4
      </p>
    </sec>
    <sec id="sec-14">
      <title>UB_Botswana_Legal_Task2_R3</title>
      <p>
        We used UB_Botswana_Legal_Task2_R1 as the baseline system. In
addition, we deployed a simple pseudo-relevance feedback on the
local collection. We used the Bo1 model for query expansion to
select the 10 most informative terms from the top 3 ranked
documents after the first pass retrieval (on the local collection) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We
then performed a second pass retrieval on this local collection with
the new expanded query.
5
      </p>
    </sec>
    <sec id="sec-15">
      <title>RESULTS AND DISCUSSION</title>
      <p>This work set out to investigate the importance of legal catchphrases
in precedence retrieval. The results of our submission in Table 1
were evaluated by the organizing committee of this task. Since
most of the catchphrases were bi-grams and tri-grams, our
exploitation of sequential term dependency variant for the Markov
Random Fields for term dependence led to improvements in retrieval
performance in terms of Mean Average Precision and Precision @
10. Our attempt to improve retrieval performance using query
expansion resulted in degradation in the retrieval performance. We
suspect this might have been to due to query drift.</p>
      <p>Run ID</p>
      <sec id="sec-15-1">
        <title>Mean Average Precision</title>
      </sec>
      <sec id="sec-15-2">
        <title>Mean reciprocal Rank Precision@10 Recall@100</title>
        <p>0.3478
0.3506</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>Probabilistic Models for Information Retrieval based on Divergence from Randomness</article-title>
          . University of Glasgow,UK,
          <source>PhD Thesis</source>
          (
          <year>June 2003</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>198</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Amati</surname>
          </string-name>
          , E. Ambrosi,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bianchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gaibisso</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Gambosi</surname>
          </string-name>
          .
          <year>2007</year>
          . FUB, IASICNR and University of Tor Vergata at
          <article-title>TREC 2007 Blog Track</article-title>
          .
          <source>In Proceedings of the 16th Text REtrieval Conference</source>
          (TREC-
          <year>2007</year>
          ).
          <article-title>Text REtrieval Conference (TREC), Gaithersburg</article-title>
          , Md., USA.,
          <volume>1</volume>
          -
          <fpage>10</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Arpan</given-names>
            <surname>Mandal</surname>
          </string-name>
          , Kripabandhu Ghosh, Arnab Bhattacharya, Arindam Pal, and
          <string-name>
            <given-names>Saptarshi</given-names>
            <surname>Ghosh</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Overview of the FIRE 2017 track: Information Retrieval from Legal Documents (IRLeD)</article-title>
          .
          <source>In Working notes of FIRE</source>
          <year>2017</year>
          <article-title>- Forum for Information Retrieval Evaluation (CEUR Workshop Proceedings</article-title>
          ).
          <source>CEUR-WS.org.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Donald</given-names>
            <surname>Metzler</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>A Markov Random Field Model for Term Dependencies</article-title>
          .
          <source>In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '05)</source>
          . ACM, New York, NY, USA,
          <fpage>472</fpage>
          -
          <lpage>479</lpage>
          . https://doi.org/10.1145/1076034.1076115
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Juan</given-names>
            <surname>Ramos</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Using TF-IDF to Determine Word Relevance in Document Queries</article-title>
          . (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Edwin</given-names>
            <surname>Thuma</surname>
          </string-name>
          , Nkwebi Peace Motlogelwa, and
          <string-name>
            <surname>Tebo</surname>
          </string-name>
          Leburu-Dingalo.
          <year>2017</year>
          .
          <article-title>UBBotswana Participation to CLEF eHealth IR Challenge 2017: Task 3 (IRTask1 : Adhoc Search)</article-title>
          .
          <source>In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum</source>
          , Dublin, Ireland,
          <source>September 11-14</source>
          ,
          <year>2017</year>
          . http://ceur-ws.
          <source>org/</source>
          Vol-1866/paper_ 73.pdf
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>