<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>HLJIT2017@IRLed-FIRE2017: Information Retrieval From Legal Documents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Liuyang Tian</string-name>
          <email>tianliuyang2016@outlook.com</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hui Ning</string-name>
          <email>ninghui@hrbeu.edu.cn</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhongyuan Han</string-name>
          <email>Hanzhongyuan@gmail.com</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruiming Xiao</string-name>
          <email>xiaoruiming11@outlook.com</email>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leilei Kong*</string-name>
          <email>kongleilei1979@gmail.com</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Haoliang Qi</string-name>
          <email>haoliang.qi@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>2State Key Laboratory of Digital</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Information Retrieval from Legal Documents, Catchphrase</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>1School of Computer Science and, Technology, Heilongjiang Institute of, Technology</institution>
          ,
          <addr-line>Harbin</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Extraction</institution>
          ,
          <addr-line>Precedence Retrieval</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Publishing Technology</institution>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>School of Computer Science and, Technology, Harbin Engineering, University</institution>
          ,
          <addr-line>Harbin</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>School of Computer Science and, Technology, Harbin University of, Science and Technology</institution>
          ,
          <addr-line>Harbin</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>School of Computer Science and, Technology, Heilongjiang Institute of, Technology</institution>
          ,
          <addr-line>Harbin</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper details the approach of implementing the Catchphrase Extraction and Precedence Retrieval tasks to be presented at Information Retrieval from Legal Documents by Forum of Information Retrieval Evaluation in 2017(Fire2017 IRLeD). For the task of Catchphrase Extraction, the classification-based and Rank-based methods were exploited, and various types of features were attempted. With respect to the task of Precedence Retrieval, the language model, the BM25 and the vector space model were employed. Comparisons to other submissions for the same tasks, show the presented methods to be one of the top performers.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
where xi  (xi(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) , xi(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )..., xi(n) )T ,i  1,2,..., n . And yi is the label
to denote whether the word wi is the catchphrase of di or not.
Then, our goal is to learn a model to decide whether a word
is the catchphrase when given a new di. The
classificationbased methods and Ranking-based methods are exploited to
learn the model respectively.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2.1 Classification Method: Bagging</title>
      <p>Bagging is based on bootstrap sampling. First, we use
the M-round bootstrap sampling method to obtain M
samples containing N training samples. Then, based on
these sampling sets, a base learner is trained. Finally, the
M-based learners are combined. The problem of
multiclassification is resolved by a simple voting method[2].
Decision tree and random forest[3] are adopted as our base
classifiers. Denoted as Bagging(DTC) and Bagging(RFC)
respectively.</p>
    </sec>
    <sec id="sec-3">
      <title>2.2 Ranking Method: RankSVM</title>
      <p>RankSVM is a pair-wise learning to rank method that uses
the SVM model to solve the ranking problem on document
pairs. To rank the candidate catchphrase, we trained a
ranking function on a training corpus using the Ranking
SVM2.</p>
    </sec>
    <sec id="sec-4">
      <title>2.3 Features</title>
      <p>num
We construct the features from five aspects: statistical
features, position features, syntactic features, mutual
information and prior probabilities.</p>
      <p>Statistical Features We use the term frequency(TF),
inverse document frequency(IDF), TF*IDF, and BM25
score of a word in document di as the statistical features.</p>
      <sec id="sec-4-1">
        <title>Position Features</title>
        <p>1) The first time of the word appears.</p>
        <p>
          num
 FirstOccur (w, d )
FirstOccur _score(w)  i1 (
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
        </p>
        <p>
          FirstOccur (w, d )  precleedne(d( w), d ) (
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
where num is the number of legal documents, precede(w,d)
represents the number of words in front of the first
occurrence of the word w in the legal document d, len(d) is
the number of words of d.
2) sentence-initial or sentence-end position.
        </p>
        <p>
          num
 InFirstLas t(w, d )
InFirstLast_ score (w)  i1 (
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
num
        </p>
        <p>
          FirstOccur(w, d )  precleend(ed( w), d ) (
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
        </p>
        <p>POS features We also choose the part-of-speech of a
word as the features. For each sentence, we get the POS of
each word using the Stanford POS Tagger[6]. We choose a
subset of POS (i.e. NNS, NNPS, NNP, NN, VBZ, VBP,
VBN, VBG, VBD, VB, TO, JJ, RB) as our features.</p>
        <p>num is the number of legal documents, countFL(w,d)
represents the number of occurrences of the word w in the
first of sentence or the end of the sentence in the legal
document d.</p>
        <p>Mutual Information High quality keywords should be
semantically related. If the relevance between a word w and
a keywords k is low, then w may not be suitable as a
keyword[7]. The degree of correlation between words is
measured by the average mutual information.</p>
        <p>
          1
MI (w1, w2 ,...wk ) 
(
          <xref ref-type="bibr" rid="ref6">6</xref>
          )
 MI (wi , wj )
n i, j1,2,...k;i j
2 http://www.cs.cornell.edu/People/tj/svm_light/svm_rank.html
        </p>
        <p>
          I (w1, w2 )    p(w1, w2 ) log( pp(w(w1)1p,w(w2)2 ) ) (
          <xref ref-type="bibr" rid="ref7">7</xref>
          )
where N is the number of word pairs (wi,wj), And MI(wi, wj)
represents the mutual information of wi and wj.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Prior Probabilities Using the gold standard, we build a</title>
        <p>priori keyword set P. P contains the keywords with their idf
is greater than two.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Pr ior _ score(w, P) </title>
        <p>
          1, w P
0, w P
(
          <xref ref-type="bibr" rid="ref8">8</xref>
          )
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3 Methods of Precedence Retrieval</title>
      <p>The task of Precedence Retrieval is to retrieve the prior
cases for a given a current case. We view the current case as
the query, while the prior cases as the documents, and use
three classical information retrieval models to resolve the
problem of precedence retrieval.</p>
    </sec>
    <sec id="sec-6">
      <title>3.1 Language Model method</title>
      <p>For the query model and the document model, we use the
language model based on Dirichlet Prior Smoothing[8]. The
relevance between query and document is computed as
follows.</p>
      <p>score(q, d )  log p(q | d )</p>
      <p>
        qwiidq
  c(w, q) log[1 c(wi , d )
p(wi | C)
]  n log d
(
        <xref ref-type="bibr" rid="ref9">9</xref>
        )
      </p>
    </sec>
    <sec id="sec-7">
      <title>3.2 Probability Model</title>
      <p>The second search model we chose is BM25 model[9]. The
relevance score is computed as follows.</p>
      <p>
        BM 25  winqlog(tf ( wtfi (,wdi),d k)1id(1f (wbi ) b(k1len1()d ))) (
        <xref ref-type="bibr" rid="ref10">10</xref>
        )
avdl
where q is the query set, d is the candidate document, avdl
is the average length of the document, k1 and b are the
adjustment parameters.
      </p>
    </sec>
    <sec id="sec-8">
      <title>3.3 Vector Space Model</title>
      <p>We also used lucene[10] which implemented the vector
space model to estimate the relevance of query and
document. The Formula is shown as follows:
score(q, d )  coord (q, d )  queryNorm(q) 
 (tf (t in d )  idf (t)2  t.getBoost()  norm(t, d )) (11)
t in q</p>
      <p>Here, t represents the term containing the domain
information; coord (q, d) means that when a document
contains more search terms, the higher score of the
document. tf(t in d) represents the word frequency that
appears in document d; idf(t) word reverse document
frequency; norm (t, d) represents the normalization factor;</p>
    </sec>
    <sec id="sec-9">
      <title>4 Experimental Results</title>
    </sec>
    <sec id="sec-10">
      <title>4.1 Results of Catchphrase Extraction</title>
      <sec id="sec-10-1">
        <title>4.1.1 Dataset</title>
        <p>In this task, a set of legal documents (Indian Supreme Court
decisions) are provided. For a few of these documents
(training set), the catchphrases (gold standard) are provided.
Catchphrases are short phrases from within the text of the
document. These catchphrases have been obtained from a
well-known legal search system Manupatra
(www.manupatra.co.in), which employs legal experts to
annotate case documents with catchphrases. The rest of the
documents will be used as the test set. The dataset of
Catchphrase Extraction contains 100 training examples and
300 testing examples.</p>
      </sec>
      <sec id="sec-10-2">
        <title>4.1.2 Experimental Settings</title>
        <p>Firstly, we get a candidate catchphrase set. The statistics
on the training corpus of catchphrase extraction show that,
the nouns accounted for 68.85%, the verbs accounted for
9.13%, adjectives accounted for 12.49%, prepositions
accounted for 5.90%, adverbs accounted for 1.43%, other
types accounted for 2.20%. According to the above
distribution, we choose noun, verbs, adjectives and adverbs.</p>
        <p>On the candidate set, We have different combinations of
features in the classification and ranking methods. The
bagging model uses TF*IDF, BM25, Position Features,
POS, Mutual Information and Prior Probabilities as feature
set. The RankSVM uses TF*IDF, POS and Prior
Probabilities as feature set.</p>
        <p>A detailed description of the method parameter settings
is shown in the following Table 1:</p>
      </sec>
      <sec id="sec-10-3">
        <title>4.1.3 Results</title>
        <p>We submitted the results of the three groups, bagging
(DTC)(denoted as HLJIT2017_IRLeD_Task1_1), bagging
(RFC)(denoted as HLJIT2017_IRLeD_Tas-k1_2) and
RankSVM(denoted as HLJIT2017_IRLeD_Task-1_3). The
experimental results are shown in Table 2.</p>
        <p>Note that our three submitted results are closed on the
evaluation metrics MAP but different on the other
evaluation metrics. We surmise that it is mainly because of
the sequence of submitted catchphrases. The two results of
the classification methods are sorted by alphabetical order,
while the results of ranking are submitted in descending
order of sorted scores.</p>
        <p>In addition, we only choose the words not the phrases as
the catchphrases. We tried to use some rule-based methods
to construct phrases according to the word we extracted
from the legal documents, but we have not achieved the
improvement on performance. The low scores on MRP,
MP@10, MR@100 and Overall Recall maybe caused by
this reason.</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>4.2 Results of Precedence Retrieval</title>
      <sec id="sec-11-1">
        <title>4.2.1 Dataset</title>
        <p>Task 2 provides two data sets, the 200 current cases
(Query_docs) which formed by removing the links to the
2000 prior cases and the prior cases which have been cited
by the cases in Query_docs along with some random cases
(not among Query_docs).</p>
      </sec>
      <sec id="sec-11-2">
        <title>4.2.2 Experimental Settings</title>
        <p>We do the same pre-processing for query extraction and
index building. In order to discard some of the interference
information in document, we filter the document through
the lexical information only the nouns, verbs, adjectives,
and Poter stemming, lower case and removing stop words
are also implemented. For Language Model, we set the
parameter mu=10000 and lambda=0.5, and for BM25, we
set k1=1.8 and b=0.7.</p>
      </sec>
      <sec id="sec-11-3">
        <title>4.2.3 Results</title>
        <p>In Task 2, We have submitted three group of results
Language Model(HLJIT2017_IRLeD_Task2_1), Vector
Space Model(HLJIT2017_IRLeD_Task2_2) and
BM25(HLJIT2017_IRLeD_Task2_3). The results are
shown in Table 3.
MAP 0.1784
Mean reciprocal 0.4074
Rank
Precision@10 0.2180 0.1290 0.1665
Recall@100 0.6810 0.5950 0.6710</p>
        <p>From the experimental results, the language model is
much better than the other two models. The MAP of the
language model is 0.3291, the MAP of the vector space
model is 0.2479, and the lowest of the probability model is
0.1784.</p>
        <p>Some methods which can improve the performance of
language model, such query extension and document
extension, have not yet been applied in this evaluation. It
may be our further work. In addition, to do the smoothing
of the document D, we only apply the given set of legal
documents; Too small collection of documents, resulting in
the sparse words, And documents and query models also
failed to adjust to the optimal. We believe that these
methods will improve the performance of the search model
and will be tried in future research.</p>
        <p>We described the approach to resolve the problems of
Catchphrase Extraction and Precedence Retrieval in
Fire2017 IRLeD task.</p>
        <p>For the task of Catchphrase Extraction, we tried to the
classification-based and ranking-based methods. Various
type of features is integrated into our models. The
experiments show that the ranking-based model achieved
better performance.</p>
        <p>For the task of Precedence Retrieval, we have only tried
several basic language models, such as language model,
probability model and vector space model. Experiments
show that the language model is more excellent than vector
space model and BM25.</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Acknowledgments</title>
      <p>This work is supported by the National Natural Science
Foundation of China (No.61772177) and the Special subject
of State Key Laboratory of Digital Publishing Technology
(The research on Plagiarism Detection-From Heuristic to
Machine Learning).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mandal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bhattacharya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pal</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          .
          <article-title>Overview of the FIRE 2017 track: Information Retrieval from Legal Documents (IRLeD)</article-title>
          .
          <source>In Working notes of FIRE 2017 - Forum for Information Retrieval Evaluation</source>
          , Bangalore, India, December 8-
          <issue>10</issue>
          ,
          <year>2017</year>
          ,
          <string-name>
            <given-names>CEUR</given-names>
            <surname>Workshop</surname>
          </string-name>
          <article-title>Proceedings</article-title>
          . CEUR-WS.org,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Du</surname>
            <given-names>P</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xia</surname>
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            <given-names>W</given-names>
          </string-name>
          , et al.
          <article-title>Multiple classifier system for remote sensing image classification: A review[J]</article-title>
          .
          <source>Sensors</source>
          ,
          <year>2012</year>
          ,
          <volume>12</volume>
          (
          <issue>4</issue>
          ):
          <fpage>4764</fpage>
          -
          <lpage>4792</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Liaw</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiener</surname>
            <given-names>M. Classification</given-names>
          </string-name>
          <article-title>and regression by randomForest</article-title>
          [J].
          <source>R news</source>
          ,
          <year>2002</year>
          ,
          <volume>2</volume>
          (
          <issue>3</issue>
          ):
          <fpage>18</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Joachims</surname>
            <given-names>T.</given-names>
          </string-name>
          <article-title>Learning to classify text using support vector machines: Methods, theory</article-title>
          and algorithms[M]. Kluwer Academic Publishers,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Hulth</surname>
            <given-names>A.</given-names>
          </string-name>
          <article-title>Improved automatic keyword extraction given more linguistic knowledge[C]. Proceedings of the 2003 conference on Empirical methods in natural language processing</article-title>
          .
          <source>Association for Computational Linguistics</source>
          ,
          <year>2003</year>
          :
          <fpage>216</fpage>
          -
          <lpage>223</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Toutanova</surname>
            <given-names>K</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            <given-names>D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            <given-names>C D</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singer</surname>
            <given-names>Y</given-names>
          </string-name>
          ,
          <year>2003</year>
          .
          <article-title>Fea-ture-rich part-of-speech tagging with a cyclic depend-ency network</article-title>
          .
          <source>In Proc. the 2003 Conference of the North American Chapter of the Association for Com-putational Linguistics on Human Language Technology</source>
          , p.
          <fpage>173</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Turney P D.</surname>
          </string-name>
          <article-title>Coherent keyphrase extraction via web mining</article-title>
          [J].
          <source>arXiv preprint cs/0308033</source>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Song</surname>
            <given-names>F</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Croft W B.</surname>
          </string-name>
          <article-title>A general language model for information retrieval[C]. Proceedings of the eighth international conference on Information and knowledge management</article-title>
          .
          <source>ACM</source>
          ,
          <year>1999</year>
          :
          <fpage>316</fpage>
          -
          <lpage>321</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Robertson</surname>
            <given-names>S</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaragoza</surname>
            <given-names>H.</given-names>
          </string-name>
          <article-title>The probabilistic relevance framework: BM25 and beyond</article-title>
          [J].
          <source>Foundations and Trends in Information Retrieval</source>
          ,
          <year>2009</year>
          ,
          <volume>3</volume>
          (
          <issue>4</issue>
          ):
          <fpage>333</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>McCandless</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hatcher</surname>
            <given-names>E</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gospodnetic</surname>
            <given-names>O</given-names>
          </string-name>
          .
          <source>Lucene in Action: Covers Apache Lucene</source>
          <volume>3</volume>
          .0[
          <string-name>
            <given-names>M</given-names>
            <surname>]. Manning</surname>
          </string-name>
          Publications Co.,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>