<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SBS 2016 Track mining: Classi cation with linguistic features for book search requests classi cation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohamed Ettaleb</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chiraz Latiri</string-name>
          <email>chiraz.latiri@gnet.tn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brahim Douar</string-name>
          <email>b.douar@gmail.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrice Bellot</string-name>
          <email>patrice.bellot@univ-amu.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aix-Marseille Universite</institution>
          ,
          <addr-line>CNRS, LSIS UMR 7296, 13397, Marseille</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tunis EL Manar University, Faculty of Sciences of Tunis, LIPAH research Laboratory</institution>
          ,
          <addr-line>Campus Universitaire Farhat Hached, Tunis</addr-line>
          ,
          <country country="TN">Tunisia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe text mining approaches dedicated to the classi cation track in Social Book Search Track Lab 2016. This track aims to exploit social knowledge extracted from LibraryThing and Reddit collections to identify which threads on online forums are book search requests. Our proposed classi cation model is based on combination of di erent textual features, namely : (i ) basic linguistic features such as nouns and verbs; and, (ii ) composed features such term sequences and noun phrases generated. Then, we applied a NaiveBayes classi er to specify the user's intentions in the requests.</p>
      </abstract>
      <kwd-group>
        <kwd>classi cation</kwd>
        <kwd>noun phrases extraction</kwd>
        <kwd>sequences mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The Social Book Search (SBS) Lab investigates book search where the users
information needs are complex, looking for more than objective metadata. In this
respect, SBS Lab aims to research and develop techniques in order to support
users in complex book search tasks. It consists of three tracks:
1. Interactive Track : a user-oriented interactive task investigating systems that
support users in each of multiple stages of a complex search tasks. The
track o ers participants a complete experimental interactive IR setup and
an exciting new multistage search interface to investigate how users move
through search stages.
2. Suggestion Track : a system-oriented task for systems to suggest books based
on rich search requests combining several topical and contextual relevance
signals, as well as user pro les and real-world relevance judgements.
3. Mining Track : an NLP/Text Mining track focusing on detecting and linking
book titles in online book discussion forums, as well as detecting book search
request in forum posts for automatic book recommendation.</p>
      <p>In this paper, we only consider the mining track which is a new one in SBS
2016 edition and investigates two tasks : (i ) Classi cation task : how
Information Retrieval Systems can automatically identify book search requests in online
forums, and; (ii ) Linking task : how to detect and link books mentioned in online
book discussions.</p>
      <p>Our contribution deals only with the classi cation task. The nal objective of
this task is to identify which threads on online forums are book search requests.
Thereby, given a forum thread with one or more posts, the system should
determine whether the opening post contains a request for book suggestions (i.e.,
binary classi cation of opening posts).</p>
      <p>In this respect, we propose to use two types of approaches, namely : an
approach based on textual sequences mining, and an NLP method which relies
on nouns, verbs and noun phrases extraction (i.e., compound nouns), to improve
the classi cation e ciency. Then, we use the NaiveBayes classi er with Weka
to specify the user's intentions in the requests.</p>
      <p>The remainder of this paper is organized as follows: Section 2 describes the
mining track and the test data. Then, section 3 recalls the basic de nition for
textual sequences mining and details our proposed approaches for book search
requests classi cation. Next, Section 4 details our di erent submitted runs for the
mining track as the o cial obtained results. The conclusion is given in Section
5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>SBS 2016 mining Track</title>
      <p>The SBS 2016 mining Track investigates how systems can automatically identify
book search requests in online forums and how to detect and link books
mentioned in online book discussions. Often, users can have information needs that
are di cult to express while considering a classical search engine and they rely
in this case to online forums, in order to get recommendations from others users.
2.1</p>
      <sec id="sec-2-1">
        <title>SBS requests classi cation task</title>
        <p>Classi cation task identi es which threads on online forums are book search
requests. That is, given a forum thread with one or more posts, the system should
determine whether the opening post contains a request for book suggestions.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Description of Data collections</title>
        <p>The test SBS 2016 collections contains:
1. A collection of 2 780 300 book records from Amazon, extended with social
metadata from LibraryThing. This set represents the books available through
Amazon. The records contain title information as well as a Dewey Decimal
Classi cation (DDC) code (for 61% of the books) and category and subject
information supplied by Amazon. Each book is identi ed by an ISBN. Note
that since di erent editions of the same work have di erent ISBNs, there
can be multiple records for a single intellectual work. Each book record is an
XML le with elds like ISBN, title, author, publisher, dimensions, number
of pages and publication date. Curated metadata comes in the form of a
Dewey Decimal Classi cation in the dewey eld, Amazon subject headings
in the subject eld, and Amazon category labels in the browseNode elds.
The social metadata from Amazon and LibraryThing is stored in the tag,
rating, and review elds.
2. Two data collections for the classi cation task: LibraryThing and Reddit:
{ Reddit training data: the training data contains threads from the
suggestmeabook subreddit as positive examples and threads from the books
subreddit as negative examples. In the test data, the subreddit has been
removed (cf. Table 1).
{ LibraryThing : 2,000 labelled threads for training, and 2,000 labelled
threads for testing.
&lt;?xml version="1.0"?&gt;
&lt;forum type="reddit"&gt;
&lt;thread id="2nw0um"&gt;
&lt;category&gt;suggestmeabook&lt;/category&gt;
&lt;title&gt;can anyone suggest a modern fantasy series. &lt;/title&gt;
&lt;posts&gt;
&lt;post id="2nw0um"&gt;
&lt;author&gt;blackbonbon&lt;/author&gt;
&lt;timestamp&gt;1417392344&lt;/timestamp&gt;
&lt;parentid&gt; &lt;/parentid&gt;
&lt;body&gt;.... where the baddy turns good, or a series similar to the broken empire trilogy.
I thoroughly enjoyed reading it along with skullduggery pleasant, the saga of darren shan,
the saga of lartern crepsley and the inhe ritance cycle. So whatever you got helps :D
cheers lads, and lassses.&lt;/body&gt;
&lt;upvotes&gt;8&lt;/upvotes&gt;
&lt;downvotes&gt;0&lt;/downvotes&gt;
&lt;/post&gt;
&lt;/posts&gt;
&lt;/thread&gt;
&lt;/forum&gt;
3</p>
        <p>Approaches for book search requests classi cation
In this work, as depicted in Figure 1, we present two approaches for book search
requests classi cation. The rst one is based on the sequences mining technique
to extract frequent sequences from textual content requests. While the second
one is based on NLP techniques. It consists in exploring textual content requests,
and extracting verbs, nouns and compound nouns.
3.1</p>
        <p>
          linguistic feature extraction
In the linguistic feature model, we begin with making the simplifying assumption
about a text in the request that it can be represented as collections of words in
which syntactic information a negligible and even the word order is unimportant.
Text features extraction is the process of transforming what is essentially a bag of
terms into a feature set that is usable by a classi er. We employed TreeTagger
for annotating text with part-of-speech and lemma information [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. We notice
that the linguistic feature model is the simplest method; it constructs a word
presence feature set from all the words of an instance. This method doesn't care
about the order of the words, or how many times a word occurs, all that matters
is whether the word is present in a list of words. In our approach, we chose to
keep only the nouns and verbs for each request of the collection.
3.2
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Compound nouns feature extraction</title>
        <p>
          Earlier works in the literature proved that the use of simple terms features in
classi cation is not accurate enough to represent the documents contents due
to the words ambiguity. A solution to this problem is to use compound nouns3
instead of simple words. The assumption is that compound nouns are more
likely to identify semantic entities than simple words. We propose to perform
a linguistic approach to extract compound nouns from the request content of
the mining track 2016. The goal is to identify the dependencies and
relationships between words through language phenomena. The linguistic approach for
compound nouns extraction is based on two steps:
1. A complex syntactic with a tagger (i.e., Treetagger). Each word is
associated to a tag corresponding to the syntactic category of the word, example:
noun, adjective, preposition, proper noun, determiner, etc.
2. The tagged corpus is used to extract a set of compound nouns by the
identi cation of syntactic patterns as detailed in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>
          We adopt the de nition of syntactic patterns given in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], where a pattern
is a syntactic rule on the order of concatenation of grammatical categories
which form a noun phrase, i.e., a compound noun.
        </p>
        <p>For the English language, We choose to de ne 12 syntactic patterns: 4
syntactic patterns of size two (for example: Noun Noun, Adjective Noun, etc.),
6 syntactic patterns of size three (for example: Adjective Noun Noun,
Adjective Noun Gerundive, etc.) and 2 syntactic patterns of size 4.
3 By compound nouns, we refer to complex terms and noun phrases.
3.3</p>
      </sec>
      <sec id="sec-2-4">
        <title>Sequences feature mining</title>
        <p>Most methods in text classi cation rely on contiguous sequences of words as
features. Indeed, if we want to take non contiguous (gappy) patterns into
account, the number of features increases exponentially with the size of the text.
Furthermore, most of these patterns will be more noisy. To overcome both issues,
sequential pattern mining can be used to e ciently extract a smaller number of
the most frequent features.</p>
        <p>
          Sequential pattern mining problem was rst proposed in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], and then
improved in [5]. It is worth noting that many methods used to discover sequential
patterns are usually extension of approaches dedicated to mining frequent
itemsets. Most of these approaches proceed on a bottom-up way. First, the frequent
sets, or sequences, of size 1 are found, then longer frequent sequences are
iteratively obtained starting from the shorter ones [5]. Finally, all the sequences
ful lling the required conditions are found. In our work, we use the LCM seq
algorithm [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]4 which is a variation of LCM5 for sequences mining. The
algorithm follows the scheme so called prefix span, but the data structures and
processing method are LCM based.
        </p>
        <p>
          We adapt to our purpose the basic de nitions of the theoretical framework
for frequent sequential patterns discovery introduced in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>De nition 1. A sequence S = ht1; : : : ; tj ; : : : ; tni, such that tk 2 vacabulary V
and n is its length, is a n-termset for which the position of each term in the
sentence is maintained. S is called a n-sequence.</p>
        <p>De nition 2. Given S a sequence discovered from the collection. The support
of S is the number of sentences in P that contain S, S is said to be frequent if
and only if its support is greater than or equal to the minimum support threshold
minsupp.</p>
        <p>
          Interestingly enough, to address book search requests classi cation in an
e cient and e ective manner, we claim that a synergy with some advanced
text mining methods, especially sequence mining [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], is particularly appropriate.
However, applying the frequent sequences of terms in the context of requests
classi cation can help select good features and improve classi cation accuracy,
mostly because of the huge number of potentially interesting frequent sequences
that can be drawn from a request collection.
3.4
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>Mining and learning process</title>
        <p>The thread classi cation system serves to identify which threads on online
forums are book search requests. Our proposed text mining based approaches are
depicted in Figure 1. The classi cation threads process is performed on the
following steps:
4 http://research.nii.ac.jp/ uno/code/lcm seq.html
5 LCM : Linear time Closed itemset Miner
1. Annotating the selected threads with part-of-speech and lemma information
using TreeTagger.
2. Extracting linguistic features, i.e., verbs and compound nouns from the
annotated threads.
3. Generating the term sequence features using the e cient algorithm LCM seq.
4. Generation of the classi cation model using the NaiveBayes classi er6 under</p>
        <p>Weka7.
5. Applying the classi cation model to the supplied test set.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments and results</title>
      <sec id="sec-3-1">
        <title>Runs description</title>
        <p>We conducted six runs according to the approaches described in Section 3,
namely: four runs on the LibraryThing data collection and two runs on the
Reddit data collection.
6 The Bayesian Classi cation represents a supervised learning method as well as a
statistical method for classi cation.
7 http://www.cs.waikato.ac.nz/ml/weka/
Runs on the LibraryThing data collection
1. Run1 (ID = Classi cation-NV): We used in this run, only Bag of
linguistic features (i.e., nouns and verbs) to generate the classi cation model, using
the NaiveBayes classi er under Weka using the default con gurations8.
2. Run2 (ID = Classi cation-NVC): We extracted rst, Bag of linguistic
features (i.e., nouns and verbs) and compound nouns from a set of 2000
threads. Then, we used these features to generate the classi cation model,
using the NaiveBayes classi er.
3. Run3 (ID = Classi cation-NVSeq): We used the nouns and verbs as in
Run1, then, we extracted the sequences of words using LCM seq algorithm
with a threshold of minsupp =5, we noticed after series of experiments with
di erents threshold values that the minsupp =5 give the best results and
had abvious clear impact on this features extraction. Finally, we combined all
features to extract the classi cation model, using the NaiveBayes classi er.
4. Run4 (ID = Classi cation-CSeq): In this run, we combined the
compound nouns with sequences, using the NaiveBayes classi er.</p>
        <p>Runs on the Runs Reddit data collection
1. Run5 (ID = Classi cation-V): In this run, we used only the verbs as
features to extract the classi cation model, using the NaiveBayes classi er.
2. Run6 (ID = Classi cation-VSeq): In the second run on post Reddit, we
extracted the sequences of words and the verbs as features using LCM seq
algorithm with a threshold of minsupp =3, we chose a low value of minsupp
due to the limited number of sequence extracted from the collection Reddit.</p>
        <p>Finally, we generated the classi cation model with the NaiveBayes classi er.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Evaluation metric and results</title>
        <p>The results obtained by our runs conducted for the classi cation task requests
are evaluated in a single metric, which is the Accuracy. It simply measures how
often the classi er makes the correct prediction. It is the ratio between the
number of correct predictions and the total number of predictions (the number
of test data points), thus :
accuracy =</p>
        <p>T P + T N
T P + T N + F P + F N
(1)
where :
{ T P : Number of True Positives
{ F P : Number of False Positives
{ T N : Number of True Negatives
8 We used in all experiments the NaiveBayes classi er with Weka using default
congurations.
{ F N : Number of False Negative</p>
        <p>In the 2016 SBS Mining Track, a total of 3 teams submitted 20 runs, 2 teams
submitted 14 runs for the Classi cation task and 2 teams submitted 6 runs for
the Linking task.</p>
        <p>Table 2 shows 2016 SBS track mining o cial results for our 4 runs conducted
on the LibraryThing collection. Our runs are (Classi cation-NVC, Classi
cationNVSeq, Classi cation-CSeq, Classi cation-NV) ranked sixth, seventh, eighth
and tenth, respectively, for the classi cation task. These results highlight that
the combination of Bag of linguistic features (i.e., nouns and verbs) and
compound nouns performs the best in term of accuracy, i.e., Classi cation-NVC.
We note also that the combination of nouns, verbs and sequences of words, i.e.,
Classi cation-NVSeq increases accuracy compared to the use of only Bag of
linguistic features (i.e., nouns and verbs). This is mainly due to the di erence
between users' descriptions of their needs.</p>
        <p>Table 3 describes 2016 SBS track mining o cial results for our 2 runs
conducted on the Reddit collection (Classi cation-VSeq and Classi cation-V), which
are ranked rst and third, respectively, in the classi cation task. The best run
is performed with the sequences of words and the verbs as features for
classication. This result con rms that mining sequences is useful for classi cation
task.</p>
        <p>It's worth noting that the obtained classi cation evaluation results shed light
that our proposed approaches, based on NLP techniques, o er interesting results
and helps to identify book search requests in online forums .
In this paper, we presented our contribution for the 2016 Social Book Search
Track, especially for the SBS Mining track. In the 6 submitted runs dedicated
for book search requests classi cation, we tested three approaches for features
selection, namely : Bag of linguistic features (i.e., nouns and verbs), compound
nouns and sequences, and their combination. We performed classi cation with
Weka with NaiveBayes classi er. We showed that combining Bag of linguistic
features (i.e., nouns and verbs) and compound nouns improves accuracy, and
integrating sequences in classi cation process enhances the performance. So,
the results con rmed that the synergy between the NLP techniques (textual
sequences mining and nouns phrases extraction) and the classi cation system is
fruitful.
5. R. Srikant and R. Agrawal. Mining sequential patterns : Generalizations and
performance improvements. In Proceedings of the 5th International Conference on
Extending Database Technology, EDBT'96, volume 1057 of LNCS, pages 3{17, Avignon,
France, March 1996. Springer-Verlag.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Hatem</given-names>
            <surname>Haddad</surname>
          </string-name>
          .
          <article-title>French noun phrase indexing and mining for an information retrieval system</article-title>
          .
          <source>In String Processing and Information Retrieval, 10th International Symposium, SPIRE</source>
          <year>2003</year>
          , Manaus, Brazil, October 8-
          <issue>10</issue>
          ,
          <year>2003</year>
          , Proceedings, pages
          <volume>277</volume>
          {
          <fpage>286</fpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Takanobu</given-names>
            <surname>Nakahara</surname>
          </string-name>
          , Takeaki Uno, and
          <string-name>
            <given-names>Katsutoshi</given-names>
            <surname>Yada</surname>
          </string-name>
          .
          <source>Knowledge-Based and Intelligent Information and Engineering Systems: 14th International Conference, KES</source>
          <year>2010</year>
          ,
          <article-title>Cardi</article-title>
          , UK, September 8-
          <issue>10</issue>
          ,
          <year>2010</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>III</given-names>
          </string-name>
          ,
          <article-title>chapter Extracting Promising Sequential Patterns from RFID Data Using the LCM Sequence</article-title>
          , pages
          <volume>244</volume>
          {
          <fpage>253</fpage>
          . Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Helmut</given-names>
            <surname>Schmid</surname>
          </string-name>
          .
          <article-title>Probabilistic part-of-speech tagging using decision trees</article-title>
          .
          <source>In International Conference on New Methods in Language Processing</source>
          , pages
          <volume>44</volume>
          {
          <fpage>49</fpage>
          ,
          <string-name>
            <surname>Manchester</surname>
          </string-name>
          , UK,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>R.</given-names>
            <surname>Srikant</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          .
          <article-title>Mining generalised associations rules</article-title>
          .
          <source>In Proceedings of the 21th International Conference on Very Large Databases, VLDB'95</source>
          , pages
          <fpage>407</fpage>
          {
          <fpage>419</fpage>
          ,
          <string-name>
            <surname>Zurich</surname>
          </string-name>
          , Switzerland,
          <year>September 1995</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>