-

SBS 2016 Track mining: Classi cation with linguistic features for book search requests classi cation

Mohamed Ettaleb

Chiraz Latiri

chiraz.latiri@gnet.tn 1

Brahim Douar

b.douar@gmail.com 1

Patrice Bellot

patrice.bellot@univ-amu.fr 0 0 Aix-Marseille Universite , CNRS, LSIS UMR 7296, 13397, Marseille , France 1 Tunis EL Manar University, Faculty of Sciences of Tunis, LIPAH research Laboratory , Campus Universitaire Farhat Hached, Tunis , Tunisia

In this paper, we describe text mining approaches dedicated to the classi cation track in Social Book Search Track Lab 2016. This track aims to exploit social knowledge extracted from LibraryThing and Reddit collections to identify which threads on online forums are book search requests. Our proposed classi cation model is based on combination of di erent textual features, namely : (i ) basic linguistic features such as nouns and verbs; and, (ii ) composed features such term sequences and noun phrases generated. Then, we applied a NaiveBayes classi er to specify the user's intentions in the requests.

classi cation noun phrases extraction sequences mining

The Social Book Search (SBS) Lab investigates book search where the users information needs are complex, looking for more than objective metadata. In this respect, SBS Lab aims to research and develop techniques in order to support users in complex book search tasks. It consists of three tracks: 1. Interactive Track : a user-oriented interactive task investigating systems that support users in each of multiple stages of a complex search tasks. The track o ers participants a complete experimental interactive IR setup and an exciting new multistage search interface to investigate how users move through search stages. 2. Suggestion Track : a system-oriented task for systems to suggest books based on rich search requests combining several topical and contextual relevance signals, as well as user pro les and real-world relevance judgements. 3. Mining Track : an NLP/Text Mining track focusing on detecting and linking book titles in online book discussion forums, as well as detecting book search request in forum posts for automatic book recommendation.

In this paper, we only consider the mining track which is a new one in SBS 2016 edition and investigates two tasks : (i ) Classi cation task : how Information Retrieval Systems can automatically identify book search requests in online forums, and; (ii ) Linking task : how to detect and link books mentioned in online book discussions.

Our contribution deals only with the classi cation task. The nal objective of this task is to identify which threads on online forums are book search requests. Thereby, given a forum thread with one or more posts, the system should determine whether the opening post contains a request for book suggestions (i.e., binary classi cation of opening posts).

In this respect, we propose to use two types of approaches, namely : an approach based on textual sequences mining, and an NLP method which relies on nouns, verbs and noun phrases extraction (i.e., compound nouns), to improve the classi cation e ciency. Then, we use the NaiveBayes classi er with Weka to specify the user's intentions in the requests.

The remainder of this paper is organized as follows: Section 2 describes the mining track and the test data. Then, section 3 recalls the basic de nition for textual sequences mining and details our proposed approaches for book search requests classi cation. Next, Section 4 details our di erent submitted runs for the mining track as the o cial obtained results. The conclusion is given in Section 5. 2

SBS 2016 mining Track

The SBS 2016 mining Track investigates how systems can automatically identify book search requests in online forums and how to detect and link books mentioned in online book discussions. Often, users can have information needs that are di cult to express while considering a classical search engine and they rely in this case to online forums, in order to get recommendations from others users. 2.1

SBS requests classi cation task

Classi cation task identi es which threads on online forums are book search requests. That is, given a forum thread with one or more posts, the system should determine whether the opening post contains a request for book suggestions. 2.2

Description of Data collections

The test SBS 2016 collections contains: 1. A collection of 2 780 300 book records from Amazon, extended with social metadata from LibraryThing. This set represents the books available through Amazon. The records contain title information as well as a Dewey Decimal Classi cation (DDC) code (for 61% of the books) and category and subject information supplied by Amazon. Each book is identi ed by an ISBN. Note that since di erent editions of the same work have di erent ISBNs, there can be multiple records for a single intellectual work. Each book record is an XML le with elds like ISBN, title, author, publisher, dimensions, number of pages and publication date. Curated metadata comes in the form of a Dewey Decimal Classi cation in the dewey eld, Amazon subject headings in the subject eld, and Amazon category labels in the browseNode elds. The social metadata from Amazon and LibraryThing is stored in the tag, rating, and review elds. 2. Two data collections for the classi cation task: LibraryThing and Reddit: { Reddit training data: the training data contains threads from the suggestmeabook subreddit as positive examples and threads from the books subreddit as negative examples. In the test data, the subreddit has been removed (cf. Table 1). { LibraryThing : 2,000 labelled threads for training, and 2,000 labelled threads for testing. <?xml version="1.0"?> <forum type="reddit"> <thread id="2nw0um"> <category>suggestmeabook</category> <title>can anyone suggest a modern fantasy series. </title> <posts> <post id="2nw0um"> <author>blackbonbon</author> <timestamp>1417392344</timestamp> <parentid> </parentid> <body>.... where the baddy turns good, or a series similar to the broken empire trilogy. I thoroughly enjoyed reading it along with skullduggery pleasant, the saga of darren shan, the saga of lartern crepsley and the inhe ritance cycle. So whatever you got helps :D cheers lads, and lassses.</body> <upvotes>8</upvotes> <downvotes>0</downvotes> </post> </posts> </thread> </forum> 3

Approaches for book search requests classi cation In this work, as depicted in Figure 1, we present two approaches for book search requests classi cation. The rst one is based on the sequences mining technique to extract frequent sequences from textual content requests. While the second one is based on NLP techniques. It consists in exploring textual content requests, and extracting verbs, nouns and compound nouns. 3.1

linguistic feature extraction In the linguistic feature model, we begin with making the simplifying assumption about a text in the request that it can be represented as collections of words in which syntactic information a negligible and even the word order is unimportant. Text features extraction is the process of transforming what is essentially a bag of terms into a feature set that is usable by a classi er. We employed TreeTagger for annotating text with part-of-speech and lemma information [ 3 ]. We notice that the linguistic feature model is the simplest method; it constructs a word presence feature set from all the words of an instance. This method doesn't care about the order of the words, or how many times a word occurs, all that matters is whether the word is present in a list of words. In our approach, we chose to keep only the nouns and verbs for each request of the collection. 3.2

Compound nouns feature extraction

Earlier works in the literature proved that the use of simple terms features in classi cation is not accurate enough to represent the documents contents due to the words ambiguity. A solution to this problem is to use compound nouns3 instead of simple words. The assumption is that compound nouns are more likely to identify semantic entities than simple words. We propose to perform a linguistic approach to extract compound nouns from the request content of the mining track 2016. The goal is to identify the dependencies and relationships between words through language phenomena. The linguistic approach for compound nouns extraction is based on two steps: 1. A complex syntactic with a tagger (i.e., Treetagger). Each word is associated to a tag corresponding to the syntactic category of the word, example: noun, adjective, preposition, proper noun, determiner, etc. 2. The tagged corpus is used to extract a set of compound nouns by the identi cation of syntactic patterns as detailed in [ 1 ].

We adopt the de nition of syntactic patterns given in [ 1 ], where a pattern is a syntactic rule on the order of concatenation of grammatical categories which form a noun phrase, i.e., a compound noun.

For the English language, We choose to de ne 12 syntactic patterns: 4 syntactic patterns of size two (for example: Noun Noun, Adjective Noun, etc.), 6 syntactic patterns of size three (for example: Adjective Noun Noun, Adjective Noun Gerundive, etc.) and 2 syntactic patterns of size 4. 3 By compound nouns, we refer to complex terms and noun phrases. 3.3

Sequences feature mining

Most methods in text classi cation rely on contiguous sequences of words as features. Indeed, if we want to take non contiguous (gappy) patterns into account, the number of features increases exponentially with the size of the text. Furthermore, most of these patterns will be more noisy. To overcome both issues, sequential pattern mining can be used to e ciently extract a smaller number of the most frequent features.

Sequential pattern mining problem was rst proposed in [ 4 ], and then improved in [5]. It is worth noting that many methods used to discover sequential patterns are usually extension of approaches dedicated to mining frequent itemsets. Most of these approaches proceed on a bottom-up way. First, the frequent sets, or sequences, of size 1 are found, then longer frequent sequences are iteratively obtained starting from the shorter ones [5]. Finally, all the sequences ful lling the required conditions are found. In our work, we use the LCM seq algorithm [ 2 ]4 which is a variation of LCM5 for sequences mining. The algorithm follows the scheme so called prefix span, but the data structures and processing method are LCM based.

We adapt to our purpose the basic de nitions of the theoretical framework for frequent sequential patterns discovery introduced in [ 4 ].

De nition 1. A sequence S = ht1; : : : ; tj ; : : : ; tni, such that tk 2 vacabulary V and n is its length, is a n-termset for which the position of each term in the sentence is maintained. S is called a n-sequence.

De nition 2. Given S a sequence discovered from the collection. The support of S is the number of sentences in P that contain S, S is said to be frequent if and only if its support is greater than or equal to the minimum support threshold minsupp.

Interestingly enough, to address book search requests classi cation in an e cient and e ective manner, we claim that a synergy with some advanced text mining methods, especially sequence mining [ 4 ], is particularly appropriate. However, applying the frequent sequences of terms in the context of requests classi cation can help select good features and improve classi cation accuracy, mostly because of the huge number of potentially interesting frequent sequences that can be drawn from a request collection. 3.4

Mining and learning process

The thread classi cation system serves to identify which threads on online forums are book search requests. Our proposed text mining based approaches are depicted in Figure 1. The classi cation threads process is performed on the following steps: 4 http://research.nii.ac.jp/ uno/code/lcm seq.html 5 LCM : Linear time Closed itemset Miner 1. Annotating the selected threads with part-of-speech and lemma information using TreeTagger. 2. Extracting linguistic features, i.e., verbs and compound nouns from the annotated threads. 3. Generating the term sequence features using the e cient algorithm LCM seq. 4. Generation of the classi cation model using the NaiveBayes classi er6 under

Weka7. 5. Applying the classi cation model to the supplied test set.

Experiments and results Runs description

We conducted six runs according to the approaches described in Section 3, namely: four runs on the LibraryThing data collection and two runs on the Reddit data collection. 6 The Bayesian Classi cation represents a supervised learning method as well as a statistical method for classi cation. 7 http://www.cs.waikato.ac.nz/ml/weka/ Runs on the LibraryThing data collection 1. Run1 (ID = Classi cation-NV): We used in this run, only Bag of linguistic features (i.e., nouns and verbs) to generate the classi cation model, using the NaiveBayes classi er under Weka using the default con gurations8. 2. Run2 (ID = Classi cation-NVC): We extracted rst, Bag of linguistic features (i.e., nouns and verbs) and compound nouns from a set of 2000 threads. Then, we used these features to generate the classi cation model, using the NaiveBayes classi er. 3. Run3 (ID = Classi cation-NVSeq): We used the nouns and verbs as in Run1, then, we extracted the sequences of words using LCM seq algorithm with a threshold of minsupp =5, we noticed after series of experiments with di erents threshold values that the minsupp =5 give the best results and had abvious clear impact on this features extraction. Finally, we combined all features to extract the classi cation model, using the NaiveBayes classi er. 4. Run4 (ID = Classi cation-CSeq): In this run, we combined the compound nouns with sequences, using the NaiveBayes classi er.

Runs on the Runs Reddit data collection 1. Run5 (ID = Classi cation-V): In this run, we used only the verbs as features to extract the classi cation model, using the NaiveBayes classi er. 2. Run6 (ID = Classi cation-VSeq): In the second run on post Reddit, we extracted the sequences of words and the verbs as features using LCM seq algorithm with a threshold of minsupp =3, we chose a low value of minsupp due to the limited number of sequence extracted from the collection Reddit.

Finally, we generated the classi cation model with the NaiveBayes classi er. 4.2

Evaluation metric and results

The results obtained by our runs conducted for the classi cation task requests are evaluated in a single metric, which is the Accuracy. It simply measures how often the classi er makes the correct prediction. It is the ratio between the number of correct predictions and the total number of predictions (the number of test data points), thus : accuracy =

T P + T N T P + T N + F P + F N (1) where : { T P : Number of True Positives { F P : Number of False Positives { T N : Number of True Negatives 8 We used in all experiments the NaiveBayes classi er with Weka using default congurations. { F N : Number of False Negative

In the 2016 SBS Mining Track, a total of 3 teams submitted 20 runs, 2 teams submitted 14 runs for the Classi cation task and 2 teams submitted 6 runs for the Linking task.

Table 2 shows 2016 SBS track mining o cial results for our 4 runs conducted on the LibraryThing collection. Our runs are (Classi cation-NVC, Classi cationNVSeq, Classi cation-CSeq, Classi cation-NV) ranked sixth, seventh, eighth and tenth, respectively, for the classi cation task. These results highlight that the combination of Bag of linguistic features (i.e., nouns and verbs) and compound nouns performs the best in term of accuracy, i.e., Classi cation-NVC. We note also that the combination of nouns, verbs and sequences of words, i.e., Classi cation-NVSeq increases accuracy compared to the use of only Bag of linguistic features (i.e., nouns and verbs). This is mainly due to the di erence between users' descriptions of their needs.

Table 3 describes 2016 SBS track mining o cial results for our 2 runs conducted on the Reddit collection (Classi cation-VSeq and Classi cation-V), which are ranked rst and third, respectively, in the classi cation task. The best run is performed with the sequences of words and the verbs as features for classication. This result con rms that mining sequences is useful for classi cation task.

It's worth noting that the obtained classi cation evaluation results shed light that our proposed approaches, based on NLP techniques, o er interesting results and helps to identify book search requests in online forums . In this paper, we presented our contribution for the 2016 Social Book Search Track, especially for the SBS Mining track. In the 6 submitted runs dedicated for book search requests classi cation, we tested three approaches for features selection, namely : Bag of linguistic features (i.e., nouns and verbs), compound nouns and sequences, and their combination. We performed classi cation with Weka with NaiveBayes classi er. We showed that combining Bag of linguistic features (i.e., nouns and verbs) and compound nouns improves accuracy, and integrating sequences in classi cation process enhances the performance. So, the results con rmed that the synergy between the NLP techniques (textual sequences mining and nouns phrases extraction) and the classi cation system is fruitful. 5. R. Srikant and R. Agrawal. Mining sequential patterns : Generalizations and performance improvements. In Proceedings of the 5th International Conference on Extending Database Technology, EDBT'96, volume 1057 of LNCS, pages 3{17, Avignon, France, March 1996. Springer-Verlag.

Hatem

Haddad . French noun phrase indexing and mining for an information retrieval system . In String Processing and Information Retrieval, 10th International Symposium, SPIRE 2003 , Manaus, Brazil, October 8- 10 , 2003 , Proceedings, pages 277 { 286 , 2003 .

Takanobu

Nakahara , Takeaki Uno, and

Katsutoshi

Yada . Knowledge-Based and Intelligent Information and Engineering Systems: 14th International Conference, KES 2010 , Cardi , UK, September 8- 10 , 2010 , Proceedings, Part

III

, chapter Extracting Promising Sequential Patterns from RFID Data Using the LCM Sequence , pages 244 { 253 . Springer Berlin Heidelberg, Berlin, Heidelberg, 2010 .

Helmut

Schmid . Probabilistic part-of-speech tagging using decision trees . In International Conference on New Methods in Language Processing , pages 44 { 49 , Manchester , UK, 1994 .

Srikant and

Agrawal . Mining generalised associations rules . In Proceedings of the 21th International Conference on Very Large Databases, VLDB'95 , pages 407 { 419 , Zurich , Switzerland, September 1995 .