<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Jean-Valere Cossu</institution>
          ,
          <addr-line>Benjamin Bigot , Ludovic Bonnefoy</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LIA/Universite d'Avignon et des Pays de Vaucluse</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <abstract>
        <p>In this paper, we present the participation of the Computer Science Laboratory of Avignon (LIA) to RepLab 2013 edition. RepLab is an evaluation campaign for Online Reputation Management Systems. LIA has produced a important number of experiments for every tasks of the campaign: ltering, topic priority detection, Polarity for Reputation and topic detection. Our approaches rely on a large variety of machine learning methods. We have chosen to mainly exploit tweet contents. In several of our experiments we have also added selected metadata. A fewer number of our proposals have integrated external information by using provided links to Wikipedia and users homepage.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>RepLab addresses the challenging problem of online Reputation analysis, i.e.
mining and understanding opinions about companies and individuals by
extracting information conveyed in tweets. In this context, LIA's participants have
proposed several methods to automatically annotate tweets.</p>
      <p>The rest of this article is structured as follows. In section 2, we brie y discuss
about datasets and RepLab tasks. In section 3, we present the LIA's submitted
systems. Then in section 4, performances are reported before concluding and
discussing some future works.
2.1
The corpus is a multilingual collection of tweets referring to a set of 61 entities.
These entities are spread into four domains: automotive, banking, universities
and music/artists. These tweets cover a period going from the 1st of June 2012 to
the 31st of December 2012. Entities' canonical names have been used as queries
?? http://lia.univ-avignon.fr/
to extract tweets from a larger database. For each entity, at least 2,200 tweets
have been collected. The 700 rst tweets have been taken to compose the training
set, and the other ones are for the test set. Consequently, tweets concerning each
of the four tasks are not homogeneously distributed in the datasets. We have
selected 8,000 tweets from the training collection to build a development set.
2.2</p>
      <sec id="sec-1-1">
        <title>Filtering</title>
        <p>The Filtering task consists in identifying, in a stream of tweets, those which are
referring or not to a given entity and label these tweets as related or unrelated.
For instance in the tweets written in English, systems have to distinguish if a
tweet containing the word "U2" correctly refers to the famous music band or
not. The lack of context is one of the main issue while processing tweets. These
messages count only 140 characters and in many cases the text content is not
su cient to correctly classify a tweet as related or not.
2.3</p>
      </sec>
      <sec id="sec-1-2">
        <title>Polarity for Reputation</title>
        <p>The goal of the task Polarity for Reputation is to nd if a tweet contains a
positive, negative or neutral statement concerning the reputation of a company.
This task is signi cantly di erent from a standard sentiment analysis since the
objective is to nd a polarity about a reputation, without considering if tweet
contents are opinionated or not. For example, sentiments known as negative do
not always imply a negative polarity for reputation characterization in tweets.
We observed that the tweet "We'll miss you R.I.P. Whitney" has been associated
with a negative label (the writer is sad because of someone's death), but this is
undoubtedly a positive tweet about the reputation of Whitney Houston. Finally,
polarity's de nition may be really di erent depending on the considered entity.
2.4</p>
      </sec>
      <sec id="sec-1-3">
        <title>Topic Priority Detection</title>
        <p>In the Topic Priority Detection task, we look for the priority level (alert, mildly
important, unimportant) of a topic. Priority classes have been de ned as follow:
1. alert : the topic deserves immediate attention of reputation managers;
2. mildly relevant : the topic contributes to the reputation of the entity but
does not require immediate attention;
3. unimportant : the topic can be neglected from a reputation management
perspective;
It seems possible to detect priority levels without processing any new clustering
task. Indeed, negative messages typically concern an information requiring a
high priority reaction. Negative tweets may therefore be highly correlated with
the higher priority level. Again many factors play a role on the understanding
of the proposed priority level.</p>
      </sec>
      <sec id="sec-1-4">
        <title>Topic Detection</title>
        <p>Systems used for Topic Detection are asked, in a rst time, to nd out the main
subject of a message and then to cluster related tweets. The objective is therefore
to bring together tweets referring to the same subject with regards to a given
entity.
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Approaches</title>
      <p>In this section we propose descriptions of the LIA's systems used in this edition.
3.1</p>
      <sec id="sec-2-1">
        <title>TF-IDF-Gini approach with a SVM classi cation</title>
        <p>
          We proposed a supervised classi cation method based on the Term
FrequencyInverse Document Frequency (TF-IDF) method using the Gini purity criteria
coupled with a Support Vector Machines (SVM) classi cation. The system is
composed of two main steps. The rst one creates a vector representation of
words using a term frequency Okapi/BM25 vector [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] with the TF-IDF-Gini
method [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. The second part uses the extracted vectors to learn SVM classi ers.
        </p>
        <p>
          TF-IDF [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] has been widely used for extracting discriminative words from
text. Several works have also reported improvements by using TF-IDF in
association with the Gini purity criteria [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. SVMs are a set of discriminative
supervised machine learning techniques aiming at determining a separation
hyperplane [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] that maximizes the structural margin between training samples.
        </p>
        <p>
          Only tweet textual content is used with this approach. Classi ers have been
trained with vectorial representation of words in order to automatically assign
the most relevant class (for priority and polarity tasks) to a tweet. These tasks
require a multi-class SVM classi er. We have chosen the one-against-one strategy
and a linear kernel. This method have reported a better accuracy than the
oneagainst-rest method [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
3.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Boosting classi cation approach</title>
        <p>
          For the classi cation tasks, we propose to combine various features extracted
from the tweets using a supervised machine learning meta-algorithm: the
Boosting [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. We chose to use the popular AdaBoost algorithm which is a variation of
the classical boosting approach. AdaBoost is a multiclass large-margin classi er
based on a boosting method of weak classi ers. The weak classi ers are given as
input. They can be the occurrence or the absence of a speci c word or n-gram
(useful for linguistic features) or a numerical value. At the end of the training
process, a list of selected rules is obtained as well as their weights. With this set
of rules, a score for each class is computed on each data to classify. The
classication tool used is IcsiBoost [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], an open source tool based on the AdaBoost
algorithm such as the Boostexter software [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. IcsiBoost presents the advantage
to provide a con dence score between 0 (low con dence) and 1 (very con dent)
for each instance to classify. This classi cation process proposes a categorization
of the tweets according to its polarity and its priority. It takes into consideration
information contained in the tweets:
        </p>
        <sec id="sec-2-2-1">
          <title>1. user id;</title>
          <p>2. tweet's textual content (bags of 3-grams max.);
3. language;
4. entity id;
5. category;
6. query string (bags of 3-grams max.);</p>
          <p>Note that the tweet textual content has been normalized with some particular
manual rules which mainly consist in separating punctuation from words (ex:
\price!" becomes \price !""). We chose to not remove the punctuation from the
tweet content because we assume that this information may be useful for polarity
and priority classi cation.
3.3</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>Cosine distance with TF-IDF and Gini purity criteria</title>
        <p>
          We proposed a supervised classi cation method based on a cosine distance
computed over vectors built using discriminant features like Term Frequency-Inverse
Document Frequency (TF-IDF) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] using the Gini purity criteria [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. This
system consists in two steps. First the text is cleaned by removing hypertext links
and punctuation marks and we generate a list of n-grams by using the Gini
purity criteria. During this step stoplists (from Oracle.com) 1 for both English
and Spanish have been used. In the second step we creates terms (words or
n-grams) models for each class by using term frequency with the TF-IDF and
Gini criterion. Models also contain speci c tags when the second step has not
been unable to properly produce feature from a training tweet. A cosine distance
measures the similarity of a given tweet by comparing its bag of words to the
whole bag built for each class and ranks tweets according to this measure. This
classi cation process takes into account (depending on the task) one or several
metadata among:
        </p>
        <sec id="sec-2-3-1">
          <title>1. user id; 2. entity id; 3. language;</title>
          <p>3.4</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Continuous Context Models</title>
        <p>Continuous Context Models (CCM) tend to capture and model the positional
and lexical dependencies existing between a given word and its context. In this
method, the presence in tweets of anchor words is required in order to build
context vectors used in CCM. For every given entity of the data set, we consider
a prede ned set of words including hashtags, "@'s usernames" and other speci c
1 http://docs.oracle.com
terms. These words have been chosen on the training set in order to cover a large
number of context examples for each entity.</p>
        <p>
          According to the procedure formerly presented in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], for one occurrence
of a given entity in a tweet, we build one vector. This vector is lled with the
relative positions of words in the entity's neighbourhood with reference to the
entity's position in the tweet. Vectors are then taken together in order to build
a context-to-entity matrix on which we apply a dimension reduction using a
Singular Value Decomposition for matrix sparseness reduction. The matrix is
then used to train a 2-class SVM classi er [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] with a linear kernel.
        </p>
        <p>Continuous Context Models have been used for the ltering task and for
polarity and priority classi cation. For the ltering task, the two classes are
respectively composed of vectors extracted from unrelated and related tweets. For
the polarity and priority classi cations, the strategy is di erent. For these 3-class
problems we have built three classi ers. For example for the polarity classi cation
we have built a positive-versus-not-positive model (no-positive corresponds to
negative plus neutral tweets), negative-versus-not-negative and
neutral-versusnot-neutral. The same procedure has been used for priority classi cation.
Decision rules for the nal class attribution has been learnt on the training data set.
We only use tweet text content in this experiments. A normalization consisting
in turning upper-case characters to lower-case and removing punctuation marks
have been done.
3.5</p>
        <p>
          k-Nearest-Neighbour with discriminant features
This method can be considered as a very improved version of the baseline. The
system tries to match each tweet in the test set with the N most similar tweets
in the training set. Tweet similarity is computed using Jaccard measure on the
bag-of-words discriminant representation of the tweets. The representation being
built from TF-IDF Term Frequency-Inverse Document Frequency [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] combined
with the Gini purity criteria [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. The process also takes into account tokens
created from the metadata (author, entity-id). A stoplist for both English and
Spanish has been used. It contains tool-words and ID from entities which
obtained a score equal to 0 with o cial measures on the development set.
3.6
        </p>
      </sec>
      <sec id="sec-2-5">
        <title>Adaptation of the LIA's system used in KBA 2012</title>
        <p>
          In collaboration with the LSIS, we participated last year to the Knowledge Base
Acceleration (KBA) task in TREC 2012 [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ]. The KBA task is very similar
to the RepLab ltering and priority sub-tasks: ltering a time-ordered corpus
for documents that are highly relevant to a prede ned list of 29 entities from
Wikipedia and assigning them a degree of priority among central (alert), relevant
(mildly important), neutral (related but unimportant) and garbage (unrelated).
Even if the de nitions are similar, the type of documents studied are di erent:
blogs, forum posts, news and web pages vs. tweets.
        </p>
        <p>For the KBA task we developed a state-of-the-art approach, which captures
intrinsic characteristics of highly relevant documents by mean of three types
T F (e; d) Term frequency of the entity e in d
T F10%(e; d) Term frequency of e for each 10% part of d
T F20%(e; d) Term frequency of e for each 20% part of d
C(sent; e; d) Count of sentences mentioning e
entropy(d) Entropy of document d
length(d) Count of words in d
SIM1g(d; sd) Cosine similarity between d and the entity's Wikipedia article,
based on unigrams
SIM2g(d; sd) Cosine similarity with bigrams
T F (re; d) Term frequency of related entities in d
T F (reL; d) Term frequency of related entities</p>
        <p>
          (embedded in links) in d
T F (e; d):IDF (e; 1h) Term frequency in d and inverse document frequency for an hour
DF (e; 1day) Number of documents with e this day
DF (e; 7d) Number of documents with e in 7 days
V ar(DF (e; 7d)) Variance of the DF in 7 days
T F (e; 7d) Term frequency of e in 7 days
T F (e; title; 7d) TF of e in titles in 7 days
Table 1. Document centric features, Entity related features and Time features. TFs
are normalized by the size of the document if applicable.
of features: document centric features, entity's pro le features, and time
features [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. This set of features is computed for each candidate document and,
using a classi cation approach, used to determine if it is related or not to a given
entity. A Random Forest classi er have been used in these experiments. One
important point of this approach over most KBA 2012 systems is that only one
classi er has been trained for all the entities and it has been proven to remains
competitive without training data associated to a speci c tested entity.
        </p>
        <p>We want to measure the performances of this approach on another kind of
documents and with a minimum of adaption. Features peculiar to the KBA
corpus have been removed and no additional features have been built to match
the speci c features of the RepLab corpus. Feature set is listed in Table 1.
Filtering task: we have submitted 3 runs for the ltering task:
{ Run 4: Tweets are cleaned : stop-words are deleted as well as @ before a
user name and hashtag are split. A classi er is trained on all positive and
negative examples for the all set of entities;
{ Run 5: Similar to Run 4 but a new set of features is computed on web pages
pointed by the URLs in the tweet. If the tweet do not contain an URL the
value of the corresponding each feature are set to "missing";
{ Run 6: Similar to Run 5 but one classi er is trained by type of entities
(automotive, universities, banking and music/artists).
Priority task one run has been submitted for the priority task . It is similar
to Run 5 presented above. Two steps are used to associate a priority level to a
document: at rst documents are tested with a classi er trained on unimportant
vs. mildly important/alert examples; then documents which that have not been
associated to the unimportant class go through a second classi er trained to
separate mildly important documents to alert ones.
3.7</p>
      </sec>
      <sec id="sec-2-6">
        <title>Ultrastemming + n-grams</title>
        <p>
          For the ltering task, we proposed a supervised classi cation method based on
word n-grams in [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] and n-ultra stemming in [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Tweets in English and/or
Spanish are present in the RepLab corpus. In order to avoid the language
detection or the speci c strategies to process each language, we use the common
information of each words, i.e. their ultra stem. For example, Information and
Informacion share the common 5-ultra stem "Infor". n-ultra stemming is a method
of words normalization to further reduce the space of documents representation.
We propose to truncate each word to its ve initial letters. The algorithm is
very simple: we computed 5-ultra stems of i tweets in learning corpus. Then two
simple probabilistic language models (LMX ) of n-grams (n = 1; 2; 3) for each
class (X=related/unrelated) have been created. We classify each tweet j of the
testing set by computing the argmax(x) value over each LMX . The results show
that 5-stemming preserves the content information of each tweet, regardless their
language, in order to lter the tweets.
3.8
        </p>
      </sec>
      <sec id="sec-2-7">
        <title>Maximum a Posteriori Feature Selection</title>
        <p>LIA's topic detection system at rst relies on the identi cation of headwords
(HW) characteristic of one topic. HW are words, bigrams, distance bigrams and
tweet author selected using a Maximum A Posteriori probability (MAP)
estimator. For one theme, we compile one ordered list of HW, ranked considering
a purity criterion. An initial choice of features for theme hypothesization is a
set HWk for each Tk of discriminative theme headwords. In order to have a
fair characterization of themes with discriminative word vocabularies, all
headword vocabularies have been formed with the same size jHWkj. Vocabularies of
di erent themes may share some headwords.</p>
        <p>In order to attribute a topic to a tweet, we compute the topic contribution
of a tweet Yd in each topics T . This topic contribution HW (TkjYd) is a sum
of contributions of the tweet in the topic coded by the features selected for it.
The topic is attributed to the topic with maximum HW contributions. Systems
proposed by the LIA for the topic detection vary by the number of features
selected.
3.9</p>
      </sec>
      <sec id="sec-2-8">
        <title>Merging algorithms</title>
        <p>LIA's methods presented above rely on very di erent approaches and we except
that combining systems outputs by the use of merging algorithms to improve
the performances of any system taken alone. To this purpose, we have applied
merging methods at every tasks except for topic detection. We have used a linear
combination of scores, as well as ELECTRE and PROMETHEE algorithms.
Seven of our systems have been combined for polarity detection and ltering
tasks and six for priority classi cation.</p>
        <p>Linear combination of outputs scores: We dispose of N systems. For one
tweet T of the set set, one system propose an entity label Lk with k = 1 : : : 61
and a corresponding output score sj(Ti; Lk). We rst normalize to 1 the sum of
scores provided by a system over the whole test set. The output entity label L
is chosen according to
(Ti; L) = k = 1 : : : 61arg max @X sj(Ti; Lk)A ;
j=1</p>
        <p>
          (1)
0 N
1
ELECTRE method: the objective of this method [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] is to chose the best
system from the entire set of systems. This methods rst consists in ranking
entity labels comparing to each others by considering how an entity dominates
another one. In a second time the method evaluates the rate of systems where
this dominance between entity labels appears.
        </p>
        <p>
          PROMETHEE method: The Preference Ranking Organisation METHod for
Enrichment Evaluations [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] is a multi-criteria analysis method. It compare
several alternative of actions taken by pair and measure the capacity of an entity
label to dominates the others candidates and its capacity of being be dominated
by the other ones. It nally creates a ranking of several alternatives.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Submissions and results</title>
      <p>Eleven methods compose the LIA's set of submissions. For reading convenience
these methods are summed up in Table 2 and refer to a method number used in
results table presented above. We now compare our result with regards to the
baseline and also to the median score computed over the scores obtained by all
the RepLab particpant for a given task.</p>
      <p>Filtering task: most of our runs, ranked according to F-measure in Table 3, are
situated between the median and the baseline. Two systems (nb 1 and nb 4 with
a F-measure scores of respectively 0:3819 and 0:3412) have reached performances
greater than the baseline. The con dence interval (0.002) shows that in terms
of accuracy many systems are equivalent despite of what can be seen according
to F-measure. Merging strategies (methods 6, 7 and 8) have not been able to
produce good selection rules since their performances remains lower than our
best runs taken alone, a selection of best candidates before the merging would
# Method Description
1 k-NN with discriminant features
2 Cosine distance with TF-IDF and Gini purity criteria
3 Continuous context models
4 Adaptation of the LIA's system used in KBA 2012
5 Ultrastemming + N-Grams
6 PROMETHEE
7 ELECTRE
8 Linear system combination
9 Boosting classi cation approach
10 TF-IDF-Gini approach with a SVM classi cation
11 Maximum a Posteriori Feature Selection</p>
      <p>Table 2. LIA's systems for RepLab 2013
have been better. Moreover, the di erences between entity label distributions
of data in training and test sets may introduce some noise during the learning
process. We have observed better performances by using a development set where
entity label populations are more equally distributed in training and test set.
Polarity task: performances ranked according to the Pearson correlation are
reported in Table 4. One important aspect of polarity systems consists of
predicting the average polarity of an entity with respect to other entities. To cover
this aspect, correlation is computed between the average polarity of entities
versus the reference. This is therefore not necessary to capture the polarity of all
tweets to correctly estimate the average polarity. In this task, most of our
proposal performances are between the median and the baseline scores. One method
(number 1) is over the baseline and reaches a correlation value equal to 0:8799.
Here again, systems are very close according to accuracy while it can be
really di erent with the others criterion. For some systems results are far from
what was seen on the development set, theses di erences come from the label
distributions between the data set and rules learned from the training process.
Priority Detection task: performances ranked according to F-measure are
reported in Table 5. Most of our runs are situated between the median and the
baseline values. Method number 1 based on k-NN classi cation method has
obtained a F-measure equal to 0:3351 comparing to 0:2965 reached by the baseline
system. Several of our proposal have reached accuracy scores over the baseline
but here again merging strategies did not provide better results than the best
system.</p>
      <p>Topic Detection task: one system has been submitted for this task.
Performances of runs produced around this method are reported and ranked in terms
of F-measure in Table 6. We can see that all our proposal are greater that the
median and the baseline scores with a F-measure equal to 0:2463 for our best
system.</p>
      <p>As reported in Table 7 runs 1 &amp; 2 yield a better classi cation for the class
"other topics". and runs 3 and 4 do not consider "other topics" labels.
Nevertheless performances are better even if runs 3 &amp; 4 consider a lower number of tweets.
In a complementary experiment realized after the campaign we have added a
rule consisting in removing this "other topic" tweets from runs 1 &amp; 2. This
rules improves the performances and F-measure reaches now 0.2972 (R=0.4648,
S=0.2307) for run 1 and 0.2928 (R=0.2763, S=0.3296) for run 2.</p>
      <p>Run 1 679
419 "other topics"
40 "mention of a product"
36 "u2 favourite songs"
30 "second hand selling / buying"
25 "4square"
21 "secondhand cars"
19 "nowplaying"</p>
      <p>Run 2 648
335 "other topics"
53 "u2 favourite songs"
46 "jokes"
39 "u2 fans"
37 "4square"
36 "second hand selling / buying"
35 "mention of a product"</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and perspectives</title>
      <p>In this paper we have presented the systems as well as the performances reached
by the Computer Science Laboratory of Avignon (LIA) to RepLab 2013. We have
presented a large variety of approaches and observed logically a large variety of
system performances. We have also proposed several combinations of systems
by using di erent merging strategies in order to bene t from the diversity of
information considered by our runs. Our results are globally good and are mostly
situated between the median and the baseline, but could still be improved by
considering a subset of systems instead of handling system outputs with an
equal weight. In other words, new merging strategies will have to be explored.
However we did not paid enough attention to label distribution while building
this development set. This lead us to introduce some noise in our models and
to produce \over-training" rules. Using cross-validation strategies with chopping
development would avoid these problems.</p>
      <p>In a future work, we will propose some clustering strategies applied to labels
co-occurrence and we will as well considered as a more important feature the
users' in uence sphere. Indeed, several tweet writer whose tweets are followed a
large number of persons should be consider in a di erent manner than a user
never read. Exploring how sentiment in web streams are a ected by society and
political events and their e ects on topic and polarity trends, is also a very
challenging question. Many situations may conduct to "swinging opinion states"
for instance during a political campaign or depending of press coverage of an
event.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Vapnik</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <article-title>Pattern recognition using generalized portrait method</article-title>
          ,
          <source>in Automation and Remote Control</source>
          ,
          <volume>24</volume>
          , pp
          <fpage>774</fpage>
          -
          <lpage>780</lpage>
          ,
          <year>1963</year>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Boser</surname>
            <given-names>B.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guyon</surname>
            <given-names>I.M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Vapnik</surname>
            <given-names>V.N.</given-names>
          </string-name>
          ,
          <article-title>A training algorithm for optimal margin classi ers</article-title>
          ,
          <source>in 5th annual workshop on Computational Learning Theory</source>
          , pp
          <fpage>144</fpage>
          -
          <lpage>152</lpage>
          ,
          <year>1992</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Joachims</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <article-title>Transductive inference for text classi cation using support vector machines</article-title>
          ,
          <source>in international Machine learning conference</source>
          , pp
          <fpage>200</fpage>
          -
          <lpage>209</lpage>
          ,
          <year>1999</year>
          , Morgan Kaufmann Publishers, Inc.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. Muller
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Smola</surname>
          </string-name>
          <string-name>
            <surname>A.</surname>
          </string-name>
          , Ratsch G., Scholkopf
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Kohlmorgen</surname>
          </string-name>
          <string-name>
            <given-names>J.</given-names>
            and
            <surname>Vapnik</surname>
          </string-name>
          <string-name>
            <surname>V.</surname>
          </string-name>
          ,
          <article-title>Predicting time series with support vector machines</article-title>
          ,
          <source>in ICANN'97</source>
          , pp
          <fpage>999</fpage>
          -
          <lpage>1004</lpage>
          ,
          <year>1997</year>
          , Springer
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>G-X</given-names>
          </string-name>
          , Ho,
          <string-name>
            <surname>C-H and Lin</surname>
            ,
            <given-names>C-J</given-names>
          </string-name>
          ,
          <article-title>Recent advances of large-scale linear classi cation</article-title>
          ,
          <source>in proceedinngs of the IEEE</source>
          ,
          <volume>100</volume>
          ,
          <issue>9</issue>
          , pp
          <fpage>2584</fpage>
          -
          <lpage>2603</lpage>
          ,
          <year>2012</year>
          , IEEE
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Schapire</surname>
            ,
            <given-names>R. E.</given-names>
          </string-name>
          ,
          <article-title>The Boosting Approach to Machine Learning: An Overview</article-title>
          , in Workshop on Non-linear
          <source>Estimation and Classi cation</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Schapire</surname>
            ,
            <given-names>R. E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Singer</surname>
            ,Yoram,
            <given-names>BoosTexter:</given-names>
          </string-name>
          <article-title>A boosting-based system for text Categorization</article-title>
          ,
          <source>in Machine Learning</source>
          ,
          <volume>39</volume>
          ,
          <fpage>135</fpage>
          -
          <lpage>168</lpage>
          ,
          <year>2000</year>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Favre</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Hakkani-Tu</surname>
          </string-name>
          r, D. and
          <string-name>
            <surname>Cuendet</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <source>Icsiboost: an opensource implementation of BoosTexter</source>
          , http://code.google.com/p/icsiboost, 2007
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Understanding inverse document frequency: on theoretical arguments for IDF</article-title>
          ,
          <source>in Journal of Documentation</source>
          ,
          <volume>60</volume>
          ,
          <issue>5</issue>
          , pp
          <fpage>503</fpage>
          -
          <lpage>520</lpage>
          ,
          <year>2004</year>
          , Emerald Group Publishing Limited
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Dong</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <article-title>An Improved Algorithm of Bayesian Text Categorization</article-title>
          ,
          <source>in Journal of Software</source>
          ,
          <volume>6</volume>
          ,
          <issue>9</issue>
          , pp
          <fpage>1837</fpage>
          -
          <lpage>1843</lpage>
          ,
          <year>2011</year>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Bigot</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Senay</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Linares</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fredouille</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Dufour</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <article-title>Person Name Recognition in ASR outputs using Continuous Context Models</article-title>
          ,
          <source>in Proceedings of ICASSP'</source>
          <year>2013</year>
          , 2013
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          et Buckley,
          <string-name>
            <surname>C.</surname>
          </string-name>
          ,
          <article-title>Term weighting approaches in automatic text retrieval</article-title>
          ,
          <source>in Information Processing and Management 24</source>
          , pp
          <volume>513</volume>
          {
          <fpage>523</fpage>
          ,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Torres-Moreno</surname>
            ,
            <given-names>J.-M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>El-Beze</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bellot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Bechet</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <article-title>Opinion detection as a topic classi cation problem</article-title>
          ,
          <source>in Textual Information Access. Chapter 9</source>
          ,
          <string-name>
            <given-names>ISTE</given-names>
            <surname>Ltd</surname>
          </string-name>
          John Wiley and Son. 2012
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          and Schutze H.,
          <source>Foundations of Statistical Natural Language Processing</source>
          , The MIT Press Cambridge, Massachusetts.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Torres-Moreno</surname>
            ,
            <given-names>J.-M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Beyond</given-names>
            <surname>Stemming</surname>
          </string-name>
          and
          <article-title>Lemmatization: Ultra-stemming to Improve Automatic Text Summarization in C oRR</article-title>
          , abs/1209.3126,
          <year>2012</year>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kleiman-Weiner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , .
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>D. A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Niu</surname>
            , F.,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Re</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Soboro</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <article-title>Building an Entity-Centric stream ltering test collection for TREC 2012</article-title>
          ,
          <source>in Proceedings of the Text REtrieval Conference (TREC)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Bonnefoy</surname>
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bouvier</surname>
            <given-names>V.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Bellot</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <article-title>A Weakly-Supervised Detection of Entity Central Documents in a Stream, in</article-title>
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          ,
          <year>2013</year>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Figueira</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Greco</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Ehrgott</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <article-title>Multiple Criteria Decision Analysis: State of the Art Surveys</article-title>
          , Springer Verlag,
          <year>2005</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>