=Paper= {{Paper |id=Vol-2319/paper28 |storemode=property |title=Noise-aware Missing Shipment Return Comment Classification in e-Commerce |pdfUrl=https://ceur-ws.org/Vol-2319/paper28.pdf |volume=Vol-2319 |authors=Avijit Saha,Vishal Kakkar,Ravindra Babu |dblpUrl=https://dblp.org/rec/conf/sigir/SahaKB18 }} ==Noise-aware Missing Shipment Return Comment Classification in e-Commerce== https://ceur-ws.org/Vol-2319/paper28.pdf
   Noise-aware Missing Shipment Return Comment Classification
                         in E-Commerce
                         Avijit Saha∗                                                   Vishal Kakkar∗                               T. Ravindra Babu
          Flipkart Internet Private Limited                                 Flipkart Internet Private Limited                 Flipkart Internet Private Limited
                   Bangalore, India                                                  Bangalore, India                                 Bangalore, India
              avijit.saha@flipkart.com                                         vishal.kakkar@flipkart.com                         ravindra.bt@flipkart.com

 ABSTRACT                                                                                             Unlike a usual return, product pick up from customer is avoided
E-Commerce companies face a number of challenges in return                                            in a missing-item return. Example of a missing-item return is -
requests. Claims of missing-items is one such challenge, where cus-                                   customer ordered a handset and received a stone in place of the
tomer claims that main product is missing from shipment through                                       handset. A confirmed case of missing item results in loss to the
return comments. It is observed that dominant part of such claims                                     company since there is a definite fraud with one of the stakeholders
are inadvertent given the limited literacy of customers. Some of                                      such as buyer, seller or delivery team. Hence, a careful scrutiny is
them have fraud intent. At Flipkart, such claims are evaluated man-                                   necessary before the approval of any missing-item returns.
ually to examine whether the comment relates to missing item.                                             A missing-item return request is generated broadly for two rea-
Classification of the claim intent automatically saves human band-                                    sons. The first can be due to definite fraud with one of the stake-
width and provides good customer experience by reducing the turn                                      holders. Secondly, given limited literacy levels of customers, it is
around time to customers. However, this is challenging as com-                                        observed that the claims do not always refer to missing-item but
ments are replete with spell variations, non-English vernacular                                       inadvertently claimed as missing-item. For example, a missing-item
words, and are often incomplete and short. This is compounded by                                      return with customer comment ‘I did not like the item.’ clearly indi-
noisy labeling of such comments due to human bias and manual                                          cates that the return belong to a return category other than missing-
errors.                                                                                               item. Because here customer received the main product. Hence, the
   To classify the claim intent, we apply conventional as well as                                     return should be cancelled. We call this a comment-mismatch (cus-
deep learning methods. To handle label noise, we employed state-                                      tomer’s comment does not match with the return reason-code).
of-the-art noise-aware techniques, which fail to perform due to                                       On the other hand, a missing-item return with customer comment
pattern specific label noise. Motivated by the wide pattern specific                                  ‘I ordered a phone but received an empty box.’ clearly indicates
label noise, we encode domain heuristics as labeling functions (LFs)                                  that the return belong to the missing-item category. This return
which label subsets of the data. However, LFs may conflict and                                        should be approved. This is referred by non-comment-mismatch
prone to noise. We address the conflict by defining a conflict-score                                  (customer’s comment matches with the return reason-code).
to rank the LFs. Proposed method of noise handling with LFs out                                           At Flipkart, a dedicated operation team assesses the compatibility
performs all the state-of-the-art noise-aware baselines.                                              of customers’ comments on missing-item return with the missing-
                                                                                                      item reason-code. Each missing-item return – passes through this
 KEYWORDS                                                                                             process and – is rejected when the return comment is incompati-
                                                                                                      ble with the missing-item reason-code, otherwise approved. This
 E-Commerce, Text, Comment, Noise, Data Programming, Deep
                                                                                                      process wastes lot of human bandwidth and is prone manual mis-
 Learning, Machine Learning
                                                                                                      takes. Moreover, it hampers the customer experience due to the
ACM Reference Format:                                                                                 lag between the return placement time and its status update to the
Avijit Saha, Vishal Kakkar, and T. Ravindra Babu. 2018. Noise-aware Missing                           customer. Hence, we want to automate this process. In terms of
Shipment Return Comment Classification in E-Commerce. In Proceedings of
                                                                                                      Machine Learning problem, given a missing-item return comment,
ACM SIGIR Workshop on eCommerce (SIGIR 2018 eCom). ACM, New York,
                                                                                                      we want to predict whether it is a comment-mismatch (positive
NY, USA, 8 pages. https://doi.org/10.475/123_4
                                                                                                      class) or non-comment-mismatch (negative class).
                                                                                                          Customer comments are generally very noisy mainly because
 1     INTRODUCTION
                                                                                                      of three reasons: a) spelling mistakes: empty is misspelled in com-
E-commerce companies face a large number of return requests of                                        ment ‘Emety box’, b) usage of regional languages: comment ‘Galt
various types (reason-codes). Missing-item is one such reason-code                                    order ho gya h’, which means I ordered wrong product, uses Hindi
where customer claims that main product is missing from shipment.                                     language, and c) varied comment length: token counts in a com-
 ∗ Equal Contributions                                                                                ment ranges [1, 323]. To handle such difficulties, we employ word
                                                                                                      embedding and meta features. Besides conventional classification
 Permission to make digital or hard copies of part or all of this work for personal or                method (xgboost [6]), we use BLSTM [24] to capture the sequential
Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes.
 classroom
In:            use is G.
    J. Degenhardt,     granted  withoutS.fee
                          Di Fabbrizio,      providedM.that
                                          Kallumadi,        copies
                                                         Kumar,     areLin,
                                                                 Y.-C.   notA.made  or distributed
                                                                               Trotman, H. Zhao
(eds.): Proceedings
 for profit           of the SIGIR
             or commercial         2018 eCom
                               advantage  andworkshop,  12 July,
                                               that copies bear2018,   Ann Arbor,
                                                                 this notice  andMichigan,  USA,
                                                                                  the full citation   information in the comments.
published  at http://ceur-ws.org
 on the first   page. Copyrights for third-party components of this work must be honored.                 Also, the manual label generation process, which marks a missing-
For all other uses, contact the owner/author(s).                                                      item return comment as comment-mismatch/non-comment-mismatch,
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA
© 2018 Copyright held by the owner/author(s).
                                                                                                      is very noisy. The sources of noise are manual error, human bias,
ACM ISBN 123-4567-24-567/08/06.                                                                       and lack of well calibrated operation team. The label noise varies
https://doi.org/10.475/123_4
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                        Avijit Saha, Vishal Kakkar, and T. Ravindra Babu


depending on the patterns and is non-iid. This is clearly visible by
the fact that the overall label noise is ∼ 15% and a specific pattern
‘I ordered x quantity of an item but received y quantity’ has 50%
noise. Due to this reason, the state-of-the-art noise-aware [11] base-
line, which uses BLSTM as the base model, fails severely. In fact, we
observed a performance degradation of this noise-aware BLSTM
over vanilla BLSTM.
    The pattern specific noise variation and high noise on certain pat-
terns motivate us to use noise correction based on domain heuristics.
Like data programming[20], we express weak supervision strate-
gies or domain heuristics as labeling functions (LFs) which label
subsets of the data. However, LFs may conflict and prone to noise.
To our best knowledge, no one has employed LFs to rectify label
noise.
    In a typical data programming setting, only data points are avail-     Figure 1: Frequency distribution of comments w.r.t. word
able and LFs are created to generate labels. The LFs may conflict on       count
certain data points and have varying error rates. To handle it, data
programming defines a generative process over the LFs to learn
the correctness probability of each LF on each data instance. This
information is then used to fit a noise-aware classifier.
    The key difference between our setting and the data program-
ming is the availability of noisy labels (we call it as true LF) in our
case. Unlike data programming, we would like introduce less noisy
LFs compared to the true LF. Due to all these reasons, instead of
applying data programming directly, here resort to a simple method.
and promising methods to correct label noise using LFs, and leave
out the exploration of data programming as future work.
    We define multiple LFs to alter the noisy labels in our dataset.
However, applying them directly to flip the noisy labels is impossi-
ble because the LFs conflict with each other - same data instance is
labeled as positive by a LF and negative by another LF. This requires
us to generate a ranked list of LFs. To do that, we define a conflict
score, which captures how less a LF conflicts with other LFs. Then,
                                                                                                 Figure 2: Word clouds
the ranked LFs are applied to alter the true labels. In case of conflict
between LFs, the LF with the least conflict score is chosen to alter
the noisy label. We show that proposed method of noise handling
with LFs out performs all the state-of-the-art noise-aware baselines
as well as vanilla baselines.                                              1 describes the dataset statistics. We can observe ∼ 41% are positive
    We integrate the following aspects in the paper.                       labels. We count the number of words in a comment by tokenizing
      • Explanation of a real world problem and its challenges             it. Table 1 also shows that the word count of comments ranges
      • Exploratory data analysis and feature engineering                  [1, 323]. The plot clearly shows the existence of widely variable
      • State-of-the-art baselines - xgboost and LSTM and their            length comments in our dataset. To deep dive, we plot a histogram
         noise-aware variants                                              of number of comments w.r.t. word count in Figure 1. We clip the
      • Noise handling with labeling functions                             plot at word length 75 for better visualization. Clearly, it is a long
      • Proposed conflict-score to handle conflicts                        tail distribution - approximately 51% of comments has less than ten
      • Shown superior performance of the proposed method                  words. However, there exists a fat tail of comments with very high
                                                                           word count.
    Section 2 contains insights into data. Related work, feature engi-
                                                                               Table 2 shows example of customer comments with different
neering, modeling, experimentation, and conclusion are described
                                                                           word count. Interestingly, we can observe the presence of spell
in Section 3, 4, 5, 6, and 7, respectively.
                                                                           error - ‘My mistek’ and regional language - ‘Sir khali box mila
                                                                           h’ (Hindi). Often, comments are very noisy and do not adhere to
2 DATA                                                                     grammar rules - ‘Not working Good; tow time same product bye
2.1 Description                                                            mistack accepted so remove it’. To provide more insight on the
The data in our study comes from the customer comments on                  data, we show a word cloud of our dataset in Figure 2. Some of
missing-item return requests. We use one year of customer com-             the important keywords are cx (customer), product, mobile, box,
ments: April 2017 - Mar 2018. We consider a comment-mismatch as            missing, and empty. The cx words occurs in the comments when
positive label and a not-comment-mismatch as negative label. Table         customer calls customer care to place the return request.
Noise-aware Missing Shipment Return Comment Classification in E-Commerce SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA

                                                             Table 1: Data Statistics

   # of instances     % of negatives      % of positives      min word count in any comments          max word count in any comments
      O(100k)               59                  41                           1                                     323

 Table 2: Example of comments with different word count                     to be 15.65%. Also, the positive and negative classes have 10.06%
                                                                            and 19.45% noise, respectively.
      Word                         Comment
      Count
        1                          Missing                                  3   RELATED WORK
                                     Sorry                                  Preprocessing is an important step for text classification. Two im-
          2                   Product missing.                              portant blocks [1] of preprocessing are - 1) Tokenization and 2)
                                  My mistek                                 Filtering. Tokenization [23], which is the initial step of prepro-
          4                    I got one jacket                             cessing, divides a text document into words known as tokens. In
                            empty box no watch                              Filtering, stop words are removed.
          8     Charged more than 172/- from the MRP on box                     Then, the preprocessed text is converted into feature vectors. One
                    Sir mera phone nahi tha box ke andar                    of the widely used model for feature generation is the bag-of-words
                                                                            (BOW) model [18]. It represents a document to a k-dimensional fea-
                                                                            ture vector, where the individual co-ordinate represents the count
                Table 3: Example of noisy labels
                                                                            of a specific word in the document. Often, term frequency-inverse
                                                                            document frequency (tfidf) [18] is used to penalize a frequently
   Word                 Comment                    Actual     True          occurring word. Recently, word2vec (w2v) [19] model gained much
   count                                           Label      Label         attentions. It embeds a word to a k-dimensional vector space by
     1                      Ghh                      -1         1           preserving the property that words occurring in the same context
                          Missing                     1        -1           will have higher similarity score. Word2vec [16] features are used
      2               Wrong delivery                 -1         1           for text classification in many ways - summation of the w2v em-
                      Missing product                 1        -1           bedding vectors, mean of the w2v embedding vectors, and tf-idf
      4         Customer wants the refund            -1         1           based weighted sum of the w2v embedding vectors corresponding
                    Product is not there              1        -1           to all the words in a document.
      8       Please replacement this product        -1         1               Support vector machines (SVMs) [14] are widely employed for
                otherwise return my money                                   text classification. Athanasiou [2] has applied gradient boosting ma-
              Item missing your department ,          1         -1          chine for sentiment analysis task, and shown superior performance
                   Very poor bad service                                    over SVM, Naive Bayes, and neural network. Gradient boosting ma-
                                                                            chine is a boosting algorithm where each iteration fits a new model
                                                                            to get better class estimation. Each newly added model is corre-
2.2       Label Noise                                                       lated with the negative gradient of the loss function, and the loss is
Recall, non-comment-mismatch implies a genuine return request of            minimized using gradient descent. Extreme gradient boosting (Xg-
type missing-item and comment-mismatch implies a return request             boost) [6] is another boosting algorithm with better regularization
anything other than missing-item type. Given a missing item return          and performs well in practice.
comment, our operation team at Flipkart mark it as either comment-              Recently deep learning algorithms[15, 22] have shown promis-
mismatch (positive) or non-comment-mismatch (negative). This                ing performance in text classification. Specifically, recurrent neural
label generation process is very noisy due to manual error, human           networks (RNNs) [15] are the widely used architectures to capture
bias, and lack of well calibrated operation team. Table 3 shows             the sequential information. Long term short memory networks
example of noisy labels for comments with different word count.             (LSTMs) [12] is a variant of RNN which helps to overcome some of
Actual label and true labels represents the label in the dataset and        the problem of RNN like, vanishing gradient problem and helps to
the expected label, respectively. We show noisy labels from one,            remember the context over long text. Many flavours of LSTMs are
two, four, and eight word count comments. For each word count,              proposed [22] to for text classification, such as Multilayer-LSTM,
we show one comment whose label should be positive but marked               Bidirectional-LSTM (BLSTM), and Tree-Structured LSTM. In Mul-
as negative and one comment whose label should be negative but              tilayer LSTM, LSTMs are stacked over each other to capture the
marked as positive.                                                         non-linearity. In BLSTM, both past and future information are pre-
                                                                            served using two hidden states, and it helps to learn the context
2.3       Noise Statistics                                                  better.
To estimate the overall noise in our dataset, we manually relabeled             Noisy label can be handled broadly by three approaches [9]: a)
3k randomly chosen examples and consider it as test dataset. We             label-noise robust models [4, 6], b) data cleansing methods [5, 13],
calculate the noise by considering mismatch between the actual              and c) label-noise tolerant learning algorithms [3, 11]. In label-noise
label and true label on the test dataset. The overall noise is estimated    robust methods, label-noise is handled by reducing the overfitting.
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                       Avijit Saha, Vishal Kakkar, and T. Ravindra Babu


Even though, theoretically learning algorithms are robust to label-             Table 4: Example of Spell Variation and Synonym
noise, in practice performance varies from algorithm to algorithm,
such as bagging perform better than the boosting [8]. Data cleans-                   missing     mobile     product      ordered
ing methods filter out data points which appears to be mislabeled.                   mising       mobil     prodcut       orderd
Filtering can be done with various approaches, such as outlier                       misssing      fone     produict      odered
detection [13], removal of all the misclassified data points by a clas-              misisng      device     produc        order
sifier [5], and removal of any points which disproportionately in-                     khali     handset    prioduct     booked
creases the model complexity [10]. In label-noise tolerant algorithm,                 empty       phone       item      purchased
noise is handled explicitly in the modeling step. Label-noise robust
logistic regression [3] modifies the loss function to handle noise.
Recently, a probabilistic neural-network based framework [11] is          ‘empty’ and ‘khali’ both means empty. Also, customers use regional
developed, which views the true label as a latent variable and a          language, such as hindi (‘mujhe product nahi mila’). To handle
softmax layer is used to predict it. The noise is explicitly modeled      these complexities, we train a word2vec model with 200 dimension
by an additional softmax layer that predict the noisy label based on      on one year customer’s return comments data from all the return
both the true label and the input features.                               reason-codes. The number of comments are O(10M). We train the
    As creating labeled training data is difficult and time consuming,    word2vec by considering words which occurs at-least 25 times in
many approaches are developed to generate training data automati-         our corpus. For tokenization, we use Gensim [21] simple_preprocess
cally, such as distant supervision [7, 17] and data programming [20].     method.
Distant supervision heuristically maps a knowledge base of known             Table 4 shows similar words from the word2vec model for four
relations to an unknown domain to generate training data. Data pro-       keywords - missing, mobile, product, and ordered. We can observe
gramming is a generic framework to create dataset pragmatically           that the word2vec model is able to capture spelling variations. It
using distant supervision. It expresses weak supervision strategies       is also able to capture semantically similar word, such as missing
or domain heuristics as labeling functions (LFs) which label sub-         and empty, mobile and handset, device and phone, and ordered
sets of the data. The LFs may conflict on certain data points and         and purchased. Moreover, the regional language variation is also
have varying error rates. To handle it, data programming defines a        captured, such as missing and khali (hindi), mobile and fone (hindi),
generative process over the LFs to learn the correctness probability      and missing and illa (tamil).
of each LF on each data instance. This information is then used to          4.2.1 Sum of Word2vec Features: We sum the individual 200-
fit a noise-aware classifier.                                             dimensional embedding vector for each word in a comment and
                                                                          use it as the final feature in model. To handle out-of-vocabulary
4     FEATURE ENGINEERING                                                 word, we fall back to the 200-dimensional zero vector.
In this Section, we will discuss all the hand crafted features which         4.2.2 Weighted Word2vec Features: We fit a tfidf model on the
are used for conventional Machine Learning models.                        training data. Then, we calculate a weighted sum of the individual
                                                                          200-dimensional embedding vector for each word in a comment.
4.1    Meta Features                                                      Where weight of individual word embedding is assigned from the
We construct nine meta features as shown in Figure 3. Word count,         tfidf score of the word.
char count, alpha count, digit count, and non-alphanumeric count
compute the number of words, characters, alphabets, digits, and           4.3    Bag-of-words (BOW) Features
non-alphanumeric characters in a comment. To show the discrimi-           Each comment in the training dataset is preprocessed with Gensim
nating power of each meta feature, in Figure 3, we show histogram         preprocessing. Then, a bag-of-words (BOW) model is trained with
of each feature w.r.t. both positive and negative class. In each sub-     5k vocabulary size and with English stop words removal.
plot, the blue and green histogram represents the distribution for
negative and positive class, respectively.                                5 MODEL
   In each subplot, the blue distribution is right shifted. It implies
that in general comments from the negative class are longer, have         5.1 BASELINE
more alphabets, digits and non-alphanumeric characters compare            We tried multiple conventional Machine Learning algorithms widely
to the comments from the positive class. This is explained by the         used for text classification, such as SVM, naive Bayes, logistic re-
fact that short comments lack descriptive ability, and thus have          gression, random forest, and xgboost. In our data, around 50% of
higher chance of being comment-mismatch. Interestingly, unique            the comments are long and we found that often long comments
character count and unique alphabet count are the most discrimi-          have sequential information, such as ‘customer ordered 2 items but
nating features - there is a clear separation between the distribution    received only 1’. To capture the sequential information, we exper-
of positive and negative class for these two features.                    imented with different RNN based models, such as RNN, LSTM,
                                                                          Multi-layer LSTM, and Bi-directional LSTM (BLSTM). In below,
4.2    Word2vec Features                                                  we only describe the best performing models from each of this
                                                                          approach.
We observed that our dataset is very noisy. For example, a keyword
like ‘missing’ has numerous variations - missing, misssing, misisng,        5.1.1 Xgboost: A xgboost model is trained with all the features
missig, etc. Moreover, there are semantically similar words, such as      described in Section 4. The model parameter is tuned by grid search.
Noise-aware Missing Shipment Return Comment Classification in E-Commerce SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA




                                                          Figure 3: Meta features


While performing the grid search, we restrict the max depth of          variable. A softmax layer is used to predict the true label. The noise
individual tree to 8 and max number of estimators to 500. The best      is explicitly modeled by an additional softmax layer that predict
parameters are chosen using 5-fold cross-validation.                    the noisy label based on both the true label and the input features.
                                                                           Assuming the non-linear function applied on an input x be h =
   5.1.2 BLSTM:. We used BLSTM which avoids feature engineer-
                                                                        h(x), the true label y is modeled by:
ing. In BLSTM, both past and future information are preserved
using two hidden states, and it helps to learn the context better.
We experimented with word2vec pretrained embedding from the                                               exp(uTi h + bi )
word2vec model as well as learning the embedding from scratch in                  p(y = i |x; w) = Í                         ,       i = 1, ..., k   (1)
                                                                                                        k exp(uT h + b )
                                                                                                        l =1       l       l
the network itself. We tune the number of neurons in the BLSTM.
We also experimented by adding fully connected relu layes in the
                                                                           Where k is the number classes and w is the network parameter-
network before the output layer. The best parameters are tuned
                                                                        set (including the softmax layer). Next a softmax output layer is
based on a validation set.
                                                                        added to predict the noisy label z based on both the true label and
                                                                        the input features:
5.2    NOISE-AWARE BASELINE
We tried two approaches to handle noisy labels: 1) data cleansing
method and 2) label noise-tolerant algorithm.                                                                       exp(uTil h + bil )
                                                                                p(z = j |y = i, x; w noise ) = Í                                     (2)
    5.2.1 Data Cleansing Method: The goal of such methods is to                                                    k exp(uT h + b )
                                                                                                                   l =1   il     il
filter out data points which appears to be mislabeled. Here, we                                 p(z = j |x) = p(z = j |y = i, x)p(y = i |x)          (3)
apply a model prediction based filtering method [5].
    In this filtering approach, we divide the training dataset into
                                                                           Where, w noise represents parameters in the second softmax layer.
k-folds (five-folds). We train a xgboost model (with the best pa-
                                                                        Given n training data points with feature vectors x 1 , x 2 , ..., x n with
rameters found in 5.1.1) on k-1 folds and apply it to predict labels
                                                                        corresponding labels z 1 , z 2 , ..., zn and true labels y1 , y2 , ..., yn , the
for k-th fold. This process is repeated k times to get labels for
                                                                        log likelihood term of the model parameters is written as:
the entire training dataset. Then we filter out instances with dis-
agreement between the actual label and the predicted label. On
the filtered training data, a xgboost model (with same parameter)                              Õ
                                                                                                    log p(zt |x t ), t = 1, ..., n                   (4)
is trained which forms the final model. We refer this method as
                                                                                                t
xgboost+filtering.
   5.2.2 Label Noise-Tolerant Algorithm: In label-noise tolerant           This modeling approach is named as c-model and learned using
algorithm, noise is handled explicitly in the modeling step. Here, a    a neural-network training. For our experiment, the h function is
probabilistic neural-network based framework [11] is considered to      considered to be a BLSTM, with the same parameter as in Section
handle label-noise. This framework views the true label as a latent     5.1.2. This model is referred as BLSTM+noise-aware.
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                       Avijit Saha, Vishal Kakkar, and T. Ravindra Babu

        Table 5: Example of Labeling Functions (LFs)                                 Table 6: Statistics of Labeling Functions

      def lambdapartial (x):                                                  # of LFs     Coverage (%)     Overlap (%)      Conflict (%)
      return 1 if # of positive integer tokens in x >= 2 else 0                  17            100              40               9

      def lambdasoap (x):
      return -1 if IsTokenPresent(soap, x) else 0                                              Table 7: Conflict Count

                                                                                                true     partial     phone missing     soap
                                                                                true               0      5,752           368           44
                                                                               partial          5,752       0             247           54
                                                                            phone missing        368       247             0             0
5.3    NOISE HANDLING WITH WEAK
                                                                                soap              44       54              0             0
       SUPERVISION
We encode domain heuristics as labeling functions (LFs) [20], which
label subsets of the data. However, LFs may conflict and prone to                              Table 8: Overlap Count
noise. Assuming data point and class pair (x, y) are drawn from the
distribution X × {−1, 1}, a LF λi : X → {1, 0, −1} is a user-defined                            true       partial    phone missing       soap
function that encodes some domain heuristic, and provides non-                 true            O(100k)      3,619         1,819            621
zero label for some subset of the data points. Where 1 and -1 refer           partial           3,619       9,371            0              0
to the positive and negative class respectively. And 0 refers to the       phone missing        1,819         0           2,187            42
case where LF can not label the instance. LFs collectively generate            soap              621          0             42             665
a large but potentially overlapping set of training labels. LFs can be
created in many ways, such as leveraging domain specific patterns
to label data points or use existing knowledge bases to generate
labels.
    We consider that the actual labels came from a LF named λt rue ,                           # unique conflicts of λi with other LFs
                                                                                cf_scoreλi =                                                (5)
which has 100% coverage. As introduction of a more noisy LF than                                          Coverage of λi
λt r ue will increase overall noise in the dataset, unlike data pro-         Intuitively, the cf_score captures how much a LF conflicts with
gramming, we introduce less noisy LFs than λt r ue . Lets consider λ      other LFs. With this score, Algorithm 1 denoise the training data.
denotes the m newly created LFs - {λi }i=1  m , each of which looks at
                                                                          λsor t ed contains the sorted list of m newly created rules in ascend-
the domain specific patterns to label data points.                        ing order w.r.t the cf_scoreλi .
    A specific pattern is observed in the comments - ‘I ordered x
quantity of an item but received y quantity’. This is partial delivery,
as the customer has received part of the order, and should be marked      Algorithm 1 Label Denoising with Labeling Functions (LFs)
as comment-mismatch. With this pattern, a LF lambdapar t ial is
                                                                            Input: X , λt rue , λsor t ed
defined in Table 5. We also observed that customer often writes
                                                                            Output: Y : final label vector
they have received soap instead of a mobile phone. This is a gen-
                                                                            append λt rue at the end of λsor t ed
uine missing-item request, and should be marked as non-comment-
                                                                            for x i ∈ X do
mismatch. With this pattern, a LF lambdasoap is also defined in
                                                                              flag=0
Table 5. The IsTokenPresent function returns true when soap is one
                                                                              for λ j ∈ λsor t ed do
of the token of comment x. Table 6 describes the statistics of all the
                                                                                 yi = λ j (x i )
LFs. Where coverage represents the percentage of instances with at
                                                                                 if yi , 0 then
least one label, conflict represents the percentage of instances with
                                                                                    flag=1
conflicting labels, and overlap depicts the percentage of instances
                                                                                    break
with more than one labels.
                                                                                 end if
    To deep dive, we show conflict and overlap count in Table 7
                                                                              end for
and Table 8, respectively. Where a cell (i, j) in Table 7 defines the
                                                                              if flag==0 then
number data instances where (λi == 1 and λ j == −1) or (λi == −1
                                                                                 yi = λt rue (x i )
and λ j == 1). Similarly, a cell (i, j) in Table 8 defines the number
                                                                              end if
data instances where (λi == 1 and λ j == 1) or (λi == −1 and
                                                                            end for
λ j == −1).
    As LFs conflicts with each other applying them directly to flip
the true labels is impossible. This requires us to generate a ranked         After denoising the training data with Algorithm 1, we apply the
list of LFs. To do that, for each λi ∈ λ, we define a conflict-score      xgboost and BLSTM model on the denoised training data. This two
as below. Note, conflict count of λi is calculated by summing the         approaches are named as xgboost+best-sequence and BLSTM+best-
number of conflict between λi and each rule from λ − λi .                 sequence.
Noise-aware Missing Shipment Return Comment Classification in E-Commerce SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA

          Table 9: Train and Test Dataset Statistics                   is quite low, and still BLSTM+best-sequence beats the BLSTM+random-
                                                                       sequence, again proving conflict handling with cf_score helps.
 # Type     # of instances    % of neg    % of pos    Label type           Overall, BLSTM+best-sequence performs best, proving the ef-
  Train        O(100k)           59          41         Noisy          ficacy of our proposed approach. Best-sequence provide benefits
   Test          3,000          51.5        48.5        Clean          over random-sequence proving the benefits of conflict handling
                                                                       among LFs by cf_score. Sequence model BLSTM is able to provide
                                                                       benefits over xgboost. We were able to improve the accuracy from
6 EXPERIMENTS                                                          86.90% to 90.04% with the help of BLSTM, LFs, and cf_score.
6.1 Dataset
                                                                                  Table 11: Performance of Proposed Methods
The complete data consists of O(100k) instances out of which ran-
domly chosen 3k instances forms the test data and rest forms the
train data. The test dataset is manually relabeled to generate clean                 Model                              Mean Accuracy             Std
labels. Table 9 shows train and test data statistics. Note, the test        Xgboost+random-sequence                         87.64                 1.37
dataset size is small because manual relabeling is time consuming.           Xgboost+best-sequence                          88.39                 NA
                                                                            BLSTM+random-sequence                           88.37                 0.02
6.2    Experimental Setup                                                     BLSTM+best-sequence                           90.04                 NA
We compare performance of xgboost+best-sequence and BLSTM+best-
sequence against the baselines - xgboost, BLSTM, xgboost+filtering,
and BLSTM+noise-aware. Moreover, to showcase the benefits of           7    CONCLUSION
the conflict handling with the cf_score, we compare the perfor-        We discussed an important problem of classifying missing-item re-
mance of xgboost+best-sequence and BLSTM+best-sequence with            turn comments into comment-mismatch/non-comment-mismatch.
xgboost+random-sequence and BLSTM+random-sequence. For random-         We highlighted the data and noise related challenges in both com-
sequence, λsor t ed consists of a of random permutation of m newly     ments and labels. We have experimented with the state-of-the-art
created rules. Model performance varies for different random se-       Machine Learning and Deep Learning methods as well as their
quences. Hence, for both xgboost+random-sequence and BLSTM+best-       noise-aware variants. We have proposed a simple method with
random, we repeat experiments 10 times with different random           labeling functions (LFs) to denoise the training dataset. A conflict-
permutation of λ and report the mean and standard deviation. For       score is defined to handle the conflicts between LFs. Empirically,
all the models, we fix random-state to 2018. As our dataset is well    we have shown efficacy of our approach over the baselines.
balanced, we use accuracy as the evaluation metric.                       As future work, we intend to explore the complete data program-
                                                                       ming framework to handle noisy labels.
6.3    Results
Table 10 shows the performance comparison between xgboost,             REFERENCES
                                                                        [1] Mehdi Allahyari, Seyed Amin Pouriyeh, Mehdi Assefi, Saied Safaei, Elizabeth D.
BLSTM, and their noise-aware variants. BLSTM is performing best             Trippe, Juan B. Gutierrez, and Krys Kochut. 2017. A Brief Survey of Text Mining:
with an accuracy of 87.43. This proves that using sequence infor-           Classification, Clustering and Extraction Techniques. CoRR abs/1707.02919 (2017).
mation indeed benefits on our dataset. BLSTM and xgboost are                arXiv:1707.02919 http://arxiv.org/abs/1707.02919
                                                                        [2] Vasileios Athanasiou and Manolis Maragoudakis. 2017. A Novel, Gradient Boost-
performing better than BLSTM+noise-aware and xgboost+filtering,             ing Framework for Sentiment Analysis in Languages where NLP Resources Are
respectively. We can observe that the state-of-the-art noise-aware          Not Plentiful: A Case Study for Modern Greek. Algorithms 10 (2017), 34.
algorithms are hurting the performance. We think that the rea-          [3] Jakramate Bootkrajang and Ata Kabán. 2012. Label-Noise Robust Logistic
                                                                            Regression and Its Applications. In Proceedings of the 2012 European Confer-
son for such a performance degradation is due to the wide pattern           ence on Machine Learning and Knowledge Discovery in Databases - Volume
specific noise variation.                                                   Part I (ECML PKDD’12). Springer-Verlag, Berlin, Heidelberg, 143–158. https:
                                                                            //doi.org/10.1007/978-3-642-33460-3_15
                                                                        [4] Leo Breiman. 2001. Random Forests. Mach. Learn. 45, 1 (Oct. 2001), 5–32. https:
             Table 10: Performance of Baselines                             //doi.org/10.1023/A:1010933404324
                                                                        [5] Carla E. Brodley and Mark A. Friedl. 1999. Identifying Mislabeled Training Data.
                                                                            J. Artif. Int. Res. 11, 1 (July 1999), 131–167. http://dl.acm.org/citation.cfm?id=
                     Model               Accuracy                           3013545.3013548
                                                                        [6] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting
                    Xgboost               86.90                             System. In Proceedings of the 22Nd ACM SIGKDD International Conference on
                Xgboost+filtering          86.47                            Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA,
                    BLSTM                 87.43                             785–794. https://doi.org/10.1145/2939672.2939785
                                                                        [7] M. Craven and J. Kumlien. 1999. Constructing biological knowledge bases by
               BLSTM+noise-aware           87.33                            extracting information from text sources. In Proceedings of the International
                                                                            Conference on Intelligent Systems for Molecular Biology.
                                                                        [8] Thomas G. Dietterich. 2000. An Experimental Comparison of Three Methods for
   Table 11 shows the performance comparison among our methods.             Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomiza-
                                                                            tion. Machine Learning 40, 2 (01 Aug 2000), 139–157. https://doi.org/10.1023/A:
Again, BLSTM with best-sequence is performing best. BLSTM+best-             1007607513941
sequence and xgboost+best-sequence are performing better than           [9] Benoît Frénay and Ata Kaban. 2014. A Comprehensive Introduction to Label Noise.
BLSTM+random-sequence and xgboost+random-sequence, respec-                  i6doc.com.publ.
                                                                       [10] Dragan Gamberger, Rudjer Boskovic, Nada Lavrac, and Ciril Groselj. 1999. Exper-
tively. It proves the benefits of conflict handling among LFs by            iments With Noise Filtering in a Medical Domain. In Proc. of 16 th ICML. Morgan
cf_score. Note, the standard deviation of BLSTM+random-sequence             Kaufmann, 143–151.
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA                                      Avijit Saha, Vishal Kakkar, and T. Ravindra Babu


[11] Jacob Goldberger and Ehud Ben-Reuven. 2017. Training Deep Neural-networks
     Using a Noise Adaptation Layer.
[12] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
     Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.
     8.1735
[13] Victoria J. Hodge and Jim Austin. 2004. A survey of outlier detection methodolo-
     gies. Artificial Intelligence Review 22 (2004), 2004.
[14] Thorsten Joachims. 1998. Text Categorization with Support Vector Machines:
     Learning with Many Relevant Features. In Proceedings of the 10th European
     Conference on Machine Learning (ECML’98). Springer-Verlag, Berlin, Heidelberg,
     137–142. https://doi.org/10.1007/BFb0026683
[15] Ji Young Lee and Franck Dernoncourt. 2016. Sequential Short-Text Classification
     with Recurrent and Convolutional Neural Networks. CoRR abs/1603.03827 (2016).
[16] Joseph Lilleberg, Yun Zhu, and Yanqing Zhang. 2015. Support vector machines
     and Word2vec for text classification with semantic features.. In ICCI*CC, Ning
     Ge, Jianhua Lu, Yingxu Wang, Newton Howard, Philip Chen, Xiaoming Tao,
     Bo Zhang, and Lotfi A. Zadeh (Eds.). IEEE Computer Society, 136–140. http:
     //dblp.uni-trier.de/db/conf/IEEEicci/IEEEicci2015.html#LillebergZZ15
[17] Emily K. Mallory, Ce Zhang, Christopher R, and Russ B. Altman. 2016. Large-
     scale extraction of gene interactions from full-text literature using DeepDive.
     Bioinformatics 32, 1 (2016), 106–113. https://doi.org/10.1093/bioinformatics/
     btv476
[18] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. In-
     troduction to Information Retrieval. Cambridge University Press, New York, NY,
     USA.
[19] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
     Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013).
[20] Alexander J. Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher
     Ré. 2016. Data Programming: Creating Large Training Sets, Quickly. In NIPS.
     3567–3575.
[21] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling
     with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
     for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/
     884893/en.
[22] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved
     Semantic Representations From Tree-Structured Long Short-Term Memory Net-
     works. CoRR abs/1503.00075 (2015).
[23] Jonathan J. Webster and Chunyu Kit. 1992. Tokenization As the Initial Phase in
     NLP. In Proceedings of the 14th Conference on Computational Linguistics - Volume 4
     (COLING ’92). Association for Computational Linguistics, Stroudsburg, PA, USA,
     1106–1110. https://doi.org/10.3115/992424.992434
[24] Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo
     Xu. 2016. Text Classification Improved by Integrating Bidirectional LSTM with
     Two-dimensional Max Pooling. CoRR abs/1611.06639 (2016).