=Paper=
{{Paper
|id=Vol-2319/paper28
|storemode=property
|title=Noise-aware Missing Shipment Return Comment Classification in e-Commerce
|pdfUrl=https://ceur-ws.org/Vol-2319/paper28.pdf
|volume=Vol-2319
|authors=Avijit Saha,Vishal Kakkar,Ravindra Babu
|dblpUrl=https://dblp.org/rec/conf/sigir/SahaKB18
}}
==Noise-aware Missing Shipment Return Comment Classification in e-Commerce==
Noise-aware Missing Shipment Return Comment Classification
in E-Commerce
Avijit Saha∗ Vishal Kakkar∗ T. Ravindra Babu
Flipkart Internet Private Limited Flipkart Internet Private Limited Flipkart Internet Private Limited
Bangalore, India Bangalore, India Bangalore, India
avijit.saha@flipkart.com vishal.kakkar@flipkart.com ravindra.bt@flipkart.com
ABSTRACT Unlike a usual return, product pick up from customer is avoided
E-Commerce companies face a number of challenges in return in a missing-item return. Example of a missing-item return is -
requests. Claims of missing-items is one such challenge, where cus- customer ordered a handset and received a stone in place of the
tomer claims that main product is missing from shipment through handset. A confirmed case of missing item results in loss to the
return comments. It is observed that dominant part of such claims company since there is a definite fraud with one of the stakeholders
are inadvertent given the limited literacy of customers. Some of such as buyer, seller or delivery team. Hence, a careful scrutiny is
them have fraud intent. At Flipkart, such claims are evaluated man- necessary before the approval of any missing-item returns.
ually to examine whether the comment relates to missing item. A missing-item return request is generated broadly for two rea-
Classification of the claim intent automatically saves human band- sons. The first can be due to definite fraud with one of the stake-
width and provides good customer experience by reducing the turn holders. Secondly, given limited literacy levels of customers, it is
around time to customers. However, this is challenging as com- observed that the claims do not always refer to missing-item but
ments are replete with spell variations, non-English vernacular inadvertently claimed as missing-item. For example, a missing-item
words, and are often incomplete and short. This is compounded by return with customer comment ‘I did not like the item.’ clearly indi-
noisy labeling of such comments due to human bias and manual cates that the return belong to a return category other than missing-
errors. item. Because here customer received the main product. Hence, the
To classify the claim intent, we apply conventional as well as return should be cancelled. We call this a comment-mismatch (cus-
deep learning methods. To handle label noise, we employed state- tomer’s comment does not match with the return reason-code).
of-the-art noise-aware techniques, which fail to perform due to On the other hand, a missing-item return with customer comment
pattern specific label noise. Motivated by the wide pattern specific ‘I ordered a phone but received an empty box.’ clearly indicates
label noise, we encode domain heuristics as labeling functions (LFs) that the return belong to the missing-item category. This return
which label subsets of the data. However, LFs may conflict and should be approved. This is referred by non-comment-mismatch
prone to noise. We address the conflict by defining a conflict-score (customer’s comment matches with the return reason-code).
to rank the LFs. Proposed method of noise handling with LFs out At Flipkart, a dedicated operation team assesses the compatibility
performs all the state-of-the-art noise-aware baselines. of customers’ comments on missing-item return with the missing-
item reason-code. Each missing-item return – passes through this
KEYWORDS process and – is rejected when the return comment is incompati-
ble with the missing-item reason-code, otherwise approved. This
E-Commerce, Text, Comment, Noise, Data Programming, Deep
process wastes lot of human bandwidth and is prone manual mis-
Learning, Machine Learning
takes. Moreover, it hampers the customer experience due to the
ACM Reference Format: lag between the return placement time and its status update to the
Avijit Saha, Vishal Kakkar, and T. Ravindra Babu. 2018. Noise-aware Missing customer. Hence, we want to automate this process. In terms of
Shipment Return Comment Classification in E-Commerce. In Proceedings of
Machine Learning problem, given a missing-item return comment,
ACM SIGIR Workshop on eCommerce (SIGIR 2018 eCom). ACM, New York,
we want to predict whether it is a comment-mismatch (positive
NY, USA, 8 pages. https://doi.org/10.475/123_4
class) or non-comment-mismatch (negative class).
Customer comments are generally very noisy mainly because
1 INTRODUCTION
of three reasons: a) spelling mistakes: empty is misspelled in com-
E-commerce companies face a large number of return requests of ment ‘Emety box’, b) usage of regional languages: comment ‘Galt
various types (reason-codes). Missing-item is one such reason-code order ho gya h’, which means I ordered wrong product, uses Hindi
where customer claims that main product is missing from shipment. language, and c) varied comment length: token counts in a com-
∗ Equal Contributions ment ranges [1, 323]. To handle such difficulties, we employ word
embedding and meta features. Besides conventional classification
Permission to make digital or hard copies of part or all of this work for personal or method (xgboost [6]), we use BLSTM [24] to capture the sequential
Copyright © 2018 by the paper’s authors. Copying permitted for private and academic purposes.
classroom
In: use is G.
J. Degenhardt, granted withoutS.fee
Di Fabbrizio, providedM.that
Kallumadi, copies
Kumar, areLin,
Y.-C. notA.made or distributed
Trotman, H. Zhao
(eds.): Proceedings
for profit of the SIGIR
or commercial 2018 eCom
advantage andworkshop, 12 July,
that copies bear2018, Ann Arbor,
this notice andMichigan, USA,
the full citation information in the comments.
published at http://ceur-ws.org
on the first page. Copyrights for third-party components of this work must be honored. Also, the manual label generation process, which marks a missing-
For all other uses, contact the owner/author(s). item return comment as comment-mismatch/non-comment-mismatch,
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA
© 2018 Copyright held by the owner/author(s).
is very noisy. The sources of noise are manual error, human bias,
ACM ISBN 123-4567-24-567/08/06. and lack of well calibrated operation team. The label noise varies
https://doi.org/10.475/123_4
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Avijit Saha, Vishal Kakkar, and T. Ravindra Babu
depending on the patterns and is non-iid. This is clearly visible by
the fact that the overall label noise is ∼ 15% and a specific pattern
‘I ordered x quantity of an item but received y quantity’ has 50%
noise. Due to this reason, the state-of-the-art noise-aware [11] base-
line, which uses BLSTM as the base model, fails severely. In fact, we
observed a performance degradation of this noise-aware BLSTM
over vanilla BLSTM.
The pattern specific noise variation and high noise on certain pat-
terns motivate us to use noise correction based on domain heuristics.
Like data programming[20], we express weak supervision strate-
gies or domain heuristics as labeling functions (LFs) which label
subsets of the data. However, LFs may conflict and prone to noise.
To our best knowledge, no one has employed LFs to rectify label
noise.
In a typical data programming setting, only data points are avail- Figure 1: Frequency distribution of comments w.r.t. word
able and LFs are created to generate labels. The LFs may conflict on count
certain data points and have varying error rates. To handle it, data
programming defines a generative process over the LFs to learn
the correctness probability of each LF on each data instance. This
information is then used to fit a noise-aware classifier.
The key difference between our setting and the data program-
ming is the availability of noisy labels (we call it as true LF) in our
case. Unlike data programming, we would like introduce less noisy
LFs compared to the true LF. Due to all these reasons, instead of
applying data programming directly, here resort to a simple method.
and promising methods to correct label noise using LFs, and leave
out the exploration of data programming as future work.
We define multiple LFs to alter the noisy labels in our dataset.
However, applying them directly to flip the noisy labels is impossi-
ble because the LFs conflict with each other - same data instance is
labeled as positive by a LF and negative by another LF. This requires
us to generate a ranked list of LFs. To do that, we define a conflict
score, which captures how less a LF conflicts with other LFs. Then,
Figure 2: Word clouds
the ranked LFs are applied to alter the true labels. In case of conflict
between LFs, the LF with the least conflict score is chosen to alter
the noisy label. We show that proposed method of noise handling
with LFs out performs all the state-of-the-art noise-aware baselines
as well as vanilla baselines. 1 describes the dataset statistics. We can observe ∼ 41% are positive
We integrate the following aspects in the paper. labels. We count the number of words in a comment by tokenizing
• Explanation of a real world problem and its challenges it. Table 1 also shows that the word count of comments ranges
• Exploratory data analysis and feature engineering [1, 323]. The plot clearly shows the existence of widely variable
• State-of-the-art baselines - xgboost and LSTM and their length comments in our dataset. To deep dive, we plot a histogram
noise-aware variants of number of comments w.r.t. word count in Figure 1. We clip the
• Noise handling with labeling functions plot at word length 75 for better visualization. Clearly, it is a long
• Proposed conflict-score to handle conflicts tail distribution - approximately 51% of comments has less than ten
• Shown superior performance of the proposed method words. However, there exists a fat tail of comments with very high
word count.
Section 2 contains insights into data. Related work, feature engi-
Table 2 shows example of customer comments with different
neering, modeling, experimentation, and conclusion are described
word count. Interestingly, we can observe the presence of spell
in Section 3, 4, 5, 6, and 7, respectively.
error - ‘My mistek’ and regional language - ‘Sir khali box mila
h’ (Hindi). Often, comments are very noisy and do not adhere to
2 DATA grammar rules - ‘Not working Good; tow time same product bye
2.1 Description mistack accepted so remove it’. To provide more insight on the
The data in our study comes from the customer comments on data, we show a word cloud of our dataset in Figure 2. Some of
missing-item return requests. We use one year of customer com- the important keywords are cx (customer), product, mobile, box,
ments: April 2017 - Mar 2018. We consider a comment-mismatch as missing, and empty. The cx words occurs in the comments when
positive label and a not-comment-mismatch as negative label. Table customer calls customer care to place the return request.
Noise-aware Missing Shipment Return Comment Classification in E-Commerce SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA
Table 1: Data Statistics
# of instances % of negatives % of positives min word count in any comments max word count in any comments
O(100k) 59 41 1 323
Table 2: Example of comments with different word count to be 15.65%. Also, the positive and negative classes have 10.06%
and 19.45% noise, respectively.
Word Comment
Count
1 Missing 3 RELATED WORK
Sorry Preprocessing is an important step for text classification. Two im-
2 Product missing. portant blocks [1] of preprocessing are - 1) Tokenization and 2)
My mistek Filtering. Tokenization [23], which is the initial step of prepro-
4 I got one jacket cessing, divides a text document into words known as tokens. In
empty box no watch Filtering, stop words are removed.
8 Charged more than 172/- from the MRP on box Then, the preprocessed text is converted into feature vectors. One
Sir mera phone nahi tha box ke andar of the widely used model for feature generation is the bag-of-words
(BOW) model [18]. It represents a document to a k-dimensional fea-
ture vector, where the individual co-ordinate represents the count
Table 3: Example of noisy labels
of a specific word in the document. Often, term frequency-inverse
document frequency (tfidf) [18] is used to penalize a frequently
Word Comment Actual True occurring word. Recently, word2vec (w2v) [19] model gained much
count Label Label attentions. It embeds a word to a k-dimensional vector space by
1 Ghh -1 1 preserving the property that words occurring in the same context
Missing 1 -1 will have higher similarity score. Word2vec [16] features are used
2 Wrong delivery -1 1 for text classification in many ways - summation of the w2v em-
Missing product 1 -1 bedding vectors, mean of the w2v embedding vectors, and tf-idf
4 Customer wants the refund -1 1 based weighted sum of the w2v embedding vectors corresponding
Product is not there 1 -1 to all the words in a document.
8 Please replacement this product -1 1 Support vector machines (SVMs) [14] are widely employed for
otherwise return my money text classification. Athanasiou [2] has applied gradient boosting ma-
Item missing your department , 1 -1 chine for sentiment analysis task, and shown superior performance
Very poor bad service over SVM, Naive Bayes, and neural network. Gradient boosting ma-
chine is a boosting algorithm where each iteration fits a new model
to get better class estimation. Each newly added model is corre-
2.2 Label Noise lated with the negative gradient of the loss function, and the loss is
Recall, non-comment-mismatch implies a genuine return request of minimized using gradient descent. Extreme gradient boosting (Xg-
type missing-item and comment-mismatch implies a return request boost) [6] is another boosting algorithm with better regularization
anything other than missing-item type. Given a missing item return and performs well in practice.
comment, our operation team at Flipkart mark it as either comment- Recently deep learning algorithms[15, 22] have shown promis-
mismatch (positive) or non-comment-mismatch (negative). This ing performance in text classification. Specifically, recurrent neural
label generation process is very noisy due to manual error, human networks (RNNs) [15] are the widely used architectures to capture
bias, and lack of well calibrated operation team. Table 3 shows the sequential information. Long term short memory networks
example of noisy labels for comments with different word count. (LSTMs) [12] is a variant of RNN which helps to overcome some of
Actual label and true labels represents the label in the dataset and the problem of RNN like, vanishing gradient problem and helps to
the expected label, respectively. We show noisy labels from one, remember the context over long text. Many flavours of LSTMs are
two, four, and eight word count comments. For each word count, proposed [22] to for text classification, such as Multilayer-LSTM,
we show one comment whose label should be positive but marked Bidirectional-LSTM (BLSTM), and Tree-Structured LSTM. In Mul-
as negative and one comment whose label should be negative but tilayer LSTM, LSTMs are stacked over each other to capture the
marked as positive. non-linearity. In BLSTM, both past and future information are pre-
served using two hidden states, and it helps to learn the context
2.3 Noise Statistics better.
To estimate the overall noise in our dataset, we manually relabeled Noisy label can be handled broadly by three approaches [9]: a)
3k randomly chosen examples and consider it as test dataset. We label-noise robust models [4, 6], b) data cleansing methods [5, 13],
calculate the noise by considering mismatch between the actual and c) label-noise tolerant learning algorithms [3, 11]. In label-noise
label and true label on the test dataset. The overall noise is estimated robust methods, label-noise is handled by reducing the overfitting.
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Avijit Saha, Vishal Kakkar, and T. Ravindra Babu
Even though, theoretically learning algorithms are robust to label- Table 4: Example of Spell Variation and Synonym
noise, in practice performance varies from algorithm to algorithm,
such as bagging perform better than the boosting [8]. Data cleans- missing mobile product ordered
ing methods filter out data points which appears to be mislabeled. mising mobil prodcut orderd
Filtering can be done with various approaches, such as outlier misssing fone produict odered
detection [13], removal of all the misclassified data points by a clas- misisng device produc order
sifier [5], and removal of any points which disproportionately in- khali handset prioduct booked
creases the model complexity [10]. In label-noise tolerant algorithm, empty phone item purchased
noise is handled explicitly in the modeling step. Label-noise robust
logistic regression [3] modifies the loss function to handle noise.
Recently, a probabilistic neural-network based framework [11] is ‘empty’ and ‘khali’ both means empty. Also, customers use regional
developed, which views the true label as a latent variable and a language, such as hindi (‘mujhe product nahi mila’). To handle
softmax layer is used to predict it. The noise is explicitly modeled these complexities, we train a word2vec model with 200 dimension
by an additional softmax layer that predict the noisy label based on on one year customer’s return comments data from all the return
both the true label and the input features. reason-codes. The number of comments are O(10M). We train the
As creating labeled training data is difficult and time consuming, word2vec by considering words which occurs at-least 25 times in
many approaches are developed to generate training data automati- our corpus. For tokenization, we use Gensim [21] simple_preprocess
cally, such as distant supervision [7, 17] and data programming [20]. method.
Distant supervision heuristically maps a knowledge base of known Table 4 shows similar words from the word2vec model for four
relations to an unknown domain to generate training data. Data pro- keywords - missing, mobile, product, and ordered. We can observe
gramming is a generic framework to create dataset pragmatically that the word2vec model is able to capture spelling variations. It
using distant supervision. It expresses weak supervision strategies is also able to capture semantically similar word, such as missing
or domain heuristics as labeling functions (LFs) which label sub- and empty, mobile and handset, device and phone, and ordered
sets of the data. The LFs may conflict on certain data points and and purchased. Moreover, the regional language variation is also
have varying error rates. To handle it, data programming defines a captured, such as missing and khali (hindi), mobile and fone (hindi),
generative process over the LFs to learn the correctness probability and missing and illa (tamil).
of each LF on each data instance. This information is then used to 4.2.1 Sum of Word2vec Features: We sum the individual 200-
fit a noise-aware classifier. dimensional embedding vector for each word in a comment and
use it as the final feature in model. To handle out-of-vocabulary
4 FEATURE ENGINEERING word, we fall back to the 200-dimensional zero vector.
In this Section, we will discuss all the hand crafted features which 4.2.2 Weighted Word2vec Features: We fit a tfidf model on the
are used for conventional Machine Learning models. training data. Then, we calculate a weighted sum of the individual
200-dimensional embedding vector for each word in a comment.
4.1 Meta Features Where weight of individual word embedding is assigned from the
We construct nine meta features as shown in Figure 3. Word count, tfidf score of the word.
char count, alpha count, digit count, and non-alphanumeric count
compute the number of words, characters, alphabets, digits, and 4.3 Bag-of-words (BOW) Features
non-alphanumeric characters in a comment. To show the discrimi- Each comment in the training dataset is preprocessed with Gensim
nating power of each meta feature, in Figure 3, we show histogram preprocessing. Then, a bag-of-words (BOW) model is trained with
of each feature w.r.t. both positive and negative class. In each sub- 5k vocabulary size and with English stop words removal.
plot, the blue and green histogram represents the distribution for
negative and positive class, respectively. 5 MODEL
In each subplot, the blue distribution is right shifted. It implies
that in general comments from the negative class are longer, have 5.1 BASELINE
more alphabets, digits and non-alphanumeric characters compare We tried multiple conventional Machine Learning algorithms widely
to the comments from the positive class. This is explained by the used for text classification, such as SVM, naive Bayes, logistic re-
fact that short comments lack descriptive ability, and thus have gression, random forest, and xgboost. In our data, around 50% of
higher chance of being comment-mismatch. Interestingly, unique the comments are long and we found that often long comments
character count and unique alphabet count are the most discrimi- have sequential information, such as ‘customer ordered 2 items but
nating features - there is a clear separation between the distribution received only 1’. To capture the sequential information, we exper-
of positive and negative class for these two features. imented with different RNN based models, such as RNN, LSTM,
Multi-layer LSTM, and Bi-directional LSTM (BLSTM). In below,
4.2 Word2vec Features we only describe the best performing models from each of this
approach.
We observed that our dataset is very noisy. For example, a keyword
like ‘missing’ has numerous variations - missing, misssing, misisng, 5.1.1 Xgboost: A xgboost model is trained with all the features
missig, etc. Moreover, there are semantically similar words, such as described in Section 4. The model parameter is tuned by grid search.
Noise-aware Missing Shipment Return Comment Classification in E-Commerce SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA
Figure 3: Meta features
While performing the grid search, we restrict the max depth of variable. A softmax layer is used to predict the true label. The noise
individual tree to 8 and max number of estimators to 500. The best is explicitly modeled by an additional softmax layer that predict
parameters are chosen using 5-fold cross-validation. the noisy label based on both the true label and the input features.
Assuming the non-linear function applied on an input x be h =
5.1.2 BLSTM:. We used BLSTM which avoids feature engineer-
h(x), the true label y is modeled by:
ing. In BLSTM, both past and future information are preserved
using two hidden states, and it helps to learn the context better.
We experimented with word2vec pretrained embedding from the exp(uTi h + bi )
word2vec model as well as learning the embedding from scratch in p(y = i |x; w) = Í , i = 1, ..., k (1)
k exp(uT h + b )
l =1 l l
the network itself. We tune the number of neurons in the BLSTM.
We also experimented by adding fully connected relu layes in the
Where k is the number classes and w is the network parameter-
network before the output layer. The best parameters are tuned
set (including the softmax layer). Next a softmax output layer is
based on a validation set.
added to predict the noisy label z based on both the true label and
the input features:
5.2 NOISE-AWARE BASELINE
We tried two approaches to handle noisy labels: 1) data cleansing
method and 2) label noise-tolerant algorithm. exp(uTil h + bil )
p(z = j |y = i, x; w noise ) = Í (2)
5.2.1 Data Cleansing Method: The goal of such methods is to k exp(uT h + b )
l =1 il il
filter out data points which appears to be mislabeled. Here, we p(z = j |x) = p(z = j |y = i, x)p(y = i |x) (3)
apply a model prediction based filtering method [5].
In this filtering approach, we divide the training dataset into
Where, w noise represents parameters in the second softmax layer.
k-folds (five-folds). We train a xgboost model (with the best pa-
Given n training data points with feature vectors x 1 , x 2 , ..., x n with
rameters found in 5.1.1) on k-1 folds and apply it to predict labels
corresponding labels z 1 , z 2 , ..., zn and true labels y1 , y2 , ..., yn , the
for k-th fold. This process is repeated k times to get labels for
log likelihood term of the model parameters is written as:
the entire training dataset. Then we filter out instances with dis-
agreement between the actual label and the predicted label. On
the filtered training data, a xgboost model (with same parameter) Õ
log p(zt |x t ), t = 1, ..., n (4)
is trained which forms the final model. We refer this method as
t
xgboost+filtering.
5.2.2 Label Noise-Tolerant Algorithm: In label-noise tolerant This modeling approach is named as c-model and learned using
algorithm, noise is handled explicitly in the modeling step. Here, a a neural-network training. For our experiment, the h function is
probabilistic neural-network based framework [11] is considered to considered to be a BLSTM, with the same parameter as in Section
handle label-noise. This framework views the true label as a latent 5.1.2. This model is referred as BLSTM+noise-aware.
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Avijit Saha, Vishal Kakkar, and T. Ravindra Babu
Table 5: Example of Labeling Functions (LFs) Table 6: Statistics of Labeling Functions
def lambdapartial (x): # of LFs Coverage (%) Overlap (%) Conflict (%)
return 1 if # of positive integer tokens in x >= 2 else 0 17 100 40 9
def lambdasoap (x):
return -1 if IsTokenPresent(soap, x) else 0 Table 7: Conflict Count
true partial phone missing soap
true 0 5,752 368 44
partial 5,752 0 247 54
phone missing 368 247 0 0
5.3 NOISE HANDLING WITH WEAK
soap 44 54 0 0
SUPERVISION
We encode domain heuristics as labeling functions (LFs) [20], which
label subsets of the data. However, LFs may conflict and prone to Table 8: Overlap Count
noise. Assuming data point and class pair (x, y) are drawn from the
distribution X × {−1, 1}, a LF λi : X → {1, 0, −1} is a user-defined true partial phone missing soap
function that encodes some domain heuristic, and provides non- true O(100k) 3,619 1,819 621
zero label for some subset of the data points. Where 1 and -1 refer partial 3,619 9,371 0 0
to the positive and negative class respectively. And 0 refers to the phone missing 1,819 0 2,187 42
case where LF can not label the instance. LFs collectively generate soap 621 0 42 665
a large but potentially overlapping set of training labels. LFs can be
created in many ways, such as leveraging domain specific patterns
to label data points or use existing knowledge bases to generate
labels.
We consider that the actual labels came from a LF named λt rue , # unique conflicts of λi with other LFs
cf_scoreλi = (5)
which has 100% coverage. As introduction of a more noisy LF than Coverage of λi
λt r ue will increase overall noise in the dataset, unlike data pro- Intuitively, the cf_score captures how much a LF conflicts with
gramming, we introduce less noisy LFs than λt r ue . Lets consider λ other LFs. With this score, Algorithm 1 denoise the training data.
denotes the m newly created LFs - {λi }i=1 m , each of which looks at
λsor t ed contains the sorted list of m newly created rules in ascend-
the domain specific patterns to label data points. ing order w.r.t the cf_scoreλi .
A specific pattern is observed in the comments - ‘I ordered x
quantity of an item but received y quantity’. This is partial delivery,
as the customer has received part of the order, and should be marked Algorithm 1 Label Denoising with Labeling Functions (LFs)
as comment-mismatch. With this pattern, a LF lambdapar t ial is
Input: X , λt rue , λsor t ed
defined in Table 5. We also observed that customer often writes
Output: Y : final label vector
they have received soap instead of a mobile phone. This is a gen-
append λt rue at the end of λsor t ed
uine missing-item request, and should be marked as non-comment-
for x i ∈ X do
mismatch. With this pattern, a LF lambdasoap is also defined in
flag=0
Table 5. The IsTokenPresent function returns true when soap is one
for λ j ∈ λsor t ed do
of the token of comment x. Table 6 describes the statistics of all the
yi = λ j (x i )
LFs. Where coverage represents the percentage of instances with at
if yi , 0 then
least one label, conflict represents the percentage of instances with
flag=1
conflicting labels, and overlap depicts the percentage of instances
break
with more than one labels.
end if
To deep dive, we show conflict and overlap count in Table 7
end for
and Table 8, respectively. Where a cell (i, j) in Table 7 defines the
if flag==0 then
number data instances where (λi == 1 and λ j == −1) or (λi == −1
yi = λt rue (x i )
and λ j == 1). Similarly, a cell (i, j) in Table 8 defines the number
end if
data instances where (λi == 1 and λ j == 1) or (λi == −1 and
end for
λ j == −1).
As LFs conflicts with each other applying them directly to flip
the true labels is impossible. This requires us to generate a ranked After denoising the training data with Algorithm 1, we apply the
list of LFs. To do that, for each λi ∈ λ, we define a conflict-score xgboost and BLSTM model on the denoised training data. This two
as below. Note, conflict count of λi is calculated by summing the approaches are named as xgboost+best-sequence and BLSTM+best-
number of conflict between λi and each rule from λ − λi . sequence.
Noise-aware Missing Shipment Return Comment Classification in E-Commerce SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA
Table 9: Train and Test Dataset Statistics is quite low, and still BLSTM+best-sequence beats the BLSTM+random-
sequence, again proving conflict handling with cf_score helps.
# Type # of instances % of neg % of pos Label type Overall, BLSTM+best-sequence performs best, proving the ef-
Train O(100k) 59 41 Noisy ficacy of our proposed approach. Best-sequence provide benefits
Test 3,000 51.5 48.5 Clean over random-sequence proving the benefits of conflict handling
among LFs by cf_score. Sequence model BLSTM is able to provide
benefits over xgboost. We were able to improve the accuracy from
6 EXPERIMENTS 86.90% to 90.04% with the help of BLSTM, LFs, and cf_score.
6.1 Dataset
Table 11: Performance of Proposed Methods
The complete data consists of O(100k) instances out of which ran-
domly chosen 3k instances forms the test data and rest forms the
train data. The test dataset is manually relabeled to generate clean Model Mean Accuracy Std
labels. Table 9 shows train and test data statistics. Note, the test Xgboost+random-sequence 87.64 1.37
dataset size is small because manual relabeling is time consuming. Xgboost+best-sequence 88.39 NA
BLSTM+random-sequence 88.37 0.02
6.2 Experimental Setup BLSTM+best-sequence 90.04 NA
We compare performance of xgboost+best-sequence and BLSTM+best-
sequence against the baselines - xgboost, BLSTM, xgboost+filtering,
and BLSTM+noise-aware. Moreover, to showcase the benefits of 7 CONCLUSION
the conflict handling with the cf_score, we compare the perfor- We discussed an important problem of classifying missing-item re-
mance of xgboost+best-sequence and BLSTM+best-sequence with turn comments into comment-mismatch/non-comment-mismatch.
xgboost+random-sequence and BLSTM+random-sequence. For random- We highlighted the data and noise related challenges in both com-
sequence, λsor t ed consists of a of random permutation of m newly ments and labels. We have experimented with the state-of-the-art
created rules. Model performance varies for different random se- Machine Learning and Deep Learning methods as well as their
quences. Hence, for both xgboost+random-sequence and BLSTM+best- noise-aware variants. We have proposed a simple method with
random, we repeat experiments 10 times with different random labeling functions (LFs) to denoise the training dataset. A conflict-
permutation of λ and report the mean and standard deviation. For score is defined to handle the conflicts between LFs. Empirically,
all the models, we fix random-state to 2018. As our dataset is well we have shown efficacy of our approach over the baselines.
balanced, we use accuracy as the evaluation metric. As future work, we intend to explore the complete data program-
ming framework to handle noisy labels.
6.3 Results
Table 10 shows the performance comparison between xgboost, REFERENCES
[1] Mehdi Allahyari, Seyed Amin Pouriyeh, Mehdi Assefi, Saied Safaei, Elizabeth D.
BLSTM, and their noise-aware variants. BLSTM is performing best Trippe, Juan B. Gutierrez, and Krys Kochut. 2017. A Brief Survey of Text Mining:
with an accuracy of 87.43. This proves that using sequence infor- Classification, Clustering and Extraction Techniques. CoRR abs/1707.02919 (2017).
mation indeed benefits on our dataset. BLSTM and xgboost are arXiv:1707.02919 http://arxiv.org/abs/1707.02919
[2] Vasileios Athanasiou and Manolis Maragoudakis. 2017. A Novel, Gradient Boost-
performing better than BLSTM+noise-aware and xgboost+filtering, ing Framework for Sentiment Analysis in Languages where NLP Resources Are
respectively. We can observe that the state-of-the-art noise-aware Not Plentiful: A Case Study for Modern Greek. Algorithms 10 (2017), 34.
algorithms are hurting the performance. We think that the rea- [3] Jakramate Bootkrajang and Ata Kabán. 2012. Label-Noise Robust Logistic
Regression and Its Applications. In Proceedings of the 2012 European Confer-
son for such a performance degradation is due to the wide pattern ence on Machine Learning and Knowledge Discovery in Databases - Volume
specific noise variation. Part I (ECML PKDD’12). Springer-Verlag, Berlin, Heidelberg, 143–158. https:
//doi.org/10.1007/978-3-642-33460-3_15
[4] Leo Breiman. 2001. Random Forests. Mach. Learn. 45, 1 (Oct. 2001), 5–32. https:
Table 10: Performance of Baselines //doi.org/10.1023/A:1010933404324
[5] Carla E. Brodley and Mark A. Friedl. 1999. Identifying Mislabeled Training Data.
J. Artif. Int. Res. 11, 1 (July 1999), 131–167. http://dl.acm.org/citation.cfm?id=
Model Accuracy 3013545.3013548
[6] Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting
Xgboost 86.90 System. In Proceedings of the 22Nd ACM SIGKDD International Conference on
Xgboost+filtering 86.47 Knowledge Discovery and Data Mining (KDD ’16). ACM, New York, NY, USA,
BLSTM 87.43 785–794. https://doi.org/10.1145/2939672.2939785
[7] M. Craven and J. Kumlien. 1999. Constructing biological knowledge bases by
BLSTM+noise-aware 87.33 extracting information from text sources. In Proceedings of the International
Conference on Intelligent Systems for Molecular Biology.
[8] Thomas G. Dietterich. 2000. An Experimental Comparison of Three Methods for
Table 11 shows the performance comparison among our methods. Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomiza-
tion. Machine Learning 40, 2 (01 Aug 2000), 139–157. https://doi.org/10.1023/A:
Again, BLSTM with best-sequence is performing best. BLSTM+best- 1007607513941
sequence and xgboost+best-sequence are performing better than [9] Benoît Frénay and Ata Kaban. 2014. A Comprehensive Introduction to Label Noise.
BLSTM+random-sequence and xgboost+random-sequence, respec- i6doc.com.publ.
[10] Dragan Gamberger, Rudjer Boskovic, Nada Lavrac, and Ciril Groselj. 1999. Exper-
tively. It proves the benefits of conflict handling among LFs by iments With Noise Filtering in a Medical Domain. In Proc. of 16 th ICML. Morgan
cf_score. Note, the standard deviation of BLSTM+random-sequence Kaufmann, 143–151.
SIGIR 2018 eCom, July 2018, Ann Arbor, Michigan, USA Avijit Saha, Vishal Kakkar, and T. Ravindra Babu
[11] Jacob Goldberger and Ehud Ben-Reuven. 2017. Training Deep Neural-networks
Using a Noise Adaptation Layer.
[12] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory.
Neural Comput. 9, 8 (Nov. 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.
8.1735
[13] Victoria J. Hodge and Jim Austin. 2004. A survey of outlier detection methodolo-
gies. Artificial Intelligence Review 22 (2004), 2004.
[14] Thorsten Joachims. 1998. Text Categorization with Support Vector Machines:
Learning with Many Relevant Features. In Proceedings of the 10th European
Conference on Machine Learning (ECML’98). Springer-Verlag, Berlin, Heidelberg,
137–142. https://doi.org/10.1007/BFb0026683
[15] Ji Young Lee and Franck Dernoncourt. 2016. Sequential Short-Text Classification
with Recurrent and Convolutional Neural Networks. CoRR abs/1603.03827 (2016).
[16] Joseph Lilleberg, Yun Zhu, and Yanqing Zhang. 2015. Support vector machines
and Word2vec for text classification with semantic features.. In ICCI*CC, Ning
Ge, Jianhua Lu, Yingxu Wang, Newton Howard, Philip Chen, Xiaoming Tao,
Bo Zhang, and Lotfi A. Zadeh (Eds.). IEEE Computer Society, 136–140. http:
//dblp.uni-trier.de/db/conf/IEEEicci/IEEEicci2015.html#LillebergZZ15
[17] Emily K. Mallory, Ce Zhang, Christopher R, and Russ B. Altman. 2016. Large-
scale extraction of gene interactions from full-text literature using DeepDive.
Bioinformatics 32, 1 (2016), 106–113. https://doi.org/10.1093/bioinformatics/
btv476
[18] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. In-
troduction to Information Retrieval. Cambridge University Press, New York, NY,
USA.
[19] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient
Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013).
[20] Alexander J. Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, and Christopher
Ré. 2016. Data Programming: Creating Large Training Sets, Quickly. In NIPS.
3567–3575.
[21] Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling
with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/
884893/en.
[22] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. 2015. Improved
Semantic Representations From Tree-Structured Long Short-Term Memory Net-
works. CoRR abs/1503.00075 (2015).
[23] Jonathan J. Webster and Chunyu Kit. 1992. Tokenization As the Initial Phase in
NLP. In Proceedings of the 14th Conference on Computational Linguistics - Volume 4
(COLING ’92). Association for Computational Linguistics, Stroudsburg, PA, USA,
1106–1110. https://doi.org/10.3115/992424.992434
[24] Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu, Hongyun Bao, and Bo
Xu. 2016. Text Classification Improved by Integrating Bidirectional LSTM with
Two-dimensional Max Pooling. CoRR abs/1611.06639 (2016).