Semi-Automated Prevention and Curation of Duplicate
             Content in Social Support Systems
                                   Igor A. Podgorny                              Chris Gielow
                                       Intuit, Inc.                               Intuit, Inc.
                                    San Diego, USA                             San Diego, USA
                               igor_podgorny@intuit.com                    chris_gielow@intuit.com
ABSTRACT                                                                it. AnswerXchange (http://ttlc.intuit.com) is a social Q&A
TurboTax AnswerXchange is a popular social Q&A system                   site where customers can learn and share their knowledge
supporting users working on U.S. federal and state tax                  with other TurboTax customers while preparing U.S.
returns. Based on a custom-built duplicate scoring model,               federal and state tax returns and also find step-by-step
35% of AnswerXchange questions have been found to be                    instructions on using the TurboTax application [5, 6]. As
near-duplicates responsible for 56% of AnswerXchange                    the users step through the TurboTax interview pages, they
document views. This degrades the user experience for both              can ask questions about software and tax topics (Figure 1)
the asker who is unable to find an answer amid duplicates,              and receive answers in a matter of minutes.
and the answerer who is unable to efficiently answer at                 AnswerXchange has generated millions of questions and
scale. The duplicate questions tend to form micro-clusters              answers that have helped tens of millions of TurboTax
that grow via preferential attachment and, once exceeding               customers since launching in 2007.
some 25 questions in size, start morphing into mega-
clusters with a complex network topology. This behavior
can be leveraged to design semi-automated content curation
systems to detect whether a newly posted question is a
duplicate and, if so, which duplicate cluster it belongs to. In
order to improve user experience in AnswerXchange, we
explore how human and artificial intelligence can be jointly
employed and then present several data-driven intelligent
user interfaces. The duplicate scoring models can be
utilized as elements of question-posting and answering
experiences, unanswered question queueing and answer
bots. These approaches can be extended to any social
support Q&A system where duplicate posting negatively
impacts search relevance and content consumption.
Author Keywords
TurboTax; AnswerXchange; CQA; community question
answering; social question answering; duplicate clusters;
content deduplication.
ACM Classification Keywords
H.5.m. Information Interfaces and Presentation (e.g. HCI):
Miscellaneous                                                            Figure 1. AnswerXchange question-posting user experience.
                                                                          Question title (a short summary of question limited to 255
INTRODUCTION                                                             characters) is mandatory. Question details (not shown) are
Social Q&A systems provide a convenient self-support                                     optional and unlimited in size.
option for tax and financial software applications where
personalized long-tail content generated by the users can               The majority of users can find answers by searching the
supplement curated knowledge base answers. Users often                  existing content. The overall quality of a customer self-help
prefer self-help to assisted measures (e.g. phone support or            system is therefore determined by how well the self-help
online chat) and are often able to find and apply their                 system assists in finding the relevant content. The number
solution faster. This also reduces the load on assisted                 of search sessions resulting in assisted support contacts
channels, ensuring they remain available to those who need              (being as large as hundreds of thousands of customers per
                                                                        year) and fraction of user up or down votes on self-support
© 2018. Copyright for the individual papers remains with the authors.   content provide a convenient proxy metrics of content
Copying permitted for private and academic purposes.
ESIDA'18, March 11, Tokyo, Japan.
                                                                        quality and search relevance in TurboTax self-help [5].
               SEARCH RESULTS ARE CLOGGED WITH DUPLICATES                                 AI CLUSTER ANALYSIS
                                                                                   RELATED WORK
                                                                                   The task of estimating semantic similarity of text
                                                                                   documents has multiple practical applications and is of
                                                                                                                                                              This e
                                                                                   growing interest from the research community. The areas of
                                                                                                                                                              the 3,
                                                                                   research include web page similarity, document similarity,                 ter 45
                                                                                   sentence similarity, search query similarity and utterance                 return
                                                                                   similarity in conversational user interfaces. These tasks are              Findin
                                                                                   also related to a more general problem of detecting                        duplic
                                                                                   duplicates in database records [2].                                        huma

                                                                                   Questions in social Q&A systems media are often confined
                                                                                   to one or two relatively short sentences and may warrant
                                                                                   domain specific approaches to addressing question
                                                                                   similarity. For example, two questions in a social Q&A
                                                                                   system can be considered semantically identical if a single
                                                                                   answer satisfies the needs of both original askers [3]. The
                                                                                   answer may not yet exist in the production database but
                                                                                   couldTRAINING
                                                                                          be generated    if needed. The task of duplicate-
                                                                                                   THE MODEL WITH HUMAN-SCORED PAIRS
                                                                                   question detection is also related to the task of re-
                                                                                   formulating a newly formed question [6] and automatically
                                                                                   finding an answer to a new question [8].
                                                                                   The most recent results in the area of duplicate content                   Agent
                Figure 2. An example of duplicate AnswerXchange search             scoring came from the 2017 Kaggle “Quora Pair”                             termin
                                                                                                                                                              cates,
                 results. Question titles and answer snippets are shown in         competition with model submissions from more than 3,000
                             purple and in black, respectively.                    teams (https://www.kaggle.com/c/quora-question-pairs). In                  Agent
                                                                                   this competition, the participants were tasked to classify if              some
       One problem with the existing question-posting experience                                                                                              versus
       (Figure 1) is that searches may result in multiple and often                Quora question pairs are duplicates or not based on 200,000
       duplicate answers that are relatively close to the intent of                training instances. Finally, SemEval2017 Task on
                                                                                   Community Question Answering (“Question–Comment
       the original question, but still do not match the original
                                                      How do I change my           Similarity“, “Question–Question Similarity”, etc.) resulted
       search intent (Figure 2). This interferes with the user’s
                                                      bank? (p502 v58,978)


                                                                                   in submissions from 23 teams [4].
       ability to select from a diverse set of possible answers [5]
       and, often results either in the submission of a duplicate
                                   How do I file an
                                     extension?
                                   (p486 v42,273)
                                                                                      The problem of duplicate detection and curation is closely
       question or switching to a less-desired support channel. A
            How do I amend a
                                                                                      related to the task of predicting content quality in social
                                                                                      Q&A systems. Content quality metrics may be helpful in
               prior year?


       related problem is that users may submit poor quality
             (p332 v16,184)


       questions by not providing all of the relevant information                     selecting the best performing question and answer for the
       needed for a good quality answer [5]. One solution is a
           How do I find a prior
                                                                                      duplicate-question pair. Answer and question quality in the
                                                                                      social Q&A systems has been the focus of increasing
             years return?
              (p283 v3,766)


       manual review of the user generated content to archive
                              I need      to  print my
       some of the duplicate questions and related answers, if any,     1m views      attention from the scientific community2m[1, 9].
                                     tax return
       and keeping the best performing                content in “live” status        DUPLICATE-SCORING MODEL
How do I find last
  years return?
  (p266 v3,699)                    (p3,308 v131,494)

       (i.e. making it available for search). This approach is labor                  AnswerXchange Search
       intensive and does not address the problem with                       the      AnswerXchange
                                                                        Most duplicates are long-tail    search is built
                                                                                                                      How with  Apache
                                                                                                                            might         Lucene
                                                                                                                                  we reduce         open-
                                                                                                                                              cluster-size
       question-posting user experience. Duplicate questions
           What is my AGI?
            (p712 v13,937)
                                                                           may withsource
                                                                        questions      more questions                 (duplicates) while satisfying
                                                                                                software (http://lucene.apache.org).                person-
                                                                                                                                           By default,
       quickly build up, adding unnecessary burden on community         and fewer views                               alization?
                                                                                      Lucene uses “tf-idf” (https://en.wikipedia.org/wiki/tf-idf)
       question answering along the way.                                              and “cosine-similarity” as standard methods of ranking
                                                                                      search results. Shorter documents with the same set of
       The goal of this study is to address the problems of
                                                                                      matching keywords typically rank higher than longer
       duplicate content prevention in AnswerXchange by
                           Why is my state tax
                             incomplete?
                                                                                      documents with similar semantic meaning. An average
       combining machine learning              and intelligent user interfaces.
                            (p549 v34,361)


                             Can I just file state?                     TOP-TEN AnswerXchange
                                                                                      DUPLICATE CLUSTERS search query is 2-3 terms long (i.e. shorter
       In what follows, we describe
                              (p1,316 v94,097) duplicate detection algorithms
                                                                        TURBOTAX         ANSWERXCHANGE
                                                                                      than   a typical AnswerXchange TY16  question) and it is often
       developed earlier and present a custom model trained on                        comparable in length with the title of a potentially duplicate
       AnswerXchange questions. Next, we introduce the concept                        question. The question details play a lesser role compared
       of “duplicate clusters” that provide a framework for semi-                     to titles contributing to extra boosting of duplicate content
       automated duplicate content prevention. Finally, we present                    by Lucene. The AnswerXchange Lucene ranking algorithm
       several custom designed data-driven intelligent user                           tends to boost new content and also accounts for various
       interfaces for addressing duplicate content problem.                           metadata such as helpfulness votes.
Training Data                                                     used to select user experience based on predefined
The problem of near-duplicate detection can be formulated         threshold(s). We also trained a separate version of the
as an unsupervised or supervised machine learning task [7].       logistic regression classifier using cosine-similarity as a
In the unsupervised case, duplicate pairs and clusters can be     single model feature. Shown in Table 1 are common
found based on distance metrics such as cosine-similarity of      metrics used for predictive model evaluation: area under
the weighted tf-idf vectors, Jaccard similarity coefficient,      curve (AUC) for receiver operating characteristic, F1 score
distance in word2vec space, etc. In the supervised case, the      and logarithmic loss (log loss) function for classification.
problem of finding topical near-duplicate relations can be
formulated as follows: given a pair of questions, the                      Model               AUC     F1 Score     Log Loss
machine learnt model has to predict a “duplicate score” and
determine if questions are duplicates based on a pre-defined        Logistic Regression        0.95       0.88         0.27
threshold. In this paper, we employ a “hybrid” approach                Random Forest           0.94       0.87         0.31
starting with cosine-similarity metrics for data pre-
                                                                      Cosine-similarity        0.83       0.73         0.48
processing and then adding a more accurate custom-built
scoring model to the processing pipeline.                           Table 1. Model performance metrics for duplicate-scoring
                                                                             models (details are explained in the text).
As the fraction of duplicate pairs in AnswerXchange is
relatively low, the question pairs ranked by cosine-              As seen from Table 1, both logistic regression and random
similarity provide a convenient data set for labeling based       forest models achieve performance that is consistent with
on the importance sampling approach. Towards this goal,           the goals of this exploratory study. At the same time,
we computed bag-of-words cosine-similarity (Appendix A)           cosine-similarity version underperforms the first two by a
for 790,000 questions available for search in                     wide margin. This can be explained by the inability to find
AnswerXchange at the end of 2017 U.S. Tax Day (April              an optimal threshold separating duplicate and non-duplicate
18). Next, four AnswerXchange moderators added class              pairs using the cosine-similarity alone. The following two
labels (0 or 1) to a random sample of 4,000 near-duplicate        examples illustrate the relationship between keyword-based
pairs. Instances open to doubt have been flagged by               cosine-similarity and duplicate-question score computed
moderators and then re-labeled by a consensus. 1,000              with logistic regression.
randomly sampled non-duplicate pairs have been added for          The first example is an AnswerXchange question pair with
the final version of the training data set to make it equally     a relatively low cosine-similarity of 0.61: (1) “I need a copy
divided between duplicate and non-duplicate pairs.                of my federal tax return for 2014” and (2) “I need 2015 Tax
Duplicate-Scoring Model Features                                  Return”. Both questions can be answered with a single
The model features can be learnt from training data and/or        instruction about getting a copy of prior year tax return filed
by knowledge acquisition from AnswerXchange                       with TurboTax and hence are duplicates. The second
moderators. We have used the following model features:            example is a question pair with high cosine-similarity of
                                                                  1.0: (1) “do i have to file state taxes?” and (2) “how to file
• Cosine-similarity with tf-idf weighting (see Appendix A).       state taxes”. These questions are not duplicates because
• Probabilistic topic ID of the question computed with            they belong to tax and product categories [5], respectively,
Latent Dirichlet Allocation (see Appendix A).                     and would require two different answers.
                                                                  DUPLICATE CLUSTERS
• U.S. tax year in the question.
                                                                  Preferential Attachment and Topology
• Distinct words in the question pair.                            After identifying 5,597,799 duplicate question pairs in
• Common words in the question pair.                              AnswerXchange (Appendix A), we built an undirected
                                                                  graph of 281,031 duplicate questions. Each duplicate pair
• Type of the question (e.g. “closed-ended” questions “Can        and duplicate question identified with the model constituted
I deduct …?” typically account for tax related, while “how”       graph edge and graph vertex, respectively. The resulting
questions often account for product related question).            graph consists of 14,616 connected components hereafter
                                                                  referred to as “duplicate clusters.” To explore duplicate-
• First word of the question.
                                                                  cluster scaling behavior, we ranked clusters by the number
Duplicate-Scoring Model Performance                               of questions and plotted the number of questions per cluster
Based on the set of 5,000 labeled question pairs, we trained      vs. cluster rank in log-log scale (Figure 3). The largest
and tested a linear (logistic regression) and non-linear          cluster has 23,236 questions and the smallest ones only
(random forest) binary classifiers using Python machine           have two. The plot also includes graph (or edge) density:
learning library “scikit-learn”. The model predicts class
label (0 for a non-duplicate and 1 for duplicate pair) and        𝐷 = 2𝐸 𝑉 𝑉 − 1 ,
also the duplicate score (i.e. probability of the question pair   where E is number of edges (i.e. duplicate pairs) and V is
to belong to either class ranging from 0.0 to 1.0) that can be    the number of vertices (i.e. questions). Graph density is
equal to 1.0 for the fully connected graphs. In the latter         can be estimated as 0.6. By extrapolating Zipf distribution
case, each question in the cluster is connected to all             to r=1 (that would correspond to a non-existing largest
remaining questions in the same duplicate cluster. Based on        micro-cluster), one can estimate N value as 400. This value,
both question counts and graph density, the duplicate              however, is almost two orders of magnitude less than the
clusters in Figure 3 can be divided into three distinct groups     number of questions in the top mega-cluster.
marked as mega-clusters, transitional clusters and micro-
clusters. These groups account for 84%, 2% and 14% of
duplicate questions, respectively.


                                                                     Figure 4. A micro-cluster marked by cyan dot in Figure 3.
                                                                        Articulation points are shown by smaller blue dots.
                                                                   To explain the scale break in the distribution shown in
                                                                   Figure 3, let us examine larger duplicate clusters in more
Figure 3. Scaling behavior of duplicate clusters (black dots) in   detail. Shown in Figure 5 is a mega-cluster with 4,549
  AnswerXchange questions. The clusters are ranked by the          questions. The cluster has density equal to 0.0017 and 1048
 number of questions in the descending order. Graph density        articulation points. This means that the mega-clusters may
 for the clusters is shown in gray. Cyan and red dots refer to     consist of multiple sub-clusters that are semantically related
      the clusters shown in Figures 4 and 5, respectively.         to each other but with the elements that are not duplicates
An example of micro-cluster with 23 vertices is shown in           unless they belong to the same sub-cluster.
Figure 4. Graph density is 0.54 and most of vertices are
interconnected with an exception of three vertices
connected by bridges to a denser graph core. The
corresponding articulation points are marked by blue dots.
Note that even if questions 1 and 2 are duplicates and
questions 2 and 3 are duplicates, this does not mean that
questions 1 and 3 are duplicates as well. This explains why
a duplicate-cluster density is typically less than 1.0 unless
the graph size is limited to two questions. As seen from
Figure 3, micro-cluster scaling behavior follows Zipf
distribution (https://en.wikipedia.org/wiki/zipf’s_law):
𝑛 𝑟 = 𝑁𝑟 +, ,
where r ranges from about 100 to the total number of
clusters R. Accordingly, the growth of N (Δ𝑁) and R (Δ𝑅)             Figure 5. Same as in Figure 4, but now for a mega-cluster.
would be constrained by the following equation:
                                                                   As the number of duplicates reaches certain level, the
Δ𝑁 𝑁 = 𝛼 Δ𝑅 𝑅.                                                     clusters start coalescing by establishing bridges with other
It is worth mentioning that Zipf distribution is an                clusters, duplicate pairs and stand-alone questions, quickly
asymptotic case of a more general Yule-Simon distribution          evolving from dense connected graphs to sparse graphs
(https://en.wikipedia.org/wiki/Yule-Simon_distribution)            with a complex network topology. The area of transition is
typical for the preferential attachment process, meaning that      marked as transitional clusters in Figure 3.
a newly posted duplicate is more likely to become attached         Semi-Automated Duplicate Content Curation
to the existing cluster than to form a new duplicate pair.         While the task of duplicate content archiving is
The scaling parameter for the micro-clusters:                      straightforward once duplicate pairs are found (Appendix
                                                                   A), the duplicate content can build up again unless
     log 𝑛 𝑟4    − log 𝑛 𝑟5
𝛼=                              log (𝑟4 ) − log (𝑟5 )
                                                                   question-posting and/or search experiences are modified.
Our next goal is therefore to explore how the concept of         of the question and type of the question (i.e. user-generated
duplicate clusters discussed in the previous section can be      content marked as UGC or knowledge base content labeled
applied to these tasks. The curation of micro-clusters can be    as FAQ) are included in the third and fourth columns,
done automatically or semi-automatically (i.e. with              respectively. The last two columns are views accumulated
minimum human involvement) by retaining one or few best          over a given period and percentage of up-votes. The
performing long-tail documents (i.e. documents that include      documents can be ranked by views and/or votes providing a
both questions and answers) and assigning them a cluster         mechanism of identifying and removing non-performing
ID for subsequent re-use.                                        content either manually or automatically based on a set of
                                                                 predefined content quality thresholds.
The curation of mega-clusters represents a more
challenging problem. First, a single best performing              ID     POST_ID                         DOCUMENT      TYPE   VIEWS UPVOTE
document in a mega-cluster may simply not exist since the
cluster may contain multiple sub-clusters connected by            1     1,899,475 Can I deduct job-search expenses?    FAQ    17,019   74.8
                                                                  1     2,666,148 HI. Where do I enter my job search   UGC     1,759   77.9
bridges. Second, duplicate curation by a human is a               1     3,048,015 Where do I include job search        UGC     1,060   78.1
cumbersome task due to the mega-cluster complex                   1     3,356,358 Where do I enter my job search       FAQ     6,727   70.3
topology. While the exact solution may simply not exist,          1     3,705,028 Where do I deduct job search         UGC     2,999    67
approximate solutions may be sufficient to reduce the
                                                                  2     2,895,188 Where do I enter my medical          FAQ    25,243   79.9
number of duplicates posted in the AnswerXchange to an                  2,899,090 Why doesnt my refund change after
acceptable level. One approach would be to break the              2               I enter my medical expenses?         FAQ    13,765   79.1
mega-clusters into smaller parts by deleting bridges in the             2,956,890 where do i enter OUT OF POCKET
graph or by employing a conventional hierarchical                 2               medical expenses                     UGC     1,509   86.6
clustering. For example, the duplicate cluster shown in                Figure 7. Duplicate document metrics for the documents
Figure 5 can be split to 1363 connected components by                             marked by grey dots in Figure 6.
removing all articulation points (blue dots in Figure 5).
                                                                 Duplicate metrics can be operationalized by adding an
Most of the resulting connected components, however, are
                                                                 algorithm to match the best question to the best answer in
disconnected documents.
                                                                 the sub-cluster. Such a system would include answer
A more practical approach is to archive non-performing           deleting and merging manually or automatically by
short-tail content from the mega-cluster and curate the          attaching automatically generated “best” answer to the
resulting connected components. Shown in Figure 6 is a           “best” duplicate question. The solution can be implemented
subset of mega-cluster from Figure 5 that now only               as a back-end tool for trusted users assigned to the task of
includes documents with at least 100 views. This results in      duplicate archiving and hidden from the less experienced
breaking the original mega-cluster into 68 connected             regular users. The solution goes beyond simple duplicate
components which are easier to curate.                           archiving by providing an option to merge available
                                                                 answers to the existing duplicate questions. The non-human
                                                                 part of the solution includes quality ranking of the existing
                                                                 answers, e.g. up and down vote statistics as shown in Figure
                                                                 7. In this way, the newly formed question-answer pairs
                                                                 provide better quality content available for search by
                                                                 combining the visually appealing questions and the best
                                                                 ranked answers. This is done by combining artificial and
                                                                 human intelligence since the answer to a related question
                                                                 (that the system recommended) can be confirmed by the
                                                                 contributor if needed. The cluster notes can be edited by
                                                                 trusted users and applied to all articles within the cluster.
                                                                 Real Time Duplicate Detection
                                                                 Finding duplicates to a given question requires (N-1)
                                                                 pairwise comparisons to the questions in the database and
Figure 6. A subset of the mega-cluster shown in Figure 5. Grey   may be not feasible in real time. The computational time
            dots mark documents used in Figure 7.                can be reduced by selecting potential duplicate matches
                                                                 with AnswerXchange search. The top performing
The next task is to present duplicate content in a form          documents in the clusters can be assigned an ID and
suitable for semi-automated content curation. Figure 7           indexed separately by the search engine. Once the search
shows an example of duplicate content metrics for eight          engine returns the documents ranked by relevancy to the
documents with at least 1000 views. The left column is a         newly formulated question, the duplicate-scoring model is
sub-cluster ID followed by a post ID identifying an              applied to the top matches to see if the new question is a
AnswerXchange document consisting of the original                duplicate and, if so, which duplicate cluster it belongs to.
question and all accumulated answers (not shown). The text
DATA-DRIVEN USER EXPERIENCES                                    product - information which may be useful to anyone with
Accumulation of duplicate content can be prevented by           printing-related questions.
integrating a custom-built duplicate-scoring model and
question-posting experience. Another option is to expose an     D) The suggested answers are deduplicated using duplicate
intelligent interface to the trusted users by providing extra   score equalization so the answers are more useful. A
features for answering duplicate questions. Finally, the        “cluster browser” is also added below to the results to help
duplicate question curation can be part of the content          refine amongst the most popular variations.
moderation process carried out by the AnswerXchange             Question Deduplication While Answering
trusted users or trained bots.                                  The second feature addresses the situation where a potential
Question Deduplication While Posting                            duplicate has been submitted and needs to be intercepted as
The first feature (Figure 8) extends the AnswerXchange          part of question answering experience. This concept is
“Question Optimizer” system [6]. The system prompts the         illustrated in Figures 9-10.
asker with personalized instructions created dynamically
based on real time analysis of the question’s semantics and
writing style. The “Question Optimizer” has been re-
designed to make duplicate question more difficult to            Chris asked % 30 minutes ago
submit without addressing the recommended re-phrasing.
                                                                 copy of 2014 return
The annotations to concept are presented next.
                                                                 I need to get a copy of my 2014 return
                                                                 and I don't have the cd.


                                                                 ANSWER THIS


                                                                   Chris, try this to download a new copy

                                                                    $ I need a copy of my 2014 Tax return           &


                                                                 SUGGESTED ANSWERS

                                                                 I need a copy of my 2014 Tax return
                                                                 92% match • 2,314 duplicates • 5/3/16 • ! 45 " 0
                                                                 Sign back into your Turbo Tax online account.
                                                                                                                        E
                                                                 From the Welcome Back screen, select Visit My Tax
 Figure 8. Question-posting experience reveals the duplicates    Timeline
       and helps users re-phrase as a unique question.           $ attach and mark answered     # attach

A) The “Question-Optimizer” technology is envisioned to
include duplicate content detection in addition to providing
timely advice on how to re-phrase or deflect.
                                                                                        Answer
B) If question falls in a known duplicate cluster, the best
matching and most referenced answer matches are shown.
C) Trusted users may attach “cluster notes” to curated             Figure 9. Contributor experience tagging and attaching
duplicate clusters and appear automatically with any                           curated answer to the question.
question within the cluster. In the example shown in Figure
                                                                Specifically, Figure 9 illustrates the contributor (typically a
8, the duplicate cluster is about printing and the message
                                                                trusted user) answering experience and includes the
notes that the printing experience recently changed in the
                                                                following annotation:
E) The suggested answered question duplicate is presented       may revise their question and it will re-enter the answer
to the original asker and also displays the duplicate           queue. They also have the option to request a new answer
probability. The contributor can easily attach it to their      without submitting the question.
answer, which also tells the system the question was a
                                                                Finally, flagging the unanswered question automatically as
duplicate and should be archived in favor of the attached.
                                                                a duplicate may be validated or invalidated by the trusted
                                                                users and to update training dataset for model re-training.
                                                                Question Deduplication with Automated Answers
 JaneDoe73 ⋆ SuperUser " 15 minutes ago                         The “Answer Bot” (Figure 11) is a feature driven by
                                                                artificial intelligence alone. The “Answer Bot” increases
 Chris, try this to download a new                              self-support efficiency by responding to a customer's
 copy                                                           questions by e-mail with answers from the matching
                                                                duplicate cluster if the posted question is flagged by the
 Your question shares the same answer as this
 similar question: I need a copy of my 2014
                                                       F        duplicate-scoring model as a duplicate.
 Tax return                                                     I) “Answer Bots” may automatically answer questions
                                                                determined to be duplicates. Like the contributor-assisted
                                                                experience, the bot will recommend the answer from the
   RECOMMENDED ANSWER                                           best answer within the duplicate cluster. The user is made
                                                                aware that a bot answered the question, and if unsatisfied
   Sign back into your Turbo Tax online account.
   From the Welcome Back screen, select Visit My
   Tax Timeline
                                                       G        may request a new answer, or revise their question.


   Select 2014 as the year from your Tax Timeline
   From the list of Some Things You Can Do on
   your Tax Timeline, select Download /Print My                  AnswerBot ! 15 minutes ago
   Return (PDF)
                                                                 I think your question might share the same
 SweetieJean ⋆ Rising Star " 1 year ago                          answer as this similar question: I need a copy
                                                                 of my 2014 Tax return
 # Note the printing experience in TurboTax
     changed in 2016
                                                       C
                                                                 I am a bot, and this action was performed
                                                                 automatically. If my answer is unhelpful, you may
                                                                 request a new answer or revise your question.
                                                                                                                            I
 MORE ACTIONS
                                                                   RECOMMENDED ANSWER

 $ Revise my question                                              Sign back into your Turbo Tax online account.

                                                       H
                                                                   From the Welcome Back screen, select Visit My
                                                                   Tax Timeline
 % Request a new answer
                                                                   Select 2014 as the year from your Tax Timeline
                                                                   From the list of Some Things You Can Do on
                                                                Figure
                                                                   your11.
                                                                        TaxAutomated    deduplication
                                                                            Timeline, select           user experience
                                                                                             Download /Print My        as part of
                                                                            customized e-mail to the original asker.
                                                                   Return (PDF)

 Figure 10. Original asker view of deduplicated question with   Further, the “Answer Bot” attaches the question to the
                     personalized answer.                       existing duplicate cluster automatically while providing a
Once the duplicate question is answered it becomes              generic or personalized answer. The bot replies trigger
available to the original asker (Figure 10).                    automated archiving of the duplicate content. The question
                                                                remains visible to the original asker but is not made
C) Re-purposing trusted users notes similar to those used in    available to AnswerXchange users and is suppressed from
question-posting experience (Figure 8).                         search results. A related option is to create two separate
F) A personalized note introduces the “recommended              queues of duplicate questions for answering. The questions
answer” while explaining it’s a duplicate.                      in the first queue would be assigned to designated
                                                                moderators who can customize duplicate content for the
G) The duplicate answer is presented with a sense of            original asker and archive it afterwards.        The less
authority.                                                      complicated questions in the second queue can be assigned
H) If the original asker is unsatisfied with the answer, they   to the “Answer Bot”.
DISCUSSION AND CONCLUSION                                       defined threshold. The total number of duplicate pairs was
Social Q&A systems often presume that the users comply          found to be 5,597,799 and contained 281,031 unique
with recommendations not to replicate the existing content.     questions (or 35% of the AnswerXchange “live” questions).
This is not the case for AnswerXchange where users often        In 2017, they contributed 56% to the AnswerXchange
avoid consuming existing content by posting a new               document views. The documents in the identified duplicate
duplicate question. These users may not realize that            pairs can be ranked by a suitable question (and answer)
AnswerXchange is a social Q&A site or lack the ability to       proxy content quality metrics as discussed earlier, for
find and apply existing answers to their question. We need      example by the number of views, votes, age of the post, or
to intervene with intelligent user interfaces to alter the      by a weighed combination thereof. The document with the
duplicate posting behavior. Towards this goal, we present       lower score can be removed consecutively from each pair
two algorithms for duplicate content curation and providing     resulting in a removal of 217,767 documents (27% of the
real time inputs to the AnswerXchange user interfaces. The      AnswerXchange “live” questions).
first algorithm determines if two questions are near-
                                                                ACKNOWLEDGMENTS
duplicates and can be combined with a search to detect
                                                                We thank anonymous reviewers for valuable comments.
duplicates in real time. The second algorithm uncovers all
duplicate pairs in AnswerXchange and is capable of              REFERENCES
handling deduplication task with a corpus of millions of        1.   Eugene Agichtein, Carlos Castillo, Debora Donato,
questions. We conclude the paper by presenting three                 Aristides Gionis, Gilad Mishne. 2008. Finding High-
question deduplication user interfaces. Our hypothesis to            Quality Content in Social Media. In: Proc. of the
validate include: (1) Will askers accept a duplicate when            International Conference on Web Search and Data
presented with an acceptable answer? (2) Will they accept a          Mining, 183-193.
duplicate with or without a personalized contributor note?      2.   Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis,
(3) If dissatisfied will they revise or request a new answer?        Vassilios S. Verykios. 2007. Duplicate Record
(4) Will they accept recommended answers from Answer                 Detection: A Survey. IEEE Trans. Knowl. Data Eng.,
Bots? We are planning to validate these hypothesis with a            19, 1-16.
set of rapid experiments prior to production.
                                                                3.   Klemens Muthmann, Alina Petrova.         2014. An
APPENDIX A: DUPLICATE PAIR DETECTION                                 automatic approach for identifying topical near-
Detecting duplicates for N=790,000 questions based on a              duplicate relations between questions from social
custom-built model would require (N(N-1)/2 pairwise                  media Q/A sites. In: Classifying Big Data from the
computations. The task of finding duplicate pairs becomes            Web, 1-6.
computationally expensive once the corpus reaches several
                                                                4.   Preslav Nakov, Doris Hoogeveen, Lluís Màrquez,
hundred thousand documents. At the same time, computing
                                                                     Alessandro Moschitti, Hamdy Mubarak, Timothy
cosine-similarity for a question pair is faster than scoring
                                                                     Baldwin, Karin Verspoor. 2017. SemEval-2017 Task 3:
the same pair with custom-built model and can be used to
reduce the number of potential duplicate pairs from billions         Community Question Answering. In: Proc. of the 11th
to millions of pairs. Further, dividing content by M                 Int. Workshop on Semantic Evaluation, 27-48.
probabilistic topics can reduce the number of pairwise          5.   Igor A. Podgorny, Matthew Cannon, Todd Goodyear.
comparisons by M, while not necessarily affecting the                2015a. Pro-active detection of content quality in
number of expected near-duplicate pairs.                             TurboTax AnswerXchange. In: Proc. of ACM
                                                                     Conference Companion on CSCW, 143-146.
     M         Duplicates        Execution time (min)           6.   Igor A. Podgorny, Chris Gielow, Matthew Cannon,
     50          63,355                    13                        Todd Goodyear. 2015b. Real time detection and
                                                                     intervention of poorly phrased questions. In CHI’15
     30          72,920                   18.5                       Extended Abstracts, 2205-2210.
     10          73,068                    36                   7.   R. S. Ramya, K. R. Venugopal, S. S. Iyengar, L.
     1           83,773                   265                        Patnaik. 2016. Feature Extraction and Duplicate
                                                                     Detection for Text Mining: A Survey. Global Journal
   Table A1. Duplicate statistics and computation time vs.
    number of probabilistic topics (M). Cosine-similarity            of Computer Science and Technology 56, 5.
   threshold is 0.7. M=1 means processing N(N-1)/2 pairs.       8.   Anna Shtok, Gideon Dror, Yoelle Maarek, Idan
Shown in Table A1 are results of the numerical experiments           Szpektor. 2012. Learning from the Past: Answering
conducted on MacBook Pro laptop with 2.8 GHz processor               New Questions with Past Answers, WWW, 759-768.
speed. The processing pipeline included (1) dividing            9.   Ivan Srba, Mária Bieliková. 2016. A Comprehensive
questions into M topics, (2) computing cosine-similarity for         Survey and Classification of Approaches for
all pairs in a topic, and (3) applying duplicate-scoring             Community Question Answering. In: TWEB, 10(3),
model to the pairs with cosine-similarity above a pre-               18:1-18:63.