Semi-Automated Prevention and Curation of Duplicate Content in Social Support Systems Igor A. Podgorny Chris Gielow Intuit, Inc. Intuit, Inc. San Diego, USA San Diego, USA igor_podgorny@intuit.com chris_gielow@intuit.com ABSTRACT it. AnswerXchange (http://ttlc.intuit.com) is a social Q&A TurboTax AnswerXchange is a popular social Q&A system site where customers can learn and share their knowledge supporting users working on U.S. federal and state tax with other TurboTax customers while preparing U.S. returns. Based on a custom-built duplicate scoring model, federal and state tax returns and also find step-by-step 35% of AnswerXchange questions have been found to be instructions on using the TurboTax application [5, 6]. As near-duplicates responsible for 56% of AnswerXchange the users step through the TurboTax interview pages, they document views. This degrades the user experience for both can ask questions about software and tax topics (Figure 1) the asker who is unable to find an answer amid duplicates, and receive answers in a matter of minutes. and the answerer who is unable to efficiently answer at AnswerXchange has generated millions of questions and scale. The duplicate questions tend to form micro-clusters answers that have helped tens of millions of TurboTax that grow via preferential attachment and, once exceeding customers since launching in 2007. some 25 questions in size, start morphing into mega- clusters with a complex network topology. This behavior can be leveraged to design semi-automated content curation systems to detect whether a newly posted question is a duplicate and, if so, which duplicate cluster it belongs to. In order to improve user experience in AnswerXchange, we explore how human and artificial intelligence can be jointly employed and then present several data-driven intelligent user interfaces. The duplicate scoring models can be utilized as elements of question-posting and answering experiences, unanswered question queueing and answer bots. These approaches can be extended to any social support Q&A system where duplicate posting negatively impacts search relevance and content consumption. Author Keywords TurboTax; AnswerXchange; CQA; community question answering; social question answering; duplicate clusters; content deduplication. ACM Classification Keywords H.5.m. Information Interfaces and Presentation (e.g. HCI): Miscellaneous Figure 1. AnswerXchange question-posting user experience. Question title (a short summary of question limited to 255 INTRODUCTION characters) is mandatory. Question details (not shown) are Social Q&A systems provide a convenient self-support optional and unlimited in size. option for tax and financial software applications where personalized long-tail content generated by the users can The majority of users can find answers by searching the supplement curated knowledge base answers. Users often existing content. The overall quality of a customer self-help prefer self-help to assisted measures (e.g. phone support or system is therefore determined by how well the self-help online chat) and are often able to find and apply their system assists in finding the relevant content. The number solution faster. This also reduces the load on assisted of search sessions resulting in assisted support contacts channels, ensuring they remain available to those who need (being as large as hundreds of thousands of customers per year) and fraction of user up or down votes on self-support © 2018. Copyright for the individual papers remains with the authors. content provide a convenient proxy metrics of content Copying permitted for private and academic purposes. ESIDA'18, March 11, Tokyo, Japan. quality and search relevance in TurboTax self-help [5]. SEARCH RESULTS ARE CLOGGED WITH DUPLICATES AI CLUSTER ANALYSIS RELATED WORK The task of estimating semantic similarity of text documents has multiple practical applications and is of This e growing interest from the research community. The areas of the 3, research include web page similarity, document similarity, ter 45 sentence similarity, search query similarity and utterance return similarity in conversational user interfaces. These tasks are Findin also related to a more general problem of detecting duplic duplicates in database records [2]. huma Questions in social Q&A systems media are often confined to one or two relatively short sentences and may warrant domain specific approaches to addressing question similarity. For example, two questions in a social Q&A system can be considered semantically identical if a single answer satisfies the needs of both original askers [3]. The answer may not yet exist in the production database but couldTRAINING be generated if needed. The task of duplicate- THE MODEL WITH HUMAN-SCORED PAIRS question detection is also related to the task of re- formulating a newly formed question [6] and automatically finding an answer to a new question [8]. The most recent results in the area of duplicate content Agent Figure 2. An example of duplicate AnswerXchange search scoring came from the 2017 Kaggle “Quora Pair” termin cates, results. Question titles and answer snippets are shown in competition with model submissions from more than 3,000 purple and in black, respectively. teams (https://www.kaggle.com/c/quora-question-pairs). In Agent this competition, the participants were tasked to classify if some One problem with the existing question-posting experience versus (Figure 1) is that searches may result in multiple and often Quora question pairs are duplicates or not based on 200,000 duplicate answers that are relatively close to the intent of training instances. Finally, SemEval2017 Task on Community Question Answering (“Question–Comment the original question, but still do not match the original How do I change my Similarity“, “Question–Question Similarity”, etc.) resulted search intent (Figure 2). This interferes with the user’s bank? (p502 v58,978) in submissions from 23 teams [4]. ability to select from a diverse set of possible answers [5] and, often results either in the submission of a duplicate How do I file an extension? (p486 v42,273) The problem of duplicate detection and curation is closely question or switching to a less-desired support channel. A How do I amend a related to the task of predicting content quality in social Q&A systems. Content quality metrics may be helpful in prior year? related problem is that users may submit poor quality (p332 v16,184) questions by not providing all of the relevant information selecting the best performing question and answer for the needed for a good quality answer [5]. One solution is a How do I find a prior duplicate-question pair. Answer and question quality in the social Q&A systems has been the focus of increasing years return? (p283 v3,766) manual review of the user generated content to archive I need to print my some of the duplicate questions and related answers, if any, 1m views attention from the scientific community2m[1, 9]. tax return and keeping the best performing content in “live” status DUPLICATE-SCORING MODEL How do I find last years return? (p266 v3,699) (p3,308 v131,494) (i.e. making it available for search). This approach is labor AnswerXchange Search intensive and does not address the problem with the AnswerXchange Most duplicates are long-tail search is built How with Apache might Lucene we reduce open- cluster-size question-posting user experience. Duplicate questions What is my AGI? (p712 v13,937) may withsource questions more questions (duplicates) while satisfying software (http://lucene.apache.org). person- By default, quickly build up, adding unnecessary burden on community and fewer views alization? Lucene uses “tf-idf” (https://en.wikipedia.org/wiki/tf-idf) question answering along the way. and “cosine-similarity” as standard methods of ranking search results. Shorter documents with the same set of The goal of this study is to address the problems of matching keywords typically rank higher than longer duplicate content prevention in AnswerXchange by Why is my state tax incomplete? documents with similar semantic meaning. An average combining machine learning and intelligent user interfaces. (p549 v34,361) Can I just file state? TOP-TEN AnswerXchange DUPLICATE CLUSTERS search query is 2-3 terms long (i.e. shorter In what follows, we describe (p1,316 v94,097) duplicate detection algorithms TURBOTAX ANSWERXCHANGE than a typical AnswerXchange TY16 question) and it is often developed earlier and present a custom model trained on comparable in length with the title of a potentially duplicate AnswerXchange questions. Next, we introduce the concept question. The question details play a lesser role compared of “duplicate clusters” that provide a framework for semi- to titles contributing to extra boosting of duplicate content automated duplicate content prevention. Finally, we present by Lucene. The AnswerXchange Lucene ranking algorithm several custom designed data-driven intelligent user tends to boost new content and also accounts for various interfaces for addressing duplicate content problem. metadata such as helpfulness votes. Training Data used to select user experience based on predefined The problem of near-duplicate detection can be formulated threshold(s). We also trained a separate version of the as an unsupervised or supervised machine learning task [7]. logistic regression classifier using cosine-similarity as a In the unsupervised case, duplicate pairs and clusters can be single model feature. Shown in Table 1 are common found based on distance metrics such as cosine-similarity of metrics used for predictive model evaluation: area under the weighted tf-idf vectors, Jaccard similarity coefficient, curve (AUC) for receiver operating characteristic, F1 score distance in word2vec space, etc. In the supervised case, the and logarithmic loss (log loss) function for classification. problem of finding topical near-duplicate relations can be formulated as follows: given a pair of questions, the Model AUC F1 Score Log Loss machine learnt model has to predict a “duplicate score” and determine if questions are duplicates based on a pre-defined Logistic Regression 0.95 0.88 0.27 threshold. In this paper, we employ a “hybrid” approach Random Forest 0.94 0.87 0.31 starting with cosine-similarity metrics for data pre- Cosine-similarity 0.83 0.73 0.48 processing and then adding a more accurate custom-built scoring model to the processing pipeline. Table 1. Model performance metrics for duplicate-scoring models (details are explained in the text). As the fraction of duplicate pairs in AnswerXchange is relatively low, the question pairs ranked by cosine- As seen from Table 1, both logistic regression and random similarity provide a convenient data set for labeling based forest models achieve performance that is consistent with on the importance sampling approach. Towards this goal, the goals of this exploratory study. At the same time, we computed bag-of-words cosine-similarity (Appendix A) cosine-similarity version underperforms the first two by a for 790,000 questions available for search in wide margin. This can be explained by the inability to find AnswerXchange at the end of 2017 U.S. Tax Day (April an optimal threshold separating duplicate and non-duplicate 18). Next, four AnswerXchange moderators added class pairs using the cosine-similarity alone. The following two labels (0 or 1) to a random sample of 4,000 near-duplicate examples illustrate the relationship between keyword-based pairs. Instances open to doubt have been flagged by cosine-similarity and duplicate-question score computed moderators and then re-labeled by a consensus. 1,000 with logistic regression. randomly sampled non-duplicate pairs have been added for The first example is an AnswerXchange question pair with the final version of the training data set to make it equally a relatively low cosine-similarity of 0.61: (1) “I need a copy divided between duplicate and non-duplicate pairs. of my federal tax return for 2014” and (2) “I need 2015 Tax Duplicate-Scoring Model Features Return”. Both questions can be answered with a single The model features can be learnt from training data and/or instruction about getting a copy of prior year tax return filed by knowledge acquisition from AnswerXchange with TurboTax and hence are duplicates. The second moderators. We have used the following model features: example is a question pair with high cosine-similarity of 1.0: (1) “do i have to file state taxes?” and (2) “how to file • Cosine-similarity with tf-idf weighting (see Appendix A). state taxes”. These questions are not duplicates because • Probabilistic topic ID of the question computed with they belong to tax and product categories [5], respectively, Latent Dirichlet Allocation (see Appendix A). and would require two different answers. DUPLICATE CLUSTERS • U.S. tax year in the question. Preferential Attachment and Topology • Distinct words in the question pair. After identifying 5,597,799 duplicate question pairs in • Common words in the question pair. AnswerXchange (Appendix A), we built an undirected graph of 281,031 duplicate questions. Each duplicate pair • Type of the question (e.g. “closed-ended” questions “Can and duplicate question identified with the model constituted I deduct …?” typically account for tax related, while “how” graph edge and graph vertex, respectively. The resulting questions often account for product related question). graph consists of 14,616 connected components hereafter referred to as “duplicate clusters.” To explore duplicate- • First word of the question. cluster scaling behavior, we ranked clusters by the number Duplicate-Scoring Model Performance of questions and plotted the number of questions per cluster Based on the set of 5,000 labeled question pairs, we trained vs. cluster rank in log-log scale (Figure 3). The largest and tested a linear (logistic regression) and non-linear cluster has 23,236 questions and the smallest ones only (random forest) binary classifiers using Python machine have two. The plot also includes graph (or edge) density: learning library “scikit-learn”. The model predicts class label (0 for a non-duplicate and 1 for duplicate pair) and 𝐷 = 2𝐸 𝑉 𝑉 − 1 , also the duplicate score (i.e. probability of the question pair where E is number of edges (i.e. duplicate pairs) and V is to belong to either class ranging from 0.0 to 1.0) that can be the number of vertices (i.e. questions). Graph density is equal to 1.0 for the fully connected graphs. In the latter can be estimated as 0.6. By extrapolating Zipf distribution case, each question in the cluster is connected to all to r=1 (that would correspond to a non-existing largest remaining questions in the same duplicate cluster. Based on micro-cluster), one can estimate N value as 400. This value, both question counts and graph density, the duplicate however, is almost two orders of magnitude less than the clusters in Figure 3 can be divided into three distinct groups number of questions in the top mega-cluster. marked as mega-clusters, transitional clusters and micro- clusters. These groups account for 84%, 2% and 14% of duplicate questions, respectively. Figure 4. A micro-cluster marked by cyan dot in Figure 3. Articulation points are shown by smaller blue dots. To explain the scale break in the distribution shown in Figure 3, let us examine larger duplicate clusters in more Figure 3. Scaling behavior of duplicate clusters (black dots) in detail. Shown in Figure 5 is a mega-cluster with 4,549 AnswerXchange questions. The clusters are ranked by the questions. The cluster has density equal to 0.0017 and 1048 number of questions in the descending order. Graph density articulation points. This means that the mega-clusters may for the clusters is shown in gray. Cyan and red dots refer to consist of multiple sub-clusters that are semantically related the clusters shown in Figures 4 and 5, respectively. to each other but with the elements that are not duplicates An example of micro-cluster with 23 vertices is shown in unless they belong to the same sub-cluster. Figure 4. Graph density is 0.54 and most of vertices are interconnected with an exception of three vertices connected by bridges to a denser graph core. The corresponding articulation points are marked by blue dots. Note that even if questions 1 and 2 are duplicates and questions 2 and 3 are duplicates, this does not mean that questions 1 and 3 are duplicates as well. This explains why a duplicate-cluster density is typically less than 1.0 unless the graph size is limited to two questions. As seen from Figure 3, micro-cluster scaling behavior follows Zipf distribution (https://en.wikipedia.org/wiki/zipf’s_law): 𝑛 𝑟 = 𝑁𝑟 +, , where r ranges from about 100 to the total number of clusters R. Accordingly, the growth of N (Δ𝑁) and R (Δ𝑅) Figure 5. Same as in Figure 4, but now for a mega-cluster. would be constrained by the following equation: As the number of duplicates reaches certain level, the Δ𝑁 𝑁 = 𝛼 Δ𝑅 𝑅. clusters start coalescing by establishing bridges with other It is worth mentioning that Zipf distribution is an clusters, duplicate pairs and stand-alone questions, quickly asymptotic case of a more general Yule-Simon distribution evolving from dense connected graphs to sparse graphs (https://en.wikipedia.org/wiki/Yule-Simon_distribution) with a complex network topology. The area of transition is typical for the preferential attachment process, meaning that marked as transitional clusters in Figure 3. a newly posted duplicate is more likely to become attached Semi-Automated Duplicate Content Curation to the existing cluster than to form a new duplicate pair. While the task of duplicate content archiving is The scaling parameter for the micro-clusters: straightforward once duplicate pairs are found (Appendix A), the duplicate content can build up again unless log 𝑛 𝑟4 − log 𝑛 𝑟5 𝛼= log (𝑟4 ) − log (𝑟5 ) question-posting and/or search experiences are modified. Our next goal is therefore to explore how the concept of of the question and type of the question (i.e. user-generated duplicate clusters discussed in the previous section can be content marked as UGC or knowledge base content labeled applied to these tasks. The curation of micro-clusters can be as FAQ) are included in the third and fourth columns, done automatically or semi-automatically (i.e. with respectively. The last two columns are views accumulated minimum human involvement) by retaining one or few best over a given period and percentage of up-votes. The performing long-tail documents (i.e. documents that include documents can be ranked by views and/or votes providing a both questions and answers) and assigning them a cluster mechanism of identifying and removing non-performing ID for subsequent re-use. content either manually or automatically based on a set of predefined content quality thresholds. The curation of mega-clusters represents a more challenging problem. First, a single best performing ID POST_ID DOCUMENT TYPE VIEWS UPVOTE document in a mega-cluster may simply not exist since the cluster may contain multiple sub-clusters connected by 1 1,899,475 Can I deduct job-search expenses? FAQ 17,019 74.8 1 2,666,148 HI. Where do I enter my job search UGC 1,759 77.9 bridges. Second, duplicate curation by a human is a 1 3,048,015 Where do I include job search UGC 1,060 78.1 cumbersome task due to the mega-cluster complex 1 3,356,358 Where do I enter my job search FAQ 6,727 70.3 topology. While the exact solution may simply not exist, 1 3,705,028 Where do I deduct job search UGC 2,999 67 approximate solutions may be sufficient to reduce the 2 2,895,188 Where do I enter my medical FAQ 25,243 79.9 number of duplicates posted in the AnswerXchange to an 2,899,090 Why doesnt my refund change after acceptable level. One approach would be to break the 2 I enter my medical expenses? FAQ 13,765 79.1 mega-clusters into smaller parts by deleting bridges in the 2,956,890 where do i enter OUT OF POCKET graph or by employing a conventional hierarchical 2 medical expenses UGC 1,509 86.6 clustering. For example, the duplicate cluster shown in Figure 7. Duplicate document metrics for the documents Figure 5 can be split to 1363 connected components by marked by grey dots in Figure 6. removing all articulation points (blue dots in Figure 5). Duplicate metrics can be operationalized by adding an Most of the resulting connected components, however, are algorithm to match the best question to the best answer in disconnected documents. the sub-cluster. Such a system would include answer A more practical approach is to archive non-performing deleting and merging manually or automatically by short-tail content from the mega-cluster and curate the attaching automatically generated “best” answer to the resulting connected components. Shown in Figure 6 is a “best” duplicate question. The solution can be implemented subset of mega-cluster from Figure 5 that now only as a back-end tool for trusted users assigned to the task of includes documents with at least 100 views. This results in duplicate archiving and hidden from the less experienced breaking the original mega-cluster into 68 connected regular users. The solution goes beyond simple duplicate components which are easier to curate. archiving by providing an option to merge available answers to the existing duplicate questions. The non-human part of the solution includes quality ranking of the existing answers, e.g. up and down vote statistics as shown in Figure 7. In this way, the newly formed question-answer pairs provide better quality content available for search by combining the visually appealing questions and the best ranked answers. This is done by combining artificial and human intelligence since the answer to a related question (that the system recommended) can be confirmed by the contributor if needed. The cluster notes can be edited by trusted users and applied to all articles within the cluster. Real Time Duplicate Detection Finding duplicates to a given question requires (N-1) pairwise comparisons to the questions in the database and Figure 6. A subset of the mega-cluster shown in Figure 5. Grey may be not feasible in real time. The computational time dots mark documents used in Figure 7. can be reduced by selecting potential duplicate matches with AnswerXchange search. The top performing The next task is to present duplicate content in a form documents in the clusters can be assigned an ID and suitable for semi-automated content curation. Figure 7 indexed separately by the search engine. Once the search shows an example of duplicate content metrics for eight engine returns the documents ranked by relevancy to the documents with at least 1000 views. The left column is a newly formulated question, the duplicate-scoring model is sub-cluster ID followed by a post ID identifying an applied to the top matches to see if the new question is a AnswerXchange document consisting of the original duplicate and, if so, which duplicate cluster it belongs to. question and all accumulated answers (not shown). The text DATA-DRIVEN USER EXPERIENCES product - information which may be useful to anyone with Accumulation of duplicate content can be prevented by printing-related questions. integrating a custom-built duplicate-scoring model and question-posting experience. Another option is to expose an D) The suggested answers are deduplicated using duplicate intelligent interface to the trusted users by providing extra score equalization so the answers are more useful. A features for answering duplicate questions. Finally, the “cluster browser” is also added below to the results to help duplicate question curation can be part of the content refine amongst the most popular variations. moderation process carried out by the AnswerXchange Question Deduplication While Answering trusted users or trained bots. The second feature addresses the situation where a potential Question Deduplication While Posting duplicate has been submitted and needs to be intercepted as The first feature (Figure 8) extends the AnswerXchange part of question answering experience. This concept is “Question Optimizer” system [6]. The system prompts the illustrated in Figures 9-10. asker with personalized instructions created dynamically based on real time analysis of the question’s semantics and writing style. The “Question Optimizer” has been re- designed to make duplicate question more difficult to Chris asked % 30 minutes ago submit without addressing the recommended re-phrasing. copy of 2014 return The annotations to concept are presented next. I need to get a copy of my 2014 return and I don't have the cd. ANSWER THIS Chris, try this to download a new copy $ I need a copy of my 2014 Tax return & SUGGESTED ANSWERS I need a copy of my 2014 Tax return 92% match • 2,314 duplicates • 5/3/16 • ! 45 " 0 Sign back into your Turbo Tax online account. E From the Welcome Back screen, select Visit My Tax Figure 8. Question-posting experience reveals the duplicates Timeline and helps users re-phrase as a unique question. $ attach and mark answered # attach A) The “Question-Optimizer” technology is envisioned to include duplicate content detection in addition to providing timely advice on how to re-phrase or deflect. Answer B) If question falls in a known duplicate cluster, the best matching and most referenced answer matches are shown. C) Trusted users may attach “cluster notes” to curated Figure 9. Contributor experience tagging and attaching duplicate clusters and appear automatically with any curated answer to the question. question within the cluster. In the example shown in Figure Specifically, Figure 9 illustrates the contributor (typically a 8, the duplicate cluster is about printing and the message trusted user) answering experience and includes the notes that the printing experience recently changed in the following annotation: E) The suggested answered question duplicate is presented may revise their question and it will re-enter the answer to the original asker and also displays the duplicate queue. They also have the option to request a new answer probability. The contributor can easily attach it to their without submitting the question. answer, which also tells the system the question was a Finally, flagging the unanswered question automatically as duplicate and should be archived in favor of the attached. a duplicate may be validated or invalidated by the trusted users and to update training dataset for model re-training. Question Deduplication with Automated Answers JaneDoe73 ⋆ SuperUser " 15 minutes ago The “Answer Bot” (Figure 11) is a feature driven by artificial intelligence alone. The “Answer Bot” increases Chris, try this to download a new self-support efficiency by responding to a customer's copy questions by e-mail with answers from the matching duplicate cluster if the posted question is flagged by the Your question shares the same answer as this similar question: I need a copy of my 2014 F duplicate-scoring model as a duplicate. Tax return I) “Answer Bots” may automatically answer questions determined to be duplicates. Like the contributor-assisted experience, the bot will recommend the answer from the RECOMMENDED ANSWER best answer within the duplicate cluster. The user is made aware that a bot answered the question, and if unsatisfied Sign back into your Turbo Tax online account. From the Welcome Back screen, select Visit My Tax Timeline G may request a new answer, or revise their question. Select 2014 as the year from your Tax Timeline From the list of Some Things You Can Do on your Tax Timeline, select Download /Print My AnswerBot ! 15 minutes ago Return (PDF) I think your question might share the same SweetieJean ⋆ Rising Star " 1 year ago answer as this similar question: I need a copy of my 2014 Tax return # Note the printing experience in TurboTax changed in 2016 C I am a bot, and this action was performed automatically. If my answer is unhelpful, you may request a new answer or revise your question. I MORE ACTIONS RECOMMENDED ANSWER $ Revise my question Sign back into your Turbo Tax online account. H From the Welcome Back screen, select Visit My Tax Timeline % Request a new answer Select 2014 as the year from your Tax Timeline From the list of Some Things You Can Do on Figure your11. TaxAutomated deduplication Timeline, select user experience Download /Print My as part of customized e-mail to the original asker. Return (PDF) Figure 10. Original asker view of deduplicated question with Further, the “Answer Bot” attaches the question to the personalized answer. existing duplicate cluster automatically while providing a Once the duplicate question is answered it becomes generic or personalized answer. The bot replies trigger available to the original asker (Figure 10). automated archiving of the duplicate content. The question remains visible to the original asker but is not made C) Re-purposing trusted users notes similar to those used in available to AnswerXchange users and is suppressed from question-posting experience (Figure 8). search results. A related option is to create two separate F) A personalized note introduces the “recommended queues of duplicate questions for answering. The questions answer” while explaining it’s a duplicate. in the first queue would be assigned to designated moderators who can customize duplicate content for the G) The duplicate answer is presented with a sense of original asker and archive it afterwards. The less authority. complicated questions in the second queue can be assigned H) If the original asker is unsatisfied with the answer, they to the “Answer Bot”. DISCUSSION AND CONCLUSION defined threshold. The total number of duplicate pairs was Social Q&A systems often presume that the users comply found to be 5,597,799 and contained 281,031 unique with recommendations not to replicate the existing content. questions (or 35% of the AnswerXchange “live” questions). This is not the case for AnswerXchange where users often In 2017, they contributed 56% to the AnswerXchange avoid consuming existing content by posting a new document views. The documents in the identified duplicate duplicate question. These users may not realize that pairs can be ranked by a suitable question (and answer) AnswerXchange is a social Q&A site or lack the ability to proxy content quality metrics as discussed earlier, for find and apply existing answers to their question. We need example by the number of views, votes, age of the post, or to intervene with intelligent user interfaces to alter the by a weighed combination thereof. The document with the duplicate posting behavior. Towards this goal, we present lower score can be removed consecutively from each pair two algorithms for duplicate content curation and providing resulting in a removal of 217,767 documents (27% of the real time inputs to the AnswerXchange user interfaces. The AnswerXchange “live” questions). first algorithm determines if two questions are near- ACKNOWLEDGMENTS duplicates and can be combined with a search to detect We thank anonymous reviewers for valuable comments. duplicates in real time. The second algorithm uncovers all duplicate pairs in AnswerXchange and is capable of REFERENCES handling deduplication task with a corpus of millions of 1. Eugene Agichtein, Carlos Castillo, Debora Donato, questions. We conclude the paper by presenting three Aristides Gionis, Gilad Mishne. 2008. Finding High- question deduplication user interfaces. Our hypothesis to Quality Content in Social Media. In: Proc. of the validate include: (1) Will askers accept a duplicate when International Conference on Web Search and Data presented with an acceptable answer? (2) Will they accept a Mining, 183-193. duplicate with or without a personalized contributor note? 2. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, (3) If dissatisfied will they revise or request a new answer? Vassilios S. Verykios. 2007. Duplicate Record (4) Will they accept recommended answers from Answer Detection: A Survey. IEEE Trans. Knowl. Data Eng., Bots? We are planning to validate these hypothesis with a 19, 1-16. set of rapid experiments prior to production. 3. Klemens Muthmann, Alina Petrova. 2014. An APPENDIX A: DUPLICATE PAIR DETECTION automatic approach for identifying topical near- Detecting duplicates for N=790,000 questions based on a duplicate relations between questions from social custom-built model would require (N(N-1)/2 pairwise media Q/A sites. In: Classifying Big Data from the computations. The task of finding duplicate pairs becomes Web, 1-6. computationally expensive once the corpus reaches several 4. Preslav Nakov, Doris Hoogeveen, Lluís Màrquez, hundred thousand documents. At the same time, computing Alessandro Moschitti, Hamdy Mubarak, Timothy cosine-similarity for a question pair is faster than scoring Baldwin, Karin Verspoor. 2017. SemEval-2017 Task 3: the same pair with custom-built model and can be used to reduce the number of potential duplicate pairs from billions Community Question Answering. In: Proc. of the 11th to millions of pairs. Further, dividing content by M Int. Workshop on Semantic Evaluation, 27-48. probabilistic topics can reduce the number of pairwise 5. Igor A. Podgorny, Matthew Cannon, Todd Goodyear. comparisons by M, while not necessarily affecting the 2015a. Pro-active detection of content quality in number of expected near-duplicate pairs. TurboTax AnswerXchange. In: Proc. of ACM Conference Companion on CSCW, 143-146. M Duplicates Execution time (min) 6. Igor A. Podgorny, Chris Gielow, Matthew Cannon, 50 63,355 13 Todd Goodyear. 2015b. Real time detection and intervention of poorly phrased questions. In CHI’15 30 72,920 18.5 Extended Abstracts, 2205-2210. 10 73,068 36 7. R. S. Ramya, K. R. Venugopal, S. S. Iyengar, L. 1 83,773 265 Patnaik. 2016. Feature Extraction and Duplicate Detection for Text Mining: A Survey. Global Journal Table A1. Duplicate statistics and computation time vs. number of probabilistic topics (M). Cosine-similarity of Computer Science and Technology 56, 5. threshold is 0.7. M=1 means processing N(N-1)/2 pairs. 8. Anna Shtok, Gideon Dror, Yoelle Maarek, Idan Shown in Table A1 are results of the numerical experiments Szpektor. 2012. Learning from the Past: Answering conducted on MacBook Pro laptop with 2.8 GHz processor New Questions with Past Answers, WWW, 759-768. speed. The processing pipeline included (1) dividing 9. Ivan Srba, Mária Bieliková. 2016. A Comprehensive questions into M topics, (2) computing cosine-similarity for Survey and Classification of Approaches for all pairs in a topic, and (3) applying duplicate-scoring Community Question Answering. In: TWEB, 10(3), model to the pairs with cosine-similarity above a pre- 18:1-18:63.