Squabble: an efficient, scalable controversy classifier Shiri Dori-Hacohen, Elinor Brondwine, and Jeremy Gollehon AuCoDe {firstname}@controversies.info Controversy Language Models. Jang et al. de- scribed a framework for detecting controversy prob- Abstract abilistically [JFDHA16] and introduced a novel ap- proach based on controversy language models (CLM). We introduce Squabble, an efficient, scalable CLM evaluates whether a document better matches a API for classifying controversial text such as controversy vs. a non-controversy (or background) news articles. Squabble is designed and im- language model, relying on the following compari- plemented in python for commercial purposes son: log P (D|C) − log P (D|N C) > α. Here, P (D|C) with industry best practices, which can be fol- and P (D|N C) represent the probabilities that a docu- lowed by other researchers aiming to commer- ment was generated from a controversial or a non- cialize their innovations. We demonstrate mul- controversial unigram language model, respectively. tiple orders of magnitude speedup compared CLM can be estimated by constructing a collection to prior work, while retaining effectiveness. of controversial documents; here, we refer to one of construction approaches, the Wikipedia Contro- versy Feature (WCF) [JFDHA16], which uses the 1 Introduction & Prior Work top K Wikipedia articles that have high controversy The last few years have seen growing interest in com- scores. Once trained, the classifier no longer relies on putational analysis of controversies (cf. [GDFMGM17, Wikipedia for its success; rather, the language model MZDC14]). Recent work demonstrated a clear link be- trained from Wikipedia can be applied to any text, tween controversies and disinformation, demonstrating whether or not it discusses topics covered in Wikipedia. that polarizing topics are more vulnerable to fake news, and proposing controversy as a feature in prediction 2 The Squabble API and classification of mis- and disinformation [VQSZ19], As is common in research labs, the original CLM code in highlighting the importance of classifying controversy. Java was built with much attention paid to effectiveness, Others explored the connection between controversy but little paid to efficiency and organization. As an and sentiment, highlighting that they are related but early-stage startup, we wanted to provide our research not interchangeable constructs [KLF18, MZDC14]. Re- team the ability to iterate their models quickly on large cent work generated unsupervised explanations of what data sets without waiting days for answers - a situation makes a topic controversial [KA19]. Social good and that slowed our team down significantly. In order to commercial applications of detecting controversy in- bring CLM out of the research lab into a commercially clude crisis management, public relations, and defense. viable product that could handle large volumes of text Consider, for example, that recent mass shootings and from different genres, we built a production-ready sys- terrorist attacks were preceded by activity on social tem in Python, dubbed “Squabble”. We will describe media referencing highly controversial matters1 . two main elements relevant for efficiency: the technical Copyright c 2019 for the individual papers by the papers’ au- infrastructure and a redesign of the CLM approach. thors. Copying permitted for private and academic purposes. Technical infrastructure. We created a system This volume is published and copyrighted by its editors. that can ingest, filter, and store raw data from a wide In: A. Aker, D. Albakour, A. Barrón-Cedeño, S. Dori-Hacohen, variety of sources such as news and social media. As M. Martinez, J. Stray, S. Tippmann (eds.): Proceedings of the an initial testbed and to stress-test our infrastructure, NewsIR’19 Workshop at SIGIR, Paris, France, 25-July-2019, published at http://ceur-ws.org we used a 10 year history of the Twitter Gardenhose 1 https://www.npr.org/2019/03/15/703911997/the-role- collection - a random 1% sample of all tweets from 2008- social-media-plays-in-mass-shootings 2018 [OBRS10]. We created a multi-threaded Python Figure 1: The Squabble API architecture. See Figure 2 for a zoomed in version of the “LM Generation” process. Table 1: Before, After & Success Metrics for Squabble. Infrastructure speeds reported on a per core basis on a server. Controversy scoring reported on a dual core laptop. Infrastructure Controversy Scoring Figure 2: Procedure for generating Language Models. Before 100 tweets/sec/core 7 requests/sec Success metric 1,000 tweets/sec/core 500 requests/sec program storing data in a PostgreSQL database hosted After 100,000 tweets/sec/core 700 requests/sec Speedup 1000x 100x on Amazon Web Services RDS system. We added a component that extracts both the tweet text as well Table 2: Dataset from prior work [DHA15] with key statis- as any externally linked article text when a link was tics included in the tweet. We kept the filtering stage simple yet flexible, accepting as input a text file with a list of # Docs (%) Terms: Mean Terms: Std Controversial 78 (25.75%) 828.43 1159.78 keywords or hashtags of interest. Tweets containing any Non-controversial 225 (74.25%) 367.1 564.11 of the keywords or tags are included in the database. All 303 (100%) 485.86 787.28 CLM revisited . We reimplemented and refined the controversy detection algorithm described in Sec- the biggest core technological hurdle in the infrastruc- tion 1, with system architecture is presented in Fig- ture portion of our system. PostgreSQL’s concurrency ure 1. We used established Python packages such as behaviour couldn’t handle the amount of data being NLTK [LB02] and Scikit-learn [PVG+ 11], and created sent. Once the underlying issue was resolved, data a research testbed in order to evaluate the Squabble storage speed immediately increased by multiple or- API. We describe evaluation details in Section 4. ders of magnitude, not only meeting our success metric Squabble accepts data with a text stream via SQL but also exceeding our most ambitious projections for queries or CSV files. In pilot efforts, prospective cus- speed, as seen in Table 1. Since this code was struc- tomers sent data via large CSV files which we were able tured for scalability, we can add multi-threading or to ingest and run through Squabble rapidly. Prior to multi-processing with no additional effort. As seen in creating Squabble, such pilots would take days to run, Table 1, we easily met our success metrics and achieved slowing development down. Like our data processing orders of magnitude speedup. code, the Squabble code can likewise scale via multi- threading. In addition, Squabble can be applied in a 4 Evaluation wide variety of verticals, such as finance, defense, and We leverage the dataset introduced by Dori-Hacohen public relations. We constructed Squabble explicitly to and Allan [DHA15] that consists of judgments for 303 allow for that possibility. As an early-stage startup, this webpages2 from the ClueWeb09 collection3 , which is also gives us flexibility to pivot easily should the need presented in Table 2. The evaluation set is imbalanced, arise, without expensive retooling of the technology. in the sense that it contains more non-controversial (225 documents) that controversial documents (78 doc- 3 Efficiency improvements uments). Therefore, we focused on balanced accuracy Prior to commencing this project, we set internal suc- as our metric of choice against several baselines such as cess metrics for efficiency (see Table 1) that we esti- sentiment and Naive Bayes, and we also report other mated would allow us to successfully process customer metrics for completeness’ sake (see Table 3). Our re- requests and internal research at scale for the fore- sults with the WCF approach were in-line with prior seeable future. Processing a massive data set into a 2 http://ciir.cs.umass.edu/downloads structured database, on a budget, turned out to be 3 http://lemurproject.org/clueweb09/ Table 3: Classification scores for Squabble compared to [Fol18] John Foley. Explainable agreement through several baselines. Squabble outperforms in all metrics evalu- simulation for tasks with subjective labels. ated (other than recall, which a trivial baseline accomplishes arXiv preprint arXiv:1806.05004, 2018. by definition). [GDFMGM17] Kiran Garimella, Gianmarco De Fran- B. Acc. R Acc. P F1 cisci Morales, Aristides Gionis, and Michael Squabble score 0.876 0.955 0.835 0.600 0.737 Mathioudakis. Reducing controversy by Sentiment 0.476 0.909 0.253 0.233 0.370 connecting opposing views. In Proceedings Random 0.545 0.727 0.451 0.267 0.390 of the Tenth ACM International Conference MultinomialNB 0.816 0.864 0.791 0.543 0.667 All controversial 0.500 1.000 0.242 0.242 0.389 on Web Search and Data Mining, pages 81– None controversial 0.500 0.000 0.758 NaN NaN 90. ACM, 2017. [JFDHA16] Myungha Jang, John Foley, Shiri Dori- work [JFDHA16], demonstrating reproducibility from Hacohen, and James Allan. Probabilistic the original paper4 , and also show that effectiveness approaches to controversy detection. In was not sacrificed for the sake of efficiency. Proceedings of the 25th ACM international on conference on information and knowl- 5 Conclusion edge management, pages 2069–2072. ACM, 2016. We presented Squabble, a robust, commercially-viable API for controversy classification, which is efficient and [KA19] Youngwoo Kim and James Allan. Unsu- pervised explainable controversy detection scalable. Squabble can be applied in a vertical-agnostic from online news. In European Conference manner. By reimplementing the controversy language on Information Retrieval, pages 836–843. model [JFDHA16] in python using industry best prac- Springer, 2019. tices, we increased its efficiency by orders of magnitude, [KLF18] Kateryna Kaplun, Christopher without sacrificing effectiveness. Efficiency and scal- Leberknight, and Anna Feldman. Con- ability position Squabble to be used in commercial troversy and sentiment: An exploratory settings for a wide variety of applications. Its modu- study. In Proceedings of the 10th Hellenic larity and research testbed allow for extensibility and Conference on Artificial Intelligence, improvement in the future, as more effective methods page 37. ACM, 2018. of classifying controversy are discovered. [LB02] Edward Loper and Steven Bird. NLTK: the Limitations & Future Work . The dataset from natural language toolkit. arXiv preprint prior work is somewhat limited [DHA15]; length be- cs/0205028, 2002. tween controversial and non-controversial documents [MZDC14] Yelena Mejova, Amy X Zhang, Nicholas is inconsistent (see Table 2). It is possible that Diakopoulos, and Carlos Castillo. Contro- CLM’s effectiveness improves by biasing longer docu- versy and sentiment in online news. arXiv ments. Foley showed that AUC on this dataset has preprint arXiv:1409.8152, 2014. been maximized compared to its inter-annotator agree- [OBRS10] Brendan O’Connor, Ramnath Balasubra- ment [Fol18]. In future work, we hope to create new manyan, Bryan R Routledge, and Noah A ground truth data sets for controversy, and encourage Smith. From tweets to polls: Linking text other researchers to do the same. Future work will sentiment to public opinion time series. In scale our architecture to run concurrently on multiple Fourth International AAAI Conference on servers and speed up controversy scoring further. More Weblogs and Social Media, 2010. work is needed to understand the connection between [PVG+ 11] Fabian Pedregosa, Gaël Varoquaux, Alexan- controversy, mis- and dis- information. dre Gramfort, Vincent Michel, Bertrand Acknowledgements. This material is based upon work Thirion, Olivier Grisel, Mathieu Blondel, supported by the National Science Foundation under Grant Peter Prettenhofer, Ron Weiss, Vincent No. 1819477. Any opinions, findings and conclusions or Dubourg, et al. Scikit-learn: Machine learn- recommendations expressed in this material are those of ing in python. Journal of machine learning the author(s) and do not necessarily reflect the views of the research, 12(Oct):2825–2830, 2011. National Science Foundation. [VQSZ19] Michela Del Vicario, Walter Quattrocioc- chi, Antonio Scala, and Fabiana Zollo. Po- References larization and fake news: Early warning [DHA15] Shiri Dori-Hacohen and James Allan. Auto- of potential misinformation targets. ACM mated controversy detection on the web. In Trans. Web, 13(2):10:1–10:22, March 2019. European Conference on Information Re- trieval, pages 423–434. Springer, 2015. 4 Details omitted for space considerations.