1. Introduction

TAR on Social Media: A Framework for Online Content Moderation

Eugene Yang

David D. Lewis

Ophir Frieder

0 0 IR Lab, Georgetown University , Washington, DC , USA 1 Reveal Brainspace , Chicago, IL , USA

Content moderation (removing or limiting the distribution of posts based on their contents) is one tool social networks use to fight problems such as harassment and disinformation. Manually screening all content is usually impractical given the scale of social media data, and the need for nuanced human interpretations makes fully automated approaches infeasible. We consider content moderation from the perspective of technology-assisted review (TAR): a human-in-the-loop active learning approach developed for high recall retrieval problems in civil litigation and other fields. We show how TAR workflows, and a TAR cost model, can be adapted to the content moderation problem. We then demonstrate on two publicly available content moderation data sets that a TAR workflow can reduce moderation costs by 20% to 55% across a variety of conditions.

eol>Technology-assised review active learning social media content moderation cost analysis

1. Introduction

and automated classification will be required for online content moderation for the foreseeable future [1, 5, 6].

Online social networks are powerful platforms for per- This has meant not just capital investments in machine sonal communication, community building, and free ex- learning tools for moderation, but also massive ongoing pression. Unfortunately, they can also be powerful plat- personnel expenses for teams of human reviewers [7]. forms for harassment, disinformation, and perpetration Surprisingly, the challenge of reducing costs when of criminal and terrorist activities. Organizations host- both machine learning and manual review are necesing social networks, such as Facebook, Twitter, Reddit, sary has been an active area of interest for almost two and others, have deployed a range of techniques to coun- decades, but in a completely diferent area: civil litigateract these threats and maintain a safe and respectful tion. Electronic discovery (eDiscovery) projects involve environment for their users. teams of attorneys, sometimes billing the equivalent of

One such approach is content moderation: removal hundreds of euros per person-hour, seeking to find docu(hard moderation) or demoting (soft moderation) of ments responsive to a legal matter [8]. As the volume of policy-violating posts [1, 2]. Despite recent progress in electronically produced documents grew, machine learnmachine learning, online content moderation still heav- ing began to be integrated in eDiscovery workflows in ily relies on human reviews [3]. Facebook’s CEO Mark the early 2000s, a history we review elsewhere [9]. Zuckerberg stated that language nuances could get lost The result in the legal world has been technologywhen relying on automated detection approaches, empha- assisted review (TAR): human-in-the-loop active learning sizing the necessities for human judgments. 1 Ongoing workflows that prioritize the most important documents changes in what is considered inappropriate content com- for review [10, 11]. One-phase (continuous model refineplicates the use of machine learning [4]. Policy experts ment) and two-phase (with separate training and deployhave argued that complete automation of content mod- ment phases) TAR workflows are both in use [9, 12]. eration is socially undesirable regardless of algorithmic Because of the need to find most or all relevant docuaccuracy [5]. ments, eDiscovery has been referred to as a high recall It is thus widely believed that both human moderation review (HRR) problem [13, 14, 15]. HRR problems also arise in systematic reviews in medicine, sunshine law requests, and other tasks [16, 17, 18]. Online content moderation is an HRR problem as well, in that a very high proportion of inappropriate content should be identified and removed.

Our contributions in this paper are two-fold. First, we describe how to adapt TAR and its cost-based evaluation framework to the content moderation problem. Second, DESIRES 2021 – 2nd International Conference on Design of Experimental Search & Information REtrieval Systems, September 15–18, 2021, Padua, Italy " eugene@ir.cs.georgetown.edu (E. Yang); desires2021paper@davelewis.com (D. D. Lewis); ophir@ir.cs.georgetown.edu (O. Frieder)

© 2021 Copyright for this paper by its authors. Use permitted under Creative CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g CCoEmmUoRns LWiceonsrekAstthribouptionP4r.0oIncteerenadtiionnagl s(CC(CBYE4U.0)R.-WS.org)

1https://www.businessinsider.com/zuckerberg-nuances-conte nt-moderation-ai-misinformation-hearing-2021-3 Content moderation on online platforms is a necessity [19, 20] and has been argued by some to be the defining feature of an online platform [6]. Despite terms of service and community rules on each platform, users produce inappropriate content, particularly when anonymous [21]. Inappropriate content includes toxic content such as hate speech [22], ofensive content [ 23], and mis / disinformation [4, 23]. It also includes content that is inappropriate for legal or commercial reasons, such as potential copyright violations [5, 24].

The identification of toxic content can require subtle human insight [4, 22], both due to attempts at obfuscation by posters, and because the inappropriateness of the content is often tied to its cultural, regional, and temporal context [1, 3]. Mis- and disinformation often consists of subtle mixtures of truthful and misleading content that require human common sense inferences and other background knowledge [4, 23].

Social media organizations have deployed numerous techniques for implementing community policies, including graph- and time-based analyses of communication patterns, user profile information, and others [ 25]. Our focus here, however, is on methods that use the content of a post.

Content monitoring falls into three categories: manual moderation, text classification, and human-in-theloop methods. The latter two approaches leverage machine learning models and are sometimes collectively referred to as algorithmic content moderation in policy research [5].

Manual moderation is the oldest approach, dating back to email mailing lists. It is, however, extremely expensive at the scale of large social networks and sufers potential human biases. Additionally, mental health concerns are an issue for moderators exposed to large volumes of toxic content [25, 26, 27].

The simplest text classification approaches are keyword filters, but these are susceptible to embarrassing mistakes2 and countermeasures by content creators. More efective text classification approaches to content moderation are based on supervised machine learning [28, 29]. Content types that have been addressed include cyberbullying [29, 30, 31, 32], hate speech we test this approach using two publicly available con- [22, 31, 33, 34, 35, 36] or ofensive language in general tent moderation datasets. Our experiments show substan- [23, 37, 38, 39, 40, 41, 42]. tial cost reductions using the proposed TAR framework However, some moderation judgments are inevitably over both manual review of unprioritized documents and too subtle for purely automated methods3, particularly training of prioritized models on random samples. when content is generated with the intent of fooling automated systems [1, 25, 43]. Content that is recontextualized from the original problematic context, for example, 2. Background through reposting, screenshotting, and embedding in new contexts complicates moderation [2]. Additionally, bias in automated systems can also arise both by learning from biased labels and from numerous other choices in data preparation and algorithmic settings [27, 44, 45].

Biased models risk further marginalizing and disproportionately censoring groups that already face discrimination [1]. Diferences in cultural and regulatory contexts further complicate the definition of appropriateness, creating another dimension of complexity when deploying automated content moderation [4].

Human-in-the-loop approaches, where AI systems actively manage which materials are brought to the attention of human moderators, attempt to address the weaknesses of both approaches while gathering training data to support supervised learning components [25, 46].

Filtering mechanisms that proactively present only approved content (pre-moderation) and/or removal mechanisms that passively take down inappropriate ones are used by platforms depending on the intensity [4]. Reviewing protocols could shift from one to the other based on the frequency of violations or during a specific event, such as elections4. Regardless of the workflows, the core and arguably the most critical components is reviews.

However, the primary research focus of human-in-theloop content moderation has been on classification algorithm design and bias mitigation, rarely on the investigation of the overall workflow.

Like content moderation, eDiscovery is a high recall retrieval task applied to large bodies of primarily textual content (typically enterprise documents, email, and chat) [11, 12]. Both fixed data set and streaming task structures have been explored, though the streaming context tends to bursty (e.g., all data from a single person arriving at once) rather than continuous. Since cost minimization is a primary rationale for TAR [47], research on TAR has focused on training regimens and workflows for minimizing the number, or more generally the cost, of documents reviewed [9, 12]. A new TAR approach is typically evaluated for its ability to meet an efectiveness target while minimizing cost or a cost target while maximizing efectiveness [ 18, 48, 49]. This makes approaches developed for TAR natural to consider for content moderation.

2https://www.techdirt.com/articles/20200912/11133045288/p aypal-blocks-purchases-tardigrade-merchandise-potentially-viol ating-us-sanctions-laws.shtml

3https://venturebeat.com/2020/05/23/ai-proves-its-a-poor-su

bstitute-for-human-content-checkers-during-lockdown/ 4https://www.washingtonpost.com/technology/2020/11/07/f acebook-groups-election/

3. Applying TAR to Content Moderation

shut up mind your own business and go f*** some one else over In most TAR applications, at least a few documents of the (usually rare) category of interest are available at (a) Wikipedia collection. the start of the workflow. These are used to initialize an iterative pool-based active learning workflow [ 50]. Re- : being in love with a girl you dont even know yours is viewed documents are used to train a predictive model, sadder which in turn is used to select further documents based : f*** of you f***ing c***! on predicted relevance [51], uncertainty [52], or composite factors. Workflows may be batch-oriented (mimicking pre-machine learning manual workflows common in the law) or a stream of documents may be presented through (b) ASKfm collection an interactive interface with training done in the back- Figure 1: Example content in the collections ground. These active learning workflows have almost completely displaced training from random examples when supervised learning is used in eDiscovery. tion approaches used in social media are complex, but in

Two workflow styles can be distinguished [ 9]. In a the end reduce to some combination of machine-assisted one-phase workflow , iterative review and training simply manual decisions (phase one) and automated decisions continues until a stopping rule is triggered [49, 53, 54]. based on deploying a trained model (phase two). OperaStopping may be conditioned on estimated efectiveness tional decisions such as flagging and screening all posts (usually recall), cost limits, and other factors [53, 55, 56]. from an account or massive reviewing of posts related Two-phase workflows stop training before review is fin- to certain events [4, 6] are all results of applying previished, and deploy the final trained classifier to rank the ously trained models, which is also a form of deployment. remaining documents for review. The reviewed docu- Also, broadly applying the model to filter the content ments are typically drawn from the top of the ranking, vastly reduces moderation burden when similar content with the depth in the ranking chosen so that an estimated is rapidly being published on the platform with the risk efectiveness target is reached [ 18, 48]. Two-phase work- of falsely removal [4]. We claim no optimal for this spelfows are favored when labeling of training data needs to cific simplified model in evaluating content moderation, be done by more expensive personnel than are necessary but an initial efort for modeling the human-in-the-loop for routine review. moderation process.

The cost of both one- and two-phase TAR workflows When applying the model to content moderation, howcan be captured by in a common cost model [9]. The ever, we assume uniform review costs for all documents. model defines the total cost of a one-phase review termi- This seems the best assumption given the short length nated at a particular point as the cost incurred in review- of texts reviewed and what is known publicly about the ing documents to that point, plus a penalty if the desired cost structure of moderation [6]. efectiveness target (e.g., a minimum recall value) has not In the next section, we describe our experimental setbeen met. The penalty is simply the cost of continuing ting for adapting and evaluating TAR for content moderon to an optimal second-phase review from that point, ation. i.e. the minimum number of prioritized documents is reviewed to hit the efectiveness target. For a two-phase workflow, we similarly define total cost to be the cost 4. Experiment Design of the training phase plus the cost of an optimal second phase using the final trained model. Here we review the data sets, evaluation metric, and

These costs in both cases are idealizations in that there implementation details for our experiment. may be additional cost (e.g. a labeled random sample) to choose a phase two cutof citecikmpaper. However, the 4.1. Data Sets model allows a wide range of workflows to be compared on a common basis, as well as allowing diferential costs We used two fully labeled and publicly available confor review of positive vs. negative documents, or phase tent moderation data sets with a focus on inappropriate one vs. phase two documents. user-generated content. The Wikipedia personal attack

While developed for eDiscovery, the above cost model data set [32] consists of 115,737 Wikipedia discussion is also a good fit for content moderation. As discussed comments with labels obtained via crowdsourcing. An in the previous section, the human-in-the-loop modera- example of the comment is presented in Figure 1(a) Eight annotators assigned one of five mutually exclusive la- framework is available on GitHub7. bels to each document: Recipient Target, Third Party Target, Quotation Attack, Other Attack, and No Attack 4.3. Evaluation (our names). We defined three binary classification tasks corresponding to distinguishing Recipient Target, Third Our metric was total cost to reach 80% recall as described Party Target, or Other Attack from all other classes. (Quo- in Section 3. This was computed at the end of each traintation Attack had too low a prevalence.) A fourth binary ing round as the sum of the number of training docclassification task distinguished the union of all attacks uments, plus the ideal second phase review cost as a from No Attack. A document was a positive example if 5 penalty, which is the number of additional top-ranked or more annotators put it in the positive class. Proportion documents (if any) needed to bring recall up to 80%. Rankof the positive class ranged from 13.44% to 0.18%. ing was based on sorting the non-training documents by

The ASKfm cyberbullying dataset [29] contains 61,232 probability of relevance using the most recent trained English utterance/response pairs, each of which we model. Note that we experimented with 80% recall as treated as a single document. An example of the con- an example. However, the TAR workflow is capable of versation is presented in Figure 1(b). Linguists annotated running with arbitrary recall target, such as 95% for sysboth the poster and responder with zero or one of four tematic review [18, 56]. mutually exclusive cyberbullying roles, as well as an- In actual TAR workflows, recall would be estimated notating the pair as a whole for any combination of 15 from a labeled random sample. Since the cost of this samtypes of textual expressions related to cyberbullying. We ple would be constant across our experimental conditions treated these annotations as defining 23 binary classifica- we used an oracle for recall instead. tions for a pair, with prevalence of the positive examples ranging from 4.63% to 0.04%. 5. Results and Analysis

For both data sets we refer to the binary classification tasks as topics and the units being classified as documents.

Documents were tokenized by separating at punctuation and whitespace. Each distinct term became a feature. We used log tf weighting as the features for the underlying classification model. The value of a feature was 0 if not present, and else 1 + ( ), where is the number of occurrences of that term in the document. 4.2. Algorithms and Workflow Our experiments simulated a typical TAR workflow. The ifrst training round is a seed set consisting of one random positive example (simulating manual input) and one random negative example. At the end of each round, a logistic regression model was trained and applied to the unlabeled documents. The training batch for the next round was then selected by one of three methods: a random sampling baseline, uncertainty sampling [52], or relevance feedback (top scoring documents) [51]. Variants of the latter two are widely used in eDiscovery [57].

Labels for the training batch were looked up, the batch was added to the training set, and a new model trained to repeat the cycle. Batches of size 100 and 200 were used and training continued for 80 and 40 iterations respectively, resulting in 8002 coded training documents at the end.

We implemented the TAR workflow in libact5 [58], an open-source framework for active learning experiments. We fit logistic regression models using Vowpal Wabbit6 with default parameter settings. Our experiment

5https://github.com/ntucllab/libact 6https://vowpalwabbit.org/

Our core finding was that, as in eDiscovery, active selection of which documents to review reduces costs over random selection. Figure 2 shows mean cost to reach 80% recall over 20 replications (diferent seed sets and random samples) for six representative categories. On all six categories, all TAR workflows within a few iterations beat the baseline of reviewing a random 80% of the data set (horizontal line labeled Manual Review).

The Wikipedia Attack category is typical of low to moderate prevalence categories ( = 0.1344). Uncertainty sampling strongly dominates both random sampling (too few positives chosen) and relevance feedback (too many redundant positives chosen for good training). Costs decrease uniformly with additional training. We plot 99% confidence intervals under the assumption that costs are normally distributed across replicates. Costs are not only higher for relevance feedback, but less predictable.

The ASKfm Curse Exclusion ( = 0.0169) and Wikipedia Other attack ( = 0.0019) category are typical low prevalence categories. Uncertainty sampling and relevance feedback act similarly in such circumstances: even top scoring documents are at best uncertainly positive. Average cost across replicates levels of and starts to increase after 44 iterations for uncertainty sampling and 45 iterations for relevance feedback. This is the point at which additional training no longer pays for itself by improving the ranking of documents. For this category (and typically) this occurs shortly before 80% recall is reached

7https://github.com/eugene-yang/TAR-Content-Moderation

on the training data alone (iteration 48 for uncertainty ing sets reached 5000 documents for ASKfm but continue sampling and iteration 52 for relevance feedback). for Wikipedia. Categories in Wikipedia ( = 0.1344

Task such as the ASKfm Sexism category ( = 0.0030) to 0.0018) are generally more frequent comparing to that deals with nuances in human languages requires ASKfm ( = 0.0463 to 0.001), providing more advanmore training data to produce a stable classifier. While tage for training to identify more positive documents. obtaining training data by random sampling stops reduc- Larger batch size slightly reduce the improvement as the ing the cost after the first iteration, uncertainty sampling underlying classifiers are retrained less frequently. In and relevance feedback continue to take advantage of practice, the sizes are depending on the cost structure of additional training data to minimize the cost and become reviewing and specific workflows in each organization. more predictable. However, as the classifiers are frequently updated with

Note that the general relationship between the preva- more coded documents, the total cost would be reduced lence of the task and the cost of reaching a certain recall over the iterations. target using TAR workflows is discussed Yang et al. [9]. Besides the overall cost reduction, Figure 3 shows

Table 1 looks more broadly at the two datasets, averag- a heatmap of mean precision across 20 replicates for ing costs both over all topics and over 20 replicate runs batches 1 to 81 with batch size of 100, to give insight for each topic for batch sizes of both 100 and 200 . By into the moderator experience of TAR workflows. Pre20 iterations with batch size of 100 (2002 training doc- cision for relevance feedback starts high and declines uments), TAR workflows with both relevance feedback very gradually. Uncertainty sampling maintains relaand uncertainty sampling significantly reduce costs ver- tively constant precision. For the very low prevalence sus TAR with random sampling. (Significance is based on category Curse Exclusion we cut of the heatmap at 52 paired t-tests assuming non-identical variances and mak- iterations for relevance feedback and 48 iterations for ing a Bonferroni correction for 72 tests.) All three TAR uncertainty sampling since on average 80% recall is obmethods in turn dominate reviewing a random 80% of tained on training data alone by those iterations. For the dataset, which costs 92,590 for Wikipedia and 90,958 both categories, even applying uncertainty sampling that for ASKfm. is intended to improve the quality of the classifier im

The improvement over cost plateaued after the train- proves the batch precision over the random sampling be

6. Summary and Future Work Our results suggest that TAR workflows developed for

legal review tasks may substantially reduce costs for content moderation tasks. Other legal workflow techniques, such as routing near duplicates and conversational threads in batches to the same reviewer, may be worth testing as well.

This preliminary experiment omitted complexities that should be explored in more detailed studies. Both content moderation and legal cases involve (at diferent time scales) streaming collection of data, and concomitant constraints on the time available to make a review decision.

Batching and prioritization must reflect these constraints.

Moderation in addition must deal with temporal variation in both textual content and the definitions of sensitive content, as well as scaling across many languages and cultures. As litigation and investigations become more international, these challenges may be faced in the law as well, providing opportunity for the legal and moderation ifelds to learn from each other. Proceedings of the 57th Annual Meeting of the As- [39] G. K. Pitsilis, H. Ramampiaro, H. Langseth, Desociation for Computational Linguistics, 2019, pp. tecting ofensive language in tweets using deep 1668–1678. learning, arXiv preprint arXiv:1801.04433 (2018). [28] J. Pavlopoulos, P. Malakasiotis, I. Androutsopoulos, [40] S. Sotudeh, T. Xiang, H.-R. Yao, S. MacAvaney, Deeper attention to abusive user content moder- E. Yang, N. Goharian, O. Frieder, Guir at semevalation, in: Proceedings of the 2017 conference on 2020 task 12: Domain-tuned contextualized models empirical methods in natural language processing, for ofensive language detection, arXiv preprint 2017, pp. 1125–1135. arXiv:2007.14477 (2020). [29] C. Van Hee, G. Jacobs, C. Emmery, B. Desmet, [41] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, E. Lefever, B. Verhoeven, G. De Pauw, W. Daele- N. Farra, R. Kumar, Semeval-2019 task 6: Identifymans, V. Hoste, Automatic detection of cyber- ing and categorizing ofensive language in social bullying in social media text, PloS one 13 (2018) media (ofenseval), arXiv preprint arXiv:1903.08983 e0203794. (2019). [30] K. Reynolds, A. Kontostathis, L. Edwards, Using [42] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, machine learning to detect cyberbullying, in: 2011 G. Karadzhov, H. Mubarak, L. Derczynski, Z. Pitenis, 10th International Conference on Machine learning Ç. Çöltekin, Semeval-2020 task 12: Multilingual and applications and workshops, volume 2, IEEE, ofensive language identification in social media 2011, pp. 241–244. (ofenseval 2020), arXiv preprint arXiv:2006.07235 [31] A. Schmidt, M. Wiegand, A survey on hate speech (2020).

detection using natural language processing, in: [43] R. Binns, M. Veale, M. Van Kleek, N. Shadbolt, Like Proceedings of the Fifth International workshop on trainer, like bot? inheritance of bias in algorithmic natural language processing for social media, 2017, content moderation, in: International Conference pp. 1–10. on Social Informatics, Springer, 2017, pp. 405–415. [32] E. Wulczyn, N. Thain, L. Dixon, Ex machina: Per- [44] L. Dixon, J. Li, J. Sorensen, N. Thain, L. Vassersonal attacks seen at scale, in: Proceedings of the man, Measuring and mitigating unintended bias 26th International Conference on World Wide Web, in text classification, in: Proceedings of the 2018 International World Wide Web Conferences Steer- AAAI/ACM Conference on AI, Ethics, and Society, ing Committee, 2017, pp. 1391–1399. 2018, pp. 67–73. [33] T. Davidson, D. Warmsley, M. Macy, I. Weber, Au- [45] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, tomated hate speech detection and the problem of A. Galstyan, A survey on bias and fairness in maofensive language, in: Eleventh international aaai chine learning, arXiv preprint arXiv:1908.09635 conference on web and social media, 2017. (2019). [34] N. Djuric, J. Zhou, R. Morris, M. Grbovic, V. Ra- [46] D. Link, B. Hellingrath, J. Ling, A human-is-thedosavljevic, N. Bhamidipati, Hate speech detection loop approach for semi-automated content moderwith comment embeddings, in: Proceed ings of the ation., in: ISCRAM, 2016 . 24th international conference on world wide web, [47] N. M. Pace, L. Zakaras, Where the money goes: ACM, 2015, pp. 29–30. Understanding litigant expenditures for producing [35] P. Fortuna, S. Nunes, A survey on automatic de- electronic discovery, RAND Corporation, 2012. tection of hate speech in text, ACM Computing [48] M. Bagdouri, W. Webber, D. D. Lewis, D. W. Oard, Surveys (CSUR) 51 (2018) 1–30. Towards minimizing the annotation cost of certified [36] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, text classification, in: CIKM 2013, ACM, 2013, pp.

Y. Chang, Abusive language detection in online 989–998. user content, in: Proceedings of the 25th interna- [49] G. V. Cormack, M. R. Grossman, Autonomy and relitional conference on world wide web, International ability of continuous active learning for technologyWorld Wide Web Conferences Steering Committee, assisted review, arXiv preprint arXiv:1504.06868 2016, pp. 145–153. (2015). [37] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, [50] B. Settles, Active learning literature survey (2009).

N. Farra, R. Kumar, Predicting the type and target [51] J. Rocchio, Relevance feedback in information reof ofensive posts in social media, arXiv preprint trieval, The Smart retrieval system-experiments in arXiv:1902.09666 (2019). automatic document processing (1971) 313–323. [38] R. Kumar, A. N. Reganti, A. Bhatia, T. Maheshwari, [52] D. D. Lewis, W. A. Gale, A sequential algorithm for Aggression-annotated corpus of hindi-english code- training text classifiers, in: SIGIR 1994, 1994, pp. mixed data, in: Proceedings of the Eleventh Inter- 3–12. national Conference on Language Resources and [53] G. V. Cormack, M. R. Grossman, Engineering QualEvaluation (LREC-2018), 2018. ity and Reliability in Technology-Assisted Review,

in: SIGIR , ACM Press, Pisa, Italy, 2016 , pp. 75 - 84 .

URL: http://dl.acm.org/citation.cf m? doid=2911

451.2911510. doi: 10 .1145/2911451.2911510,

00024. [54] D. D. Lewis , E.

Yang , O.

Frieder , Certifying one-

phase technology-assisted reviews (

2021 ). [55]

Yang ,

D. D.

Lewis ,

Frieder , Heuristic stopping

ings of the 21st ACM Symposium on Document

Engineering , 2021 . [56]

Li ,

Kanoulas , When to stop reviewing in

tems (TOIS) 38 (

2020 ) 1 - 36 . [57]

G. F.

Cormack ,

M. F.

Grossman , Evaluation of

review in electronic discovery , SIGIR 2014 ( 2014 )

153- 162 . doi: 10 .1145/2600428.2609601. [58]

Y.-Y.

Yang ,

S.-C.

Lee ,

Y.-A.

Chung , T.-E. Wu, S.-A.

University , 2017 . URL: https://github.com/ntu

//arxiv.org/abs/1710.00379.