=Paper=
{{Paper
|id=Vol-2173/paper7
|storemode=property
|title=Moving Disambiguation of Regulations from the Cathedral to the Bazaar
|pdfUrl=https://ceur-ws.org/Vol-2173/paper7.pdf
|volume=Vol-2173
|authors=Manasi Patwardhan,Richa Sharma,Abhishek Sainani,Shirish Karande,Smita Ghaisas
|dblpUrl=https://dblp.org/rec/conf/hcomp/PatwardhanSSKG18
}}
==Moving Disambiguation of Regulations from the Cathedral to the Bazaar==
Moving Disambiguation of Regulations from the Cathedral to the Bazaar Manasi Patwardhan, Richa Sharma, Abhishek Sainani, Shirish karande, Smita Ghaisas TCS Research, 54-B Hadpasar Industrial Estate, Pune 411013, India manasi.patwardhan@tcs.com Abstract We conduct a series of pilot crowdsourcing experiments that help us design a 3-step workflow composed entirely Regulatory compliance is critical to the existence, conti- nuity, and credibility of businesses. Regulations, however, of micro-tasks. Micro-task crowdsourcing has a potential are ridden with ambiguities that make their comprehension which is yet to be fully explored in the field of software a challenge that seems surmountable only by experts. Ex- engineering (Adriano and van der Hoek 2016; Weidema perts’ involvement in understanding regulatory requirements et al. 2016; Zhao and van der Hoek 2015; LaToza and for every software development project is expensive and not van der Hoek 2016). We employ micro-tasking to break scalable. Having software engineers perform disambiguation down the complex task of disambiguation into smaller of such requirements would be a great value addition. We chunks of tasks, sequentially executed as the steps of the present our design of a 3-step crowdsourcing workflow that workflow, causing less cognitive load, and resulting in a aims to convert the task of disambiguation into a series of better quality and scalability. We use an already proven micro-tasks to be performed by a crowd of software engi- method of peer-evaluation as a part of crowdsourcing work- neers. We demonstrate that the outcome of this workflow is at par with the expert-enabled disambiguation at 4.5 times flow to produce reliable data (Goto, Ishida, and Lin 2016; lower cost. Ambati, Vogel, and Carbonell 2012; Hansen et al. 2013; Huang and Fu 2013). The outcome of the micro-task ex- ecuted in the ith step of the workflow is peer-evaluated in Introduction the (i + 1)th step, ensuring successive and incremental en- Since regulations aim to safeguard the wellbeing of citi- hancement in quality. zens, they are written with a great rigor and discipline to For other complex tasks, such as tasks in linguistics (Hong minimize incidents of violations. However, their diction is and Baker 2011) and the medical domain (Zhai et al. 2013), so highly specialized that it is almost incomprehensible to use of lay crowd to replace experts is proven to be a feasible business communities, who need to ensure regulatory com- option leading to more scalable and less costlier solution for pliance. Mechanisms to assure and demonstrate regulatory data collection. On the similar lines, in this work, we prove compliance have been researched for a long time (Breaux, that the crowd annotations we receive for ambiguity detec- Vail, and Anton 2006). However, researchers have noted tion and disambiguation, upon reaching consensus, match that the ambiguities in regulations pose a challenge to re- with those made by the experts, providing a clear indication quirements engineers and thus the process of deriving sys- that the wisdom of software engineers’ can equate experts’. tem requirements tends to be error prone. We demonstrate that our approach moves this highly spe- Massey et.al. have created a legal ambiguity taxonomy for cialized task of disambiguation From the Cathedral to the identifying and classifying ambiguities in regulations that Bazaar (Raymond 1999) and leads to 4.5 times reduction in govern software systems (Massey et al. 2014). In their ex- cost of experts. periments involving software engineers (undergraduate and graduate students) in resolving ambiguities, they found that Disambiguation the engineers could identify ambiguous terms or phrases There are six distinct types of regulation ambiguities de- in a regulation statement, but were not able to agree on a fined by Massey et al (Massey et al. 2014), viz. 1. Lexical 2. consistent rationale. The authors therefore suggest that soft- Syntactic 3. Semantic 4. Incompleteness 5. Vagueness and ware engineers need expert inputs to validate their interpre- 6. Referential. As a part of this study, we have focused on tations of ambiguities (Massey et al. 2015). Involving legal the first three types of ambiguities. A term /phrase in a reg- experts in every software project is expensive and therefore ulation statement is lexically ambiguous if it has multiple not scalable. In our work, we explore this line of research dictionary meanings. Disambiguation here would mean ex- further by involving a crowd of professional software en- plicating the exact meaning. Syntactic ambiguity points at gineers to not only identify ambiguities; but also to disam- multiple word associations leading to multiple parse trees biguate regulations, with an aim to find viable and scalable and disambiguation here amounts to clarifying the scope of alternative to the current expensive mode of disambigua- the word association. Semantic ambiguity occurs if a state- tion. ment is not self-contained and disambiguation would mean Copyright c 2018for this paper by its authors. Copying permitted providing the additional contextual information for inter- for private and academic purposes. pretation. Table 1 illustrates examples of regulation state- Ambiguity Regulatory Statement (marked term in bold) Question Answers (valid answers in bold) Lexical Implement hardware, software, and/or proce- In the given sen- a) to put in writing or digital form dural mechanisms that record and examine ac- tence what is the for future use. b) information stored on tivity in information systems that contain or meaning of word a computer. c) best performance. d) to use electronic protected health information. ‘record’? make a permanent or official note of. e) a piece of evidence from the past. Syntactic Implement policies and procedures to address In the given sen- a) electronic protected health informa- the final disposition of electronic protected tence the phrase tion b) policies c) hardware d) address health information, and/or the hardware or ‘final disposition e) electronic media electronic media on which it is stored. of’ refers to? Semantic Implement hardware, software, and/or proce- What does ”ex- a) Keep a log of what was done b) No- dural mechanisms that record and examine ac- amine activity” tify admin that something was done c) tivity in information systems that contain or mean? Stop/block what is being done d) Iden- use electronic protected health information. tify what was done e) Classify what was done Table 1: Ambiguity Examples ments per ambiguity type, with an ambiguous term/ phrase rather than textual ones. We presented regulation statements marked. A question posed on the term would highlight the and supplementary text in the form of policy statements ex- source and type of ambiguity and a list of valid explanatory tracted from university websites which publish their HIPAA answers to the question would result in disambiguation. For policies (NYU ). The crowd provided binary annotations in- our study, we sought ground truth inputs from three experts dicating whether a given policy in response to a regulation who have worked with Health Insurance Portability and Ac- seemed to implement what was intended by the regulation countability Act (HIPAA) regulations (Dwyer III, Weaver, statement. We received an increased participation (24 out and Hughes 2004) for more than 3 years. We asked experts of 30) with reduced completion time (average 1 minute per to select 5 regulation statements from HIPAA, each having task) alleviating cognitive load. However, 74% of the re- terms/phrases depicting the three types of ambiguities. sponses were incorrect. Moreover, the design of this pilot task is not scalable as it requires collection of policy state- ments for every regulation statement from web sources. Pilot Tasks For all the three pilot tasks, we noted that the tasks involved To conducted pilot crowdsourcing experiments with a spe- comprehension of the regulation and strategizing for com- cific intent to evaluate the design trade-offs w.r.t. cognitive pliance. The comprehension was subjective because our load, scalability and quality. Our experiments consisted of crowd consisted of software engineers working in differ- 3 crowdsourcing tasks to collect regulation disambiguation ent domains. Accordingly, their foci while formulating or data for 5 regulation statements. We targeted a crowd of 30 selecting policy statements and/or posing questions as re- professional software engineers with 3 to 4 years of experi- sponses were different. This led to a lot of variations in the ence (henceforth referred to as crowd workers). They were responses, making it an impossible task to draw a consen- asked to perform this task during their working hours. In sus on the source of ambiguity. To address this challenge, the first task, we tried to achieve disambiguation in a sin- we needed to direct their attention to specific ambiguities gle step. We presented regulation statements and asked the in the regulation statements, which are indicated by specific crowd workers to either write their own policy statement(s) terms or phrases. in response to the regulations or produce policies from cred- ible sources which comply with the regulations. These pol- Workflow Design icy statements would serve as explanatory texts for disam- biguation. We achieved a very low participation (3 out of 30 With the observations made from our pilot studies, we ar- crowd workers) with 27% error rate (incorrect/spam inputs) rived at the following conclusions: (1) The complex task and completion time of average 3 minutes per regulation of disambiguation has to be divided into smaller chunks of statement indicating a high cognitive load. To address this micro-tasks, so that, the reduction in cognitive load would issue, as a part of second pilot task, we designed the dis- achieve better participation and quality of inputs (2) The ambiguation as a two-step process: (i) Pose questions about micro-task design should be (i) amenable to achieve scal- the ambiguities and (ii) Provide answers. We still got a low ability, (ii) lead to discrete set of responses, which eases participation (4 out of 30) with a small reduction in the er- the process of achieving consensus, (iii) highlight source of ror rate 24% and completion time (average 2.5 minutes per ambiguity in a regulation statement, alleviating the prob- regulation statement). Thus, the reduction in cognitive load lem of varying focal points. (3) There is a need to design was not significant enough. a workflow which consists of a sequence of micro-tasks, Both these pilot tasks sought textual inputs, leading to such that the solicited crowd responses in the ith step of the high cognitive load. Furthermore, algorithmically evaluat- workflow, are reviewed and validated by other set of crowd ing consensus is a challenge. To address this challenge, as workers (peers), in the (i + 1)th step, followed by pro- the third pilot task, we decided to seek discrete responses viding responses on validated inputs. Such peer-evaluation would ensure successive and incremental enhancements in Expert Annotations P R F disambiguation without expert involvement and also would Crowd Consensus Valid InvalidTotal achieve crowd engagement since they are required to ratio- nalize their validations by providing responses. The resul- Valid 17 2 19 tant workflow is described below. Step 1 Invalid 3 24 27 89% 85% 87% Total 20 26 46 Workflow Step 1: Marking Ambiguous Terms and Pos- Valid 30 5 35 ing Questions In this micro-task a crowd worker is (i) Step 2 Invalid 1 9 10 86% 97% 91% presented with a regulation statement, (ii) asked to mark a Total 31 14 45 (set of) term(s) and/or phrase(s) in the statement, which are ambiguous, and (iii) to pose a (set of) question(s) to every Valid 31 7 38 term or phrase marked, the answer to which would cause Step 3 Invalid 3 34 37 82% 91% 86% disambiguation. We apply majority voting to find consen- Total 34 41 75 sus on terms/phrases. Thus, the outcome of this micro-task P: Precision, R: Recall, F: F-score is a set of regulation statements with valid set of ambiguous terms/phrases and a set of corresponding questions for each term. Table 2: Confusion Matrix for the Workflow Workflow Step 2: Validating Questions and Providing Answers The outcome of the prior micro-task is used as posed by the crowd workers to each of these terms in the an input here. A crowd worker is (i) presented with a reg- prior micro-task. For each of these 45 micro-tasks (5 regula- ulation statement, along with a validated ambiguous term tory statements * 3 terms * 3 questions) we received inputs or phrase and the corresponding set-of questions (ii) asked from a distinct set of 15 crowd workers. After majority vot- to validate each question for its meaning, grammar,and ap- ing, we had 35 valid questions and 10 invalid questions. For plicability (if the answer actually leads to disambiguation), Step 3, we selected the same 5 regulation statements with by providing binary input (Valid/Invalid). (iii) For all the the same set of 3 ambiguous terms. For each term, we ran- questions marked as valid, (s)he needs to provide a succinct domly selected 1 majority-voted question that matched with answer to the question, which would cause disambiguation. that from experts. For every question, we randomly selected We ensure that the set-of crowd workers attempting this 5 answers provided in the earlier step by crowd workers. micro-task are different than those who have worked on Thus, we had a total of 75 micro-tasks. For each answer, we Step 1; or if they are the same set of workers, they do not expected 5 binary responses from 15 crowd workers. After get to work on their own set of responses (questions) from majority voting, we had 40 answers marked as valid and 35, the earlier step. We use majority voting for consensus on the invalid. The crowd consensus (majority voting) results were valid set of questions. Thus, the outcome of this step is set validated by experts. The results of all three micro-tasks are of regulation statements containing a term and/or a phrase illustrated in table 2. marked as ambiguous for which at least one question is val- To complete these micro-tasks the crowd workers took 1 to idated. In addition, each of these questions is accompanied 5 minutes, which is of the same order of the time taken for by a (set-of) answer(s) provided by the crowd. completing the pilot-tasks. However, we received a 100% participation, with higher quality inputs. This shows that the Workflow Step 3: Validating Answers The outcome of micro-tasking and the workflow have reduced the cognitive the prior micro-task serves an input to this micro-task. A load and achieved a higher crowd engagement. crowd worker is asked to (i) read the regulatory statement along with the marked term or phrase, (ii) read the question Projected Cost Analysis HIPAA has about 5000 regula- posed on the marked term, and (iii) choose any subset of the tion statements. A crowd of software engineers working for answers as valid answers to the posed question, considering an hour daily at the rate of 4 USD, spending 3 minutes per the context of the regulation statement. Her response to an task per worker would cost USD 86.5 Kfor HIPPA anno- answer would be ‘yes’ if she thinks that the answer is valid; tations. On the other hand, legal experts working for 200 otherwise it would be ‘no’. We follow the same strategy as USD per hour (rate validated by legal experts in a personal discussed in the prior step to select and allocate micro-tasks communication) at the rate of 1.5 minutes per task would to the crowd workers. We use majority voting for consensus cost USD 395 K (4.5 times that of software engineers). on the valid set of answers. Conclusion and Future Work Data Collection and Analysis We have early indications of success for disambiguating For step 1 in the workflow, we selected five regulation state- regulations by utilizing a crowd consisting of software en- ments from our expert annotated data. We remind that each gineers. Our approach could lead to a 4.5 fold reduction in statement contains terms or phrases that demonstrate all the cost compared to employing legal experts. In future, we in- three types of ambiguities. 15 crowd workers marked 46 tend to extend this work to other ambiguity types: referen- unique set of terms. Of these, 19 terms were majority-voted tial, incompleteness and vagueness and employ techniques as ambiguous. For Step 2 in the workflow, for the same set (such as adaptive task allocation, online Expectation Max- of five regulation statements, we selected 3 terms/phrases imization, active learning, etc.) which could help acquire (one per ambiguity type) for which crowd consensus was annotations on a large scale, so that, machine/ deep learn- achieved in step 1 and were in agreement with expert an- ing algorithms can be trained to provide automated disam- notations. We also included 3 randomly chosen questions biguation. References Zhao, M., and van der Hoek, A. 2015. A brief perspec- tive on microtask crowdsourcing workflows for interface Adriano, C. M., and van der Hoek, A. 2016. Exploring design. In Proceedings of the Second International Work- microtask crowdsourcing as a means of fault localization. shop on CrowdSourcing in Software Engineering, 45–46. arXiv preprint arXiv:1612.03015. IEEE Press. Ambati, V.; Vogel, S.; and Carbonell, J. 2012. Collaborative workflow for crowdsourcing translation. In Proceedings of the ACM 2012 conference on Computer Supported Cooper- ative Work, 1191–1194. ACM. Breaux, T. D.; Vail, M. W.; and Anton, A. I. 2006. To- wards regulatory compliance: Extracting rights and obliga- tions to align requirements with regulations. In Require- ments Engineering, 14th IEEE International Conference, 49–58. IEEE. Dwyer III, S. J.; Weaver, A. C.; and Hughes, K. K. 2004. Health insurance portability and accountability act. Security Issues in the Digital Medical Enterprise 72(2):9–18. Goto, S.; Ishida, T.; and Lin, D. 2016. Understanding crowdsourcing workflow: modeling and optimizing itera- tive and parallel processes. In Fourth AAAI Conference on Human Computation and Crowdsourcing. Hansen, D. L.; Schone, P. J.; Corey, D.; Reid, M.; and Gehring, J. 2013. Quality control mechanisms for crowd- sourcing: peer review, arbitration, & expertise at family- search indexing. In Proceedings of the 2013 conference on Computer supported cooperative work, 649–660. ACM. Hong, J., and Baker, C. F. 2011. How good is the crowd at real wsd? In Proceedings of the 5th linguistic annotation workshop, 30–37. Association for Computational Linguis- tics. Huang, S.-W., and Fu, W.-T. 2013. Enhancing reliability using peer consistency evaluation in human computation. In Proceedings of the 2013 conference on Computer supported cooperative work, 639–648. ACM. LaToza, T. D., and van der Hoek, A. 2016. Crowdsourc- ing in software engineering: Models, motivations, and chal- lenges. IEEE software 33(1):74–80. Massey, A. K.; Rutledge, R. L.; Antón, A. I.; and Swire, P. P. 2014. Identifying and classifying ambiguity for regu- latory requirements. In Requirements Engineering Confer- ence (RE), 2014 IEEE 22nd International, 83–92. IEEE. Massey, A. K.; Rutledge, R. L.; Antón, A. I.; Hemmings, J. D.; and Swire, P. P. 2015. A strategy for addressing ambi- guity in regulatory requirements. Technical report, Georgia Institute of Technology. New york university hipaa policies. Raymond, E. 1999. The cathedral and the bazaar. Knowl- edge, Technology & Policy 12(3):23–49. Weidema, E. R.; López, C.; Nayebaziz, S.; Spanghero, F.; and van der Hoek, A. 2016. Toward microtask crowdsourc- ing software design work. In CrowdSourcing in Software Engineering (CSI-SE), 2016 IEEE/ACM 3rd International Workshop on, 41–44. IEEE. Zhai, H.; Lingren, T.; Deleger, L.; Li, Q.; Kaiser, M.; Stoutenborough, L.; and Solti, I. 2013. Web 2.0-based crowdsourcing for high-quality gold standard development in clinical natural language processing. Journal of medical Internet research 15(4).