Enabling Next Generational Social Science with Machine Reading Scott Appling Erica Briscoe Georgia Institute of Technology Georgia Institute of Technology Atlanta, GA Atlanta, GA scott.appling@gtri.gatech.edu erica.briscoe@gtri.gatech.edu ABSTRACT The social science research process has traditionally required re- searchers to engage in a largely manual information seeking process and then manual analysis to extrapolate trends from past work into the study design process including hypotheses generation and vari- able declaration. Across several computational disciplines including probabilistic relational learning and machine reading, we see op- portunity to advance and significantly positively change the social science research process in a world with more and more scientific textual data accruing on a yearly, if not, daily basis. Here we present an articulation of the problem we see with the nature of publishing scientific findings in largely unstructured natural language text along with our perspective for how both micro- and macro-reading Figure 1: Research Cycle methods can play a role together with the work being done on the scientific research cycle itself to drive better and more efficient research across all of science. 2 THE PROBLEM The research process itself, conceived and refined over hundreds CCS CONCEPTS of years, typically allows for new research to be designed and con- • Information systems → Information systems applications; Data ducted by building off of past knowledge. It is however within the mining; • Computing methodologies → Natural language pro- past 60 years that the sheer magnitude of the scientific data being cessing; Information extraction; observed and collected has resulted in an inability for researchers to keep up and fully utilize it all. Perhaps as a symptom of this or as the global workforce has slowly shifted away from physical KEYWORDS labor jobs towards those of science and engineering, the speed of Science of Machine Reading scientific literature growth every year has been rapidly increasing; whereas, the amount of time researchers have to discover, digest, and synthesize new research directions has not been increasing. [5] 1 INTRODUCTION The state of the research process is such that individual researchers The social science research process, and more generally, the scien- are stuck with the massive data dilemma like professionals in other tific research process is a general set of steps, forming a cycle, that STEM fields. As this happens, the ability to conduct future research researchers within the social sciences generally take as they engage begins to suffer from different kinds of problems e.g. those related in and conduct research in their sub-fields of interest. The process to information seeking behaviors [8] or those related to the ways usually starts in the model step (See Figure 1 for our working defini- experiment designs are constructed [4]. tion of this process) with one or more questions of scientific inquiry Researchers are often times left between choosing what appears that a researcher wants to formally investigate where the research within the first couple pages of their search platform’s results and begins considering prior literature and scaffolding hypotheses; this spending vast amounts of time trying to discover related terms (and is seen as the start of a research cycle. These ’investigations‘ take consequently, studies) that should likely be considered as a part of many forms (e.g. qualitative, quantitative, theoretical, conceptual) their literature review and hypotheses and experiment planning and sub-types (e.g. causal, non-causal). Depending on the type of activities. Figure 2 is but one example of a bibliometric database’s investigation, for example, an experimental design with hypothe- growth over the past several years; overall there is an increase from ses and analyses testing the effects of an independent variable on year to year as more research publications are produced. Albeit, in a dependent variable, different levels of background context are recent years there has been a push to create better bibliometric tools needed by the researcher to appropriately design such a study. and better citation search engines and recommendations systems (e.g. [6]), there instead of finding the most relevant papers, now brought out of the background, is the problem of what to do with K-CAP2017 Workshops and Tutorials Proceedings, 2017 the papers given the researcher cannot read and perform the level of ©2017 Copyright held by the owner/author(s). requisite critical thinking and analysis that is needed on all or even K-CAP2017 Workshops and Tutorials Proceedings, 2017 Appling and Briscoe at a slow and steady rate and where researchers and their graduate students could adequately review and synthesize findings as they build on prior works. And whereas some would say that the amount of data being generated bids a farewell to traditional scientific methods and processes [1] we take an opposing view and argue that it is not the process or methods but the accessibility of the results to our analysis tools that impedes new rates of progress; we see the incorporation of machine reading research and methods (along with work from other and related fields [12] i.e. research on the scientific process itself 2 ) to introduce structure over the scientific finding disclosure process, still largely in unstructured natural language text, as a useful means to enable more efficient and indeed, next generational, science. Figure 2: Scopus bibliographic data article growth. From [3] ACKNOWLEDGMENTS This material is based upon work supported by the Defense Ad- likely a small percentage of papers produced in a normal literature vanced Research Projects Agency (DARPA). review process. We believe methods and new human-machine pro- cesses are needed to enable the next generation of human-driven REFERENCES scientific analysis, those that go beyond recommending papers [1] Chris Anderson. 2008. The end of theory: The data deluge makes the scientific method obsolete. Wired magazine 16, 7 (2008), 16–07. to read and instead collaboratively work with human researchers [2] Kwame Asante, Eric Barbour, Lauren Barker, Melanie Benjamin, Sara D Bowman, to organize and aggregate findings towards the development and Andrew P Boughton, Erin Braswell, Chelsea Chandler, Nan Chen, Sam Chrisinger, creation of new research directions and experimentation. and et al. 2017. Open Science Framework. (May 2017). osf.io/4znzp [3] Elizabeth Dyas. 2014. Scopus, Science Direct, and Mendeley. (2014). https://www. slideshare.net/nulibrary/scopus-sciencedirect-and-mendeley Presentation. 3 MORE EFFICIENT RESEARCH CYCLES [4] Daniele Fanelli, Rodrigo Costas, and John P. A. Ioannidis. 2017. Meta- assessment of bias in science. Proceedings of the National Academy of WITH MACHINE READING Sciences 114, 14 (2017), 3714–3719. https://doi.org/10.1073/pnas.1618569114 Given for example several many research papers (e.g. 20-40 papers) arXiv:http://www.pnas.org/content/114/14/3714.full.pdf [5] Timo Hannay. 2015. Science‘s Big Data Problem. (Aug 2015). https://www.wired. on a particular variable or construct of interest, averaging between com/insights/2014/08/sciences-big-data-problem/ 8-12 pages, the researcher may spend between 2 and 4 days annotat- [6] Gary King, Patrick Lam, and Margaret E Roberts. 2017. Computer-Assisted Keyword and Document Set Discovery from Unstructured Text. American Journal ing and synthesizing what would amount to a meta-analysis over of Political Science (2017). the set of papers to find the information they need to perform the [7] Tom M Mitchell, Justin Betteridge, Andrew Carlson, Estevam Hruschka, and necessary critical thinking that drives hypothesis formation (taking Richard Wang. 2009. Populating the semantic web by macro-reading internet text. In International Semantic Web Conference. Springer, 998–1002. place in the predict step). If instead there were semi-automated [8] Mai T Pham, Lisa Waddell, Andrijana Rajić, Jan M Sargeant, Andrew Papadopou- processes that, together with the researcher, extracted: variables of los, and Scott A McEwen. 2016. Implications of applying methodological shortcuts interest, relationships, and experimental trends1 , then, some sig- to expedite systematic reviews: three case studies using systematic reviews from agri-food public health. Research synthesis methods 7, 4 (2016), 433–446. nificant amount of time could be saved from, among others, the [9] Chris Quirk and Hoifung Poon. 2016. Distant Supervision for Relation Extraction traditional literature review and analysis tasks that occur during a beyond the Sentence Boundary. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics 1, Long Papers (2016). research cycle; suddenly days of manual annotation and relation- [10] Sonse Shimaoka, Pontus Stenetorp, Kentaro Inui, and Sebastian Riedel. 2016. ship summarization are reduced to minutes or hours. This is in Neural Architectures for Fine-grained Entity Type Classification. Proceedings of fact an area where both macro- and micro-reading techniques can the 15th Conference of the European Chapter of the Association for Computational Linguistics 1, Long Papers (2016), 1271–1280. play a significant role. During macro-reading activities, a collection [11] Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, and Christopher D Manning. of research articles are skimmed to extract broad phenomena like 2012. Multi-instance multi-label learning for relation extraction. In Proceedings variables or methods used in specific articles (e.g. [7, 10]) while of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. Association for Computational micro-reading activities are focused on specific passages of the sci- Linguistics, 455–465. entific articles to extract hypotheses and result interpretations (e.g. [12] Anna Elisabeth van ‘t Veer and Roger Giner-Sorolla. 2016. Pre-registration in social psychology – A discussion and suggested template. Journal of Experimental [9, 11]). These results are used to automatically generate both struc- Social Psychology 67, Supplement C (2016), 2 – 12. https://doi.org/10.1016/j.jesp. tured representations of scientific findings and human-readable 2016.03.004 natural language reports. 4 CONCLUSIONS The amount of scientific data being generated is growing at a faster rate every year and human ability to continue to sufficiently include and reason over these vast amounts of knowledge is already being challenged. Gone are the days where research in sub-disciples grew 1 We see here a need for the continued work related to design and development of 2 E.g. Towards taxonomy development for appropriately labeling scientific concepts scientific research registrations processes and conceptual taxonomies (see e.g. [2, 12]) and relationships