<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enabling Next Generational Social Science with Machine Reading</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Scott Appling</string-name>
          <email>scott.appling@gtri.gatech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erica Briscoe</string-name>
          <email>erica.briscoe@gtri.gatech.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Georgia Institute of Technology</institution>
          ,
          <addr-line>Atlanta, GA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <abstract>
        <p>The social science research process has traditionally required researchers to engage in a largely manual information seeking process and then manual analysis to extrapolate trends from past work into the study design process including hypotheses generation and variable declaration. Across several computational disciplines including probabilistic relational learning and machine reading, we see opportunity to advance and significantly positively change the social science research process in a world with more and more scientific textual data accruing on a yearly, if not, daily basis. Here we present an articulation of the problem we see with the nature of publishing scientific findings in largely unstructured natural language text along with our perspective for how both micro- and macro-reading methods can play a role together with the work being done on the scientific research cycle itself to drive better and more eficient research across all of science.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Information systems → Information systems applications; Data
mining; • Computing methodologies → Natural language
processing; Information extraction;</p>
    </sec>
    <sec id="sec-2">
      <title>INTRODUCTION</title>
      <p>
        The social science research process, and more generally, the
scientific research process is a general set of steps, forming a cycle, that
researchers within the social sciences generally take as they engage
in and conduct research in their sub-fields of interest. The process
usually starts in the model step (See Figure 1 for our working
definition of this process) with one or more questions of scientific inquiry
that a researcher wants to formally investigate where the research
begins considering prior literature and scafolding hypotheses; this
is seen as the start of a research cycle. These ’investigations‘ take
many forms (e.g. qualitative, quantitative, theoretical, conceptual)
and sub-types (e.g. causal, non-causal). Depending on the type of
investigation, for example, an experimental design with
hypotheses and analyses testing the efects of an independent variable on
a dependent variable, diferent levels of background context are
needed by the researcher to appropriately design such a study.
2
The research process itself, conceived and refined over hundreds
of years, typically allows for new research to be designed and
conducted by building of of past knowledge. It is however within the
past 60 years that the sheer magnitude of the scientific data being
observed and collected has resulted in an inability for researchers
to keep up and fully utilize it all. Perhaps as a symptom of this
or as the global workforce has slowly shifted away from physical
labor jobs towards those of science and engineering, the speed of
scientific literature growth every year has been rapidly increasing;
whereas, the amount of time researchers have to discover, digest,
and synthesize new research directions has not been increasing. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
The state of the research process is such that individual researchers
are stuck with the massive data dilemma like professionals in other
STEM fields. As this happens, the ability to conduct future research
begins to sufer from diferent kinds of problems e.g. those related
to information seeking behaviors [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] or those related to the ways
experiment designs are constructed [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Researchers are often times left between choosing what appears
within the first couple pages of their search platform’s results and
spending vast amounts of time trying to discover related terms (and
consequently, studies) that should likely be considered as a part of
their literature review and hypotheses and experiment planning
activities. Figure 2 is but one example of a bibliometric database’s
growth over the past several years; overall there is an increase from
year to year as more research publications are produced. Albeit, in
recent years there has been a push to create better bibliometric tools
and better citation search engines and recommendations systems
(e.g. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]), there instead of finding the most relevant papers, now
brought out of the background, is the problem of what to do with
the papers given the researcher cannot read and perform the level of
requisite critical thinking and analysis that is needed on all or even
likely a small percentage of papers produced in a normal literature
review process. We believe methods and new human-machine
processes are needed to enable the next generation of human-driven
scientific analysis, those that go beyond recommending papers
to read and instead collaboratively work with human researchers
to organize and aggregate findings towards the development and
creation of new research directions and experimentation.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>MORE EFFICIENT RESEARCH CYCLES</title>
    </sec>
    <sec id="sec-4">
      <title>WITH MACHINE READING</title>
      <p>
        Given for example several many research papers (e.g. 20-40 papers)
on a particular variable or construct of interest, averaging between
8-12 pages, the researcher may spend between 2 and 4 days
annotating and synthesizing what would amount to a meta-analysis over
the set of papers to find the information they need to perform the
necessary critical thinking that drives hypothesis formation (taking
place in the predict step). If instead there were semi-automated
processes that, together with the researcher, extracted: variables of
interest, relationships, and experimental trends1, then, some
significant amount of time could be saved from, among others, the
traditional literature review and analysis tasks that occur during a
research cycle; suddenly days of manual annotation and
relationship summarization are reduced to minutes or hours. This is in
fact an area where both macro- and micro-reading techniques can
play a significant role. During macro-reading activities, a collection
of research articles are skimmed to extract broad phenomena like
variables or methods used in specific articles (e.g. [
        <xref ref-type="bibr" rid="ref10 ref7">7, 10</xref>
        ]) while
micro-reading activities are focused on specific passages of the
scientific articles to extract hypotheses and result interpretations (e.g.
[
        <xref ref-type="bibr" rid="ref11 ref9">9, 11</xref>
        ]). These results are used to automatically generate both
structured representations of scientific findings and human-readable
natural language reports.
4
      </p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSIONS</title>
      <p>
        The amount of scientific data being generated is growing at a faster
rate every year and human ability to continue to suficiently include
and reason over these vast amounts of knowledge is already being
challenged. Gone are the days where research in sub-disciples grew
at a slow and steady rate and where researchers and their graduate
students could adequately review and synthesize findings as they
build on prior works. And whereas some would say that the amount
of data being generated bids a farewell to traditional scientific
methods and processes [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] we take an opposing view and argue
that it is not the process or methods but the accessibility of the
results to our analysis tools that impedes new rates of progress; we
see the incorporation of machine reading research and methods
(along with work from other and related fields [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] i.e. research
on the scientific process itself
scientific finding disclosure process, still largely in unstructured
natural language text, as a useful means to enable more eficient
and indeed, next generational, science.
      </p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGMENTS</title>
      <p>
        This material is based upon work supported by the Defense
Advanced Research Projects Agency (DARPA).
1We see here a need for the continued work related to design and development of
scientific research registrations processes and conceptual taxonomies (see e.g. [
        <xref ref-type="bibr" rid="ref12 ref2">2, 12</xref>
        ])
2E.g. Towards taxonomy development for appropriately labeling scientific concepts
and relationships
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Chris</given-names>
            <surname>Anderson</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>The end of theory: The data deluge makes the scientific method obsolete</article-title>
          .
          <source>Wired magazine 16</source>
          ,
          <issue>7</issue>
          (
          <year>2008</year>
          ),
          <fpage>16</fpage>
          -
          <lpage>07</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Kwame</given-names>
            <surname>Asante</surname>
          </string-name>
          , Eric Barbour, Lauren Barker, Melanie Benjamin,
          <string-name>
            <surname>Sara D Bowman</surname>
            ,
            <given-names>Andrew P Boughton</given-names>
          </string-name>
          , Erin Braswell, Chelsea Chandler, Nan Chen, Sam Chrisinger, and et al.
          <year>2017</year>
          . Open Science Framework. (May
          <year>2017</year>
          ). osf.io/4znzp
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Elizabeth</given-names>
            <surname>Dyas</surname>
          </string-name>
          .
          <year>2014</year>
          . Scopus, Science Direct, and Mendeley. (
          <year>2014</year>
          ). https://www. slideshare.net/nulibrary/scopus
          <article-title>-sciencedirect-and-mendeley Presentation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Daniele</given-names>
            <surname>Fanelli</surname>
          </string-name>
          , Rodrigo Costas,
          <string-name>
            <given-names>and John P. A.</given-names>
            <surname>Ioannidis</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Metaassessment of bias in science</article-title>
          .
          <source>Proceedings of the National Academy of Sciences</source>
          <volume>114</volume>
          ,
          <issue>14</issue>
          (
          <year>2017</year>
          ),
          <fpage>3714</fpage>
          -
          <lpage>3719</lpage>
          . https://doi.org/10.1073/pnas.1618569114 arXiv:http://www.pnas.org/content/114/14/3714.full.pdf
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Timo</given-names>
            <surname>Hannay</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Science's Big Data Problem</article-title>
          .
          <source>(Aug</source>
          <year>2015</year>
          ). https://www.wired. com/insights/2014/08/sciences-big
          <article-title>-data-problem/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Gary</given-names>
            <surname>King</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Lam</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Margaret E</given-names>
            <surname>Roberts</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Computer-Assisted Keyword and Document Set Discovery from Unstructured Text</article-title>
          .
          <source>American Journal of Political Science</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7] Tom M Mitchell, Justin Betteridge, Andrew Carlson, Estevam Hruschka, and Richard Wang.
          <year>2009</year>
          .
          <article-title>Populating the semantic web by macro-reading internet text</article-title>
          .
          <source>In International Semantic Web Conference</source>
          . Springer,
          <fpage>998</fpage>
          -
          <lpage>1002</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Mai</surname>
            <given-names>T Pham</given-names>
          </string-name>
          , Lisa Waddell, Andrijana Rajić, Jan M Sargeant,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Papadopoulos</surname>
          </string-name>
          ,
          <source>and Scott A McEwen</source>
          .
          <year>2016</year>
          .
          <article-title>Implications of applying methodological shortcuts to expedite systematic reviews: three case studies using systematic reviews from agri-food public health</article-title>
          .
          <source>Research synthesis methods 7</source>
          ,
          <issue>4</issue>
          (
          <year>2016</year>
          ),
          <fpage>433</fpage>
          -
          <lpage>446</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Chris</given-names>
            <surname>Quirk</surname>
          </string-name>
          and
          <string-name>
            <given-names>Hoifung</given-names>
            <surname>Poon</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Distant Supervision for Relation Extraction beyond the Sentence Boundary</article-title>
          .
          <source>Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics</source>
          <volume>1</volume>
          ,
          <string-name>
            <given-names>Long</given-names>
            <surname>Papers</surname>
          </string-name>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Sonse</surname>
            <given-names>Shimaoka</given-names>
          </string-name>
          , Pontus Stenetorp, Kentaro Inui, and
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Riedel</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Neural Architectures for Fine-grained Entity Type Classification</article-title>
          .
          <source>Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics</source>
          <volume>1</volume>
          ,
          <string-name>
            <given-names>Long</given-names>
            <surname>Papers</surname>
          </string-name>
          (
          <year>2016</year>
          ),
          <fpage>1271</fpage>
          -
          <lpage>1280</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Mihai</surname>
            <given-names>Surdeanu</given-names>
          </string-name>
          , Julie Tibshirani, Ramesh Nallapati, and
          <string-name>
            <given-names>Christopher D</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Multi-instance multi-label learning for relation extraction</article-title>
          .
          <source>In Proceedings of the</source>
          <year>2012</year>
          <article-title>joint conference on empirical methods in natural language processing and computational natural language learning</article-title>
          .
          <source>Association for Computational Linguistics</source>
          ,
          <fpage>455</fpage>
          -
          <lpage>465</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12] Anna Elisabeth van '
          <source>t Veer and Roger Giner-Sorolla</source>
          .
          <year>2016</year>
          .
          <article-title>Pre-registration in social psychology - A discussion and suggested template</article-title>
          .
          <source>Journal of Experimental Social Psychology</source>
          <volume>67</volume>
          ,
          <string-name>
            <surname>Supplement</surname>
            <given-names>C</given-names>
          </string-name>
          (
          <year>2016</year>
          ),
          <fpage>2</fpage>
          -
          <lpage>12</lpage>
          . https://doi.org/10.1016/j.jesp.
          <year>2016</year>
          .
          <volume>03</volume>
          .004
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>