<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Using the Web as a Data Source: Challenges for Linked Science</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Advanced Research of Spatial Information and Department of Geography Hunter College, City University of New York New York</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Web makes access to data of interest for disciplines such as geography, sociology, economics, or linguistics almost instantaneous, removing the barrier of lengthy and costly data collection. Using such data from the Web is problematic in terms of the validity, transparency, and reproducibility of the corresponding research, though, as little is known about the subject population and access to the data is often under control of private corporations. This paper discusses the implications of such research and points out potential solutions in the context of Linked Science.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Carsten Ke ler</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>
        Data collection has traditionally been the bottleneck of many research endeavors,
and gathering the data required to test a hypothesis may still take months, years,
or even decades. Physicists need to design and construct complex
instrumentation in order to prove the existence of subatomic particles, and astronomers may
even have to plan a mission into space to do their work. Other elds, however,
are bene ting from an abundance of data at their ngertips, often only a call
to a handful of API functions or web services away. Datasets collected through
social media platforms such as Facebook or Twitter and collaborative,
voluntary e orts such as Wikipedia or OpenStreetMap enable almost instantaneous
research across a number of elds. Likewise, the sensor web [
        <xref ref-type="bibr" rid="ref3 ref7">7, 3</xref>
        ] o ers access
to an ever-growing amount of real-time data streams. Researchers in geography,
sociology, economics, and linguistics, to name but a few, are already using those
resources [8, 10, 1, 2, for example].
      </p>
      <p>This discussion paper raises some of the issues that come with the access
to those data sources and their use in scienti c research. It discusses the
implications for validity of the corresponding studies and their reproducibility, and
draws conclusions in the context of Linked Science.
subjects can easily nd correlations that do not exist at a larger scale if the
population investigated is not balanced with respect to the parameters under
consideration. As an example, if the goal of a study is to see if there is a
correlation between someone's age and the likeliness that they like red wine,1 the
population under consideration must be balanced in terms of other personal and
social attributes, such as gender, race, education, and income. If the sample
population in our hypothetic study is not chosen appropriately, consisting largely of
elder women and younger men, the results might indicate that there is a larger
preference for red wine in elder people than in younger people, when in reality,
it might be that women prefer red wine more often than men.</p>
      <p>This ctional example shows that careful design of the sample population is
crucial to come to valid conclusions.2 Research that draws heavily from social
media data, however, is especially prone to this problem, because (a) too little is
known about the participants in a study, and (b) the attributes known about the
participants are very hard to verify. Moreover, di erent social media platforms
are often used by user groups with di erent, but distinct, pro les. After all, social
media is being used to socialize with peers, which will often have at least some
demographic properties in common. It is hence di cult to use social media data
for studies that are supposedly saying something about the general population.
In reality, many of these studies are most likely really only saying something
about the users of a speci c social media service; this is along the lines of the
old joke that many psychology studies do not really say anything about the
general public, but a lot about psychology students, as this is the main group
taking part in their human participants tests.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Reproducibility Issues</title>
      <p>
        Using social media data in research also entails problems for the reproducibility
of any studies. User pro les and the data available are in a constant ux, so
that it is virtually impossible to replicate a study with the same set of users.
While studies that test general statements about the user population of a service
(\Are Facebook users more inclined to conservative political positions than
Twitter users?") can be replicated in principle, this is only possible as long as the
service is available, makes the required information available through its API,
and maintains a large user base; all of these factors are outside of the control of
the investigator. Finally, data archiving is problematic because in their terms of
service, many social networks prohibit making local copies of their data obtained
through APIs. Even if such archiving is permitted, the large volume of the data
can make archiving di cult or at least expensive.
1 Clinite [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] suggests that there is no such relationship.
2 This is a particular pitfall for studies that aim at nding a certain correlation
predicted in the hypothesis, and ultimately lead to constructed correlations. The website
http://www.tylervigen.com/spurious-correlations has some very obvious, yet
entertaining examples of such constructed correlations.
      </p>
      <p>Using the Web as a Data Source: Challenges for Linked Science</p>
      <p>
        Some recent studies have also raised transparency concerns. They have been
conducted by some social networks' in-house research teams who had access to
data that is not available to anybody outside of the company at that scale [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ].
This renders reproducibility completely impossible and forces outside reviewers
and researchers to blindly trust the stated results. It also raises the question
whether such results should be accepted for publication in the rst place. For the
studies cited above, the reviewers have decided that the community should know
about this research, despite the fact that the basic principle of reproducibility
has been violated. The research community will have to come to a consensus for
handling such cases as more and more potentially interesting data is collected
by private, commercial services that do not provide outside researchers access
to their main asset.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Implications and Conclusions for Linked Science</title>
      <p>
        Openness, transparency, and reproducibility are core principles of Linked
Science [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. While the scienti c community is developing and testing approaches
and technologies for data publishing and archiving, they do not work well for
research on data from sources such as social media or sensor networks. This kind
of data can be hard to archive and make publicly accessible because of the sheer
volume, or because of restrictive terms of service of commercial providers. In
order to address the latter point, a legislative e ort may be required that legalizes
data archiving from publicly accessible APIs for research purposes. The
scienti c community also needs to decide whether it wants to make reproducibility
optional, allowing researchers at commercial enterprises to report on ndings
that no one else can verify or reproduce.
      </p>
      <p>A stricter enforcement of reproducibility and transparency principles for the
acceptance of journal and conference submissions is required to solve this
problem. The Linked Science principles of semantically annotating, interconnecting,
and publishing scienti c resources show that the technologies for these processes
are already there. E orts to develop executable papers that automatically
perform the data analysis steps of a study show that the added value of providing
these resources go beyond theoretical reproducibility|they actually reproduce
the data analysis. Enforcing the publishing of these resources will also require
legal certainty for researchers who work with data from private corporations. As
their role as data providers for research is growing, legislation is needed that
allows archiving of data obtained from their public APIs. The scienti c community
hence needs to be more strict about its core principles, leveraging the
opportunities o ered by technology-driven frameworks such as Linked Science, while the
legal circumstances have to be adjusted to ensure transparency without risking
lawsuits.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Ballatore</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bertolotto</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Wilson,
          <string-name>
            <surname>D.C.</surname>
          </string-name>
          :
          <article-title>Geographic knowledge extraction and semantic similarity in openstreetmap</article-title>
          .
          <source>Knowledge and Information Systems</source>
          <volume>37</volume>
          (
          <issue>1</issue>
          ),
          <volume>61</volume>
          {
          <fpage>81</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Benson</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haghighi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barzilay</surname>
          </string-name>
          , R.:
          <article-title>Event discovery in social media feeds</article-title>
          .
          <source>In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume</source>
          <volume>1</volume>
          . pp.
          <volume>389</volume>
          {
          <fpage>398</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Botts</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Percivall</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reed</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Davidson</surname>
          </string-name>
          , J.:
          <article-title>Ogc R sensor web enablement: Overview and high level architecture</article-title>
          .
          <source>In: GeoSensor networks</source>
          , pp.
          <volume>175</volume>
          {
          <fpage>190</fpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Burke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kraut</surname>
            ,
            <given-names>R.E.</given-names>
          </string-name>
          :
          <article-title>Growing closer on facebook: changes in tie strength through social network site use</article-title>
          .
          <source>In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems</source>
          . pp.
          <volume>4187</volume>
          {
          <fpage>4196</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Burks</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zadeh</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Rapid estimate of ground shaking intensity by combining simple earthquake characteristics with tweets</article-title>
          .
          <source>In: 10th US Nat. Conf. Earthquake Eng</source>
          .,
          <string-name>
            <surname>Front</surname>
          </string-name>
          . Earthquake Eng.,
          <string-name>
            <surname>Anchorage</surname>
            ,
            <given-names>AK</given-names>
          </string-name>
          , USA, Jul. 21Y25 (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Clinite</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The Preferences in Wine of Various Aged Consumers</article-title>
          .
          <source>Bachelor thesis</source>
          , California Polytechnic State University, San Luis Obispo (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Delin</surname>
            ,
            <given-names>K.A.</given-names>
          </string-name>
          :
          <article-title>The sensor web: A macro-instrument for coordinated sensing</article-title>
          .
          <source>Sensors</source>
          <volume>2</volume>
          (
          <issue>7</issue>
          ),
          <volume>270</volume>
          {
          <fpage>285</fpage>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Ellison</surname>
            ,
            <given-names>N.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein eld</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lampe</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>The bene ts of facebook friends: social capital and college students use of online social network sites</article-title>
          .
          <source>Journal of Computer-Mediated Communication</source>
          <volume>12</volume>
          (
          <issue>4</issue>
          ),
          <volume>1143</volume>
          {
          <fpage>1168</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Kauppinen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baglatzi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Ke ler, C.:
          <article-title>Linked science: Interconnecting scienti c assets</article-title>
          . In: Critchlow,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Dam"</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.K</surname>
          </string-name>
          .V. (eds.)
          <source>Data Intensive Science</source>
          , pp.
          <volume>383</volume>
          {
          <fpage>400</fpage>
          . CRC Press, USA (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Sakaki</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Okazaki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Matsuo</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Earthquake shakes twitter users: realtime event detection by social sensors</article-title>
          .
          <source>In: Proceedings of the 19th international conference on World wide web</source>
          . pp.
          <volume>851</volume>
          {
          <fpage>860</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>