<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Source: DATA.GO.JP (on Mar</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Using Open Data for Social Sciences</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Jawaharlal Nehru University</institution>
          ,
          <addr-line>New Delhi, Delhi 110067</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>26</volume>
      <issue>2021</issue>
      <fpage>0000</fpage>
      <lpage>0003</lpage>
      <abstract>
        <p>Information and communication technologies (ICT) are changing research methods in social sciences, especially in the ways of getting data. Online surveys and Big Data analyses are used the most among them. Using Open Data is another way realized by the spread of ICT. Using Open Data by administrative agencies has potential but also some difficulties. This study discusses the practical use of Open Data, and focuses on problems related to data in Portable Document Format (PDF) files. Those problems seem to occur because many officers in charge of Open Data do not pay attention to the principle and practical use of Open Data. It shows a gap between drafters and practitioners in our society.</p>
      </abstract>
      <kwd-group>
        <kwd>Research Method</kwd>
        <kwd>Social Sciences</kwd>
        <kwd>Questionnaire Survey</kwd>
        <kwd>Big Data</kwd>
        <kwd>Open Data</kwd>
        <kwd>File Format</kwd>
        <kwd>PDF</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <sec id="sec-1-1">
        <title>1.1 Questionnaire survey</title>
        <p>If questionnaire was made well with options or Likert scales, it is possible to collect
numeric data, as the results of such surveys are categorical and quantifiable. In the
past social sciences have often relied on physical or offline questionnaire surveys.</p>
        <p>Nowadays, online surveys with questionnaire are also a popular way for
gathering user data. Respondents can answer to the questionnaire with smart
phones or tablet terminators. However, there have been problems in online
surveys compared with offline surveys. Typical cases are summarized in Table 1.
The online survey respondents are automatically limited to the Internet users who
have registered to a survey company. We cannot reach those who do not use the
Internet and those who are not registered to any Internet services. In addition,
online direct mails are easy to be ignored. It leads to low recovery rate. Therefore,
online surveys have been regarded being biased in sampling.</p>
        <p>
          On the other hand, conventional offline surveys also got problems recently.
Mail survey in Table 1 has been an efficient data collection tool since 1788 [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
These days nuclear families, which consist only of parents and children, are
increasing especially in urban area. When researchers visit their houses in daytime,
they may not be able to recover questionnaires because no family members are
available there. It is also difficult to reach young people who live alone. In
addition, people are unlikely to open the door to unknown person’s visit. These
problems lead to sampling bias. Respondents come to be limited to those who can
react to researchers in such circumstances.
        </p>
        <p>
          When almost all the people come to use the Internet, online survey may have
less sampling bias than conventional survey. Visiting respondents in offline
survey can secure reliable response, while online submission can reduce Hawthorne
effect that respondents tend to give desirable answers, which let them look more
normative [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Each of these methods has its pros and cons.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2 Big Data</title>
        <p>
          With improvement of ICT and computing, we came to treat Big Data, that is,
huge amount of transactions of information. Including GAFA (Google, Amazon,
Facebook, Apple), many large corporations utilize Big Data which they collect in
their businesses [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Using Big Data has a tremendous potential to benefit social
sciences. However, those studies of Big Data are in a black box format. Some
companies apply the findings to their businesses; others optimize them as B2 2B
commodities. Analyses of data and the findings are not disclosed. Though SNS
companies such as Twitter provide API to get data, there are limitations of using
them.
        </p>
        <p>
          Companies are eager to protect their algorithms, but at the same time, the
data itself can contain sensitive information, for instance transaction data or
personal customer information. That is why many countries are developing legal
systems on personal information (Table 2). Japan amended the Act on the
Protection of Personal Information for use of Big Data. It includes “Clear Indication of
the Purpose of Use”, “Consent of the Person on Provision to A Third Party” and
“Anonymization of Information”. General Data Protection Regulation (GDPR)
in EU has more rules. It is said that DPA in Kenya and PDP in India are based on
GDPR [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ][
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].
        </p>
        <p>USA
Japan
China
EU
Kenya
India
*amended</p>
        <p>
          In June 2013, Hitachi announced that it would start a service that utilizes the
boarding / alighting history of JR East (Japan’s largest railway company)’s Suica
(IC prepaid fare card) as big data and provides it as station area marketing
information. At first, JR East claimed that it was not disclosing personal information
on its customers, but admitted selling data without their consent and apologized
after a month [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. It is still withheld in 2021.
        </p>
        <p>
          In January 2021, messaging app WhatsApp announced the new Privacy
Policy, which will allow WhatsApp to share data with its parent company, Facebook.
It does not apply in EU, because it violates GDPR [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. People encouraged each
other to shift from WhatsApp to other messaging apps, Signal or Telegram. At
last, WhatsApp postponed the update of its privacy policy.
        </p>
        <p>
          In March 2021, LINE, which is very similar to WhatsApp and dominant in
Japan, let Chinese engineers at a Shanghai affiliate access Japanese users’ data
without informing them [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. LINE Corporation was founded as a part of a South
Korean game company.
        </p>
        <p>
          Big Data can be international. It is important to pay attention to the latest
trend in the world. Even if the use of data is legitimate in Japan, it may violate
GDPR in EU [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Though these legal systems encourage the use of Big Data,
companies will be careful with further use of them. Moreover, it will take time
for a broad range of academic use of Big Data.
2
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Using Open Data</title>
      <p>
        We have already covered some of the challenges when it comes to
collecting numeric and large amount of data in social sciences. Using Open Data could
present an alternative way of data collection. Knowledge is open if anyone is free
to access, use, modify, and share it [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Usually there are limits and difficulties
to access data, as mentioned in previous section. Even after getting data, there are
still problems as license, copyright, patent or other mechanisms of control. Open
Data are free from these restrictions. There are two kinds of data expected to be
Open Data. First, they are academic data in sciences. Second, they are social data
obtained by administrative agencies.
      </p>
      <p>Using academic Open Data is, in other words, the secondary use of data. The
data of GSS (General Social Survey) in USA are generally available in formats
designed for statistical programs, and “GSS Data Explorer” allows users to test
hypotheses, and look for interesting correlations directly on the website.</p>
      <p>
        Social data obtained by administrative agencies are also published and free
to access by the public. According to a questionnaire survey in Japan, medians
of Open Data government possession rates were only 1% to 5% in each section:
spatial Information, Agroforestry, Commerce and Industry, Medical and Welfare,
Education Tourism, and Others [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. This means Open Data by local
governments have a big potential. Most of data by governments are census data. They
are free from sampling bias in social survey. When they are published, that
certifies they are free from the problems of private information in Big Data by private
companies.
      </p>
      <p>
        By way of illustration for a problem in Open Data by administrative agency,
there have been arguments on official announcement about the results of national
academic ability survey in Japan. When the governor of Osaka prefecture,
Hashimoto decided to publish the data by cities, towns and villages, some
municipalities and activists were against it. When he became the mayor of Osaka city, he
disclosed the results of the city by schools. Now results in 2011 and 2012 are
available except municipalities with only one school [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. An academic use of the
data considered not to identify those schools [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>As mentioned at Introduction, Open Data by administrative agencies are not
only numeric but can also come in the form of documents. They are provided in
PDF (portable document format) files. There were 9776 PDF data sets in
Japanese data catalogue site “DATA.GO.JP”. They enabled cross-sectional search of
the data by the central government [14]. That made up about 40% of all the data
sets in the site. After 4 years, while the data sets increased by 50%, the rate of
PDF format keeps still 41.2% of all (Fig. 1).</p>
      <p>These PDF files often are not machine-readable, even if they have literal
or numeric data. When we retrieve them, we may need OCR (optical character
recognition) software. For example, Election Commission of India has data of
donation, which have tables with donors and amounts. However, they are not
machine-readable. After retrieving data with software, we have to review the
error rate of the OCR algorithm, with viewing operation. They are scanned data
from paper documents, which were printed out. Punch holes in sequential
documents often damage some parts of data. The spread of paper-less transactions in
administrative agencies may solve these problems.</p>
      <p>All the above plus, Use open standards from W3C (RDF and SPARQL) to
identify things, so that people can point at your stuff</p>
      <p>Berners-Lee, known as the inventor of the world wide web, developed star
rating system “in order to encourage people -- especially government data
owners -- along the road to good linked data” [15] (Table 3). This table present a
scale, well known to officers in charge of Open Data in governments.</p>
      <p>
        PDF format is supposed to be worth 3 stars in a manner independent of
application software, hardware, and operating systems. However, many data in PDF
will not get even 2 stars because they are not machine-readable. In the survey on
local governments in Japan [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], it was discussed whether machine-readable PDF
should be distinguished from non-machine-readable one in the questionnaire.
Officers in charge of a certain municipality said, “They may not be aware of the
difference between normal PDF and image PDF. PDF may be only PDF for them.”
It was considered that such a question could be difficult to answer by
respondents – if they do not have the necessary background to make this distinction.
      </p>
      <p>Though Berners-Lee drew a blueprint of Linked Open Data, the star scheme
he developed does not cover all current problems connected to open data. For
instance, to get Linked Open Data, we need more Open Data. The value of data
will increase, if there are more related data [16]. However, those data are not easy
to use, unless they are machine-readable. If they are not easy to use, people may
not use them. There is a structured interview research, which showed data users
could motivate officers in charge of Open Data [17]. Emotion of officers cannot
be overlooked. The above-mentioned officers also said, “It is pleasure for public
servants that people use Open Data. It encourages us to contribute for public
interest.” They know how many times their Open Data sets were downloaded.
If people use more data, officers may publish more data (Fig. 2). The difference
between 1 star and 2 stars is very important as well as that between 0 star and 1
star. Here are proposals to Star scheme.</p>
      <p>• There should be an instruction to distinguish non-machine-readable PDF
from machine-readable PDF.
• Machine-readable should be translated to “possible to copy and paste
textual data or matrix data” for ordinary people.
• PDF should be included as examples, as well as CSV and excel.
• “2 stars system” can highlight the importance of the difference between
1 star and 2 star. It can be more efficient to encourage officers in charge
toward Linked Open Data, so far.
• Open Data providers can share 3 to 5 stars works to the third parties or</p>
      <p>Open Data catalogue site.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>This study highlights some of the many challenges involved in collecting numeric
data in social sciences, the actual conditions of Open Data by administrative
agencies, and a practical use of Open Data. There are changes in research methods
with ICT. Conducting a social survey is getting difficult. Utilizing Big Data for
academic purpose still presents many problems, which need to be solved. At
the same time, using Open Data by administrative agencies has a tremendous
potential. Open data is a source for a high volume of free documents. Many
document data are made by scanning printed documents. They are published in
non-machine-readable PDF files. People may not pick up such Open Data, which
are hard to use. More use of Open Data can generate more Open Data from public
sectors. Therefore, some proposals on the problem of Open Data in PDF files
were presented. Though the problem of a file format in this study looks very
trivial, it may have prevented the spread of Open Data. The Star scheme was
made to encourage officers in charge of Open Data. It has been well known to
them, its principle still does not seem to be realized by them even after a decade.
It must be significant to have pointed out a gap between drafters and practitioners
in our society.</p>
      <p>This study only pointed out the existence of the problem of PDF. It was
discussed only with cases in Japan and India. It was not examined whether the
problem exists all over the world, and how many non-machine-readable PDF there
are. There can be some reasons that officers in charge tend to make image PDF
files. For example, they may be going to put priority on signatures or stamps.
Convenience is not always right. It should be discussed with Electronic Signature
together. These points should be improved and will be the future works.
14. Honda M.: The whole aspect of public data to suppose from “DATA.GO.JP”. Journal of Japan</p>
      <p>Society of Information and Knowledge, 26(4), 320–325, (2017)
15. Berners-Lee, T.: Linked Data. (2006). https://www.w3.org/DesignIssues/LinkedData, last
accessed 2021/03/30
16. Shapiro C.: Information rules : a strategic guide to the network economy. Varian, Hal R. Boston,</p>
      <p>Mass. Harvard Business School Press (1999)
17. Honda M., Kajikawa Y.: Importance of communication between policy makers and external
actors in the policy formation process. Proceedings of the 15th National convention of Japanese
Association for Communication, Information and Society. pp. 204-207 (2018)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. de Heer W., de Leeuw E.D., van der Zouwen J.:
          <article-title>Methodological Issues in Survey Research: a Historical Review</article-title>
          .
          <source>Bulletin of Sociological Methodology/Bulletin de Méthodologie Sociologique</source>
          <volume>64</volume>
          (
          <issue>1</issue>
          ),
          <fpage>25</fpage>
          -
          <lpage>48</lpage>
          (
          <year>1999</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Landsberger</surname>
            <given-names>H. A.</given-names>
          </string-name>
          :
          <article-title>Hawthorne revisited: a plea for an open city</article-title>
          . Ithaca, N.Y.: Cornell University (
          <year>1957</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Marr</surname>
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Big Data in Practice: How 45 successful companies used Big Data Analytics to Deliver Extraordinary Results</article-title>
          . Chichester, Wiley (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kazeem</surname>
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Kenya is stepping up its citizens' digital security with a new EU-inspired data protection law</article-title>
          .
          <source>Quartz Africa, November</source>
          <volume>12</volume>
          (
          <year>2019</year>
          ) https://qz.com/africa/1746202/kenya-haspassed
          <article-title>-new-data-protection-laws-in-compliance-with-gdpr/</article-title>
          ,
          <source>last accessed</source>
          <year>2021</year>
          /03/30
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Jain</surname>
            <given-names>R.:</given-names>
          </string-name>
          <article-title>An existentialist dilemma for the Non-Personal Data regulation</article-title>
          ?,
          <source>” Telecom.com, March</source>
          <volume>23</volume>
          (
          <year>2021</year>
          ). https://telecom.economictimes.indiatimes.
          <article-title>com/tele-talk/an-existentialistdilemma-for-the-non-personal-data-regulation/4861</article-title>
          , last accessed
          <year>2021</year>
          /03/30
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Metcalfe</surname>
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Japan Railway Company Apologizes for Selling IC Card Data</article-title>
          .
          <source>The Wall Street Journal, July</source>
          <volume>29</volume>
          (
          <year>2013</year>
          ). https://www.wsj.com/articles/BL-JRTB-
          <volume>14515</volume>
          , last accessed
          <year>2021</year>
          /03/30
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Lakshmanan R.: WhatsApp Will Disable Your Account If You Don't Agree</surname>
          </string-name>
          <article-title>Sharing Data With Facebook</article-title>
          .
          <source>The Hacker News, Jan</source>
          <volume>6</volume>
          , (
          <year>2021</year>
          ). https://thehackernews.com/
          <year>2021</year>
          /01/whatsappwill-delete
          <article-title>-your-account-if</article-title>
          .html,
          <source>last accessed</source>
          <year>2021</year>
          /03/30
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Reuters:
          <article-title>Japan to probe Line after reports it let Chinese engineers access user data</article-title>
          .
          <source>March</source>
          <volume>17</volume>
          (
          <year>2021</year>
          ). https://www.reuters.com/article/us
          <article-title>-japan-line-access-idUSKBN2B901E, last accessed</article-title>
          <year>2021</year>
          /03/30
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Terada</surname>
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Overview of foreign legal systems related to personal information protection</article-title>
          .
          <source>JIPDEC</source>
          (
          <year>2019</year>
          ). https://www.jipdec.or.jp/archives/publications/J0005156.pdf,
          <source>last accessed</source>
          <year>2021</year>
          /03/30
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <article-title>Open knowledge Foundation: Open Definition 2</article-title>
          .1 https://opendefinition.org/od/2.1/en/ last accessed 2021/05/15
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Noda</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Honda</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yoshida</surname>
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Economic Effect by Open Data in Local Government in Japan</article-title>
          ,” In: Baghdadi,
          <string-name>
            <given-names>Y.</given-names>
            and
            <surname>Harfouche</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . (eds.)
          <article-title>ICT for a Better Life and</article-title>
          a Better World,
          <source>The Impact of Information and Communication Technologies on Organizations and Society</source>
          . pp.
          <fpage>165</fpage>
          -
          <lpage>173</lpage>
          . Springer, Heidelberg, (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. Osaka prefecture.:
          <article-title>Public elementary school, junior high school and kindergarten</article-title>
          . http://www. pref.osaka.lg.jp/life/list2.php?ctg02_id=18, last accessed
          <year>2021</year>
          /03/30
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Uesugi</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yano</surname>
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>A Geodemographic Analysis to Assess Variations in School Performance Based on Educational Achievement: A Case Study of Osaka City, Japan</article-title>
          .
          <source>Japanese Journal of Human Geography (Jimbun Chiri)</source>
          ,
          <volume>70</volume>
          (
          <issue>2</issue>
          ),
          <fpage>253</fpage>
          -
          <lpage>271</lpage>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>