Using Open Data for Social Sciences

                                Akio Yoshida [0000-0003-1001-314X]

                Jawaharlal Nehru University, New Delhi, Delhi 110067, India
                               akio.yoshida@gmail.com


      Abstract. Information and communication technologies (ICT) are changing
      research methods in social sciences, especially in the ways of getting data. Online
      surveys and Big Data analyses are used the most among them. Using Open Data
      is another way realized by the spread of ICT. Using Open Data by administrative
      agencies has potential but also some difficulties. This study discusses the practical
      use of Open Data, and focuses on problems related to data in Portable Document
      Format (PDF) files. Those problems seem to occur because many officers in charge
      of Open Data do not pay attention to the principle and practical use of Open Data.
      It shows a gap between drafters and practitioners in our society.


      Keywords: Research Method, Social Sciences, Questionnaire Survey, Big Data,
      Open Data, File Format, PDF.


1   Introduction
Social data can be used to improve and help researches in the field of social science.
Those data exist in a variety of shapes. They are texts, voices, photos, movies,
etc. They are also called materials, records, sources, and evidences. However,
analyzing data in so many different formats presents a challenge. Therefore, one
approach would be to try to quantify the collected data. With numeric data, results
from objective views can be shown in statistical ways.
     The rise of ICT brought changes to the research methods in social sciences.
In addition to online questionnaire surveys and Big Data analyses, Open Data
analyses are used recently. Contrasting with former two, this study shows the
potential of using Open Data, as well as the practical problems of using them.

1.1 Questionnaire survey
If questionnaire was made well with options or Likert scales, it is possible to collect
numeric data, as the results of such surveys are categorical and quantifiable. In the
past social sciences have often relied on physical or offline questionnaire surveys.
     Nowadays, online surveys with questionnaire are also a popular way for
gathering user data. Respondents can answer to the questionnaire with smart

 Copyright © 2021 for this paper by its authors. Use permitted under
 Creative Commons License Attribution 4.0 International (CC BY 4.0).
phones or tablet terminators. However, there have been problems in online sur-
veys compared with offline surveys. Typical cases are summarized in Table 1.
The online survey respondents are automatically limited to the Internet users who
have registered to a survey company. We cannot reach those who do not use the
Internet and those who are not registered to any Internet services. In addition,
online direct mails are easy to be ignored. It leads to low recovery rate. Therefore,
online surveys have been regarded being biased in sampling.

                    Table 1. Online vs Offline in typical social surveys.
                                          Online                            Offline
 Object                                Internet user                   General people
 Sampling                        Registered participants                    Random
 Delivery                               Online DM                       Postal service
 Recovery                               Reply form                          Visiting
 Recovery rate                             Low                               High
 Sampling Bias                             High                              Low
 Costs                                      Low                              High

     On the other hand, conventional offline surveys also got problems recently.
Mail survey in Table 1 has been an efficient data collection tool since 1788 [1].
These days nuclear families, which consist only of parents and children, are in-
creasing especially in urban area. When researchers visit their houses in daytime,
they may not be able to recover questionnaires because no family members are
available there. It is also difficult to reach young people who live alone. In ad-
dition, people are unlikely to open the door to unknown person’s visit. These
problems lead to sampling bias. Respondents come to be limited to those who can
react to researchers in such circumstances.
     When almost all the people come to use the Internet, online survey may have
less sampling bias than conventional survey. Visiting respondents in offline sur-
vey can secure reliable response, while online submission can reduce Hawthorne
effect that respondents tend to give desirable answers, which let them look more
normative [2]. Each of these methods has its pros and cons.

1.2 Big Data
With improvement of ICT and computing, we came to treat Big Data, that is,
huge amount of transactions of information. Including GAFA (Google, Amazon,
Facebook, Apple), many large corporations utilize Big Data which they collect in
their businesses [3]. Using Big Data has a tremendous potential to benefit social
sciences. However, those studies of Big Data are in a black box format. Some

                                             42
companies apply the findings to their businesses; others optimize them as B2 2B
commodities. Analyses of data and the findings are not disclosed. Though SNS
companies such as Twitter provide API to get data, there are limitations of using
them.
     Companies are eager to protect their algorithms, but at the same time, the
data itself can contain sensitive information, for instance transaction data or per-
sonal customer information. That is why many countries are developing legal
systems on personal information (Table 2). Japan amended the Act on the Protec-
tion of Personal Information for use of Big Data. It includes “Clear Indication of
the Purpose of Use”, “Consent of the Person on Provision to A Third Party” and
“Anonymization of Information”. General Data Protection Regulation (GDPR)
in EU has more rules. It is said that DPA in Kenya and PDP in India are based on
GDPR [4][5].
                     Table 2. Legal systems on personal information.
                                       Laws                             Enforcement
                         Sector and State Specific Rules /                  -/
 USA
                      FTC: Federal Trade Commission Act § 5             1914 *2012
 Japan               Act on Protection of Personal Information          2003 *2017
 China                        CS: Cybersecurity Law                        2017
 EU                 GDPR: General Data Protection Regulation               2018
 Kenya                     DPA: The Data Protection Act                    2019
 India                 PDP: Personal Data Protection Bill 2019            Pending
*amended

      In June 2013, Hitachi announced that it would start a service that utilizes the
boarding / alighting history of JR East (Japan’s largest railway company)’s Suica
(IC prepaid fare card) as big data and provides it as station area marketing infor-
mation. At first, JR East claimed that it was not disclosing personal information
on its customers, but admitted selling data without their consent and apologized
after a month [6]. It is still withheld in 2021.
      In January 2021, messaging app WhatsApp announced the new Privacy Pol-
icy, which will allow WhatsApp to share data with its parent company, Facebook.
It does not apply in EU, because it violates GDPR [7]. People encouraged each
other to shift from WhatsApp to other messaging apps, Signal or Telegram. At
last, WhatsApp postponed the update of its privacy policy.
      In March 2021, LINE, which is very similar to WhatsApp and dominant in
Japan, let Chinese engineers at a Shanghai affiliate access Japanese users’ data
without informing them [8]. LINE Corporation was founded as a part of a South
Korean game company.


                                           43
     Big Data can be international. It is important to pay attention to the latest
trend in the world. Even if the use of data is legitimate in Japan, it may violate
GDPR in EU [9]. Though these legal systems encourage the use of Big Data,
companies will be careful with further use of them. Moreover, it will take time
for a broad range of academic use of Big Data.

2   Using Open Data
      We have already covered some of the challenges when it comes to collect-
ing numeric and large amount of data in social sciences. Using Open Data could
present an alternative way of data collection. Knowledge is open if anyone is free
to access, use, modify, and share it [10]. Usually there are limits and difficulties
to access data, as mentioned in previous section. Even after getting data, there are
still problems as license, copyright, patent or other mechanisms of control. Open
Data are free from these restrictions. There are two kinds of data expected to be
Open Data. First, they are academic data in sciences. Second, they are social data
obtained by administrative agencies.
      Using academic Open Data is, in other words, the secondary use of data. The
data of GSS (General Social Survey) in USA are generally available in formats
designed for statistical programs, and “GSS Data Explorer” allows users to test
hypotheses, and look for interesting correlations directly on the website.
      Social data obtained by administrative agencies are also published and free
to access by the public. According to a questionnaire survey in Japan, medians
of Open Data government possession rates were only 1% to 5% in each section:
spatial Information, Agroforestry, Commerce and Industry, Medical and Welfare,
Education Tourism, and Others [11]. This means Open Data by local govern-
ments have a big potential. Most of data by governments are census data. They
are free from sampling bias in social survey. When they are published, that certi-
fies they are free from the problems of private information in Big Data by private
companies.


                                        44
                                                    Source: DATA.GO.JP (on Mar 26, 2021)
       Fig. 1. Numbers of file formats in Open Data by the central government of Japan.

     By way of illustration for a problem in Open Data by administrative agency,
there have been arguments on official announcement about the results of national
academic ability survey in Japan. When the governor of Osaka prefecture, Hashi-
moto decided to publish the data by cities, towns and villages, some municipali-
ties and activists were against it. When he became the mayor of Osaka city, he
disclosed the results of the city by schools. Now results in 2011 and 2012 are
available except municipalities with only one school [12]. An academic use of the
data considered not to identify those schools [13].
     As mentioned at Introduction, Open Data by administrative agencies are not
only numeric but can also come in the form of documents. They are provided in
PDF (portable document format) files. There were 9776 PDF data sets in Japa-
nese data catalogue site “DATA.GO.JP”. They enabled cross-sectional search of
the data by the central government [14]. That made up about 40% of all the data
sets in the site. After 4 years, while the data sets increased by 50%, the rate of
PDF format keeps still 41.2% of all (Fig. 1).
     These PDF files often are not machine-readable, even if they have literal
or numeric data. When we retrieve them, we may need OCR (optical character
recognition) software. For example, Election Commission of India has data of
donation, which have tables with donors and amounts. However, they are not
machine-readable. After retrieving data with software, we have to review the er-
ror rate of the OCR algorithm, with viewing operation. They are scanned data
from paper documents, which were printed out. Punch holes in sequential docu-
ments often damage some parts of data. The spread of paper-less transactions in
administrative agencies may solve these problems.

                                             45
              Table 3. Star scheme* toward Linked Open Data by Berners-Lee.
             Available on the web (whatever format) but with an open licence, to be Open
         
             Data
             Available as machine-readable structured data (e.g. excel instead of image
       
             scan of a table)

      as (2) plus non-proprietary format (e.g. CSV instead of excel)

             All the above plus, Use open standards from W3C (RDF and SPARQL) to
   
             identify things, so that people can point at your stuff
  All the above, plus: Link your data to other people’s data to provide context
                                                                       *added in 2010
                                                             Source: Linked Data [15]

     Berners-Lee, known as the inventor of the world wide web, developed star
rating system “in order to encourage people -- especially government data own-
ers -- along the road to good linked data” [15] (Table 3). This table present a
scale, well known to officers in charge of Open Data in governments.
     PDF format is supposed to be worth 3 stars in a manner independent of ap-
plication software, hardware, and operating systems. However, many data in PDF
will not get even 2 stars because they are not machine-readable. In the survey on
local governments in Japan [11], it was discussed whether machine-readable PDF
should be distinguished from non-machine-readable one in the questionnaire. Of-
ficers in charge of a certain municipality said, “They may not be aware of the dif-
ference between normal PDF and image PDF. PDF may be only PDF for them.”
It was considered that such a question could be difficult to answer by respond-
ents – if they do not have the necessary background to make this distinction.


    Fig. 2. A Cycle toward more Open Data related with Star scheme and the PDF problem.


                                            46
      Though Berners-Lee drew a blueprint of Linked Open Data, the star scheme
he developed does not cover all current problems connected to open data. For
instance, to get Linked Open Data, we need more Open Data. The value of data
will increase, if there are more related data [16]. However, those data are not easy
to use, unless they are machine-readable. If they are not easy to use, people may
not use them. There is a structured interview research, which showed data users
could motivate officers in charge of Open Data [17]. Emotion of officers cannot
be overlooked. The above-mentioned officers also said, “It is pleasure for public
servants that people use Open Data. It encourages us to contribute for public
interest.” They know how many times their Open Data sets were downloaded.
If people use more data, officers may publish more data (Fig. 2). The difference
between 1 star and 2 stars is very important as well as that between 0 star and 1
star. Here are proposals to Star scheme.
      • There should be an instruction to distinguish non-machine-readable PDF
         from machine-readable PDF.
      • Machine-readable should be translated to “possible to copy and paste tex-
         tual data or matrix data” for ordinary people.
      • PDF should be included as examples, as well as CSV and excel.
      • “2 stars system” can highlight the importance of the difference between
         1 star and 2 star. It can be more efficient to encourage officers in charge
         toward Linked Open Data, so far.
      • Open Data providers can share 3 to 5 stars works to the third parties or
         Open Data catalogue site.

3   Conclusion
This study highlights some of the many challenges involved in collecting numeric
data in social sciences, the actual conditions of Open Data by administrative
agencies, and a practical use of Open Data. There are changes in research methods
with ICT. Conducting a social survey is getting difficult. Utilizing Big Data for
academic purpose still presents many problems, which need to be solved. At
the same time, using Open Data by administrative agencies has a tremendous
potential. Open data is a source for a high volume of free documents. Many
document data are made by scanning printed documents. They are published in
non-machine-readable PDF files. People may not pick up such Open Data, which
are hard to use. More use of Open Data can generate more Open Data from public
sectors. Therefore, some proposals on the problem of Open Data in PDF files
were presented. Though the problem of a file format in this study looks very
trivial, it may have prevented the spread of Open Data. The Star scheme was
made to encourage officers in charge of Open Data. It has been well known to
them, its principle still does not seem to be realized by them even after a decade.


                                        47
It must be significant to have pointed out a gap between drafters and practitioners
in our society.
      This study only pointed out the existence of the problem of PDF. It was dis-
cussed only with cases in Japan and India. It was not examined whether the prob-
lem exists all over the world, and how many non-machine-readable PDF there
are. There can be some reasons that officers in charge tend to make image PDF
files. For example, they may be going to put priority on signatures or stamps.
Convenience is not always right. It should be discussed with Electronic Signature
together. These points should be improved and will be the future works.

References
1. de Heer W., de Leeuw E.D., van der Zouwen J.: Methodological Issues in Survey Research:
    a Historical Review. Bulletin of Sociological Methodology/Bulletin de Méthodologie Soci-
    ologique 64(1), 25–48 (1999).
2. Landsberger H. A.: Hawthorne revisited: a plea for an open city. Ithaca, N.Y.: Cornell Univer-
    sity (1957).
3. Marr B.: Big Data in Practice: How 45 successful companies used Big Data Analytics to De-
    liver Extraordinary Results. Chichester, Wiley (2016).
4. Kazeem Y.: Kenya is stepping up its citizens’ digital security with a new EU-inspired data
    protection law. Quartz Africa, November 12 (2019) https://qz.com/africa/1746202/kenya-has-
    passed-new-data-protection-laws-in-compliance-with-gdpr/, last accessed 2021/03/30
5. Jain R.: An existentialist dilemma for the Non-Personal Data regulation?,” Telecom.com,
    March 23 (2021). https://telecom.economictimes.indiatimes.com/tele-talk/an-existentialist-
    dilemma-for-the-non-personal-data-regulation/4861, last accessed 2021/03/30
6. Metcalfe J.: Japan Railway Company Apologizes for Selling IC Card Data. The Wall Street Jour-
    nal, July 29 (2013). https://www.wsj.com/articles/BL-JRTB-14515, last accessed 2021/03/30
7. Lakshmanan R.: WhatsApp Will Disable Your Account If You Don’t Agree Sharing Data With
    Facebook. The Hacker News, Jan 6, (2021). https://thehackernews.com/2021/01/whatsapp-
    will-delete-your-account-if.html, last accessed 2021/03/30
8. Reuters: Japan to probe Line after reports it let Chinese engineers access user data. March 17
    (2021). https://www.reuters.com/article/us-japan-line-access-idUSKBN2B901E, last accessed
    2021/03/30
9. Terada S.: Overview of foreign legal systems related to personal information protection. JI-
    PDEC (2019). https://www.jipdec.or.jp/archives/publications/J0005156.pdf, last accessed
    2021/03/30
10. Open knowledge Foundation: Open Definition 2.1 https://opendefinition.org/od/2.1/en/ last ac-
    cessed 2021/05/15
11. Noda T., Honda M., Yoshida A.: Economic Effect by Open Data in Local Government in Ja-
    pan,” In: Baghdadi, Y. and Harfouche, A. (eds.) ICT for a Better Life and a Better World, The
    Impact of Information and Communication Technologies on Organizations and Society. pp.
    165–173. Springer, Heidelberg, (2019).
12. Osaka prefecture.: Public elementary school, junior high school and kindergarten. http://www.
    pref.osaka.lg.jp/life/list2.php?ctg02_id=18, last accessed 2021/03/30
13. Uesugi M., Yano K.: A Geodemographic Analysis to Assess Variations in School Performance
    Based on Educational Achievement: A Case Study of Osaka City, Japan. Japanese Journal of
    Human Geography (Jimbun Chiri), 70(2), 253–271 (2018)


                                               48
14. Honda M.: The whole aspect of public data to suppose from “DATA.GO.JP”. Journal of Japan
    Society of Information and Knowledge, 26(4), 320–325, (2017)
15. Berners-Lee, T.: Linked Data. (2006). https://www.w3.org/DesignIssues/LinkedData, last ac-
    cessed 2021/03/30
16. Shapiro C.: Information rules : a strategic guide to the network economy. Varian, Hal R. Boston,
    Mass. Harvard Business School Press (1999)
17. Honda M., Kajikawa Y.: Importance of communication between policy makers and external
    actors in the policy formation process. Proceedings of the 15th National convention of Japanese
    Association for Communication, Information and Society. pp. 204-207 (2018)


                                                49