<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SocioScope: an Integrated Framework for Understanding Society from Social Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hoang Long Nguyen</string-name>
          <email>longnh238@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Minsung Hong</string-name>
          <email>minsung.holdtime@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Na-Yeong Cho</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Junbeom Kim</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Camacho</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jason J. Jung</string-name>
          <email>j2jung@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Engineering, Chung-Ang University</institution>
          ,
          <addr-line>84 Heukseok-ro, Dongjak-gu, Seoul</addr-line>
          ,
          <country country="KR">Korea</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, Universidad Autonoma de Madrid</institution>
          ,
          <addr-line>Madrid</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We recognize the importance of data, especially in the period of Internet of Things. However, e ectively collecting social data from di erent sources and analyzing relationships between them to understand our society is actually a big challenge. SocioScope is built to solve this problem. Besides, we recognize that many researchers are spending time for conducting same work (i.e., collecting and pre-processing data). Therefore, we aim to provide SocioScope as a framework for reducing their e ort time. Outputs of our system are not only used for understand about social data but also possible to use as inputs for other work to create useful applications.</p>
      </abstract>
      <kwd-group>
        <kwd>SocioScope framework</kwd>
        <kwd>Social data</kwd>
        <kwd>Social information</kwd>
        <kwd>Society understanding</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        We are living in era of data due to the growth of Internet of Things (IoT). In
IoT system, devices (e.g., wireless sensor networks, GPS, control systems) are
connected and data is created through every events. By the year 2013, IoT had
been integrated into di erent systems by using multiple technologies. Therefore,
data increased quickly in every sides including volume, velocity, variety. Due to
a statistics in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], data reached 4.4 zettabytes in 2013. This brings huge bene ts
for society.
      </p>
      <p>
        Data is the facts about our world. There are two types of data which are
tacit and explicit data [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Tacit data is achieved through experience and is
embedded in the human mind. On the opposite, explicit data consists of printed
and electronic materials. In this work, we focus on explicit data. Besides, we
also concentrate on social data instead of individual data (e.g., all data which
is collected from President Barack Obama). Social data refers to the set of data
? Corresponding author.
which is created by di erent individuals from social media, mass media, and
sensor data. However, social data is still raw and discrete. It exists without any
context and analysis. It can not be useful itself unless it is processed to obtain
the information.
      </p>
      <p>Therefore, we need a system for integrating and analyzing social data to
understand its relationships and connections. Then, we get social information which
is structured representation of social data. By obtaining social information, we
can answer questions which are related to \who, what, where, and when". For
example, social data is related to the number of vehicles on street. From
analyzing data to get information, we can understand about our society with questions
such as: who stay on this street? what kind of vehicles that people are using?
where is the available street? and when is the period time of tra c jam? The
scope of society is also very dynamics and it depends on social data that we get.
Society can be a university, a company, or a community.</p>
      <p>
        Our contribution in this paper is as follows. We build SocioScope for the
target of collecting social data and creating social information. From that, we
can have our insight into society around us. Besides, we recognize that researchers
must spend a lot of time for same work (i.e., collecting and pre-processing data).
By providing SocioScope framework, we want to reduce their e ort time. The
outputs of SocioScope can be received as inputs for other work to create social
knowledge, or even social wisdom. Coming back to the above example, we can
produce an application to guide for avoiding tra c jam using social information.
This is what we call social knowledge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>In this section, we give the overview of our motivation. Section 2 will describe
our system in detail. Further, the performance of SocioScope will be presented
in Section 3. Finally, we conclude and discuss future work in Section 4.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>SocioScope Framework</title>
      <sec id="sec-2-1">
        <title>Overview</title>
        <p>In this section, we give detail information for well understanding the overview of
our system. SocioScope is implemented by using Java programming language and
is independently divided into web service and background service (i.e., windows
service or linux daemon) as shown in Fig. 1.</p>
        <p>The web service side is implemented by using Apache Tomcat servlet
container and Java Server Pages technology. We choose Model View Controller
(MVC) as our programming pattern for well separating logical handling of data
and presentation of the data. Further, we design important functions in
background service which is run by command prompt. The advantages of background
service are higher system performance and better security. Moreover, it reduces
wrong user's behavior. These two systems communicate with each other by
using Apache Thrift which is a framework for cross-language services
development. Apache Thrift e ectively works with diversity programming language
(e.g., C++, Java, Python, and PHP). Further, we can take advantages of Apache
Thrift when we want to integrate other applications into SocioScope.</p>
        <p>Dealing with big data is one of big challenges of SocioScope. Hence, we
consider applying big data processing techniques when we design the system.
MongoDB database is selected because it is schema-less (i.e., easily extending and
altering extra elds), NoSQL, fast access (i.e., using internal memory for saving
working set in order to allow faster data access). In addition, MongoDB can be
used together with Hadoop to power big data system.</p>
        <p>Besides, we also build a Linked Data (LD) database for better
understanding and generating structured data. When a text data is collected, we tokenize
this sentence to create a bag-of-words. Our system will retrieve information on
Linked Open Data sets (i.e., DBpedia and Freebase) based on words. We use LD
database for producing speci c metadata of our document and for conducting
context-based analysis as well.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Modules</title>
        <p>All modules which are described here are supported by our system. Besides, we
also mention about URL schemes in order to directly access these modules. The
URL must follow format: \domain/subpage". Belows are list of all modules in
which SocialScope contains:
User Management
{ Sign up (/sign up): User must register an account on SocioScope for
accessing the system. This URL shows a form to apply for a new account. There
are some script for verifying content in which user inputs.
{ Sign in (/sign in): This page is for an individual to gain access by passing
his username and password. After successful logging in, user is redirected
into SocioScope homepage.
{ Home (/home): This is the homepage of SocioScope. All the information
about the system is shown here.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Social Data Collection</title>
        <p>{ API account (/api account): In order to process authorized requests to
obtain data from sources, we have to register for applications or tokens from
API. This page is used for managing all the API accounts.
{ API tracking (/API tracking): Each API has di erent rate limit strategies.</p>
        <p>If the number of requests overcome limitation, API will return error code
and no data is acquired. We can control API rate limits by using this page.
{ Batch crawler (/batch crawler): Batch crawler is used when you want to deal
with blocks of data that have already been stored over a period of time. It
works well in case user wants to process large volumes of data. We supports
crawling data not only by using keyword but also by using location and time.
Tab. 3 shows the user interface of crawler page. Moreover, we also focus on
solving multilingual problem. User can collect data with di erent languages.
{ Stream crawler (/stream crawler): The main features of stream crawler is
similar with batch crawler. However, stream crawler is used when you want
to analyze data in real time. Stream crawler is useful for fraud detection
tasks (e.g., in healthcare, telecommunications, or banking area). We create
the set of listeners for collecting data. Every time a new data from data
sources is generated, it is automatically collected by these listeners.
{ Manual crawler (/manual crawler): Almost data sources use ID for managing
data. In case user has already had list of data ID, we support this feature
for getting data by passing ID into system.
{ Collected data (/collected data): After crawling, data is stored in database.</p>
        <p>This page allows retrieving all the data from database. Pagination techniques
are used for breaking large data into smaller portions to increase system
performance. In addition, we support extracting data to le (e.g., text and
csv le) using json format.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Social Information Generation</title>
        <p>
          { Natural Language Processing (/nlp): Providing basic tools for natural
language processing (e.g., tokenizer, POS tagger, named entity recognizer,
stemming, lemmatization, sentiment analysis, and coreference resolution). This
is an important step for understanding text data.
(a) Selecting sources
(b) Inputing keywords
(c) Choosing location
(d) Picking up time
{ Time Series (/time series): Displaying how the frequency of data change over
a period of time. From the time series visualization, we can easily recognize
patterns of data or even observe abnormal signal.
{ Signal Processing (/signal processing): Converting signal from time domain
to frequency domain by using di erent techniques (e.g., fast fourier
transform, discrete fourier transform, discrete cosine transform, discrete sine
transform, discrete hartley transform, fast wavelet transform, and wavelet packet
transform). Because of many proved functions, it is easier to compute and
process signal in the frequency domain rather than in the time domain.
{ Word Cloud (/word cloud): Word cloud is a visualization method to display
words which appear more frequently in the source text in a prominent way.
It helps us to generate a starting point for deeper analysis later on (e.g.,
judging that there are words which are co-occurrence in speci c topics) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>System Performance</title>
      <p>In this section, we demonstrate the performance of SocioScope. There are three
issues that we want to prove which are: i) the performance of crawling process,</p>
      <p>(a) Natural Language Processing
(b) Signal Processing
(c) Time Series</p>
      <p>(d) Word Cloud
ii) the e ectiveness of MongoDB when dealing with bid data, and iii) the time
consuming for conducting data analysis tasks. All the experiments are conducted
by using a computer with speci cations as follows: Intel(R) Core(TM) i5-4590
CPU 3.30GHz with 12GB RAM. Twitter is selected as the source for the
experiments because we can consider data on this source as big data. Moreover,
the request limitation of Twitter API (180 calls every 15 minutes) is also the
challenge that we want to overcome.</p>
      <p>We rst measure the performance of crawling feature by collecting data in
1 minute, 10 minutes, and 100 minutes by using batch crawling and streaming
crawling features. In order to solve the request limitation problem, we create 35
di erent Twitter API applications and build an application pool. An
application pool is a collection of Twitter API applications that work based on queueing
theory. Every time an application reaches limitation, it is automatically moved
to end of queue and another available application is used. Therefore, the
crawling process is conducted continuously without any delaying. Moreover, we also
implement thread pool for parallel crawling. At any time, a thread will be used
as long as it is still available. Obama and Trump are two keywords that we
choose for testing the batch crawler and stream crawler respectively. The result
in Tab. 1 shows that batch crawler is quite stable (about 4200 tweets per minute)
while stream crawler is quite dynamics due to the real-time data property. From
the result, we draw the conclusion that by using SocioScope we can collect much
more data than using HTTP requests as normal.</p>
      <p>Further, we interested in proving the e ectiveness of MongoDB database
when dealing with big data. In order to optimize system performance, we rst
apply indexing technique on frequent retrieving record to minimize the number of
disk accesses required. Besides, WiredTiger storage engine is also applied. This
engine supports of document level locking. Therefore, system archives better
concurrency. Furthermore, WiredTiger uses snappy compression algorithm to
reduce the number of data which have to be written or read from the disk.
Fig. 5 shows that NoSQL is better than SQL databases when dealing with big
data. The time is measured when user clicks on the retrieve button until the
result is shown.</p>
      <p>2500</p>
      <p>MySQL</p>
      <p>MongoDB-MMAPv1</p>
      <p>MongoDB-WiredTiger
2000 MongoDB-WiredTiger-Indexed
sd 1500
n
o
c
e
s
il
im 1000
500
0
&gt;4,000
&gt;40,000 &gt;80,000</p>
      <p>Number of data</p>
      <p>&gt;450,000</p>
      <p>Finally, we focus on data analyzing tasks. Time consuming for natural
language processing, signal processing, time series and word cloud visualization
tasks are approximately with the result in Fig. 5. This shows that the time for
analyzing data of SocioScope is insigni cant.</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>
        In this paper, we introduce about SocioScope which is a framework for collecting
and analyzing social data to create social information. There are various
analyzing tools (e.g., natural language processing, signal processing, visualization)
which are supported to discover our society from social data. In addition, the
output of SocioScope can be taken advantages for generating social knowledge
(e.g., event detection [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], trust-based recommendation system [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]), or even social
wisdom.
      </p>
      <p>We also plan some future work for enhancing the utility of our system. Some
other sources will be investigated for integrating into our system such as mass
media source (e.g., newspaper and radio), social media sources (e.g., Instagram,
Flickr, and Foursquare), and sensor source (e.g., camera and wearable devices)
to enrich social data. Besides, we consider applying Hadoop, which is a powerful
framework for dealing with big data, to improve our system. Finally, other
analyzing tools (e.g., sampling, transformation, denoising, and feature extraction
modules) will be applied for better understanding social data.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgment</title>
      <p>This work was supported by the National Research Foundation of Korea (NRF)
grant funded by the Korea government (MSIP) (NRF-2017R1A2B4010774).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Acko</surname>
            ,
            <given-names>R.L.</given-names>
          </string-name>
          :
          <article-title>From data to wisdom</article-title>
          .
          <source>Journal of applied systems analysis 16(1)</source>
          , 3{
          <issue>9</issue>
          (
          <year>1989</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Erevelles</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fukawa</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Swayne</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Big data consumer analytics and the transformation of marketing</article-title>
          .
          <source>Journal of Business Research</source>
          <volume>69</volume>
          (
          <issue>2</issue>
          ),
          <volume>897</volume>
          {
          <fpage>904</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Heimerl</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lohmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lange</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ertl</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Word cloud explorer: Text analytics based on word clouds</article-title>
          .
          <source>In: Proceedings of the Hawaii International Conference on System Sciences (HICSS</source>
          <year>2014</year>
          ), Hilton Waikoloa, Hawaii, USA, Jan 6-
          <issue>9</issue>
          ,
          <year>2014</year>
          . pp.
          <year>1833</year>
          {
          <year>1842</year>
          . IEEE (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Hoang</given-names>
            <surname>Long</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>O-Joun</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jai</surname>
            <given-names>E</given-names>
          </string-name>
          , J., Jaehwa, Park amd Tai-Won,
          <string-name>
            <given-names>U.</given-names>
            ,
            <surname>Hyun-Woo</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          :
          <article-title>Event-driven trust refreshment on ambient services</article-title>
          .
          <source>IEEE Access 5</source>
          ,
          <issue>4664</issue>
          {
          <fpage>4670</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Nguyen</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jung</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          :
          <article-title>Real-time event detection on social data stream</article-title>
          .
          <source>Mobile Networks and Applications</source>
          <volume>20</volume>
          (
          <issue>4</issue>
          ),
          <volume>475</volume>
          {
          <fpage>486</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>E.A.</given-names>
          </string-name>
          :
          <article-title>The role of tacit and explicit knowledge in the workplace</article-title>
          .
          <source>Journal of knowledge Management</source>
          <volume>5</volume>
          (
          <issue>4</issue>
          ),
          <volume>311</volume>
          {
          <fpage>321</fpage>
          (
          <year>2001</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>