<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sensemaking on Wikipedia by Secondary School Students with SynerScope</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>W.R. van Hage</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>F. Nu´n˜ez Serrano</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>T. Ploeger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>J.E. Hoeksema</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>SynerScope B.V</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universidad Polite ́cnica de Madrid</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>VU University Amsterdam</institution>
        </aff>
      </contrib-group>
      <fpage>48</fpage>
      <lpage>56</lpage>
      <abstract>
        <p>Visual analytics of linked data can be done by secondary school students with minimal preparation. We study the learning curve of students while answering typical Web analytics questions on Wikipedia and DBpedia using SynerScope visual analytics software. We find that after a short tutorial students are able to answer most complex questions in a few minutes, learning by trial and error. Older students are faster on average, but motivation appears to be a stronger factor than age for success. Answering speed doubles within two hours of experience while correctness increases.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The world will soon face a critical shortage of data scientists, professionals with
analytical expertise that can take advantage of (linked) data to answer questions [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. One
strategy to mitigate this problem is to enable non-experts to take over part of the data
science tasks. We pose that data science is comprised of many tasks that do not all
require expert-level knowledge. In this article we restrict ourselves to a category of data
science sensemaking tasks on Web data that is common in data journalism and involves
basic analytics operations, search, and Web browsing. We hypothesise that, given the
right tools, untrained people can quickly be trained to do such tasks, avoiding a
complete data science education.
      </p>
      <p>The goal of this article is to test this hypothesis by doing an experiment to
demonstrate the feasibility of having untrained people do prototypical sensemaking tasks given
visual analytics tools. Specifically, we look at secondary school students with no
analytical experience, and ask them to answer complex questions about Wikipedia content
using the SynerScope4 visual analytics software illustrated in Figure 1. We want to
know if users can get to an answer after a minimal amount of training in the tool. We
want to know how long it takes them to find an answer and if their time-to-answer
decreases as their experience with the tool increases, and what the influence is of their age
and corresponding level of education.</p>
      <p>The line of reasoning we follow is that the required skills for such sensemaking data
science tasks can be rapidly acquired or substituted with appropriate tools. If this is the
case and if SynerScope is an appropriate tool for the task, then we should be able to
show that unskilled people can accomplish the sensemaking tasks.
4 http://www.synerscope.com</p>
      <p>
        This idea of empowering people by means of augmented reasoning through
humancomputer interaction is not new [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], but in recent years the development of interactive
tools for visual analytics have intensified. Some of these tools are targeted at
programmers (e.g., [
        <xref ref-type="bibr" rid="ref1 ref10 ref11">1, 10, 11</xref>
        ]), while other tools target non-programmers (e.g., [
        <xref ref-type="bibr" rid="ref12 ref13 ref14 ref2 ref8 ref9">13, 12, 14, 9, 2,
8</xref>
        ]). For this experiment we need a tool from the latter category that is network centric
and allows search and Web browsing. We use SynerScope [
        <xref ref-type="bibr" rid="ref13 ref4 ref5">13, 5, 4</xref>
        ], one of the tools
that meets these requirements.
      </p>
      <p>The rest of this paper is organised as follows: Section 2 describes the SynerScope
software in more detail. Section 3 outlines the experimental set-up, including the tasks,
tooling, and procedure. Section 4 shows our findings. Section 5 discusses our findings,
draws conclusions and suggests future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>The SynerScope Software</title>
      <p>SynerScope is a visual analytics application that delivers real time interaction with
dynamic network-centric data. SynerScope supports simultaneous visualisations and
coordinates user interaction, enabling the user to identify causal relationships and to uncover
unforeseen connections.</p>
      <p>The central interaction paradigm of SynerScope is Multiple and Coordinated Views.
SynerScope shows a number of different perspectives on data, for example, relations
and time, and each selection made in either of these views causes an equivalent selection
to be made in all other views. This enables the user to explore correlations between
different facets of data.</p>
      <p>SynerScope is designed to work with a very basic information schema. This schema
consists of two object types: Nodes and Links. Links connect two Nodes. Both Nodes
and Links can have additional attributes of a number of data types, including integers,
floating point numbers, free text, date and time, latitude and longitude.</p>
      <p>What follows is a short overview of each visualisation that is offered by SynerScope.</p>
      <p>Hierarchical Edge Bundling View The Hierarchical Edge Bundling View (HEB) is the
primary network view in SynerScope. Each Node is visualised as a point on a circle,
and each Link is visualised as a curved line between its source and target Node.</p>
      <p>The Nodes are grouped hierarchically, based on one or more of their attributes. The
Links between Nodes of the same hierarchical category are bundled together (as if they
were tied together with a cable tie).</p>
      <p>Massive Sequence View The Massive Sequence View (MSV) is the primary temporal
view in SynerScope. Each Node gets a fixed position on the horizontal axis. Nodes are
grouped hierarchically in the same fashion as in the HEB. Links between Nodes are
represented by a horizontal line between the respective positions of the Nodes. On the
vertical axis the user can select a scalar attribute, typically a time or date. This orders
the Links temporally.</p>
      <p>Map View The Map View is the primary spatial view in SynerScope. The user can
select two attributes from any Node or Link data source to interpret as WGS84 latitude
and longitude coordinates. These attributes are used to plot the Nodes (not the Links)
on a map as points.</p>
      <p>Scatter Plot View The Scatter Plot View uses Cartesian coordinates to relate the
values of two attributes of either Nodes or Links. Dots are drawn on a two-dimensional
chart, the positioning relative to the horizontal and vertical axis being determined by
the attribute’s values. A third attribute can used to set the size of the dots.
Search and Filter View The Search and Filter View is an interactive view that allows
the user to select Nodes or Links by searching by value.</p>
      <p>Web View The Web View is an interactive view that allows the user to view any URL’s
that are an attribute of a node or a link.</p>
      <p>The user can interact with SynerScope’s views in several ways: By selecting and
highlighting data, drilling down to or up from a selection, and expanding selections
from nodes to connected links or vice versa. Every interaction method is coordinated
across multiple views.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Experimental Set-up</title>
      <p>
        Sensemaking Tasks In the experiment we look at 10 exemplar Web analytics
questions that each require a combination of at least two of the following operations to
answer: network navigation, filtering on categorical and numerical variables, grouping
and counting, search, Web browsing within Wikipedia, and zooming in on data
selections. Examples of the questions are: “How many former AFC Ajax soccer players died
in Paramaribo and what was the cause of death?” , or “Which page about a disease
is linked to most from pages about physicists?” . The complete set of questions can be
found on FigShare [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Question number 8 is marked as a difficult question, because it
is the only question that involves a set intersection between two sets of network patterns.
SynerScope Visual Analytics Tooling The SynerScope tool used by the students is a
graphically accelerated visual analytics application that combines a number of views on
networked data. It offers real-time interactive exploration using scatter plots, timelines,
maps, hierarchical edge bundling network layouts, an integrated Web browser, a search
engine, and a spreadsheet table view. The selections made in any of these views are
propagated to all the other views. A video illustrating interaction with the Wikipedia
data can be found on FigShare [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Procedure The experiment consists of five parts: (1) a 30m plenary introduction to
the experiment and the data sets used, (2) a 15m plenary tutorial to the SynerScope
visual analytics tool, (3 and 4) two 45m sessions where students try to answer questions
using SynerScope, (5) a concluding discussion and personal interviews. The students
are asked to answer as many as possible of 10 questions about 3 subsets of Wikipedia
within 90m. Each set centers around pages on a specific topic.</p>
      <p>
        Data Sets The topics covered in the experiment are: (1) Athletes classified as soccer
players and trainers of AFC Ajax, FC Barcelona, and Manchester United, (2) Scientists
classified as physicist, (3) Artists in the pop genre. Each of these three sets consist of
around 3000 Wikipedia pages about the topic (the seed set), all the pages that are linked
to from the seed pages (the “out” context), all pages that link to the seed pages (the
“in” context), and all the links between the seed, “out” context, and “in” context pages.
This amounts to three sets of around 100k–200k pages and 300k–500k page links. Each
page is assigned around 18 attributes with information about the page, such as the page
title, the number of words on the page, the in degree and out degree, a three-level
hierarchical topic classification of the main subject of the page (e.g. Actor-Artist-Person,
or Building-ArchitecturalStructure-Place) derived from the DBpedia rdf:type property
of the corresponding DBpedia resource, birth/death date and place, and topic-specific
properties such as respectively soccer team, university, or band. An example of the three
schemas can be found in the hand-outs for the students [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. We made a selection of
the DBpedia types (downloaded september 2013) that form a hierarchical partitioning
of the Wikipedia pages. We only considered types from the DBpedia ontology, ignoring
other type hierarchies such as Yago, FreeBase, and Schema.org. The selection process
involved dividing the types into three hierarchical layers, and imposing a preferential
ordering onto the types. For example, Amsterdam was assigned City at level 1,
PopulatedPlace at level 2, and Place at level 3, discarding types such as Settlement to form
a proper partition. When type information is missing, a placeholder type is assigned.
Test Subjects The students involved as test subjects in the experiment are 63 middle
school and high school students (9 female, 54 male) from three schools in the
Amsterdam area between the ages of 12 and 18, divided into 34 groups of size 1–3. The
experiments were performed in two labs of the VU University Amsterdam Network
Institute.5 One running SynerScope in the Amazon cloud accessed through a Web-based
client (OTOY), the other running SynerScope natively on gaming PCs with modern
NVIDIA GeForce GPUs. The students were paired up and given a hand-out describing
the three data sets, listing all the questions, and containing a form to record the answers
and the time taken [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. During the experiment students were assisted by answering
specific technical questions, but were given no other guidance that would help them
find answers.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>
        There was a large variation in the productivity of the various students, as can be seen in
Figure 2. This can be expected of students that have no intrinsic motivation to cooperate
in the experiment. The motivated students answered all questions, while two groups did
nothing and are excluded from the results. In general the total number of 10 questions
was too high to answer for most students in two 45 minute sessions. Most students
managed to answer the questions of two topics (6 or 7 questions). Of the questions that were
answered, about 60% was answered correctly. There was a large variation, depending
on the difficulty of the question. This is illustrated in Figure 3. Some questions were
answered partially. For example, when asked for a number and explanation only the
number or the explanation was answered correctly. We performed significance tests for
the differences in duration between all the categories shown in Figure 2 with a Welch’s
t-test, and similarly for the categories in Figure 4. There was a slight increase in the
number of questions that were answered correctly over time. This trend is significant
according to a Mann-Kendall test (p = 0.0318), even when counting partial answers
as false answers. Students performed faster and more consistently for subsequent
questions. This is illustrated in Figure 4 (right), specifically with questions 1–7 which were
consistently answered before time ran out. This increase in speed is significant between
the first and last of the questions in the sequence at a confidence level of 95%. Older
students seemed to be faster than younger students, but their answers were of a
comparable correctness. Although the difference in mean time taken between the fastest
and slowest age groups is a factor 2, a Mann-Kendall test does not show a significant
downward trend (p = 0.178). This is due to the relatively small number of observations
(34 student teams) and a class of particularly talented middle school freshmen that
performed on par with 18-year-olds, but with a significantly higher accuracy. The data used
to derive these conclusions can be found on Figshare [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ].
5 Network Institute Tech Labs, http://www.networkinstitute.org/tech-labs/
      </p>
      <p>Answered Partially</p>
      <p>Not Answered
1
2
3
n4
ito5
s
e6
u
Q7
8
9
10</p>
      <p>Student
100%
90%
80%
70%
60%
50%
40%
30%
20%
10%
0%
scientists
30m
25m
r gifted
sew20m students
n
a
toe15m
m
it
an10m
e
M
5m
0m
difficult
question
12
13
15
17</p>
      <p>18
Given the preliminary nature of these results, we can not draw very strong conclusions
yet. If we had more test subjects, we could have repeated the experiment with the topics
offered in a randomized order, which would strengthen the conclusions by removing
the learning effect and topic preference between the various topics.</p>
      <p>We are impressed by what our young test subjects were able to achieve. When given
the right tools, visual analytics of linked data really can be done by secondary school
students with minimal preparation. We found that after a short tutorial students are able
to answer most complex questions in a few minutes, learning by trial and error. Within
two hours of experience, answering speed doubles within while correctness increases.</p>
      <p>The older test subjects more frequently asked for help when they get stuck than
the younger test subjects, who just found their own way through trial and error, and
therefore also take longer to get to an anser than the older students (as can be seen in
Section 4. Overall, motivation appears to be a stronger success factor than age. This
belief is hard to make concrete, but it is reinforced by our observation that students
are quick to accept their first findings as a definitive answer to the question they were
working on. When students found information they thought was the right answer, they
were fairly quick to accept that answer and wanted to move on to the next question as
soon as possible. In contrast to professionals, the students did not verify their answers.
For instance, when the students had to find out how many AFX Ajax soccer players died
in Paramaribo, they typically accepted all the soccer players that died in Paramaribo as
an answer, without checking if they played in AFC Ajax. We think this can be explained
by the lack of feedback during the experiment. Students were not penalised for wrong
answers or rewarded for right answers, and the experiment was a one time encounter
with the software. We expect that many of the incorrect or partial answers could have
been improved if the students were to have verified their answers.</p>
      <p>The experiment reinforced our belief that visual analytics software must be highly
interactive and present immediate feedback to the user. During the interviews at the end
of the session students were generally positive about the software and tasks and thought
the experiment gave them a new perspective on Wikipedia. Their main negative remark
was that SynerScope running on Amazon was distractingly slow. In actuality, the
software was equally fast on Amazon instances as on local machines, but the lag introduced
by network congestion, network latency, and video compression, removed the sensation
of true interactivity. In isolated cases, for example, when zooming out to the entire data
set of 400k links, students had to wait a few seconds. Delays in interaction like these
appeared to interrupt the student’s train of thought.</p>
      <p>We found that students of all ages are able to effectively use the SynerScope tool
to answer the questions. Older students are usually faster, but not significantly more
accurate. We would like to further test these findings with older and younger subjects.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>Thanks go to the Damstede, Pieter Nieuwland College, and Cygnus Gymnasium schools
for their participation in this experiment. We thank the VU Network Institute for the
use of their facilities, and Samir Naaimi for his assistance during the experiments. This
work was done within the context of the SAGAN project supported by ONR Global
NICOP grant N62909-14-1-N030, the EU FP7 NewsReader project (316404), and the
Dutch COMMIT Data2Semantics project.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. D3.
          <article-title>js: D3.js - data-driven documents (</article-title>
          <year>2014</year>
          ), http://d3js.org/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. Gapminder:
          <article-title>Gapminder: Unveiling the beauty of statistics for a fact based world view (</article-title>
          <year>2014</year>
          ), http://www.gapminder.org/
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. van Hage,
          <string-name>
            <surname>W.R.:</surname>
          </string-name>
          <article-title>SynerScope on Wikipedia (movie) (06</article-title>
          <year>2014</year>
          ), http://dx.doi.org/ 10.6084/m9.figshare.1061499
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Holten</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cornelissen</surname>
          </string-name>
          , B.,
          <string-name>
            <surname>van Wijk</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Trace Visualization Using Hierarchical Edge Bundles and Massive Sequence Views</article-title>
          .
          <source>In: Visualizing Software for Understanding and Analysis</source>
          ,
          <year>2007</year>
          .
          <source>VISSOFT</source>
          <year>2007</year>
          . 4th IEEE International Workshop on. pp.
          <fpage>47</fpage>
          -
          <lpage>54</lpage>
          (
          <year>June 2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Holten</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data</article-title>
          .
          <source>IEEE Transactions on Visualization and Computer Graphics</source>
          <volume>12</volume>
          (
          <issue>5</issue>
          ),
          <fpage>741</fpage>
          -
          <lpage>748</lpage>
          (
          <year>Sep 2006</year>
          ), http://dx.doi.org/10.1109/TVCG.
          <year>2006</year>
          .147
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Licklider</surname>
            ,
            <given-names>J.C.R.</given-names>
          </string-name>
          :
          <article-title>Man-computer symbiosis</article-title>
          .
          <source>Human Factors in Electronics, IRE Transactions on (1)</source>
          ,
          <fpage>4</fpage>
          -
          <lpage>11</lpage>
          (
          <year>1960</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Manyika</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chui</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bughin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dobbs</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roxburgh</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hung</surname>
            <given-names>Byers</given-names>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Big data: The next frontier for innovation, competition, and productivity (</article-title>
          <year>2011</year>
          ), http://www.mckinsey.com/insights/business_technology/ big
          <article-title>_data_the_next_frontier_for_innovation</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Explain-a-lod: Using linked open data for interpreting statistics</article-title>
          .
          <source>In: Proceedings of the 2012 ACM international conference on Intelligent User Interfaces</source>
          . pp.
          <fpage>313</fpage>
          -
          <lpage>314</lpage>
          . ACM (
          <year>2012</year>
          ), http://www.ke.tu-darmstadt.de/resources/ explain-a-lod
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Qlikview: Business Intelligence and
          <string-name>
            <surname>Data Visualization Software - Qlik</surname>
          </string-name>
          (
          <year>2014</year>
          ), http: //www.qlik.com/
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>R: The R Project for Statistical Computing</surname>
          </string-name>
          (
          <year>2014</year>
          ), http://www.r-project.org/
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Skjaeveland</surname>
            ,
            <given-names>M.G.</given-names>
          </string-name>
          :
          <article-title>Sgvizler: A javascript wrapper for easy visualization of sparql result sets</article-title>
          .
          <source>In: Extended Semantic Web Conference</source>
          (
          <year>2012</year>
          ), http://dev.data2000.no/ sgvizler/
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Spotfire</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <string-name>
            <surname>TIBCO Spotfire - Business Intelligence Analytics Software</surname>
          </string-name>
          &amp; Data
          <string-name>
            <surname>Visualization</surname>
          </string-name>
          (
          <year>2014</year>
          ), http://spotfire.tibco.com/
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. SynerScope: SynerScope - Connecting the dots (
          <year>2014</year>
          ), http://www.synerscope. com/
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Tableau: Business Intelligence and Analytics - Tableau
          <string-name>
            <surname>Software</surname>
          </string-name>
          (
          <year>2014</year>
          ), http://www. tableausoftware.com/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. van Hage,
          <string-name>
            <given-names>W.R.</given-names>
            ,
            <surname>Ploeger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Hoeksema</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          , Nu´n˜ez Serrano, F.:
          <article-title>Wikipedia SynerScope experiment (</article-title>
          <year>06 2014</year>
          ), http://dx.doi.org/10.6084/m9.figshare.1060254
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>