<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>The open access journal Educational
Technology &amp; Society special issue on
“Learning and Knowledge Analytics”:
Educational Technology &amp; Society (Special
Issue on Learning &amp; Knowledge Analytics,
edited by George Siemens &amp; Dragan Gašević)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Fostering Analytics on Learning Analytics Research: the LAK Dataset</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Davide Taibi</string-name>
          <email>davide.taibi@itd.cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Dietze</string-name>
          <email>dietze@l3s.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Proceedings of the International Conference on</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Educational Data Mining</institution>
          ,
          <addr-line>2008-12</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for Educational Technologies, National Research Council of Italy</institution>
          ,
          <addr-line>Via Ugo La Malfa 153, Palermo</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>L3S Research Center</institution>
          ,
          <addr-line>Appelstr. 9a, 30167 Hannover</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <volume>15</volume>
      <issue>3</issue>
      <abstract>
        <p>This paper describes the Learning Analytics and Knowledge (LAK) Dataset, an unprecedented collection of structured data created from a set of key research publications in the emerging field of learning analytics. The unstructured publications have been processed and exposed in a variety of formats, most notably according to Linked Data principles, in order to provide simplified access for researchers and practitioners. The aim of this dataset is to provide the opportunity to conduct investigations, for instance, about the evolution of the research field over time, correlations with other disciplines or to provide compelling applications which take advantage of the dataset in an innovative manner. In this paper, we describe the dataset, the design choices and rationale and provide an outlook on future investigations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Learning Analytics</kwd>
        <kwd>Data</kwd>
        <kwd>Linked Educational Data Mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Web,</p>
      <p>Table 1 : Papers included in the LAK dataset</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>As part of an international team of research practitioners
consisting of the Society for Learning Analytics Research
(SoLAR)1, ACM2, the LinkedUp project3, the Educational
Technology Institute of the National Research Council of Italy
(CNR-ITD), we have released an unprecedented resource for the
Learning Analytics and Educational Data Mining. In order to
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.</p>
      <sec id="sec-2-1">
        <title>1 www.solaresearch.org/</title>
      </sec>
      <sec id="sec-2-2">
        <title>2 http://acm.org/</title>
      </sec>
      <sec id="sec-2-3">
        <title>3 http://linkedup-project.eu/</title>
        <p>Publication</p>
      </sec>
      <sec id="sec-2-4">
        <title>Proceedings of the ACM Conference on Learning Knowledge (LAK) (2011-12)</title>
      </sec>
      <sec id="sec-2-5">
        <title>International</title>
        <p>Analytics and
10
239
16</p>
      </sec>
      <sec id="sec-2-6">
        <title>4 http://www.educationaldatamining.org/proceedings 5 http://www.ifets.info/issues.php?id=56</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2. FROM SCHOLARLY PAPER TO</title>
    </sec>
    <sec id="sec-4">
      <title>STRUCTURED DATA</title>
    </sec>
    <sec id="sec-5">
      <title>2.1 The extraction process</title>
      <p>In order to process and analyze the set of unstructured journals
and conferences papers data was transformed into structured data.
While each conference proceeding is available on the Web in PDF
format, but each collection has its own structure. Even if in some
cases the most used format is the ACM template6, papers not
always comply with it entirely, calling for some specifically
adapted extraction mechanisms. The overall knowledge
extraction process is composed of three main steps:
1.</p>
      <sec id="sec-5-1">
        <title>Transforming representation.</title>
      </sec>
      <sec id="sec-5-2">
        <title>Cleaning up</title>
        <p>information.</p>
        <p>PDF
documents to
plain
textual
and
consolidation
of the textual</p>
      </sec>
      <sec id="sec-5-3">
        <title>Extracting structured data from text.</title>
        <p>In the first step the PDF file containing the proceeding of a
conferences, or the papers of a journal is split up in order to have
one document for each paper. Then each PDF file has been
elaborated with pdf2text tool in order to have a textual
representation for each paper.</p>
        <p>In the second step the text files are elaborated in order to
transform them in a partially structured format that can be
elaborated automatically. In particular at this step tables and
figures are removed from the paper, maintaining their captions,
that can be useful for text mining processing, footnotes have been
also removed from the text, while bulleted or numbered list have
been organized using an homogeneous format.</p>
        <p>As part of the third step, text files are being processed in order to
extract from them the most important sections of the document.
Regarding the authors, their name, affiliation, country are
represented using the FOAF ontology.
6 http://www.acm.org/sigs/publications/proceedings-templates
For each paper the following information are collected: title,
authors, keywords, abstract, full-text and its relationship with the
type of publication or event, (journal or conference proceedings).
It is important to note that beside the common metadata for the
learning analytics papers such as: title, abstract, authors and
affiliation also the full text of the papers is stored in the dataset.
At this stage the full text has been stored without considering its
separation in paragraphs and sections, however the elaboration
performed at step number 2 has also identified the titles and
paragraphs of sections and subsections, thus providing the basis
for analyzing full text with further granularity in next versions of
the LAK dataset. The referenced papers are also extracted but are
not made available in the LAK dataset in this version of the
dataset .</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>2.2 The schema</title>
      <p>
        The schema used to describe the papers in the dataset is based on
two established schemas: the Semantic Web Conference (SWC)
ontology7 (already used to describe metadata about publications
from the Semantic Web conferences and related events8) and the
Linked Education schema9. The Linked Education schema has
been developed to represent and catalog both educational and
educational related datasets, which are datasets not specifically
created for education but that can be used in an educational
context. The schema has been used to annotate datasets and
resources as part of an integrated dataset10 which contains
educationally relevant resources such as : LinkedUniversities[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
and the mEducator Educational Resources [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] with their Open
Educational Resources and materials explicitly related to
education, as well as implicitly educationally relevant datasets
such as BBC Programmes[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], ACM Library Metadata11 and
Europeana [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] datasets. The main entities collected in the LAK
dataset are paper authors, institutions and papers, related to the
7 http://data.semanticweb.org/ns/swc/ontology
8 http://data.semanticweb.org/
9 http://data.linkededucation.org/ns/linked-education.rdf
10http://linkedup.l3s.uni-hannover.de:8880/openrdf
      </p>
      <p>sesame/repositories/linked-learning-selection?query
11 http://acm.rkbexplorer.com/
learning analytics area. Authors and institutions have been
represented using respectively the classes Person and
Organization of the FOAF ontology, while to represent papers, the
class InProceedings of the SWRC ontology has been used. The
LAK Dataset, at the time of writing, includes 779 authors,
connected to 295 institution, and 315 posters, abstract, short and
full papers.</p>
    </sec>
    <sec id="sec-7">
      <title>2.3 Access Methods</title>
      <p>In order to support different access method for the data, the
resources of the LAK datasets have been published in different
formats:
</p>
      <p>A dump file in zipped RDF/XML file format can be directly
downloaded from the SoLAR research web page.

</p>
      <p>A version of the dataset in a format that can be elaborated
through the R statistic software have been provided12
A Linked Data endpoint with a public SPARQL endpoint
has been developed in order to provide access to structured
RDF metadata according to LOD principles.13
The following SPARQL query14 on the LAK dataset can be used
to extract the full text of all 2011 papers (LAK 2011, and EDM
2011 conferences) in .srx format (XML file which can be opened
in any text editor):</p>
      <p>PREFIX led:&lt;http://data.linkededucation.org/ns/linked-education.rdf#&gt;
PREFIX swrc:&lt;http://swrc.ontoware.org/ontology#&gt;
SELECT ?paper ?fulltext WHERE { ?paper led:body ?fulltext . ?paper
swrc:year ?year . FILTER (?year = “2011”) }
On the SoLAR Website some useful examples for querying the
SPARQL endpoint15 of the LAK dataset have been reported16.
The example queries allow users, for instance, to retrieve: the
papers co-authored by two selected authors; all papers published
in both EDM and LAK conferences by the authors affiliated to an
institution.</p>
    </sec>
    <sec id="sec-8">
      <title>3. LAK DATA CHALLENGE</title>
      <p>Beyond merely publishing the data, we are actively encouraging
its innovative use and exploitation as part of a public LAK Data
Challenge17 sponsored by the European Project LinkedUp. An
initial competition is co-located with the ACM LAK13
12 http://www.r-project.org/
13http://data.linkededucation.org/openrdf-sesame/repositories/lakconference?query=[your sparql query]
14http://data.linkededucation.org/openrdf-sesame/repositories/lakconference?queryLn=SPARQL&amp;query=PREFIX%20led%3A%3Chttp
%3A%2F%2Fdata.linkededucation.org%2Fns%2Flinkededucation.rdf%23%3E%0APREFIX%20swrc%3A%3Chttp%3A%2F%
2Fswrc.ontoware.org%2Fontology%23%3E%0A%0Aselect%20%3Fpa
per%20%3Ffulltext%20where%20%7B%3Fpaper%20led%3Abody%2
0%3Ffulltext%20.%20%3Fpaper%20swrc%3Ayear%20%3Fyear%20.
%20FILTER%20%28%3Fyear%20%3D%20%222011%22%29%20%7
D&amp;infer=true
15http://data.linkededucation.org/openrdf-sesame/repositories/lakconference
Conference, Leuven, Belgium (April 2013)18. The challenge is
revolving around the overall question on what insights can be
gained from analytics on the LAK corpus about the general
discipline of Learning Analytics and its connection to other fields.
How can we make sense of this emerging field’s historical roots,
current state, and future trends, based on how its members report
and debate their research? Challenge submissions should exploit
the LAK Dataset by covering one or more of the following,
nonexclusive list of topics:
</p>
      <p>Analysis &amp; assessment of the emerging LAK community
in terms of topics, people, citations or connections with
other fields

</p>
      <p>Innovative applications to explore, navigate and visualise
the dataset (and/or its correlation with other datasets)</p>
      <p>Usage of the dataset as part of recommender systems</p>
    </sec>
    <sec id="sec-9">
      <title>4. CONCLUSIONS</title>
      <p>The LAK Dataset has been a first starting point to allow the
analysis of leaning analytics as an emerging research discipline
and its definition and evolution. Along with the growth of the
research works in learning analytics and related fields, we intend
to expand the dataset by adding new research publications. In
addition, while the dataset currently contains plain metadata and
full text of the research publications, it is envisaged to extract and
add additional data about contained entities and topics, to provide
simple means for assessing, exploring and navigating the data.</p>
    </sec>
    <sec id="sec-10">
      <title>5. ACKNOWLEDGMENTS</title>
      <p>This work is partly funded by the European Union under FP7
Grant Agreement No 317620 (LinkedUp).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>d'Aquin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Motta</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Linking Data Across Universities: An Integrated Video Lectures Dataset</article-title>
          .
          <source>In Proceeding of the 10th International Semantic Web Conference (ISWC</source>
          <year>2011</year>
          ),
          <fpage>23</fpage>
          -
          <lpage>27</lpage>
          Oct 2011, Bonn, Germany.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Haslhofer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isaac</surname>
            .
            <given-names>A.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>data.europeana.eu - The Europeana Linked Open Data Pilot</article-title>
          .
          <source>In Proceeding of the International Conference on Dublin Core and Metadata Applications (DC</source>
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Kobilarov</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Scott</surname>
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raimond</surname>
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Oliver</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sizemore</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smethurst</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            <given-names>R.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Media Meets Semantic Web - How the BBC Uses DBpedia and Linked Data to Make Conections</article-title>
          .
          <source>In Proceedings of the 6th European Semantic Web Conference (ESWC2009).</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Mitsopoulou</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taibi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giordano</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dietze</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>H. Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bamidis</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bratsas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Woodham</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Connecting Medical Educational Resources to the Linked Data Cloud: the mEducator RDF Schema, Store and API</article-title>
          , in
          <source>Linked Learning</source>
          <year>2011</year>
          ,
          <source>Proceedings of the 1st International Workshop on eLearning Approaches for the Linked Data Age, CEUR-WS</source>
          , Vol.
          <volume>717</volume>
          ,
          <year>2011</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>