<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Social mining from Wi-Fi campus data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Roberto Puccetti</string-name>
          <email>roberto.puccetti@unipi.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>supervised by Dino Pedreschi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>- Mirco Nanni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CNR Pisa -</institution>
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Pisa University</institution>
          ,
          <addr-line>Pisa -</addr-line>
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>This research deals with methods and algorithms for analyzing Wi-Fi data of individuals at the scale of a large campus area, studying personality traits, class attendance, mobility and social network structure of students. In particular, the main driving application will be the study of whether and how such elements influence academic performances. Data mining has been used in telecommunication industry for several applications including marketing, security and network reliability. The exploration data in educational field using Data Mining techniques concerns with extracting a pattern to discover hidden information from educational data. In the case of academic performance studies, a number of behavioral patterns have been linked such as time allocation, active social ties, sleep duration and sleep quality, or participation in sport activity. While most of the existing studies suffer from biases and limitations often associated with surveys and self-reports, our research is directed towards a Wi-Fi network in an urban campus area and therefore it analyses a large set of data that are not biases influenced. We want first to address the problem to analyze such data to discover their quality and then to develop a tool to enable the extraction of latent knowledge in dynamic and multidimensional networks. As an application area of the studies, we use the movements within a university campus and then the study of the influence of daily behavior on students' performance. We can define our research as multidisciplinary, in the sense that it approaches and uses many fields of information science: Wi-Fi Data Analysis, Social Networks (Construction [1], Topological properties [2] and Study of community [3]) and Privacy issues [4]. However the most of survey work done is about Prediction of Students' Academic Performance [5] [6], [7] [8], [9], [10]. In these works there are many aspect similar to our research but the most innovative aspects we have introduced are: no survey from questionnaires but directly from their behavior using a large dataset collected from students (about 50k) by their tab-</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>let/smartphones, plurality in course of study, and Social network observations from campus
Wi-Fi data (their visited places and physical proximity at the same time).
3</p>
    </sec>
    <sec id="sec-2">
      <title>Research plan</title>
      <p>To gain the goal of the research, activity includes the following main tasks: to adopt or
develop methods and algorithms to analyze wi-fi data on the scale of a large campus, to
develop a software prototype, on top of such methods and algorithms, that can be used for solving
a selected set of problems as, for example, to predict students’ academic performance.
Finally, we want to collect a “so-big dataset” from a campus area as a case study to test the
prototype. From the viewpoint of the case study, we are planning: i) to extract and evaluate the
importance of different sets of features for supervised learning models in particular for
students’ performance prediction; ii) to identify individual and network factors that best
correlate with students’ performances; iii) to predict students’ performance; iv) to investigate
significant differences among performance groups, in terms of the most important individual
and network features. Data used to build the case study come from a consistent dataset:
WiFi access logs and exam results from the University of Pisa, described in better detail later.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Preliminary results</title>
      <p>The Wi-fi data usually suffer from various issues, such as sparse data, noise, uncertainty and
so on, which need to be dealt with before any analysis-based task. We start our work by
studying and trying to remove (or to mitigate) them.
4.1</p>
      <p>Semantic labelling of Access Points
Another preliminary work is to understand the role of each Access Point connection in the
people activity. This means to understand two main important features of the access points:
collocation in area and purpose of their use. So, we aggregated APs to calculate and display
interactively the daily use (see Fig.1) and we identified 11 classes of use (Didactics, Central
Administration, Study area, Recreational activities, Dormitory, etc.). This second task has
two approaches: Top-Down, supervised (based on collocation, and Data-Driven, using
visual tools and Data Time Warping algorithm to validate what prefixed in arbitrary way.
We use two different clustering methods to study users’ ties: behavior similarity, that
wants to analyze how much students use APs in percentage and for which purpose, and
physical collocation, that presupposes the friendship between two individuals who
habitually attend the same places. The first, for each AP classification gathers a score from 1 to 2 to
“weight” the importance of the location frequency, for instance assigning higher weights to
spaces devoted to study and (secondarily) social activities. Then, for each category, each
student is represented by an array describing the percentage of his time spent in it. This
representation of a student allows the direct comparison of two individuals through a simple
weighted sum (basically a L1-norm where each feature has an associated importance).
Distance from student i to student j is</p>
      <p>In such a way, we create a student’s weighted disctance matrix to measure distance for
clustering operations, aimed to group students that spend their time in a similar way (Fig.2).</p>
      <p>Instead, the physical location approach considers the simultaneous use of the access point
as similarity factor. Staring from the time logs, we study the overlapping time in function of
the APs use to determine the reasonable cut factor. We decided to consider 20 minutes as
reasonable overlapping time. With this parameter and adopting Jaccard distance index, we
construct the relationship matrix for social network methods to determine the student’s
community. Using two different methods (DEMON and LOUVAIN), in both we collect the
same number of communities. This means that in students’ environment we have a strong
clustering in function of place frequency.</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusions and next steps</title>
      <p>After community’s discovery, in function of students’ behaviors analysis, we want to
correlate them with their performance. The first and simplest method we will use to calculate the
performance is to combine exam results with their importance and their timeliness (e.g. it
penalizes exams that are late on schedule):
;
;
Finally, based on this score and on cluster membership descripted above, we will profile the
students. This process will be done studying how to optimize the accuracy of profiling.
Defining such accuracy in a well-founded way is part of the challenge.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Kovanen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Saram</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Kaski</surname>
          </string-name>
          ,
          <article-title>"Reciprocity of mobile phone calls</article-title>
          .,
          <source>" JDySES</source>
          , vol.
          <volume>2</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>138</fpage>
          -
          <lpage>151</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          393, no.
          <issue>6684</issue>
          , pp.
          <fpage>440</fpage>
          -
          <lpage>442</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Kianmehr</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Alhajj</surname>
          </string-name>
          ,
          <article-title>"Calling communities analysis and identification using machine learning techniques," Expert Systems with Applications</article-title>
          , vol.
          <volume>36</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>6218</fpage>
          -
          <lpage>6226</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Sweeney</surname>
          </string-name>
          ,
          <article-title>"k-anonymity: a model for protecting privacy,"</article-title>
          <source>International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems</source>
          , vol.
          <volume>10</volume>
          , no.
          <issue>5</issue>
          , pp.
          <fpage>557</fpage>
          -
          <lpage>570</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>D.</given-names>
            <surname>Gaševi</surname>
          </string-name>
          ´c,
          <string-name>
            <given-names>R.</given-names>
            <surname>Janzen</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Zouaq</surname>
          </string-name>
          ,
          <article-title>"Choose your classmates, your GPA is at stake!” The association of cross-class social ties and academic performance,"</article-title>
          <source>Am Behav Sci</source>
          , vol.
          <volume>57</volume>
          , no.
          <issue>10</issue>
          , p.
          <fpage>1460</fpage>
          -
          <lpage>1479</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>M.P.Vitale</surname>
            ,
            <given-names>G.C.</given-names>
          </string-name>
          <string-name>
            <surname>Porzio</surname>
            and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Doreian</surname>
          </string-name>
          ,
          <article-title>"Examining the effect of social influence on student performance through network autocorrelation models,"</article-title>
          <source>J Appl Stat</source>
          , vol.
          <volume>43</volume>
          , no.
          <issue>1</issue>
          , p.
          <fpage>115</fpage>
          -
          <lpage>127</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>PV.</given-names>
            <surname>Marsden</surname>
          </string-name>
          and KE.Campbell,
          <article-title>"Measuring tie strength,"</article-title>
          <source>Soc Forces</source>
          , vol.
          <volume>63</volume>
          , no.
          <issue>2</issue>
          , p.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          9, no.
          <year>2015</year>
          , pp.
          <fpage>6415</fpage>
          -
          <lpage>6426</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>I.</given-names>
            <surname>Smirnov</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Thurner</surname>
          </string-name>
          ,
          <article-title>"Formation of homophily in academic performance: students prefer to change their friends rather than performance," https://arxiv</article-title>
          .org/abs/1606.09082,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kassarnig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bjerre-Nielsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sapiezynski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.D.</given-names>
            <surname>Dreyer</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.L.</given-names>
            <surname>Jørgensene</surname>
          </string-name>
          ,
          <article-title>"Academic performance and behavioral patterns," Epj Data Science</article-title>
          , vol.
          <volume>7</volume>
          , no.
          <issue>10</issue>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>