<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Investigating Crowdsourcing as an Evaluation Method for TEL Recommender Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mojisola Erdt</string-name>
          <email>erdt@kom.tu-darmstadt.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Florian Jomrich</string-name>
          <email>jomrich@kom.tu-darmstadt.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Katja Schuler</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christoph Rensing</string-name>
          <email>rensing@kom.tu-darmstadt.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Multimedia Communications Lab, Technische Universitat Darmstadt</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <abstract>
        <p>O ine evaluations using historical data o er a fast and repeatable way to evaluate TEL recommender systems. However, this is only possible if historical datasets contain all particular information needed by the recommender algorithm. Another challenge is that users must have indicated interest in the recommended resource in the past for a resource to be evaluated as relevant. This however does not mean the user would not be interested in this newly recommended resource. User experiments help to complement o ine evaluations but due to the e ort and costs of performing these experiments, very few are conducted. Crowdsourcing is a solution to this challenge as it gives access to su cient willing users. This paper investigates the evaluation of a graphbased recommender system for TEL using crowdsourcing. Initial results show that crowdsourcing can indeed be used as an evaluation method for TEL recommender systems.</p>
      </abstract>
      <kwd-group>
        <kwd>recommender systems</kwd>
        <kwd>evaluation</kwd>
        <kwd>crowdsourcing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        At the work place, it is increasingly common to learn on-the-job in order to
accomplish a certain task or to learn about a new topic needed to solve a particular
problem. These days, most of the knowledge is gained from resources found on
the Web e.g. from videos on YouTube (www.youtube.com), slides on SlideShare
(www.slideshare.net) or forums on LinkedIn (www.linkedin.com). Recommender
systems help by suggesting resources tting the task the person is presently
trying to solve or gain knowledge about. Various kinds of recommender systems have
been proposed for TEL, each having their particular aims and advantages [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        A lot of research has gone into the evaluation of TEL recommender systems
based on standard methods from information retrieval (IR) which are mostly
based on determining the precision of such algorithms using cross-validation on
historical or synthetically created datasets. These o ine evaluation methods are
fast to conduct once the datasets exist and can be repeated and easily compared
to other evaluation results [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However getting datasets that have exactly the
information needed for a speci c algorithm remains a challenge. For example, in
      </p>
      <p>
        Copyright © 2013 for the individual papers by the papers' authors.
2
order to evaluate our graph-based recommender approach AScore [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a
hierarchical activity structure is required. Activities are learning goals or tasks de ned
by the learner in a hierarchical structure [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. When the learner nds resources
that are needed to achieve a learning goal or to solve a task, he attaches them to
the respective activities. Activities thus support the learner during his learning
process by helping him plan and organize his tasks and learning resources.
AScore exploits these activity structures to recommend learning resources to the
learner or to other learners working on related activities. There are however very
few datasets that have such hierarchical activity structures [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Consequently, the
o ine evaluation of this approach based on historical data is limited. This
motivated us to search for an alternative evaluation method.
      </p>
      <p>
        Another challenge that arises when evaluating using historical datasets is if
new resources are recommended to a user who did not have or know these
resources in the past, there is no way of judging if the user would like this resource
in the future. There have been attempts to complement o ine evaluations by
conducting user experiments [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However due to the high e ort required to
perform user experiments not many have been conducted thus far. There therefore
exists a gap between the fast, easy-to-conduct o ine evaluations and the online
experiments. An attempt to bridge this gap is the online evaluation approach
using crowdsourcing [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Certainly doubts arise regarding the quality of
results from an evaluation performed by unknown crowdworkers for a few cents.
Experiments however do show that results from crowdsourcing are just as good
as from traditional user experiments [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], depending of course on the design of
the task to solve [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>In this paper, we investigate using crowdsourcing to evaluate our TEL
recommender system AScore comparing it to the state-of-art FolkRank. Our goal
is to test for relevance, novelty and diversity.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Crowdsourcing can be described as an open call to online users from a very large
community to contribute to solve a problem or to perform a human intelligent
task in exchange for payments, social recognition or entertainment [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
Advantages of crowdsourcing are the fast access to a vast population, the low cost,
high quality and exibility [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Limitations are the arti ciality of the task, the
unknown population and the need for quality control to detect spammers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Crowdsourcing has been used in research to solve various tasks in many di erent
domains e.g. for surveys, usability testing, classi cation or translation tasks [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
An example in IR is TERC - Technique for Evaluating Relevance by
Crowdsourcing [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], developed to test the e ectiveness of IR systems. Recommender strategies
have also been evaluated using crowdsourcing [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to determine the relevance of
the recommendations made. Other measures such as novelty, redundancy and
diversity have also been measured using crowdsourcing where the crowdworkers
state their preference judgements for certain items [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>Crowdsourcing Evaluation Concept and Results</title>
      <p>In the crowdsourcing user experiment we investigate these 3 hypotheses:
H1.Relevance: AScore recommends more relevant resources than FolkRank.
H2.Novelty: AScore recommends more unknown or new resources than FolkRank.
H3.Diversity: AScore recommends more diverse resources than FolkRank.</p>
      <p>
        In order to generate recommendations for the experiment, an initial research
on the topic of \Climate Change" was needed to create a basis graph structure
(an extended folksonomy) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to run the recommender algorithms on. We selected
climate change as this is a topic currently being debated world-wide and it can
thus be assumed that the recommended resources to this topic can be understood
and evaluated by most participants of the survey. Hence, prior to the experiment,
we asked 5 experts using CROKODIL [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to research for resources on the Web
pertaining to speci ed activities and sub-activities relating to climate change
about 70 resources were found and attached to 8 activities. The graph structure
thus created comprising the users, resources, tags, and activities was then used
to generate recommendations with the two algorithms AScore and FolkRank.
Such a limited dataset would be inadequate for an o ine evaluation but it is
su cient to prepare an online user experiment.
      </p>
      <p>
        In each questionnaire, 5 resources were recommended to the more general
activity: \Understanding Climate Change" or to the more speci c sub-activity
\Analyze the catastrophes which are currently happening or going to happen
because of the higher worldwide temperature". These resources were either
recommended by AScore or FolkRank. To each resource recommended, 10 questions
were asked (see Fig. 1): 3 questions to each hypothesis (answered on a 7-point
Likert scale) and one control question to help us detect spammers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The
participants were asked to rst research on the Web for resources relating to the
general topic of climate change in order to be able to judge the relevance, novelty
and diversity of the recommendations following.
      </p>
      <p>Hypothesis 1: Relevance
Q1: The given Internet resource supports me very well in my research about the topic.</p>
      <p>Q2: If I could only use this resource, my research would still be very successful.</p>
      <p>Q3: Without this resource just by using my own resources, my research about the given topic would still be very good.
Hypothesis 2: Novelty
Q4: The Internet resource gives me new insights and/ or information for my task.</p>
      <p>Q5: I would have found this resource on my own/ anyway/ during my research.</p>
      <p>Q6: There are lots of important aspects about the topic described in this resource that lack in other resources.
Hypothesis 3: Diversity
Q7: This Internet resource differs strongly from my other resources.</p>
      <p>Q8: This resource informs me comprehensively about my topic.</p>
      <p>Q9: This resource covers the whole spectrum of research about the given topic.</p>
      <p>Control Questions
Q10a. How many pictures and tables that are relevant to the given research topic does the given resource contain?
Q10b. Give a short summary of the recommended resource above by giving 4 keywords describing its content.
Q10c. Describe the content of the given resource in two sentences.</p>
      <p>Fig. 1. Questions asked in the Questionnaire to each Hypothesis and Control Questions
4</p>
      <p>The evaluation jobs were placed on two crowdsourcing platforms: 60 jobs
on microWorkers1 and 40 jobs on CrowdFlower2. We had results from all over
the world, most of the crowdworkers however came from USA and Bangladesh.
After eliminating spammers, we had a total of 68 fully answered questionnaires
from paid crowdworkers. We additionally invited 57 voluntary non-crowdworkers
(mostly students) to take part in the survey in order to be able to compare the
quality of results with those from crowdworkers. In total, 125 fully answered
questionnaires were considered for the evaluation. The results of the experiment
are shown in Fig. 2. where AScore (left in grey) is compared to FolkRank (right
in red). The average answers given on the Likert scale (from 1 - 7) are shown.
For each question, AScore receives a better average score than FolkRank. We
conducted a two sample Student's t-test for each of the hypotheses. Table 1
gives an overview of the results. Hypothesis 1: Relevance is supported as the
t-test gives a p value less than 0.05. This means the answers to questions Q1,
Q2 and Q3 support the hypothesis that AScore does recommend more relevant
resources than FolkRank. Hypothesis 2: Novelty is supported as well as the p
value from the t-test is also less than 0.05, this shows that Q4, Q5, Q6 support
the hypothesis that AScore recommends more novel resources than FolkRank .
Hypothesis 3: Diversity measured by Q7, Q8 and Q9, is however not supported
as the p value is greater than 0.05. Therefore it is not possible to say that AScore
recommends more diverse resources than FolkRank. This could be an indication
that diversity is harder to evaluate. In conclusion, the results of the experiment
support the rst two hypotheses: the recommendations made by AScore are more
relevant and novel than those recommended by FolkRank.
In this paper, we argue the need for an alternative evaluation method for TEL
recommender systems and propose using crowdsourcing. Initial results show this
is possible, concluding that AScore provides more relevant and novel
recommendations than FolkRank. We plan to further analyse the data collected to
determine the impact of activity hierarchies - comparing the results of
recommendations made to a sub-activity with those made to an activity higher in
the hierarchy. We hypothesis that recommendations should increase in novelty
1 http://www.microworkers.com (retrieved 19.06.2013)
2 http://crowdflower.com (retrieved 19.06.2013)
5
the further down the hierarchy. We plan to compare the results between
crowdworkers and non-crowdworkers and with these insights improve our proposed
crowdsourcing evaluation concept and apply it to further scenarios like
evaluating recommendations of learning resources from external sources.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Alonso</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baeza-Yates</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Design and implementation of relevance assessments using crowdsourcing</article-title>
          .
          <source>In: Advances in Information Retrieval. LNCS</source>
          , vol.
          <volume>6611</volume>
          , pp.
          <volume>153</volume>
          {
          <fpage>164</fpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Anjorin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rensing</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bischo</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bogner</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reger</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Faltin</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Steinacker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Ludemann,
          <string-name>
            <surname>A.</surname>
          </string-name>
          ,
          <article-title>Dom nguez Garc a</article-title>
          , R.:
          <article-title>CROKODIL - A Platform for Collaborative Resource-Based Learning</article-title>
          .
          <source>In: Towards Ubiquitious Learning</source>
          . pp.
          <volume>29</volume>
          {
          <fpage>42</fpage>
          . LNCS, Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Anjorin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodenhausen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garc</surname>
            <given-names>a</given-names>
          </string-name>
          , R.D.,
          <string-name>
            <surname>Rensing</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Exploiting semantic information for graph-based recommendations of learning resources</article-title>
          .
          <source>In: 21st Century Learning for 21st Century Skills</source>
          , pp.
          <volume>9</volume>
          {
          <fpage>22</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Chandar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carterette</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Using preference judgments for novel document retrieval</article-title>
          .
          <source>In: Research and development in IR</source>
          . pp.
          <volume>861</volume>
          {
          <fpage>870</fpage>
          . SIGIR,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Habibi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Popescu-Belis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Using crowdsourcing to compare document recommendation strategies for conversations</article-title>
          .
          <source>In: Workshop on Recommendation Utility Evaluation: Beyond RMSE</source>
          . p.
          <volume>15</volume>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kazai</surname>
          </string-name>
          , G.:
          <article-title>In search of quality in crowdsourcing for search engine evaluation</article-title>
          .
          <source>In: Advances in Information Retrieval. LNCS</source>
          , vol.
          <volume>6611</volume>
          , pp.
          <volume>165</volume>
          {
          <fpage>176</fpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Manouselis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Drachsler</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vuorikari</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hummel</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koper</surname>
          </string-name>
          , R.:
          <article-title>Recommender Systems in TEL</article-title>
          . In: Rec. Sys. Handbook, pp.
          <volume>387</volume>
          {
          <fpage>415</fpage>
          . Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>