<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards a Recommender System for Statistical Research Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniel Bahls</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guido Scherp</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Klaus Tochtermann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wilhelm Hasselbring</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Leibniz Information Centre for Economics (ZBW)</institution>
          ,
          <addr-line>Kiel</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Software Engineering Group, Kiel University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2012</year>
      </pub-date>
      <fpage>61</fpage>
      <lpage>72</lpage>
      <abstract>
        <p>To effectively promote the exchange of scientific data, retrieval services are required to suit the needs of the research community. A large amount of research in the field of economics is based on statistical data, which is often drawn from external sources like data agencies, statistical offices or affiliated institutes. Since producing such data for a particular research question is expensive in time and money-if possible at all- research activities are often influenced by the availability of suitable data. Researchers choose or adjust their questions, so that the empirical foundation to support their results is given. As a consequence, researchers look out and poll for newly available data in all sorts of directions due to a lacking information infrastructure for this domain. This circumstance and a recent report from the High Level Expert Group on Scientific Data motivate recommendation and notification services for research data sets. In this paper, we elaborate on a case-based recommender system for statistical data, which allows for precise query specification. We discuss required similarity measures on the basis of cross-domain code lists and propose a system architecture. To address the problem of continuous polling, we elaborate on a notification service to inform researchers on newly avaible data sets based on their personal request.</p>
      </abstract>
      <kwd-group>
        <kwd>Research Data Management</kwd>
        <kwd>Semantic Digital Data Library</kwd>
        <kwd>Linked Data</kwd>
        <kwd>Statistics</kwd>
        <kwd>Recommender Systems</kwd>
        <kwd>Case-Based Reasoning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        At present, efforts are being made to pick up research data as bibliographic
artifacts for re-use, transparency and citation. Data publications will be
submitted to digital archives and registered in central catalogs which lays the ground
for information services to support the scientific community in finding relevant
data. Since every scientific discipline brings its own challenges in this endeavor,
specific solutions are required, so that valuable, and hence accepted, services can
be offered to the scientific community [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The High Level Expert Group on
Scientific Data recommends to provide data recommendation services that suggest
relevant research data to the individual scientist [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. This appears to be
particularly applicable in the domain of economics where research activity is influenced
by the availability of statistical research data sets.3 Researchers adjust to what
data is available and adapt research questions so that the empirical foundation
can be given.
      </p>
      <p>As a consequence, researchers look out and poll for newly available data in
all sorts of directions due to a lacking information infrastructure for this
domain. They exchange news on newly available data at conferences, at meetings
or simply at lunch time or during coffee-break. They also revisit websites of data
agencies, repositories and familiar institutes to run their personal portfolio of
keyword-based queries on regular web search engine interfaces—trying to
express their request for specific data sets. Although best practice at present, this
strategy seems effortful and insufficient in returning a complete list of relevant
data sets. This picture was shared with us in interviews we have conducted with
researchers in economics.</p>
      <p>Having catalogs of registered research data sets puts us in a good position to
address the above problem and develop well-conceived search tools and services
for our scientific community. Besides the fact that the catalog itself lays the
ground for a more organized search, this paper tries to address the following two
aspects of the identified problem:
1. Phrasing several queries with different keywords and filters of all kinds to
cover the range of relevant data sets.
2. Continuous polling at regular time intervals.</p>
      <p>The remainder of the paper is structured as follows. We review related work
and decide on our approach in Section 2. Section 3 concludes the findings and
formulates the functional requirements for our proposed system. Since we follow
a case-based recommendation approach, we examine case base and case structure
in Section 4 and elaborate on a similarity measure design on the basis of common
code lists subsequently in Section 5. We propose a system architecture in Section
7. Finally, we close with conclusions and outlook in Section 8.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In the domain of statistical research data, one main difficulty is given by data
protection and usage rights, so that uploading entire data collections to an
independent repository causes legal problems. This is one of several reasons why we
have decided to use Semantic Web technologies for the data model, which are
strong in fine-grained referencing and in dealing with distributed data sources.
In particular, we use the RDF Data Cube Vocabulary (QB), which integrates the
SDMX standard4 and is increasingly recognized in the domain of statistics [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
3 A large amount of research in the field of economics is based on statistical data,
which is often drawn from external sources like data agencies, statistical offices or
affiliated institutes. Producing such data for a particular research question is expensive
in time and money—if possible at all.
4 Statistical Data and Metadata eXchange Language http://sdmx.org/
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. A more detailed argumentation and an overall vision for our research
is given in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <p>
        There are several different types of recommender systems for which a
comprehensive overview can be found at [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Especially in e-commerce environments,
collaborative filtering has established as a common technique. Online stores like
amazon5 recommend products on the basis of similar user profiles, following the
idea that one might be interested in the products that other users with similar
interest patterns have purchased. While this technique can be applied irrespective
of the kind of items operated on, it demands large amounts of usage data from
a sufficient number of users in order to produce meaningful recommendations.
This initial overhead is known as the cold-start problem and usually requires user
acceptance long before the value of item recommendation can be experienced.
      </p>
      <p>
        Another technique makes use of the items’ digital content6 which we refer to
as content-based recommendation systems [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Typically, items are mapped onto
a vector space model where distances between them can be calculated using
common mathematical means. This technique has established particularly in
the context of textual items, where means like7 are frequently used. However,
this approach again depends on an initial set of usage data. While collaborative
filtering compares patterns among user profiles, content-based retrieval is based
on usage history of a single user and suggests similar items according to what
she or he found useful or not useful earlier.
      </p>
      <p>A third system type is based on background knowledge and calculates
recommendations merely on the basis of a given user query and domain-specific
preference knowledge encoded in the form of rules or specifically designed
similarity measures. The approach therefore does not build on usage data at all and
thus is not affected by the cold-start problem. Since usage data on statistical
research data sets is not easily available to us and difficult to acquire in sufficient
quantity, we find this approach most suitable for our domain. The amount of
statistical research data is tremendous, and the amount of usage data required
scales accordingly if we plan to include all available data sets for
recommendation. In addition, recommending data sets that are similar to the ones used
previously may not be helpful in the scientific domain, where researchers often
work on various projects simultaneously or change their research area when
moving to another organization. The above described systems tend to recommend
older items, because usage statistics on newer ones build up slowly8. While these
drawbacks do not apply for knowledge-based recommenders, another advantage
is their strength in explaining results, so that users can understand why a
particular recommendation was considered relevant. Furthermore, a lot of
background knowledge for statistical data is available and has even been formalized
5 urlhttp://www.amazon.com
6 be it metadata, a textual description or the digital item itself like for example in
document retrieval scenarios
7 Term Frequency - Inverse Document Frequency
8 also known as the time-span problem
in SDMX9, DDI10, code lists and the RDF Data Cube Vocabulary (QB), which
also encourages a knowledge-based approach.</p>
      <p>
        Knowledge-based recommender systems are typically constraint-based or
casebased [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. While the former uses rule sets and constraint resolvers to produce
recommendations, the case-based approach uses specifically designed similarity
measures that shall reflect the user’s understanding of utility [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Eventually, we
have chosen to follow a case-based approach on the grounds of positive
experiences in earlier projects. As a consequence, a research data set is considered and
may be referred to as a case in the following. Cases in general can be represented
textually, as a feature vector, or as a structured representation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The cases
according to our RDF-based data model are already in structured shape, which
gives reason to choose a structured CBR11 approach over a textual, feature-based
or other.
      </p>
      <p>
        Common data repositories do not yet offer recommendation features and
focus on providing full text search interfaces and filtering features. Text search
algorithms often yield scores that allow for relevance ranking and are applied on
textual fields of the respective underlying metadata model. Search criteria given
for the more structured part of the model12 are usually filtered on, meaning
that all unmatched items are removed from the ranking [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. A typical
implementation imposes this rather technical and limited viewpoint on the user who
switches back and forth modifying query phrase and parameters to cover the
whole spectrum of possibly interesting search results, simply to deal with the
limitations of such rigid interface13. It is to say that these issues are difficult
to overcome, and most retrieval algorithms incorporate stemming, query
expansion and other strategies while targeting a yet simple interface which certainly
is another important design goal. Our aim is to get a clear picture of the user
needs first which needs no further editing once specified clearly. Every item that
matches the query entirely would be considered a perfect match, and therefore
the approach performs like the common ones. In addition, however, the
system should be able to find near matches and offer further means of knowledge
discovery, which is a more high-level approach in the first place.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Functional Requirements</title>
      <p>
        To address research objective 1, the system must provide an interface that allows
for precise specification of a data request, enabling researchers to pinpoint to
the perfect data set regardless of whether such data exists. This can be done
on the basis of the RDF Data Cube Vocabulary which provides a wide range of
predicates and attributes to formulate precise queries. The system further needs
9 Statistical Data and Metadata eXchange Language http://sdmx.org/
10 Data Documentation Initiative http://www.ddialliance.org/
11 Case-based reasoning—or case-based recommending in our case
12 e.g. creation date, size, country of origin or other domain-specific fields
13 rephrasing query terms, resetting date ranges, size parameters, geo location and
other
to know what aspects of the query are of greater, and what aspects are of lesser
significance, which can be handled with the help of user-defined weights [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>The second objective can be achieved through a notification service that sends
out updates on newly available data to the individual user whenever estimated
relevant.</p>
      <p>An understanding of utility must be encoded in the system, so that data not
perfectly matching the user’s description can be estimated whether it yet may
interest the user. In case-based recommender systems, such knowledge is encoded
in similarity measures that are used to determine an estimated degree of utility of
a particular case under a given query. Such measures must be designed carefully
and must not make assumptions on user preferences where no foundation is given.
Case-based recommenders in principle can be applied for our research objectives.
However, the value of this approach depends on the question whether meaningful
similarity measures can be implemented, which will be investigated in Section 5.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Case Structure</title>
      <p>CL OBS STATUS Status of an observation with respect events such as the ones
reflected in the codes composing the code list.</p>
      <p>CL CONF STATUS Coded information about the sensitivity and confidentiality
status of the data.</p>
      <p>CL DECIMALS Gives information on the number of decimal digits used in the
data.</p>
      <p>CL FREQ Indicates the “frequency” of the data (e.g. monthly) and, thus,
indirectly, also implying the type of “time reference” that could
be used for identifying the data with respect time.</p>
      <p>CL SEX Provides information on the gender.</p>
      <p>CL TIME FORMAT Time Format as written in the SDMX-EDI and SDMX-ML
messages; these codes (based on the ISO 8601 standard) indicate the
type of time references used in the data. The numeric codes
below (203, 102,,702) are used only in the SDMX-EDI messages;
and the alphanumeric codes (P1DPT1M) only in the SDMX-ML
messages.</p>
      <p>CL UNIT MULT Unit Multiplier; indicates the magnitude in the units of
measurements.</p>
      <p>CL AREA Reference area and/or counterpart area; geographical areas,
defined as areas included within the borders of a country, region,
group of countries, etc.</p>
      <p>
        CL CURRENCY Provides code values for currencies.
Since we use the RDF Data Cube Vocabulary to organize statistical research
data, the number of available attributes to describe research data sets is very
large, also because RDF-based descriptions are per se extensible, which might be
made use of when dealing with long tail research data of individual researchers.
Hence, we only review some of the common attributes in order to assess the
value of this approach. Table 1 gives an overview to the Cross-Domain Code
lists issued by the SDMX consortium [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. First of all, we need to clarify
the notion of a case and how to map the RDF data to a case base. Figure 1
illustrates the structure of a case, and where the SDMX code list attributes are
located.
      </p>
      <p>
        This structured representation suggests to apply the local-global-principle,
which is an established paradigm in the CBR domain [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Local similarity
measures are used to determine similarities on attribute level, while global similarity
measures aggregate the resulting values on object level. There are two types
of objects: DataSet and Observation, and thus, two global similarity measures
are needed. Instances of DataSet are the items to be retrieved or recommended,
while instances of Observation make for a large portion of its actual content.
Because of the n,m relation between DataSet and Observation, we need
measures for dealing with multiple values [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. In the following, we write sim(q,c)
to denote the similarity function of a query value q and a case value c, whereas
both variables q and c are elements of the respective attribute’s value range.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Similarity Measures</title>
      <p>5.1</p>
      <sec id="sec-5-1">
        <title>Local Similarity Measures</title>
        <p>CL CURRENCY specifies the currency used in a data set. If the user explicitly
queries for data sets with Euro as a currency, only such data sets should be
considered suitable. Suggesting that a data set using USD would be more useful
than one using CHF has no basis.14 Thus, the similarity measure for such type
should be totally uninformed:
sim(q, c) =
1, if q=c
0, otherwise</p>
        <p>We find a different situtation for the area code, where groups of countries
and regions build up taxonomies. Whether data about Bavaria is useful when
Germany was specified in the query depends on interpretation: It may be
interpreted as “data about any region in Germany is fine” or “data about Germany
on the national level is needed”. Sophisticated user interfaces would be required
for disambiguation, and we rather try to bypass this problem and approach a
more vague but generic measure. This is supported by the consideration that
even with a more precise query, the utility of a data set on other regions still
remains hard to assess in general. When a data set on Bavaria is requested and
a data set on Brandenburg is given, one may argue that merely the data on
Bavaria represents the population a researcher wants to do research on, and any
other are simply unsuitable. In contrast, one may argue as well that both regions
are siblings in the sense that Germany is the subsuming parent, and a similarity
value above zero appears resonable, as the data set may still reflect some of
the features the researcher is after, while a data set on Idaho (USA) may not
be suitable anymore, and yet another dataset on Chengdu (China) cannot be
used at all. Several techniques are available for implementing a taxonomy-based
similarity measure. One generic option is given to calculate a value based on the
length of the shortest path. A more specific option depends on the actual query
semantics and needs further consideration and discussion.</p>
        <p>Other codes have ordered range sets. The CL DECIMALS code list denotes
the number of decimals used in the data. It seems reasonable to assume that
any data providing a higher number of decimals suits just as well as the data
queried for, since numbers can always be rounded. In contrast, a smaller number
of decimals than requested should be assumed as less suitable, as it means a
lack in precision. And since the degree of precision decreases proportionally with
the difference between case and query value, it suggests a typical more is better
similarity measure:
sim(q, c) =
1, if q ≤ c
q −dc , otherwise
(1)
(2)
where d denotes the maximal difference between query and case, which is ten in
this case.</p>
        <p>A similar case is found for the code list CL FREQ. Quarterly data can be
aggregated from monthly or daily data. But if monthly data is requested, and
quarterly data is given, the request is not perfectly met. Such data might yet
be more useful than yearly data, so that a similar measure like the above could
be reasonable. While cl decimals was based on numbers, cl freq is symbolic.
14 U.S. Dollars (USD), Swiss franc (CHF)
Therefore, we could define an order and map frequency symbols to integers, so
that a similar function as the above can be applied.</p>
        <p>
          For the free text fields as listed in [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], the measure should be based on
common techniques like TF-IDF15 or n-gram. However, it must be ensured that
the value is normalized, so that resulting similarity values can be set in relation
to the ones of other attributes when aggregating in the global measure.
5.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Aggregation of Similarity Values</title>
        <p>A so-called global similarity measure is used to aggregate the results from the
attribute-level similarity calculations. As we are still in a stage of considerations,
there is no point in arguing whether weighted means, Euclidean or other types of
aggregation is the right method to choose. We state, however, that the measure
should enable user-defined weights in the query, as it allows the researcher to
emphasize on the one or other parameter.</p>
        <p>
          To complete the similarity measure, we further need to specify how multiple
values are dealt with. For example, the researcher may request data on the
geographic locations France, Germany and United Kingdom. The utility of a data
set that represents the populations of England and France may then be
calculated by finding best partners for every requested country16 and build minimum,
maximum or average for the overall similarity value of the geographic attribute.
Which strategy to choose depends on the particular attribute and should be
examined carefully in evaluation with end users [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
5.3
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>Undefined values</title>
        <p>There are some special cases we need to consider. When a query specifies a
value for a particular attribute, for which the case compared does not provide
any value, the measure must yield some value as well. For instance, if male is
specified for gender in the query, and the attribute value is not given in the
case, utility should be considered zero, because the user explicitly stated that
represented population should be male. It appears reasonable to take this as the
default measure. However, if the user specifies free for confidentiality status,
and the case does not provide any information in this regard17, the data set is
not necessarily unsuitable. A reasonable way to deal with this issue could be to
simply ignore this attribute in the global similarity function.</p>
        <p>Another special situation occurs when a researcher requests data that
contains values of some currency, but she does not want to specify more precisely
on it. She is certain that monetary values must be part of the data while the
currency unit itself is subordinate. One way to cater for this is to introduce a
special value * and let sim(*,c)=1 for any case c.
15 term frequencyinverse document frequency
16 best with respect to the local similarity measure
17 due to incomplete annotation</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Notification Service</title>
      <p>Due to the impression that empirical research is quite data-driven, and
researchers need to conintuously look out for new data sets in order to stay
upto-date, we want to make some considerations on a notification service18. As
our approach was to capture the researcher’s request for data in high precision,
we are in a position to test incoming data sets for relevance and send out
messages19. One strategy in this regard would be to notify about every data set that
meets a user-defined similarity threshold. From experience, however, similarity
values tend to accumulate in a particular range, which is highly dependent on
the similarity measure design and the respective user query, and thus, it may
be difficult to be provide a specific threshold value. In that sense, the values
calculated with the help of the similarity measure should rather be regarded as
scores that give means for a ranking. To bypass this problem, we suggest to send
out such rankings of newly registered data sets on a regular basis as per user
settings. The user may then take a closer look at the top matches and estimate
their utility individually.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Proposed Architecture</title>
      <p>The recommender component is considered part of a larger digital archive system
that manages statistical research data. Figure 2 gives an illustration of the entire
system architecture, where three main components depict the relevant parts of
the recommender system.</p>
      <p>
        The case retrieval engine requires access to the data base that contains the
data sets and the similarity measures which should be contained in a separate
data base as to allow for independent editing whenever administrative review is
needed. The archived research data usually is maintained in its specific data
format, which in our case is based on RDF20. If the retrieval engine is implemented
using sequential similarity calculation, the data repository can be accessed as a
case base directly, since no further indexing is required. This, however, leads to
long computation times in case of a large case base. For more efficient retrieval,
more optimized methods like Case Retrieval Nets [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] should be considered,
which builds its own data structure from case base and similarity measures.
Therefore, the recommender component needs to be notified whenever there are
updates on the data repository or similarity measures.
      </p>
      <p>The notification service needs access to the users’ notification queries and
their e-mail addresses, which are stored in the user preferences data base. The
component should be notified whenever updates occur on the data repository,
so that new data sets can be tested for relevance immediately.
18 cf. Google Alerts
19 whenever a relevant candidate is detected or collated, per e-mail, twitter or other
channel
20 Resource Description Framework</p>
      <p>
        A detailled discussion on the user interface exceeds the scope of this paper.
However, it must provide for query specification, display of results and
configuration of user settings regarding the notification service as discussed in Section 6.
Furthermore, we suggest to integrate explanation features in order to provide
transparency to the user on how results were retrieved. A generic interface
design and implementation can be found at [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and some idea on how a query
interface particularly designed for statistical research data using the RDF Data
Cube Vocabulary can be found at [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
8
      </p>
    </sec>
    <sec id="sec-8">
      <title>Conclusion and Outlook</title>
      <p>
        We have examined some of the common code lists for statistical data with respect
to their specification and found indicators that motivate a particular similarity
measure design. For some of them we were able to reason a specific design,
whereas other code lists are difficult to make assertions on and suggest rather
uninformed measures. Eventually, a final assessment on utility of a particular
data set can only be done by the researcher. A similarity measure can only
approximate a common sense of utility [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. It easily fails due to limited query
expressiveness and inability to interpret its actual semantics and the actual user
needs. One option to overcome this problem is to allow for customization and
personalization of similarity measures. Whenever a user is presented with
unexpected results, an explanation may be given and the user may give feedback
on the similarity measure. Since structural CBR systems in general are easily
equipped with explanation support and customization of similarity measures [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], some of the open similarity design questions could be answered by the
individual user within a particular research context. However, ordinary measures
for dealing with multiple values and the application of user-defined weights in
the aggregating function enable a more gradual scaling of retrieval results with
respect to user needs.
      </p>
      <p>Another common practice in empirical research is the use of proxy variables,
where some data highly correlates with other. Such information could be useful
for recommending relevant data. A similarity measure could again be extended
to make use of such relations if represented in the data model.</p>
      <p>The proposed recommender system is based on the RDF Data Cube
Vocabulary. The user is therefore in a position to specify precisely on the kind of data
needed, and the system has the required means to assess suitability of available
data sets. In addition, provided the measure reflects a reasonable
understanding of utility, the introduced notification service helps researchers keep up to
date and thus, both research goals defined in Listing 1 were met. Nevertheless,
an evaluation is yet to be carried out, which is subject of future work. With
further progress on a research data management infrastructure and the
continuing exchange with the scientific community, we will get a clearer picture on the
applicability of this approach.</p>
      <p>Eventually, a prototype is needed in order to gain feedback from the research
community we are addressing, which we consider implementing as we proceed
with the reasearch on a data management infrastructure.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Feijen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>What researchers want - a literature study of researchers' requirements with respect to storage and access to research data</article-title>
          (
          <year>February 2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Wood</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andersson</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bachem</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Best</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Genova</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lopez</surname>
            ,
            <given-names>D.R.</given-names>
          </string-name>
          , Los, W.,
          <string-name>
            <surname>Marinucci</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Romary</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , Van de Sompel, H.,
          <string-name>
            <surname>Vigen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wittenburg</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giaretta</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Riding the wave: How Europe can gain from the rising tide of scientific data</article-title>
          .
          <source>European Union</source>
          (
          <year>2010</year>
          )
          <article-title>Final report of the High Level Expert Group on Scientific Data: A submission to the European Commission</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gottron</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hachenberg</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zapilko</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Towards a semantic data library for the social sciences</article-title>
          .
          <source>In: SDA'11: Proceedings of the International Workshop on Semantic DigitalArchives</source>
          . (
          <year>2011</year>
          ) in Preparation.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cyganiak</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Field</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gregory</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Halb</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tennison</surname>
          </string-name>
          , J.: Semantic statistics:
          <article-title>Bringing together sdmx and scovo</article-title>
          . In Bizer,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Heath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Berners-Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Hausenblas</surname>
          </string-name>
          , M., eds.
          <source>: LDOW</source>
          . Volume
          <volume>628</volume>
          of CEUR Workshop Proceedings., CEUR-WS.org (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Miloevi</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Janev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Spasi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milojkovi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vrane</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Publishing statistical data as linked open data</article-title>
          .
          <source>In: Proceedings of the 2nd International Conference on Information Society Technology, Information Society of the Republic of Serbia</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Halb</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raimond</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hausenblas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Building Linked Data For Both Humans and Machines</article-title>
          .
          <source>In: WWW 2008 Workshop: Linked Data on the Web (LDOW2008)</source>
          , Beijing, China (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Bahls</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tochtermann</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Addressing the long tail in empirical research data management</article-title>
          .
          <source>In: 12th International Conference on Knowledge Management (IKNOW '12)</source>
          , Graz, Austria,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2012</year>
          ) in Preparation.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Burke</surname>
          </string-name>
          , R.:
          <article-title>Recommender systems: An introduction, by dietmar jannach, markus zanker, alexander felfernig, and gerhard friedrich</article-title>
          .
          <source>International Journal of HumanComputer Interaction</source>
          <volume>28</volume>
          (
          <issue>1</issue>
          ) (
          <year>2012</year>
          )
          <fpage>72</fpage>
          -
          <lpage>73</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Bergmann</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Richter,
          <string-name>
            <given-names>M.M.</given-names>
            ,
            <surname>Schmitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Stahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Vollrath</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          :
          <article-title>Utility-oriented matching: A new research direction for case-based reasoning</article-title>
          . In: In professionelles Wissensmanagement:
          <article-title>Erfahrungen und Visionen</article-title>
          .
          <source>Proceedings of the 1st Conference on Professional Knowledge Management. Shaker</source>
          . (
          <year>2001</year>
          )
          <fpage>264</fpage>
          -
          <lpage>274</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Bergmann</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , Kolodner, J.,
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          , E.:
          <article-title>Representation in case-based reasoning</article-title>
          .
          <source>Knowl. Eng. Rev</source>
          .
          <volume>20</volume>
          (
          <issue>3</issue>
          ) (
          <year>September 2005</year>
          )
          <fpage>209</fpage>
          -
          <lpage>213</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Bridge</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , G¨oker,
          <string-name>
            <given-names>M.H.</given-names>
            ,
            <surname>McGinty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Smyth</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>Case-based recommender systems</article-title>
          .
          <source>Knowledge Engineering Review</source>
          <volume>20</volume>
          (
          <year>September 2005</year>
          )
          <fpage>315</fpage>
          -
          <lpage>320</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Richter</surname>
            ,
            <given-names>M.M.</given-names>
          </string-name>
          :
          <article-title>Case based reasoning and the search for knowledge</article-title>
          .
          <source>In: Proceedings of the 7th industrial conference on Advances in data mining: theoretical aspects and applications. ICDM'07</source>
          , Berlin, Heidelberg, Springer-Verlag (
          <year>2007</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Guidelines</surname>
          </string-name>
          , S.C.o.:
          <article-title>Annex 1: cross-domain concepts 2009</article-title>
          .
          <string-name>
            <surname>Area</surname>
          </string-name>
          (
          <year>2009</year>
          )
          <fpage>1</fpage>
          -
          <lpage>47</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Guidelines</surname>
          </string-name>
          , S.C.o.:
          <article-title>Annex 2: cross-domain code lists 2009</article-title>
          .
          <string-name>
            <surname>Area</surname>
          </string-name>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Stahl</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth-Berghofer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Rapid prototyping of cbr applications with the open source tool mycbr</article-title>
          . In Althoff, K.D.,
          <string-name>
            <surname>Bergmann</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Minor</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hanft</surname>
          </string-name>
          , A., eds.
          <source>: ECCBR</source>
          . Volume
          <volume>5239</volume>
          of Lecture Notes in Computer Science., Springer (
          <year>2008</year>
          )
          <fpage>615</fpage>
          -
          <lpage>629</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Lenz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Case retrieval nets as a model for building flexible information systems (</article-title>
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Roth-Berghofer</surname>
            ,
            <given-names>T.R.</given-names>
          </string-name>
          :
          <article-title>Explanations and case-based reasoning: Foundational issues</article-title>
          . In Funk, P.,
          <string-name>
            <surname>Gonzlez</surname>
            <given-names>Calero</given-names>
          </string-name>
          , P.A., eds.
          <source>: Advances in Case-Based Reasoning. Volume 3155 of Lecture Notes in Computer Science</source>
          . Springer Berlin / Heidelberg (
          <year>2004</year>
          )
          <fpage>195</fpage>
          -
          <lpage>209</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Bahls</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth-Berghofer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Explanation support for the case-based reasoning tool mycbr</article-title>
          .
          <source>Proceedings of the TwentySecond AAAI Conference on Artificial Intelligence July</source>
          <volume>2226</volume>
          2007
          <string-name>
            <given-names>Vancouver</given-names>
            <surname>British Columbia Canada</surname>
          </string-name>
          (
          <year>2007</year>
          )
          <fpage>1844</fpage>
          -
          <lpage>1845</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>