<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <abstract>
        <p>Linked  data  experience  at  Macmillan:   Building  discovery  services  for  scientific  and  scholarly  content  on  top   of  a  semantic  data  model     Tony  Hammond  and  Michele  Pasin   Macmillan  Science  and  Education,  The  Macmillan  Campus,  London,  N1  9XW,  UK   {tony.hammond,  michele.pasin}@macmillan.com     Background   Macmillan  Science  and  Education  is  a  publisher  of  high  impact  scientific  and   scholarly  information  and  publishes  journals,  books,  databases  and  services   across  the  sciences  and  humanities.  Publications  include  the  multidisciplinary   journal  Nature,  the  popular  magazine  Scientific  American,  domain  specific  titles   and  society  owned  journals  under  the  Nature  Publishing  Group  and  Palgrave   Macmillan  Journals  imprints,  as  well  as  ebooks  on  the  Palgrave  Connect  portal.     We  have  recently  implemented  a  linked  data  architecture  at  the  core  of  our   publishing  workflow  with  an  archive  of  over  1m  articles  and  a  publication  rate  in   the  100s  of  articles  per  day.  We  build  on  a  common  metadata  model  defined  by   an  OWL  2  ontology.  To  meet  acceptable  page  response  times  we  have  evolved  a   hybrid  storage/query  platform.  User-­‐facing  queries  are  resolved  against   RDF/XML  document  includes  using  XQuery  with  execution  speeds  of  10s-­‐100s  of   milliseconds  depending  on  complexity,  whereas  data  enrichment  and  integration   is  managed  at  the  ETL  layer  using  both  SPARQL  query/update  together  with   SPIN  and  Schematron  rules  and  bespoke  code  (Java/Scala).   Recently  released  discovery  products  on  the  nature.com1  platform  are  based   squarely  on  this  linked  data  foundation  and  include  subject  pages2  as  a  new   navigational  paradigm,  and  also  bidirectional  links  between  articles  and  related   articles.     In  general,  we  have  found  that  by  building  on  top  of  a  rich  and  consistent  data   model  we  can  provide  new  navigational  pathways  for  users  to  discover  and   explore  different  facets  of  our  content.  Using  such  a  simple  entity-relationship   model  coupled  with  global  addressing  allows  us  to  be  truly  web  scalable.   Our  data  model  is  realized  using  a  linked  data  technology  stack  which  provides  a   number  of  significant  benefits  over  traditional  approaches  to  managing  data.  The   use  of  RDF  encourages  the  use  of  a  standard  naming  convention,  and  makes  this   generally  accessible  by  enforcing  a    global  naming  policy.  It  provides  a  higher-­level  semantic  focus  for  operations  which  means  that  we  are  less  susceptible  to                                                                                                                  1  http://www.nature.com/   2  http://www.nature.com/subjects  </p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Infrastructure  </title>
      <p>LINKED  DATA  EXPERIENCE  AT  MACMILLAN   1  
 
 </p>
    </sec>
    <sec id="sec-2">
      <title>Challenges    </title>
      <p>Initially  we  had  intended  to  query  the  graphs  directly  in  a  triplestore  and  had  
developed  a  linked  data  API  for  this  purpose.  In  practice,  we  found  that  our  
implementation  was  not  fit  for  purpose  and  failed  in  two  critical  dimensions:  
performance  and  robustness.  Typical  result  sets  were  being  delivered  in  seconds  
or  tens  of  seconds,  whereas  we  were  tasked  to  achieve  ~20  ms,  some  2–3  orders  
of  magnitude  faster.  Additionally  for  reliability  we  required  a  clustered  solution,  
but  the  triplestore  we  had  implemented  was  unclustered  and  non-­‐transactional.
 
Since  our  immediate  concern  was  to  support  our  online  products  we  decided  on  
a  tailored  API  that  directly  reflected  the  data  model.  Our  main  principles  were  
that  the  API  should  be  chunky  not  chatty,  i.e.  we  aim  to  provide  all  required  data  
in  a  single  call;  that  data  should  be  represented  as  it  is  consumed,  rather  than  
how  it  is  stored;  that  it  support  common  use  cases  in  simple,  obvious  ways;  that  
it  ensure  a  guaranteed,  consistent  speed  of  response  for  more  complex  queries;  
and  that  it  build  on  a  foundation  of  standard,  pragmatic  REST  using  collections  
and  items.
 
This  led  to  our  developing  a  hybrid  system  for  storage  and  query  of  the  data  
model.  The  data  is  throughout  modelled  in  RDF  but  is  replicated  and  distributed  
between  RDF  and  XML  data  stores.  We  have  added  semantic  sections  as  
                                                                                                              
3  http://data.nature.com/  </p>
    </sec>
    <sec id="sec-3">
      <title>2   LINKED  DATA  EXPERIENCE  AT  MACMILLAN  </title>
      <p> 
RDF/XML  includes  within  our  XML  documents  as  well  as  creating  standalone  
RDF/XML  documents  for  our  core  and  domain  ontologies.  Retrievals  based  on  
document  ID  are  now  realized  with  XQuery,  and  augmented  by  in-­‐memory  
key/value  lookups,    yielding  acceptable  API  response  times.  SPARQL  queries  are  
currently  restricted  to  build  time  data  assembly.  
 
Future  aims  are  threefold:  1)  to  grow  the  data  model  with  additional  things  and  
relations  as  new  product  requirements  arise;  2)  to  open  up  the  user  query  
palette  to  more  fully  exploit  the  graph  structure  while  maintaining  an  acceptable  
API  responsiveness;  and  3)  to  create  an  extended  mindshare  and  understanding  
throughout  the  company  in  the  value  of  building  and  maintaining  the  discovery  
graph  as  a  real  enterprise  asset.</p>
    </sec>
    <sec id="sec-4">
      <title>LINKED  DATA  EXPERIENCE  AT  MACMILLAN   3  </title>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>