1. INTRODUCTION

A practical tool for creating, managing and sharing evolving linked data

Graham Klyne

graham.klyne@oerc.ox.ac.uk 1

Cerys Willoughby

Cerys.Willoughby@soton.ac.uk 0

Kevin Page

kevin.page@oerc.ox.ac.uk 1 0 Chemistry, University of Southampton , Highfield, Southampton, SO17 1BJ , UK 1 Oxford e-Research Centre, University of Oxford , 7 Keble Rd, Oxford, OX1 3QG , UK

Annalist is a software system for individuals and small groups to reap the bene ts of using RDF linked data, supporting them to easily create data that participates in a wider web of linked data. It presents a exible web interface for creating, editing and browsing evolvable data, without requiring the user to be familiar with minutiae of the RDF model or syntax, or to perform any programming, HTML coding or prior con guration. Development of Annalist was motivated by data capture and sharing concerns in a small bioinformatics research group, and customized personal information management. Requirements centre particularly on achieving low activation energy for simple tasks, exibility to add structural details as data is collected, access-controlled sharing, and ability to connect private data with public data on the web. It is designed as a web server application, presenting an interface for de ning data structure and managing data. Data is stored as text les that are amenable to access by existing software, with the intent that a range of applications may be used in concert to gather, manage and publish data. During its development, Annalist has been used in a range of applications, which have informed decisions about its design and proven its exibility and robustness in use. It has been particularly e ective in exploring and rapid prototyping designs for linked data on the web, covering science and humanities research, creative art and personal information.

Semantic Web Linked data Data management

1. INTRODUCTION

In a blog post based on an 2013 ESWC keynote presentation, Karger[ 20 ] argues that a primary feature of Semantic Web applications should be to accommodate evolving data: Corresponding author \A Semantic Web application is one whose schema is expected to change". He also argues: \The current state of tools for end users to capture, communicate, and manage their information is terrible".

Annalist (\keeper of records") is a linked data notebook, a software system for creating, editing and managing RDF[ 9 ] linked data, which attempts to address some of the problems noted by Karger, by allowing the structure of stored data to evolve with understanding of requirements and the nature of the available data. Its primary aim is to enable individual users and small groups to reap the bene ts of creating and using linked data; i.e. to create data that can be shared, evolved and re-mixed with other data on the web.

Annalist is application-agnostic, but has been developed to address data management for small research groups lacking capacity for web site development. It aims to be: easy-touse, without programming; exible, allowing structure to be crystallised around available data; sharable, facilitating collaboration with local and remote colleagues; and re-mixable, for combining locally created data with community resources. Annalist also \scratches an itch" as a tool for web-based personal information management and sharing.

While supporting contribution to linked data at web scale, Annalist's design assumes that individual datasets t comfortably in the available RAM and local le system of a modern personal computer. It is not a general-purpose RDF editor, but approaches data from a perspective rooted in application concepts rather as an RDF graph, and not all RDF structures can be generated or directly managed (this does not preclude linking to arbitrary RDF data). Finally, it is not a general publishing platform: the presentation of data is oriented towards data management actives, and assume the user is familiar with the content.

Annalist is an open development1, with source code, design notes and documentation kept in a public Github repository2. There is also a public demonstrator and tutorial3. 2.

MOTIVATION AND REQUIREMENTS

One motivating use case for developing Annalist was FlyTED, a database of Drosophila Testis Gene Expression images[ 39 ][ 41 ]. Our experiences with FlyTED (summarized in gure 1) give rise to requirements identi ed in parentheses, and summarized below. 1http://oss-watch.ac.uk/resources/odm 2https://github.com/gklyne/annalist 3http://annalist.net/ 2 Evaluate OME 3 BioImage discontinued 4 ePrints adopted 5 FlyTED database public

NAR paper 6

Live database lost in system failure 7 BioImage Database

FlyTED project Approximate timeline (year) 2004 2007 2010

The FlyTED database was originally intended to be published using the BioImage Database[ 31, 7 ], which was an early implementation of a database incorporating metadata based on semantic web standards. Limitations of early RDF and web software tool sets imposed design compromises that eventually led to the BioImage Database software not being sustained. Also, an early version of Open Microscopy Environment (OME)[ 35 ] was evaluated, but its metadata schema was found to be insu ciently exible to accommodate the range of annotation information that was required (R4, R5). Eventually, a modi ed version of ePrints repository software[ 16 ] was used to publish the gene expression images and associated annotations. Some time after the data were published, a combination of a virtual storage service failure and a backup system con guration error resulted in the running system being lost. Although the original data remained available, the live data was stored in a relational database le; loss of the publication platform, and lack of available resources to re-apply the ePrints customizations and rebuild the database meant the database was not reinstated (R8, R9, R10). The loss might have been mitigated if live data had been more easily shared, including via version management systems (R6).

Although not large, the FlyTED data were expensive to gather, hence valuable, with each combination of gene and phenotype requiring literature and database searches, statistical analysis, laboratory procedures for sample preparation, microscopic image capture and annotation by a biological expert. Initial input of image annotations used spreadsheets, as the biologists were familiar with these (R1, R3). During the course of the investigations documented by FlyTED, the terms used to record developmental stages during which genes were active were adjusted to provide better coverage of the observations (R5). Programmatic access to the underlying data was subsequently used to create an exemplar application, OpenFlyData[ 25 ], which provided facilities for search and co-display of gene information across a number of Drosophila databases (R11).

The FlyTED database was created by a small team of software developers working closely with biologists, but the original intent was to pave the way for tools that biologists could use without developer support (R1, R2, R3). Without the support of developers, the biologists would almost certainly have not gone beyond creating spreadsheets containing their observations, the contents of which cannot easily be cross-referenced with external data sources without prior knowledge of their structure (e.g. which column is used for the FlyBase gene ID?). In creating the published database, observed genes were cross-referenced with FlyBase[ 12 ], the community database of information about Drosophila genes, and annotated using terms compliant with the MISFISHIE standard[ 10 ] (R4, R7).

Finally, and not directly related to the FlyTED experience, we also wanted Annalist to be suitable for o ine use (e.g. for eld work) (R12).

Summary of requirements

R1 Ease of use: possible to quickly create a simple collection and start capturing data.

R2 Ease of use: no programming or HTML coding needed to create a new collection.

R3 Ease of use: detailed knowledge of RDF and/or OWL not needed to create or edit data.

R4 Flexibility: choice of RDF vocabulary used in the data. R5 Flexibility: possible to de ne or adapt structure of data as it is collected.

R6 Sharability: data can be shared between collaborators using a variety of techniques, including online access and o ine le copying.

R7 Remixability: use of domain vocabularies or ontologies to facilitate combining with community datasets; ability to present data as linkable web resources, and to link to external web resources.

R8 Portability: possible to move data between live systems; not dependent on a single central service.

R9 Sustainability of software: data capture, editing and browsing possible using unmodi ed software.

R10 Sustainability of data: underlying data stored and exposed using a standard, easily used data format. R11 Visible data: underlying data exposed so that functions not provided by the system (e.g. data visualization) can be implemented by independent software.

R12 O ine working: deployable on a personal computer, allowing work on linked data collections without an Internet connection. 3.

RELATED WORK

As a system for creating and managing data on the web, Annalist enters a crowded space with a wealth of alternatives available. But, despite this, we are unaware of anything that provides the out-of-box capability of Annalist for creating linked data, and meeting the requirements outlined. Figure 2 gives an overview of systems with respect to the requirements listed above. Our survey focuses on tools that directly present end-user data management interfaces. There are systems (e.g. Virtuoso, Sesame, Jena, WikiBase, etc.) that are primarily semantic data stores and developer tools that are not covered. Also, tools for data cleaning (e.g. OpenRe ne, LODRe ne) or middleware that augments existing data (e.g. Poolparty, Sponger) are considered complementary rather than alternatives to Annalist. 3.1

Semantic Web Systems and Tools

Callimachus4 provides exible support for sharable linked data, but is a tool for developers rather than end-users. Semantic media wiki5[ 37 ] is usable without additional development, but is not really suited to desktop deployment, and 4http://callimachusproject.org 5http://semantic-mediawiki.org ✔ ✧ ✔ ✔ ✔ ✔ ✔ ✔ ✔

✔ ✧ ? ✔ ✔ ✔ ✧ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✔ ✧

✔ ✔ ✧ ✔ ✔ ✔ ✔ ✧ ✔ ✧ ✧ ✧ ✔ ✔ ✧ ✔ ✔ ✧

✔ ✧ ✔ ✔ ✔ ✧ ✔ ✔ ✔ ✧ ✧ ✧ ✔ ✔ ✧ ✔ ✔ ✧ ✔ ✔ ✔ ✔ ✧ ✧ ✔ ✧ ✧ ✔ ✔ Semantic MediaWiki ✧ ✔ ✔ ✔ ✧ ✧ ? ✔ the data is not amenable to version management or sharing via le sharing. Rauschmayer presented a poster[ 29 ] about the Hyena RDF editor at SemWiki 2008, which appears to envisage similar usage scenarios as Annalist, but does not currently appear to be available as a usable tool. This work is described in Rauschmayer's PhD thesis[ 28 ], which states that the core idea is to use a central repository for linked information, where Annalist is conceived as being just a small part in a wider linked data infrastructure. Wikidata6, built upon the Wikibase7 data store, acts as central storage for structured data of Wikimedia projects. It has similarities to Annalist - \items" and \statements" parallel Annalist's Entities and Fields, but the user interface is not customizable and it does not appear to support the creation of data collections independently of Wikipedia and related projects.

There are ontology design tools, such as Protege[ 34 ] (including WebProtege), which can be used to create RDF data, but a focus on ontology design leads to a complex interface that is not well suited for end-user creation and management of linked data. Changing the data structure requires an understanding of ontology design.

Piggy Bank[ 17 ] was developed as a tool for consuming web data, and creating a local RDF store to facilitate navigation and merging data from diverse sources. The emphasis here was on consuming web data from heterogeneous sources (something that Annalist can facilitate), but not so much on creating linked data for sharing and eventual publication.

Fresnel[ 26 ] is an RDF vocabulary for controlling presentation of RDF, for which Annalist uses home-spun terms. The development of Annalist focused initially on creating a user interface to create linked data without knowledge of HTML or RDF (requirements R2, R3), and the vocabulary needed to describe the presentation emerged from this approach. With the technical requirements now established in running code, evaluating a retro- t of Fresnel could be a topic for further work. RDForms8 (\RDF Forms") is a JavaScript 6https://www.wikidata.org/ 7urlhttp://wikiba.se 8http://rdforms.org/ library supporting a declarative description of views for editing and presenting RDF, whose interface appears to have some aspects in common with Annalist. But RDForms is a developer tool, and not something that can be used without programming.

One use for Annalist has been to create additional information, or annotations, related to web pages (e.g., see section 5.5 below). Pundit9 is positioned as a semantic web annotation tool for research, capable of performing faceted search over annotated web pages10. It appears to be able to create linked data annotations, but it is not clear if it can create free-standing linked datasets, or how easily the annotations created can be exported and/or consumed by other applications. Another annotation tool is Domeo11[ 8 ]: this, too, can create RDF annotations of online documents, but does not appear to create free-standing data. 3.2

Data sharing platforms

Figshare12 is a a proprietary web platform for research data sharing that is well-suited for sharing research papers, supporting data and other materials, but does not of itself provide support for creating re-mixable linked data. ResearchSpace13 is developing \a collaborative environment for humanities and cultural heritage research using knowledge representation and Semantic Web technologies", sharing some goals with Annalist, but specialized for cultural heritage by building speci cally on CIDOC CRM[ 22 ].

The Database Wiki[ 6 ] is another project to provide a generic, collaborative, user-friendly interface over structured data. In this case, the underlying data is XML rather than RDF, and there is less emphasis on linking with external data. This project is informed by Form Lenses[ 27 ], a principled approach to mapping between stored data and a presented user interface, based on Applicative Functors[ 24 ], which o er a possible avenue for future work with Annalist.

Histcross14[ 18 ] was a semantic database of historical data, subsequently replaced by Segrada15, apparently with many similar goals to Annalist, but does not work with linked data so would not readily participate in a wider network of data. 3.3

Spreadsheets and desktop databases

Regular spreadsheets (Excel, Open O ce, etc.) are very popular for research and personal information management, o ering exibility, ease of use and sharing (e.g. via CSV), but do not easily support combining data from di erent sources, do not provide direct web access to the underlying data, and are not particularly well suited for use with version control systems. Right eld[ 38 ] is a tool that augments spreadsheets to facilitate entry of semantically constrained terms, and as such goes some way to addressing the remixing problems of spreadsheet data, but does not really lend itself to creating multiple cross-linked linked data structures, and shares other limitations of spreadsheets.

Bakke[ 1 ] reports an an experiment with Related Worksheets that explores the management of multiple relations between worksheets in a desktop application. Their paper 9http://thepund.it 10http://eswcdemo.gramsciproject.org 11http://swan.mindinformatics.org 12http:// gshare.com 13http://www.researchspace.org 14https://github.com/mkalus/histcross 15http://www.segrada.org clearly explain some problems that Annalist aims to address, and proceeds to evaluate how they can be addressed in a spreadsheet interface. The work suggests user interface designs that might be helpful, but does not of itself provide usable tools for creating linked data.

Desktop databases such as Access16 require some con guration e ort before they can be used for capturing data, which in turn are constrained by the relational schema used, and not readily linked with external datasets. 3.4

Content management systems

Other classes of web application that might be considered for research data management include Content Management Systems (CMSs, such as Wordpress17 or Drupal18). These require signi cant development and/or con guration e ort to create a data sharing platform, and do not support the full exibility of RDF linked data. Drupal has built-in RDF support that is layered over an underlying schema, and is not amenable to change without re-working the underlying site con guration. Also, CMSs tend to hide the underlying data from direct view or manipulation, rather than exposing it for other applications to use in di erent ways. 3.5

Electronic Laboratory Notebooks

Annalist in some respects resembles electronic laboratory notebook systems (ELNs). There are many proprietary ELNs that are aimed at commercial research laboratories and as such may be beyond the budget of an individual or small research group. There are also some open source ELNs (e.g. Voegele et al[ 36 ], elabftw19, LabTrove[ 13 ]. We have not speci cally evaluated any of these, but in general they o er a blog-like platform, where textual notes may be augmented with named attribute or tabular data. We are not aware of any ELNs that support web linked data.

DESIGN

Annalist adopts a frame-oriented, or entity-oriented, approach to presenting and storing data, rather than being RDF graph based. Frames are considered to be easy for people to understand as they model some aspects of human memory patterns[ 18 ]. schraefel and Karger[ 30 ] explore in some depth the presentation of semantic web data in user interactions, and emphasize consideration of \what do we want to do"; in the case of Annalist, what we want to do is create, manage and share linked data on the web. We observe that when researchers use spreadsheets to create data, it is commonly arranged with a row of information for each of a number of similar entities (e.g. microarray descriptions commonly use a row of values for each of several thousand gene probes). The frame-based approach has implementation advantages, too: it provides a convenient grouping of data such that the description of each entity is stored as a separate le, assigned a URL, and directly accessed as a web resource.

In discussions with researchers about their preference for using spreadsheets, we were told that one of their reasons is that spreadsheets do not impose a priori constraints on what can be entered, making it easy for them to enter data as it becomes available. Bakke et al [ 1 ] also note \When it comes 16https://products.o ce.com/en-us/access 17https://wordpress.org 18https://www.drupal.org 19http://elabftw.net/ to general editing tasks on tabular data, spreadsheet systems have an advantage even over most tailor-made applications", and advantages of a system that can \allow temporary inconsistencies". Frey et al [ 13 ] also note that semantically aware tools \could be too heavyweight and prescriptive", restricting re-use in other areas. Annalist adopts a principle that its rst task is to make it easy for researchers to capture their data; de ning structure is secondary. Further, there is no attempt to validate data entered, or impose any kind of quality standards: we take a view that validity, quality and re nement are dependent on a context of use[ 40 ], and as such are usefully applied in such context.

Linked data vocabulary terms are commonly associated with schemas (or ontologies), but we observe that such terms may be adopted independently of any schema. Annalist supports evolving data initially though addition of terms, even to the extent of taking an unstructured narrative, identifying signi cant elements, and progressively articulating them using new terms, preferably drawn from existing stable schemas (e.g. sections 5.1 and 5.2). A related concern is evolving schemas, which incur changes to terms already used in data. Annalist provides support for supertype URIs in type de nitions, and property aliases, which can assist with type and property URI migration; generalization of these features is ongoing. We have also performed migrations by direct editing of the underlying data; while not an option for non-technical users, it shows that schema evolution can also be assisted by external services.

The requirements for internal data to be exposed to third party applications, and data portability, are addressed by using JSON { speci cally JSON-LD[ 33 ] { as the primary internal data storage format. JSON-LD conventions allow data to be interpreted as RDF, yet retaining the ease-of-use of JSON, and use by applications that have no knowledge of RDF. Within the data managed by Annalist, internal links are stored as URI relative references[ 4 ], to be resolved against the URL used to access the data, allowing data to be copied from one Annalist deployment to another without changes. Use of Compact URIs (CURIEs20) for eld names and types, with pre xes de ned in external JSON-LD context les, allows for more compact, readable and understandable data compared with using full URIs.

Browser View Model

Django-based

CSS, JS, PNG, etc.

Controller Web application

Static web resources Form rendering

engine Data model access

Field renderers

Identity provider (e.g. Google) Web page templates One for al lists, one for al entity views, and one for al entity edit forms.

Page content is determined by control data from the file system.

Storage

File system

Control data: types, views, etc.

User data

Figure 3 shows the main components of the Annalist software. It is implemented as an HTTP server application, written in Python using the Django21 software framework. All the essential Annalist application logic is implemented by the server, but there is some limited use of Javascript in the browser to provide a more responsive user interface; the intent here is that Annalist can be used in browser enviroments where Javascript is disabled or unavailable. The Annalist server software is designed to be deployed locally (on a personal computer), on a private network, or on a publicly accessible host.

At the heart of Annalist is a dynamic web-page creator and form rendering engine that combines user data with a form description to create an HTML web page. Figure 4 is an example of data viewed using Annalist. The underlying JSON-LD data can be accessed by web retrieval, either via Annalist, or a suitably con gured web server; the le layout is designed to preserve relative references. This helps to ensure that access to the data is not dependent on the health of the Annalist service. A goal is to allow user data to be stored on the web, separately from the Annalist service itself, though 21https://www.djangoproject.com there remain some access control details to be resolved to make this a reality.

Figure 5 shows an example of the Annalist data editing interface. It actually shows a form used for editing the de nition of a form description, so is self-referential: the labels of the elds on the left of the page are echoed in the list of eld descriptions at the bottom of the page.

Figure 6 shows an excerpt from a listing of records. These examples illustrate main kinds of display provided by Annalist: a detailed view of a single record, which may be an editing view or a view-only display, and a summary list of multiple records. Further examples of the Annalist user interface can be found in the Annalist tutorial document22.

Site Collection Record type Data record

Record view

Record list view

Field description

Field group (optional)

Like the user data, form descriptions and other con guration data are stored as JSON-LD les, and are editable through the Annalist web interface. This makes Annalist self-maintaining, in the sense that there is no separate con guration interface or other mechanism needed to de ne data types, storage structures, or their presentation.

Data are organized as illustrated in gure 7. At the top level is an Annalist Site, which is associated with an Internet host (or localhost for desktop deployment). Site data is grouped into free-standing Collections, which contain user data, and metadata to de ne its structure and presentation.

The data are stored in Data records, each of which is presumed to describe some entity, and corresponds to an addressable web resource or le (each having a distinct URL). Record types correspond to the type of entity described, and are used to combine similar entities for user presentation (e.g. in lists), and also in the underlying data storage (e.g. entities of di erent types are stored in di erent directories or storage containers). Record views de ne forms used 22http://annalist.net/documents/tutorial/main for creating, editing or viewing data records; Record lists de ne presentation of multiple entities for browsing. Field de nitions are referenced by record views and record lists, and control the internal representation and presentation of record component values. Field groups are used to group elds for various purposes, e.g. to de ne repeated groups of elds. URIs used for type and property URIs are contained in the record type, eld and view de nitions, and may be drawn from existing ontologies, or local ad/hoc identi ers with potential to adopt existing vocabularies as correspondences are determined.

Access control is managed in two parts: authentication is by a third-party identity provider (IDP) using OpenID Connect23 returning an authenticated email address. Annalist has been tested to date using the Google login service24. Access control is handled by permissions stored as Annalist records, which are de ned and applied on a per-collection basis, with fall-back to site-wide permissions. Permissions required for access may depend on record type (e.g. ADMIN permission required to access the permission records), and in future this might be used for ner-grained control.

APPLICATIONS

Annalist has been used with several personal and research projects, described below, which have informed its ongoing development. The rst example includes a sketch of its implementation to provide insight into use of Annalist, and all can be examined at the URLs given. 5.1

The Carolan Guitar

This is a project of Nottingham University's Mixed Reality Laboratory on \Augmenting a Guitar with its Digital Footprint"[ 3 ], recording its history online in the form of a blog25. Annalist has been used to create a linked data overlay of this history that links to the blog itself, and also to other key resources that are part of its history26. This overlay models events (construction, composition, performance and others), people, places, artifacts, materials, musical compositions and more using vocabularies drawn from RDFS[ 5 ], CIDOC CRM[ 22, 11 ], FRBRoo[ 2 ], and W3C PROV[ 23 ]. 5.1.1

Carolan Guitar implementation

Artifact Tool

Entity Design Material

Work

Construction Composition Performance

Event Place

Person

Role 23http://openid.net/connect/ 24https://developers.google.com/identity/protocols/ OpenIDConnect 25http://carolanguitar.com 26http://demo.annalist.net/annalist/c/Carolan Guitar/d/ Artifact/Carolan Guitar/ elements of both PROV (prov:Entity, prov:Activity) and CIDOC CRM (crm:E71_Man-Made_Thing, crm:E5_Event) ontologies. Other, more re ned, types are introduced as judged useful to capture the guitar's history. The Carolan Guitar itself is an instance of Artifact, a subtype of Entity, which is primarily an instance of the CIDOC CRM type crm:E24_Physical_Man-Made_Thing, but is also associated with a number of other types in the type de nition27: {"annal:id": "Artifact", "annal:type": "annal:Type", "rdfs:label": "A constructed physical entity", "rdfs:comment": "An artifact, such as a musical instrument or some other object.", "annal:uri": "crm:E24_Physical_Man-Made_Thing", "annal:supertype_uris": [ {"annal:supertype_uri": "prov:Entity"}, {"annal:supertype_uri": "crm:E77_Persistent_Item"}, {"annal:supertype_uri": "crm:E70_Thing"}, {"annal:supertype_uri": "crm:E71_Man-Made_Thing"}, {"annal:supertype_uri": "crm:E18_Physical_Thing"}, {"annal:supertype_uri": "frbroo:F7_Object"}], "annal:type_list": "_list/Artifacts", "annal:type_view": "_view/Artifact"}

The central subject, the Carolan Guitar, is presented using the Artifact Record view28. This view describes an Artifact with an identi er, type, label, description, links to further information, and (central to this application) a list of life events, which correspond to a journal of its history.

Information about the Carolan Guitar and its life events are recorded in its description29, e.g.: : {"crm:P12i_was_present_at":

"Construction/Construction_9"}, {"crm:P12i_was_present_at":

"Performance/First_performance"}, {"crm:P12i_was_present_at": "Performance/Stairway"}, {"crm:P12i_was_present_at": "Performance/Hop_jam"}, {"crm:P12i_was_present_at": "Event/Photo_shoot"}, {"crm:P12i_was_present_at":

"Composition/Catch_the_moment"}, :

Here, relative URL references are used to designate life events, each of which may record information about where it took place, entities used, and who was involved in what roles. Di erent types of event in the guitar's history (construction, composition, performance, etc.) may also have di erent information: a construction event view30 may include information about the tools, materials used; a performance view31 may include details of the works performed.

The modeling of the Carolan Guitar's story is by no means complete (if such a thing is ever possible), and some choices of what to include could reasonably be described as arbitrary. But, using Annalist, the description can be augmented and 27http://demo.annalist.net/annalist/c/Carolan Guitar/d/ type/Artifact/type meta.jsonld 28http://demo.annalist.net/annalist/c/Carolan Guitar/d/ view/Artifact/view meta.jsonld 29http://demo.annalist.net/annalist/c/Carolan Guitar/d/ Artifact/Carolan Guitar/entity data.jsonld 30http://demo.annalist.net/annalist/c/Carolan Guitar/d/ view/Construction/view meta.jsonld 31http://demo.annalist.net/annalist/c/Carolan Guitar/d/ view/Performance/view meta.jsonld re ned with additional types and more detailed view descriptions as new requirements are encountered. Indeed, this has already happened several times during its development. 5.2

Smoke: creating an audio-visual poem

Procedural Bending is presented in Garrelfs' PhD thesis[ 14 ] as a model for discourse about creative processes. It has similarities with the W3C PROV model[ 23 ], but also some key di erences. Annalist has been used to create a description of the creation of \Smoke"32, an \experimental documentary come audio-visual poem" about mid 20th-century air pollution in the cities of Pittsburgh and St Louis. In this case, a semi-structured blog-like journal was created by the artist using Annalist33, together with a Procedural Blend diagram using the model from Garrelfs' thesis. We worked with the artist to encode the blend diagram as Annalist records, and in the process were able to re ne the model to make it more consistently encodable while preserving the original descriptive intent. 5.3

Chemistry Personas

Chemistry Personas34 evaluates Annalist as a tool for capturing records of academic researchers in chemistry, and for identifying metadata from these records. It was used to create a set of interfaces for the capture and linking of information about people, organisations, projects, and resources associated with experiments such as plans, materials, equipment, activities, and the experiment records themselves that incorporate linked data. Designs of the models are based on research information and associated metadata from experiment records, and observed recording practices in chemistry from a range of universities across the world.

Annalist was found to be useful for capturing research records and associated data, with the exibility to be easily adapted to the needs of di erent research groups and individual researchers. We found the generic capability to create specialized interfaces for capturing information allowed us to handle the di erent requirements of di erent domains and disciplines. The linked-data aspect is particularly useful in enabling the simple reuse of resources and plans, and inclusion of frequently used information into the research records. 5.4

Canal Cruising Log

The canal cruising log35 is an example of Annalist used for personal information management. It captures information about narrowboat cruising on the English canal network, with information about daily movements, waterways, places visited, other interesting locations, and maintenance activities performed. It is based on a handwritten log book, and attempts to capture information in searchable form that may be useful when planning waterways travels. The information modelling is ad hoc (i.e. uses private vocabulary terms). Using Annalist, it would be quick and easy to revisit and add class URIs from standard ontologies. Property URIs are harder to update, but but work is in progress to support data migration as properties change. 32http://irisgarrelfs.com/smoke 33http://cream.annalist.net/annalist/c/IG Philadelphia Project/ 34http://cream.annalist.net/annalist/c/Chemistry Personas/ 35http://demo.annalist.net/annalist/c/CruisingLog/ 5.5

Accommodation search

The accommodation search collection36 is another example of personal information management, this time with a clear real-world outcome. We sought a new home for an elderly relative that would make it easier to provide increasinglyneeded levels of support. The web made it easy to nd candidate properties, but there were speci c requirements (e.g., physical accessibility) not selectable by available search facilities, so we had to lter from a large number of candidate properties; further, good properties would come to market and disappear quickly, so prompt information sharing was needed. Annalist was used to rapidly create a specialised database of candidates, with links to existing property web sites, additional annotations, an an overall scoring of suitability according to our particular requirements. This was shared among family members, and when the ideal property appeared we were able to consult over the details and arrange an early visit. 6. 6.1

DISCUSSION Evaluation of requirements

Section 2 set out a number of requirements arising mainly from past experiences creating FlyTED. We now review those requirements against the implemented Annalist applications described in section 5.

Requirements R2 (no programming), R6 (data sharing), R9 (use of unmodi ed software) and R10 (expose data in a standard format) are demonstrated by all of the implementations described. Our work on these implementations has repeatedly exploited R8 (portability of data) and R12 (o ine working) by using a GitHub repository37 to transfer work-in-progress data between an o ine laptop and online servers; this use of Github for data exchange, backup and versioning also demonstrates another aspect of R10 (sustainability of data). Annalist's exposure of underlying data (R11) is present in all the applications described, but has not been signi cantly exploited by independent software; this will be tested in future work.

The implementation of The Carolan Guitar data has shown extensive use of R4 and R7 (choice and mixing of existing RDF vocabularies), combining PROV, CIDOC CRM, FRBRoo and some private terms. Requirement R5 (evolution of structure) was used when working through the Carolan Guitar's history, starting with its construction, but in later stages dealing with musical performances and compositions.

The The Canal Cruising Log implementation in particular demonstrates R3 (not needing knowledge of RDF) as this has been developed without reference to any speci c RDF terms or characteristics. The frame-oriented approach to data presentation appears to be more approachable than requiring users to work with the RDF graph/triple structure (also suggested by Kalus[ 18 ]). There is one aspect where RDF in uences remain visible: eld descriptions must specify a property string in the form of a URI or CURIE that is used to identify an attribute in the data (the e ect of which is to constrain the syntax of attribute identifying strings).

The Accommodation search application demonstrates the achievement of R1 (to quickly create a collection and start 36http://demo.annalist.net/annalist/c/Accommodation search/ 37https://github.com/ninebynine capturing data) as this ephemeral application would never have been realized if it would have taken more than an hour or so to create the collection with some initial data records. 6.2

Further observations

Primary storage of data as simple text resources: there is no database or triple store behind Annalist. For locally stored resources, each addressable resource is stored as a single JSON-LD le, and the request URL is mapped to a corresponding lename. The underlying resources may be served directly by a web server without any Annalist deployment being present, which we believe could be a boon for long-term sustainability of research data (e.g., the Annalist demo site38 also o ers links39 that connect directly to the underlying data). This approach allows collections to be versioned using common version management tools (e.g. git), and shared via version repositories.

Edit con icts: Annalist handles concurrent updates to individual entities by atomic updates. There is an unimplemented design for warning when an entity changes during editing. Consistency of values between entities is not currently enforced (see section 4).

Performance: the Annalist demonstrator runs on a modestly provisioned virtual machine (1 virtual CPU, 2Gb RAM). We have not undertaken formal measurements, but have not found performance to be a limitation in day-to-day use. This matches our expectation that the underlying Linux le system is very e cient for accessing small les that comprise Annalist collections. Some particular operations perform repeated le accesses, and future work is planned to optimize these cases. Planned developments will introduce an index alongside the at les, probably a triple store, to support e cient search and query over the data.

Data types for organisation, views to de ne structure: traditional database systems use data types (or equivalent) both to categorize data records and to associate structure with the data (e.g. a relational table de nes the structure of each row in a table). This is less true for schema-free databases (e.g. MongoDB40), but even here we may see that structural features such as indexes are associated with what is e ectively a data type (e.g. in MongoDB, a \collection"). With Annalist, types are used simply to categorize data records, and the structure of any record is determined by the view (or views) that are used to create or edit it. This means that di erent views can be used with a record type, according to the context (e.g., considering a person as an employee or as a customer). (When editing a record with elds not referenced by the view used, those elds are unchanged when the record is saved.)

Collection portability: URLs and URIs: moving a collection between Annalist deployments means that the URLs used to access records can change. But for some elds, such as type descriptions, we need stable identi ers that don't change with location of the data. The implementation of Annalist recognizes this tension, and distinguishes between URLs used to access and retrieve resources and URIs used to identify them. This goes somewhat against the grain of web wisdom, and in practice the distinction is used quite sparingly, but we believe this illustrates that in pragmatic applications, particularly where there may be copies of the 38http://demo.annalist.net/ 39http://annalist.net/annalist sitedata/ 40https://www.mongodb.com/ same information in di erent locations, the distinction may have some value. (This di ers from earlier discussions about URNs and URIs41, in that the distinction is in no way dependent on the URI scheme used: a given HTTP URI may be a URI or URL depending on how it is used.)

Usability: we have not yet undertaken formal usability studies, but our experience to date indicates that it is possible to quickly create data managment forms that are usable with no special knowledge or experience. We have found that creating structure and view de nitions can become complicated when there are multiple relationships between entity types, and improvements in this area are ongoing.

Scale: Annalist deals with collections of modest size, but through the milieu of the web even such modest collections, in su cient numbers, may contribute to data at much greater scales. Tools, like Annalist, that facilitate creation of linked data at local scales may be a key to enabling fully distributed datasets at web scale. 7.

FURTHER WORK

The nature of Annalist as a generic tool means there is inevitably far more that could be done than has been achieved to date. Work-in-progress enhancements include: modular type/view de nitions, importing de nitions from a predened collection, which we see as allowing end-users to get started even more quickly with generating their own linked data (e.g. using \canned" de nitions for bibliographic and provenance information); usability improvements to streamline common data entry tasks (e.g. automatic creation of default views and lists associated with a data type); data migration facilities to assist with applying vocabulary changes to existing data.

Work is currently underway to create an independent frontend for presentation of musical performance data created using Annalist (based on the structures developed for the Carolan Guitar), which we plan to use to develop a demonstration system aimed at enhancing audience experience of live music concerts. This will provide an exemplar of how Annalist may be used as part of a larger ensemble of tools for creating and deploying applications using linked data.

We have noted that Annalist is not a \big data" system, and that design choices may constrain the e ective size of a single Annalist collection. But by creating multiple independent, cross-linked and web-searchable data, we anticipate that Annalist collections may be combined with other data sources, contributing to creation of linked data resources at larger scale. One way to explore this idea would be to use Annalist to reinstate public access to the FlyTED data. Some initial explorations are under way, and successful achievement of this could provide a particularly compelling evaluation of the Annalist principles and design.

Looking ahead, we anticipate creation of \data bridges" to allow existing data (especially in spreadsheets) to be presented as linked data through Annalist; this might extend further to real-time data acquisition, with data from sensors like GPS or real-time feeds.

The current implementation of Annalist uses the server host le system for data storage, but it was an original goal that Annalist could work with third-party storage. A candidate for this would be a Linked Data Platform (LDP)[ 32 ] server, though there are matters of access control to be 41http://www.w3.org/TR/uri-clari cation/ resolved. The Social Linked Data (Solid) project42 uses WebId43 for access control, and could provide an promising opportunity if it is possible to devise a mechanism to link OpenID Connect authentication with WebId authorization.

Other areas of possible future work for Annalist include provenance[ 15 ] recording and support for provenance pingbacks[ 21 ] to recognize downstream use of Annalist data.

Looking further ahead, we note that the Annalist dynamic form generator has evolved in a somewhat ad hoc fashion, in response to evolving recognition of requirements. As noted in section 3.2, theoretical work on \form lenses"[ 27 ] might be adopted to provide a more principled grounding for this aspect of Annalist.

Finally, we note that while Annalist has been an open development from its outset, it has been signi cantly conducted so far by an \open community" of just one developer. For a viable future we must engage a wider community of users and developers to establish a more long-term sustainable project. Contributions will be most welcome!

CONCLUSIONS

Annalist has been used in a number of research projects for prototyping linked data information structures and, even as a work-in-progress, has proven exible and robust in use by a small number of diverse users, with a low cost to get started with a new collection. It has also proved e ective in personal information management projects involving annotation of existing web resources and sharing structured data on the Web. While many of the capabilities of Annalist are provided by other systems, we are not aware of any other that combines the key features of Annalist in a package that can be used \out of the box" for data management.

We have noticed that, while Annalist is easy to use for basic data entry and browsing, developing more complex structures requires greater e ort, and does bene t from an awareness of the RDF model (particularly with respect to the use of URIs, or CURIEs, to identify things, classes of things and relations between things). But this e ort can be applied incrementally, yielding rapid bene ts and feedback, supporting agility in creating information designs.

Annalist's exible approach to information structuring has permitted an approach that di ers from that often used when creating databases (e.g. for research data), starting with a very loosely structured narrative and progressively re ning structured information around that narrative. In this, we feel that we have created a tool that goes some way to achieving Karger's objectives for semantic web applications, viz. \to work e ectively over whatever schemas their users choose to create or import"[ 19 ].

ACKNOWLEDGEMENTS

The development and evaluation of Annalist has been supported in part by EPSRC EP/L019981/1 Fusing Semantic and Audio Technologies for Intelligent Music Production and Consumption and by the JISC-funded Research Data Spring CREAM project44. 42https://github.com/solid/solid-spec 43http://www.w3.org/2005/Incubator/webid/spec/ identity/ 44https://blog.soton.ac.uk/cream/

[1]

Bakke ,

Karger , and

Miller . A spreadsheetbased user interface for managing plural relationships in structured data . In Proc. SIGCHI conference on human factors in computing systems. In CHI '11. ACM , Vancouver, 2011 , pp. 2541 { 2550 . doi: 10 .1145/1978942. 1979313.

[2]

Bekiari ,

Doerr , and P. Le B uf, eds. FRBR - object-oriented de nition and mapping to FRBRer (version 1.0) . 1.0 ed. May 2009 , 1 .0 ed., 2009 . url: http://cidoc.ics.forth.gr/docs/frbr oo/frbr docs/ FRBRoo V1. 0 draft 2009 may .pdf.

[3]

Benford ,

Hazzard , et al. Augmenting a guitar with its digital footprint . In Proc. international conference on new interfaces for musical expression . Louisiana State University, 2015 , pp. 303 { 306 . url: http://www. nime.org/proceedings/2015/nime2015 264.pdf.

[4]

Berners-Lee ,

Fielding , and

Masinter . RFC 3986, Uniform

Resource

Identi er (URI): Generic Syntax . 2005 . url: https://tools.ietf.org/html/rfc3986.

[5]

Brickley and

Guha . RDF schema 1.1. W3C Recommendation. World Wide Web Consortium, Feb . 2014 . url: http://www.w3.org/TR/rdf-schema/.

[6]

Buneman ,

Cheney , et al. The Database Wiki project: a general-purpose platform for data curation and collaboration . SIGMOD record , 40 ( 3 ): 15 { 20 , 2011 . doi: 10 .1145/2070736.2070740.

[7]

Catton ,

Sparks , and

D. M.

Shotton . Chapter 21: publishing and nding images in the BioImage Database, an image database for biologists . In, Cell biology (third edition) , pp. 207 { 216 . Academic Press, Burlington, third edition ed., 2006 . doi: 10 .1016/B978- 012164730-8/ 50149 -0.

[8]

Ciccarese ,

Ocana , and

Clark . Open semantic annotation of scienti c publications using DOMEO. J. biomedical semantics, 3(S-1 ): S1 , 2012 . url: http : //www.jbiomedsem.com/content/3/S1/S1.

[9]

Cyganiak ,

Wood , and M. Lanthaler. RDF 1 . 1 concepts and abstract syntax . W3C Recommendation. Feb . 2014 . url: http : / / www . w3 . org / TR / rdf11 - concepts/.

[10]

E. W.

Deutsch et al. Development of the minimum information speci cation for in situ hybridization and immunohistochemistry experiments (MISFISHIE) . Omics: a journal of integrative biology , 10 ( 2 ): 205 { 208 , 2006 .

[11]

Doerr . The CIDOC conceptual reference module: an ontological approach to semantic interoperability of metadata . AI mag. , 24 ( 3 ): 75 { 92 , Sept . 2003 . url: http://dl.acm.org/citation.cfm?id= 958671 . 958678 .

[12] G. dos Santos et al. Flybase: introduction of the Drosophila Melanogaster release 6 reference genome assembly and large-scale migration of genome annotations . Nucleic acids research , 43 ( Database-Issue ): 690 { 697 , 2015 . doi: 10 .1093/nar/gku1099.

[13]

J. G.

Frey ,

A. J.

Milsted , et al. Labtrove: a lightweight, web based, laboratory blog as a route towards a marked up record of work in a bioscience research laboratory . Plos one , 8 ( 7 ):e67460{[18pp], July 2013 . doi: 10 .1371/ journal.pone.0067460. url: http://eprints.soton.ac.uk/ 355078/.

[14] I. Garrelfs. From inputs to outputs: an investigation of process in sound art practice . PhD thesis . University of the Arts London, May 2015 . url: http://irisgarrelfs. com/thesis.

[15]

Groth and L. Moreau. PROV overview: an overview of the PROV family of documents . W3C Working Group Note. W3C, Apr. 2013 . url: http : / / www . w3.org/TR/prov-overview/.

[16]

Gutteridge . GNU EPrints 2 overview . In 11th panhellenic academic libraries conference. Event Dates: 2002 , 2002 . url: http://eprints.soton.ac.uk/256840/.

[17]

Huynh ,

Mazzocchi , and

Karger . Piggy Bank: experience the semantic web inside your web browser . Web semantics: science, services and agents on the World Wide Web , 5 ( 1 ), 2007 . doi: 10 .1016/j.websem. 2006 . 12 .002.

[18]

Kalus . Semantic networks and historical knowledge management: introducing new methods of computerbased research . Journal of the Association for History and Computing , 10 ( 3 ), Dec. 2007 . url: http : / / hdl . handle.net/2027/spo.3310410.0010.301.

[19]

Karger . Keynote at ESWC part 2: how the semantic web can help end users . Tech. rep. MIT CSAIL Research , 2013 . url: http://haystack.csail.mit.edu/ blog/2013/06/06/keynote- at - eswc - part- 2 - how- thesemantic-web-can-help-end-users/.

[20]

Karger . Keynote at the ESWC part 1: the state of end user information management . Tech. rep. MIT CSAIL Research , 2013 . url: http : / / haystack . csail . mit.edu/blog/2013/06/05/keynote-at-the-europeansemantic - web - conference - part- 1 - the- state - of - enduser - information-management/.

[21]

Klyne and

Groth. PROV-AQ: provenance access and query . W3C Working Group Note. W3C, Apr. 2013 . url: http://www.w3.org/TR/prov-aq/.

[22] P. Le B uf,

Doerr , et al. De nition of the CIDOC conceptual reference model , version V6.2. Tech. rep. International Council of Museums , May 2015 . url: http://www.cidoc-crm. org/docs/cidoc crm version 6 . 2.pdf.

[23]

Lebo et al. PROV-O: the PROV ontology . W3C Recommendation. W3C , Apr. 2013 . url: http://www. w3.org/TR/prov-o/.

[24]

McBride and

Paterson . Applicative programming with e ects . J. funct. program. , 18 ( 1 ):1{ 13 , Jan . 2008 . doi: 10 .1017/S0956796807006326.

[25]

Miles ,

Shotton , et al. OpenFlyData: an exemplar data web integrating gene expression data on the fruit y Drosophila Melanogaster . Journal of biomedical informatics , 43 ( 5 ): 752 { 761 , 2010 . doi: http://dx.doi. org/10.1016/j.jbi. 2010 . 04 .00.

[26]

Pietriga ,

Karger , et al. Fresnel: a browser independent presentation vocabulary for RDF . In Proc. 5th international conference on the semantic web . In ISWC'06 . Springer-Verlag, Athens, 2006 , pp. 158 { 171 . doi: 10 .1007/11926078 12.

[27]

Rajkumar ,

Lindley , et al. Lenses for web data . Electronic communications of the EASST , 57 ( 57 ), 2013 . doi: 10 .14279/tuj.eceasst. 57 .879.

[28]

Rauschmayer . Connected information management . PhD thesis . Ludwig-Maximilians-Universitat Munchen, Feb. 2010 . url: http://nbn-resolving.de/urn:nbn:de: bvb: 19 - 114390 .

[29]

Rauschmayer . Hyena RDF editor. Poster. Displayed at SemWiki 2008 , 3rd Semantic Wiki Workshop: The Wiki Way of Semantics. 2008 . url: http://ceur- ws. org/Vol- 360 /poster-11.pdf.

[30] m. schraefel

and D.

Karger . The pathetic fallacy of RDF . In International workshop on the semantic web and user interaction (SWUI) 2006 . url: http://eprints. soton.ac.uk/262911/.

[31] D. M. Shotton , S

Sparks , and C

Catton . An introduction to the BioImage database, an image database for biologists . In Proc. 4th european light microscopy initiative meeting, Gothenburg , 2004 .

[32]

Speicher ,

Arwe , and

Malhotra . Linked data platform 1.0 .

W3C

Recommendation. Feb . 2015 . url: http://www.w3.org/TR/ldp/.

[33]

Sporny ,

Longley , et al. JSON-LD 1.0: a JSONbased serialization for linked data . W3C Recommendation. Jan . 2014 . url: http://www.w3.org/TR/json-ld/.

[34] Stanford Center for Biomedical Informatics Research (BMIR). Protege ontology editor . url: http://protege. stanford.edu.

[35]

J. R.

Swedlow et al. The Open Microscopy Environment (OME) Data Model and XML le: open tools for informatics and quantitative analysis in biological imaging . Genome biology , 6 ( 5 ):R47{ R47 , 2005 . doi: 10 .1186/gb-2005-6-5-r47.

[36]

Voegele et al. A universal open-source electronic laboratory notebook . Bioinformatics , 29 ( 13 ): 1710 { 1712 , 2013 . doi: 10 .1093/bioinformatics/btt253.

[37]

Vo lkel and M. a. Krotzsch. Semantic wikipedia . In Proc. 15th international conference on World Wide Web. ACM , Edinburgh, Scotland, 2006 , pp. 585 { 594 . doi: 10 .1145/1135777.1135863.

[38]

Wolstencroft et al. RightField: semantic enrichment of systems biology data using spreadsheets . In 8th IEEE international conference on e-science, e-science 2012 , Chicago, IL, USA, October 8- 12 , 2012 , pp. 1 { 8 . doi: 10 .1109/eScience. 2012 . 6404412 .

[39]

Zhao ,

Klyne ,

Benson , E. Gudmannsdottir,

White-Cooper , and

Shotton . FlyTED: the Drosophila testis gene expression database . Nucleic acids research , 38 ( Suppl . 1):D710{ D715 , 2010 . issn: 0305 - 1048 . doi: 10 .1093/nar/gkp1006.

[40]

Zhao ,

Klyne ,

Gamble , and

Goble . A checklist-based approach for quality assessment of scienti c information . In 3rd international workshop on linked science (LISC2013) , 2013 . url: http://linkedscience. org/wp-content/uploads/2013/04/paper5.pdf.

[41]

Zhao ,

Klyne , and

Shotton . Building a semantic web image repository for biological research images . In S. Bechhofer,

Hauswirth , J. Ho mann, and M. Koubarakis, editors, The semantic web: research and applications . Vol. 5021 , in Lecture Notes in Computer Science, pp. 154 { 169 . Springer Berlin Heidelberg, 2008 . doi: 10 .1007/978-3- 540 -68234-9 14 .