Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                         Budva, Becici, Montenegro, September 30 – October 4, 2019


    JOIN2 A PUBLICATION DATABASE AND REPOSITORY
                  BASED ON INVENIO
                    L. Baracchi1,a, A. Wagner2,b JOIN2 Collaboration
             1
                 Deutsches Zentrum für Neurodegenerative Erkrankungen, DZNE, Library
                          2
                              Deutsches Elektronen-Synchrotron, DESY, Library

                      E-mail: a laura.baracchi@dzne.de, b alexander.wagner@desy.de


JOIN² is a shared repository infrastructure that brings together eight research institutes for the
development of a full-fledged scholarly publication database and repository based on the Invenio v1.1
open source framework for large-scale digital repositories. Seven JOIN² instances are already
successfully deployed and two more institutes have joined seamlessly during the last year, resulting in
the overall consolidation of the system and its functionalities. JOIN² provides a general solution built
around a well-defined publication workflow which represents the cornerstone of the JOIN² paradigm.
Always preferring simplicity to complexity and implementing a convergent, inclusive solution, the
JOIN² members have consolidated their successful development workflow and collaboration. We
highlight how JOIN² is able to address the needs of a heterogeneous group of research centres.

Keywords: JOIN², publication database, institutional repository, library system, publishing,
services, reporting


                                                                        Laura Baracchi, Alexander Wagner


                                                              Copyright © 2019 for this paper by its authors.
                     Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


                                                                                                         51
      Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                         Budva, Becici, Montenegro, September 30 – October 4, 2019


1. Introduction
         With the main goal of establishing a centralized single source for publication reporting,
already in 2000 Forschungszentrum Jülich (FZJ) started the development of a publication database
that went into production in 2002. Later Deutsches Elektronen-Synchrotron (DESY) partnered with
FZJ and adopted this system, where functionalities required for an institutional Open Access
repository were added. While being successfully used for their purposes, around 2009 it became clear,
that these databases required substantial functional enhancements. It turned out quickly however, that
both systems had diverged significantly over time, and that additions would now require individual
development at each partner. This lead to the decision to replace these custom built solutions by some
well-established Open Source system and to address local requirements by more general concepts
allowing the partners to rely on a common code base, leading to the establishment of the JOIN²
initiative1: a joint effort tackling the needs of different research centres in a structured, coordinated
way.
         From user requests it also became clear that the new system should not only be a tool for
publication reporting, but it should also
      further the visibility of research done on site. Thus integration into the existing web presence,
         ensuring high quality indexing by search engines and providing data directly to external
         services like OpenAIRE (the European OpenScience repository2) or BASE (a
         multidisciplinary academic search engine3) became more of a focus.
      allow to derive personalized publications lists easily e.g. for a vita, as well as aggregations for
         project or institutional pages on the web or via exports for basically any kind of use cases.
      take the requirements of OpenScience into account. This called for a tight integration of the
         publications database and Open Access repository and for good integration with workflows
         for both Green and Gold Open Access publication including fee and embargo handling.
      be used as a central document repository not only for own publications, but also for all kinds
         of works required to conduct the own research. This common request by users requires a good
         integration with tools for reference formatting like BibTeX or commercial tools for use with
         word processors as well as good import functions for fast addition of content.
      be used as a central database to report on publications made at the institution. This adds very
         high demands on data quality and normalization.
         These requirements led JOIN²’s search for a suitable software towards repository systems that
can handle a data model beyond simple Dublin Core[1–4] and are able to cope with large numbers of
records while providing fast searches and high availability. It also led to the decision to base JOIN² on
individual, closed collections reflecting the organizational structure of the hosting sites. At it’s core it
is more a web based literature management system. It also demanded for handling of full text files
beyond pure Open Access including fine grained access control mechanisms, and while still focusing
on publications, it should allow for full text formats beyond PDF.

2. Implementation
        After a detailed market analysis, the two institutions decided to go for Invenio4[5]. Even
though it was clear from the start that some additions would have been necessary, it was concluded
that the flexible, in libraries well established and understood, meta data model Marc21 is suitable to
handle different requirements while keeping a common code base. This was, and still is, considered to
be the decisive advantage of Invenio compared to other solutions available. However, some additional
development still needed to be done.
The main missing functionalities were identified as:


1
        https://join2.de
2
        https://openaire.eu
3
        https://base-search.net
4
        https://invenio-software.org/legacy


                                                                                                        52
      Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                         Budva, Becici, Montenegro, September 30 – October 4, 2019


     Web forms, that allow non-experts to submit their publications easily, while ensuring very high
      meta data quality without much manual curation. This is tackled by importers from commonly
      used systems5 to streamline submissions in conjunction with extensive authority control.
     Export interfaces that allow for re-use in bibliographic software6 and especially on the web.
     Authority control to avoid inconsistent inputs and thus to increase the quality of bibliographic
      data. This also includes author disambiguation required for personalized publication lists on the
      web.
     A workflow that allows for at least three steps to guarantee bibliographic quality and proper
      permission handling for bibliographic data and full text files on the repository side.
         It was further agreed that all functions that are offered to the user will be used by staff
members as well, and that, by design, there will be no additional hidden functions only visible to staff.
This ensures on one hand that everything is working as expected, while on the other hand staff
members will always see the same information and in the same way as any user and can thus assist
easily if needs be. It also helped to identify and fix weaknesses in the overall submission process and
streamline it as much as possible. As a result, e.g. all bibliographic reports are produced using the
provided search functionalities and export formats.
         From the very beginning GSI Helmholtzzentrum für Schwerionenforschung (GSI) showed
interest in the project and joined right before the implementation, as did RWTH Aachen University
Library (UB RWTH) only shortly afterwards. DESY and GSI also planned to migrate their library
systems to Invenio and, ideally, run them as an integrated system similar to CDS7 at CERN.
         After some preliminary work in 2010, implementation focussing on the publications database
began with these four partners in 2011[6].
         At the end of 2012 JOIN² reached the first major mile-stone when JuSER went into production
to replace the old database VDB at Forschungszentrum Jülich. By 2013, all other partners had systems
running on site and most were already in production[7]. During 2013 JOIN² also organized the Invenio
User Group Workshop (IUGW) at Forschungszentrum Jülich8 The Heinz Meier-Leibnitz Zentrum
(MLZ) in Garching joined as a new partner and their system iMPULSE went into production already
in 2014.
         Having all sites of the partners online was the next decisive mile-stone for JOIN². Given the
diversity of the institutions involved it proved the flexibility and also the viability of the solution.
Having a university on board JOIN² had to cope with publications from all fields, ranging from Arts
and Humanities to all areas of the Sciences. While this did not allow for any subject specific short
cuts9, it ensured the capability to handle very general use cases and finally paid off as more partners
from various fields joined.
         Designed as a distributed system from the start, a lot of the infra structure was built and
established e.g. for joint development or system roll out. Unlike the usual approach in Invenio, JOIN²
always featured a fully automatic system for setup and configuration where the parameters for the
latter are organized in git repositories. It thus was the first Invenio system that employed an overlay
structure as is now common practice. This required quite some work up front, but it proved a necessity
as the project expanded later and it allows the project to establish a monthly almost seamless roll out
routine. It is also one of the building blocks for the successful dockerization of the project, again a
technology that barely existed at the start of JOIN².
         Having all original partners’ instances in production freed up some resources and finally
allowed for new partners to come on board. By 2014 the German Cancer Research Centre (DFKZ)
joined the project[8] while the German Centre for Neurodegenerative Diseases (DZNE) already
showed first interest as well[9].


5
         doi, pubmed, arXiv, inspire etc.
6
         This required some enhancement of existing exports, especially for EndNote/RIS.
7
         https://cds.cern.ch
8
         sic!Jülich
https://www.fz-juelich.de/zb/EN/About_us/conferences/%5Bsic!%5D2013/sic_2013_node.html.
9
         Effectivley, it even required to straighten some of those taken by CERN.


                                                                                                     53
      Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                         Budva, Becici, Montenegro, September 30 – October 4, 2019


         Being on the agenda already from the start DESY finally started to migrate their library
system (ILS) and integrate it into pubdb, which thus became a central hub for all publication related
services at DESY. The most visible part of this integration is the library catalogue. Development on
this part was completed in 2017[10] and shortly afterwards GSI was also able to adopt the new
functionalities. However, an ILS has quite a few requirements based on workflows in the background
reflecting different processes and needs and again the initial approach proved flexible enough to tackle
them smoothly.
         One of the major issues JOIN² had to solve from the start was name forms in non-English
languages, especially for author disambiguation and normalization[11]. In general each name has an
official form and can have n alias forms associated. While in western-European languages for names
this involves only a few characters in Latin script, the issue becomes more pronounced in non-Latin
scripts like Cyrillic. Thus first contacts with the Joint Institute for Nuclear Research (JINR) date back
to the IUGW 2013, mainly due to this common issue.
         Again authority records are employed to solve this problem. Here they model people10, and in
order to describe them properly these records also hold additional attributes like an address or multiple
name forms. Authority records have been a common concept in the library world for decades11, while
it is a quite new concept to the repository world and did not exist in Invenio. Being a basic
requirement however, ideas and work on how to do this in JOIN² date back to the initial project layout.
While the implementation was done well before the advent of ORCiD12, the schema is flexible enough
that JOIN² is, and has always been, ready for ORCiD. Effectively, it can even use ORCiDs as primary
author identifier at any point in time. The same is true for other kinds of identifiers.


 Sample of screen shots from various instances and their seamless output to the institutes web pages.
   The map shows the partner institutions of JOIN². Dark colour indicates sites that host the actual
systems, light shade refers to individual lab sites associated to one partner. E.g. DESY PUBDB serves
                 the two lab locations of DESY: Hamburg (dark) and Zeuthen (light).
                                 See also the project web site at https://join2.de

       After some initial trials to flesh out the functionality of JOIN² for re-use in JDS13, JINR
decided to become a partner in 2017[12] and finally a Memorandum of Understanding between JINR

10
        JOIN² features a similar handling for institutes, groups, grants, experiments, journals etc.
11
        see [11] and https://www.loc.gov/marc/authority/ for a full description of Marc Authority
12
        https://orcid.org
13
        http://jds.jinr.ru


                                                                                                       54
      Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                         Budva, Becici, Montenegro, September 30 – October 4, 2019


and DESY on behalf of JOIN² was signed. The current version of JOIN² now solved the issues with
multiple scripts for searching and also displaying the authority records, but there still remain some
issues to be solved with regard to multi-script submission. In practice, adding another script results in
at least two official forms of a name, and the one to be choosen may depend on the language of the
data. Shortly after JINR also DNZE joined the project and started implementation in 2018. These two
new partners also required some internationalization within the project. While for quite some time the
main language (except for code and technical issue tracking) was German, it now moved to English in
almost all areas. This also required a rewrite of the documentation, which is mainly tackled by DZNE
along with their implementation of DZNEPUB. Furthermore, DZNE was the first partner that
employed docker from the start, another important mile-stone for the project which implemented an
enabling technology for the future14.
         Today, JOIN² operates seven systems in production with two more (DZNE and JINR) to
come. More than 28.000 staff members and more than 6.000 visiting scientists have full access to
functionalities. JOIN²-systems serve more than 475.000 records out of which more than 68.000 offer a
freely accessible full text and 135.000 are shared authorities.


3. Advanced functions
         The powerful author disambiguation together with the possibility to store all kinds of
publications from whatever time period allow to keep the whole academic record. Using the web
export will guarantee pages to stay updated and the reuse of data required for reporting avoids multiple
submissions of the same data. This is particularly useful in case of publications with several several
authors from the same institute. The integrated full text repository also allows transparent access to the
publications for co-workers. In case of Open Access publications the whole community can access
them easily from the authors’ web pages. Invisible, semantic mark-up is employed for the export to
ensure proper indexing by search engines and thus further visibility, while integration in the overall
web presence of the research institution gives quite a boost in ranking. The latter again profits from an
Open Access full text attached.
         Additionally, the JOIN² workflow ensures that it is always safe for authors to attach their
articles. No file will be released to Open Access without manual checking for legal restrictions and if
necessary, embargoes are handled automatically. This also allows to build an archive of the own
achievements over time while adding relevant publications to the institute’s collection will build a
valuable resource for future work that allows the whole group on campus easy access.
         As outlined above even though right now each JOIN²-instance employs a local author
identification scheme, the implementation is done such that it already integrates as much as possible
with ORCiD. If JOIN² is used as an OpenAccess publishing platform, minting of DOIs can be
automatized and will also pass on the ORCiD of the authors and thus can even update the authors’
ORCiD profiles automatically if configured to in ORCiD. Alongside ORCiD all author records can
also feature a number of other author identifiers. E.g. if the inspire-id is associated with an author even
the import of large collaboration papers from INSPIRE-HEP15 including author association is fast and
easy.16 The main bottle neck is currently parsing and handling web service returns with way beyond
2.000 individual authors, which takes a few seconds.
         Recent enhancements of JOIN² are targeted to streamline the publication workflow. The first
step optimized the delivery process of theses in High Energy Physics to INSPIRE-HEP. To this end a
harvesting routine was established that adds relevant records published by DESYs publishing house on
PUBDB to INSPIRE-HEP automatically and in a timely manner. In a second step workflows used in
the publishing house for the production of proceedings were optimized. In this case the publishing
house not only produces the full volume, but also each contribution as an individual article. The new

14
        RWTH publications, while starting out on a dedicated machine is also already in production
running in docker and development using docker is now commonly adopted.
15
        https://inspirehep.net
16
        A close collaboration between INSPIRE-HEP and JOIN² ensures to fetch inspire-ids and
provide them for all partners for almost automatic addition to the authors records.


                                                                                                       55
      Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                         Budva, Becici, Montenegro, September 30 – October 4, 2019


process now ensures that they are grouped together and that all records are interlinked so it becomes
easy to navigate trough the volume.17 Furthermore, the library is now able to just use the very same
record for the printed edition of the book as for the digital edition while in the past a second record had
to be created. Finally, the same process that was employed to add relevant theses to INSPIRE-HEP
now also fetches proceedings including their contributions. For in-house authors the new procedures
also ensure proper reporting of the publications and notification of the groups involved.
        Another major enhancement was required to handle the current change in the publishing
industry. The traditional publication model was based on subscription fees, that is one pays to read.
With the movement towards full Open Access commercial publishers and societies change more and
more to models based on Article Processing Charges (APC). In these models one pays to publish.18
This is becoming an increasingly important part, as can be seen e.g. from the numbers published by
the openAPC19 project which is tracking these expenses internationally. To streamline the internal
processes in these cases and also to make the cost aspect transparent for the users new functions were
added to the systems. As similar requirements already exist to run the library system in case of book
purchases it was possible to employ synergies between both modules. Currently, the definition of a
common, xml-based export format for the price information is under discussion with openAPC.


4. Conclusion
         Building on a 100% open source framework and around the users’ needs, JOIN² has developed
a publication database, repository and integrated library system able to address the needs of an
expanding set of diverse research centres providing rich functionalities in the simplest way. The
definition and enforcement of a uniform publication workflow is at the core of the JOIN² approach.
We believe the JOIN² collaboration model to have proven very successful.


Bibliography
[1] Information and documentation — the dublin core metadata element set — part 1: Core elements.
(International Organization for Standardization, 2017).
[2] Information and documentation — the dublin core metadata element set — part 2: DCMI
properties and classes. (International Organization for Standardization, 2017).
[3] The dublin core metadata element set. (National Information Standards Organization, 2013).
[4] Encoding dublin core metadata in html. (Internet Engineering Task Force, 2010).
[5] Wagner, A. Veröffentlichungsdatenbank und Volltextrepositorium. BIT online 14, 45–48 (2011).
[6] Wagner, A. Ein neues JUWEL? - Publikationsmanagement für Wissenschaft und Administration.
in 13 (Spezialbibliotheken - Freund und Follower der Wissenschaft, Jülich (Germany), 9 Nov 2011 -
11 Nov 2011; Forschungszentrum Jülich, Verlag, 2011). doi:10.3204/PUBDB-2017-00882
[7] Wagner, A. Invenio@HGF – status and perspectives. in (Forschungszentrum Jülich; 2nd Invenio
User Group Workshop, Jülich (Germany), 18 Nov 2013 - 20 Nov 2013, 2013). doi:10.3204/PUBDB-
2017-00869
[8] Wagner, A. & Thiele, R. Invenio@HGF – Collaborative repository infrastructure. in (Open
Repositories 2014, Helsinki (Finland), 8 Jun 2014 - 13 Jun 2014, 2014). doi:10.3204/DESY-2014-
02793


17
        See e.g. https://doi.org/10.3204/DESY-PROC-2013-04.
18
        However, often unnoticed, even traditional publication models featured quite a number of
additional fees like page or colour charges or fees for hybrid Open Access.
19
        https://www.intact-project.org/openapc


                                                                                                       56
      Proceedings of the 27th International Symposium Nuclear Electronics and Computing (NEC’2019)
                         Budva, Becici, Montenegro, September 30 – October 4, 2019


[9] Wagner, A. JOIN² – going for the ². in (Deutsches Zentrum für Neurodegenerative Erkrankungen,
2015). doi:10.3204/PUBDB-2015-03786
[10] Wagner, A. Invenio as a library system. in (Heinz Maier-Leibnitz Zentrum; 4th Invenio User
Group Workshop, Garching (Germany), 21 Mar 2017 - 24 Mar 2017, 2017). doi:10.3204/PUBDB-
2017-01357
[11] Wagner, A. Authority Control in Invenio. in (Forschungszentrum Jülich; 2nd Invenio User Group
Workshop, Jülich (Germany), 18 Nov 2013 - 20 Nov 2013, 2013). doi:10.3204/PUBDB-2017-00870
[12] Wagner, A. JOIN² – A scientists toolbox. in (Joint Institute for Nuclear Research, 2018).
doi:10.3204/PUBDB-2018-00618


                                                                                                     57