=Paper=
{{Paper
|id=Vol-1611/paper2
|storemode=property
|title=From Users to Systems: Identifying and Overcoming Barriers to Efficiently Access Archival Data
|pdfUrl=https://ceur-ws.org/Vol-1611/paper2.pdf
|volume=Vol-1611
|authors=Nicola Ferro,Gianmaria Silvello
|dblpUrl=https://dblp.org/rec/conf/jcdl/FerroS16
}}
==From Users to Systems: Identifying and Overcoming Barriers to Efficiently Access Archival Data==
From Users to Systems: Identifying and Overcoming Barriers to Efficiently Access Archival Data Nicola Ferro Gianmaria Silvello Department of Information Engineering Department of Information Engineering University of Padua University of Padua Padua, Italy Padua, Italy nicola.ferro@unipd.it gianmaria.silvello@unipd.it ABSTRACT given the “dramatic increase” [3] in the number of people ac- Digital archives are one of the pillars of our cultural her- cessing them. A recent user study [11] analyzing the user in- itage and they are increasingly opening up to end-users by teraction patterns with finding aids highlighted that “[they] focusing on accessibility of their resources. Moreover, digi- focus on rules for description rather than on facilitating ac- tal archives are complex and distributed systems where in- cess to and use of the materials they list and describe” and teroperability plays a central role and efficient access and that many archive’s users have serious issues using finding exchange of resources is a challenge. aids [1]. Common and frequent user interaction patterns In this paper, we investigate user and interoperability re- with finding aids are navigational and thus they require to quirements in the archival realm and we discuss how next browse the archival hierarchy to make sense of the archival generation archival systems should operate a paradigm shift data; for instance, two common interaction patterns are [11]: bringing a new model of access to archival resources which top-down where users “start at the highest level, gain back- allows to better address these needs. ground and context, and work down to the most specific level To this end, we employ the data structures and query of detail ” and bottom-up where users “start at the most de- primitives based on the NEsted SeTs for Object hieRarchies tailed level seeking specific information, and then move back (NESTOR) model to efficiently access archival data over- to the higher levels”. coming the identified barriers and limitations. From this new point-of-view, digital finding aids (i.e. EAD) constrain user orientation of archives because several key op- erations are not possible nor efficient, given that it is prob- Keywords lematic to: (i) let the user access a specific item on-the-fly, set-based data models, archival data, XPath, XML whereas we have to define fixed access points to the archival hierarchy [8]; (ii) let the user reconstruct the context of an item without requiring to browse the whole archival hierar- 1. INTRODUCTION chy [2]; and, (iii) present the user with only selected items Archives, along with libraries and museums, are one of from an archive, whereas we have to give them the archive the main cultural institutions encompassed by Digital Li- as a whole [7, 18]. braries (DL). Archives represent the trace of the activities From the technological perspective, the presented limi- of a physical or legal person in the course of their busi- tations also affect the interoperability of archives in dis- ness which is preserved because of their continued value over tributed environments, thus preventing the exchange of re- time. They are composed of unique documents interlinked sources by means of standard DL technologies such as the with each other as well as with their production and preser- Open Archives Initiative Protocol for Metadata Harvest- vation environments. The main characteristic of archives ing (OAI-PMH)1 [8, 15]. Indeed, a single EAD file describes lies in the hierarchical structure used to retain the context a whole archive and thus it is not possible to share or ex- and the full informational power of archival data. change in a distributed environment only a subset of records; The hierarchical structure shaping archives is a founda- for archives, it is common to be required to exchange only tional feature of traditional paper-based archival description the high-level descriptions (e.g., fonds and sub-fonds) or to – the so-called finding aid. This is reflected in its digital exchange only the records open to public disclosure. This counterpart, the Encoded Archival Description (EAD) [14] problem affects the possibility to exchange finding aids with eXtensible Markup Language (XML) format, which is the variable granularity by means of OAI-PMH forcing archival key brick for managing, finding and accessing archival data. institutions to share whole archives or nothing. EAD pro- Over the last decade, thanks to the centrality of the Web vides archivists with many degrees of freedom in tagging for information access and the rapid evolution of DL ser- practice exacerbating the differences in how XML elements vices, we have witnessed a major shift towards a “radical are used and nested one inside the other [10]. This makes user orientation” [12] of archives, where usability and find- it difficult to know in advance how an institution will use ability of resources are becoming number one priorities [20] the hierarchical elements and then to define general rules and paths to access EAD elements; for instance, there is no In Proceedings of 1st International Workshop on Accessing Cultural Her- guarantee that an XML Path Language (XPath) expression itage at Scale (ACHS’16), June 22, 2016, Newark, NJ, USA. Copyright returning all the series or the units in a given EAD file will 2016 for this paper by its authors. Copying permitted for private and aca- demic purposes. 1 http://www.openarchives.org/pmh/ [...] work with a different file in another collection or even in the[...] same one. [...] Archival record 1 Archival record 13 In this paper, we stem from the above observations about Archival record 11 [...] the user and interoperability needs in the archival realm to SUB- Archival record 12 FONDS Bdiscuss how next generation archival systems should operate SERIES D [...] a paradigm shift bringing a new model of access to archival FONDS A Archival record 10 resources which allows to better address these needs. In par- [...] ticular, the contribution of the paper is to turn the above SUB- FONDS C SERIES E[...] UNIT requirements into specific access use cases to archival re- G sources, discussing how and why current approaches rep- Archival record 2 UNIT H [...] SERIES F resent a barrier to their complete fulfillment, and showing Archival record 3 UNIT […] I how our proposed solution, called NEsted SeTs for Object UNITL […] hieRarchies (NESTOR) [8, 9], represents a step forward. Archival record 4 Archival record 5 Indeed, NESTOR [8] defines an alternative way to rep- Archival record 6[…] resent hierarchical data by expressing the relationships be- Archival record 7 Archival record 8 tween objects through the inclusion property between sets, Archival record 9 […] in contrast to the binary relation between nodes exploited by […] the tree which is the typical model used to represent archival (a) Archival Tree (b) EAD representation data. NESTOR has been instantiated by three data struc- tures on which query primitives, proven to be highly efficient Figure 1: A sample archive and its EAD represen- in a wide spectrum of cases, have been realized [9]. NESTOR tation. represents a paradigm shift with respect to state-of-the-art solution to access hierarchical data because it answers query primitives – e.g., descendants and children to deal with the shown in Figure 1 on the right, which is an XML descrip- top-down interaction pattern and ancestors and parent to tion of a whole archive, reflects the archival structure, holds deal with the bottom-up one – by exploiting basic set op- relations between entities and retains context. erations which do not require to browse and navigate the EAD follows the traditional archival paradigm where ex- hierarchy. perts know exactly what they are looking for and, for ex- Moreover, in order to fully understand the difference be- ample, they browse EAD to know the location of physical tween NESTOR and state-of-the-art navigational (i.e., based records [12]. By contrast, in the new user-oriented paradigm on XPath) approaches, we conducted a case study evalua- enabled by digital archives “users no longer have to be de- tion based on ten real-world heterogeneous EAD files repre- pendent on the physical presence of archivists to identify, senting different key challenges for the identified access use review, and retrieve materials” [23], but they need effective cases, where we discuss the main drawbacks of a navigation- means for performing information seeking activities. As a based access approach and how they are addressed by the matter of fact, EAD turns out to be problematic in: (i) NESTOR set-based one. We also show how the intrinsic supporting user-oriented information access; (ii) supporting differences between NESTOR and traditional navigational flexible control access policies; (iii) enabling interoperabil- approaches are also consistently reflected in the query exe- ity between digital archives working in distributed environ- cution times, which are a quantitative proxy for appreciating ments. the paradigm shift represented by NESTOR and its impact. The rest of the paper is organized as follows: Section 2 provides relevant background information; Section 3 dis- 2.2 XPath: A Navigational Approach cusses the examined use cases; Section 4 presents the ex- XPath2 is widely adopted for searching and selecting por- perimental outcomes. Finally, Section 5 draws some conclu- tions of EAD files. XPath is a language for addressing parts sions. of an XML document; it provides basic facilities for manip- ulation of several data types and adopts a path notation 2. BACKGROUND for navigating through the hierarchical structure of an XML document. “Location path” is a common kind of XPath ex- 2.1 Digital Archives pression, which selects a set of nodes relative to a given node and as output returns the node-set containing the nodes se- Archives are composed by “unique records of corporate lected by the location path. Each part of an XPath ex- bodies and the papers of individuals and families” [14]. The pression can be composed of three parts: (i) an axis, which original order – i.e. the principle of provenance – of the doc- specifies the tree relationship between the nodes; (ii) a node uments within an archive is preserved because the context test, which specifies the node type and expanded-name of and the physical order in which the documents are held are the selected nodes; and (iii) zero or more predicates that as valuable as their content [6]. can further refine the selected set of nodes. According to the International Standard for Archival De- As it emerges from the previous discussion, archival sys- scription (General) (ISAD(G)), archival description (i.e. the tems typically rely on third-party and standard libraries finding aids) proceeds from general to specific as a conse- for XPath processing. Since the NESTOR data structures quence of the provenance principle and has to show, for ev- and query primitives are implemented in Java and work in- ery unit of description, its relationships and links with other memory, we are interested in comparing to state-of-the-art units and to the general fonds, taking the form of a tree as shown in Figure 1 on the left. The digital encoding of 2 ISAD(G) is the Encoded Archival Description (EAD) [14], http://www.w3.org/TR/xpath/ UNIT L NESTOR can be instantiated by three data structures [9]: UNIT I SERIES Direct Data Structure (DDS), Inverse Data Structure (IDS) UNIT FONDS A SUB-FONDS C F H and Hybrid Data Structure (HDS). Each one of these struc- SERIES F SUB-FONDS C UNIT H UNIT G tures is composed by three dictionaries, one containing the UNIT G SUB-FONDS SERIES SERIES SUB-FONDS B FONDS A materialization of the sets, one containing the direct subsets B D E UNIT I UNIT L SERIES E of each set and the last one containing all the supersets of each set. DDS is a structure built around the constraints SERIES D defined by the NS-M, IDS is a structure built around the constraints of INS-M and HDS can be seen as a mixture between DDS and IDS [9]. (a) Euler-Venn representation (b) DocBall representation When we deal with a collection of sets defined by NESTOR, of the NS-M of the INS-M we can distinguish between set-wise and element-wise prim- itives. The former ones enable us to query the structure of Figure 2: The archive of Figure 1 modeled with the an archive, whereas the latter ones query the content of the NS-M and the INS-M. archive (i.e., the archival records). For instance, by means of the set-wise primitives we can ask for all the series of a specific sub-fonds, whereas with the element-wise primitives in-memory Java-based solutions. Xalan3 , Jaxen4 and JX- we can ask for all the archival records belonging to the series path5 are the three most used state-of-the-art Java libraries of that sub-fonds. for XPath processing. NESTOR primitives (i.e., Descendants, Ancestors, Chil- dren and Parent) are efficient alternative implementations of 2.3 NESTOR: A Set-Based Approach XPath primitives as shown in [9] where we conducted an ex- The NESTOR model is defined by two set-based data tensive evaluation on five EAD collections, Wikipedia and models: The Nested Set Model (NS-M) and the Inverse Set two synthetic XML datasets and we compared NESTOR Data Model (INS-M) [8]; they are formally defined in the with state-of-the-art XPath engines. In [9] we evaluated context of set theory as a collection of subsets. The most NESTOR on average performances by testing the primitives intuitive way to understand how these models work is to re- on thousands of files and then presenting mean execution late them to the archival tree. In Figure 2a we can see how times; in this paper we investigate how NESTOR primitives the archive shown in Figure 1 is mapped into an organization behave with specific digital archives and how efficiently they of nested sets based on the NS-M. answer to common and frequent archival operations. From Figure 2a we can see that the NS-M adopts a bottom- up approach: (i) each set corresponds to an archival division; 3. USE CASES (ii) the innermost sets are the leaves of the hierarchy, e.g. the units; (iii) you create supersets as you climb up the hi- We present three user-oriented use cases derived from com- erarchy, e.g. the series, sub-fonds and fonds. The archival mon interaction patterns individuated in the archival do- records are represented as elements belonging to the sets. main and four interoperability use cases based on the ex- With the NS-M an archive is modeled as a collection of sub- change of archival data in distributed environments. sets where there is a set – i.e. “fonds” – which contains all 3.1 User-oriented Use Cases the subsets – i.e. “subfonds”, “series”, “units” – of the archive and where two subsets at the same level – e.g. two “series” Use Case 1: identifying and selecting relevant material – cannot have common elements, thus their intersection is empty. This use-case is related to the “searching for known material ” As shown in Figure 2b, the INS-M adopts a top-down ap- information seeking activity investigated by Duff and John- proach: (i) each set corresponds to an archival division; (ii) son in [5]. This activity may be performed by researchers the innermost set is the root of the hierarchy, i.e. the fonds; at the beginning of a project to establish a context and de- (iii) you create supersets as you climb down the hierarchy, tect relevant information and it may be re-iterated several e.g. sub-fonds, series and then units. As for the NS-M, also times to “reevaluate information that has suddenly gained in this case the archival records are represented as elements new significance” [5]. Such activities can be associated to belonging to the sets. With the INS-M an archive is modeled the top-down pattern of interaction identified by Freund and as a collection of sets where there exists an archival division Toms in [11] where the users “start at the highest level [of shared by all other divisions; in our example, the “fonds” is an archival description], gain background and context, and the archival division common to all the other divisions in work down to the most specific level of detail ”. the archive. In Figure 3 we can see a graphical representation of this This vision overcomes EAD limitations because in NESTOR use case. We consider an archival system that answers a each archival record is an element belonging to a set which user query that starting from a given context node requires can be selected and managed independently from the other to return a list of archival records. From this list the user records; thus, we can return to the users a list of records be- then selects the description of, say, sub-fonds C; in this case longing to different archival divisions at any level allowing two frequent queries to be answered are: to return the sub- them to access and consult the records hiding the complexity divisions (series D, series E, series F, unit G, unit H, unit I of the whole archival structure. and unit L) which are part of this sub-fonds – i.e a structural query – and to return all the records (the actual records 3 http://xml.apache.org/xalan-j/ or their descriptions contained by the three series and four 4 http://jaxen.codehaus.org/ units which are children of sub-fonds C) associated to this 5 http://commons.apache.org/proper/commons-jxpath/ sub-fonds – i.e a content query. Use-case 1: Identifying and selecting relevant material Structural Operation: What are the sub-divisions composing Sub-Fonds C? Content Operation: Which records belong to Sub-Fonds C? Archival record 1 Archival record 13 UNIT L Archival record 11 Archival record 12 UNIT I SUB- FONDS B SERIES UNIT SERIES D F H SUB-FONDS C FONDS UNIT A Archival record 10 G FONDS A SUB-FONDS C SERIES F SUB-FONDS B FONDS SUB- SERIES E A FONDS C UNIT UNIT UNIT G SUB-FONDS SERIES SERIES E G H SERIES E B D UNIT H UNIT I UNIT L Archival record 2 Archival record 6 SERIES F Archival record 3 UNIT Archival record 7 SERIES D I Archival record 8 UNIT L Archival record 9 Archival record 4 Archival record 5 Descendants Structural Descendants Descendants Structural Descendants Structural expression: Content expression: Operation: Content Operation: Content /fondsA/subfondsC/ /fondsA/subfondsC/ Get all the subsets of Operation: Get all the supersets Operation: descendant-or-self::* descendant-or- sub-fonds C Get all the elements of sub-fonds C Get all the elements Archival record 11 self::*/text() belonging to belonging to Archival record 12 sub-fonds C sub-fonds C and SERIES Archival record 2 its supersets D Archival record 3 UNIT SERIES E SUB-FONDS C SERIES F G Archival record 11 UNIT Archival record 10 Archival record 12 UNIT H UNIT SUB- SERIES Archival record 10 SERIES SERIES G H UNIT I FONDS C E UNIT D SERIES D G Archival record 4 E UNIT I UNIT L SERIES UNIT Archival record 5 F UNIT L Archival record 2 SERIES F H Archival record 6 Archival record 6 SUB-FONDS C Archival record 3 UNIT Archival record 7 I Archival record 8 Archival record 7 UNIT Archival record 4 L Archival record 9 Archival record 8 Archival record 5 Archival record 9 (a) Tree (b) Nested Sets Model (c) Inverse Nested Sets Model Figure 3: Use-case 1: Identifying and selecting relevant material. With a navigational approach based on XPath, the struc- tion [22]. To address this aspect we need to return to the tural query corresponds to the following XPath expression: user all and only the archival divisions from the selected unit /fondsA/subfondsC/descendant-or-self::*; and the con- up to the root. tent query corresponds to: /fondsA/subfondsC/descendant- If we consider the case presented in Figure 4 where we or-self::*/text(). Both these expressions require to nav- need to reconstruct the context of “Unit L”, we can see that igate the archival tree to the sub-fonds C division and then a structural query needs to return all the archival divisions to visit all of its descendants. up to the root – i.e., the ancestors of unit L which are series In Figure 3 we see that the NS-M answers the structural F, sub-fonds C and fonds A – and the content query returns query by returning all the subsets of sub-fonds C (i.e. all all the records or descriptions contained by these divisions. its descendants), whereas the INS-M answers it by return- With an XPath-based approach, the structural query (e.g., ing all the supersets of the sub-fonds (i.e. all its ancestors). /fondsA/subfondsC/seriesF/unitL/ancestor-or-self::*) The content query is answered by NS-M by returning all the requires to navigate the archival tree from the leaf “unit L” elements belonging to sub-fonds C, whereas INS-M has to up to the root; the output of this query is a sub-tree with return the union of all the elements belonging to sub-fonds the same root of the original tree, but containing only those C and its supersets. We can see that the NS-M and the nodes on the path between “Fonds A” and the leaf “unit L”. INS-M answer the queries by exploiting two different prim- The content query (/fondsA/subfondsC/seriesF/unitL/ancestor- itives, the first is based on the subsets of a set, whereas the or-self::*/text()) does the same operation but selects second is based on its supersets. In NS-M the descendants only the data nodes that are then returned to the user. of an archival node, say sub-fonds C, are the subsets of the As shown in Figure 4, the NS-M answers the query about set representing sub-fonds C; whereas, in INS-M the descen- the context by exploiting a set-wise primitive which returns dants are the supersets of the given set. all the supersets of the selected division, whereas the INS-M does so by returning all its subsets. This operation also has Use Case 2: building contextual knowledge an element-wise counterpart answering the content query and in this case, NS-M returns all the elements belonging to “Building context is the sine qua non of historical research” [5] the union of the supersets of the selected unit, whereas the and one of the main functions of archives. As we described INS-M simply returns the elements belonging to the set of above, the context of an archival record is required to dis- the unit. close its full informational power and thus, reconstructing the knowledge of a record or of an archival division is one of the most common and important operation an archival sys- Use Case 3: seeking unknown archival material tem has to provide. This operation can be associated with This use-case is related to the “becoming oriented to a new the bottom-up pattern of interaction identified also by [11] archive or collection” information seeking activities inves- where the users “start at the most detailed level seeking spe- tigated in [5]. It analyses a common scenario where users cific information, and then move back to the higher levels have not a clear idea about what they are looking for and to make sense of the information and place it in context if may proceed systematically from an archival division to the necessary”. other. This use case is also related to the two previous ones Figure 4 presents the operations required to “build contex- because, among other operations, it may require to analyze tual knowledge” of an archival description. To better guide the descendants of a given archival division or record as well the user when exploring the archive the more accurate the as to climb up the hierarchy. Indeed, we can see this use contextual information returned are, the better; indeed, if case as a combination of the top-down and the bottom-up we return the whole archive to the user then s/he might be patterns and can be associated to the “systematic interro- disoriented by the large amount of heterogeneous informa- gation” interaction [11], where the users “develop hypotheses Use-case 2: Building contextual knowledge Structural Operation: What is the context of unit L? Content Operation: Which records are related to record 9? Archival record 1 Archival record 13 Archival record 11 UNIT L Archival record 12 SUB- UNIT I FONDS B SERIES UNIT SERIES D F H FONDS FONDS A SUB-FONDS C A Archival record 10 SUB-FONDS C SERIES F UNIT G UNIT UNIT FONDS SUB- SERIES E SUB-FONDS SERIES SERIES G H SUB-FONDS B FONDS C A UNIT B D E G UNIT L UNIT I SERIES E UNIT 9 Archival record 2 H Archival record 6 SERIES F Archival record 3 UNIT Archival record 7 I SERIES D Archival record 8 UNIT L Archival record 9 Archival record 4 Archival record 5 Ancestors Structural Ancestors Content Ancestors Structural Ancestors Content Operation: Operation: Operation: Operation: Structural expression: Content expression: Get all the Get all the elements Get all the subsets Get all the elements /fondsA/subfondsC/ /fondsA/subfondsC/ supersets of belonging to unit L of unit L belonging to unit L seriesF/unitL/ seriesF/unitL/ ancestor-or-self::* ancestor-or-self::*/ unit L and its supersets FONDS text() UNIT L A SUB- Archival record 1 FONDS C SUB-FONDS C SERIES F Archival record 1 Archival record 2 FONDS SERIES F Archival record 3 A SUB-FONDS C SERIES F Archival record 2 Archival record 4 UNIT L Archival record 3 Archival record 5 FONDS Archival record 4 UNIT Archival record 9 A L Archival record 5 Archival record 9 (a) Tree (b) Nested Sets Model (c) Inverse Nested Sets Model Figure 4: Use-case 2: Building Contextual Knowledge. Use-case 3: Seeking unknown archival material Structural Operation: Which divisions are related to unit L? Content Operation: Which records are related to unit L? Archival record 1 Archival record 13 Archival record 11 UNIT L Archival record 12 UNIT I SUB- FONDS B SERIES F UNIT SERIES D FONDS SUB-FONDS C SERIES F H A SUB-FONDS C FONDS A Archival record 10 UNIT UNIT UNIT SUB-FONDS SERIES G H FONDS G SERIES E SUB-FONDS B SUB- SERIES E B D A FONDS C UNIT UNIT L G UNIT I SERIES E UNIT Archival record 2 H Archival record 6 SERIES F Archival record 3 UNIT Archival record 7 SERIES D I Archival record 8 UNIT L Archival record 9 Archival record 4 Archival record 5 Parent and Children Parent and Children Parent and Children Parent and Children Structural Operations: Structural Operations: Structural Operations: Structural Operations: Structural expression: Content expression: Get all the subsets Get all the elements Get all the supersets Get all the elements /fondsA/subfondsC/ /fondsA/subfondsC/ seriesF/unitG/parent::* of the superset of the subsets of the of the subsets of of the supersets of seriesF/unitG/ of unit L superset of unit L the subset of unit L parent::*/text() unit L /fondsA/subfondsC/ seriesF/child::* /fondsA/subfondsC/ seriesF/child::*/ UNIT L UNIT G text() UNIT I UNIT UNIT UNIT H Archival record 6 H Archival record 6 G UNIT Archival record 7 UNIT I Archival record 7 H Archival record 8 UNIT Archival record 8 UNIT I UNIT L L Archival record 9 Archival record 9 UNIT G (a) Tree (b) Nested Sets Model (c) Inverse Nested Sets Model Figure 5: Use-case 3: Seeking unknown archival material. as to where in the finding aids structure the information is returning all the direct subsets (i.e. the children) of the su- most likely to be and check each one in turn”. perset (i.e. the parent) to which the selected unit belongs; In Figure 5 we show this use case where the user selects an as usual, the INS-M reverses this logic and answers by re- archival division or a record and then asks for all the archival turning all the direct supersets of the subset to which the divisions (structural or set-wise) or all the records (content selected unit belongs. The element-wise primitive takes the or element-wise) at the same level of the selected element sets outputted by the set-wise one and then returns all the (e.g. the siblings of this element). For instance, if the user elements belonging to them. selects one of record descriptions represented by “Unit L” in the figure, this operation allows her/him to retrieve all the other descriptions connected to it (e.g. all the sibling units of “Unit L” or the elements belonging to them). We can see that to answer this interrogation, both from 3.2 Interoperability-oriented Use Cases the structural and the content viewpoints, the navigational As described above and reported in [15], digital finding approach requires two XPath expressions where the first one aids based encoded by the EAD standard represent a bar- returns the parent node of the given node and the second, rier towards the very interoperability this standard aims to starting from this last one node, returns all of its children; enable. Indeed, as we see below, with EAD there are sev- note that to do this, navigational approaches need to visit eral OAI-PMH functions which cannot be used by archival each child node and thus the higher the number of children, systems. On the other hand, NESTOR set-based operations the higher the complexity of this operation. can be straightforwardly employed by archival systems to The NS-M answers the query with a set-wise primitive by use all OAI-PMH functionalities with digital finding aids [8]. Use Case 4: Get Records Table 1: Statistics of ten selected EAD files. This use case is based on the a common OAI-PMH request Size max average where a service provider requests all the records belonging (MB) # nodes depth fan-out fan-out to an archive. This use case can be addressed also by navi- EAD-01 0.368 7,316 10 823 4.33 EAD-02 1.853 21,355 10 1,610 1.62 gational approaches just by exchanging the whole EAD file EAD-03 3.131 42,123 13 2,453 1.49 via OAI-PMH. EAD-04 3.866 75,094 9 10,271 1.73 NESTOR addresses this case by relying on the descendant EAD-05 4.043 51,946 12 1,320 1.80 EAD-06 5.310 73,372 12 3,663 1.87 content operation shown in Figure 3 with a slight variation; EAD-07 6.017 57,362 14 565 1.91 indeed in the figure we ask for all the descendants of sub- EAD-08 9.242 103,703 18 340 1.62 fonds C, whereas in this case we are asking the NS-M to EAD-09 9.746 160,031 14 8,930 2.01 EAD-10 15.512 188,862 17 696 1.62 return the set representing “Fonds A” which contains all the records in the archive, and the INS-M to return the union of all records belonging to the set “Fonds A” and its supersets. DDS, IDS and HDS are compared to widely-adopted ready Use Case 5: Get Sub-hierarchy to use solutions based on the XPath for operating of the structure and the content of EAD files: Xalan, Jaxen and This use case is a specification of the previous one where the JXPath, which represent the state-of-the-art solutions for service provider requests only those records belonging to the dealing with EAD files7 . sub-hierarchy rooted in a given archival division. Naviga- The main characteristic of EAD files representing a chal- tional approaches do not apply to this case, whereas NESTOR lenge for XPath libraries is the number of nodes in each can address it by means of the descendant content operation file; the selected files are of increasing sizes to show that as shown in Figure 3. navigational-based solution performances depend by the num- Use Case 6: Get Context ber of nodes and the overall dimension of the EAD files, whereas this does not apply for the set-based operations im- In this case the service provider requests all the records be- plemented by NESTOR. Indeed, in Figure 6 we can see that longing to a specific division, say “Unit L”, and to all the all the XPath libraries answer in linear time with respect related divisions up to the root as shown in Figure 4. to the size of the EAD file because they need to navigate As in the previous case, navigational approaches do not big hierarchies by visiting a great number of nodes. On the apply to this case, whereas NESTOR addresses it by em- other hand, we can see that IDS answers the descendant ploying the ancestor content primitive which for the NS-M structural operation in constant time for all the EAD files returns the union of all the records belonging to “Unit L” and it is five orders of magnitude faster than XPath-based and its supersets and for the INS-M returns all the elements solutions. DDS and HDS show some dependence on the size belonging to the “Unit L”. of the EAD file; indeed, they need to perform some set op- Use Case 7: List Sets erations (more nodes mean more operations) that require some time, even though for the descendant content oper- This use case is related to the “listSets” OAI-PMH verb “used ation, they are several orders of magnitude more efficient to retrieve the set structure of a repository” and allows the than navigating the archival hierarchy. Overall, IDS is the service provider to know the structure of a local repository best solution for addressing use case 1 and 7, whereas DDS in advance. is the best for use cases 1, 4 and 5. This request cannot be answered by an XPath expression It is interesting to note that for addressing use cases 1, 4 because it is not possible to extract only structural informa- and 5, XPath-based libraries are slower for the EAD-04 file tion filtering out all data nodes; moreover, the OAI-PMH which is the one with the highest number of children (i.e., set-based organization of metadata does not apply to EAD. 10,271) followed by EAD-09 which also has a high number On the other hand, answering the “listSets” verb is natural of children (i.e., 8,930). These two files are challenging for for NESTOR because it retains the structure by exploiting all the use cases requiring the descendants or the children inclusion relationships between sets. Therefore, it answers of a node such as use cases 1, 3 and 5. Navigational-based this request by employing the descendant structure opera- solutions are particularly challenged by this case as we can tion as shown in Figure 3. see in Figure 6 for the content operation and in Figure 8. On the other hand, we can see that the IDS and the HDS are 4. VALIDATION not affected by the high max fan-out of these files given that We proposed three different instantiations of NESTOR they can answer without visiting the high number of child according to three alternative data structures, namely DDS, nodes, but just by returning a set or by performing basic set IDS and HDS. In order to compare the query operations operations. DDS requires more set operations than the other defined on these data structures with currently adopted so- two set-based solutions; even though in most cases it is con- lutions for operating on digital archives we selected two EAD sistently more efficient than navigation-based solutions, it is collections that provide us with real-world archival data: the still less performing than IDS and HDS which are extremely National Archives of the Netherlands6 and the Library of efficient for these cases. The overall performances reported Congress finding aids. in Figure 8 with a particular focus on EAD-04 and EAD-09 We selected ten EAD files taken from these collections show that set-based solutions are particularly well-suited to representing a wide variety of archives with different char- address the operation employed by use case 3. acteristics representing key challenges for archival systems. 7 The statistics about these files are reported in Table 1. We ensure a fair comparison because all the tested solutions are implemented in Java, work in central memory and are 6 http://www.nationaalarchief.nl/ tested on the same machine. Lastly, use case 2 requires to climb up the archival hierar- [5] M. W. Duff and C. A. Johnson. Accidentally Found on chy from a given entry point. We considered EAD files with Purpose: Information-Seeking Behavior of Historians in variable depth (from 9 to 17) and we validated the ancestor Archives. The Library Quarterly, 72(4):472–496, 2002. operations using the deepest node in each hierarchy as en- [6] L. Duranti. Diplomatics: New Uses for an Old Science. try point which represents the worst case scenario for any Society of Amer. Arch. and Association of Canadian Arch., archival system. From a performance viewpoint, in Figure 7 1998. we can appreciate the difference between the NESTOR set- [7] M. Y. Eidson. Describing Anything That Walks: The Prob- based approaches and the XPath navigational approaches. lem Behind the Problem of EAD. Journal of Archival Or- Indeed, NESTOR-based solutions behave consistently for all ganization, 1(4):5–28, 2002. the tested EAD files and do not depend by the depth and size of EAD files. On the other hand, the XPath libraries [8] N. Ferro and G. Silvello. NESTOR: A Formal Model for behave differently from file to file showing a dependence on Digital Archives. Inf. Proc. Manage., 49(6):1206–1240, 2013. the number of nodes, fan-out and depth of the files; for in- [9] N. Ferro and G. Silvello. Descendants, Ancestors, Children stance, JXPath behaves less efficiently when EAD files have and Parent: A Set-Based Approach to Efficiently Address a high max fan-out (EAD-04 and EAD-09), whereas Xalan XPath Primitives. Inf. Proc. Manage., 52(3):399-429, 2016. performances worsen as the number of nodes increases. [10] L. Francisco-Revilla, C. B. Trace, H. Li, and S. A. Buchanan. Encoded Archival Description: Data Quality and Analysis. 5. CONCLUSIONS Proc. American Society for Inf. Science and Tech., 51(1):1– 10, 2014. In this paper we identified and described the barriers pre- venting an efficient access to archival data. We described [11] L. Freund and E. G. Toms. Interacting with Archival Finding the main drawbacks of EAD and we showed how it impairs Aids. JASIST, 67(4):994-1008, 2015. a smooth and efficient access to archival descriptions as well [12] I. Huvila. Participatory archive: towards decentralised cura- as that it does not satisfy several interoperability require- tion, radical user orientation, and broader contextualisation ments. of records management. Archival Science, 8(1):15–36, 2008. We analyzed the role of the NESTOR model in the context [13] N. A. Khan. Emerging Trends in OAI-PMH Application. of digital archives and described its main advantages with In Design, Development, and Management of Resources for respect to state-of-the-art navigational-based solutions. We Digital Library Services, pages 147–159, 2013. have seen that NESTOR set-based approach represents a paradigm shift in the access of XML files which is well-suited [14] D. V. Pitti. Encoded Archival Description. An Introduction and Overview. D-Lib Mag., 5(11), 1999. to enable interaction and interoperability functionalities in the archival context. [15] C. J. Prom. Does EAD Play Well with Other Metadata We identified and described seven use cases highlighting Standards? Searching and Retrieving EAD Using the OAI the key challenges archival systems have to address in or- Protocols. J. of Arch. Org., 1(3):51–72, 2002. der to deal with common user interaction patterns and to [16] C. J. Prom. User Interactions with Electronic Finding Aids satisfy interoperability requirements. In this frame of refer- in a Controlled Setting. The American Archivist, 67(2):234– ence, we compared and discussed strengths and limitations 268, 2004. of navigational-based solutions with respect to NESTOR set-based ones. [17] C. J. Prom and T. G. Habing. Using the Open Archives Ini- tiative Protocols with EAD. In Proc. 2nd Joint Conference We have seen that NESTOR is a model of access to archival on Digital Libraries, pages 171–180. ACM Press, 2002. resources that allows us to better address the identified needs both from the user and the interoperability viewpoints. From [18] J. Roth. Serving Up EAD: An Exploratory Study on the a quantitative standpoint, the experimental validation con- Deployment and Utilization of Encoded Archival Description firms that NESTOR-based solutions consistently outperform Finding Aids. The Amer. Arch., 64(2):214–237, 2001. state-of-the-art solutions; moreover, we have seen that NESTOR- [19] W. Scheir. First Entry: Report on a Qualitative Exploratory based solutions are less dependent – or not dependent at all Study of Novice User Experience with Online Finding Aids. – on the hierarchical structure of archives than navigational- J. of Arch. Org., 3(4):49–85, 2006. based ones. [20] A. Sexton, C. Turner, G. Yeo, and S. Hockey. Understand- ing users: a prerequisite for developing new technologies. References Journal of the Society of Archivists, 25(1):33–49, 2004. [1] J. C. Chapman. Observing Users: An Empirical Analysis [21] S. L. Shreeves, T. G. Habing, K. Hagedorn, and J. A. Young. of User Interaction with Online Finding Aids. J. of Arch. Current Developments and Future Trends for the OAI Pro- Org., 8(1):4–30, 2010. tocol for Metadata Harvesting. Library Trends, 53(4):576– 589, Spring 2005. [2] J. G. Daines and C. L. Nimer. Re-Imagining Archival Dis- play: Creating User-Friendly Finding Aids. J. of Arch. Org., [22] S. Yako. It’s Complicated: Barriers to EAD Implementation. 9(1):4–31, 2011. American Archivist, 71(2):456–475, 2008. [23] J. Zhang. Archival Representation in the Digital Age. J. of [3] M. G. Daniels and E. Yakel. Seek and You May Find: Suc- Arch. Org., 10(1):45–68, 2012. cessful Search in Online Finding Aid Systems. American Archivist, 73:535–468, 2010. [24] X. Zhou. Examining Search Functions of EAD Finding Aids Web Sites. J. of Arch. Org., 4(3/4):99–118, 2008. [4] E. Discovery, S. Shaw, and P. Reynolds. Creating the Next Generation of Archival Finding Aids. D-Lib Mag., 13(5/6), 2007. Use-cases 1 and 7 Use-cases 1, 4 and 5 5 10 Descendant Structural Operation 10 5 Descendant Content Operation DDS 4 IDS 4 10 HDS 10 Xalan 3 10 Jaxen 10 3 Execution Times (msec), log scale Execution Times (msec), log scale JXpath 2 2 10 10 1 1 10 10 0 0 10 10 −1 −1 10 10 −2 −2 10 10 −3 −3 DDS 10 10 IDS HDS −4 10 10 −4 Xalan Jaxen JXpath −5 −5 10 10 EAD01 EAD02 EAD03 EAD04 EAD05 EAD06 EAD07 EAD08 EAD09 EAD10 EAD01 EAD02 EAD03 EAD04 EAD05 EAD06 EAD07 EAD08 EAD09 EAD10 EAD files EAD files Figure 6: Execution times of the descendant structural and content operations. Use-case 2 Use-cases 2 and 6 5 Ancestor Structural Operation 5 Ancestor Content Operation 10 10 XPath: DDS DDS 4 IDS 4 IDS 10 10 HDS HDS Xalan Xalan 10 3 Jaxen 10 3 Jaxen Execution Times (msec), log scale Execution Times (msec), log scale JXpath JXpath 2 2 10 10 1 1 10 10 0 0 10 10 −1 −1 10 10 −2 −2 10 10 −3 −3 10 10 −4 −4 10 10 −5 −5 10 10 EAD01 EAD02 EAD03 EAD04 EAD05 EAD06 EAD07 EAD08 EAD09 EAD10 EAD01 EAD02 EAD03 EAD04 EAD05 EAD06 EAD07 EAD08 EAD09 EAD10 EAD files EAD files Figure 7: Execution times of the ancestor structural and content operations. Use-case 3 5 Parent Structural Operation 5 Parent Content Operation 10 10 DDS DDS 4 IDS 4 IDS 10 HDS 10 HDS Xalan Xalan 10 3 Jaxen 3 10 Jaxen Execution Times (msec), log scale Execution Times (msec), log scale JXpath JXpath 2 2 10 10 1 1 10 10 0 0 10 10 −1 −1 10 10 −2 −2 10 10 −3 −3 10 10 −4 −4 10 10 −5 −5 10 10 EAD01 EAD02 EAD03 EAD04 EAD05 EAD06 EAD07 EAD08 EAD09 EAD10 EAD01 EAD02 EAD03 EAD04 EAD05 EAD06 EAD07 EAD08 EAD09 EAD10 EAD files EAD files 5 Children Structural Operation 5 Children Content Operation 10 10 DDS DDS 4 IDS 4 IDS 10 HDS 10 HDS Xalan Xalan 10 3 Jaxen 10 3 Jaxen Execution Times (msec), log scale Execution Times (msec), log scale JXpath JXpath 2 2 10 10 1 1 10 10 0 0 10 10 −1 −1 10 10 −2 −2 10 10 −3 −3 10 10 −4 −4 10 10 −5 −5 10 10 EAD01 EAD02 EAD03 EAD04 EAD05 EAD06 EAD07 EAD08 EAD09 EAD10 EAD01 EAD02 EAD03 EAD04 EAD05 EAD06 EAD07 EAD08 EAD09 EAD10 EAD files EAD files Figure 8: Execution times of the parent and children structural operations.