=Paper= {{Paper |id=Vol-3019/p22 |storemode=property |title=A Ready-to-Use Solution to Explore Linked Archives with MetaindeX and Gephi |pdfUrl=https://ceur-ws.org/Vol-3019/LinkedArchives_2021_paper_8.pdf |volume=Vol-3019 |authors=Laurent Millet-Lacombe }} ==A Ready-to-Use Solution to Explore Linked Archives with MetaindeX and Gephi== https://ceur-ws.org/Vol-3019/LinkedArchives_2021_paper_8.pdf
          A!"#$%&'()*(+,$"-*./)0*1")*"234.*5$"6017$&
             ready-to-use solution to explore linked
              "!5890:$,";0)9"<$)%01&$="%1&">$490?
             archives  with MetaindeX and Gephi

                         Laurent Millet-Lacombe1[0000 0002 2623 9648]

                       MetaindeX laurentmlcontact-metaindex@yahoo.fr
                                   http://www.metaindex.fr




          Abstract. MetaindeX [1] is an open-source [8], online application which
          o↵ers an opportunity to explore and study linked archives. In this article
          we will focus on user interface features for loading data, displaying and
          studying those links within a set of documents from French Archives
          Nationales.

          Keywords: links · graphs · user interface · NoSQL database · metaindex
          · gephi · archives



1       Context and Data Preparation

1.1       Corpus Overview

We will work with a corpus of more than 2800 references from French archives.
This corpus is issued from a PhD work, whose author kindly accepted to lend
its data for the demonstration.
     Contents are mainly extracted from notarial archives (estate lawyers) from
16th-17th century in Paris. Delivered corpus is made of a set of Excel files for
a total of about 20000 lines. Each line represents a specific person (full name,
gender, professional situation, textual description of personal and professional
connections), seen within a specific archive (date, type, institution, document
id).


1.2       Documents and Links Modelisation

NoSQL Database MetaindeX tool relies on ElasticSearch NoSQL database
[5], whose model can be summarized as a set of catalogs (called ’indices’), each
catalog containing documents, each document being made of a unique identifier
and some fields (string or number mainly). Though this database does not define
constraints between fields, a minimal schema definition is still required, as a list
of name and type of available fields. This schema is defined at catalog level, and
then  a document from this catalog can use any of those fields.
______________
* Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
    International (CC BY 4.0).
2      Laurent ML

Links modelisation Links are represented as a ’string’ field in database, con-
taining identifiers of target documents to point at. As an example, if a catalog
contains 3 documents with ID ”doc001”,”doc002”,”doc003” and we want docu-
ment ”doc002” to reference documents ”doc001” and ”doc003”, then ”doc002”
will contain a field (which we will call ’doclink’ for this example) with text
”doc001,doc003” as a value. Interpretation of this string is then done by appli-
cation when needed.
    An additional ’weight’ information can be associated to each link by adding
a suffix ’:’ to the referenced id. Following our previous example, value
”doc001:4,doc003” for our ’doclink’ field would then mean a link to ”doc001”
with a weight of 4 and a link to ”doc003” with implicit weight of 1. This conven-
tion can be directly used in CSV files, where a column called ’doclink’ for our
example would contain, for each line, IDs and weight of documents to point at.
    Actual meaning of links weight is left to user interpretation. As an example,
if the link is between two ”cities”, it could represent some amount of commercial
exchange, while if it is between two artists, it could represent the number of
peaces of art they own from each other.

2   Preparing, Uploading and Exploring Data and Links
From original Excel files have been extracted a list of unique persons on the one
hand and a list of unique archives on the other hand, as two separate CSV files
(about 4400 individuals and 2800 archives), with the help of OpenRefine tool [7]
for data cleaning, and a set of dedicated python scripts for data extraction and
formatting. Also links have been reconciled for both personal and professional
relationships between persons.
    Uploading contents into the server is done by dragging CSV files over the
catalog contents in the user interface. MetaindeX CSV-import module allows
then user to map CSV columns to new or existing fields in the catalog.
    Once loaded, documents are represented as cards, each card being a single
document, i.e. a single line from our input CSV files (in our case either an
archive or a person). Each card can be expanded to see or edit document’s fields
as illustrated on Figure 1.
    We can also notice on Figure 1 that links are resolved to get a more user-
friendly summary of corresponding document, rather than simply IDs list. Also
jumping from one document to another, following the links, is possible by clicking
on them, allowing user to navigate through those connections.
    Lucene query syntax [2] is available as a search engine, and allows advanced
search, such as for example ”find all persons whose first name approximately
equals to ’Antoine’ and was born before year 1700”, which would be translated
with following query:
type:person AND firstname:Antoine~ AND datestart:<1700
    At last, though its usage and possibilities are out of scope of this demon-
stration, we can precise that a Kibana [6] module is integrated to MetaindeX,
allowing user to create advanced statistic charts on its corpus.
                                  Exploring linked archives with MetaindeX        3




Fig. 1. Screenshot from MetaindeX when consulting contents of a document (here a
character)



3     Generating Graphs

3.1   Basic Graph Generation

MetaindeX is able to generate a graph description file (GEXF format [3]) com-
patible with main graph applications such as Gephi [4]. In such a generated
graph, each document would be a node and each link would be an edge. Gen-
eration module allows to select which fields to be used as nodes’ metadata,
allowing fine graph rendering customization based on our contents, for example
by assigning a color to nodes depending on value of a given field.
    Detailed usage of Gephi is out of the scope of this article, but once GEXF
file generated from our data and loaded in Gephi, with only few settings we
can already identify some clusters within professional and personal relationships
among persons, as shown on Figure 2 (each grey dot represents a person, names
have been hidden for better readability of the networks).
    We can see there some clusters already well identified, where several links
seem to converge around same groups of individuals. Those links were created
each time two persons were found to have a professional relationship (for exam-
ple master and student), seen in green, or personal relationship in pink (mariage
witness, siblings, etc.). That information could help the researcher to get a better
vision of social and professional relationships over his corpus, and maybe inter-
pret with better accuracy historical facts he could find on archives contents.
4       Laurent ML




Fig. 2. Graph (detail) of professional (green) and personal (pink) relationships among
individuals (grey dots).




           Fig. 3. Graph of professional relationships grouped by parish.
                                   Exploring linked archives with MetaindeX         5

3.2   Aggregated Graph Generation
Since bigger graphs might be more difficult to read, MetaindeX o↵ers also a
specific algorithm to generate simplified aggregated graphs. This way, nodes and
links are gathered up following values of a specific field. This ”group-by” feature
allows to get much more readable graphs directly focused on topic the researcher
is interested in.
    Following our example, we can group persons by parish they live in, which
means that all persons having a ”parish” field with a similar value will be grouped
within a single node, and their respective links will also be aggregated to this
node.
    Figure 3 shows as a result all parishes found in the corpus, their size depend-
ing on amount of individuals registered as living there, while links thickness
being based on amount of links that all individuals from given parish have with
individuals from another parish. On that graph, we can see that parishes Saint-
Germain-Le-Vieux and Saint-Médéric (top right) seems to have quite numerous
professional relationships (thicker link) despite their smaller amount of persons
recorded to live in (smaller nodes size) from our corpus. That could maybe lead
the researcher to a new approach or hints to understand social relationships of
this community.

4     Conclusion
Demonstration has been done that, starting from a consistent set of data, Metain-
deX loads quickly and efficiently the corpus, let us explore contents with ad-
vanced queries and links navigation. GEXF export module, with both basic and
group-by algorithms, let us easily load our data in Gephi for advanced graph
exploration. If phase of data preparation remains a major step requiring some-
times some coding skills, the proposed environment made of MetaindeX and
Gephi tools, o↵ers a ready-to-use, coding-free, quick and efficient ecosystem to
explore a set of linked archives for the researcher.

References
1. MetaindeX, http://www.metaindex.fr. Last accessed 3 Jun 2021
2. Lucene Query Language, https://lucene.apache.org/core/. Last accessed 3 Jun 2021
3. GEXF file format, https://gephi.org/gexf/format/. Last accessed 3 Jun 2021
4. Mathieu Bastian and Sebastien Heymann and Mathieu Jacomy. Gephi:
   An Open Source Software for Exploring and Manipulating Networks,
   http://www.aaai.org/ocs/index.php/ICWSM/09/paper/view/154.2009
5. ElasticSearch, https://www.elastic.co/guide/en/elasticsearch/reference/current/elasticsearch-
   intro.html. Last accessed 3 Jun 2021
6. Kibana, https://www.elastic.co/guide/en/kibana/current/introduction.html. Last
   accessed 3 Jun 2021
7. OpenRefine, https://openrefine.org/. Last accessed 3 Jun 2021
8. MetaindeX source code, https://github.com/laurentmldev/metaindex. Last update
   Aug 2021