=Paper=
{{Paper
|id=Vol-2607/short5
|storemode=property
|title=Excavating the Data Pit: the Portuguese Parish Memories (1758) as a Gold Standard
|pdfUrl=https://ceur-ws.org/Vol-2607/short5.pdf
|volume=Vol-2607
|authors=Ivo Santos,Fernanda Olival,Ofélia Sequeira
}}
==Excavating the Data Pit: the Portuguese Parish Memories (1758) as a Gold Standard==
Excavating the Data Pit: the Portuguese Parish
Memories (1758) as a Gold Standard
Ivo Santos1[0000−0001−5152−6027] , Fernanda Olival1[0000−0003−4762−3451] , and
Ofélia Sequeira1[0000−0003−2376−1817]
CIDEHUS - University of Évora, Portugal cidehus@uevora.pt
This work is funded by national funds through the Foundation for Science and
Technology, under the project UIDB/00057/2020
http://www.cidehus.uevora.pt/
{ifs,mfo,osequeira}@uevora.pt
Abstract. The common approach to research in History and Archaeol-
ogy tends to the continuous development of new databases, completely
independent of each other with the consequence of data fragmentation,
atomisation of knowledge, and ultimately the creation of data silos. This
happens because of academic tradition, but also because these disci-
plines work with fragmented information to understand historical data,
the contexts, which enables the creation of multiple narratives and in-
terpretations. However, for these disciplines, the context is a key aspect
that always should be preserved.
The Memórias Paroquiais (Parish Memories) correspond to a survey,
organized in 3 major parts (land, mountain and river) and are an essential
source for obtaining a radiography of Portugal in 1758-1761. We believe
that this primary source could reach a new exponent if worked from a
different approach: semantically annotated, processed and modeled.
We propose that the Portuguese Parish Memories, due to their intrinsic
characteristics, should constitute a Knowledge Base (KB) to connect
with other historical sources and research outputs. Ultimately, the Parish
Memories could be a Gold Standard for the Natural Language Processing
with impact on the research on other historical sources of Early Modern
History Portugal, regardless of the knowledge domain.
Keywords: Parish Memories · Knowledge Base · Gold Standard ·
Natural Language Processing · Open Linked Data.
1 Introduction
The Memórias Paroquiais (Parish Memories) are an essential source for obtaining
a radiography of Portugal in 1758-1761. They correspond to a survey, organized
Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0). DHandNLP, 2 March 2020,
Evora, Portugal.
in 3 major parts (the locality itself, the mountain and the river), which was
printed and sent to those responsible for the dioceses of the country, by notice
(aviso) of the Kingdom’s Secretary of State of 18 January 1758. Following the
tradition of other earlier examples, the inquire was limited to the Kingdom and
not even the Atlantic Islands were comprised.
The survey included a total of 60 questions: 27 questions about the local-
ity, 13 about the mountain and 20 other about the river. Not surprisingly, the
parish priests only responded to what suited their territory. The questions were
not limited to historical aspects; they inquired about administrative and juris-
dictional (ecclesiastical and secular) issues, demographic data, major ’fruits of
the locality’, fairs, 1755 earthquake impact, existence of seaport or ramparts.
This is about the land or the parish itself. Data about the mountains is funda-
mental for a study of the natural landscape and the use of resources (they asked
about size of the mountains, rivers, special water source, medicinal herbs, mines,
lagoons, villages, monasteries and churches). On the river part, the detail was
also great: size and dexterity of flow, navigability, direction of the current, fish
and fishery related activities, bridges, mills, and cultivation of the margins, etc.
And each of the three parts closed with an open request: ’And all that is worthy
of memory’ and was not included in the survey. It was an invitation to describe
what was specific and relevant about each place.
It was up to each bishop or prelate, and in case the See became vacant to
the Cathedral chapter, to distribute the inquiries. Many of them turned to the
vicars forane and the vicars-general to make the task, sending the survey to the
parish priests. It was the latter, or their representatives, who responded by giving
information about their parish. Therefore, it is data presented by someone who
generally knew well the target territory and mastered writing skillfully.
Some answers denounce that some priests heard knowledgeable persons and
that the task deserved some care. It was the case of the parish of Nossa Senhora
da Graça de Monforte, in the South of Portugal (Alentejo region): “A printed
paper was handed to us from Your Most Illustrious Reverend [the Cathedral
Chapter of Elvas]. It arrived trough the Reverend Vicar Forane of this Village,
Doctor Francisco Mendes Madeira. So we should answer each question individu-
ally, what we knew and what we asked for information. And because it was the
time of Lent and the fun of history was embarrassing for us in our occupation,
as well as the investigations on antiques and, due to the lack of libraries and
old papers, we rely on people of certain trust who have served us willingly.” [1,
p. 1175]. Some priests listened to the oldest people in the locality [4, p. 393]
and others did not consult anyone. There were parish priests who indicated a
vast bibliography on the parish [3, p. 287-88]; others, such as the parish priest
of one of the parishes of Lisbon (Campo Grande), lamented that his church had
no archives with ”documents or other old papers from which much news could
be drawn” [5, p. 318].
The parish priests had 3 months to respond and many did so quickly, al-
though there were still responses dating from 1759, 1760 and even 1761, espe-
cially in Lisbon, at the time struggling with the impact of the 1755 earthquake [5,
p. 29-30].
Note that the parish, headquartered in the main church, was the smallest
organizational division of the territory at that time. It was also “the most co-
hesive and homogeneous territorial and social unit of the life of the Portuguese
populations of the past” [3, p. 14]. In this way, the Parish Memories collect data
with great relevance, by the micro scale used and due to the fact that the survey
cover the whole country, described from the same grid of questions.
The Torre do Tombo holds 44 volumes of these handwritten Memories, all
available online. In fact, only 41 volumes correspond to the papers sent by the
parish priests. To these were added an index volume and two summary data
on approximately 500 parishes. These data may be of interest to mayors and
local councillors, demographers, botanists, zoologists, architects, and, most of
all, historians and archaeologists.
2 Workflow
2.1 Motivations
According to R. Kummer, “Historians do not want to find database records;
they want to understand historical contexts” [9]. By other words, History and
Archeology work fragmented information to understand historical data, the con-
texts, which enables the creation of multiple narratives and interpretations.
Thus, databases should be prepared for characteristics such as subjectivity, plu-
ral views, fragmented data, and uncertainty.
After the advent of Digital Humanities, there is a progressive attempt to
apply computer methods to History, however, in the Portuguese case, there is still
some difficulty in finding pre-processed, standardized and open-sourced historical
data available for analysis. Similarly, not all research projects seek to make their
data available, which, in extreme cases, may render the conclusions impossible
to reproduce. This gap makes it difficult to verify and discuss results, which is
essential to the scientific process. In part, this difficulty has born from the same
academic tradition that values the effort needed to find novel and unpublished
documentation or to analyze sources already known in an original and innovative
way.
Among other data sets, the transcripts made by the Portugal 1758 project
are available in CIDEHUS Digital [7] with a textual search, faceting and the pos-
sibility of direct navigation for each question, in each parish. This search (based
on Apache Solr), while useful for generic research in Parish Memories, does not
programmatically reflect relationships between entities and events visible in the
text, nor even characterizes relationships present in data or helps the user to
understand the context of relevant information. On other hand, the linking of
this data with other sources, such as the Portuguese Corography, by Padre An-
tónio Carvalho da Costa (published between 1706-1712), was only tested briefly
through toponymy, limiting more complex analyzes for some users, as they imply
previous data processing, knowledge and time for the effort.
In the situation described, the approach chosen is limiting the study of the
context, something primarily essential to heritage sciences. This limit is found
in studies of History and Archaeology, but also in other research focused on
other domains, since, in isolation, any study will tend to ignore existing rela-
tionships that do not fall within the scope of that same study. This means that
the completely independent development of new databases consequence is more
data fragmentation, the atomisation of knowledge, and ultimately the creation of
data silos [10]. In the case of the Portuguese Parish Memories, the current prac-
tice is, generally, limited to extracting relevant information for the researcher
own interests and, therefore, implies infinite iterations of construction of listings
/ databases (e.g. the presence / absence of industry or foodstuffs). Ultimately,
we consider that research done in this way embodies the constant construction
of data silos which, if differently done or related to each other, would give us a
more complete view of Early Modern Portugal: the context.
2.2 Proposal
We believe that Parish Memories as a source of information could reach a new
exponent if worked from a different perspective: semantically annotated, pro-
cessed and modeled. With this approach, it will be able to become a Knowledge
Base (KB) to connect with historical sources and research outputs. We therefore
propose that:
– all transcripts and notes should be available online and according to Open
Science principles;
– it should be possible for anyone interested in the topic to collaborate (crowd-
sourced);
– the collaboration is independent of the geographical area and previous back-
ground knowledge of the user (with different levels of access to control the
quality of data processing);
– is essential to semantically annotate the Parish Memories and construct an
ontology to represent this Knowledge Base;
– the annotations should be associated with a thesaurus;
To implement this proposal, a paradigm shift in Parish Memories research
practices is required, respecting research interests and academic tradition, but
with a normalizing workflow for its study. This way, we intend to provide a route
to remove each piece of knowledge constructed from this historical source from
it’s present day isolation, maintaining its original context, both historical and
scientific.
2.3 Methods
The proposed workflow is based on the platform INCEpTION. This platform,
answers to most of our requirements, allows interactive and collaborative seman-
tic annotation, and there may be different levels of access. These features alone
allow any user to collaborate and to review the process of annotation by another
hypothetically more experienced user. Thus, it is possible, for example, that
students from Digital Humanities, Palaeography, or even any interested citizen
can contribute to the transformation of the Portuguese Parish Memories into a
Knowledge Base in all its potentiality. Even in a traditional view, all this has
numerous advantages, as a mean of mitigating errors on transcriptions or even
valuing the base transcription with knowledge from users who know very well
the territory under analysis.
Among others, the semantic annotation capability provides “concept link-
ing, fact linking, knowledge base population, semantic frame annotation” [8]
and includes automatic learning algorithms to actively assist the annotation
task. From corrected and learned annotations, INCEpTION can therefore sug-
gest annotations in other processed or pre-processed texts. This approach, called
machine-assisted interactive annotation (human-in-the-loop with Active Learn-
ing (AL)) [6], allows the concept, entity and fact linking task to be faster and
more efficient [8]. At the same time, the tool creates the KB, on the fly, as
annotations are created.
In addition to these features, it includes corpus search to facilitate annota-
tion [2], is modular, and other algorithms are being developed to assess whether
annotations entered are plausible [8].
We propose starting the annotation task by approaching related questions,
for example the question “If the margins of the river are cultivated, and if there
is a lot of fruit or wild trees” (River-10) and the question “If there are mills,
olive oil presses, cloth beater, waterwheels, or some other device” (River-16).
This makes it possible to restrict the work of annotations to specific domains of
knowledge, in this case mainly economic and patrimonial and in which there are
several specialists in CIDEHUS, who can contribute to a successful and smooth
start of the task. On the other hand, it will allow to test the NLP detection of
similar references in texts where the parish priests answered in free text and not
structured by questions.
Taking advantage of the platform’s capabilities, we also propose that the
annotated transcripts should always include a thesaurus in order to somehow
respect the original orthography they were written in, allow searches with current
spelling and improve associated knowledge.
After an initial investment by researchers, it is important to engage the com-
munity into collaboration, especially the academic, but not only. Universities
have a very relevant critical mass in other disciplinary areas that can and should
be harnessed. They may not even like history, but they should be made aware
that the realities they study have a past dimension, capable of generating new
meanings or explanations.
3 Discussion
Context is essential. In the area of Humanities, but not only, it is a primarly
essential aspect. As such, databases in History and Archaeology should reflect
all information present in the sources and their context. Technologies such as
NLP or LOD are increasingly essential to overlap this issue.
The workflow proposal presented here is not innovative, but allows, for ex-
ample, to characterize the predominance of fruits per parish and also to the
micro-scale (locality, mountain and river); to link the information present in the
Parish Memories to databases, to onomastic or prosopographic data, whether
respecting the same period of time or not; or even to help to build a question-
answer system related to the Early Modern Portuguese History.
The heterogeneity of Parish Memories may thus be its greatest richness as
a KB. As such, the progressive implementation of semantic annotations from
various knowledge domains will make it possible to provide a basis for linking
to other coeval data, to other databases, and to the product of research, inde-
pendently of the knowledge domain.
In short, we propose that the Portuguese Parish Memories, due to their
intrinsic characteristics, should constitute a Gold Standard for the Natural Lan-
guage Processing of other historical sources of Early Modern History Portugal,
regardless of the knowledge domain, with the Portuguese Corography as being
the next natural immediate step.
References
1. ANTT (ed.): Memórias Paroquiais, vol. 24, p. 1175
2. Boullosa, B., de Castilho, R.E., Laskari, N.K., Klie, J.C., Gurevych, I.: Integrating
knowledge-supported search into the inception annotation platform. In: Proceed-
ings of the 2018 Conference on Empirical Methods in Natural Language Processing:
System Demonstrations. pp. 127–132 (2018)
3. Capela, J.V., Ferreira, A.d.C. (eds.): Braga triunfante ao tempo das Memórias
Paroquiais de 1758. s.n., Braga (2002)
4. Capela, J.V., Matos, H. (eds.): As freguesias dos distritos de Aveiro e Coimbra nas
Memórias Paroquiais de 1758 : memórias, história e património. Ed. José Viriato
Capela, Braga (2011), http://repositorium.sdum.uminho.pt/handle/1822/19969,
não contém as p. 129-699.
5. Capela, J.V., Matos, H., Castro, S. (eds.): As freguesias dos distritos de Lisboa e
Setúbal nas ’Memórias Paroquiais de 1758’: memórias, história e património. Casa
Museu de Monção/Universidade do Minho, Braga (2016)
6. de Castilho, R.E., Ide, N., Kim, J.D., Klie, J.C., Suderman, K.: Towards cross-
platform interoperability for machine-assisted text annotation. Genomics & Infor-
matics 17(2) (2019)
7. CIDEHUS: Portugal 1758 - CIDEHUS Digital (2019), http://www.cidehusdigi-
tal.uevora.pt/portugal1758, Last accessed on 2019-12-10
8. Klie, J.C., Bugert, M., Boullosa, B., de Castilho, R.E., Gurevych, I.: The incep-
tion platform: Machine-assisted and knowledge-oriented interactive annotation. In:
Proceedings of the 27th International Conference on Computational Linguistics:
System Demonstrations. pp. 5–9 (2018)
9. Kummer, R.: Archaeology and the semantic web—prospects and challenges. In:
Frischer, B., Webb Crawford, J., Koller, D. (eds.) Computer Applications and
Quantitative Methods in Archaeology (CAA). Proceedings of the 37th Interna-
tional Conference. pp. 178–190. Archaeopress (2010)
10. Migliorini, S., Grossi, P., Belussi, A.: An interoperable spatio-temporal model
for archaeological data based on ISO standard 19100. Journal on Computing
and Cultural Heritage 11(1), 1–28 (12 2017). https://doi.org/10.1145/3057929,
https://doi.org/10.1145%2F3057929