Semantic Annotation and Retrieval of Parliamentary Content: A
      Case Study on the Spanish Congress of Deputies
                                 Iván Cantador                                                           Lara Quijano-Sánchez
                          ivan.cantador@uam.es                                                            lara.quijano@uam.es
                       Escuela Politécnica Superior                                                   Escuela Politécnica Superior
                     Universidad Autónoma de Madrid                                                 Universidad Autónoma de Madrid
                              Madrid, Spain                                                                   Madrid, Spain

ABSTRACT                                                                                  In this context, researchers have developed approaches to auto-
In this paper, we present an ontology-based annotation and retrieval                   matically generate semantic annotations for parliamentary content
approach for parliamentary content, such as debate transcripts and                     of diverse types, such as laws, political programs, and transcriptions
law proposals. Exploiting a number of domain ontologies, semantic                      compiling interventions of parliament members in plenary meet-
web technologies and information retrieval techniques, our ap-                         ings. The majority of these approaches have focused on identifying
proach extracts topics, concepts and named entities (e.g., names                       certain parliament entities such as political groups and represen-
of politicians and political parties) appearing in input documents.                    tative members [23], and a limited number of topics addressed
The domain ontologies were designed to support multilinguality,                        within the input text documents [8]. Moreover, in general, they
and were built from the United Nations taxonomy of sustainable                         only support content in a single language [20].
development goals. The approach was instantiated with a text cor-                         Aiming to address these limitations, we present a first version of
pus extracted from the Spanish Congress of Deputies and is being                       an ontology-based approach that makes use of information retrieval
integrated into an e-government platform.                                              techniques and semantic web technologies to annotate and retrieve
                                                                                       parliamentary contents in multiple languages. More specifically, our
CCS CONCEPTS                                                                           approach is built upon a knowledge base composed of ontologies
                                                                                       covering the United Nations taxonomy of sustainable development
• Applied computing → Computing in government; • Infor-
                                                                                       goals, which are related to a variety of domains, such as educa-
mation systems → Ontologies; Information extraction; In-
                                                                                       tion, economy, natural resources, climate change, and social rights.
formation retrieval.
                                                                                       The approach identifies concepts (i.e., classes) and instances (i.e.,
KEYWORDS                                                                               class individuals) of the above ontologies in input text documents,
parliamentary content, semantic annotation, argument extraction,                       by means of information retrieval techniques applied to indices
ontology-based information retrieval                                                   created from multilingual labels of the ontology concepts. The ex-
                                                                                       tracted concepts do not only represent several levels of thematic
1    INTRODUCTION                                                                      annotations, but also allow computing ontology-based similari-
                                                                                       ties that enhance the retrieval of semantically related search and
Managing and publicly providing digital libraries on parliamen-
                                                                                       recommendation results beyond keyword-based matching.
tary activity are essential tasks for open government, promoting
                                                                                          As a proof of concept, the approach has been instantiated and pre-
democracy, enhancing transparency, and facilitating accountability.
                                                                                       liminary evaluated on a text corpus managed by Parlamento2030,
However, the amount of multimedia content recording the debates
                                                                                       an online platform that monitors parliamentary activity in the
and proposals generated by parliaments is huge and ever-increasing.
                                                                                       Spanish Congress of Deputies. A user study on search tasks shows
This together with the unstructured nature of such content makes
                                                                                       the benefits of the semantically enhanced annotation and retrieval
its organization, access and retrieval challenging.
                                                                                       results provided by our approach.
    As stated in [12], metadata facilitates the classification, storage
                                                                                          The reminder of the paper is structured as follows. In Section 2,
and retrieval of e-government resources. It summarizes the avail-
                                                                                       we survey related work on both retrieval of parliamentary content
able contents, allows users to manage, find and access the resources,
                                                                                       and semantic retrieval for e-government applications. In Section
helps understanding and determining if the corresponding infor-
                                                                                       3, we introduce the case study addressed with the Parlamento2030
mation meet particular requirements, and, thanks to a consistent
                                                                                       platform. Next, in Sections 4 and 5 we present the proposed ap-
description of the data, promotes its sharing and exchange.
                                                                                       proach, distinguishing between its knowledge base building, se-
    Public administrations are aware of the advantages of sharing
                                                                                       mantic annotation, and ontology-based retrieval methods. Then,
open government data with regard to transparency, stakeholder
                                                                                       in Section 6, we present preliminary results from a user study on
collaboration, improved services, and new economic activities [20].
                                                                                       search tasks.
Hence, in the last decade, there has been a large increment of
initiatives to publish and interlink government data and services.
This has been facilitated by the use of semantic web technologies
and standards, and the generation of Linked Open Data (LOD) [24].

Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).                                     Parlamento2030, https://www.parlamento2030.es/
                                                                                                                                  Cantador et al.


2     RELATED WORK                                                      information retrieval- and machine learning-based methods did not
Our approach is built upon an ontology-based framework for an-          show significant performance differences.
notating and retrieving parliamentary content. Hence, in this sec-          Differently to the previous works, in [11], Kaptein and Marx
tion, we survey related work by distinguishing between specific         attempted to extract high level semantic annotations from parlia-
approaches to information retrieval in parliamentary contexts and       mentary debates, aiming to summarize and visualize the narrative
more general approaches to semantic annotation and retrieval in         structure of meetings as tables of topics and intervention graphs.
e-government applications.                                              The authors developed a search engine that exploited the generated
                                                                        annotations enabling the provision of entry points to documents (in
                                                                        XML), grouping of search results, and faceted data exploration. In
                                                                        a user study on a large dataset with official transcripts of meetings
2.1    Retrieval of Parliamentary Content                               from the Dutch Parliament, users reported that, in comparison to a
In [22], Sánchez-Nielsen et al. conceptualized a smart system where     standard document retrieval systems, the proposed search engine
citizens interact with elected representatives in Parliaments, access   provided a better overview of the data.
to Parliament proceedings, and subscribe to new parliamentary con-          In addition to identifying certain parliament entities (e.g., po-
tents. Among other issues, the authors proposed the use of semantic     litical groups and representative members), as done in [23], we
technologies and recommender systems to incorporate models for          also aim to extract thematic annotations in several semantic levels,
concept and knowledge representation, manual and automatic an-          namely topics and domains. Similarly to the approaches presented
notation of textual content from plenary sessions, fragmentation        in [4, 8, 11], ours processes transcriptions stored in text documents.
of audiovisual content, provision of customized feeds and infor-        However, instead of generating annotations in XML, our system
mation retrieval, and production of automatic reports. Recently, in     produces RDF tuples, which could be linked to external semantic
[23], the authors presented an approach to automatically annotate       web repositories. We note that the RDF annotations of [23] refer to
video transcriptions of the debates occurred in plenary meetings        a limited vocabulary of legislation, representation, and parliamen-
of the Canary Islands Parliament, in Spain. Specifically, the an-       tary activities, whereas our annotations correspond to concepts
notations –expressed as RDF semantic data– were associated to           from a variety of domains. The personalization of search results, as
concepts belonging to an ad hoc ontology that modeled legisla-          proposed by [5, 26], is left as future work.
tion, representation and parliamentary activity concepts, such as
legislatures, legislative proposals, political groups, representative
members, sessions, interventions and votes. Exploiting the gen-         2.2    Semantic Retrieval in e-Government
erated annotations, the authors developed a prototype aimed to          The Olimpo system [10] presents one of the first reported ap-
retrieve video clips that fulfill a user’s specified need on the par-   proaches to semantic search for an e-government application. In
liamentary activity, as well as contextual information for content      particular, the system applied case-based reasoning and informa-
understanding.                                                          tion retrieval techniques to find documents similar to an input list
   In a series of works ([4, 5, 8, 26]), De Campos, Fernández-Luna,     of documents provided by the user through an iterative process.
Huete and colleagues investigated information retrieval approaches      Implemented for searching United Nation (UN) security resolutions,
for the Parliament of Andalusia, in Spain. In [8], the authors pre-     the system made use of the structured representation of documents
sented an XML digital library automatically created from the official   to identify and extract from the documents a variety of attributes,
documents –published in PDF– of the parliament session diaries, i.e.,   such as subjects, dates, institution acronyms, country names, num-
the transcriptions of all the deputies’ interventions in plenary and    ber of decisions, and text parts with higher occurrence of indicative
commission sessions. The generated XML files contained simple           expressions of the resolutions. More recently, Liu and Hu [12] also
metadata, such as session date, starting and ending times, agenda       presented an approach to extract metadata from e-government
points, addressed topics, and vote results. Making use of such XML      information resources, and exploit it to provide semantic search
structures, in [4], the authors developed a system to support the       functionalities. In this case, the identification of attributes was con-
manual annotation of certain parts of the transcriptions with their     ducted by means of lexical analysis (i.e., part-of-speech labeling),
corresponding segments in videos recorded during the parliament         stop words removal, and term frequency-based filtering for the
sessions. Exploiting generated annotations, a search engine for         Chinese language. The generated metadata was stored in XML, but
structured documents based on Bayesian Networks and Influence           the authors proposed its transformation to RDF.
Diagrams was tested to retrieve parts of the transcriptions and             In the previous works, the exploited metadata consisted of a
videos relevant to a given query. More recently, in [26], the authors   limited set of attributes without semantic structure or relationships
enhanced their search engine with personalization capabilities. In      between them. Differently, others have proposed the use of ontolo-
particular, motivated by the need of maintaining the user’s privacy     gies as knowledge representation frameworks whose interlinked
in political contexts, the retrieval model was adapted to a number of   concepts conform the fundamental elements of the document an-
content-based stereotype profiles, which were built through several     notations used by the search engines. In this context, semantic web
term and category weighting techniques. Lastly, in [5], instead of      technologies, e.g., RDF and OWL, represent a popular trend in the
addressing personalized search, the authors focused on an infor-        literature.
mation filtering task, where the members of parliament receive              Amato, Mazzeo and Picariello [1] presented a system that ex-
personalized recommendations of those documents that may be             ploits ontologies and NLP techniques to annotate e-government
relevant for them. Through empirical comparison, the evaluated          multimedia documents. The ontologies contained both domain
Semantic Annotation and Retrieval of Parliamentary Content: A Case Study on the Spanish Congress of Deputies


knowledge and lexical vocabularies for Italian and English lan-                        3     CASE STUDY
guages. In the paper, the authors did not provide descriptions of                      Our approach is aimed to be integrated into Parlamento2030, an
the domains ontologies. They, nonetheless, implemented an infor-                       online platform that monitors parliamentary activity in the Span-
mation retrieval prototype where a user study was conducted on a                       ish Parliament. For such integration, the approach has been im-
small collection of 60 criminal law and juridical documents, show-                     plemented and preliminary evaluated on a dataset managed by
ing preliminary accuracy results of the annotation process. In [17],                   Parlamento2030. In Section 4, we describe such dataset and the
Moreira et al. presented POWER, an ontology of political processes                     knowledge base we have generated from it. Before that, we intro-
designed to track politicians, political organizations, and elections                  duce the Spanish Parliament and Parlamento2030 tool.
in social media. The authors also presented EMPOWERED, a frame-
work to populate the POWER ontology with information extracted                         3.1     The Spanish Congress of Deputies
from various resources. The authors assessed the framework on
                                                                                       The Cortes Generales (lit. General Courts) are the Spanish Par-
the Portuguese Government and National Elections Committee
                                                                                       liament. Established and regulated in the Constitution of Spain,
websites, retrieving 3.6K politicians, 3K terms related to political
                                                                                       they form a bicameral legislative chamber, consisting of the Sen-
institutions, and 74 political associations from mandates taken place
                                                                                       ate (Senado) –i.e., the upper house– and the Congress of Deputies
since 1976. They did not explain the developed semantic annotation
                                                                                       (Congreso de los Diputados) –i.e., the lower house.
method, and proposed to exploit the annotations for information
                                                                                          In the Congress of Deputies, parliamentarians form working
retrieval tasks –i.e., expert finding and question answering– and
                                                                                       groups –called commissions– in different areas of interest (e.g.,
to align the ontologies with Linked Open Data repositories, such
                                                                                       economy, education, healthcare, agriculture, etc.), and discuss ini-
as DBpedia, YAGO and FOAF. Motivated by the benefits of LOD
                                                                                       tiatives related to the commissions they belong to. The proposals
to support interoperability between European administrations and
                                                                                       generated by the commissions are presented and debated in ple-
to improve the information access for citizens across Europe, in
                                                                                       nary sessions. Certain proposals are then formalized as law projects,
[20], Narducci, Palmonary and Semeraro presented CroSeR, a cross-
                                                                                       which have to be voted and approved by a majority of the deputies
language e-government service retriever for different European
                                                                                       for their implementation as laws. The Senate, which represents
languages. The underlying semantic annotation algorithm of the
                                                                                       the territorial regions in Spain and supervises the work done by
system enriched the short descriptions of the services with labels
                                                                                       the Congress of Deputies, does not propose laws, but revises and
extracted from Wikipedia concepts related to the services. In par-
                                                                                       suggests changes to the law projects provided by the deputies.
ticular, it used Explicit Semantic Analysis [9], which disambiguates
                                                                                          As a form of government transparency, both law proposals and
a word meaning through a semantic similarity with Wikipedia con-
                                                                                       Parliament session diaries are available online as HTML and PDF
cepts. The authors carried out an empirical evaluation consisting
                                                                                       documents. The Parlamento2030 platform crawls, scraps and cat-
of an information retrieval task on a catalogue of 2.4K services
                                                                                       egorizes such diaries to provide search functionalities, which we
in five different languages –namely Dutch, Belgian, German, Nor-
                                                                                       describe next.
wegian and Swedish–, and comparing CroSeR with well known
semantic annotators, such as Wikipedia Miner [16] and DBpedia
Spotlight [15]. More recently, in [14], the authors presented Ontolo-                  3.2     The Parlamento2030 platform
Gov, a system that supports interoperability between e-government                      Forming part of Salvador Soler Foundation, CIECODE, Centro de
repositories, and provides semantic search functionalities. More                       Investigación y Estudios sobre Coherencia y Desarrollo (lit. Center
specifically, the system performed a metadata extraction process,                      for Research and Studies on Coherence and Development), aims to
made use of ontology-based contextual user profiles, and applied                       analyze public policies and private practices of developed countries,
case based-reasoning to support knowledge retrieval.                                   inform about their effects on developing countries, and make pro-
   As done in [10], we consider content generated by UN, but,                          posals to move towards a more egalitarian society and fair world.
instead of focusing on vocabularies associated to its security res-                        In particular, CIECODE implements innovative projects on ac-
olutions, we use its taxonomy of sustainable development goals,                        cessing to political information. Among them, Parlamento2030 is an
covering various domains. Similarly to the approach presented in                       online platform that monitors parliamentary activity in the Spanish
[12], we make use of natural language processing techniques and                        Congress of Deputies, aiming to promote an active, informed and
resources for the semantic annotation process. Moreover, as done in                    demanding citizenship and a responsible political class subject to
[1] and [17], we propose ontology-based representations. However,                      public scrutiny. It is an adaptation of the TIPI Ciudadano toool and
the former did not provide descriptions of its domain ontologies,                      is built upon the CIECODE Political Watch open-source framework.
and the latter used a limited ontology modeling political processes.                       The Parlamento2030 tool scans all the political activities of the
Lastly, as in [20], supporting multilinguality and generating LOD                      Congress of Deputies by crawling the activity transcriptions pub-
represent two of the principal requirements for our system. Dif-                       licly available at the congress website, and automatically catego-
ferently to that work, we left the rigorous evaluation of generic                      rizing the crawled content according to their relationships with 17
semantic annotation and retrieval methods for the future.                              priority thematic areas for poverty, social justice and sustainable

                                                                                       Fundación Salvador Soler, https://unmundosalvadorsoler.org
                                                                                       CIECODE, https://www.ciecode.es
                                                                                       TIPI Ciudadano, https://tipiciudadano.es
                                                                                       CIECODE Political Watch, https://github.com/politicalwatch
                                                                                       Congreso de los Diputados, http://www.congreso.es
                                                                                                                                    Cantador et al.


development. More specifically, it annotates the parliamentary con-     Table 1: Examples of concepts and keywords of the dataset.
tent by keyword matching using a vocabulary with more than 3K
Spanish terms provided by expert individuals and organizations in        Domain       Concept                   Keywords
each of the areas.                                                       Poverty      Unemployment and          unemployment and
   The platform has a search engine with which the user can refine                    vulnerability             vulnerability, unemployed and
                                                                                                                vulnerability, vulnerable
information filtering queries based on multiple criteria, such as
                                                                                                                unemployed
author, date, theme and keyword (see Figure 1). It also offers a
                                                                                      Global poverty rate       global poverty rate, worldwide
personalized system of alerts that allows a user to be up to date on                                            poverty rate, international
the political news of her topics of interest. Parlamento2030’s code                                             poverty rate
and data can be freely accessed and downloaded.                          Education    Reading and               reading and math skills, reading
   Our approach is built and tested on a Parlamento2030 dataset                       math skills               and math proficiency
composed of the above mentioned 17 thematic vocabularies and a                        STEM careers              STEM careers and women, STEM
collection of structured transcriptions of parliamentary activity. In                 and women                 degrees and women, STEM and
the next section, we describe the dataset, as well as the semantic                                              women
knowledge base and annotations our approach generates from such                       ANECA                     ANECA, National Agency for
dataset.                                                                                                        Quality Assessment and
                                                                                                                Accreditation


                                                                        4.1    Original dataset
                                                                        The Parlamento2030 system performs keyword matching heuristics
                                                                        based on regular expressions to identify the topics of the textual
                                                                        content periodically published in the website of the Spanish Con-
                                                                        gress of Deputies. The regular expressions were manually generated
                                                                        and curated by the experts that built the domain vocabularies.
                                                                           The topics correspond to the 17 Sustainable Development Goals
                                                                        (SDGs) established by United Nations: no poverty (G1), zero hunger
                                                                        (G2), good health and well-being (G3), quality education (G4), gen-
                                                                        der equality (G5), clean water and sanitation (G6), affordable and
                                                                        clean energy (G7), decent work and economic growth (G8), indus-
                                                                        try, innovation and infrastructure (G9), reduced inequality (G10),
                                                                        sustainable cities and communities (G11), responsible consump-
                                                                        tion and production (G12), climate action (G13), life below water
                                                                        (G14), life on land (G15), peace and justice strong institutions (G16),
                                                                        and cooperation and alliances (G17). These goals are aligned with
                                                                        Agenda 2030, the global action plan and commitment to eradicate
                                                                        poverty and achieve sustainable development by 2030 worldwide.
                                                                           For each of these goals, in cooperation with CIECODE mem-
                                                                        bers, experts on different domains elaborated a list of targets, i.e.,
                                                                        issues of interest and relevant problems to be addressed by gov-
                                                                        ernments and public administrations. Each target was split into
                                                                        concepts, which are represented by a set of keywords in Spanish.
                                                                        More specifically, the Parlamento2030 system uses a vocabulary
                                                                        of more than 3K concepts, where each concept has associated a
                                                                        number of regular expressions that generate keywords according to
Figure 1: Screenshots of Parlamento2030 search form and                 singular and plural forms, morphological deviations (e.g., masculine
result pages.                                                           and feminine forms, articles and prepositions linked to nouns), ad-
                                                                        jectives and adverbs, abbreviations and acronyms. For example, for
                                                                        the Quality education (“Educación de calidad”) target, the concept
                                                                        “ayudas oficiales al desarrollo en educación” (lit. official develop-
                                                                        ment grants in education) has keywords such as “ayudas oficiales al
4   KNOWLEDGE BASE AND SEMANTIC                                         desarrollo en educación”, “ayuda oficial al desarrollo en educación”,
    ANNOTATIONS                                                         “ayuda oficial al desarrollo educativo,” and “AOD en educación.” For
In this section, we present the knowledge base building and se-         a better comprehension, Table 1 shows some examples of concepts
mantic annotation methods we developed. Before, we introduce            and keywords of the dataset translated into English.
the original Parlamento2030 dataset on which we run the above
methods.                                                                https://www.un.org/sustainabledevelopment
Semantic Annotation and Retrieval of Parliamentary Content: A Case Study on the Spanish Congress of Deputies


4.2     Knowledge base
Our approach uses a semantic knowledge base built from the Par-
lamento2030 dataset. In particular, the knowledge base integrates
domain ontologies that follow the schema shown in Figure 2. The
schema has 3 main classes, namely Goal, Target and Concept.
   The Goal subclasses and individuals correspond to the 17 United
Nations SDGs. The Target subclasses and individuals are asso-
ciated to domain targets established in Parlamento2030 for each
SDG. Lastly, the Concept subclasses and individuals correspond
to topics (concepts) and keywords assigned to each target in Par-
lamento2030. A Concept subclass may have subclasses, forming
partial taxonomies that represent the knowledge of the covered
domains; see in Figure 3 the partial view of the ontology with
some concepts on the Education domain. Moreover, through the
rdfs:label property, the Concept individuals have one or more
terms in different languages. These terms correspond to the key-
words exploited in Parlamento2030; see examples in Table 1.
   Each class, individual and property of the ontologies is identi-
fied by a language- and semantics-independent URI. For instance,
tipi:target#Target_4_6 is the URI of target T4.6 Literacy, related
to goal G4 Quality education, whose URI is tipi:goal#Goal_4. To
support multiliguality, every class and individual can have multiple
String labels through the rdfs:label property. Hence, for example,
the individual of T4.6 class may have label values such as “literacy”
and “alfabetización” for English and Spanish, respectively.
   Lastly, the individuals of Goal, Target and Concept subclasses                      Figure 3: Partial view of the Education domain ontology, vi-
are linked by means of the tipi:hasGoal and tipi:hasTarget                             sualized by means of its rdfs:label values in Spanish.
properties. Following the above mentioned examples, the indi-
vidual tipi:concept#Concept_942 (i.e., digital literacy) has tar-                      4.3     Semantic Annotations
get tipi:target#Target_4_6 (i.e., literacy), which is related to
goal tipi:goal#Goal_4 (i.e., quality education). The individuals                       Figure 4 shows the architecture of the implemented semantic anno-
of Concept subclasses are created from the singular and acronym                        tation method, which makes use of two indices: an index for the
forms of every list of keywords available in Parlamento2030.                           input parliamentary documents and an index for our ontological
   The building of classes and individuals was conducted automat-                      knowledge base.
ically. The organization of the Concept subclasses (including the
creation of new inner subclasses), in contrast, was done manually
by experts, using the graphical user interface of the Protégé tool
[19]. The whole knowledge base is composed of 169 targets, 3.6K
concepts and 10.3K terms.


                                                                                                  Figure 4: Semantic annotation framework.

               Figure 2: Ontology schema diagram.                                         Parlamento2030 crawls and scraps the documents published in
                                                                                       the Spanish Congress of Deputies website, providing JSON files
                                                                                       with the content generated by parliamentarians –i.e., debates and
The prefix tipi: has to be replaced by http://ciecode.es/tipi/ontology/                law proposals– in plain text. From these files, our method builds
                                                                                                                                                                    Cantador et al.


an index (on the top right of the figure) using the Apache Lucene            As an illustrative example, the next XML fragment shows some
library. This index is used by a document index manager to retrieve       semantic annotations generated for an input document about ac-
a ranking of documents for a given keyword-based query. The               tions planned by the Hydrographic Confederation, formed by Spanish-
                                                                          Portuguese committees for the management and care of the Duero
documents are indexed by title, content and language separately.
                                                                          river. From keywords such as “embalses” and “presas” (lit. reser-
Indexed terms are weighted with TF-IDF values computed on the             voirs and dams), our method extracts “Recurso hídrico” (lit. water
whole corpus of documents. For a query, the result list are limited       resource), “Gestión integrada del agua” (lit. integrated water man-
to the 100 documents and the ranking scores of the documents are          agement), and “Agua limpia y saneamiento” (lit. clean water and
normalized to sum 1.                                                      sanitation), as semantic annotations at concept, target and goal
   Our method builds a second Lucene index (in the middle of              levels.
the figure) to efficiently access to the information available in the     <doc_annotation_list >
                                                                            < d o c _ i d >81 c 8 b b 4 0 6 e 5 2 a 1 e 8 0 0 9 6 1 9 e e 0 6 3 2 7 0 5 c 7 f 3 6 6 c a 6 < / d o c _ i d >
domain ontologies. In this case, each ontology entity –i.e., class          < d o c _ t i t l e > A c t u a c i o n e s p r e v i s t a s por l a Confederacion
or individual– is indexed by domain (goal), target, concept name,                                   Hidrografica . . . < / doc_title >
concept keyword, RDF label, and language. The index also stores             <term_annotation_list >
the URI of each entity. Similarly to the document index, there is an            <term_annotation >
                                                                                    <term > e m b a l s e s < / term >
ontology index manager that retrieves a ranked list of entities for a
                                                                                    < weight > 0 . 5 < / weight >
given keyword-based query.                                                      </ t e r m _ a n n o t a t i o n >
   The ontology index manager uses an ontology manager (on the                  <term_annotation >
left of the figure) to efficiently obtain all the ontology entities and             <term > p r e s a s < / term >
their data. The latter was indeed the component that created the                    < weight > 0 . 5 < / weight >
                                                                                </ t e r m _ a n n o t a t i o n >
UN sustainable development ontologies from the Parlamento2030                </ t e r m _ a n n o t a t i o n _ l i s t >
dataset, and stored them in a RDF repository using the Apache Jena          <concept_annotation_list >
framework.                                                                      <concept_annotation >
   Once the document and ontology indices are built, a document                     < c o n c e p t _ u r i > t i p i _ c o n c e p t : Concept_147 </ c o n c e p t _ u r i >
                                                                                    < concept_name > R e c u r s o h i d r i c o < / concept_name >
annotation component (on the bottom right of the figure) generates
                                                                                    < weight > 1 . 0 < / weight >
XML files with the semantic annotations of the input documents.                   </ c o n c e p t _ a n n o t a t i o n >
The annotation process is as follows (for a given language). The             </ c o n c e p t _ a n n o t a t i o n _ l i s t >
annotator launches on the document index several queries qe,k               <target_annotation_list >
for each ontology entity e. The queries are composed of the k                   <target_annotation >
                                                                                    < t a r g e t _ u r i > t i p i _ t a r g e t : Target_6_5 </ t a r g e t _ u r i >
terms associated to the entity name, keywords and labels, and thus                  < t a r g e t _ n a m e > G e s t i o n i n t e g r a d a d e l agua < / t a r g e t _ n a m e >
generate several ranked lists of documents that contain the terms                   < weight > 1 . 0 < / weight >
associated to e. Next, the obtained ranking scores se ,k (d) of the             </ t a r g e t _ a n n o t a t i o n >
documents are aggregated into weights w e (d), which measure the             </ t a r g e t _ a n n o t a t i o n _ l i s t >
relevance of each ontology entity e for each document d:                    <goal_annotation_list >
                                                                                <goal_annotation >
                                                                                    < g o a l _ u r i > t i p i _ g o a l : Goal_6 < / g o a l _ u r i >
                                    Õ
                      w e (d) =              se,k (d)                               < goal_name >Agua l i m p i a y s a n e a m i e n t o < / goal_name >
                                k ∈terms(d )                                        < weight > 1 . 0 < / weight >
                                                                                </ g o a l _ a n n o t a t i o n >
    These weights are normalized by document, and are considered             </ g o a l _ a n n o t a t i o n _ l i s t >
as semantic annotations of the documents at entity (concept) level.       </ d o c _ a n n o t a t i o n _ l i s t >
    Through the relationships of the ontologies, we can compute              We note that, in addition to the explained thematic annotations,
weights w e (t) and w t (д), respectively establishing the relevance of   our method also extracts annotations related to named entities, such
entity e for target t and the relevance of target t for goal д. As a      as proper nouns of people (e.g., parliamentarians), organizations
first implementation, we set w e (t) = 1 if entity e (or the class of     (e.g., government agencies, political parties) and places.
e) is related to target t by means of the tipi:hasTarget property,
and w t (д) = t →д w e (t) for those targets t that are related to goal
               Í
                                                                          5      ONTOLOGY-BASED RETRIEVAL
д by means of the tipi:hasGoal property.
                                                                          In this section, we present the developed semantic search method.
    Using the weights w e (t) and w t (д), our method computes weights
                                                                          Before, we describe the ontology-based document representation
w t (d) and wд (d) that will be considered as semantic annotations
                                                                          model used by the method.
of the documents at target and goal levels:

                      w t (d) =
                                 Õ
                                     w e (t) · w e (d)                    5.1        Document Representation
                                   e→t                                    As we will explain in the next subsection, our search method is
                                   Õ                                      built upon the well known Vector Space Model, VSM [21]. Hence, to
                        wд (d) =           w t (д) · w t (d)              describe content documents, i.e., parliamentary debate transcripts
                                   t →д                                   and law proposals, we make use of a vector representation. However,
                                                                          instead of the classical information retrieval representation based
Apache Lucene, https://lucene.apache.org                                  on terms, we propose to use semantically related concepts as units
Apache Jena, https://jena.apache.org/                                     of information. The explicit relationships between concepts allow
Semantic Annotation and Retrieval of Parliamentary Content: A Case Study on the Spanish Congress of Deputies


computing semantic relatedness, and address the VSM limitation                                SR value, even for the most dissimilar entities, so that SR
of term vector pairwise orthogonality (i.e., linear independence),                            ranges in certain interval [a, 1] with a > 0. The authors set
an unrealistic assumption where any pair of terms do not relate to                            α = 0.8 in their experiments.
each other [25, 27]. Using concepts also allows avoiding ambiguities                        • The second factor increases SR proportionally to the close-
of polysemic terms and applying semantic inference through the                                ness of the two entities to their common ancestor e 0 . Let us
concepts relationships at retrieval stage [7].                                                consider cases where h 1 = h 2 , for which the first factor of
    Formally, let O = {E, R} be an ontology composed of entities                              SR equals 1. The second factor allows establishing a higher
E = {C, I} that can be classes C or individuals I, and relation-                              SR value for the case h 1 = h 2 = x than the case h 1 = h 2 = y
ships R : E → E ∪ L that link pairs of entities or entities with                              if x < y.
literals (e.g., numeric or string). Let e 1, e 2, . . . , e N ∈ E belong to a               • The third factor decreases SR when e 1 and e 2 are in the same
N -dimensional Euclidean space. A document d is represented as a                              branch of the ontology class hierarchy; that is, they are not
vector d = (wd ,1, . . . , wd ,N ) ∈ RN , where the weight wd ,n corre-                       sibling classes or instances.
sponds to the semantic annotation score computed as explained in
Section 4.3.
    In this paper, the considered relationships are the properties
of our knowledge base (Section 4.2), and the entities of a given                       6    EXPERIMENTS
document correspond to its semantic annotations extracted by our                         To show the semantic capabilities of the proposed search method,
approach (Section 4.3). Next, we explain how the relationships are                       we tested 3 different variants of the query-document similarity
used by the proposed search method.                                                      given in Section 5.2. The first similarity, simkey is implemented
                                                                                         with SR(e 1, e 2 ) = 1, ∀e 1, e 2 ∈ E, i.e., it does not exploit ontologi-
5.2 Search Method                                                                        cal relationships between entities, and thus it is equivalent to the
Our search method is based on the Generalized Vector Space Model,                        classical VSM. This version is limited to matching of query and doc-
GVSM [27] proposed by Tsatsaronis and Panagiotopoulou [25],                              ument key terms. The second similarity, simsem_all , considers the
which incorporates a semantic relatedness (SR) measure into the                          SR(e 1, e 2 ) value of every entity in the document and query vectors.
term-based vector similarity of the VSM.                                                 It thus exploits any ontological relationship underlying the query
    Let d = (wd ,1, . . . , wd ,N ) ∈ RN and q = (wq,1, . . . , wq,N ) ∈                 and document. This version favors the retrieval of a large number
   N
R be the weight vectors of document d and query q, respectively.                         of documents belonging to the domain of the query. Lastly, the
The value qn is set to 1 if entity en appears in the input query q,                      third similarity, simsem_max , only applies the maximum SR(e 1, e 2 )
and to 0 otherwise. Considering a typical keyword-based search                           value  for the query and document entities. This version also extends
scenario, we map the keywords of the user’s query to entities. This                      the keyword matching, but focuses on the strongest relationships
is done by the OntologyManager explained in Section 4.3 (Figure                          according to the closeness of the entities in the domain ontology.
4), by exact matching of the keywords with the entity terms (see                             For a better comprehension of the experimental results, we
Figure 2).                                                                               present    next a real searching example for the query “educación
    We define the similarity between d and q as follows:                                 especial” (lit. especial education).
                             ÍN ÍN                                                           Running this query on our Parlamento2030 corpus, the simkey
                                i=1 j=1 w d ,i · w q,j · SR(ei , e j )                   method only retrieved 3 documents –d 1, d 2, d 3 – having the term
              sim(d, q) =
                                       N w 2 · ÍN w 2                                    “educación especial.” These documents had been annotated with
                                  qÍ                q
                                       i=1 d ,i         i=1 q,i                          tipi:concept# Concept_929 (i.e., especial education concept), which,
    To implement this similarity, we propose to use the semantic                         in the ontology, has associated terms such as specific needs educa-
relatedness metric proposed by Corella and Castells in [3], which                        tion, aided education, and exceptional education. The method omitted
we explain next. Being E the set of existing ontology entities, the                      these terms since it is based on the matching of the query keywords.
semantic relatedness SR between two entities e 1, e 2 ∈ E is measured                        The simsem_all method, in contrast, was able to retrieve a to-
in terms of their distance in the ontology hierarchy as follows. Let                     tal of 45 documents, all of them belonging to the education do-
e 0 be the closest ancestor (super class) to e 1 and e 2 in the ontology                 main. As for the first method, the documents d 1 , d 2 and d 3 were
hierarchy, and let h 1 = 1 + dist(e 1, e 0 ) (and h 2 = 1 + dist(e 2, e 0 ))             retrieved in the top 3 ranking positions. After them, in the results
be 1 plus the number of levels between e 1 (and e 2 ) and e 0 in the                     list, there were 4 documents about access to education and school
ontology hierarchy. We define the semantic relatedness between                           bullying. In the ontology, these two concepts are sibling of the es-
entities e 1 and e 2 as:                                                                 pecial education concept. In particular, they are subclasses of the
                                                                                        educational problems concept. Next, in the ranking, there were 7
                        α        |h 1 − h 2 |                       max(h 1, h 2 ) − 1
                                                           
                                                     1
SR(e 1, e 2 ) = 1 −           ·                 ·           · 1−                         documents –d 5, . . . , d 8 – highly related to the query. These docu-
                       h(O) h 1 + h 2 min(h 1, h 2 )                        h(O)
                                                                                         ments were annotated with the tipi:concept#Concept_937 (i.e.,
    The formula has three factors:                                                       Specific Needs of Educational Support, SNES). This concept is a direct
     • The first factor measures the distance between e 1 and e 2 as a                   child (subclass) of the especial education concept in the ontology.
        proportion of the depth h(O) of the ontology hierarchy. The                      Its relationship with especial education is stronger than access to
        α ∈ [0, 1] parameter allows ensuring a minimum non-zero                          education and school bullying. However, its associated SR values
An ontology hierarchical level can be either a rdfs:subclassOf relationship between      have less influence on the ranking scores than the TF-IDF values of
two classes, or a rdf:type relationship between an individual and its class.             the document entities.
                                                                                                                                     Cantador et al.


Table 2: Number of indexed documents with parsed content.                    7   CONCLUSIONS
                                                                             In this paper, we have presented a novel approach to ontology-based
 Document type                                                Num.           annotation and retrieval of parliamentary content. The approach
 Non-law proposals in plenary session                         40             was built upon domain ontologies that cover a large number of
 Law proposals by parliamentary groups                        13             topics related to the United Nations taxonomy of sustainable devel-
 Law proposals by deputies                                    2              opment goals. As a proof of concept, the approach was instantiated
 Law proposals by autonomous cities and communities           10             and preliminary evaluated on a corpus extracted from the Spanish
 Popular legislative initiatives                              4              Congress of Deputies, and used by the Parlamento2030 platform.
 Proposals to reform the congress rules                       3              The approach is in a preliminary stage. Next, we describe several
                                                                             future research lines.
                                                                                The ontological schema of the approach is composed of three
 Table 3: Average precision P@N values for each method.                      main classes, namely Goal, Target and Concept, and three prop-
                                                                             erties to relate the class individuals, namely hasGoal, hasTarget
 Method          P@5           P@10         P@15          P@20               and subClassOf (i.e., subConceptOf). As shown in the paper, ex-
 simkey          0.633         0.483        0.422         0.358              ploiting these properties leads to semantically enhanced search
 simsem_all      0.733         0.550        0.500         0.492              results. However, more specific properties between particular enti-
 simsem_max      0.733         0.683        0.656         0.600              ties could be considered as well. For instance, we may have inter-
                                                                             domain properties that relate concepts –such as Job security
                                                                             (from G1: No poverty) equivalentTo Precarious work (G8: De-
                                                                             cent work economic growth)–, and targets –such as Promoting
                                                                             renewable energy (G7: Affordable and clean energy) impactsOn
    Boosting the semantic relatedness in the ranking process, the            Decreasing pollution (G13: Climate change), which impactsOn
simsem_max method generated a result list where documents d 5, . . . , d 8   Disease prevention (G3: Good health well being). Exploiting
appear after the first 3 positions.                                          these and other types of relationships could enrich the semantic
    Testing the methods on the Parlamento2030 corpus, we found               inference capabilities of the retrieval method, which would be able
other similar examples. We do not present them due to space limi-            to better find related documents. Properties and relationships may
tations. Moreover, aiming to show a more rigorous evaluation of              be defined manually or extracted automatically by mining external
the methods, we performed a preliminary experiment, described                knowledge bases.
next.                                                                           Moreover, our semantic annotation method is able to extract
    The Parlamento2030 corpus had 3534 documents. From them,                 named entities, such as proper nouns of people, organizations and
only 72 had textual content parsed. Table 2 shows the types of these         places. We may extend the ontology to model these issues. For
documents. The remainder documents had links to associated PDF               instance, we may have classes and properties describing and relating
files from the Congress of Deputies. We indexed and annotated                cities and administrative divisions. Then, we could exploit this
all the documents analyzing their titles and their content if avail-         information to find government initiatives according to particular
able. A total of 435 documents were annotated with 642 concept               demographic, sociocultural, political and economic attributes of
annotations, meaning a coverage of 12.3% of the corpus. The Par-             target locations, or in terms of geographic relationships between
lamento2030 system has topic tags for 427 documents obtained                 such locations, e.g., educational initiatives proposed for certain
through their language-dependent regular expressions.                        surrounding area.
    The obtained Parlamento2030 dataset did not have sets of queries            Modeling and extracting time aspects of parliamentary content
and relevance judgements, so we opted to conduct a user study.               are left as future work. Temporal traceability of addressed goals
Three experts participated in the study. After revising the document         and targets, and time-based government initiative similarities are
collection, each of them stated 5 queries. Then, they were requested         examples of interesting functionalities for a parliamentary infor-
to assess the 20-top results provided by the 3 search methods for            mation retrieval system. We also envision the extension of the
the 15 queries, stating whether each retrieved document was non              annotation method to work at argumentation level [2, 6, 13, 18],
relevant, relevant or highly relevant for the corresponding query.           aiming to automatically extract argumentative structures from in-
The Fleiss’ kappa inter-rater correlation coefficient was κ = 0.984,         put texts. In particular, we propose to develop a method able to
meaning an almost perfect agreement between the assessors’ judge-            identify arguments, as well as components from them, such as facts,
ments. The experts also evaluated the correctness of the annotations         statements and predictions. The extracted arguments would serve
of the documents, measuring an accuracy of 98.7% for the semantic            as summaries of existing debates and proposals, and may be used
annotation process.                                                          as instruments to measure and monitor ideological and activity
    Table 3 reports the average precision values P@N of the three            dynamics in the Parliament.
methods for the top N = 5, 10, 15, 20 results. Relevant and highly              Our approach supports multilinguiality, but we only implemented
relevant assessments were considered as positive judgements. The             it with a vocabulary in Spanish. A further step would be to translate
reported values show that the proposed semantic methods outper-              the generated ontology entities into other languages, starting with
form the VSM-based simkey method. Moreover, they show that                   English. For both the ontologies and semantic annotations, we plan
limiting the ontological expansion of matched concepts by the                to link their entities to external knowledge bases, following the
simsem_max improves the accuracy achieved by simsem_all .
Semantic Annotation and Retrieval of Parliamentary Content: A Case Study on the Spanish Congress of Deputies


Linked Open Data initiative. In this context, we would make all the                        [14] Antonio Martín and Carlos León. 2015. Semantic framework for an efficient
generated resources publicly available as RDF repositories.                                     information retrieval in the e-government repositories. In Handbook of Research
                                                                                                on Democratic Strategies and Citizen-Centered E-Government Services. IGI Global,
   The proposed information retrieval method, which makes use                                   192–213.
of a novel semantic relatedness metric between ontology entities,                          [15] Pablo N Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011.
                                                                                                DBpedia Spotlight: Shedding light on the web of documents. In Proceedings of
could be adapted to provide personalized search and recommen-                                   the 7th International Conference on Semantic Systems. 1–8.
dation. For such purpose, as done in previous work [5, 22, 26], we                         [16] David Milne and Ian H Witten. 2008. Learning to link with Wikipedia. In Pro-
will have to model individual or stereotype-based user profiles,                                ceedings of the 17th ACM Conference on Information and Knowledge Management.
                                                                                                509–518.
according to privacy and application issues.                                               [17] Silvio Moreira, David Batista, Paula Carvalho, Francisco M Couto, and Mário J
   Lastly, an exhaustive evaluation of the proposed approach has                                Silva. 2011. POWER - Politics Ontology for Web Entity Retrieval. In Proceedings
to be conducted. In this sense, we plan to perform both offline ex-                             of the 23rd International Conference on Advanced Information Systems Engineering.
                                                                                                Springer, 489–500.
periments and online user studies. For the latter case, the approach                       [18] Gaku Morio. 2018. Annotating Online Civic Discussion Threads for Argument
will be integrated into the Parlamento2030 platform.                                            Mining. In 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI).
                                                                                                IEEE, 546–553.
                                                                                           [19] Mark A Musen. 2015. The protégé project: A look back and a look forward. AI
ACKNOWLEDGMENTS                                                                                 Matters 1, 4 (2015), 4–12.
                                                                                           [20] Fedelucio Narducci, Matteo Palmonari, and Giovanni Semeraro. 2013. Cross-
This work was supported by the Spanish Ministry of Science and                                  language semantic retrieval and linking of e-gov services. In Proceedings of the
Innovation (PID2019-108965GB-I00). The authors acknowledge                                      12th International Semantic Web Conference. Springer, 130–145.
CIECODE for providing the dataset used in this work, and thank its                         [21] Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model
                                                                                                for automatic indexing. Communications of the ACM 18, 11 (1975), 613–620.
director –Javier Pérez– and members –Pablo Martín, Belén Agüero                            [22] Elena Sánchez-Nielsen and Francisco Chávez-Gutiérrez. 2008. Personalized
and Irene Matín– for their help and support in the project. They                                and on-demand retrieval of parliamentary proceedings with social feedback on
                                                                                                elected representatives. In Proceedings of the 21st Annual Conference on Legal
also thank Alejandro Bellogín for his help on the regular expression                            Knowledge and Information Systems. IOS Press, 53–62.
processing.                                                                                [23] Elena Sánchez-Nielsen, Francisco Chávez-Gutiérrez, and Javier Lorenzo-Navarro.
                                                                                                2019. A semantic parliamentary multimedia approach for retrieval of video clips
                                                                                                with content understanding. Multimedia Systems 25, 4 (2019), 337–354.
REFERENCES                                                                                 [24] Nigel Shadbolt, Kieron O’Hara, Tim Berners-Lee, Nicholas Gibbins, Hugh Glaser,
 [1] Flora Amato, Antonino Mazzeo, Vincenzo Moscato, and Antonio Picariello. 2009.              Wendy Hall, et al. 2012. Linked open government data: Lessons from data.gov.uk.
     A system for semantic retrieval and long-term preservation of multimedia doc-              IEEE Intelligent Systems 27, 3 (2012), 16–24.
     uments in the e-government domain. International Journal of Web and Grid              [25] George Tsatsaronis and Vicky Panagiotopoulou. 2009. A generalized vector
     Services 5, 4 (2009), 323–338.                                                             space model for text retrieval based on semantic relatedness. In Proceedings of
 [2] Claire Cardie, Cynthia R Farina, Matt Rawding, and Adil Aijaz. 2008. An eRule-             the Student Research Workshop at EACL 2009. 70–78.
     making corpus: Identifying substantive issues in public comments. In Proceedings      [26] Eduardo Vicente-López, Luis M de Campos, Juan M Fernández-Luna, and Juan F
     of the 11th International Conference on Language Resources and Evaluation.                 Huete. 2016. Use of textual and conceptual profiles for personalized retrieval of
 [3] Miguel Ángel Corella and Pablo Castells. 2006. A heuristic approach to semantic            political documents. Knowledge-Based Systems 112 (2016), 127–141.
     web services classification. In Proceedings of the 10th International Conference on   [27] SK Michael Wong, Wojciech Ziarko, and Patrick CN Wong. 1985. Generalized
     Knowledge-Based and Intelligent Information and Engineering Systems. Springer,             vector spaces model in information retrieval. In Proceedings of the 8th Annual
     598–605.                                                                                   International ACM SIGIR Conference on Research and Development in Information
 [4] Luis M De Campos, Juan M Fernández-Luna, Juan F Huete, and Carlos J Martín-                Retrieval. 18–25.
     Dancausa. 2008. An integrated system for accessing the digital library of the
     Parliament of Andalusia: Segmentation, annotation and retrieval of transcriptions
     and videos. In Proceedings of the 8th International Workshop on Pattern Recognition
     in Information Systems - Volume 1. SciTePress, 38–47.
 [5] Luis M de Campos, Juan M Fernández-Luna, Juan F Huete, and Luis Redondo-
     Expósito. 2017. Comparing machine learning and information retrieval-based
     approaches for filtering documents in a parliamentary setting. In Proceedings of
     the 11th International Conference on Scalable Uncertainty Management. Springer,
     64–77.
 [6] Vlad Eidelman and Brian Grom. 2019. Argument Identification in Public Com-
     ments from eRulemaking. In Proceedings of the 17th International Conference on
     Artificial Intelligence and Law. 199–203.
 [7] Miriam Fernández, Iván Cantador, Vanesa López, David Vallet, Pablo Castells, and
     Enrico Motta. 2011. Semantically enhanced Information Retrieval: An ontology-
     based approach. Journal of Web Semantics 9, 4 (2011), 434–452.
 [8] Juan M Fernández-Luna, Juan F Huete, Manuel Gómez, and Carlos J Martín-
     Dancausa. 2008. Development of the XML digital library from the parliament of
     Andalucía for intelligent structured retrieval. In Proceedings of the 17th Interna-
     tional Symposium on Methodologies for Intelligent Systems. Springer, 417–423.
 [9] Evgeniy Gabrilovich and Shaul Markovitch. 2009. Wikipedia-based semantic
     interpretation for natural language processing. Journal of Artificial Intelligence
     Research 34 (2009), 443–498.
[10] Hugo C Hoeschl, Tânia Cristina D Bueno, Andre Bortolon, Eduardo S Mattos,
     Marcelo S Ribeiro, Irineu Theiss, and Ricardo Miranda Barcia. 2004. An intelligent
     search engine for electronic government applications for the resolutions of the
     United Nations Security Council. In Building the E-Service Society. Springer,
     23–41.
[11] Rianne Kaptein and Maarten Marx. 2010. Focused retrieval and result aggregation
     with political data. Information retrieval 13, 5 (2010), 412–433.
[12] Xiaoxing Liu and Changxia Hu. 2012. Research and design on e-government
     information retrieval model. Procedia Engineering 29 (2012), 3170–3174.
[13] Anastasios Lytos, Thomas Lagkas, Panagiotis Sarigiannidis, and Kalina Bontcheva.
     2019. The evolution of argumentation mining: From models to social media and
     emerging tools. Information Processing & Management 56, 6 (2019), 102055.