Semantic Annotation and Retrieval of Parliamentary Content: A Case Study on the Spanish Congress of Deputies Iván Cantador Lara Quijano-Sánchez ivan.cantador@uam.es lara.quijano@uam.es Escuela Politécnica Superior Escuela Politécnica Superior Universidad Autónoma de Madrid Universidad Autónoma de Madrid Madrid, Spain Madrid, Spain ABSTRACT In this context, researchers have developed approaches to auto- In this paper, we present an ontology-based annotation and retrieval matically generate semantic annotations for parliamentary content approach for parliamentary content, such as debate transcripts and of diverse types, such as laws, political programs, and transcriptions law proposals. Exploiting a number of domain ontologies, semantic compiling interventions of parliament members in plenary meet- web technologies and information retrieval techniques, our ap- ings. The majority of these approaches have focused on identifying proach extracts topics, concepts and named entities (e.g., names certain parliament entities such as political groups and represen- of politicians and political parties) appearing in input documents. tative members [23], and a limited number of topics addressed The domain ontologies were designed to support multilinguality, within the input text documents [8]. Moreover, in general, they and were built from the United Nations taxonomy of sustainable only support content in a single language [20]. development goals. The approach was instantiated with a text cor- Aiming to address these limitations, we present a first version of pus extracted from the Spanish Congress of Deputies and is being an ontology-based approach that makes use of information retrieval integrated into an e-government platform. techniques and semantic web technologies to annotate and retrieve parliamentary contents in multiple languages. More specifically, our CCS CONCEPTS approach is built upon a knowledge base composed of ontologies covering the United Nations taxonomy of sustainable development • Applied computing → Computing in government; • Infor- goals, which are related to a variety of domains, such as educa- mation systems → Ontologies; Information extraction; In- tion, economy, natural resources, climate change, and social rights. formation retrieval. The approach identifies concepts (i.e., classes) and instances (i.e., KEYWORDS class individuals) of the above ontologies in input text documents, parliamentary content, semantic annotation, argument extraction, by means of information retrieval techniques applied to indices ontology-based information retrieval created from multilingual labels of the ontology concepts. The ex- tracted concepts do not only represent several levels of thematic 1 INTRODUCTION annotations, but also allow computing ontology-based similari- ties that enhance the retrieval of semantically related search and Managing and publicly providing digital libraries on parliamen- recommendation results beyond keyword-based matching. tary activity are essential tasks for open government, promoting As a proof of concept, the approach has been instantiated and pre- democracy, enhancing transparency, and facilitating accountability. liminary evaluated on a text corpus managed by Parlamento2030, However, the amount of multimedia content recording the debates an online platform that monitors parliamentary activity in the and proposals generated by parliaments is huge and ever-increasing. Spanish Congress of Deputies. A user study on search tasks shows This together with the unstructured nature of such content makes the benefits of the semantically enhanced annotation and retrieval its organization, access and retrieval challenging. results provided by our approach. As stated in [12], metadata facilitates the classification, storage The reminder of the paper is structured as follows. In Section 2, and retrieval of e-government resources. It summarizes the avail- we survey related work on both retrieval of parliamentary content able contents, allows users to manage, find and access the resources, and semantic retrieval for e-government applications. In Section helps understanding and determining if the corresponding infor- 3, we introduce the case study addressed with the Parlamento2030 mation meet particular requirements, and, thanks to a consistent platform. Next, in Sections 4 and 5 we present the proposed ap- description of the data, promotes its sharing and exchange. proach, distinguishing between its knowledge base building, se- Public administrations are aware of the advantages of sharing mantic annotation, and ontology-based retrieval methods. Then, open government data with regard to transparency, stakeholder in Section 6, we present preliminary results from a user study on collaboration, improved services, and new economic activities [20]. search tasks. Hence, in the last decade, there has been a large increment of initiatives to publish and interlink government data and services. This has been facilitated by the use of semantic web technologies and standards, and the generation of Linked Open Data (LOD) [24]. Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Parlamento2030, https://www.parlamento2030.es/ Cantador et al. 2 RELATED WORK information retrieval- and machine learning-based methods did not Our approach is built upon an ontology-based framework for an- show significant performance differences. notating and retrieving parliamentary content. Hence, in this sec- Differently to the previous works, in [11], Kaptein and Marx tion, we survey related work by distinguishing between specific attempted to extract high level semantic annotations from parlia- approaches to information retrieval in parliamentary contexts and mentary debates, aiming to summarize and visualize the narrative more general approaches to semantic annotation and retrieval in structure of meetings as tables of topics and intervention graphs. e-government applications. The authors developed a search engine that exploited the generated annotations enabling the provision of entry points to documents (in XML), grouping of search results, and faceted data exploration. In a user study on a large dataset with official transcripts of meetings 2.1 Retrieval of Parliamentary Content from the Dutch Parliament, users reported that, in comparison to a In [22], Sánchez-Nielsen et al. conceptualized a smart system where standard document retrieval systems, the proposed search engine citizens interact with elected representatives in Parliaments, access provided a better overview of the data. to Parliament proceedings, and subscribe to new parliamentary con- In addition to identifying certain parliament entities (e.g., po- tents. Among other issues, the authors proposed the use of semantic litical groups and representative members), as done in [23], we technologies and recommender systems to incorporate models for also aim to extract thematic annotations in several semantic levels, concept and knowledge representation, manual and automatic an- namely topics and domains. Similarly to the approaches presented notation of textual content from plenary sessions, fragmentation in [4, 8, 11], ours processes transcriptions stored in text documents. of audiovisual content, provision of customized feeds and infor- However, instead of generating annotations in XML, our system mation retrieval, and production of automatic reports. Recently, in produces RDF tuples, which could be linked to external semantic [23], the authors presented an approach to automatically annotate web repositories. We note that the RDF annotations of [23] refer to video transcriptions of the debates occurred in plenary meetings a limited vocabulary of legislation, representation, and parliamen- of the Canary Islands Parliament, in Spain. Specifically, the an- tary activities, whereas our annotations correspond to concepts notations –expressed as RDF semantic data– were associated to from a variety of domains. The personalization of search results, as concepts belonging to an ad hoc ontology that modeled legisla- proposed by [5, 26], is left as future work. tion, representation and parliamentary activity concepts, such as legislatures, legislative proposals, political groups, representative members, sessions, interventions and votes. Exploiting the gen- 2.2 Semantic Retrieval in e-Government erated annotations, the authors developed a prototype aimed to The Olimpo system [10] presents one of the first reported ap- retrieve video clips that fulfill a user’s specified need on the par- proaches to semantic search for an e-government application. In liamentary activity, as well as contextual information for content particular, the system applied case-based reasoning and informa- understanding. tion retrieval techniques to find documents similar to an input list In a series of works ([4, 5, 8, 26]), De Campos, Fernández-Luna, of documents provided by the user through an iterative process. Huete and colleagues investigated information retrieval approaches Implemented for searching United Nation (UN) security resolutions, for the Parliament of Andalusia, in Spain. In [8], the authors pre- the system made use of the structured representation of documents sented an XML digital library automatically created from the official to identify and extract from the documents a variety of attributes, documents –published in PDF– of the parliament session diaries, i.e., such as subjects, dates, institution acronyms, country names, num- the transcriptions of all the deputies’ interventions in plenary and ber of decisions, and text parts with higher occurrence of indicative commission sessions. The generated XML files contained simple expressions of the resolutions. More recently, Liu and Hu [12] also metadata, such as session date, starting and ending times, agenda presented an approach to extract metadata from e-government points, addressed topics, and vote results. Making use of such XML information resources, and exploit it to provide semantic search structures, in [4], the authors developed a system to support the functionalities. In this case, the identification of attributes was con- manual annotation of certain parts of the transcriptions with their ducted by means of lexical analysis (i.e., part-of-speech labeling), corresponding segments in videos recorded during the parliament stop words removal, and term frequency-based filtering for the sessions. Exploiting generated annotations, a search engine for Chinese language. The generated metadata was stored in XML, but structured documents based on Bayesian Networks and Influence the authors proposed its transformation to RDF. Diagrams was tested to retrieve parts of the transcriptions and In the previous works, the exploited metadata consisted of a videos relevant to a given query. More recently, in [26], the authors limited set of attributes without semantic structure or relationships enhanced their search engine with personalization capabilities. In between them. Differently, others have proposed the use of ontolo- particular, motivated by the need of maintaining the user’s privacy gies as knowledge representation frameworks whose interlinked in political contexts, the retrieval model was adapted to a number of concepts conform the fundamental elements of the document an- content-based stereotype profiles, which were built through several notations used by the search engines. In this context, semantic web term and category weighting techniques. Lastly, in [5], instead of technologies, e.g., RDF and OWL, represent a popular trend in the addressing personalized search, the authors focused on an infor- literature. mation filtering task, where the members of parliament receive Amato, Mazzeo and Picariello [1] presented a system that ex- personalized recommendations of those documents that may be ploits ontologies and NLP techniques to annotate e-government relevant for them. Through empirical comparison, the evaluated multimedia documents. The ontologies contained both domain Semantic Annotation and Retrieval of Parliamentary Content: A Case Study on the Spanish Congress of Deputies knowledge and lexical vocabularies for Italian and English lan- 3 CASE STUDY guages. In the paper, the authors did not provide descriptions of Our approach is aimed to be integrated into Parlamento2030, an the domains ontologies. They, nonetheless, implemented an infor- online platform that monitors parliamentary activity in the Span- mation retrieval prototype where a user study was conducted on a ish Parliament. For such integration, the approach has been im- small collection of 60 criminal law and juridical documents, show- plemented and preliminary evaluated on a dataset managed by ing preliminary accuracy results of the annotation process. In [17], Parlamento2030. In Section 4, we describe such dataset and the Moreira et al. presented POWER, an ontology of political processes knowledge base we have generated from it. Before that, we intro- designed to track politicians, political organizations, and elections duce the Spanish Parliament and Parlamento2030 tool. in social media. The authors also presented EMPOWERED, a frame- work to populate the POWER ontology with information extracted 3.1 The Spanish Congress of Deputies from various resources. The authors assessed the framework on The Cortes Generales (lit. General Courts) are the Spanish Par- the Portuguese Government and National Elections Committee liament. Established and regulated in the Constitution of Spain, websites, retrieving 3.6K politicians, 3K terms related to political they form a bicameral legislative chamber, consisting of the Sen- institutions, and 74 political associations from mandates taken place ate (Senado) –i.e., the upper house– and the Congress of Deputies since 1976. They did not explain the developed semantic annotation (Congreso de los Diputados) –i.e., the lower house. method, and proposed to exploit the annotations for information In the Congress of Deputies, parliamentarians form working retrieval tasks –i.e., expert finding and question answering– and groups –called commissions– in different areas of interest (e.g., to align the ontologies with Linked Open Data repositories, such economy, education, healthcare, agriculture, etc.), and discuss ini- as DBpedia, YAGO and FOAF. Motivated by the benefits of LOD tiatives related to the commissions they belong to. The proposals to support interoperability between European administrations and generated by the commissions are presented and debated in ple- to improve the information access for citizens across Europe, in nary sessions. Certain proposals are then formalized as law projects, [20], Narducci, Palmonary and Semeraro presented CroSeR, a cross- which have to be voted and approved by a majority of the deputies language e-government service retriever for different European for their implementation as laws. The Senate, which represents languages. The underlying semantic annotation algorithm of the the territorial regions in Spain and supervises the work done by system enriched the short descriptions of the services with labels the Congress of Deputies, does not propose laws, but revises and extracted from Wikipedia concepts related to the services. In par- suggests changes to the law projects provided by the deputies. ticular, it used Explicit Semantic Analysis [9], which disambiguates As a form of government transparency, both law proposals and a word meaning through a semantic similarity with Wikipedia con- Parliament session diaries are available online as HTML and PDF cepts. The authors carried out an empirical evaluation consisting documents. The Parlamento2030 platform crawls, scraps and cat- of an information retrieval task on a catalogue of 2.4K services egorizes such diaries to provide search functionalities, which we in five different languages –namely Dutch, Belgian, German, Nor- describe next. wegian and Swedish–, and comparing CroSeR with well known semantic annotators, such as Wikipedia Miner [16] and DBpedia Spotlight [15]. More recently, in [14], the authors presented Ontolo- 3.2 The Parlamento2030 platform Gov, a system that supports interoperability between e-government Forming part of Salvador Soler Foundation, CIECODE, Centro de repositories, and provides semantic search functionalities. More Investigación y Estudios sobre Coherencia y Desarrollo (lit. Center specifically, the system performed a metadata extraction process, for Research and Studies on Coherence and Development), aims to made use of ontology-based contextual user profiles, and applied analyze public policies and private practices of developed countries, case based-reasoning to support knowledge retrieval. inform about their effects on developing countries, and make pro- As done in [10], we consider content generated by UN, but, posals to move towards a more egalitarian society and fair world. instead of focusing on vocabularies associated to its security res- In particular, CIECODE implements innovative projects on ac- olutions, we use its taxonomy of sustainable development goals, cessing to political information. Among them, Parlamento2030 is an covering various domains. Similarly to the approach presented in online platform that monitors parliamentary activity in the Spanish [12], we make use of natural language processing techniques and Congress of Deputies, aiming to promote an active, informed and resources for the semantic annotation process. Moreover, as done in demanding citizenship and a responsible political class subject to [1] and [17], we propose ontology-based representations. However, public scrutiny. It is an adaptation of the TIPI Ciudadano toool and the former did not provide descriptions of its domain ontologies, is built upon the CIECODE Political Watch open-source framework. and the latter used a limited ontology modeling political processes. The Parlamento2030 tool scans all the political activities of the Lastly, as in [20], supporting multilinguality and generating LOD Congress of Deputies by crawling the activity transcriptions pub- represent two of the principal requirements for our system. Dif- licly available at the congress website, and automatically catego- ferently to that work, we left the rigorous evaluation of generic rizing the crawled content according to their relationships with 17 semantic annotation and retrieval methods for the future. priority thematic areas for poverty, social justice and sustainable Fundación Salvador Soler, https://unmundosalvadorsoler.org CIECODE, https://www.ciecode.es TIPI Ciudadano, https://tipiciudadano.es CIECODE Political Watch, https://github.com/politicalwatch Congreso de los Diputados, http://www.congreso.es Cantador et al. development. More specifically, it annotates the parliamentary con- Table 1: Examples of concepts and keywords of the dataset. tent by keyword matching using a vocabulary with more than 3K Spanish terms provided by expert individuals and organizations in Domain Concept Keywords each of the areas. Poverty Unemployment and unemployment and The platform has a search engine with which the user can refine vulnerability vulnerability, unemployed and vulnerability, vulnerable information filtering queries based on multiple criteria, such as unemployed author, date, theme and keyword (see Figure 1). It also offers a Global poverty rate global poverty rate, worldwide personalized system of alerts that allows a user to be up to date on poverty rate, international the political news of her topics of interest. Parlamento2030’s code poverty rate and data can be freely accessed and downloaded. Education Reading and reading and math skills, reading Our approach is built and tested on a Parlamento2030 dataset math skills and math proficiency composed of the above mentioned 17 thematic vocabularies and a STEM careers STEM careers and women, STEM collection of structured transcriptions of parliamentary activity. In and women degrees and women, STEM and the next section, we describe the dataset, as well as the semantic women knowledge base and annotations our approach generates from such ANECA ANECA, National Agency for dataset. Quality Assessment and Accreditation 4.1 Original dataset The Parlamento2030 system performs keyword matching heuristics based on regular expressions to identify the topics of the textual content periodically published in the website of the Spanish Con- gress of Deputies. The regular expressions were manually generated and curated by the experts that built the domain vocabularies. The topics correspond to the 17 Sustainable Development Goals (SDGs) established by United Nations: no poverty (G1), zero hunger (G2), good health and well-being (G3), quality education (G4), gen- der equality (G5), clean water and sanitation (G6), affordable and clean energy (G7), decent work and economic growth (G8), indus- try, innovation and infrastructure (G9), reduced inequality (G10), sustainable cities and communities (G11), responsible consump- tion and production (G12), climate action (G13), life below water (G14), life on land (G15), peace and justice strong institutions (G16), and cooperation and alliances (G17). These goals are aligned with Agenda 2030, the global action plan and commitment to eradicate poverty and achieve sustainable development by 2030 worldwide. For each of these goals, in cooperation with CIECODE mem- bers, experts on different domains elaborated a list of targets, i.e., issues of interest and relevant problems to be addressed by gov- ernments and public administrations. Each target was split into concepts, which are represented by a set of keywords in Spanish. More specifically, the Parlamento2030 system uses a vocabulary of more than 3K concepts, where each concept has associated a number of regular expressions that generate keywords according to Figure 1: Screenshots of Parlamento2030 search form and singular and plural forms, morphological deviations (e.g., masculine result pages. and feminine forms, articles and prepositions linked to nouns), ad- jectives and adverbs, abbreviations and acronyms. For example, for the Quality education (“Educación de calidad”) target, the concept “ayudas oficiales al desarrollo en educación” (lit. official develop- ment grants in education) has keywords such as “ayudas oficiales al 4 KNOWLEDGE BASE AND SEMANTIC desarrollo en educación”, “ayuda oficial al desarrollo en educación”, ANNOTATIONS “ayuda oficial al desarrollo educativo,” and “AOD en educación.” For In this section, we present the knowledge base building and se- a better comprehension, Table 1 shows some examples of concepts mantic annotation methods we developed. Before, we introduce and keywords of the dataset translated into English. the original Parlamento2030 dataset on which we run the above methods. https://www.un.org/sustainabledevelopment Semantic Annotation and Retrieval of Parliamentary Content: A Case Study on the Spanish Congress of Deputies 4.2 Knowledge base Our approach uses a semantic knowledge base built from the Par- lamento2030 dataset. In particular, the knowledge base integrates domain ontologies that follow the schema shown in Figure 2. The schema has 3 main classes, namely Goal, Target and Concept. The Goal subclasses and individuals correspond to the 17 United Nations SDGs. The Target subclasses and individuals are asso- ciated to domain targets established in Parlamento2030 for each SDG. Lastly, the Concept subclasses and individuals correspond to topics (concepts) and keywords assigned to each target in Par- lamento2030. A Concept subclass may have subclasses, forming partial taxonomies that represent the knowledge of the covered domains; see in Figure 3 the partial view of the ontology with some concepts on the Education domain. Moreover, through the rdfs:label property, the Concept individuals have one or more terms in different languages. These terms correspond to the key- words exploited in Parlamento2030; see examples in Table 1. Each class, individual and property of the ontologies is identi- fied by a language- and semantics-independent URI. For instance, tipi:target#Target_4_6 is the URI of target T4.6 Literacy, related to goal G4 Quality education, whose URI is tipi:goal#Goal_4. To support multiliguality, every class and individual can have multiple String labels through the rdfs:label property. Hence, for example, the individual of T4.6 class may have label values such as “literacy” and “alfabetización” for English and Spanish, respectively. Lastly, the individuals of Goal, Target and Concept subclasses Figure 3: Partial view of the Education domain ontology, vi- are linked by means of the tipi:hasGoal and tipi:hasTarget sualized by means of its rdfs:label values in Spanish. properties. Following the above mentioned examples, the indi- vidual tipi:concept#Concept_942 (i.e., digital literacy) has tar- 4.3 Semantic Annotations get tipi:target#Target_4_6 (i.e., literacy), which is related to goal tipi:goal#Goal_4 (i.e., quality education). The individuals Figure 4 shows the architecture of the implemented semantic anno- of Concept subclasses are created from the singular and acronym tation method, which makes use of two indices: an index for the forms of every list of keywords available in Parlamento2030. input parliamentary documents and an index for our ontological The building of classes and individuals was conducted automat- knowledge base. ically. The organization of the Concept subclasses (including the creation of new inner subclasses), in contrast, was done manually by experts, using the graphical user interface of the Protégé tool [19]. The whole knowledge base is composed of 169 targets, 3.6K concepts and 10.3K terms. Figure 4: Semantic annotation framework. Figure 2: Ontology schema diagram. Parlamento2030 crawls and scraps the documents published in the Spanish Congress of Deputies website, providing JSON files with the content generated by parliamentarians –i.e., debates and The prefix tipi: has to be replaced by http://ciecode.es/tipi/ontology/ law proposals– in plain text. From these files, our method builds Cantador et al. an index (on the top right of the figure) using the Apache Lucene As an illustrative example, the next XML fragment shows some library. This index is used by a document index manager to retrieve semantic annotations generated for an input document about ac- a ranking of documents for a given keyword-based query. The tions planned by the Hydrographic Confederation, formed by Spanish- Portuguese committees for the management and care of the Duero documents are indexed by title, content and language separately. river. From keywords such as “embalses” and “presas” (lit. reser- Indexed terms are weighted with TF-IDF values computed on the voirs and dams), our method extracts “Recurso hídrico” (lit. water whole corpus of documents. For a query, the result list are limited resource), “Gestión integrada del agua” (lit. integrated water man- to the 100 documents and the ranking scores of the documents are agement), and “Agua limpia y saneamiento” (lit. clean water and normalized to sum 1. sanitation), as semantic annotations at concept, target and goal Our method builds a second Lucene index (in the middle of levels. the figure) to efficiently access to the information available in the < d o c _ i d >81 c 8 b b 4 0 6 e 5 2 a 1 e 8 0 0 9 6 1 9 e e 0 6 3 2 7 0 5 c 7 f 3 6 6 c a 6 < / d o c _ i d > domain ontologies. In this case, each ontology entity –i.e., class < d o c _ t i t l e > A c t u a c i o n e s p r e v i s t a s por l a Confederacion or individual– is indexed by domain (goal), target, concept name, Hidrografica . . . < / doc_title > concept keyword, RDF label, and language. The index also stores the URI of each entity. Similarly to the document index, there is an e m b a l s e s < / term > ontology index manager that retrieves a ranked list of entities for a < weight > 0 . 5 < / weight > given keyword-based query. The ontology index manager uses an ontology manager (on the left of the figure) to efficiently obtain all the ontology entities and p r e s a s < / term > their data. The latter was indeed the component that created the < weight > 0 . 5 < / weight > UN sustainable development ontologies from the Parlamento2030 dataset, and stored them in a RDF repository using the Apache Jena framework. Once the document and ontology indices are built, a document < c o n c e p t _ u r i > t i p i _ c o n c e p t : Concept_147 < concept_name > R e c u r s o h i d r i c o < / concept_name > annotation component (on the bottom right of the figure) generates < weight > 1 . 0 < / weight > XML files with the semantic annotations of the input documents. The annotation process is as follows (for a given language). The annotator launches on the document index several queries qe,k for each ontology entity e. The queries are composed of the k < t a r g e t _ u r i > t i p i _ t a r g e t : Target_6_5 terms associated to the entity name, keywords and labels, and thus < t a r g e t _ n a m e > G e s t i o n i n t e g r a d a d e l agua < / t a r g e t _ n a m e > generate several ranked lists of documents that contain the terms < weight > 1 . 0 < / weight > associated to e. Next, the obtained ranking scores se ,k (d) of the documents are aggregated into weights w e (d), which measure the relevance of each ontology entity e for each document d: < g o a l _ u r i > t i p i _ g o a l : Goal_6 < / g o a l _ u r i > Õ w e (d) = se,k (d) < goal_name >Agua l i m p i a y s a n e a m i e n t o < / goal_name > k ∈terms(d ) < weight > 1 . 0 < / weight > These weights are normalized by document, and are considered as semantic annotations of the documents at entity (concept) level. Through the relationships of the ontologies, we can compute We note that, in addition to the explained thematic annotations, weights w e (t) and w t (д), respectively establishing the relevance of our method also extracts annotations related to named entities, such entity e for target t and the relevance of target t for goal д. As a as proper nouns of people (e.g., parliamentarians), organizations first implementation, we set w e (t) = 1 if entity e (or the class of (e.g., government agencies, political parties) and places. e) is related to target t by means of the tipi:hasTarget property, and w t (д) = t →д w e (t) for those targets t that are related to goal Í 5 ONTOLOGY-BASED RETRIEVAL д by means of the tipi:hasGoal property. In this section, we present the developed semantic search method. Using the weights w e (t) and w t (д), our method computes weights Before, we describe the ontology-based document representation w t (d) and wд (d) that will be considered as semantic annotations model used by the method. of the documents at target and goal levels: w t (d) = Õ w e (t) · w e (d) 5.1 Document Representation e→t As we will explain in the next subsection, our search method is Õ built upon the well known Vector Space Model, VSM [21]. Hence, to wд (d) = w t (д) · w t (d) describe content documents, i.e., parliamentary debate transcripts t →д and law proposals, we make use of a vector representation. However, instead of the classical information retrieval representation based Apache Lucene, https://lucene.apache.org on terms, we propose to use semantically related concepts as units Apache Jena, https://jena.apache.org/ of information. The explicit relationships between concepts allow Semantic Annotation and Retrieval of Parliamentary Content: A Case Study on the Spanish Congress of Deputies computing semantic relatedness, and address the VSM limitation SR value, even for the most dissimilar entities, so that SR of term vector pairwise orthogonality (i.e., linear independence), ranges in certain interval [a, 1] with a > 0. The authors set an unrealistic assumption where any pair of terms do not relate to α = 0.8 in their experiments. each other [25, 27]. Using concepts also allows avoiding ambiguities • The second factor increases SR proportionally to the close- of polysemic terms and applying semantic inference through the ness of the two entities to their common ancestor e 0 . Let us concepts relationships at retrieval stage [7]. consider cases where h 1 = h 2 , for which the first factor of Formally, let O = {E, R} be an ontology composed of entities SR equals 1. The second factor allows establishing a higher E = {C, I} that can be classes C or individuals I, and relation- SR value for the case h 1 = h 2 = x than the case h 1 = h 2 = y ships R : E → E ∪ L that link pairs of entities or entities with if x < y. literals (e.g., numeric or string). Let e 1, e 2, . . . , e N ∈ E belong to a • The third factor decreases SR when e 1 and e 2 are in the same N -dimensional Euclidean space. A document d is represented as a branch of the ontology class hierarchy; that is, they are not vector d = (wd ,1, . . . , wd ,N ) ∈ RN , where the weight wd ,n corre- sibling classes or instances. sponds to the semantic annotation score computed as explained in Section 4.3. In this paper, the considered relationships are the properties of our knowledge base (Section 4.2), and the entities of a given 6 EXPERIMENTS document correspond to its semantic annotations extracted by our To show the semantic capabilities of the proposed search method, approach (Section 4.3). Next, we explain how the relationships are we tested 3 different variants of the query-document similarity used by the proposed search method. given in Section 5.2. The first similarity, simkey is implemented with SR(e 1, e 2 ) = 1, ∀e 1, e 2 ∈ E, i.e., it does not exploit ontologi- 5.2 Search Method cal relationships between entities, and thus it is equivalent to the Our search method is based on the Generalized Vector Space Model, classical VSM. This version is limited to matching of query and doc- GVSM [27] proposed by Tsatsaronis and Panagiotopoulou [25], ument key terms. The second similarity, simsem_all , considers the which incorporates a semantic relatedness (SR) measure into the SR(e 1, e 2 ) value of every entity in the document and query vectors. term-based vector similarity of the VSM. It thus exploits any ontological relationship underlying the query Let d = (wd ,1, . . . , wd ,N ) ∈ RN and q = (wq,1, . . . , wq,N ) ∈ and document. This version favors the retrieval of a large number N R be the weight vectors of document d and query q, respectively. of documents belonging to the domain of the query. Lastly, the The value qn is set to 1 if entity en appears in the input query q, third similarity, simsem_max , only applies the maximum SR(e 1, e 2 ) and to 0 otherwise. Considering a typical keyword-based search value for the query and document entities. This version also extends scenario, we map the keywords of the user’s query to entities. This the keyword matching, but focuses on the strongest relationships is done by the OntologyManager explained in Section 4.3 (Figure according to the closeness of the entities in the domain ontology. 4), by exact matching of the keywords with the entity terms (see For a better comprehension of the experimental results, we Figure 2). present next a real searching example for the query “educación We define the similarity between d and q as follows: especial” (lit. especial education). ÍN ÍN Running this query on our Parlamento2030 corpus, the simkey i=1 j=1 w d ,i · w q,j · SR(ei , e j ) method only retrieved 3 documents –d 1, d 2, d 3 – having the term sim(d, q) = N w 2 · ÍN w 2 “educación especial.” These documents had been annotated with qÍ q i=1 d ,i i=1 q,i tipi:concept# Concept_929 (i.e., especial education concept), which, To implement this similarity, we propose to use the semantic in the ontology, has associated terms such as specific needs educa- relatedness metric proposed by Corella and Castells in [3], which tion, aided education, and exceptional education. The method omitted we explain next. Being E the set of existing ontology entities, the these terms since it is based on the matching of the query keywords. semantic relatedness SR between two entities e 1, e 2 ∈ E is measured The simsem_all method, in contrast, was able to retrieve a to- in terms of their distance in the ontology hierarchy as follows. Let tal of 45 documents, all of them belonging to the education do- e 0 be the closest ancestor (super class) to e 1 and e 2 in the ontology main. As for the first method, the documents d 1 , d 2 and d 3 were hierarchy, and let h 1 = 1 + dist(e 1, e 0 ) (and h 2 = 1 + dist(e 2, e 0 )) retrieved in the top 3 ranking positions. After them, in the results be 1 plus the number of levels between e 1 (and e 2 ) and e 0 in the list, there were 4 documents about access to education and school ontology hierarchy. We define the semantic relatedness between bullying. In the ontology, these two concepts are sibling of the es- entities e 1 and e 2 as: pecial education concept. In particular, they are subclasses of the  educational problems concept. Next, in the ranking, there were 7 α |h 1 − h 2 | max(h 1, h 2 ) − 1    1 SR(e 1, e 2 ) = 1 − · · · 1− documents –d 5, . . . , d 8 – highly related to the query. These docu- h(O) h 1 + h 2 min(h 1, h 2 ) h(O) ments were annotated with the tipi:concept#Concept_937 (i.e., The formula has three factors: Specific Needs of Educational Support, SNES). This concept is a direct • The first factor measures the distance between e 1 and e 2 as a child (subclass) of the especial education concept in the ontology. proportion of the depth h(O) of the ontology hierarchy. The Its relationship with especial education is stronger than access to α ∈ [0, 1] parameter allows ensuring a minimum non-zero education and school bullying. However, its associated SR values An ontology hierarchical level can be either a rdfs:subclassOf relationship between have less influence on the ranking scores than the TF-IDF values of two classes, or a rdf:type relationship between an individual and its class. the document entities. Cantador et al. Table 2: Number of indexed documents with parsed content. 7 CONCLUSIONS In this paper, we have presented a novel approach to ontology-based Document type Num. annotation and retrieval of parliamentary content. The approach Non-law proposals in plenary session 40 was built upon domain ontologies that cover a large number of Law proposals by parliamentary groups 13 topics related to the United Nations taxonomy of sustainable devel- Law proposals by deputies 2 opment goals. As a proof of concept, the approach was instantiated Law proposals by autonomous cities and communities 10 and preliminary evaluated on a corpus extracted from the Spanish Popular legislative initiatives 4 Congress of Deputies, and used by the Parlamento2030 platform. Proposals to reform the congress rules 3 The approach is in a preliminary stage. Next, we describe several future research lines. The ontological schema of the approach is composed of three Table 3: Average precision P@N values for each method. main classes, namely Goal, Target and Concept, and three prop- erties to relate the class individuals, namely hasGoal, hasTarget Method P@5 P@10 P@15 P@20 and subClassOf (i.e., subConceptOf). As shown in the paper, ex- simkey 0.633 0.483 0.422 0.358 ploiting these properties leads to semantically enhanced search simsem_all 0.733 0.550 0.500 0.492 results. However, more specific properties between particular enti- simsem_max 0.733 0.683 0.656 0.600 ties could be considered as well. For instance, we may have inter- domain properties that relate concepts –such as Job security (from G1: No poverty) equivalentTo Precarious work (G8: De- cent work economic growth)–, and targets –such as Promoting renewable energy (G7: Affordable and clean energy) impactsOn Boosting the semantic relatedness in the ranking process, the Decreasing pollution (G13: Climate change), which impactsOn simsem_max method generated a result list where documents d 5, . . . , d 8 Disease prevention (G3: Good health well being). Exploiting appear after the first 3 positions. these and other types of relationships could enrich the semantic Testing the methods on the Parlamento2030 corpus, we found inference capabilities of the retrieval method, which would be able other similar examples. We do not present them due to space limi- to better find related documents. Properties and relationships may tations. Moreover, aiming to show a more rigorous evaluation of be defined manually or extracted automatically by mining external the methods, we performed a preliminary experiment, described knowledge bases. next. Moreover, our semantic annotation method is able to extract The Parlamento2030 corpus had 3534 documents. From them, named entities, such as proper nouns of people, organizations and only 72 had textual content parsed. Table 2 shows the types of these places. We may extend the ontology to model these issues. For documents. The remainder documents had links to associated PDF instance, we may have classes and properties describing and relating files from the Congress of Deputies. We indexed and annotated cities and administrative divisions. Then, we could exploit this all the documents analyzing their titles and their content if avail- information to find government initiatives according to particular able. A total of 435 documents were annotated with 642 concept demographic, sociocultural, political and economic attributes of annotations, meaning a coverage of 12.3% of the corpus. The Par- target locations, or in terms of geographic relationships between lamento2030 system has topic tags for 427 documents obtained such locations, e.g., educational initiatives proposed for certain through their language-dependent regular expressions. surrounding area. The obtained Parlamento2030 dataset did not have sets of queries Modeling and extracting time aspects of parliamentary content and relevance judgements, so we opted to conduct a user study. are left as future work. Temporal traceability of addressed goals Three experts participated in the study. After revising the document and targets, and time-based government initiative similarities are collection, each of them stated 5 queries. Then, they were requested examples of interesting functionalities for a parliamentary infor- to assess the 20-top results provided by the 3 search methods for mation retrieval system. We also envision the extension of the the 15 queries, stating whether each retrieved document was non annotation method to work at argumentation level [2, 6, 13, 18], relevant, relevant or highly relevant for the corresponding query. aiming to automatically extract argumentative structures from in- The Fleiss’ kappa inter-rater correlation coefficient was κ = 0.984, put texts. In particular, we propose to develop a method able to meaning an almost perfect agreement between the assessors’ judge- identify arguments, as well as components from them, such as facts, ments. The experts also evaluated the correctness of the annotations statements and predictions. The extracted arguments would serve of the documents, measuring an accuracy of 98.7% for the semantic as summaries of existing debates and proposals, and may be used annotation process. as instruments to measure and monitor ideological and activity Table 3 reports the average precision values P@N of the three dynamics in the Parliament. methods for the top N = 5, 10, 15, 20 results. Relevant and highly Our approach supports multilinguiality, but we only implemented relevant assessments were considered as positive judgements. The it with a vocabulary in Spanish. A further step would be to translate reported values show that the proposed semantic methods outper- the generated ontology entities into other languages, starting with form the VSM-based simkey method. Moreover, they show that English. For both the ontologies and semantic annotations, we plan limiting the ontological expansion of matched concepts by the to link their entities to external knowledge bases, following the simsem_max improves the accuracy achieved by simsem_all . Semantic Annotation and Retrieval of Parliamentary Content: A Case Study on the Spanish Congress of Deputies Linked Open Data initiative. In this context, we would make all the [14] Antonio Martín and Carlos León. 2015. Semantic framework for an efficient generated resources publicly available as RDF repositories. information retrieval in the e-government repositories. In Handbook of Research on Democratic Strategies and Citizen-Centered E-Government Services. IGI Global, The proposed information retrieval method, which makes use 192–213. of a novel semantic relatedness metric between ontology entities, [15] Pablo N Mendes, Max Jakob, Andrés García-Silva, and Christian Bizer. 2011. DBpedia Spotlight: Shedding light on the web of documents. In Proceedings of could be adapted to provide personalized search and recommen- the 7th International Conference on Semantic Systems. 1–8. dation. For such purpose, as done in previous work [5, 22, 26], we [16] David Milne and Ian H Witten. 2008. Learning to link with Wikipedia. In Pro- will have to model individual or stereotype-based user profiles, ceedings of the 17th ACM Conference on Information and Knowledge Management. 509–518. according to privacy and application issues. [17] Silvio Moreira, David Batista, Paula Carvalho, Francisco M Couto, and Mário J Lastly, an exhaustive evaluation of the proposed approach has Silva. 2011. POWER - Politics Ontology for Web Entity Retrieval. In Proceedings to be conducted. In this sense, we plan to perform both offline ex- of the 23rd International Conference on Advanced Information Systems Engineering. Springer, 489–500. periments and online user studies. For the latter case, the approach [18] Gaku Morio. 2018. Annotating Online Civic Discussion Threads for Argument will be integrated into the Parlamento2030 platform. Mining. In 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI). IEEE, 546–553. [19] Mark A Musen. 2015. The protégé project: A look back and a look forward. AI ACKNOWLEDGMENTS Matters 1, 4 (2015), 4–12. [20] Fedelucio Narducci, Matteo Palmonari, and Giovanni Semeraro. 2013. Cross- This work was supported by the Spanish Ministry of Science and language semantic retrieval and linking of e-gov services. In Proceedings of the Innovation (PID2019-108965GB-I00). The authors acknowledge 12th International Semantic Web Conference. Springer, 130–145. CIECODE for providing the dataset used in this work, and thank its [21] Gerard Salton, Anita Wong, and Chung-Shu Yang. 1975. A vector space model for automatic indexing. Communications of the ACM 18, 11 (1975), 613–620. director –Javier Pérez– and members –Pablo Martín, Belén Agüero [22] Elena Sánchez-Nielsen and Francisco Chávez-Gutiérrez. 2008. Personalized and Irene Matín– for their help and support in the project. They and on-demand retrieval of parliamentary proceedings with social feedback on elected representatives. In Proceedings of the 21st Annual Conference on Legal also thank Alejandro Bellogín for his help on the regular expression Knowledge and Information Systems. IOS Press, 53–62. processing. [23] Elena Sánchez-Nielsen, Francisco Chávez-Gutiérrez, and Javier Lorenzo-Navarro. 2019. A semantic parliamentary multimedia approach for retrieval of video clips with content understanding. Multimedia Systems 25, 4 (2019), 337–354. REFERENCES [24] Nigel Shadbolt, Kieron O’Hara, Tim Berners-Lee, Nicholas Gibbins, Hugh Glaser, [1] Flora Amato, Antonino Mazzeo, Vincenzo Moscato, and Antonio Picariello. 2009. Wendy Hall, et al. 2012. Linked open government data: Lessons from data.gov.uk. A system for semantic retrieval and long-term preservation of multimedia doc- IEEE Intelligent Systems 27, 3 (2012), 16–24. uments in the e-government domain. International Journal of Web and Grid [25] George Tsatsaronis and Vicky Panagiotopoulou. 2009. A generalized vector Services 5, 4 (2009), 323–338. space model for text retrieval based on semantic relatedness. In Proceedings of [2] Claire Cardie, Cynthia R Farina, Matt Rawding, and Adil Aijaz. 2008. An eRule- the Student Research Workshop at EACL 2009. 70–78. making corpus: Identifying substantive issues in public comments. In Proceedings [26] Eduardo Vicente-López, Luis M de Campos, Juan M Fernández-Luna, and Juan F of the 11th International Conference on Language Resources and Evaluation. Huete. 2016. Use of textual and conceptual profiles for personalized retrieval of [3] Miguel Ángel Corella and Pablo Castells. 2006. A heuristic approach to semantic political documents. Knowledge-Based Systems 112 (2016), 127–141. web services classification. In Proceedings of the 10th International Conference on [27] SK Michael Wong, Wojciech Ziarko, and Patrick CN Wong. 1985. Generalized Knowledge-Based and Intelligent Information and Engineering Systems. Springer, vector spaces model in information retrieval. In Proceedings of the 8th Annual 598–605. International ACM SIGIR Conference on Research and Development in Information [4] Luis M De Campos, Juan M Fernández-Luna, Juan F Huete, and Carlos J Martín- Retrieval. 18–25. Dancausa. 2008. An integrated system for accessing the digital library of the Parliament of Andalusia: Segmentation, annotation and retrieval of transcriptions and videos. In Proceedings of the 8th International Workshop on Pattern Recognition in Information Systems - Volume 1. SciTePress, 38–47. [5] Luis M de Campos, Juan M Fernández-Luna, Juan F Huete, and Luis Redondo- Expósito. 2017. Comparing machine learning and information retrieval-based approaches for filtering documents in a parliamentary setting. In Proceedings of the 11th International Conference on Scalable Uncertainty Management. Springer, 64–77. [6] Vlad Eidelman and Brian Grom. 2019. Argument Identification in Public Com- ments from eRulemaking. In Proceedings of the 17th International Conference on Artificial Intelligence and Law. 199–203. [7] Miriam Fernández, Iván Cantador, Vanesa López, David Vallet, Pablo Castells, and Enrico Motta. 2011. Semantically enhanced Information Retrieval: An ontology- based approach. Journal of Web Semantics 9, 4 (2011), 434–452. [8] Juan M Fernández-Luna, Juan F Huete, Manuel Gómez, and Carlos J Martín- Dancausa. 2008. Development of the XML digital library from the parliament of Andalucía for intelligent structured retrieval. In Proceedings of the 17th Interna- tional Symposium on Methodologies for Intelligent Systems. Springer, 417–423. [9] Evgeniy Gabrilovich and Shaul Markovitch. 2009. Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research 34 (2009), 443–498. [10] Hugo C Hoeschl, Tânia Cristina D Bueno, Andre Bortolon, Eduardo S Mattos, Marcelo S Ribeiro, Irineu Theiss, and Ricardo Miranda Barcia. 2004. An intelligent search engine for electronic government applications for the resolutions of the United Nations Security Council. In Building the E-Service Society. Springer, 23–41. [11] Rianne Kaptein and Maarten Marx. 2010. Focused retrieval and result aggregation with political data. Information retrieval 13, 5 (2010), 412–433. [12] Xiaoxing Liu and Changxia Hu. 2012. Research and design on e-government information retrieval model. Procedia Engineering 29 (2012), 3170–3174. [13] Anastasios Lytos, Thomas Lagkas, Panagiotis Sarigiannidis, and Kalina Bontcheva. 2019. The evolution of argumentation mining: From models to social media and emerging tools. Information Processing & Management 56, 6 (2019), 102055.