=Paper=
{{Paper
|id=Vol-3128/paper3
|storemode=property
|title=Identifying family ties among politicians: Challenges of information extraction evaluation
|pdfUrl=https://ceur-ws.org/Vol-3128/paper3.pdf
|volume=Vol-3128
|authors=Diana Santos,Suemi Higuchi,Claudia Freitas
|dblpUrl=https://dblp.org/rec/conf/dhandnlp-ws/SantosHF22
}}
==Identifying family ties among politicians: Challenges of information extraction evaluation==
Identifying family ties among politicians Challenges of information extraction evaluation Diana Santos1[0000−0002−3108−7706]⋆ , Suemi Higuchi2[[0000−0002−6255−3781] , and Cláudia Freitas3[0000−0001−6807−8558] 1 Linguateca & University of Oslo, Norway d.s.m.santos@ilos.uio.no 2 FGV, Brazil Suemi.Higuchi@fgv.br 3 Linguateca & PUC-Rio, Brazil claudiafreitas@puc-rio.br Abstract. We discuss several challenges of evaluating information ex- traction patterns, using the DHBB corpus, a public resource for the Dicionário Histórico-Biográfico Brasileiro. Our goal is to stress both the limitations and the advantages of using a corpus-based approach for the task of identifying political families in Brazilian society. Keywords: Evaluation · Information extraction · Brazilian politics 1 Family ties in Brazilian politics It is often mentioned that in Brazil family ties matter a lot for success in poli- tics [12, 7]. However, this is not easy to measure and therefore confirm. But given the availability of the DHBB corpus, we decided to extract all family relation- ships there mentioned, and assess whether they concerned family relationships among politicians. This can be considered a kind of distant reading for History [2, 10], and it highlighted the need to be very concrete as to what exactly one is evaluating. We begin by explaining how we annotated family ties in the DHBB corpus, then how we annotated that a particular name was already a biographee in DHBB, and then how the family relations were extracted. Then we discuss how to evaluate the result, and show that there are several ways one can evaluate the resulting concordances. In our understanding, a politician is someone who is invested in his or her position through election, nomination or designation, usually members of the executive and legislative branches4 . Positions that serve merely for bureaucratic ⋆ Copyright © 2022 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 4 https://corregedorias.gov.br/assuntos/perguntas-frequentes/ agentes-publicos-e-agentes-politicos. 2 D. Santos et al. functions, such as technical advisers and consultants, whether executive, leg- islative, judiciary branches or military, are generally not considered politicians, although they are involved in government decision-making processes.[8] 2 Annotating family in the AC/DC DHBB belongs to the AC/DC family of corpora, a project designed to make available, and searchable, large corpora on the Web [9]. Because we were es- pecially interested in family ties in the context of Brazilian politics, we added them as one of the semantic fields available in AC/DC, something that may be relevant as well for other kinds of text, as the current organization of DIP [11] for extracting chracters and their family ties demonstrates5 . 3 Grounding biographees Contrary to the family semantic domain, there is information that only makes sense for DHBB, namely the unification of several distinct names as correspond- ing to a particular biographee (Lula, Lula da Silva, Luı́s Inácio Lula da Silva, for example). In fact, each politician who has an entry in DHBB receives an unique identifier (stored in the id field), and during creation of the corpus we tried to unify the several different ways of referring to the same person, using manually constructed rules, as described in [5]. 4 Obtaining extraction patterns By looking at ten entries whose holders are known to have many family ties with other politicians, the second author devised a set of patterns, divided in five groups (not necessarily mutually exclusive), as detailed in [3, page 119]. 1. Relations between the entry politician and other people biographed in DHBB 2. Relations between the entry politician, using the possessive pronoun, assum- ing that it refers to the biographee, and named politicians. 3. Relations between the entry politician and another non-named politician 4. Relations between two politicians biographed in DHBB, none of them the biographee (This gave us 35 cases, all of them correct.) 5. General family relations described in DHBB between names Sentences of each group are listed below, preceded by type: – (1) Paulo Maluf, seu padrinho, (...) – (2) Sua esposa era filha de João Alves de Sousa, militar, tenente- coronel e chefe polı́tico em Patrocı́nio (MG) . – (3) Seu pai foi eleito deputado constituinte (...) 5 More information on the categories can be found in https://www.linguateca.pt/ Gramateca/Familia.html. Identifying family ties among politicians 3 – (5) (...) os ex-pessedistas, liderados por Crispim Jaques Bias Fortes, secretário de Obras Públicas, filho do ex-governador José Francisco Bias Fortes (...) Note that the second example illustrates an indirect relationship between politi- cians: the biographee is son-in-law of JAS (his wife is daughter of JAS). 5 Evaluating the results Since the patterns yielded a large number of results, we obtained a sample per kind of pattern, and manually evaluated those 198 cases. We soon noticed that evaluation could be done according to the following criteria: – did the patterns find valid family relations? (criterion 1) – did the patterns find family relations between politicians? (crit 2) – did the patterns find family relations between politicians which were possible to identify in DHBB? (crit 3) In addition, one could take a strict evaluation of the patterns (and no text around would be included). – did the patterns extract family relations between politicians? (crit 4) – did the patterns extract family relations between politicians which were pos- sible to identify in DHBB? (crit 5) (only first names are not enough) See, for example, the following cases: – (...) no entanto derrotada dentro do partido que optou por Gleisi Hoff- mann, esposa do ministro do Planejamento Paulo Bernardo (GH, wife of the minister) – No mês seguinte, foi acusado de envolvimento na morte de Severino Alves de Lacerda, filho do ex-prefeito do municı́pio paraibano de Aguiar . (SAL, son of the ex-mayor) The first case allow us to find in the whole sentence a relationship between two politicians (Gleisi Hoffman and Paulo Bernardo), but not in the extracted pat- tern. (it yelds YES for the three first criteria). The second, although it identifies a relationship between two politicians, does not allow us to identify the second politician, and thus yields YES only for for the first, second and fourth criteria. In Table 1 we show the results of this fivefold evaluation [1]. This shows that a lot of decisions have to be agreed upon, and that evaluation depends on exactly what is one interested in. The results of case 3 were reported in [4]. What we would like to stress here is that all these numbers are appropriate, but evaluate different things. While the first only looks at the precision of family relations in encyclopedic text, the second and fourth measure politician family links, and the third and fifth measure the capability of extracting links among named politicians. The difference between the 2nd and 3rd vs the 4th and 5th deal with the amount of text to be processed. Obviously, the pattern itself is much easier to employ to get triples of the form A-family link-B, that can for example be depicted in a graph. 4 D. Santos et al. Table 1. Evaluating the extraction in five different ways Pattern set crit 1 crit 2 crit 3 crit 4 crit 5 1 1 .48 .48 .32 .30 2 .98 .64 .64 .46 .44 3 1 .88 .88 .62 .60 5 .96 .44 .42 .24 .18 Total .985 .605 .595 .405 .375 6 New extraction patterns While analysing the cases obtained, it became clear that the patterns themselves could be significantly improved if we took into consideration the (main) political positions. While reliably annotating the DHBB with all appropriate political positions is not yet performed, we created a short list, improved the rules that mentioned a noun to become a “political position noun”, and got new results. Table 2. Improving the queries with political positions Pattern set Before After 1 3753 2225 3 640 78 5 624 338 We were able to get down from 5,017 cases to 2,641 cases, which is almost a half, as is shown in Table 2. Also, the patterns themselves became more reliable, in the sense of yielding the position of the family member much more frequently. This is confirmed by a new random set which was again humanly reviewed, see Table 3, and which can be inspected in [6]. Interestingly, most politicians identified had a name in our sample, which casts doubt on the need to separate politicians from identifiable politicians... and shows that the DHBB authors were very careful to name the people mentioned. Table 3. Evaluating the extraction in five different ways with the new patterns [6] crit 1 crit 2 crit 3 crit 4 crit 5 All .985 .765 .765 .705 .690 Even though the exercise reported here may seem to exhaust the evaluation possibilities, several interesting issues remain to be solved, such as: full names vs. first names only, concordances where more than one family relationship was present, and the automatic recovery of the possessive pronoun’s referent. And finally, the common presence in DHBB of family members with power in Brazil, although not politicians in our sense. Identifying family ties among politicians 5 References 1. Avaliação dos antigos padrões com 5 critérios. Tech. rep., Lin- guateca (5 January 2022), https://www.linguateca.pt/acesso/dhbb/ AvaliacaoCincoCriterios5jan2022.html 2. Fortes, A., Alvim, L.G.M.: Evidências, códigos e classificações: o ofı́cio do histori- ador e o mundo digital. Esboços: histórias em contextos globais 27(45), 207–227 (2020) 3. Higuchi, S.: Extração automática de informações: uma leitura distante do Dicionário Histórico-Biográfico Brasileiro (DHBB). Ph.D. thesis, PUC- Rio, Rio de Janeiro (May 2021), http://www.linguateca.pt/documentos/ TeseSuemiHiguchi2021.pdf 4. Higuchi, S., Freitas, C., Santos, D.: Automatic information extraction: a distant reading of the Brazilian Historical-Biographical Dictionary. In: PROPOR 2022. Springer (March 2022) 5. Higuchi, S., Santos, D., Freitas, C., Rademaker, A.: Distant reading Brazil- ian politics. In: Proceedings of 4th Conference of The Association Digital Hu- manities in the Nordic Countries (Copenhagen, March 6-8 2019) (March 2019), https://www.linguateca.pt/Diana/download/aprDHN2019.pdf 6. Avaliação dos novos padrões com 5 critérios. Tech. rep., Lin- guateca (5 January 2022), https://www.linguateca.pt/acesso/dhbb/ AvaliacaoNovosPadroes5jan2022.html 7. Oliveira, R.C.d., Goulart, M.H.H.S., Vanali, A.C., Monteiro, J.M.: Famı́lia, par- entesco, instituições e poder no Brasil: retomada e atualização de uma agenda de pesquisa. Revista Brasileira de Sociologia 5(11), 165–198 (2017), https:// dialnet.unirioja.es/descarga/articulo/6227086.pdf 8. Cargos polı́ticos no dicionário histórico-biográfico brasileiro. Tech. rep., Linguateca (5 January 2022), https://www.linguateca.pt/acesso/dhbb/ AvaliacaoCincoCriterios5jan2022.html 9. Santos, D.: Corpora at Linguateca: Vision and roads taken. In: Berber Sardinha, T., Ferreira, T.L.S.B. (eds.) Working with Portuguese Corpora, pp. 219–236. Blooms- bury (2014) 10. Santos, D.: Humanidades Digitais e História: algumas observações (16 December 2021), https://www.linguateca.pt/Diana/download/SantosPIDH.pdf 11. Santos, D., Willrich, R., Langfeldt, M., de Moraes, R.G., Mota, C., Pires, E., Schu- macher, R., Pereira, P.S.: Identifying literary characters in Portuguese: Challenges of an international shared task. In: PROPOR 2022. Springer (March 2022) 12. Schoenster, L.: Clãs polı́ticos seguem dominando Congresso na próxima legis- latura. Tech. rep., Transparência Brasil (nov 2014), https://www.transparencia. org.br/downloads/publicacoes/Cl%C3%A3s%20pol%C3%ADticos%20seguem% 20dominando%20Congresso%20na%20pr%C3%B3xima%20legislatura.pdf