<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Identifying family ties among politicians</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Family ties in Brazilian politics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Linguateca &amp; PUC-Rio</institution>
          ,
          <country country="BR">Brazil</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Linguateca &amp; University of Oslo</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We discuss several challenges of evaluating information extraction patterns, using the DHBB corpus, a public resource for the Dicionar´io Hisotr´ico-Biogarfic´o Brasileiro. Our goal is to stress both the limitations and the advantages of using a corpus-based approach for the task of identifying political families in Brazilian society. It is often mentioned that in Brazil family ties matter a lot for success in politics [12, 7]. However, this is not easy to measure and therefore confirm. But given the availability of the DHBB corpus, we decided to extract all family relationships there mentioned, and assess whether they concerned family relationships among politicians. This can be considered a kind of distant reading for History [2, 10], and it highlighted the need to be very concrete as to what exactly one is evaluating. We begin by explaining how we annotated family ties in the DHBB corpus, then how we annotated that a particular name was already a biographee in DHBB, and then how the family relations were extracted. Then we discuss how to evaluate the result, and show that there are several ways one can evaluate the resulting concordances. In our understanding, a politician is someone who is invested in his or her position through election, nomination or designation, usually members of the executive and legislative branches4. Positions that serve merely for bureaucratic</p>
      </abstract>
      <kwd-group>
        <kwd>Evaluation</kwd>
        <kwd>Information extraction</kwd>
        <kwd>Brazilian politics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        functions, such as technical advisers and consultants, whether executive,
legislative, judiciary branches or military, are generally not considered politicians,
although they are involved in government decision-making processes.[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Annotating family in the AC/DC</title>
      <p>
        DHBB belongs to the AC/DC family of corpora, a project designed to make
available, and searchable, large corpora on the Web [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Because we were
especially interested in family ties in the context of Brazilian politics, we added
them as one of the semantic fields available in AC/DC, something that may be
relevant as well for other kinds of text, as the current organization of DIP [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
for extracting chracters and their family ties demonstrates5.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Grounding biographees</title>
      <p>
        Contrary to the family semantic domain, there is information that only makes
sense for DHBB, namely the unification of several distinct names as
corresponding to a particular biographee (Lula, Lula da Silva, Luıs´ Ina´cio Lula da Silva, for
example). In fact, each politician who has an entry in DHBB receives an unique
identifier (stored in the id field), and during creation of the corpus we tried to
unify the several diferent ways of referring to the same person, using manually
constructed rules, as described in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Obtaining extraction patterns</title>
      <p>By looking at ten entries whose holders are known to have many family ties
with other politicians, the second author devised a set of patterns, divided in
ifve groups (not necessarily mutually exclusive), as detailed in [3, page 119].
1. Relations between the entry politician and other people biographed in DHBB
2. Relations between the entry politician, using the possessive pronoun,
assuming that it refers to the biographee, and named politicians.
3. Relations between the entry politician and another non-named politician
4. Relations between two politicians biographed in DHBB, none of them the
biographee (This gave us 35 cases, all of them correct.)
5. General family relations described in DHBB between names</p>
      <p>Sentences of each group are listed below, preceded by type:
– (1) Paulo Maluf, seu padrinho, (...)
– (2) Sua esposa era filha de Jo˜ao Alves de Sousa , militar,
tenentecoronel e chefe poıtli´co em Patroınc´ io (MG) .
– (3) Seu pai foi eleito deputado constituinte (...)
5 More information on the categories can be found in https://www.linguateca.pt/
Gramateca/Familia.html.
– (5) (...) os ex-pessedistas, liderados por Crispim Jaques Bias Fortes,
secreatr´io de Obras Pu´blicas, filho do ex-governador Joes´ Francisco
Bias Fortes (...)
Note that the second example illustrates an indirect relationship between
politicians: the biographee is son-in-law of JAS (his wife is daughter of JAS).
5</p>
    </sec>
    <sec id="sec-5">
      <title>Evaluating the results</title>
      <p>Since the patterns yielded a large number of results, we obtained a sample per
kind of pattern, and manually evaluated those 198 cases. We soon noticed that
evaluation could be done according to the following criteria:
– did the patterns find valid family relations? (criterion 1)
– did the patterns find family relations between politicians? (crit 2)
– did the patterns find family relations between politicians which were possible
to identify in DHBB? (crit 3)</p>
      <p>In addition, one could take a strict evaluation of the patterns (and no text
around would be included).</p>
      <p>– did the patterns extract family relations between politicians? (crit 4)
– did the patterns extract family relations between politicians which were
possible to identify in DHBB? (crit 5) (only first names are not enough)
See, for example, the following cases:
– (...) no entanto derrotada dentro do partido que optou por Gleisi
Hofmann, esposa do ministro do Planejamento Paulo Bernardo (GH, wife
of the minister)
– No mˆes seguinte, foi acusado de envolvimento na morte de Severino Alves
de Lacerda, filho do ex-prefeito do muniıpc´ io paraibano de Aguiar .
(SAL, son of the ex-mayor)
The first case allow us to find in the whole sentence a relationship between two
politicians (Gleisi Hofman and Paulo Bernardo), but not in the extracted
pattern. (it yelds YES for the three first criteria). The second, although it identifies
a relationship between two politicians, does not allow us to identify the second
politician, and thus yields YES only for for the first, second and fourth criteria.</p>
      <p>
        In Table 1 we show the results of this fivefold evaluation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This shows that
a lot of decisions have to be agreed upon, and that evaluation depends on exactly
what is one interested in. The results of case 3 were reported in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>What we would like to stress here is that all these numbers are appropriate,
but evaluate diferent things. While the first only looks at the precision of family
relations in encyclopedic text, the second and fourth measure politician family
links, and the third and fifth measure the capability of extracting links among
named politicians. The diference between the 2nd and 3rd vs the 4th and 5th
deal with the amount of text to be processed. Obviously, the pattern itself is
much easier to employ to get triples of the form A-family link-B, that can for
example be depicted in a graph.</p>
    </sec>
    <sec id="sec-6">
      <title>New extraction patterns</title>
      <p>While analysing the cases obtained, it became clear that the patterns themselves
could be significantly improved if we took into consideration the (main) political
positions. While reliably annotating the DHBB with all appropriate political
positions is not yet performed, we created a short list, improved the rules that
mentioned a noun to become a “political position noun”, and got new results.</p>
      <p>We were able to get down from 5,017 cases to 2,641 cases, which is almost a
half, as is shown in Table 2. Also, the patterns themselves became more reliable,
in the sense of yielding the position of the family member much more frequently.</p>
      <p>
        This is confirmed by a new random set which was again humanly reviewed,
see Table 3, and which can be inspected in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Interestingly, most politicians
identified had a name in our sample, which casts doubt on the need to separate
politicians from identifiable politicians... and shows that the DHBB authors were
very careful to name the people mentioned.
      </p>
      <p>Even though the exercise reported here may seem to exhaust the evaluation
possibilities, several interesting issues remain to be solved, such as: full names
vs. first names only, concordances where more than one family relationship was
present, and the automatic recovery of the possessive pronoun’s referent. And
ifnally, the common presence in DHBB of family members with power in Brazil,
although not politicians in our sense.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Avaliaca</surname>
          </string-name>
          <article-title>¸˜o dos antigos padro˜es com 5 crietr´ios</article-title>
          .
          <source>Tech. rep.</source>
          ,
          <source>Linguateca (5 January</source>
          <year>2022</year>
          ), https://www.linguateca.pt/acesso/dhbb/ AvaliacaoCincoCriterios5jan2022.html
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Fortes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alvim</surname>
            ,
            <given-names>L.G.M.:</given-names>
          </string-name>
          <article-title>Evidˆencias, oc´digos e classificaco˜¸es: o oıcf´io do historiador e o mundo digital</article-title>
          .
          <source>Esboc¸os: hisotr´ias em contextos globais</source>
          <volume>27</volume>
          (
          <issue>45</issue>
          ),
          <fpage>207</fpage>
          -
          <lpage>227</lpage>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Higuchi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Extraca¸
          <article-title>˜o automat´ica de informaco¸˜es: uma leitura distante do Dicionar´io Hisotr´ico-Biogarfic´o Brasileiro (DHBB)</article-title>
          .
          <source>Ph.D. thesis</source>
          , PUCRio, Rio de Janeiro (May
          <year>2021</year>
          ), http://www.linguateca.pt/documentos/ TeseSuemiHiguchi2021.pdf
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Higuchi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Automatic information extraction: a distant reading of the Brazilian Historical-Biographical Dictionary</article-title>
          .
          <source>In: PROPOR 2022</source>
          . Springer (March
          <year>2022</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Higuchi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Freitas</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rademaker</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Distant reading Brazilian politics</article-title>
          .
          <source>In: Proceedings of 4th Conference of The Association Digital Humanities in the Nordic Countries (Copenhagen, March</source>
          <volume>6</volume>
          -8
          <year>2019</year>
          ) (
          <year>March 2019</year>
          ), https://www.linguateca.pt/Diana/download/aprDHN2019.pdf
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Avaliaca</surname>
          </string-name>
          <article-title>¸˜o dos novos padro˜es com 5 crietr´ios</article-title>
          .
          <source>Tech. rep.</source>
          ,
          <source>Linguateca (5 January</source>
          <year>2022</year>
          ), https://www.linguateca.pt/acesso/dhbb/ AvaliacaoNovosPadroes5jan2022.html
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Oliveira</surname>
          </string-name>
          , R.C.d.,
          <string-name>
            <surname>Goulart</surname>
            ,
            <given-names>M.H.H.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanali</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Monteiro</surname>
            ,
            <given-names>J.M.:</given-names>
          </string-name>
          <article-title>Fa mıl i´a, parentesco, institucio˜¸es e poder no Brasil: retomada e atualizaca˜¸o de uma agenda de pesquisa</article-title>
          .
          <source>Revista Brasileira de Sociologia</source>
          <volume>5</volume>
          (
          <issue>11</issue>
          ),
          <fpage>165</fpage>
          -
          <lpage>198</lpage>
          (
          <year>2017</year>
          ), https:// dialnet.unirioja.es/descarga/articulo/6227086.pdf
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <article-title>Cargos poıtli´cos no diciona´rio hisotr´ico-biogarfic´o brasileiro</article-title>
          .
          <source>Tech. rep.</source>
          ,
          <source>Linguateca (5 January</source>
          <year>2022</year>
          ), https://www.linguateca.pt/acesso/dhbb/ AvaliacaoCincoCriterios5jan2022.html
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : Corpora at Linguateca:
          <article-title>Vision and roads taken</article-title>
          . In: Berber Sardinha,
          <string-name>
            <given-names>T.</given-names>
            ,
            <surname>Ferreira</surname>
          </string-name>
          , T.L.S.B. (eds.) Working with Portuguese Corpora, pp.
          <fpage>219</fpage>
          -
          <lpage>236</lpage>
          .
          <string-name>
            <surname>Bloomsbury</surname>
          </string-name>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Humanidades Digitais e Hisotr´ia: algumas observac˜o¸es (16 December 2021</article-title>
          ), https://www.linguateca.pt/Diana/download/SantosPIDH.pdf
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Willrich</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Langfeldt</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>de</surname>
            <given-names>Moraes</given-names>
          </string-name>
          , R.G.,
          <string-name>
            <surname>Mota</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pires</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schumacher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pereira</surname>
            ,
            <given-names>P.S.:</given-names>
          </string-name>
          <article-title>Identifying literary characters in Portuguese: Challenges of an international shared task</article-title>
          .
          <source>In: PROPOR 2022</source>
          . Springer (March
          <year>2022</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Schoenster</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Cla˜s poıtli´cos seguem dominando Congresso na porx´ima legislatura</article-title>
          .
          <source>Tech. rep., Transparˆencia</source>
          Brasil (nov
          <year>2014</year>
          ), https://www.transparencia. org.br/downloads/publicacoes/Cl%C3%
          <article-title>A3s%20pol%C3%ADticos%20seguem% 20dominando%20Congresso%20na%20pr%C3%B3xima%20legislatura</article-title>
          .pdf
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>