=Paper= {{Paper |id=Vol-2364/15_paper |storemode=property |title=Towards the Automatic Classification of Speech Subjects in the Danish Parliament Corpus |pdfUrl=https://ceur-ws.org/Vol-2364/15_paper.pdf |volume=Vol-2364 |authors=Dorte H. Hansen,Costanza Navarretta,Lene Offersgaard,Jürgen Wedekind |dblpUrl=https://dblp.org/rec/conf/dhn/HansenNOW19 }} ==Towards the Automatic Classification of Speech Subjects in the Danish Parliament Corpus== https://ceur-ws.org/Vol-2364/15_paper.pdf
Towards the Automatic Classification of Speech Subjects
          in the Danish Parliament Corpus

Dorte Haltrup Hansen1, Costanza Navarretta2[0000-0002-4242-9249], Lene Offersgaard3 and
                        Jürgen Wedekind4[0000-0002-0759-6009]

       Centre for Language Technology, Department of Nordic Studies and Linguistics,
                            University of Copenhagen, Denmark
                                 1dorteh@hum.ku.dk
                                2costanza@hum.ku.dk
                                  3leneo@hum.ku.dk
                               4jwedekind@hum.ku.dk




       Abstract. This paper addresses the semi-automatic subject area annotation of the
       Danish Parliament Corpus 2009-2017 in order to construct a gold standard corpus
       for automatic classification. The corpus consists of the transcriptions of the
       speeches in the Danish parliamentary meetings. In our annotation work, we
       mainly use subject categories proposed by Danish scholars in political sciences.
       The relevant subjects areas of the speeches have been manually annotated using
       the titles of the agendas items for the parliamentary meetings and then the sub-
       jects areas have been assigned to the corresponding speeches. Some subjects co-
       occur in the agendas, since they are often debated at the same time. The fact that
       the same speech can belong to more subject areas is further analysed. Currently,
       more than 29,000 speeches have been classified using the titles of the agenda
       items. Different evaluation strategies have been applied. We also describe auto-
       matic classification experiments on a subset of the corpus using feature extracted
       with NLP techniques. The best results (96% F-score) were obtained using fea-
       tures extracted from the agenda items. These results indicate that the gold stand-
       ard corpus and agenda items can be used for automatically classify parliamentary
       debates with high accuracy.

       Keywords: Parliamentary Debates, Subject Classification, Gold Standard Cor-
       pus.


1      Introduction

The transcriptions of parliamentary debates (Hansards) are available in many countries,
and researchers from different disciplines, such as political science, linguistics and
computational linguistics, have examined them in a variety of contexts. A classification
of the speeches into subject areas is certainly the most basic technique for analysing
their content. However, it is beneficiary for practical applications, such as search opti-
misation, and it is useful for more sophisticated analyses, e.g. of the tone in the debates
167


on immigration, a topic that can be found in the debates on taxpaying, unemployment
and foreign policy.
    In this paper, we report on the creation of a gold standard corpus consisting of the
speeches from the Danish Parliament Corpus 2009-2017 classified by subject areas as
well as on experiments to classify the debates in subject areas using some basic NLP
methods and machine learning techniques. The corpus contains Hansards of the sittings
in the Chamber of the Danish Parliament and has recently been made available as a
collection through the Danish CLARIN research infrastructure [1]. The corpus consists
of approx. 41 million running words and 182,192 speeches 1. Information about the sit-
tings, the name of speakers, their party, the time of the speeches, and the title of agenda
items are provided in the corpus. However, the corpus does not contain information
about the subjects of neither the speeches nor the agenda items.
    The paper is organised as follows. Section 2 describes related work. In Section 3, we
account for the adopted classification scheme and, in Section 4, we present the method
used for constructing the gold standard corpus. The analysis and evaluation of the an-
notated corpus are provided in Section 5. In Section 6, we report on the automatic clas-
sification experiments and their results. The final session concludes and suggests future
research.


2         Related Work

Political domains have been categorised according to various schemes depending on
the task. The Comparative Manifesto Project, CMP2 [2] and the Comparative Agendas
Project, CAP3 developed two domain classification systems for comparative studies.
    In the Comparative Manifesto Project, party election programmes (manifestos) were
annotated using 560 categories in order to determine the policy preferences of political
parties. The Comparative Agendas Project classifies policy activities around the world
according to 21 general categories and 192 sub-categories.
    The Danish Policy Agendas Project at the University of Aarhus is manually anno-
tating parliamentary activities in the Danish Parliament from 1953 and onward 4. The
data comprise e.g. policy bills, legislative hearings, parliamentary debates, and
speeches by the prime minister. The project uses the CAP coding scheme. Recently,
experiments with semi-automatic classification have been carried out on Danish city
council agendas [3]. In the experiments, a Naive Bayes classifier was applied on a man-
ually annotated corpus on the basis of the council agendas, and then the agendas were
lemmatised and used as testing material. The best classification results on some of the
data were 75%.
    The CAP and CMP classifications are too complex and too broad for the scope of
subjects addressed in the Danish Parliament. Moreover, the CAP scheme was originally

1 Hansen, Dorte Haltrup, 2018, The Danish Parliament Corpus 2009 - 2017, v1, CLARIN-DK-

     UCPH Centre Repository, http://hdl.handle.net/20.500.12115/8.
2 https://manifesto-project.wzb.eu/

3 https://www.comparativeagendas.net/
4
    http://www.agendasetting.dk/
                                                                                          168


proposed to describe the policy areas of the US Congress, and although it has been
extended and revised to be more widely applicable, it still suffers from this bias. Some
of the major categories of CAP such as 400 General Agriculture with sub-categories
comprising e.g. 403 Food Inspection and Safety and 408 Fisheries and Fishing describe
perfectly subjects debated in the Danish Parliament, while other categories e.g. 23 Cul-
tural Policy Issues do not, and its sub-category sport is grouped under 15 Industrial
and commercial policy (1526 Sport and Gambling).
   An alternative approach mentioned in [5, 6] is to use the names of the ministries as
categories for classifying data related to German politics. Since the ministries’ names
and areas of responsibility can change from one election period to another [4] this is
not a viable solution for our data. Therefore, Zirn [5] uses a scheme based on the re-
sponsibilities of committees to which the agenda items are assigned. Her classification
scheme thus corresponds to the 22 committees of the German Parliament. Inspired by
Zirn’s work, we have developed a classification scheme that reflects the responsibility
areas in the committees of the Danish parliament. In this way, we can also connect the
domain and the spokespersons for those areas. We show that spokespersons for a par-
ticular domain are, not surprisingly, the most speaking politicians about that subject
area and related subjects.
   Automatic text classification of large collections of texts is a natural language pro-
cessing subarea, which has developed extensively the past decades. It consists of as-
signing predefined classes to text documents by training machine learning algorithms
on features extracted from the texts with various NLP techniques. Examples of training
features are the number of words in the texts, the length of their sentences, bag of words
(bow), lemmas, lemmas of particular word classes, TF*IDF values (term frequency*
inverse document frequency) [8,9]. In three-fold sentiment classification of various
datasets researchers have obtained between 63.9% and 98.6% accuracy depending on
the data [10].


3       Classification Scheme

Scholars of political science in Denmark have suggested to categorise the subject areas
of Danish politics into the following 23 main classes 5: Agriculture, Business, Culture,
Defence, Economy, Education, Energy, Environment, European Integration, Foreign
Affairs, Government Operations, Health Care, Housing, Immigration, Justice, Labour,
Local and Regional Affairs, Personal Rights, Politics, Social Affairs, Technology, Ter-
ritories and Transportation.
   We use these subject areas for our annotations and group the responsibility areas
(spokesmanships) under them. The responsibility areas for 2015-17 were found on the
Danish parliament website and have been used in the present work. The three categories
Government Operations, Politics and Personal Rights have been omitted since they
deal mostly with meta-content and not with specific political domains. If speeches on


5 Mail communication with Prof. Christoffer Green-Pedersen, Political Science Department, Uni-

    versity of Aarhus, about the CAP classification and its Danish version,
169


these occur, they will be categorised as Other. We merged the categories Technology
and Transport into the category Infrastructure. This category also comprises IT.
    In Table 1, we show the Danish specific subject areas and the corresponding CAP
classes as well as the spokesmanship related to them in the Danish parliament. The
latter information is based on the spokesmanships in the period 2015-17, The table
shows that the Danish subject areas match the main CAP codes fairly well. Exceptions
are Local and Regional Affairs and Housing which map to the same code in CAP (14)
but are distinct areas in Danish politics. The same holds for Foreign Affairs and Euro-
pean Integration, which map to the same major subject area in CAP, but are distin-
guished areas in the Danish parliament. Other problematic cases are e.g.: Consumer
Policy, which is in Denmark normally categorised under Agriculture together with
Food while in CAP is categorised under (15) General Banking, Finance, and Domestic
Commerce, and the subject area Culture, which in Denmark normally comprises Sports
while the latter subject is categorised differently in CAP.

 Table 1. : Danish subject area classification of parliamentary speeches based on spokesman-
                    ships (2015-17) and the corresponding classes in CAP.
  Chosen subject         Spokesmanships in
                                                       Corresponding CAP subject areas
      areas            the Danish parliament
                         the Danish parliament                          Areas
 Economy            Finance                       1 Domestic Macroeconomic Issues
                    Fiscal Affairs                1 Domestic Macroeconomic Issues
 Health Care        Psychiatry                    3 Health
                    Health                        3 Health
 Agriculture        Animal Welfare                4 Agriculture
                    Fisheries                     4 Agriculture
                    Food                          4 Agriculture
                    Agriculture                   4 Agriculture
                    Consumer Policy               1525 Consumer Policy
 Labour             Labour market                 5 Labour and Employment

 Education          Higher Education and Re-      6 Education
                    search                        6 Education
                    Education
 Environment        Environment                   7 Environment

 Energy             Energy                        8 Energy
                    Climate                       705 Air and noise pollution, climate change
                                                  and climate policies
 Immigration        Immigration and Integration   9 Immigration and Refugee Issues
                    Alien Affairs                 9 Immigration and Refugee Issues
                    Naturalization                9 Immigration and Refugee Issues


                    Transportation                10 Transportation
 Infrastructure     IT                            17 Space, Science, Technology, and Communi-
                    Media                         cations
                                                  17 Space, Science, Technology, and Communi-
                                                  cations
                                                                                                170

 Justice               Legal affairs                 12 Law, Crime, and Family Issues
                       Constitutional Matters        20 Government issues

 Social Affairs        Children                      13 Social Welfare
                       Family                        13 Social Welfare
                       Disabled                      13 Social Welfare
                       Social services               13 Social Welfare
                       Senior citizens               13 Social Welfare
                       Gender equality               2 Civil Rights, Minority Issues, and Civil Lib-
                                                     erties
 Housing               Housing                       14 Community Development and Housing Is-
                                                     sues
 Local and Re-         Rural Districts and Islands   14 Community Development and Housing Is-
 gional Affairs        Municipal Affairs             sues
                                                     2001 Local Government Issues
 Business              Trade and Industry            15 Industrial and commercial policy

 Defence               Defence                       16 Defence

 Foreign Affairs       Foreign Affairs               19 International Affairs and Foreign Aid
                       Development Cooperation       19 International Affairs and Foreign Aid
 European      Inte-   EU                            1910 International Affairs and Foreign Aid
 gration
 Territories           Faroe Islands                 2105 Dependencies and Territorial Issues
                       Greenland                     2105 Dependencies and Territorial Issues

 Culture               Cultural Affairs              23 Cultural Policy Issues
                       Ecclesiastical Affairs        210 The Danish national church
                       Sport                         1526 Sport and Gambling




4          Method

As already mentioned, the Danish Parliament Corpus 2009-2017 does not contain in-
formation on subject areas or the committees responsible for them. Therefore, we use
the title of the agenda items for the meetings as an indication of the subject areas of the
speeches of these meetings. In total there are 182,192 speeches under 7,336 different
agenda items.
    We extracted the titles of the agenda items and normalized them, e.g. “First reading
of bill 193: XYZ” has been normalized to XYZ. This resulted in 6,722 different agenda
titles. For each title, up to three subjects from the chosen classification scheme were
coded manually. For example, for the title Tax on saturated fat in food, Agriculture
(comprising Food) has been chosen as the primary subject, while Economy (comprising
Tax) was annotated as the secondary subject. The subject area classification of the
agenda items were added automatically to the speeches in the time slots allocated to
them. The process was repeated until there were more than 1,000 examples (speeches)
for each of the 19 subject areas. One exception is the subject area Territories that was
not assigned to so many speeches. The annotated corpus comprises currently more than
171


29,000 speeches. We are now using the annotations as a training and test data set for
the automatic subject area classification.


5      Evaluation and Analysis

Of the 6,722 agenda titles, 1,079 were manually marked for subject areas by one anno-
tator and then corrected by a second one. In 9% of the annotations, the second annota-
tors proposed another subject area or a different ranking of the two or three subjects
proposed by the first annotator. They discussed the disagreement cases and in some
cases involved a third annotator, producing an agreed-upon classification. The 29,249
classified speeches contain over 615,000 tokens. Out of these speeches, 16,743 (57%)
are annotated with only one subject area, 11,455 (39%) with two subject areas, and
1,051 (3.6%) with three subject areas.
    As an initial evaluation of the classification, we extracted the speakers talking in
each subject area in the 18,473 speeches that were classified under a single subject, and
we marked the spokespersons for those areas in the period 2015-2017. We found out
that the spokespersons of the subject areas and related areas are in the majority of the
cases the most frequent speakers for those areas in that period. However, because poli-
ticians can be spokespersons of more than one subject area and spokespersons can
change area of responsibility during the same election period, this information can only
be an approximate indication that the speeches have been classified correctly.


6      Automatic Classification of the Speeches: First
       Experiments

In this section, we describe experiments for automatically categorising the parliament
speeches into the given domains using supervised classification. That is, we use a train-
ing set T = {(s1, c1), . . , (sn, cm)} consisting of speeches that have each been hand-
labelled with the appropriate class and the task is to find a classifier and a model that
are capable to map new speeches s to their correct class c. In our experiments, we have
used a subset of the annotated speeches. The subset consists of 19,676 speeches, be-
longing to 18 classes (we excluded the class Territories, because of the low number of
speeches). To each class belong between 900 and 1180 speeches. All the speeches in
the chosen subset consist of at least 5 words. The speeches and the titles of the agenda
items have been Part of Speech (PoS) tagged and lemmatised.
   We have extracted the lemmas of nouns and proper nouns from the speeches and
removed numbers and prepositions from the lemmas of the titles of the agenda items.
The training features we have tested are the following: Bow of the agenda item titles
(selected lemma types), bow of the lemmas of the speeches, TF*IDF of the lemmas of
the speeches, the TF*IDF of the n-grams of the speeches’ lemmas (up to trigrams), and
of the characters (chars) of the lemmas (up to 4-grams), the TF*IDF of the nominal
                                                                                        172


lemmas, and information about the speakers. The latter information comprises the gen-
der, role (minister, member) and the party of the speakers. Combinations of some of
the features were also tested (see Table 2).
    The Python scikit-learn package was used for the experiments. The speeches were
randomised and then the data were divided in a training set (60% of the data), a test set
and evaluation test (20% of the data each). We trained and tested the features on the
training and test set respectively, and finally we tested the obtained models on the eval-
uation data. The scikit-learn multinomial Naïve Bayes and support vector machine clas-
sifier were applied. The Naïve Bayes classifier obtained the best results. This is proba-
bly because Naïve Bayes also gives good results on sparse data, and some of the
speeches consist only of few words. In Table 2, we report the results obtained by this
classifier on different features in terms of Precision, Recall and weighted F-score.

                         Table 2. Classification features and results

 No.        Features                                 Precision          Recall   F-score
 1      agenda item titles – bow                     0.96               0.96     0.96
 2      all lemmas – bow                             0.75               0.74     0.74
 3      all lemmas - TF*IDF                          0.73               0.7      0.7
 4      TF*IDF n-grams (lemmas)                      0.7                0.69     0.69
 5      TF*IDF n-grams (chars)                       0.65               0.48     0.45
 6      nominal lemmas - TF*IDF                      0.73               0.72     0.72
 7      all lemmas – bow and nominal lem-            0.75               0.73     0.73
        mas TF*IDF
 8      agenda item titles and lemma – bow           0.91               0.91     0.91
 9      lemma bow, TF*IDF nominal,                   0.75               0.74     0.73
        speaker features


The results show that training the classifier on the bow of the extracted agenda item
titles (1) gives the best performance. This indicates that the manually classification is
consistently made. When involving data from the speeches the performance drops from
F-score 0.96 to F-score 0.91 for the best results (8). Furthermore, all results are signif-
icantly better than those obtained by training a majority classifier (F-score 0.06) and by
chance.
    A first analysis of the automatically annotated data indicates that most classification
errors when using features extracted from the speeches were due to the limited length
of some of the speeches, and by the fact that we did not remove from the transcriptions
comments added by the transcribers concerning e.g. the poor quality of the audio file.


7      Discussion and Concluding Remarks

In this paper, we have presented the construction and evaluation of a subject area an-
notated subcorpus of the Danish Parliamentary Corpus (2009-2017). The coding
scheme mainly follows the classification used by Danish scholars in politics and this
classification can be mapped into the international CAP system.
173


   The annotation of the speeches was performed by manually annotating the titles of
the agenda items, and then automatically propagating their subject areas to the speeches
under them. Most of the annotated speeches are classified under a single class, but in
some cases, the annotator classified the speeches under two classes (39% of the
speeches) or under three (3.6% of the cases). The manual evaluation and the compari-
son of the speeches’ subject areas and the speakers’ role in the parliament committees
indicates that the classification is appropriate.
   The automatic classification experiments, which we performed on part of the gold
standard corpus show that training a multinomial Naive Bayes classifier on bow ex-
tracted from the agenda item titles results in an F-score of 0.96, which is extremely
good. In a similar task on debates in municipalities, [3] obtained much poorer results.
This might have been caused by the fact that in those meetings the agenda items titles
were not assigned as consistently as in the case of the Danish Parliament.
   The results obtained with features extracted automatically from the speeches also
indicate that the parliament sessions follow quite precisely the pre-defined agendas.
However, using these features as training data is only useful when the speeches have a
certain length. The uneven length of the speeches is also problematic for the use of
machine learning algorithms that require large amounts of data. However, linguistic
features could be useful when looking for more specific topics in the general subject
areas.
      In the future, we will test whether we can predict the second subject area when
relevant, and we will experiment with other features, and classifiers applied on more
data, also selecting speeches that contain at least 50 or 100 words.


References
 1. Hansen, D. H., Navarretta, C., Offersgaard, L.: A Pilot Gender Study of the Danish Parlia-
    ment Corpus. In: the proceedings of the workshop ParlaCLARIN at 11th edition of the Lan-
    guage Resources and Evaluation Conference, Japan (2018).
 2. Budge, I., Klingemann, H. D., Volkens, A., Bara, J., Tanenbaum, E.: Mapping Policy Pref-
    erences. Estimates for Parties, Electors, and Governments 1945-1998. Oxford University
    Press (2001)
 3. Loftis, M. W. and Mortensen, P. B.: Collaborating with the Machines: A Hybrid Method for
    Classifying      Policy      Documents.       In     The     Political     Studies     Journal,
    https://doi.org/10.1111/psj.12245 (2018)
 4. Mortensen, P. B., and Green-Pedersen, C.: Institutional Effects of Changes in Political At-
    tention: Explaining Organizational Changes in the Top Bureaucracy. In Journal of Public
    Administration Research and Theory, 25(1), 165–189, https://doi.org/10.1093/jop-
    art/muu030 . (2015)
 5. Zirn, C.: Analyzing Positions and Topics in Political Discussions of the German Bundestag.
    In proceedings of the ACL Student Research Workshop, pp. 26 -33 (2014)
 6. Zirn, C., Glavas, G., Nanni, F., Eichorst, J., & Stuckenschmidt, H.: Classifying topics and
    detecting topic shifts in political manifestos. Proceedings of the International Conference on
    the Advances in Computational Analysis of Political Text (PolText 2016), 88–93, Dubrov-
    nik, Croatia, 14–16 July (2016).
                                                                                           174


 7. Spärck Jones, K.: A Statistical Interpretation of Term Specificity and Its Application in Re-
    trieval. Journal of Documentation. 28: 11–21 (1972)
 8. Allahyari M., Pouriyeh S., Assefi M., Safaei S., Trippe E.D, Gutierrez J. B., Kochut K. A
    Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques. CoRR.
    vol. abs/1707.02919 (2017)
 9. Korde V., Mahender C.N.. Text classification and classifier:A survey International Journal
    of Artificial Intelligence & Applications, Vol.3:2, pp. 85-99, March. (2012).
10. Joulin A., Grave E., Bojanowski P., Mikolov T. Bag of Tricks for Efficient Text Classifica-
    tion. Proceedings of the 15th Conference of the European Chapter of the Association for
    Computational Linguistics: Volume 2, Short Papers, pages 427–431, Valencia, Spain, April
    3-7 (2017).