On Semi-Automatic Creation of Dataset for Multi-Document
Automatic Summarization of News Articles and Forum Threads
Volodymyr Taranukha a, Tetiana Horokhova b and Yaroslav Linder a
a
    Taras Shevchenko National University of Kyiv, Volodymirska st., 64, Kyiv, 01033, Ukraine
b
    Borys Grinchenko Kyiv University, Bulvarno-Kudriavska st., 18/2, Kyiv, 04053, Ukraine

                 Abstract
                 The problem of semi-automatic dataset creation for multi-document summarization and
                 forum threads summarization is analyzed. Aspects specific to Slavic languages are
                 underlined. Dedicated algorithms for this purpose were designed and tested. Due to not
                 smooth nature of the optimization problem genetic algorithms were suggested. Some new
                 and interesting results are received.

                 Keywords 1
                 Automatic summarization, multi-document summarization, forum thread summarization,
                 dataset creation

1. Introduction
    The extreme proliferation of modern electronics (first and foremost - mobile phones) made
electronic data sources widely available to all kinds of users. In response to this tendency multiple
organizations, newspapers, forums jumped the chance to provide their own point of view, spin
narrative, push advertisement, etc. This extended and enhanced data flow often takes the form of text
with some images and there are too much data in the usual data flow aimed at a single person.
    In this research automatic summarization is suggested as a tool to solve the problem of locating
and distilling information. Among areas of application for automatic summarization two stands as
more problematic:
    1. Multiple document summarization.
    2. Forum summarization.
    On top of being more complex than a typical summarization task, there is an extra layer of
problems when it comes to Slavic languages. First and foremost it’s lack of a good dataset to facilitate
research. The problem of multi-document summarization of news events is to provide a well-
organized summary that covers an event completely while minimizing repetition. The focus and point
of view of the input papers for an event may differ. Recent works in this area have tried neural models
to exploit the graph structure among relations text clusters. Also, a couple of recent papers have tried
neural encoder-decoder models to do multi-document summarization [1,2]. Due to the sparsity and
high expense of human-written summaries, the generation of large-scale multi-document
summarizing datasets for training has been hampered. There was an attempt [3] to train abstractive
sequence-to-sequence models with citations and search engine results as input documents on a huge
corpus of Wikipedia text. As far as efficiency goes there is a notable loss of quality in these results
compared to results achieved in single-document summarization. So, a dedicated dataset to train
multi-document summarization is sorely required.
    The WWW discussion forums come in a variety of flavors, each with its own topic and
community. User-generated content on web forums is an excellent source of information. In the case

Information technology and implementation (IT&I-2021), December 1–3, 2021, Kyiv, Ukraine
EMAIL: taranukha@ukr.net (A. 1); t.horokhova@kubg.edu.ua (A. 2); yaroslav.linder@gmail.com (A. 3)
ORCID: 0000-0002-9888-4144 (A. 1); 0000-0003-0113-8653 (A. 2); 0000-0003-1076-9211 (A. 3)
            ©️ 2022 Copyright for this paper by its authors.
            Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
            CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                      15
of question-and-answer sites like Quora the opening post is a question and the responses are answers
to that question. The best answer in these forums may be chosen by the forum community via voting.
On the other hand, there is no such thing as “the best answer” in discussion forums where people
share their thoughts and experiences. Furthermore, discussion threads on a single topic might easily
contain dozens or hundreds of individual posts, making it difficult to identify the important
information in the thread, especially when using a mobile device to visit the forum. In this research,
extractive summarization [4] is proposed to extract salient units of text from a source and then
concatenate them to generate a shorter version of the discussion. Sentences are commonly utilized as
summarizing units in most summarization assignments but for this assignment, it is expected that
posts will be better suitable as basic units for summaries of discussion threads.
    While there are many differences between both tasks (multi-document summarization and forum
summarization) there are also some similarities due to the enclosed nature of articles in data sources
(or posts in threads) so there is an option to exploit said similarities on top of problem-specific
features. This paper describes useful elements to build the required dataset. The main point of
research is to provide tools for the semi-automatic preliminary summary generation that will help to
create summaries that are required for future research.

2. Related works
    To assess summarizing systems, man-made reference summaries are often utilized. TIPSTER Text
Summarization Evaluation Conference [5], NIST Document Understanding Conference [6], and NIST
Text Analytics Conference [7] all employed benchmarks based on reference summaries.
    A reference summary is a subset of text units picked from the source document for extractive
summarizing, which is a researcher in this study. Depending on the task, these units might be
sentences [8], utterances [9], or forum posts [10]. The first and last approaches are examined in this
work. Summarization is a very subjective task: the substance of the summary varies, as does the
length of the summary produced. Experts who write summaries frequently disagree on what
information should be included in the summary.
    To address this problem, the DUC 2005 assessment approach was established to account for
diversity in human-generated reference summaries. As a result, for each of the 50 themes, at least four
distinct summaries were developed. Each issue in the NIST TAC Guided Summarization Task
received up to four alternative answers.
    When it comes to establishing a reference, specialists writing abstractive summaries are typically
asked to create a summary of a certain length for a specific document or document collection. As a
result, a corpus of reference summaries usually is produced for abstractive discussion thread
summarizing [11]. While abstractive summaries are not the greatest option when it comes to
evaluation of extractive summaries, they can be used in conjunction with variation of ROUGE [12]
metrics to make them helpful. There is ROUGE 2.0 [13] particularly designed with ability to process
synonym substitution which is often used by humans in abstractive summarization tasks.
    The key feature is the agreement between human experts on the content of an extractive summary.
It can be measured using the percentage of common decisions and the proportions of selected and
non-selected units by the experts. The agreement is then calculated in terms of effect size ( number
measuring the strength of the relationship between two variables in a population). Useful measure is
Fleiss’ κ [14]
                                         Pr(𝑎) − Pr⁡(𝑒)                                             (1)
                                     κ=                 ,
                                           1 − Pr⁡(𝑒)
where Pr(𝑎) is the measured agreement (the percentage of common decisions) and Pr(𝑒) is expected
agreement based on the proportion of selected and non-selected units by the experts.
   A negative κ indicates structural disagreement. If κ = 0 then there is no agreement between the
experts (observed agreement is as good as random). Positive κ up to 0.2 indicates slight agreement, if


                                                                                                    16
0.2 < κ < 0.4 it’s fair agreement, if 0.4 ≤ κ < 0.6 it’s moderate agreement, if 0.6 ≤ κ < 0.8 it’s
substantial agreement and if 0.8 ≤ κ ≤ 1 it’s strong agreement.
   For the purpose of this research extensive search was performed but there was not a single paper
found with the agreement being higher than “moderate”. Among available papers, the highest scores
go to single document news articles summaries (“moderate”) [15]. Multi-document summarization is
expectedly worse. There was no research on conversation transcripts and such but it’s expected to
have even worse marks there. For now, the most direct way to resolve this issue is to use voting based
on the number of experts in favor of a certain segment. Also, there was found little to no good data on
summarization of forum threads. All abovementioned problems are exacerbated when it comes to
Slavic languages: Ukrainian and Russian languages in particular.
   Recent neural network based research almost totally superseded the non-neural approach. There
are papers on extractive and abstractive approach [16-19] on this matter. Though it’s crucial to point
out that a machine-learning based approach works well if and only if there is a sufficiently large and
sufficiently comprehensive dataset. More so, some modern approaches, such as T5 [20] and GPT-3
[21] will not work without very significant investments from third parties.
   So, for the purpose of this research non-neural approaches were chosen especially for extractive
multi-document summarization. There were some papers on this topic both for extractive and
abstractive approaches such as [22-24] albeit most of them were significantly outdated.

3. Initial analysis
   Since there is no good dataset to latch to the problem was reformulated in a roundabout manner:
how one can develop a feature-based method that will produce good semi-finished summaries to
reduce the future workload on expert(s)?
   Subsequently, this question was divided into several other questions and a number of assumptions.
Basic assumptions are the following:
   1. A thread is a sequence of small documents forming a discussion, with each user having a
   different point of view on the topic of discussion.
   2. News flow is a set of documents representing different points of view corresponding to
   different editorial policies of news agencies. Often such a set can form a discussion if the topic
   stays the same during a notable period.
   According to the assumptions research questions were formulated:
   1. Do basic assumptions stand?
   2. What is the best form to represent a preliminary multi-document news summary?
   3. What is the best form to represent a preliminary thread summary?
   4. What is the best length of a preliminary multi-document news summary?
   5. What is the best length of a preliminary thread summary?
   6. What are the characteristics of articles that are selected by humans to be included in the
   preliminary multi-document news summary?
   7. What are the characteristics of the posts that are selected by humans to be included in the
   preliminary thread summary?
   8. What are the major qualities important for humans in a preliminary multi-document news
   summary?
   9. What are the major qualities important for humans in a preliminary thread summary?
   To give the answer to these questions small polling was carried out among students of Taras
Shevchenko National University of Kyiv. The answers were the following:
   1. Yes, basic assumptions look relevant and logical.
   2. For the multi-document news summary, the best is to sort articles by relative relevance to the
   topic and reduce each subsequent article removing irrelevant and repetitive parts.
   3. For thread summary, the best way is to include whole posts if the thread is short enough. For
   long threads, it’s useful to define the topic by first post and reduce the content of each subsequent
   post by removing irrelevant and repetitive parts.

                                                                                                     17
    4. The best length of preliminary multi-document news summary corresponds to a single screen.
    It was noted that it’s useful to include hyperlinks to detailed articles.
    5. The best length of a preliminary thread summary is from 5 to 7 posts.
    6. The most important characteristic for an article to be selected into a preliminary multi-
    document news summary is relevance.
    7. The most important characteristic for a post to be selected into a preliminary thread summary
    is relevance.
    8. The major quality for a preliminary multi-document news summary is representativeness.
    9. The major qualities for thread summary are representativeness and readability.
    The answers point to the notable difference in source structure: it’s much easier to obtain a
readable summary from the set of articles as long as one of them serves as a basis. For thread
summarization, it’s much harder to get consistent (i.e. readable) summary.
    These answers combined with the lack of notable dataset (especially in Ukrainian language) forced
to develop a sequence of methods based on some a priori heuristics derived from a period predating
the current resurgence of neural networks. From two key qualities readability looks like the one that is
harder to achieve. So, readability was analyzed further.
    A text differs from a set of grammatically correct sentences by a certain number of connections
between the sentences that are part of it. These connections are of a different nature.
    It is necessary to specify what types of connectivity are observed in the text:
          structural coherence;
          logical coherence;
          semantic closeness.
    The structural coherence between the elements is most often formed by using special words or
grammatical forms to connect the elements of the text. Structural coherence is defined by the
following means:
          anaphora;
          elliptical structures;
          repetition of structural elements of the text, namely phrases and words;
          usage of conjunctions.
    The logical coherence of the text is ensured at the level of interpretation, although it has certain
syntactical features. For example, the words «якщо»(“if”), «то»(“then”), «інакше» (“otherwise” or
“else”) create logical connections in the text that are clearly different from structural connections as
they were defined above. As for the task while having incomplete tools for text interpretation most of
emphasis is made on structural coherence. There are several reasons for this.
    First, logical coherence is based on connections such as “explanation”, “cause”, “consequence”,
and so on. The main problem lies in the fact that not all these connections and not in all cases have
clear markers and the form of appropriate vocabulary or grammatical structures.
    Second, logical coherence is formed by interpretation and is therefore subjective. Thus, it should
be considered for each reader separately, and for this reason it is not invariant.
    Third, logical coherence is formed through interpretation and therefore requires a full
understanding of the text. Unfortunately, this requirement is beyond capabilities of any modern
machine learning model. Thus, the algorithm is mainly based on the analysis of structural coherence,
especially since it often also allows finding logical coherence, as logical coherence is often
accompanied by structural coherence.
    Semantic closeness is a special type of coherence and will be discussed below.

4. Algorithms to create dataset
   There were two complex algorithms tested for the purpose of creation of preliminary summaries.
The first was developed in IRTC IT & S in department 165 during “Pattern computer” research
program [25]. Only some features and usage of the first algorithm will be described in this paper. The
second algorithm was hand tailored using relatively modern instruments and is described below.

                                                                                                     18
4.1.    Initial processing
    There is a pipeline established for the purpose of initial processing:
         subsystem of morphological analysis;
         subsystem of partial parsing;
         subsystem for simplified anaphora resolution.
    Big Ukrainian dictionary (over 100,000 words) was used for morphological analysis. For the out-
of-dictionary words heuristical algorithm was used [26]. The main purpose of this analysis is to find
out canonical forms of words and some grammatical characteristics such as gender, case, time etc.
    Having canonical forms greatly improves frequencies of text elements and also allows to process
service words in correct manner. For English texts it’s possible to use stemmers but for Slavic
languages (Ukrainian and Russian in particular) it’s notably more efficient to use dedicated
morphological dictionaries. On the basis of the received morphological data the primitive syntactic
analysis is carried out. Adjectives (and participles) are associated with the corresponding nouns and
nouns are associated with the corresponding verbs. To do anaphora resolution morphological features
are used. And only then a simple semantic closeness analysis is performed, namely: among the
alternatives, the word that has the meaning that is the closest in semantic similarity to the words of the
context is selected. To determine the semantic similarity, the semantic database WordNet localized to
Ukrainian language [27] was used. Other parts of semantic closeness such as implied inference are
ignored because they are considered too complex to analyze.
    This pipeline is used in both algorithms though for the purposes of new one minor alteration were
made to fit it into modern programming languages and libraries.

4.2.    Important element detection
    The importance of text elements is determined by how much the user is interested in them and how
important they are to present the content of the text. To evaluate importance of term (word) simple tf-
idf statistic is used.
    The tf-idf value increases proportionally to the number of times a word appears in the document
and is offset by the number of documents in the corpus that contain the word, which helps to adjust
for the fact that some words appear more frequently in general.
                                                              N                                     (2)
                                 tf − idfd (t) = ft,d · ⁡log⁡( ),
                                                              𝑛𝑡
where ft,d - is the raw count of a term in a document, N - total number of documents in the selected
set, nt - number of documents containing term t.
    This particular weighting schema promotes diversity though it’s important to underline mutual
influence of terms (words). It is possible to receive input document(s) with multiple synonym usages
which can water down observed saliency and thus exclude important elements from summary.
    To avoid this problem it’s useful to calculate lexical chains [28] in texts using WordNet [29]. For
the purpose of this research lexical chains were used. In order to simplify the process scores of the
chains were calculated inside each text independently.
                                                       nd⁡ (Ch) − 1                                 (3)
                             Homogenityd (Ch) =                     ,
                                                      Lengthd (Ch)
where ⁡nd (Ch) - number of distinct term occurrences in chain for the document, Lengthd (Ch)- length
of chain (total number of occurrences of different terms in chain) in the document.
                                                                                                 (4)
                     Scored (Ch) = Lengthd (Ch) · ⁡ Homogenityd (Ch),
   Chains were tested with quality criterion:
                                                                                                      (5)
                          Scored (Ch) > 𝐴𝑣𝑔(Scored (Ch)) + 2σ
Where Avg(Scored (Ch)) is average of all scores of all chains in the particular document and σ
is standard deviation.

                                                                                                       19
   Initially the sentences received scores based both on chain scores and on tf-idf scores.
                                                                                                       (6)
                      Scored (S) =      ∑         tf − idfd (t) · Scored (Ch),
                                     t(S),Ch(S)

where summation includes only terms from the sentence S and if said term is also included into
relevant chain.
    This approach on itself penalized shorter posts or shorter articles for they often have shorter lexical
chains.

4.3.    Genetic algorithm
    Fairly standard genetic algorithm was used in research.
    The chromosome was defined as the list corresponding to sentence numbers that will be included
in the final document (summary). An element having value of “True” indicates that the sentence will
be included into the summary. Value of “False” corresponds to sentences that are not included into
the summary. Number of values equal to “True” must not exceed summary sentence allotment.
    A chromosome can mutate by randomly changing the value of a list item at random. Two
chromosomes can perform cross-over in several ways:
    1. Equal uniform random selection per element form parents.
    2. Single point cross-over when everything before (and including) the point is taken from one
    parent and everything else – from another.
    3. Dual point cross-over when head and tail are taken from one parent and middle part between
    two cut points is taken from another parent.
    Each version of cross-over has own influence on performance and evaluation results.
    After the cross-over is performed either padding or trimming can happen. If the summary is
shorter than necessary the most salient unused sentences are included into chromosome. If the
summary is longer than necessary – the least salient sentences are removed from it.
    The algorithms works as following:
   First generation of chromosomes is generated at random. They are places into empty List 𝐿1.
   For number of generations 𝐺
           List 𝐿2 = {}
           Random mutations 𝐾 times are imposed on the chromosomes.
           Mutants are placed into 𝐿2.
           For all chromosomes on the list 𝐿1
                   For all chromosomes on the list 𝐿1
                            A pair of chromosomes creates offsprings by cross-over
                            Descendants are put into 𝐿2
           The chromosomes in 𝐿2 are sorted by rating
   ⁡⁡⁡⁡⁡⁡⁡⁡𝑁 best are selected.
           If combined score of the population⁡𝐿1 = combined score the population⁡𝐿2: abort algorithm
           𝐿1= 𝐿2.
where 𝐺 – number of generations, 𝑁 - power of chromosome set, 𝐾- mutability parameter.
   The rating is calculated as total sum of individual scores of terms combined with global coherence
score in accordance with principles laid out in Initial analysis.
   The first version of cross-over was implemented in the algorithm developed in Department 165.
The second and third ones were implemented during this research.

5. First series of numerical experiments
   For each version of cross-over experiments were performed with Ukrainian texts and the results
were evaluated by hand. Each experiment consisted of 10 launches of multi-document summarization
and 10 launches of forum thread summarization. During each launch of multi-document
summarization 7 different articles collected from source (https://www.ukr.net/news/politics.html)

                                                                                                        20
were processed. During each launch of thread summarization a thread with at least 7 posts from
https://replace.org.ua/forum/9/ (“Український форум програмістів → Обговорення”) was
processed. Each time a score to the summary was manually assigned based on perceived performance
ranging from 1 (“entirely unsatisfactory”) to 5 (“good selection for future work”). The average scores
and their variances are presented in Table 1. Initially it was expected that first type of cross-over will
perform in the best way because it’s true for many other problems that are solved using genetic
algorithms. But in this experiment things went other way: first type of cross-over ended up as the
worst out of three in both tasks. Also there is unexpected difference between performance of the
second and third types of cross-over in both tasks. Usually it’s expected to have consistent difference
in performance across the board as long as the task stays the same.
Table 1
First series of evaluation results
   Multi-document               Multi-document       Thread summarization        Thread summarization
   summarization              summarization score        average score               score variance
    average score                  variance
         3.8                          0.84                     3.2                        1.07
         4.3                          0.77                     3.5                         0.5
         4.2                          0.84                     3.7                        0.46

6. Algorithm adjustments and second series of numerical experiments
    Due to unexpected behavior of the summarization algorithm several hypotheses were put forward
and tested. They mostly revolved around notions of continuity, document (post) boundaries and
hidden features of human perception. For example, in most cases results of multi-document
summarization retained majority of sentences from most salient (important) document and less
important documents ended up in small chunks mashed together at the end of summary. Nevertheless,
it does not prevent testers from giving relatively high marks to such summary. In the same time
destruction (chunking) of the first post in thread summarization task was a surefire way to generate
low expert score regardless to relative value (contribution) of abovementioned post to general
discussion quality in thread. More so, chunking of any post carried significant negative impact on the
score of generated summary while exclusion of the same post often resulted in notably milder expert
score penalty. To address abovementioned issues some changes were introduced to summarization
algorithm. First of all, first post in a thread was made mandatory regardless of actual contribution to
summary. Second, chunking penalty was inserted into final calculations.
                                                  Selectedd                                        (7)
                                      Penaltyd =            ,
                                                   Totald
where Selectedd - number of selected sentences from the document, Totald – total number of
sentences in the document.
                   Score (S) = ∑ tf − idf(t) · Score (Ch) · Penalty ⁡,                 (8)
                          d                              d               d
                                     t,ch

    It resulted in higher score if less posts were cut in pieces during general optimization. Observation
of results often showed that algorithm retained initial boundaries of posts in exchange of removing
some posts entirely. Also, additional changes were introduces to genetic algorithm to improve quality
of summaries and speed of convergence. This approach has some similarities to chromosome reuse
strategy [30] but instead of chromosome library it uses memory about ancestral behavior directly.

6.1.    Chromosomes with memory
    The chromosome was defined as the list corresponding to sentence numbers that will be included
in the final document (summary). In comparison to standard chromosome extra field was introduced
for each variable (gene). This field contains information about recent changes in variable and is used

                                                                                                       21
during mutation and cross-over. There are two rules which influence behavior of chromosomes with
memory field:
   1. If new (mutant) chromosome is created by flipping Boolean variable that was recently in
   opposite state then other variable is picked for flipping.
   2. If two chromosomes undergoing cross-over have too many (above certain threshold⁡𝑇)
   variables going in opposite directions (as it is indicated by respective memory fields) the cross-
   over does not produce offsprings. If necessary other pair of chromosomes will undergo cross-over.
   For the purpose of this research extra memory field was presented by Boolean variables, but in
general case memory field can be implemented as a set of integer or even real variables.
   First of all this optimization is intended to boost convergence by avoiding recalculating summary
quality due to algorithm clearly cycling values of the same (sub)set of variables.
   While in general this optimization can be used with any kind of cross-over for the purpose of this
research it was applied to dual-point cross-over as it showed the best average performance. All other
versions of cross-over were removed from consideration.

6.2.    Final evaluation results
   The evaluation results for fourth and fifth version of preliminary summary generation are
presented in Table 2.
Table 2
Second series of evaluation results

   Multi-document            Multi-document         Thread summarization       Thread summarization
   summarization           summarization score          average score              score variance
    average score               variance
         4.4                       0.49                       3.9                       0.77
         4.4                       0.93                       3.9                       0.98
   As it is shown in the table introduction of strict order and chunking penalty was most beneficial for
thread summarization. Never the less none of the tweaks to the algorithm allowed perfect scores.
Also, regardless of tweaks to the algorithm forum thread summarization is working definitely worse
in comparison to multi-document summarization. On the issue of performance the results were not
very conclusive. In most cases the algorithm converges notably faster, but there were some cases
when the algorithm was working for the full number of generations and failed to achieve good results.
This is also reflected by notable growth of score variance.

7. Conclusions and future work
    For the purpose of future development of dataset(s) for evaluation of summaries some experiments
were performed. It was shown that it’s possible to achieve good (rated as acceptable by human
experts) results with genetic algorithms for semi-automatic summary generations for multi-document
summarization dataset. For the purpose of thread summarization dataset it can be beneficial to
combine different approaches. Relatively high variance in scores points to some good and some poor
results. It can be presumed that by picking n-best from several methods the abovementioned problems
can be alleviated. The strange behavior of optimization algorithms probably emerges from the nature
of the medium of thread discussions. Unlike journalists forum writers actually discuss things directly
and often quote each other creating complex and convoluted chains of reasoning. More so, sometimes
they don’t quote directly but instead make indirect references or conclusions based on discussed
subject. It makes analysis of such occasions inconvenient not only for the algorithm but also for
human assistants. The future work will be centered on getting enough manpower to implement
datasets for Slavic languages, first and foremost a dataset for Ukrainian language.


                                                                                                     22
8. Acknowledgements
   The author would like to acknowledge the following people for their contributions to the research:
prof. Anisimov A.V. from faculty of Computer Sciences and Cybernetics, Taras Shevchenko National
University of Kyiv for useful suggestions on the nature of natural language texts and general support;
staff members of dpt. 165 of IRTC IT &S, Kyiv for libraries and support provided during this
research.

9. References
[1] Y. Liu and M. Lapata, Hierarchical transformers for multi-document summarization. Proceedings
     of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy,
     July 28 - August 2, 2019 pp. 5070–5081 URL: https://aclanthology.org/P19-1500.pdf
[2] M.Yasunaga et al., Graph-based neural multi-document summarization. Proceedings of the 21st
     Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, Canada,
     August 3 - August 4, 2017 pp. 452–462, URL: https://aclanthology.org/K17-1045.pdf
[3] Peter J. Liu et al., Generating Wikipedia by Summarizing Long Sequences. 2018. URL:
     https://arxiv.org/abs/1801.10198
[4] J. Xu and Durrett, G., Neural extractive text summarization with syntactic compression.
     Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and
     the 9th International Joint Conference on Natural Language Processing, , Hong Kong, China,
     November 3–7, 2019, pp. 3292–3303 URL: https://aclanthology.org/D19-1324.pdf
[5] I. Mani, et al., The TIPSTER SUMMAC text summarization evaluation. In: Proceedings of
     Ninth Conference of the European Chapter of the Association for Computational Linguistics,
     1999 pp. 77-85. https://doi.org/10.3115/977035.977047
[6] A. Nenkova, Automatic text summarization of newswire: Lessons learned from the document
     understanding conference. 2005. URL:            https://www.aaai.org/Papers/AAAI/2005/AAAI05-
     228.pdf
[7] M. El-Haj, U. Kruschwitz and C. Fox, University of Essex at the TAC 2011 MultiLingual
     Summarisation Pilot. 2011. URL: http://repository.essex.ac.uk/8920/1/UoEssex.proceedings.pdf
[8] H. Jing and K. McKeown, Cut and paste based text summarization. In Proceedings of 1st
     Meeting of the North American Chapter of the Association for Computational Linguistics. 2000.
     p.178-185 URL: https://aclanthology.org/A00-2024.pdf
[9] M. Li et al., Keep meeting summaries on topic: Abstractive multi-modal meeting summarization.
     In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
     July 2019. pp. 2190-2196. URL: https://aclanthology.org/P19-1210.pdf
[10] V.A. Grozin, N.F. Gusarova and N.V. Dobrenko, Feature selection for language independent text
     forum summarization. In: Proceedings of International Conference on Knowledge Engineering
     and the Semantic Web, Springer, September 2015.pp. 63-71.
[11] E. Barker et al., The SENSEI annotated corpus: Human summaries of reader comment
     conversations in on-line news. In: Proceedings of the 17th annual meeting of the special interest
     group      on    discourse    and     dialogue.     September       2016.   pp.  42-52.    URL:
     https://aclanthology.org/W16-3605.pdf
[12] Kavita Ganesan, ROUGE 2.0: Updated and Improved Measures for Evaluation of
     Summarization         Tasks      Computational         Linguistics,     1(1).   2006,      URL:
     https://arxiv.org/pdf/1803.01937.pdf
[13] C.Y. Lin, Rouge: A package for automatic evaluation of summaries. In Workshop Text
     summarization branches out. Barcelona, Spain. July 2004. p. 74-81. URL:
     https://aclanthology.org/W04-1013.pdf
[14] J. L. Fleiss and J. Cohen, The equivalence of weighted kappa and the intraclass correlation
     coefficient as measures of reliability, Educational and Psychological Measurement (1973), Vol.
     33 613–619 https://doi.org/10.1177/001316447303300309


                                                                                                   23
[15] M. Mitra, A. Singhal and C. Buckley, Automatic text summarization by paragraph extraction.
     Intelligent Scalable Text Summarization. (1997) URL: https://aclanthology.org/W97-0707.pdf
[16] K. Al-Sabahi, Z. Zuping and M. Nadher, A hierarchical structured self-attentive model for
     extractive document summarization (HSSAS). IEEE Access, 6, 2018. pp. 24205-24212. URL:
     https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=8344797
[17] K. Yao et al., Deep reinforcement learning for extractive document summarization.
     Neurocomputing,                  284,               2018.           pp.52-62.            URL:
     https://www.researchgate.net/publication/322715462_Deep_Reinforcement_Learning_for_Extra
     ctive_Document_Summarization
[18] W. Li et al., Improving neural abstractive document summarization with explicit information
     selection modeling. In: Proceedings of the 2018 conference on empirical methods in natural
     language processing, 2018, pp. 1787-1796. URL: https://aclanthology.org/D18-1205.pdf
[19] J. Tan, X. Wan and J. Xiao, Abstractive document summarization with a graph-based attentional
     neural model. In: Proceedings of the 55th Annual Meeting of the Association for Computational
     Linguistics     July 2017.        Volume 1:         Long    Papers.  pp.   1171-1181.    URL:
     https://aclanthology.org/P17-1108.pdf
[20] T5 URL: https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html
[21] GPT-3 URL: https://openai.com/blog/gpt-3-apps/
[22] A. Haghighi and L. Vanderwende, Exploring content models for multi-document summarization.
     In: Proceedings of human language technologies: The 2009 annual conference of the North
     American Chapter of the Association for Computational Linguistics. June 2009, pp. 362-370.
     URL: https://aclanthology.org/N09-1041.pdf
[23] C.Y. Lin and E. Hovy, From single to multi-document summarization. In Proceedings of the 40th
     annual meeting of the association for computational linguistics, July 2002, pp. 457-464. URL:
     https://aclanthology.org/P02-1058.pdf
[24] X. Wan and J. Yang, Multi-document summarization using cluster-based link analysis.
     In Proceedings of the 31st annual international ACM SIGIR conference on Research and
     development        in     information    retrieval.    July   2008.    pp.    299-306    URL:
     https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.222.6018&rep=rep1&type=pdf
[25] V. Gritsenko, Zakluchnuy zvit pro vukonannya DCNTP “Obrazny konpyuter” [Final report on
     completion of STSTP “Pattern computer”]. IRTC IT&S, Kyyiv, 2010. 44 p. URL:
     http://obrazcomp.irtc.org.ua/Pressa/Zvit/Zvit_OK.pdf
[26] A.V. Anisimov, A.N.Romanik, V.Yu. Taranukha, Evrisiticheskiye algoritmy dlya opredeleniya
     kanonicheskih form i gramaticheskih harakteristic slov [Heuristic Algorithms for Determination
     of Canonical Forms and Grammatical Characteristics of Words]. Cybernetics and Systems
     Analysis Vol.40 (2004). – Iss. 2. pp. 3-15.
[27] Z.Wu and M. Palmer, Verb semantics and lexical selection. In: 32nd. Annual Meeting of the
     Association for Computational Linguistics, (1994) New Mexico State University, Las Cruces,
     New Mexico pp. 133 –138.
[28] A. Anisimov, et al., Ukrainian WordNet: creation and filling. In: International Conference on
     Flexible Query Answering Systems September 2013. pp. 649-660. Springer, Berlin, Heidelberg.
[29] R. Barzilay and M. Elhadad, Using lexical chains for text summarization. Advances in automatic
     text summarization, 1999. pp.111-121. URL: https://academiccommons.columbia.edu
     /doi/10.7916/ D8086DM3/download
[30] A. Acan, and Y. Tekol, Chromosome reuse in genetic algorithms. In Genetic and evolutionary
     computation conference , July 2003 Springer, Berlin, Heidelberg. pp. 695-705.


                                                                                                24