1. Introduction

Marking up Dramatic Text: a Case Study of “7 stories” by Morris Panych

Ivan Bekhta

ivan.bekhta@gmail.com 0

Nataliia Hrytsiv

nataliia.m.hrytsiv@lpnu.ua 1

Anastasiia Matviychuk

1 0 Lviv Franko National University , Universytetska Street, 1, Lviv, 79000 , Ukraine 1 Lviv Polytechnic National University , Stepana Bandery Street, 12, Lviv, 79000 , Ukraine

The paper elucidates the process, challenges and results of using computational linguistics tools (NLP) and pre-computer technique (TEI for personage utterance tagging) in processing dramatic text. As the material for analysis we have chosen the modern play ―7 stories‖ of t Canadian playwright Morris Panych, researched from the viewpoint of statistical indicator's and textual coefficients. Special attention is paid to statistical parameters of main personages in the play. Results obtained show numeric characteristics of such data: number of meanings (N); maximal meaning (max); minimal meaning (min); range (R); mode (Mo); median (Md); mean (Ẋ); standard deviation (Ϭ); coefficient of varνi)a;tionstan(dard error (Sẋ); measurement error (ε).

1 Translation NLP quantitative analysis text mark-up applied linguistics drama text tagging

1. Introduction

In addition, an approach towards and detailed study dramatic texts, as a unique literary genre, is a separate challenge in current studies, which has special requirements within NLP tools application and text mark-up. Therefore, the study of Morris Panych's playwork "7 Stories" is relevant.

The idea is that modern Canadian drama is the aspect, little studied from numerous viewpoints, i.e. philological, translatological, rhethorical; however, least studied from the angle of mathematical linguistics and statistics.

In order to understand the specifics of dramatic works, the concept of author's style, postmodern literature, to which the work under study belongs, the life path of the author and translator were additionally considered.

The play "7 Stories" by Morris Panych and translated by Ivan Krychfalushiy is an example of postmodern literature that has become a challenge and opposition to the laws of modernism.

2. Method and preparation characteristics

Considering the vast quantities of ST and TT data available today for analysis, as discussed in [ 3, 4, 5, 6 ], Natural Language Processing is among most interesting and promising aspects of data science [ 7, 8, 9, 10, 11, 12, 13 ].

By default, text data of the original text is difficult to process [ 14, 15, 16, 17, 18 ] given the challenge of comparing/contrasting it to the translated drama text [ 19, 20, 21, 22, 23, 24 ], the task can be complicated [ 25, 26, 27, 28 ], though, incredibly appealing [ 29, 30, 31, 32, 33 ].

Within this study project, we opted for exploring the way NLP techniques, especially mark-up possibilities, can advance processing performing/drama text for statistical profiling of ST and TT.

The project outlined in the current paper explores the ddistribution of the number of words in a sentence as well as other numeric characteristics being analyzed collectively and for all the characters of drama under analysis in their contrast with the Ukrainian translation. 2.1.

Stages of working with the text document “7 stories” by Morris Panych

A number of actions were performed for statistical analysis. Therefore, the analysis took place in the following stages:  The books of the original text and the translation were pre-scanned for further manipulations using ABBYY Fine Reader software;  Afterwards, it was converted from pdf to .docx to make it possible to work with text in terms of mark-up;  The correct formatting of text was checked and discrepancies between scanned pdf file and text documents were detected; it was normalized in the MS Word editor;

Next, the focus was on:  Selection of text marking up system according to its features;  Implementation of proper tags for the original work  Implementation of proper tags for the translated version;  Calculated texts results were processed using the Python programming language;  Afterwards, the results of the statistical parameters, such as N, max, min, R, Mo, Md, Ẋ, Ϭ, ν, Sẋ, ε were analyzed and described.

The original text and its translation was marked up using the same marking rules.

To recall, the use was made of the XML (eXtensible Markup Language) – a text markup language. It was used to conduct research and implement on the structural level.

The XML language was preffered since it fully determines the logical structure of a document.

The task of the XML language is to ensure certain data: images, texts, and other parts of a Web document; it can be defined and structured regardless of the platform used to recreate them.

Since in the current paper we deal with a dramatic work, text mark up and tag patterns were selected and adjusted for the appropriate analysis of this type of work. Thus, let us now turn our sights to text mark-up system, peculiar to drama text.

2.2. Mark-up pattern 2.2.1. Pattern

Thus, the following text markings were chosen according to the features of the dramatic work: <chtr>...</chtr> — paired marking, which is used to indicate a solid whole part of the text related to a particular character;

<cnm>...</cnm> — paired marking, which is used to indicate the name of the character with a colon; <s>...</s> —— paired marking, which is used to denote a sentence in the speech of the character; <mtr>...</mtr> — paired marking, which is used to mark all author's remarks throughout the text 2.2.2. Example

<mtr>The action of the play takes place outside an apartment building-on the ledge, outside various windows of the seventh storey. As the play progresses, the lights emphasize the time elapsed between early evening and late night. As the play opens, we hear a party in progress from one of the windows, MAN stands on the ledge, in a state of perplexity, contemplating the depths below. He seems disturbed, confused. Then he comes to what seems to be a resolution. He prepares to jump. When he is about to leap, the window next to him flies open. CHARLOTTE appears. She holds a MAN wAllet, which she attempts to throw out the window, RODNEY,charging up from behind, grabs her hand. A window-ledge struggle ensues.</mtr> <chtr><cnm>CHARLOTTE</cnm> <s>Let GO of me!!!</s><s> Let GO!!</s></chtr> <chtr><cnm>RODNEY</cnm> <mtr>(threatening)</mtr><s> So-help-me-GOD, CHARLOTTE. </s></chtr> <chtr><cnm>CHARLOTTE</cnm> <mtr>(daring him)</mtr><s> What??</s><s> WHAT??!! </s></chtr> <chtr><cnm>RODNEY</cnm> <s>Give me back my wallet! </s></chtr> <mtr>She tries to throw it again. They struggle. </mtr> <chtr><cnm>RODNEY</cnm> <s>What’s WRONGwith you?</s><s> Are you CRAZY?! </s></chtr> <chtr><cnm>CHARLOTTE</cnm> <s>YES! </s><s>YES, I AM!!! </s></chtr> <chtr><cnm>RODNEY</cnm> <s>MY GOLD CARD is in there!! </s></chtr>

3. Results

This section of the study presents statistics taken from the calculation of data based on the number of words in a sentence. That is, the unit of measurement in this statistical calculation is the word. The findings illustrate the contrast of ST and TT results of statistical parameters, i.e. N, max, min, R, Mo, Md, Ẋ, Ϭ, ν, .STẋ,heεschematic representation follows the data of each drama character one by one.

3.1. Analysis of the part of the text that belongs to the drama character of "Charlotte"

Having analysed the distribution of the number of words in a sentence by absolute and relevant frequency, we have obtained such numeric characteristics:

Charlotte: the whole ST data: 1 — 58 (90,62%); 2 — 4 (6,25%); 3 — 1 (1,56%); 4 — 1 (1,56%);.

The data for «Charlotte» presupposes that bthseoluate frequency of sentence lengths with word number 1 equals to 58; consequently, with word number of 2 equals to 4; with word number 3 equals to 1; with word number of 4 equals to 1.

Talking about translation, the most frequent are sentences with the number of words that equals to 1.

Charlotte: the whole TT data: 1 — 35 (30,97%); 4 — 17 (15,04%); 5 — 13 (11,50%); 2 — 12 (10,62%); 6 — 12 (10,62%); 3 — 10 (8,85%); 7 — 5 (4,42%); 11 — 3 (2,65%); 9 — 2 (1,77%); 10 — 2 (1,77%); 8 — 1 (0,88%); 12 — 1 (0,88%). The last two are the least frequent.

On the basis of the data above the following calculations are made of number of meanings, maximal meaning, minimal meaning, range, mode, median, mean, standard deviation, coefficient of variation, standard error, measurement error.

Results are presented in Table 1.

3.2. Analysis of the part of the text that belongs to the drama character of "Rodney"

Having analysed the distribution of the number of words in a sentence by absolute and relevant frequency, we have obtained such numeric characteristics:

Rodney: the whole ST data: 1 — 37 (90,24%); 2 — 3 (7,32%); 3 — 1 (2,44%).

The data for «Rod»neypresupposes that tahbesolute frequency of sentence lengths with word number 1 equals to 37; consequently, with word number of 2 equals to 3; with word number 3 equals to 1.

Rodney: the whole TT data: 1 — 16 (21,05%); 2 — 13 (17,11%); 3 — 11 (14,47%); 4 — 9 (11,84%); 5 — 9 (11,84%); 6 — 7 (9,21%); 7 — 5 (6,58%); 9 — 3 (3,95%); 8 — 2 (2,63%); 10 — 1 (1,32%).

Based on the data above, the following calculations are made and presented in Table 2. Table 2 shows the following results:

ST data numeric characteristic: Number of meanings (N) — 41; maximal meaning (max) — 3; minimal meaning (min) — 1; range (R) — 2; mode (Mo) — 1; median (Md) — 2,0; mean (Ẋ) — 1,12; standard deviation (Ϭ) — 0,39; coefficient of variation (ν) — 0,3519; standard error (Sẋ) — 0,0617; measurement error (ε) — 0,1077.

TT data numeric characteristic: Number of meanings (N) — 76; maximal meaning (max) — 10; minimal meaning (min) — 1; range (R) — 9; mode (Mo) — 1; median (Md) — 5,5; mean (Ẋ) — 3,76; standard deviation (Ϭ) — 2,37; coefficient of variation (ν) — 0,6304; standard error (Sẋ) — 0,2721; measurement error (ε) — 0,1417.

3.3. Analysis of the part of the text that belongs to the drama character of "Man"

By analogue to the previous characters (Charlotte and Rondey) we obtain the results for other characters; here – Man.

Man: the whole ST data: 1 — 228 (87,36%); 2 — 27 (10,34%); 3 — 6 (2,30%).

Thus, the data for «Man» states thbastoltuhte fraequency of sentence lengths with word number 1 equals to 228; consequently, with word number of 2 equals to 27; with word number 3 equals to 6.

Man: the whole TT data: 1 — 99 (18,50%); 3 — 90 (16,82%); 4 — 78 (14,58%); 2 — 61 (11,40%); 5 — 48 (8,97%); 6 — 47 (8,79%); 7 — 37 (6,92%); 8 — 18 (3,36%); 9 — 16 (2,99%); 10 — 9 (1,68%); 11 — 8 (1,50%); 12 — 7 (1,31%); 15 — 4 (0,75%); 13 — 3 (0,56%); 16 — 3 (0,56%); 18 — 2 (0,37%); 14 — 1 (0,19%); 17 — 1 (0,19%); 19 — 1 (0,19%); 23 — 1 (0,19%); 27 — 1 (0,19%).

Next, we have calculated number of meanings, maximal meaning, minimal meaning, range, mode, median, mean, standard deviation, coefficient of variation, standard error, measurement error. The results are demonstrated in Table 3.

ST data numeric characteristic: Number of meanings (N) — 261; maximal meaning (max) — 3; minimal meaning (min) — 1; range (R) — 2; mode (Mo) — 1; median (Md) — 2,0; mean (Ẋ) — 1,15; standard deviation (Ϭ) — 0,42; coefficient of variation (ν) — 0,3619; standard error (Sẋ) — 0,0258; measurement error (ε) — 0,0439.

TT data numeric characteristic: Number of meanings (N) — 535; maximal meaning (max) — 27; minimal meaning (min) — 1; range (R) — 26; mode (Mo) — 1; median (Md) — 11,0; mean (Ẋ) — 4,52; standard deviation (Ϭ) — 3,47; coefficient of variation (ν) — 0,7678; standard error (Sẋ) — 0,1500; measurement error (ε) — 0,0651.

3.4. Analysis of the part of the text that belongs to the drama character of "Leonard"

By analogue to the previous characters we obtain the results for the character – Leonard. Leonard: the whole ST data: 1 — 92 (86,79%); 2 — 12 (11,32%); 3 — 2 (1,89%).

Leonard: the whole TT data: 1 — 30 (14,49%); 5 — 28 (13,53%); 2 — 27 (13,04%); 3 — 27 (13,04%); 4 — 24 (11,59%); 6 — 23 (11,11%); 8 — 15 (7,25%); 7 — 8 (3,86%); 9 — 5 (2,42%); 10 — 5 (2,42%); 12 — 4 (1,93%); 14 — 3 (1,45%); 13 — 2 (0,97%); 16 — 2 (0,97%); 17 — 2 (0,97%); 11 — 1 (0,48%); 19 — 1 (0,48%). ST data numeric characteristic:

Number of meanings (N) — 106; maximal meaning (max) — 3; minimal meaning (min) — 1; range (R) — 2; mode (Mo) — 1; median (Md) — 2,0; mean (Ẋ) — 1,15; standard deviation (Ϭ) — 0,41; coefficient of variation (ν) — 0,3539; standard error (Sẋ) — 0,0396; measurement error (ε) — 0,0674.

TT data numeric characteristic:

Number of meanings (N) — 207; maximal meaning (max) — 19; minimal meaning (min) — 1; range (R) — 18; mode (Mo) — 1; median (Md) — 9,0; mean (Ẋ) — 4,94; standard deviation (Ϭ) — 3,53; coefficient of variation (ν) — 0,7148; standard error (Sẋ) — 0,2453; measurement error (ε) — 0,0974.

3.5. Analysis of the part of the text that belongs to the drama character of "Jennifer"

Jennifer: the whole ST data: 1 — 21 (84,00%); 2 — 3 (12,00%); 6 — 1 (4,00%);.

Jennifer: the whole TT data: 6 — 5 (19,23%); 4 — 4 (15,38%); 2 — 3 (11,54%); 3 — 3 (11,54%); 5 — 2 (7,69%); 9 — 2 (7,69%); 1 — 1 (3,85%); 7 — 1 (3,85%); 8 — 1 (3,85%); 10 — 1 (3,85%); 11 — 1 (3,85%); 14 — 1 (3,85%); 15 — 1 (3,85%).

ST data numeric characteristic: Number of meanings (N) — 25; maximal meaning (max) — 6; minimal meaning (min) — 1; range (R) — 5; mode (Mo) — 1; median (Md) — 2,0; mean (Ẋ) — 1,32; standard deviation (Ϭ) — 1,01; coefficient of variation (ν) — 0,7642; standard error (Sẋ) — 0,2018; measurement error (ε) — 0,2996.

TT data numeric characteristic: Number of meanings (N) — 26; maximal meaning (max) — 15; minimal meaning (min) — 1; range (R) — 14; mode (Mo) — 6; median (Md) — 7,0; mean (Ẋ) — 5,96; standard deviation (Ϭ) — 3,55; coefficient of variation (ν) — 0,5948; standard error (Sẋ) — 0,6955; measurement error (ε) — 0,2287.

3.6. Analysis of the part of the text that belongs to the drama character of "Marshall"

Marshall: the whole ST data: 1 — 94 (85,45%); 2 — 15 (13,64%); 4 — 1 (0,91%).

Marshall: the whole TT data: 2 — 31 (15,74%); 4 — 27 (13,71%); 3 — 26 (13,20%); 5 — 25 (12,69%); 6 — 21 (10,66%); 8 — 16 (8,12%); 7 — 11 (5,58%); 9 — 11 (5,58%); 1 — 9 (4,57%); 10 — 7 (3,55%); 11 — 6 (3,05%); 12 — 2 (1,02%); 16 — 2 (1,02%); 17 — 2 (1,02%); 23 — 1 (0,51%).

ST data numeric characteristic: Number of meanings (N) — 110; maximal meaning (max) — 4; minimal meaning (min) — 1; range (R) — 3; mode (Mo) — 1; median (Md) — 2,0; mean (Ẋ) — 1,16; standard deviation (Ϭ) — 0,44; coefficient of variation (ν) — 0,3760; standard error (Sẋ) — 0,0417; measurement error (ε) — 0,0703.

TT data numeric characteristic: Number of meanings (N) — 197; maximal meaning (max) — 23; minimal meaning (min) — 1; range (R) — 22; mode (Mo) — 2; median (Md) — 8,0; mean (Ẋ) — 5,39; standard deviation (Ϭ) — 3,38; coefficient of variation (ν) — 0,6279; standard error (Sẋ) — 0,2409; measurement error (ε) — 0,0877.

3.7. Analysis of the part of the text that belongs to the drama character of "Joan"

Joan: the whole ST data: 1 — 43 (84,31%); 2 — 7 (13,73%); 3 — 1 (1,96%);.

Joan: the whole TT data: 3 — 16 (16,49%); 4 — 16 (16,49%); 1 — 13 (13,40%); 5 — 12 (12,37%); 2 — 10 (10,31%); 7 — 10 (10,31%); 6 — 6 (6,19%); 9 — 4 (4,12%); 8 — 3 (3,09%); 12 — 3 (3,09%); 11 — 1 (1,03%); 14 — 1 (1,03%); 17 — 1 (1,03%); 18 — 1 (1,03%).

ST data numeric characteristic: Number of meanings (N) — 51; maximal meaning (max) — 3; minimal meaning (min) — 1; range (R) — 2; mode (Mo) — 1; median (Md) — 2,0; mean (Ẋ) — 1,18; standard deviation (Ϭ) — 0,43; coefficient of variation (ν) — 0,3651; standard error (Sẋ) — 0,0602; measurement error (ε) — 0,1002.

TT data numeric characteristic: Number of meanings (N) — 97; maximal meaning (max) — 18; minimal meaning (min) — 1; range (R) — 17; mode (Mo) — 3; median (Md) — 7,5; mean (Ẋ) — 4,81; standard deviation (Ϭ) — 3,35; coefficient of variation (ν) — 0,6958; standard error (Sẋ) — 0,3402; measurement error (ε) — 0,1385.

Unit

3.8. Analysis of the part of the text that belongs to the drama character of "Michael"

ST data numeric characteristic: Number of meanings (N) — 37; maximal meaning (max) — 2; minimal meaning (min) — 1; range (R) — 1; mode (Mo) — 1; median (Md) — 1,5; mean (Ẋ) — 1,08; standard deviation (Ϭ) — 0,27; coefficient of variation (ν) — 0,2525; standard error (Sẋ) — 0,0449; measurement error (ε) — 0,0814.

TT data numeric characteristic: Number of meanings (N) — 55; maximal meaning (max) — 12; minimal meaning (min) — 1; range (R) — 11; mode (Mo) — 4; median (Md) — 5,5; mean (Ẋ) — 5,16; standard deviation (Ϭ) — 2,57; coefficient of variation (ν) — 0,4979; standard error (Sẋ) — 0,3467; measurement error (ε) — 0,1316.

3.9. Analysis of the part of the text that belongs to the drama character of "Rachel"

Rachel: the whole ST data: 1 — 53 (91,38%); 2 — 5 (8,62%).

Rachel: the whole TT data: 4 — 18 (15,00%); 5 — 14 (11,67%); 7 — 14 (11,67%); 3 — 12 (10,00%); 2 — 11 (9,17%); 6 — 11 (9,17%); 1 — 10 (8,33%); 8 — 5 (4,17%); 9 — 4 (3,33%); 11 — 4 (3,33%); 10 — 3 (2,50%); 12 — 3 (2,50%); 13 — 3 (2,50%); 14 — 3 (2,50%); 16 — 3 (2,50%); 15 — 1 (0,83%); 20 — 1 (0,83%).

N max min

R Mo Md Ẋ Ϭ ν Sẋ ε

ST data numeric characteristic: Number of meanings (N) — 58; maximal meaning (max) — 2; minimal meaning (min) — 1; range (R) — 1; mode (Mo) — 1; median (Md) — 1,5; mean (Ẋ) — 1,09; standard deviation (Ϭ) — 0,28; coefficient of variation (ν) — 0,2584; standard error (Sẋ) — 0,0369; measurement error (ε) — 0,0665.

TT data numeric characteristic: Number of meanings (N) — 120; maximal meaning (max) — 20; minimal meaning (min) — 1; range (R) — 19; mode (Mo) — 4; median (Md) — 9,0; mean (Ẋ) — 6,03; standard deviation (Ϭ) — 3,94; coefficient of variation (ν) — 0,6529; standard error (Sẋ) — 0,3596; measurement error (ε) — 0,1168.

3.10. Analysis of the part of the text that belongs to the drama character of "Percy"

Percy: the whole ST data: 1 — 34 (80,95%); 2 — 7 (16,67%); 3 — 1 (2,38%).

Percy: the whole TT data: 6 — 12 (16,44%); 3 — 11 (15,07%); 4 — 10 (13,70%); 5 — 7 (9,59%); 1 — 6 (8,22%); 2 — 5 (6,85%); 7 — 4 (5,48%); 8 — 4 (5,48%); 9 — 3 (4,11%); 11 — 3 (4,11%); 14 — 3 (4,11%); 10 — 2 (2,74%); 12 — 1 (1,37%); 18 — 1 (1,37%); 23 — 1 (1,37%).

ST data numeric characteristic: Number of meanings (N) — 42; maximal meaning (max) — 3; minimal meaning (min) — 1; range (R) — 2; mode (Mo) — 1; median (Md) — 2,0; mean (Ẋ) — 1,21; standard deviation (Ϭ) — 0,46; coefficient of variation (ν) — 0,3827; standard error (Sẋ) — 0,0717; measurement error (ε) — 0,1158.

TT data numeric characteristic: Number of meanings (N) — 73; maximal meaning (max) — 23; minimal meaning (min) — 1; range (R) — 22; mode (Mo) — 6; median (Md) — 8,0; mean (Ẋ) — 5,90; standard deviation (Ϭ) — 4,04; coefficient of variation (ν) — 0,6839; standard error (Sẋ) — 0,4726; measurement error (ε) — 0,1569.

3.11. Analysis of the part of the text that belongs to the drama character of "Al"

ST data numeric characteristic: Number of meanings (N) — 31; maximal meaning (max) — 3; minimal meaning (min) — 1; range (R) — 2; mode (Mo) — 1; median (Md) — 2,0; mean (Ẋ) — 1,32; standard deviation (Ϭ) — 0,59; coefficient of variation (ν) — 0,4457; standard error (Sẋ) — 0,1059; measurement error (ε) — 0,1569.

TT data numeric characteristic: Number of meanings (N) — 58; maximal meaning (max) — 16; minimal meaning (min) — 1; range (R) — 15; mode (Mo) — 6; median (Md) — 7,5; mean (Ẋ) — 5,40; standard deviation (Ϭ) — 3,41; coefficient of variation (ν) — 0,6316; standard error (Sẋ) — 0,4476; measurement error (ε) — 0,1626.

3.12. Analysis of the part of the text that belongs to the drama character of "Nurse Wilson"

Nurse Wilson: the whole ST data: 1 — 42 (87,50%); 2 — 5 (10,42%); 3 — 1 (2,08%);.

Nurse Wilson: the whole TT data: 3 — 10 (13,16%); 4 — 10 (13,16%); 1 — 9 (11,84%); 5 — 9 (11,84%); 2 — 7 (9,21%); 6 — 6 (7,89%); 7 — 6 (7,89%); 12 — 4 (5,26%); 8 — 3 (3,95%); 9 — 3 (3,95%); 11 — 2 (2,63%); 13 — 2 (2,63%); 18 — 2 (2,63%); 10 — 1 (1,32%); 17 — 1 (1,32%); 23 — 1 (1,32%).

N max min

R Mo Md Ẋ Ϭ ν Sẋ ε

ST data numeric characteristic: Number of meanings (N) — 48; maximal meaning (max) — 3; minimal meaning (min) — 1; range (R) — 2; mode (Mo) — 1; median (Md) — 2,0; mean (Ẋ) — 1,15; standard deviation (Ϭ) — 0,41; coefficient of variation (ν) — 0,3558; standard error (Sẋ) — 0,0588; measurement error (ε) — 0,1007.

TT data numeric characteristic: Number of meanings (N) — 76; maximal meaning (max) — 23; minimal meaning (min) — 1; range (R) — 22; mode (Mo) — 3; median (Md) — 8,5; mean (Ẋ) — 5,91; standard deviation (Ϭ) — 4,48; coefficient of variation (ν) — 0,7586; standard error (Sẋ) — 0,5141; measurement error (ε) — 0,1705.

3.13. Analysis of the part of the text that belongs to the drama character of "Lilian"

Lilian: the whole ST data: 1 — 68 (91,89%); 2 — 6 (8,11%);.

Lilian: the whole TT data: 2 — 23 (14,94%); 4 — 19 (12,34%); 3 — 18 (11,69%); 5 — 17 (11,04%); 1 — 14 (9,09%); 6 — 13 (8,44%); 10 — 12 (7,79%); 8 — 10 (6,49%); 7 — 9 (5,84%); 9 — 7 (4,55%); 11 — 3 (1,95%); 13 — 2 (1,30%); 16 — 2 (1,30%); 17 — 2 (1,30%); 12 — 1 (0,65%); 14 — 1 (0,65%); 18 — 1 (0,65%).

ST data numeric characteristic: Number of meanings (N) — 74; maximal meaning (max) — 2; minimal meaning (min) — 1; range (R) — 1; mode (Mo) — 1; median (Md) — 1,5; mean (Ẋ) — 1,08; standard deviation (Ϭ) — 0,27; coefficient of variation (ν) — 0,2525; standard error (Sẋ) — 0,0317; measurement error (ε) — 0,0575.

TT data numeric characteristic: Number of meanings (N) — 154; maximal meaning (max) — 18; minimal meaning (min) — 1; range (R) — 17; mode (Mo) — 2; median (Md) — 9,0; mean (Ẋ) — 5,51; standard deviation (Ϭ) — 3,69; coefficient of variation (ν) — 0,6704; standard error (Sẋ) — 0,2975; measurement error (ε) — 0,1059.

3.14. Analysis of the part of the text that belongs to the secondary drama

characters

3.14.1. Character "One"

One: the whole ST data: 1 — 1 (100,00%). One: the whole TT data: 4 — 2 (40,00%); 3 — 1 (20,00%); 5 — 1 (20,00%); 6 — 1 (20,00%).

ST data numeric characteristic: Number of meanings (N) — 1; maximal meaning (max) — 1; minimal meaning (min) — 1; range (R) — 0; mode (Mo) — 1; median (Md) — 1,0; mean (Ẋ) — 1,00; standard deviation (Ϭ) — 0,00; coefficient of variation (ν) — 0,0000; standard error (Sẋ) — 0,0000; measurement error (ε) — 0,0000.

TT data numeric characteristic: Number of meanings (N) — 5; maximal meaning (max) — 6; minimal meaning (min) — 3; range (R) — 3; mode (Mo) — 4; median (Md) — 4,5; mean (Ẋ) — 4,40; standard deviation (Ϭ) — 1,02; coefficient of variation (ν) — 0,2318; standard error (Sẋ) — 0,4561; measurement error (ε) — 0,2032.

3.14.2. Character "Two"

Two: the whole ST data: 1 — 2 (66,67%); 2 — 1 (33,33%). Two: the whole TT data: 4 — 2 (33,33%); 8 — 2 (33,33%); 5 — 1 (16,67%); 10 — 1 (16,67%).

ST data numeric characteristic: Number of meanings (N) — 3; maximal meaning (max) — 2; minimal meaning (min) — 1; range (R) — 1; mode (Mo) — 1; median (Md) — 1,5; mean (Ẋ) — 1,33; standard deviation (Ϭ) — 0,47; coefficient of variation (ν) — 0,3536; standard error (Sẋ) — 0,2722; measurement error (ε) — 0,4001.

TT data numeric characteristic: Number of meanings (N) — 6; maximal meaning (max) — 10; minimal meaning (min) — 4; range (R) — 6; mode (Mo) — 4; median (Md) — 6,5; mean (Ẋ) — 6,50; standard deviation (Ϭ) — 2,29; coefficient of variation (ν) — 0,3525; standard error (Sẋ) — 0,9354; measurement error (ε) — 0,2821.

ST data numeric characteristic: Number of meanings (N) — 5; maximal meaning (max) — 2; minimal meaning (min) — 1; range (R) — 1; mode (Mo) — 1; median (Md) — 1,5; mean (Ẋ) — 1,20; standard deviation (Ϭ) — 0,40; coefficient of variation (ν) — 0,3333; standard error (Sẋ) — 0,1789; measurement error (ε) — 0,2922.

TT data numeric characteristic: Number of meanings (N) — 5; maximal meaning (max) — 8; minimal meaning (min) — 3; range (R) — 5; mode (Mo) — 4; median (Md) — 5,0; mean (Ẋ) — 5,00; standard deviation (Ϭ) — 1,79; coefficient of variation (ν) — 0,3578; standard error (Sẋ) — 0,8000; measurement error (ε) — 0,3136.

3.14.4. Character "Four"

Four: the whole ST data: 1 — 2 (100,00%).

Four: the whole TT data: 4 — 2 (50,00%); 1 — 1 (25,00%); 2 — 1 (25,00%). Character’s name

Charlotte Rodney

Man Leonard Jennifer Marshal

Joan Michael Rachel Percy

Al Nurse Wilson

Lilian One Two Three

Four

ST data numeric characteristic: Number of meanings (N) — 2; maximal meaning (max) — 1; minimal meaning (min) — 1; range (R) — 0; mode (Mo) — 1; median (Md) — 1,0; mean (Ẋ) — 1,00; standard deviation (Ϭ) — 0,00; coefficient of variation (ν) — 0,0000; standard error (Sẋ) — 0,0000; measurement error (ε) — 0,0000.

TT data numeric characteristic: Number of meanings (N) — 4; maximal meaning (max) — 4; minimal meaning (min) — 1; range (R) — 3; mode (Mo) — 4; median (Md) — 2,0; mean (Ẋ) — 2,75; standard deviation (Ϭ) — 1,30; coefficient of variation (ν) — 0,4724; standard error (Sẋ) — 0,6495; measurement error (ε) — 0,4629.

4. Comparative analysis of word distribution in sentences 4.1. Difference in the Number of meanings (N) in ST and TT

Given form the results above that the translated variant statistical parameters data exceeds the original drama in the majority of cases, we now turn our sights to one parameter – Number of meanings (N). We tend to compare the data and find the difference (if present). Our assumption 1 is that the TT is much longer in terms of word usage within the sentence.

ST 64 41 261 106 25 110 51 37 58 42 31 48 74 1 3 5 2

TT 113 76 535 207 26 197 97 55 120 73 58 76 154 5 6 5 4

Difference +49 +35 +274 +101 +1 +87 +46 +18 +62 +31 +27 +28 +80 +4 +3 0 +2

To recall, character "Man" is the protagonist and the main character of the play. He is a welldressed gentleman who is willing to jump off the seventh story.

He has a number of conversations with the residents of the building. He feels lost and compelled to stand on the seventh story of the building. Taking into account the results of Figure 1 we hold assumption 2 that the translator adds a considerable number of words (274), or, he rather, doubles the ST quantity, due to a number of reasons:  to explain the original;  to compensate literary imagery losses;  to add something from the translator himself, to recreate, so to say, the original;  due to structural and lexico-gramatical allomorphic features of a language pair.

Whatever reason stands behind this translator’s decis-imonaking, it is a prosperous ground for further Translation Studies analysis.

4.2. Analysis of the whole text

Here we focus on statistical parameters with the defined unit of measurement – a word. The number of words in a drama text utterunces is important due to a couple of reasons:  the length of lines of the written script;  chronometry and metrics of the whole drama act;  pithiness and iconicity of each phrase.

Below are the results on the distribution of the number of words in a TT sentence by absolute and relevant frequency.

The most frequent are sentences in the translated text with the number of words 4 – 259 (14,2%), 1 – 255(13,98%), 3– 255(13,98%), 2 – 219(12,01%) 5 – 205(11,24%), 6 -182 (9,98), 7-121 (6,63%), 8 – 84(4,61), 9- 62 (3,4%), 10 – 49 (2,69%), 11 – 32 (1,75), 12 – 31 (1,7%), 14 – 14 (0,77%), 13 – 13 (0,71%), 16 – 13(0,71%), 17 – 9 (0,49%), 18 – 7 (0,38%), 15 – 6 (0,33%), 23 – 4 (0,22%), 19 – 2 (0,11%), 20 – 1 (0,05%), 27 – 1 (0,05%). The last two results are the least frequent.

In the following Figure 2 we can see a comparison of the number of words in the sentences of the whole TT drama work.

The x-axis is the number of sentences, and the y-axis is the number of words in a sentence.

5. Conclusions

The main advances of statistical linguistics have been retrieved in the article. The original Canadian play has been compared with the corresponding translated text in terms of statistical parameters, which has never been done before.

The paper is of practical and applied value; however, the scientific value of the paper is seen as such that the suggested approach and methods will eventually allow formulating and substantiating a plausible scientific hypothesis in the realm of statistical linguistics and translation studies. At this point it is proven that bilingual drama texts are well adoptable for NLP and reveal promising outcomes.

We have verified absolute and relevant distribution, probability measurement, also: N, max, min, R, Mo, Md, Ẋ, Ϭ, ν, Sẋ, ε in the sentences of both texts.

Specifically designed software, which is represented as a combination of XML markup language, Microsoft Excel spreadsheet, and Python programming language, has been used. Results of statistical calculations of the drama ―7 stories‖ by Morris Panych by unitwoorfd amreeapsruerseented in the corresponding Tables 1 – 17.

Structural recognition provides useful information about the characters of the play, original and translation, namely the length of the sentence in word units that will help with further comparisons of ST and TT. The quantitative characteristics of the original play and its Ukrainian translation on the lexical level relying on the linguistic statistical analysis have been clarified: the amount of translated text Numbers of meaning (N) exceeds considerably and demands further analysis. The discrepancy becomes obvious with number of characters (Man, Leonard, Marshal, Lilian)

The correlation of coefficients has been presented in tables and figures to illustrate the material under research.

The prospect of the study is to further explore the problems of translator’s which resulted in the declared above data. meaningful choices

6. Acknowledgement

The project has been carried out within the complex academAipcplitcoaptiocn ―of modern technologies for optimization of information processes in natural langLuavgive‖Poalyttechnic National University. At the initial stage the project underwent the consultancy of Ihor Kulchytskyy, to whom we express our gratitude. 7. References International

S. Shaheen, and M. Spruit. Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation. 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2017, doi:10.1109/dsaa.2017.61.

Topic Modeling in Python with Gensim. Machine Learning Plus, 16 Apr. 2020, URL: www.machinelearningplus.com/nlp/topic-modeling-gensim-python.

K. Aguilar, NLP Techniques with Shakespeare’s Plays: Cleaning and Classifying Text with the Bard, 2020. URL: https://medium.com/analytics-vidhya/nlp-techniques-withshakespeares-plays-d8843ba26a4f.

O. Levchenko, M. Dilai, (2019) Attitudes Toward Feminism in Ukraine: A Sentiment Analysis of Tweets. In: Shakhovska N., Medykovskyy M. (eds) Advances in Intelligent Systems and Computing III. CSIT 2018. Advances in Intelligent Systems and Computing, vol 871. Springer, Cham. doi:10.1007/978-3-030-01069-0_9

[1]

Panych , Seven Stories, Vancouver: Talonbooks, 2013 .

[2]

Panych , 7 istorii , [per. Z anhliiskoi Ivana Krychfalushiia], Brusturiv: Dyskursus , 2014 .

[3]

Laviosa (Ed.), Corpus-based Translation Studies: Theory , Findings, Applications, Rodopy, 2002 .

[4]

K. H.

Chen , and

H. H.

Chen , Aligning bilingual corpora especially for language pairs from different families . Information Sciences Applications , 1995 , 42 , pp. 57 - 81 .

[5]

Munday , A Computer-assisted approach to the Analysis of Translation Shifts , Meta, 1998 , XLIII, 4.

[6]

Zanettin , Parallel corpora in translation studies: Issues in corpus design and analysis . In Intercultural Faultlines. Research Models in Translation Studies I: Textual and Cognitive Aspects , ed. M. Olohan, pp. 105 - 118 . Manchester: St. Jerome, 2000 .

[7]

Allen , Natural Language Understanding. Cummings Publishing Company, Redwood City, 1995 .

[8]

Barnard , et al. ―SG-MBLased Markup for Literary Texts: Two Problems and Some Solutions.‖ Computers and the Humanities , vol. 22 , no. 4 , 1988 , pp. 265 - 276 . JSTOR, URL: www.jstor.org/stable/30200136. Accessed 28 Feb. 2021 .

[9]

Blackburn ,

Bos ,

Kohlhase , & H. De Nivelle , Inference and computational semantics . In Computing Meaning , Springer Netherlands, 2001 , pp. 11 - 28 .

[10]

Dagan , and

Glickman , Probabilistic textual entailment: generic applied modeling of language variability . In Proceedings of the PACAL Workshop on Learning Methods for Text Understanding and Mining , Grenoble, France, 2004 , pp. 26 - 29 .

[11]

Dale ,

Moisl , H. Somers (Eds.), Handbook of natural language processing . CRC press, 2000 .

[12]

Dilai ,

Levchenko , Discourses Surrounding Feminism in Ukraine: A Sentiment Analysis of Twitter Data 2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT 2018 - Proceedings 2018 | conference-paper doi: 10 .1109/STC-CSIT. 2018 .8526694

[13]

Hogan , The Web of Data . Springer, 2020 .

[14]

Lytvyn ,

Vysotska ,

Hamon ,

Grabar ,

Sharonova ,

Cherednichenko , O. Kanishcheva (Eds.), Computational Linguistics and Intelligent Systems. Proc. 4thInt. Conf. COLINS 2020 . Volume I:Workshop. Lviv, Ukraine, April 23-24 , 2020 , CEURWS.org, online

[15]

Marcus ,

Santorini ,

Marcinkiewicz , Building a Large Annotated Corpus of English: Penn TreeBank . Computational linguistics: Special Issue on Using Large Corpora, 1993 , 19 ( 2 ), pp. 313 - 330 .

[16]

Matthews , An Introduction to Natural Language Processing Through Prolog , Routledge: London and New York, 2014 .

[17]

Oakes , Sentence and word alignment in the CARTER project . In Using Corpora for Language Research , ed. J. Thomas , and M. Short , London: Longman, 1996 , pp. 211 - 233 .

[18]

Pavis , Theatre at the Crossroads of Culture, Routledge, 1992 .

[19]

Bassnett , Translating for the Theatre: The Case Against Performability . TTR : traduction, terminologie, rédaction, 1991 , 4 ( 1 ),pp. 99 - 111 . URL: https://doi.org/10.7202/037084ar.

[20]

Bassnett , Still Trapped in the Labyrinth: Further Reflections on Translation and Theatre , Constructing Cultures: Essays on

Literary

Translation .-Multilingual Matters , 1998 , pp. 90 - 108 .

[21]

T.H.

Howard-Hill , Modern Textual Theories and the Editing of Plays. The Library, 6th ser ., 1989 , 11 , pp. 89 - 115 .

[22]

Issacharoff , F. Robin Jones (Eds.), Performing Texts. Philadelphia: University of Pennsylvania Press, 1988 .

[23]

Lavagnino , E. Mylonas, The show must go on: Problems of tagging performance texts . Comput Hum , 1995 , pp. 113 - 121 . URL: https://doi.org/10.1007/BF01830705

[24] Corpus-based Language Studies: An Advanced Resource Book , ed. T. McEnery , R.

Xiao , Y.

Tono , Routledge, 2006 .

[25]

Dershowitz , E. Nissan (Eds.), Language, Culture, Computation: Computing for the Humanities , Law and Narratives . Springer, 2014 .

[26]

Levchenko ,

Tyshchenko and

Dilai . Associative Verbal Network of the Conceptual Domain БІДА (MISERY) in Ukrainian . Proceedings of the 4th Conference on Computational Linguistics and Intelligent Systems (COLINS 2020 ). Volume I: Main Conference. URL: http://ceur-ws. org/ Vol- 2604 / Associative Verbal Network of the Conceptual Domain БІДА (MISERY ) in Ukrainian

[27]

Shakhovska , and M. Medykovskyy (Eds), Advances in Intelligent Systems and Computing III: Selected papers from the International Conference on Computer Science and Information Technologies, CSIT 2018, September 11 -14 Lviv, Ukraine. Springer: Springer Nature Switzerland, 2019 .

[28] C.M. Sperberg-McQueen , Text in the Electronic Age: Textual Study and Text Encoding, with Examples from Medieval texts . Literary and Linguistic Computing , 6 ( 1991 ), pp. 34 - 46 .

[29] C.M. Sperberg-McQueen , and B . Lou (Eds.), Guidelines for Electronic Text Encoding and Interchange (TEI P3) . Chicago and Oxford: Text Encoding Initiative , 1994 .