<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Lviv, Ukraine, November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Context-Based Question-Answering System for the Ukrainian Language</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Ukrainian Catholic University Faculty of Applied Sciences</institution>
          ,
          <addr-line>Lviv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>1</volume>
      <fpage>5</fpage>
      <lpage>16</lpage>
      <abstract>
        <p>We introduce a context-based question answering model for the Ukrainian language based on Wikipedia articles using Bidirectional Encoder Representations from Transformers (BERT) [1] model which takes a context (Wikipedia article) and a question to the context. The result of the model is an answer to the question. The model consists of two parts. The first one is a pretrained multilingual BERT model which are trained on the top-100 the most popular languages on Wikipedia articles. The second part is the fine-tuned model, which is trained on the data set of questions and answers to the Wikipedia articles. The training and validation data is Stanford Question Answering Dataset (SQuAD) [2].There is no any question answering datasets for the Ukrainian language. The plan is to build an appropriate dataset with machine translate and use it for the fine-tuning training stage and compare the result with models which were fine-tuned on the other languages. The next experiment is to train a model on the Slavic languages dataset before fine-tuning on the Ukrainian language and compare the results.</p>
      </abstract>
      <kwd-group>
        <kwd>Context-based Question Answering</kwd>
        <kwd>Bidirectional Encoder</kwd>
        <kwd>Representations from Transformers</kwd>
        <kwd>multilingual BERT</kwd>
        <kwd>fine-tuning</kwd>
        <kwd>Generative Pre-trained Transformer</kwd>
        <kwd>Stanford Question Answering Dataset</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1.1</p>
      <p>Along with it, the question answering system might be beneficial for layers, medical
workers, and other specific professions.
1.2</p>
    </sec>
    <sec id="sec-2">
      <title>General Formulation of the Problem</title>
      <p>Question answering task is one of the classical problems in natural language processing
(NLP). At the input for content-based question-answering model has a context and a
question. As a context, we can take an article, a document, an essay, a paper, or any
other piece of textual information. In this project, we will use articles from Wikipedia.
A question is a natural human language question. Articles and questions are in the
Ukrainian language. The result of the model is a phrase from the context, which
contains the answer to the question.
2</p>
      <sec id="sec-2-1">
        <title>Review of Related Work</title>
        <p>
          Despite the importance of the problem, it is not appropriately solved for the Ukrainian
language yet. There was no public result for the Ukrainian language found except some
multilingual models like BERT [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
2.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Classical Methods</title>
      <p>
        Let us start a review of existed methods from the classical approaches. Under the term
classical, we mean methods, which use well-known strategies without artificial neural
network models. There are unsupervised and supervised methods. Unsupervised
approaches are based on word embedding [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] distances and word frequencies. Supervised
methods use labeled dataset for training (logistic regression, support vector machine,
etc.). Also, we can attribute logic-based methods (for example, Machine
Comprehension Using Commonsense Knowledge [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]) to the set of classical methods. Such
methods are used to solve question-answering task because logical representations yield
more abstract concepts, such as temporal or logical relations. This is very useful for
learning a type of commonsense knowledge.
      </p>
      <p>Unsupervised Methods. Two different approaches are distinguished within this
category of methods – based on measuring Euclidean distance between sentences and
counting word and phrase frequencies.</p>
      <p>
        Euclidean distance between sentences. The first traditional method we came across
during reviewing of related works is finding the minimal Euclidean distance between
question and sentences from the context [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The idea of this approach is to find an
average vector of words for each sentence. The answer to the question is the closest
sentence from the context to the question according to Euclidean distance. It is possible
to specify the answer by splitting the sentence into phrases, but it is an additional task,
which will decrease the accuracy of the method. One more drawback of the described
method is relying on the quality of word embeddings. Also, this method does not take
into account a dependency between the words in the sentence.
      </p>
      <p>
        Word and phrase frequency. It is possible to use n-gram approach [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for generating
an answer. The question is parsed into the dependency tree and rebuilt into a narrative
sentence with missing the target word or phrase. The missed phrase is filling by n-gram
model. An artificial neural network model can replace the n-gram model. It will be
discussed below. The drawback of this approach is a low accuracy of dependency parser
models and relying on the phrase frequency in a relatively small volume of text.
      </p>
      <p>
        Supervised Methods. This category of methods often use logistic regression and
support vector machine approaches. Supervised traditional methods are described in
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The author uses the SQuAD dataset mentioned above for learning. Sentences from
the context are split into the sentences and added to a binary vector. The target sentence
is marked as 1 and all other items are 0. After that, multinomial logistic regression [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
is being trained by the labeled data or support vector machine [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. One of the advantages
of this approach is the ability to add some features to the model (dependency between
the words, term frequency (TF), inverse document frequency (IDF) [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], etc.). A term
frequency is a feature, which increases the weight of frequent words and inverse
document frequency wise verse decreases the weight of widespread words.
      </p>
      <p>Pros and Cons. The advantages of the classical approaches are simplicity and high
transparency of the models. Along with it, the model performance on artificial samples
is not good enough (near 70% accuracy on the SQuAD validation set). The result will
be worse with increasing size of the context or setting a goal to retrieve a more specific
answer (a phrase instead of a sentence). Moreover, the results for the Ukrainian
language are even worse than the English language. It happens due to higher grammar
complexity of the Ukrainian language, fewer text corpora, the presence of word cases
and other language specifics.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Artificial Neural Network Models</title>
      <p>In this part, we will review supervised and unsupervised cases for each main model.</p>
      <p>
        Long Short-Term Memory Model. Long short-term memory model (LSTM) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is
a recurrent neural network architecture, which allows building sequence-to-sequence
models. Also, the input and output vector sizes are not fixed. As an input, LSTM model
takes a context and a question and returns a word scores from the context. To connect
a vector for context and a vector for a question, we add an attention layer. It is a crucial
part of the question answering system based on LSTM model. Attention layer is a dot
product of context and question output vectors. After that, the result of the dot product
converts into the probability of being an answer to the question. The approach
mentioned above is described in the paper dedicated to Bidirectional Attention Flow (BAF)
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        Generative pre-trained transformer (GPT). There is a second version of this
model called GPT-2 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. GPT-2 is one of the State-of-the-Art models in language
modeling tasks. This model was trained on the Wikipedia articles and internet pages to
make the style of generated text more various. This model can only generate the next
word based on the previous text. Hence, to make it answer the question, we have to
rephrase questions sentence into a narrative sentence with a skipped phrase for the
answer. GPT-2 will generate the answer. The peculiarity of this model is the absence of
the context. On the one hand, it can be an advantage if there is no specific data to
retrieve the answer. On the other hand, the accuracy will be low for the tasks from special
areas (law, medicine, etc.), as the model was not trained on data from the corresponded
areas. Anyway, GPT-2 cannot be applied to the Ukrainian language, as it is trained only
on English texts. Along with it, training the model from scratch or even pre-training on
Ukrainian corpora requires a lot of resources and time.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>Bidirectional Encoder Representations from Transformers</title>
      <p>
        Bidirectional Encoder Representations from Transformers (BERT) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a
transformerbased neural which shows state-of-the-art results in a wide variety of NLP tasks
provided by Google researchers. Multilingual BERT model was built for top-100 of the
most popular languages used in Wikipedia and it can be used for a hundred languages
out-of-the-box. BERT model training process consists of two stages. The first stage is
pre-training on the text corpora for language modeling task. The second stage is
finetuning on the question-answer datasets. The first stage requires substantial
computational resource. Fine-tuning, however, can be performed even on a single graphics
processing unit (GPU).
      </p>
      <p>
        Multilingual BERT results. There are several modifications of BERT multilingual
models, which differ by the fine-tuning process. There are BERT models fine-tuned on
a translated dataset, original dataset (English), cased (use original word case) and
uncased (all words are lowercased). Table 1 shows the result of BERT modifications on
Cross-lingual Natural Language Inference (XNLI) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] dataset (translated datasets).
      </p>
      <sec id="sec-5-1">
        <title>Research Hypotheses and Problem</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Hypotheses</title>
      <p>Accuracy Hypothesis. The main objective of this project is to build a question-answer
model for the Ukrainian language, which shows accuracy near results shown in Table
1 on the well-known benchmarks. The first hypothesis says it is possible to achieve an
efficiency near 70-80%, which is close to results for the other languages provided by
Google researchers.</p>
      <p>Model Comparison Hypothesis. One more goal is to compare different approaches
for pre-training. Some datasets have human translated data into the Russian and other
Slavic languages. It seems that fine-tuning model on Slavic languages datasets and then
fine-tuning on the turned into Ukrainian language dataset might improve performance
for the Ukrainian language comparing with direct fine-tuning on the Ukrainian
language dataset. So, the next task of this project is to confirm or deny this hypothesis.
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Problems</title>
      <p>Translation Problems. To achieve the project goals mentioned above, we need to find
an appropriate machine translator to create the dataset in the Ukrainian language, build
different model pipelines, and compare results. Furthermore, it might require a human
translated small dataset in the Ukrainian language to verify the models.</p>
      <p>Articles Retrieval Problems. Besides, the project needs to retrieve Wikipedia
articles in the Ukrainian language. There are articles in the datasets which exist in the
English Wikipedia and are absent in the Ukrainian part. Hence, we have to detect such items
and exclude them from the datasets. Moreover, Wikipedia provides articles in the
Extensible Markup Language (XML) format, which must be converted into the
humanreadable text.
4
4.1</p>
      <sec id="sec-7-1">
        <title>Envisioned Approach</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Dataset Generation</title>
      <p>
        The very first task is to generate Ukrainian language dataset from the existing datasets
(SQuAD [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) by machine translator. There is a subtask related to the machine
translation. It is comparison and checking the quality of the translation. The quality of the
translated dataset directly affects the quality of the model. Translation quality can be
checked by reverse translation. If the difference between the original text and the text
after the forward and the backward translation is small enough, it indicates high quality
of the translator.
4.2
      </p>
    </sec>
    <sec id="sec-9">
      <title>Data Storing</title>
      <p>Generated datasets and retrieved articles from Ukrainian Wikipedia are stored in the
database to make access to the data more convenient. As the data size is bigger than
read-only memory capacity, we will need to split and read data partially.
4.3</p>
    </sec>
    <sec id="sec-10">
      <title>Models Pipeline</title>
      <p>The base pre-trained model is multilingual BERT model. Then it is fine-tuned on the
different datasets and variations. The first model will be fine-tuned on the translated
training datasets (SQuAD). The accuracy of the model is calculated on the test sets of
the corresponded datasets. The next model is fine-tuned on the human-translated
datasets for Slavic languages. After that, the model will be fine-tuned on the
machinetranslated dataset for the Ukrainian language. Combinations on the fine-tuning stage
produce different models, which are being compared on the test sets and the
humantranslated Ukrainian language dataset created manually.</p>
      <sec id="sec-10-1">
        <title>Research Methodology and Plan</title>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Methodological Approach</title>
      <p>
        One can distinguish three methodological approaches [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]:
─ Quantitative methods are appropriate for measuring, ranking, comparing, etc.
─ Qualitative methods are best to measure describing, interpreting, contextualizing.
      </p>
      <p>Very often, it is related to the textual results.
─ Mixed methods, which combine a numerical measurement and exploration.</p>
      <p>On the one hand, quantitative methods are the best for comparison fine-tuned models
between each other and with state-of-the-art models for the English language.</p>
      <p>
        There will be applied the F1 [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] score and precision to get a quantitative measure
and two types of matching. The first one is exact matching answers, and the second one
will check if the original answer is present in the model answer.
      </p>
      <p>On the other hand, sometimes answer to the question is not precisely equal to the
expected value, but the meaning is correct. That is why mixed methods are the most
appropriate for question-answering model evaluation. On this stage, the
question-answer system may require a subsystem (additional artificial neural network or a simple
set of rules) which decides if the answer is correct even if the model response is not
equal to the labeled value.</p>
      <p>Along with it, as it was mentioned above, the last stage of model evaluation is a
human-translated test set in the Ukrainian language, which allows providing a
qualitative measurement.
5.2</p>
    </sec>
    <sec id="sec-12">
      <title>Plan for the Research</title>
    </sec>
    <sec id="sec-13">
      <title>Milestone</title>
      <p>Translation datasets and building database of articles
Building baseline model
Building advanced fine-tuned models
Model evaluation
Writing master thesis
Submission of thesis for final review
Master Thesis Defense</p>
    </sec>
    <sec id="sec-14">
      <title>Start Date End Date</title>
      <p>Sep 2019 Oct 2019
Oct 2019 Oct 2019
Oct 2019 Dec 2019
Nov 2019 Dec 2019
Oct 2019 Jan 2020
- 8 Jan 2020
- End of Jan 2020
6</p>
      <sec id="sec-14-1">
        <title>Conclusive Remarks and Outlook</title>
        <p>The most valuable thing from the potential results of the project is a high performance
context-based question-answering model for the Ukrainian language. After the
completion of this work, we will know how to build question-answering systems for the
Ukrainian language. Further, these methods can be applied to the other Slavic languages
or languages with very complicated grammar, peculiar properties, or non-Latin
characters.</p>
        <p>Translated datasets will be reusable for the other researches and projects and can be
taken as a start point for the human translation process.</p>
        <p>There is a point in the research plan where hypothesis might fail, and research must
start from scratch. It is a hypothesis about building a high-performed
question-answering model based on fine-tuning on the machine-translated datasets. This approach
showed good results for English, Spanish, German, and Arabic languages. Along with
it, the efficiency for the Urdu language is significantly worse than for the languages
mentioned earlier (see Table 1).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>2. Stanford Question Answering Dataset. https://rajpurkar.github.io/SQuADexplorer/</mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Al-Rfou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Perozzi</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skiena</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Polyglot:
          <article-title>Distributed word representations for multilingual NLP</article-title>
          .
          <source>arXiv preprint arXiv:1307.1662</source>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Rajpurkar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , Zhang, J.,
          <string-name>
            <surname>Lopyrev</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : SQuAD:
          <volume>100</volume>
          ,000+
          <article-title>questions for machine comprehension of text</article-title>
          .
          <source>arXiv preprint arXiv:1606.05250</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Sperandei</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <article-title>Understanding logistic regression analysis</article-title>
          .
          <source>Biochemia Medica</source>
          <volume>24</volume>
          (
          <issue>1</issue>
          ),
          <fpage>12</fpage>
          -
          <lpage>18</lpage>
          (
          <year>2014</year>
          ). doi:
          <volume>10</volume>
          .11613/BM.
          <year>2014</year>
          .003
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Kwok</surname>
          </string-name>
          , J.T.Y.:
          <article-title>Automated text categorization using support vector machine</article-title>
          .
          <source>In: 5th International Conference on Neural Information Processing</source>
          , pp.
          <fpage>347</fpage>
          -
          <lpage>351</lpage>
          (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McGill</surname>
            ,
            <given-names>M.J.</given-names>
          </string-name>
          :
          <article-title>Introduction to Modern Information Retrieval</article-title>
          .
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          , Inc., New York (
          <year>1983</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural Computation</source>
          ,
          <volume>9</volume>
          (
          <issue>8</issue>
          ),
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Seo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kembhavi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farhadi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hajishirzi</surname>
          </string-name>
          , H.:
          <article-title>Bidirectional attention flow for machine comprehension</article-title>
          .
          <source>arXiv preprint arXiv:1611.01603</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Amodei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Language models are unsupervised multitask learners</article-title>
          .
          <source>OpenAI Blog</source>
          <volume>1</volume>
          (
          <issue>8</issue>
          ) (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>11. Google AI Research. https://github.com/google-research/bert</mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <article-title>Cross-lingual Natural Language Inference dataset</article-title>
          . https://github.com/facebookresearch/XNLI
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Benz</surname>
            ,
            <given-names>C.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ridenour</surname>
            ,
            <given-names>C.S.</given-names>
          </string-name>
          :
          <article-title>Qualitative-Quantitative Research Methodology: Exploring the Interactive Continuum</article-title>
          . SIU Press, Carbondale and Edwardsville (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Goutte</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gaussier</surname>
          </string-name>
          , E.:
          <article-title>A probabilistic interpretation of precision, recall, and F-score, with implication for evaluation</article-title>
          . In: Losada D.E.,
          <string-name>
            <surname>Fernández-Luna</surname>
            <given-names>J.M. (eds.) ECIR</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>LNCS</article-title>
          , vol.
          <volume>3408</volume>
          , pp.
          <fpage>345</fpage>
          -
          <lpage>359</lpage>
          . Springer, Berlin, Heidelberg (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Apidianaki</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mohammad</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>May</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shutova</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carpuat</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <source>Proceedings of the 12th International Workshop on Semantic Evaluation. Association for Computational Linguistics</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Ostermann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roth</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Modi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thater</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pinkal</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>SemEval 2018 Task 11: Machine comprehension using commonsense knowledge</article-title>
          .
          <source>In: 12th International Workshop on Semantic Evaluation</source>
          , pp.
          <fpage>747</fpage>
          -
          <lpage>757</lpage>
          . Association for Computational Linguistics (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>