<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Multihop-Multilingual Co-attention Method for Visual Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Debajyoty Banik</string-name>
          <email>debajyoty.banik@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Devashish Kumar Singh</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohit Kumar Pandey</string-name>
          <email>mohitpandeybgp@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fast R-CNN</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>queries</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Visual Genome</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kalinga Institute of Industrial Technology</institution>
          ,
          <addr-line>Bhubaneswar</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <fpage>21</fpage>
      <lpage>23</lpage>
      <abstract>
        <p>Our model revolves around expanding web-searching to the multi-domain , our project help to cover the gap present in today's research with regards to visual learning and harvesting it to gain knowlegde out of it.we intend to mirror human behaviour with respect to gathering knowledge from multi domain sources. As the imaformation matter not the source and web 2.0 and web 3.0 contain a lot of images and a picture speak a thousand words we intend to find way to harvest the info and make it eficient enough that the common people queries could be answered from information extracted from the pictures.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>(M. K. Pandey)</p>
      <p>Our study uses faster R-CNN on the relevant images to make the process more accurate,
faster and relevant. we try to mimic human behaviour and while looking for relevant info in
the multi modal source pool where the current research tend to just rely just on the relevance
of image based on their scores making less accurate.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Novelty</title>
      <p>
        The Bert-base-cased tokenizer tokenizes all text segments, including the questions, answers,
textual sources, and image captions. 100 regions from an object detection model, a Faster
R-CNN variant, are used to represent each image. with a ResNeXt-101 FPN backbone and Visual
Genome[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] pre-training. To satisfy the auto-regressive characteristic, attention masks are
applied to tokens in A by the Transformer after we feed it [CLS], S, [SEP], Q, A, [SEP]&gt;. During
ifne-tuning, we employ the usual Masked-Language-Modeling[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] loss. By repeatedly adding a
[MASK] to the input’s end, swapping it out for a predicted token, then adding a new [MASK]
for the subsequent time step, we decode. After witnessing [SEP], [PAD], or when the length
reaches a certain point, generation ends. To expose improvements and costs stemming from the
complexity of providing models with data from both modalities, we additionally provide two
modality-specific variants, VLPI and VLPT, which are skilled in answering text- or image-based
inquiries alone as opposed to the whole data.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Task formulation</title>
      <p>Let us say, somebody asked a question, then a sets of positive results will be produced which
satisfy the condition of being either being a snippet or a pair of images and descriptions. has
things like its location or another characteristic which will act as a reference to identify attached
to it which serve as critical points in answering the question asked.</p>
      <p>We accomplish the task in two stages. First, let us say the questions Q and s1, s2, ..., sn,
The positive pairs found by searching the photos are identified by the model. The model uses
question Q and the selected sources as context C in the second stage to produce answer A.
However, Future research is needed because we are not aware of any modelling tools that can
consume suficiently large multimodal settings to accomplish this. A single-stage system would
ideally combine the processing of Q, s1, s2,..., Sn to produce the determination of A and C.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Answers from Text</title>
      <sec id="sec-4-1">
        <title>4.1. Hard Negative Mining</title>
        <p>In the process of hard negative mining for text, we select the articles and sources that overlap
based on the noun phrases extracted from the inquiry and mine sources like Wikipedia, articles,
and other similar sources. though there is a lac
1DLQ-2022: International Workshop on Deep Learning for Question Answering, Co-located with the KGSWC-2022,
November 21-23, 2022, Madrid, Spain.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Answers from Images</title>
      <sec id="sec-5-1">
        <title>5.1. Hard Negative Mining For Image</title>
        <p>To respond to all questions and provide references, we develop pairs of text and image-based
data during the hard negative mining process. which was produced while breaking down the
question. The text is sourced from sources like articles, Wikipedia, magazines, comments, etc,
and its chosen based on nouns present in the question.</p>
        <p>For pictures, we use search engines API like bing Apis to find images relevant to the question
on the basis of the description of the image and other factors. And we pair these text and images
to form pairs.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Categorization</title>
        <p>We divide the questions into yes or no or like which, why, how, and such nouns and we tend to
compare such nouns on these pairs and classify them with the help of GQA and xGQA.</p>
        <p>k of clarity in question too sometimes we also simply sample randomly the sources and use
all the available sources.2
2DLQ-2022: International Workshop on Deep Learning for Question Answering, Co-located with the KGSWC-2022,
November 21-23, 2022, Madrid, Spain.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Answering &amp; Quality Control</title>
      <sec id="sec-6-1">
        <title>6.1. Quality Control</title>
        <p>
          To ensure we give quality content in the answers, not something which is fake, we ensure
quality control through 2 methods, one is crowd-sourcing[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and the other is feedback on loops
[
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. a group of annotators is trained and selected. after that, each batch is given data and a
bonus for out-of-box thinking. each group is sent with data and constructive feedback looping
to correct our mistakes through the help of these large numbers of human understanding.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Fluency</title>
        <p>
          Fluency is measured, with the help of Bart score, a newly proposed based on accurate
measurement of paraphrase quality. The Bart-score[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] measure’s the probability of generating B from
[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>
          This is calculated in our scenario as Bart-score(r, c)[
          <xref ref-type="bibr" rid="ref8 ref9">8</xref>
          ], which can be understood as the
likelihood of producing a candidate given a reference.
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Faster R-CNN</title>
      <p>Faster R-CNN is an extension of Fast R-CNN. It saves a lot of our time comparing it with Fast
RCN. Faster R-CNN, as its name implies, is quicker than Fast R-CNN because of the region
proposal network (RPN). Regions with convolutions neural networks (R-CNN) use a novel
region proposal network(RPN) to generate a regional proposal, which compares with traditional
algorithms like selective searching to save us time. Faster R-CNN combined with the RPN
network is one of the best ways to detect R-CNN series based on deep learning.</p>
      <p>
        The ROI Pooling layer, a CNN framework for successful end-to-end object identification, is
intimately related to the proposal obtained by RPN[
        <xref ref-type="bibr" rid="ref10">9</xref>
        ]. Based on the implementation of Faster
R-CNN models that can be obtained by training using the deep learning framework of Cafe,
the viability of R-CNN works on the RwsNet101 network and PVANET network is examined.
      </p>
      <sec id="sec-7-1">
        <title>7.0.1. DRAWBACKS OF R-CNN</title>
        <p>The fact that RPN is trained so that all of its anchors in a mini-batch are 256 and taken from
the same single image presents one potential disadvantage of the quicker R-CNN. As a result,
samples may be correlated, which means that their features are likewise associated,
delaying convergence.</p>
        <p>From here, we could see that the pros are more than the cons, so it is not a bad idea to use it
until a better and more advanced system sets its foot on the market.3
3DLQ-2022: International Workshop on Deep Learning for Question Answering, Co-located with the KGSWC-2022,
November 21-23, 2022, Madrid, Spain.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Multimodal feature-wise attention module (MulFA)</title>
      <p>
        Most of the currently used techniques take spatial into account cross-grounding. In other
words, we could also say, they determine the relationship between each spatial object in an
image and question. These models, however, entirely disregard the feature channel dimension’s
concentration in the image as well as the question representation and only concentrate on
learning spatial attention. Some of the tasks of computer versions, for example, classification
[
        <xref ref-type="bibr" rid="ref8 ref9">8</xref>
        ] and image caption [10], have proved that incorporating that feature channel attention
mechanism has better performance than usual. Because it allows the model to learn efectively.
      </p>
      <p>In this paper, we propose that the MulFA seeks to produce greater attention weights. to
emphasize informative suppressing less significant aspects.. To generate attention weight,
MulFA uses bilinear models. There are two types of MulFA, one for image modalities and the
other for text, namely: IMulFA(see figure 2) and QMulFA(see figure 3). IMulFA is used for
modulating images, and QMulFA is used for question or text modalities.4</p>
      <sec id="sec-8-1">
        <title>8.0.1. Image multimodal feature-wise attention module (IMulFA)</title>
        <p>
          In this process, the image is understood by the AI. It happens or is processed in four steps. [11]:
1. Squeezing image feature,
2. Fusing feature-wise statics and question signals,
3. Computing feature-wise attention weight,
4. Feature-wise re-writing image
4DLQ-2022: International Workshop on Deep Learning for Question Answering, Co-located with the KGSWC-2022,
November 21-23, 2022, Madrid, Spain.
steps[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].:
vectors,
        </p>
        <p>The process is more briefly described with the help of fig.:2 In this fig.: we could clearly see
the flow of the IMulFA. Here, V is the image in vector form whereas, k is the object feature and
M is the feature channels.</p>
      </sec>
      <sec id="sec-8-2">
        <title>8.0.2. Question multimodal feature-wise attention module(QMulFA)</title>
        <p>In this process question or text is processed by the AI. The process is processed in just 3 simple
1. combining information from multiple sources to create feature-wise attention weight
2. Squeezing attention weight vector,
3. Re-calibrating question features.</p>
        <p>The QMulFA figure could help the process run more smoothly. The question feature-wise
attention weight vector is created in this case by fusing the signals from the visual and question
feature channel statics. The ith object feature vector produces the ith weight vector. 
and the equation feature Q has :
,
where ℎ  ,</p>
        <p>= 
ℎ = 
(  ,   )</p>
        <p>(    )
   
  

, is a parameter matrix of the single linear layer, and
denotes the fusion feature obtained by a bilinear model.5</p>
        <p>The</p>
        <p>has K items in each of which can direct the question’s feature-wise focus.
Accordingly, To integrate, we use an average pooling operation. the following are all items efects:
 =
1
 =1

∑ ℎ
 ′ =   × 
 ′ =   
( ,  )
, where</p>
        <p>denotes the question feature-wise attention weight vector.</p>
        <p>The question characteristics are then recalibrated using Q and the attention combined with
element-wise multiplication. as
, whereas,  ′ 1× is the feature-wise attention feature. We define this QMulFA as
5DLQ-2022: International Workshop on Deep Learning for Question Answering, Co-located with the KGSWC-2022,</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>9. Multimodal feature-wise co-attention module for VQA</title>
      <p>With regard to picture and question modalities, we have created a feature-wise
attentionlearning module. We suggest three co-attention mechanisms to combine them, each of which
has a diferent approach to how picture and question feature-wise attention is prioritized. the
ifrst two mechanisms, which we call alternate performing feature-based attention on the query
and the image simultaneously, as below6</p>
      <p>V’ = QMulFA(V,Q), Q’ = QMulFA(V’,Q) or, Q’ = QMulFA(V,Q), V’ = IMulFA(V,Q’)
The third mechanism, which we call parallel co-attention, generate images, and question
attention simultaneously, defined as</p>
      <p>V’ = IMulFA(V,Q) Q’ = QMulFA(V,Q)
10. Multimodal spatial attention module
The issue of visual question answering (VQA) in computer vision is widely recognized. Due to
how crucial it is to comprehend an image, text-based VQA assignments have recently attracted a
lot of attention. In this area of research, we suggest a cutting-edge encoder-decoder framework to
specifically predict complicated responses. We use the attention mechanism, which can choose
characteristics based on the questions, to obtain the more pertinent features for the inquiry.</p>
      <p>In order to answer correctly or even relevant to the question, we need to focus on the region
which is related to our question, and hence, we take i n use of multimodal spatial attention
module. Unlike its name, it focuses on the important part of the image and suppresses the rest
of them. In order to that we first fuse the visual feature
and the question features
 ′  
 ′ 1</p>
      <p>The attention distribution over the area of the image is produced by the bilinear model’s
computations, which are then fed with the fusion feature to a softmax function as illustrated
in the figure.: 1.
6DLQ-2022: International Workshop on Deep Learning for Question Answering, Co-located with the KGSWC-2022,
November 21-23, 2022, Madrid, Spain.
11. VQA 1.0 and VQA 2.0
Bottom-up[12] applies the quicker R-CNN-based bottom-up attention approach suggested in
[13], which enables the region related to a question.</p>
      <p>MLB (Multimodal Low-Rank Bilinear Pooling) [14] is a solution to the issue of the
computational cost of the bilinear model while utilizing its acceptable capacity for representation.</p>
      <p>MLB is extended by MFH (Multimodal Factorized High-Order Pooling) to incorporate
multimodal characteristics. High-order pooling is used. It utilizes high-level characteristics and
image convolutions features.</p>
      <p>The BAN (bilinear attention network) adopts a bottom-up focus on image attributes and
task-level question features. An attention map is produced by BAN by computing the bilinear
interaction between each pair of picture and question features.</p>
      <p>The counter is specifically designed to handle hard counting questions, which call for a model
to specify which types of objects need to be counted.
12. Datasets
VQA 1.0. In this update, there are more than 204k images from the Microsoft common object
in context(MS coco) dataset, more than 600k questions (at least 3 questions per image), and
around 6 million of answers(10 answers per question). The datasets are of three types namely :
1. Train : consists of 80k images and 240k question-answer pairs 2. val : consists of 40k images
and 120k question-answer pairs 3. Test: consists of 80k images and 240k question-answer pairs</p>
      <p>VQA 2.0. This is the updated version of the previous VQA which is VQA 1.0. VQA 2.0 has a
longer scale, having 240k images from Microsoft’s common object in context (MS coco), more
than 1 million questions, and 11 million’s of answers. It is composed of 4,43,757 pairs of image,
questions, and answer for training, whereas, 2,14,354 for validating and 4,47,793 for testing.[15]
7</p>
      <p>Our evaluation findings for the VQA 1.0 test set are displayed in table 1 We contrast the
outcomes of our models with those from a number of cutting-edge models, including the VQA 1.0
Challenge’s reigning champion, the MFH model. Table 1 demonstrates that our model UFSCAN
outperforms every method, including the winner of the VQA 1.0 challenge, MFH[16]. With
the exception of the three MFH-based models, it greatly outperforms the rest. The most recent
model, MFH+CoAtt+Glove (bottom-up), is trained using the same train set and validation set as
UFSCAN and uses the same bottom-up attention features. Notably, MFH uses more question
characteristics than UFSCAN and adds the question spatial attention method. Nevertheless,
UFSCAN exceeds the top-performing MFH model, MFH+CoAtt+Glove (bottom-up), highlighting
the benefits of our suggested MulFA. Additionally, with data augmentation utilizing the Visual
Genome, our model UFSCAN + VG achieves the best overall accuracy of 70.19% and 70.24% on
the test-dev set and test-standard set, respectively. UFSCAN performs at the cutting edge on
VQA 1.0 as a consequence.[17]
7DLQ-2022: International Workshop on Deep Learning for Question Answering, Co-located with the KGSWC-2022,
November 21-23, 2022, Madrid, Spain.
13. Conclusions
In the model, we create a new model for8 answering the question in a multi-modal way, which
is a great challenge in these changing times when we are changing from web 2.0 to web 3.0.
design to simulate the environment, one is going to face in the real world while searching for
information. Our model searches in multiple domains for the answers rather than just being
dependent on a text query[20].</p>
      <p>
        At the same time, we also focus on the fluency and accuracy of the answer. In this paper, For
the purpose of bridging multimodal QA and IR research, we have ofered both a restricted and
a complete retrieval setting. In addition to reflecting our daily web experience, this data set
ofers the community a playground to investigate significant sub-challenges with the goal of
developing a single mode. For knowledge aggregation, multimodal reasoning, and open-domain
visual comprehension. Our project’s ultimate objective is to gather pertinent data from the
multi-domain mode, combine it above a sizable context window, and produce fluent, natural
answers.9
[
        <xref ref-type="bibr" rid="ref10">9</xref>
        ] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with
region proposal networks, Advances in neural information processing systems 28 (2015).
[10] H. Xu, K. Saenko, Ask, attend and answer: Exploring question-guided spatial attention for
visual question answering, in: European conference on computer vision, Springer, 2016,
pp. 451–466.
[11] S. Zhang, M. Chen, J. Chen, F. Zou, Y.-F. Li, P. Lu, Multimodal feature-wise co-attention
method for visual question answering, Information Fusion 73 (2021) 1–10.
[12] D. Teney, P. Anderson, X. He, A. Van Den Hengel, Tips and tricks for visual question
answering: Learnings from the 2017 challenge, in: Proceedings of the IEEE conference on
computer vision and pattern recognition, 2018, pp. 4223–4232.
[13] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, L. Zhang, Bottom-up and
top-down attention for image captioning and visual question answering, in: Proceedings
of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6077–6086.
[14] J.-H. Kim, K.-W. On, W. Lim, J. Kim, J.-W. Ha, B.-T. Zhang, Hadamard product for low-rank
bilinear pooling, arXiv preprint arXiv:1610.04325 (2016).
[15] F. Ortiz-Rodriguez, S. Tiwari, R. Panchal, J. M. Medina-Quintero, R. Barrera, Mexin:
multidialectal ontology supporting nlp approach to improve government electronic
communication with the mexican ethnic groups, in: DG. O 2022: The 23rd Annual International
Conference on Digital Government Research, 2022, pp. 461–463.
[16] Z. Yu, J. Yu, C. Xiang, J. Fan, D. Tao, Beyond bilinear: Generalized multimodal factorized
high-order pooling for visual question answering, IEEE transactions on neural networks
and learning systems 29 (2018) 5947–5959.
[17] D. Gaurav, F. O. Rodriguez, S. Tiwari, M. Jabbar, Review of machine learning approach for
drug development process, in: Deep Learning in Biomedical and Health Informatics, CRC
Press, 2021, pp. 53–77.
[18] Y. Zhang, J. Hare, A. Prügel-Bennett, Learning to count objects in natural images for visual
question answering, arXiv preprint arXiv:1802.05766 (2018).
[19] J.-H. Kim, J. Jun, B.-T. Zhang, Bilinear attention networks, in: S. Bengio, H. Wallach,
H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (Eds.), Advances in Neural
Information Processing Systems, volume 31, Curran Associates, Inc., 2018.
[20] S. Gupta, S. Tiwari, F. Ortiz-Rodriguez, R. Panchal, Kg4astra: question answering over
indian missiles knowledge graph, Soft Computing 25 (2021) 13841–13855.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Bender</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koller</surname>
          </string-name>
          ,
          <article-title>Climbing towards nlu: On meaning, form, and understanding in the age of data, in: Proceedings of the 58th annual meeting of the association for computational linguistics</article-title>
          ,
          <year>2020</year>
          , pp.
          <fpage>5185</fpage>
          -
          <lpage>5198</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bisk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thomason</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Andreas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lapata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lazaridou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>May</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nisnevich</surname>
          </string-name>
          , et al.,
          <source>Experience grounds language</source>
          , arXiv preprint arXiv:
          <year>2004</year>
          .
          <volume>10151</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Krishna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Groth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hata</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kravitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kalantidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Shamma</surname>
          </string-name>
          , et al.,
          <article-title>Visual genome: Connecting language and vision using crowdsourced dense image annotations</article-title>
          ,
          <source>International journal of computer vision 123</source>
          (
          <year>2017</year>
          )
          <fpage>32</fpage>
          -
          <lpage>73</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Nangia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sugawara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Trivedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Warstadt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Vania</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Bowman</surname>
          </string-name>
          ,
          <article-title>What ingredients make for an efective crowdsourcing protocol for dificult nlu data collection tasks?</article-title>
          ,
          <source>arXiv preprint arXiv:2106.00794</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , G. Neubig, P. Liu, Bartscore:
          <article-title>Evaluating generated text as text generation</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>34</volume>
          (
          <year>2021</year>
          )
          <fpage>27263</fpage>
          -
          <lpage>27277</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , G. Neubig, P. Liu, Bartscore:
          <article-title>Evaluating generated text as text generation</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>34</volume>
          (
          <year>2021</year>
          )
          <fpage>27263</fpage>
          -
          <lpage>27277</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Shih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hoiem</surname>
          </string-name>
          ,
          <article-title>Where to look: Focus regions for visual question answering</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>4613</fpage>
          -
          <lpage>4621</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          8DLQ-
          <fpage>2022</fpage>
          : International Workshop on
          <article-title>Deep Learning for Question Answering, Co-located with the KGSWC-2022</article-title>
          , November 21-
          <issue>23</issue>
          ,
          <year>2022</year>
          , Madrid, Spain.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>9Conference - IWDLQ, Co-located with the KGSWC-2022</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>