<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Knowledge-based Deep Heterogeneous Graph Matching Model for Multimodal RecipeQA</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yunjie Wu</string-name>
          <email>yunjie_wu@tju.edu.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sai Zhang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xiaowang Zhang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zhiyong Feng</string-name>
          <email>zyfeng@tju.edu.cn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Liang Wan</string-name>
          <email>lwan@tju.edu.cn</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>College of Intelligence and Computing, Shenzhen Research Institute of Tianjin University, Tianjin University</institution>
          ,
          <addr-line>Tianjin</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>College of Intelligence and Computing, Tianjin University</institution>
          ,
          <addr-line>Tianjin</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>RecipeQA is a multimodal task that requires understanding the multimodal context. Since the recipe instructions are procedural, temporal relations are essential to support procedural understanding. Due to the high divergence of representation, it is challenging to model the temporal relations of multimodal and dynamic recipes. In this paper, we propose a Knowledge-based Deep Heterogeneous Graph Matching Model (DHGM) to model temporal structures of recipes. Firstly, we present a knowledge-based recipe encoder to reduce the divergence between recipe entities. Secondly, we design a two-stage heterogeneous graph matching method to guarantee neighborhoods consensus. Experimental results show that our proposed approach for RecipeQA obtains the best performance on the RecipeQA dataset.</p>
      </abstract>
      <kwd-group>
        <kwd>Multimodal Reading Comprehension</kwd>
        <kwd>Heterogeneous Graph</kwd>
        <kwd>Matching</kwd>
        <kwd>Knowledge Graph</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The RecipeQA [1] is a newly proposed Multimodal Machine Comprehension
(M3C) task, which comprises instructional recipes with multimodal context. The
RecipeQA provides different multi-choice tasks that require a joint
understanding of both visual and textual procedural knowledge.</p>
      <p>Since cooking instructions are procedural, it is essential to understand
temporal relations for RecipeQA. However, due to the high divergence of
representations between different instructions, it is challenging to model temporal relations.
In addition to the heterogeneity of multimodal data, even the same textual
entity could be different. For example, the “ground beef ” in Step 1 would change to
“patty” in Step 3 after cooking, but they still correspond to the same entity. The
existing works either learn the dynamical states of entities for procedural
reasoning [2] or attempt to answer questions with attention-based alignment [3, 4]. It is
* Copyright c 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
not easy for them to capture the semantics of temporal structure since the
structural semantics is lost during encoding the recipes. Inspired from DGMC [6], we
model the temporal structure of recipes in deep graph matching.</p>
      <p>In this paper, we propose DHGM for RecipeQA to explicitly model temporal
structures in the graph matching processing. Firstly, we present a
knowledgebased recipe embedding to reduce the divergence between recipe entities.
Secondly, we design a two-stage heterogeneous graph matching method to guarantee
neighborhoods consensus for injecting temporal structure. Finally, we conduct
experiments on the RecipeQA dataset and achieve the best performance.
2</p>
      <p>Deep Heterogeneous Graph Matching Model
The framework of DHGM is shown in Figuer 1. We build heterogeneous graphs
with temporal relations for recipe and question, which are embedded based on
the knowledge graph. The correct answer is chosen with a two-stage method:
local heterogeneous feature matching and neighborhood consensus matching.</p>
    </sec>
    <sec id="sec-2">
      <title>Step 1: Put the</title>
      <p>minced meat in a
bowl and add salt,
pepper, parmigiano
and 1 egg. Mix well.
Food</p>
      <p>Ontologies
Recipes</p>
    </sec>
    <sec id="sec-3">
      <title>FoodKG</title>
    </sec>
    <sec id="sec-4">
      <title>Reciptor</title>
      <p>ResNet-50
   


p
r
o
j
p
r
o
j</p>
    </sec>
    <sec id="sec-5">
      <title>Knowledge-based embedding Local heterogeneous feature matching</title>
      <p />
      <p>GNN1
MLP( - )
 ( )</p>
      <p>S
i
m
,(
·
·
)</p>
      <p>(0)
MLP


graph FoodKG. With an anchor recipe ra, we extract the related triples from
FoodKG through a similarity model. Given an entity ea, a positive partner ep
connected to ea and a negative partner en not connected to ea, the triplet loss
is used to optimize the learned embeddings in semantic space:</p>
      <p>
        Lemb (ea; ep; en) = max (0; d (ea; ep) +
d (ea; en))
(
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
where d (ei; ej) is used to measure the distance, is a margin parameter.
Furthermore, we employ a pretrained ResNet-50 model to represent each image.
2.2
We construct graphs for each recipe and question with the temporal relations.
A graph can be defined as G = (V; A), where V is a vertex set, and A is the
adjacency matrix of edges. For a heterogeneous node with type t, we employ a
type-specific transformation matrix Mt to project the node into same semantic
space:
where xi is the original feature, and x0i is the projected feature. The node feature
h(t 1) in layer t can be updated with localized information as follows:
j
x0i = M txi
h(0) = x0i
      </p>
      <p>i
ai(jt) = sof tmax(att(hi(t 1); h(jt 1)))
hi(t) = ( X ai(jt)h(jt 1))</p>
      <p>j2Ni
where Ni is the neighbors set of node i, and att( ) is the self-attention model for
learning the weight of different neighbors. Given the node embeddings of recipe
Hr and question Hq, we obtain the initial correspondences matrix as:</p>
      <p>S(0) = mask_sof tmax(HqHr&gt;)
2.3</p>
      <p>Neighborhood consensus matching
We iteratively refine the correspondences matrix S(l) to guarantee the
neighborhood consensus. The node features Hq, Hr can be passed along the soft
correspondence S to obtain node features Hr0, Hq0 in other domain:
Hr0 = S&gt;Hq and</p>
      <p>Hq0 = SHr
And then, we employ node indicator function I to map corresponding
neighborhoods into sub-domain and propagate message by GNN model:
Oq = GN N (I; Xq; Aq)
and</p>
      <p>
        Or = GN N (S(&gt;l)I; Xr; Ar)
(
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
(
        <xref ref-type="bibr" rid="ref4">4</xref>
        )
(
        <xref ref-type="bibr" rid="ref5">5</xref>
        )
(
        <xref ref-type="bibr" rid="ref6">6</xref>
        )
(7)
where X is the feature matrix of nodes. We measure the consensus of nodes pair
(vi; vj) by dij = oiq ojr. And the correspondence matrix can be updated as
follow:
      </p>
      <p>Si(;lj+1) = sof tmax(Si(;lj) + M LP (dj;i))i;j
we utilized a similarity model followed by an MLP layer to obtain the final
answer:</p>
      <p>P (ak) = sof tmax(similarity(Hq; SHr))
(9)
(10)
The final objective function has been defined as follows:</p>
      <p>L =</p>
      <p>X log(Si(;L)gt(i)) +
i2V</p>
      <p>X
(a+;a )2Ai
where gt(i) is the ground truth correspondences.
3</p>
      <p>Experiments and Evaluation
[ log(p(a+)) log(1
p(a ))]
(11)
We conduct the experiments on three tasks of the RecipeQA dataset (i.e., visual
cloze, visual coherence, visual ordering). The models are trained on each task
separately. We employ accuracy as the metric.</p>
      <p>Table 1 presents the quantitative results on the test set. The proposed DHGM
outperforms other benchmark models on all three tasks, demonstrating that
temporal structural information plays an essential role in procedural understanding.
We also employ a DHGM model without knowledge-based recipe embedding
to verify the impact of FoodKG, which still outperforms the benchmark models.
The DHGM explicitly models the temporal structure in graph matching, which is
beneficial to understanding the multimodal temporal context. We observe that
the performance drops without the knowledge-based embedding, which shows
that knowledge is helpful to enhance the recipe representations.
In this paper, we propose DHGM for RecipeQA to explicitly model the temporal
structure in graph matching. We believe that the graph-matching-based idea for
modeling the temporal structure is meaningful for understanding the multimodal
procedural data similar to RecipeQA. Experimental results on the RecipeQA
dataset demonstrate the excellent performance of the proposed DHGM. In
future work, we would extend our approach to other tasks with heterogeneous
procedural data to verify the generalization.
5</p>
      <p>Acknowledgments
The work described in this paper is partially supported by Shenzhen Science
and Technology Foundation (JCYJ20170816093943197) and Key Research and
Development Program of Hubei Province (No. 2020BAB026).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Yagcioglu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erdem</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erdem</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ikizler-Cinbis</surname>
          </string-name>
          , N.:
          <article-title>RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes</article-title>
          .
          <source>In: Proc. of EMNLP</source>
          <year>2018</year>
          , pp.
          <fpage>1358</fpage>
          -
          <lpage>1368</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Amac</surname>
            ,
            <given-names>M. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yagcioglu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erdem</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erdem</surname>
          </string-name>
          , E.:
          <article-title>Procedural Reasoning Networks for Understanding Multimodal Procedures</article-title>
          .
          <source>In: Proc. of CoNLL</source>
          <year>2019</year>
          , pp.
          <fpage>441</fpage>
          -
          <lpage>451</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Faghihi</surname>
            ,
            <given-names>H. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mirzaee</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paliwal</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kordjamshidi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Latent Alignment of Procedural Concepts in Multimodal Recipes</article-title>
          .
          <source>In: Proc. of ALVR</source>
          <year>2020</year>
          , pp.
          <fpage>26</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yuan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Zhang,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ,
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          <article-title>: Multi-Level Multimodal Transformer Network for Multimodal Recipe Comprehension</article-title>
          .
          <source>In: Proc. of SIGIR</source>
          <year>2020</year>
          , pp.
          <fpage>1781</fpage>
          -
          <lpage>1784</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaki</surname>
            ,
            <given-names>M. J.:</given-names>
          </string-name>
          <article-title>Reciptor: An effective pretrained model for recipe representation learning</article-title>
          .
          <source>In: Proc. of KDD</source>
          <year>2020</year>
          , pp.
          <fpage>1719</fpage>
          -
          <lpage>1727</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fey</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lenssen</surname>
            ,
            <given-names>J. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morris</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Masci</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kriege</surname>
            ,
            <given-names>N. M.:</given-names>
          </string-name>
          <article-title>Deep graph matching consensus</article-title>
          .
          <source>In: Proc. of ICLR</source>
          <year>2020</year>
          ,
          <article-title>poster</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>