A Knowledge-based Deep Heterogeneous Graph
      Matching Model for Multimodal RecipeQA

Yunjie Wu1 , Sai Zhang1 , Xiaowang Zhang1 , Zhiyong Feng1,2? , and Liang Wan1
        1
          College of Intelligence and Computing, Tianjin University, Tianjin, China
    2
        College of Intelligence and Computing, Shenzhen Research Institute of Tianjin
                        University, Tianjin University, Tianjin, China
         {yunjie_wu, zhang_sai, xiaowangzhang, zyfeng, lwan}@tju.edu.cn
                                   ?
                                     Corresponding Author


            Abstract. RecipeQA is a multimodal task that requires understand-
            ing the multimodal context. Since the recipe instructions are procedural,
            temporal relations are essential to support procedural understanding.
            Due to the high divergence of representation, it is challenging to model
            the temporal relations of multimodal and dynamic recipes. In this pa-
            per, we propose a Knowledge-based Deep Heterogeneous Graph Match-
            ing Model (DHGM) to model temporal structures of recipes. Firstly,
            we present a knowledge-based recipe encoder to reduce the divergence
            between recipe entities. Secondly, we design a two-stage heterogeneous
            graph matching method to guarantee neighborhoods consensus. Exper-
            imental results show that our proposed approach for RecipeQA obtains
            the best performance on the RecipeQA dataset.

            Keywords: Multimodal Reading Comprehension · Heterogeneous Graph
            Matching · Knowledge Graph


1       Introduction

The RecipeQA [1] is a newly proposed Multimodal Machine Comprehension
(M3 C) task, which comprises instructional recipes with multimodal context. The
RecipeQA provides different multi-choice tasks that require a joint understand-
ing of both visual and textual procedural knowledge.
    Since cooking instructions are procedural, it is essential to understand tem-
poral relations for RecipeQA. However, due to the high divergence of representa-
tions between different instructions, it is challenging to model temporal relations.
In addition to the heterogeneity of multimodal data, even the same textual en-
tity could be different. For example, the “ground beef ” in Step 1 would change to
“patty” in Step 3 after cooking, but they still correspond to the same entity. The
existing works either learn the dynamical states of entities for procedural reason-
ing [2] or attempt to answer questions with attention-based alignment [3, 4]. It is
*
    Copyright c 2021 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2         Y. Wu et al.

not easy for them to capture the semantics of temporal structure since the struc-
tural semantics is lost during encoding the recipes. Inspired from DGMC [6], we
model the temporal structure of recipes in deep graph matching.
    In this paper, we propose DHGM for RecipeQA to explicitly model temporal
structures in the graph matching processing. Firstly, we present a knowledge-
based recipe embedding to reduce the divergence between recipe entities. Sec-
ondly, we design a two-stage heterogeneous graph matching method to guarantee
neighborhoods consensus for injecting temporal structure. Finally, we conduct
experiments on the RecipeQA dataset and achieve the best performance.


2     Deep Heterogeneous Graph Matching Model

The framework of DHGM is shown in Figuer 1. We build heterogeneous graphs
with temporal relations for recipe and question, which are embedded based on
the knowledge graph. The correct answer is chosen with a two-stage method:
local heterogeneous feature matching and neighborhood consensus matching.


                                            Food
                                          Ontologies


                                Recipes
                                                       FoodKG


                                                                                           𝑚𝑒𝑠𝑠𝑎𝑔𝑒 𝑝𝑎𝑠𝑠𝑖𝑛𝑔
                                                                 proj


                                Reciptor
         𝐺𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛
                                                          𝑥𝑖    𝑀𝑡𝑒𝑥𝑡      𝑥𝑖′


                                                                                                                          Sim(·
                                                                                                        GNN1

                                                                                                                              ,·)
                                                                 proj


                                ResNet-50
                                                                                                                                          𝑆 (0)
                                                          𝑥𝑗    𝑀𝑖𝑚𝑔        𝑥𝑗′
                                Knowledge-based embedding                                      Local heterogeneous feature matching
          𝐺𝑟𝑒𝑐𝑖𝑝𝑒


Step 1: Put the
minced meat in a
bowl and add salt,                                                      GNN2
                                                                                                -   )
                                                                                                                         𝐻𝑞         MLP           𝑃𝑎𝑛𝑠𝑤𝑒𝑟
pepper, parmigiano                                                                      MLP(
and 1 egg. Mix well.                                                                                    𝑆 (𝑙)


                                                                                                                         𝑆𝐻𝑟
                                                                        Neighborhood consensus matching


                       Fig. 1. The overall framework of DHGM for RecipeQA


2.1     Knowledge-based recipe embedding

To reduce the divergence between entity representations in recipes, we employed
a pretrained model Reciptor [5] for recipe embedding with an external knowledge
graph FoodKG. With an anchor recipe ra , we extract the related triples from
               Knowledge-based Graph Matching for Multimodal RecipeQA             3

FoodKG through a similarity model. Given an entity ea , a positive partner ep
connected to ea and a negative partner en not connected to ea , the triplet loss
is used to optimize the learned embeddings in semantic space:

              Lemb (ea , ep , en ) = max (0, d (ea , ep ) + α − d (ea , en ))    (1)

where d (ei , ej ) is used to measure the distance, α is a margin parameter. Fur-
thermore, we employ a pretrained ResNet-50 model to represent each image.


2.2   Local heterogeneous feature matching

We construct graphs for each recipe and question with the temporal relations.
A graph can be defined as G = (V, A), where V is a vertex set, and A is the
adjacency matrix of edges. For a heterogeneous node with type t, we employ a
type-specific transformation matrix Mt to project the node into same semantic
space:
                                  x0i = M t xi                            (2)
where xi is the original feature, and x0i is the projected feature. The node feature
 (t−1)
hj     in layer t can be updated with localized information as follows:

                                         (0)
                                       hi      = x0i                             (3)

                         (t)                       (t−1)   (t−1)
                        aij = sof tmax(att(hi    , hj              ))            (4)
                              (t)          (t) (t−1)
                                      X
                             hi = σ(      aij hj     )                           (5)
                                         j∈Ni

where Ni is the neighbors set of node i, and att(·) is the self-attention model for
learning the weight of different neighbors. Given the node embeddings of recipe
Hr and question Hq , we obtain the initial correspondences matrix as:

                         S (0) = mask_sof tmax(Hq Hr > )                         (6)


2.3   Neighborhood consensus matching

We iteratively refine the correspondences matrix S (l) to guarantee the neigh-
borhood consensus. The node features Hq , Hr can be passed along the soft
correspondence S to obtain node features Hr 0 , Hq 0 in other domain:

                         Hr 0 = S > Hq      and Hq 0 = SHr                       (7)

And then, we employ node indicator function I to map corresponding neighbor-
hoods into sub-domain and propagate message by GNN model:
                                                              >
           Oq = GN N (I, Xq , Aq )       and      Or = GN N (S(l) I, Xr , Ar )   (8)
4      Y. Wu et al.

where X is the feature matrix of nodes. We measure the consensus of nodes pair
(vi , vj ) by dij = oqi − orj . And the correspondence matrix can be updated as
follow:
                         (l+1)                        (l)
                       Si,j      = sof tmax(Si,j + M LP (dj,i ))i,j                          (9)

we utilized a similarity model followed by an MLP layer to obtain the final
answer:
                      P (ak ) = sof tmax(similarity(Hq , SHr ))                           (10)

The final objective function has been defined as follows:

                        (L)
           X                               X
    L=−           log(Si,πgt (i) ) +                  [− log(p(a+ )) − log(1 − p(a− ))]   (11)
           i∈Vσ                        (a+ ,a− )∈Ai


where πgt (i) is the ground truth correspondences.


3   Experiments and Evaluation

We conduct the experiments on three tasks of the RecipeQA dataset (i.e., visual
cloze, visual coherence, visual ordering). The models are trained on each task
separately. We employ accuracy as the metric.
    Table 1 presents the quantitative results on the test set. The proposed DHGM
outperforms other benchmark models on all three tasks, demonstrating that tem-
poral structural information plays an essential role in procedural understanding.
We also employ a DHGM model without knowledge-based recipe embedding
to verify the impact of FoodKG, which still outperforms the benchmark models.
The DHGM explicitly models the temporal structure in graph matching, which is
beneficial to understanding the multimodal temporal context. We observe that
the performance drops without the knowledge-based embedding, which shows
that knowledge is helpful to enhance the recipe representations.


                   Table 1. Accuracy on the test sets of RecipeQA.

                              Visual Cloze            Visual Coherence     Visual Ordering
    HUMAN                        77.60                      81.60               64.00
    Hasty Student                27.35                      65.80               40.88
    Impatient Reader             27.36                      28.08               26.74
    PRN (Single Task)            56.31                      53.64               62.77
    PRN (Multi Task)             46.45                      40.58               62.67
    DHGM -w/o KG                 48.16                      45.37               61.54
    DHGM                         51.57                     50.28               63.11
                Knowledge-based Graph Matching for Multimodal RecipeQA                 5

4    Conclusion

In this paper, we propose DHGM for RecipeQA to explicitly model the temporal
structure in graph matching. We believe that the graph-matching-based idea for
modeling the temporal structure is meaningful for understanding the multimodal
procedural data similar to RecipeQA. Experimental results on the RecipeQA
dataset demonstrate the excellent performance of the proposed DHGM. In fu-
ture work, we would extend our approach to other tasks with heterogeneous
procedural data to verify the generalization.


5    Acknowledgments
The work described in this paper is partially supported by Shenzhen Science
and Technology Foundation (JCYJ20170816093943197) and Key Research and
Development Program of Hubei Province (No. 2020BAB026).


References
1. Yagcioglu, S., Erdem, A., Erdem, E., Ikizler-Cinbis, N.: RecipeQA: A Challenge
   Dataset for Multimodal Comprehension of Cooking Recipes. In: Proc. of EMNLP
   2018, pp.1358-1368.
2. Amac, M. S., Yagcioglu, S., Erdem, A., Erdem, E.: Procedural Reasoning Networks
   for Understanding Multimodal Procedures. In: Proc. of CoNLL 2019, pp.441-451.
3. Faghihi, H. R., Mirzaee, R., Paliwal, S., Kordjamshidi, P.: Latent Alignment of
   Procedural Concepts in Multimodal Recipes. In: Proc. of ALVR 2020, pp.26-31.
4. Liu, A., Yuan, S., Zhang, C., Luo, C., Liao, Y., Bai, K., Xu, Z.: Multi-Level
   Multimodal Transformer Network for Multimodal Recipe Comprehension. In: Proc.
   of SIGIR 2020, pp.1781-1784.
5. Li, D., Zaki, M. J.: Reciptor: An effective pretrained model for recipe representation
   learning. In: Proc. of KDD 2020, pp.1719-1727.
6. Fey, M., Lenssen, J. E., Morris, C., Masci, J., Kriege, N. M.: Deep graph matching
   consensus. In: Proc. of ICLR 2020, poster.