-

A Knowledge-based Deep Heterogeneous Graph Matching Model for Multimodal RecipeQA

Yunjie Wu

yunjie_wu@tju.edu.cn 1

Sai Zhang

Xiaowang Zhang

Zhiyong Feng

zyfeng@tju.edu.cn 0 1

Liang Wan

lwan@tju.edu.cn 1 0 College of Intelligence and Computing, Shenzhen Research Institute of Tianjin University, Tianjin University , Tianjin , China 1 College of Intelligence and Computing, Tianjin University , Tianjin , China

RecipeQA is a multimodal task that requires understanding the multimodal context. Since the recipe instructions are procedural, temporal relations are essential to support procedural understanding. Due to the high divergence of representation, it is challenging to model the temporal relations of multimodal and dynamic recipes. In this paper, we propose a Knowledge-based Deep Heterogeneous Graph Matching Model (DHGM) to model temporal structures of recipes. Firstly, we present a knowledge-based recipe encoder to reduce the divergence between recipe entities. Secondly, we design a two-stage heterogeneous graph matching method to guarantee neighborhoods consensus. Experimental results show that our proposed approach for RecipeQA obtains the best performance on the RecipeQA dataset.

Multimodal Reading Comprehension Heterogeneous Graph Matching Knowledge Graph

The RecipeQA [1] is a newly proposed Multimodal Machine Comprehension (M3C) task, which comprises instructional recipes with multimodal context. The RecipeQA provides different multi-choice tasks that require a joint understanding of both visual and textual procedural knowledge.

Since cooking instructions are procedural, it is essential to understand temporal relations for RecipeQA. However, due to the high divergence of representations between different instructions, it is challenging to model temporal relations. In addition to the heterogeneity of multimodal data, even the same textual entity could be different. For example, the “ground beef ” in Step 1 would change to “patty” in Step 3 after cooking, but they still correspond to the same entity. The existing works either learn the dynamical states of entities for procedural reasoning [2] or attempt to answer questions with attention-based alignment [3, 4]. It is * Copyright c 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). not easy for them to capture the semantics of temporal structure since the structural semantics is lost during encoding the recipes. Inspired from DGMC [6], we model the temporal structure of recipes in deep graph matching.

In this paper, we propose DHGM for RecipeQA to explicitly model temporal structures in the graph matching processing. Firstly, we present a knowledgebased recipe embedding to reduce the divergence between recipe entities. Secondly, we design a two-stage heterogeneous graph matching method to guarantee neighborhoods consensus for injecting temporal structure. Finally, we conduct experiments on the RecipeQA dataset and achieve the best performance. 2

Deep Heterogeneous Graph Matching Model The framework of DHGM is shown in Figuer 1. We build heterogeneous graphs with temporal relations for recipe and question, which are embedded based on the knowledge graph. The correct answer is chosen with a two-stage method: local heterogeneous feature matching and neighborhood consensus matching.

Step 1: Put the

minced meat in a bowl and add salt, pepper, parmigiano and 1 egg. Mix well. Food

Ontologies Recipes

FoodKG Reciptor

ResNet-50 p r o j p r o j

Knowledge-based embedding Local heterogeneous feature matching

GNN1 MLP( - ) ( )

S i m ,( · · )

(0) MLP graph FoodKG. With an anchor recipe ra, we extract the related triples from FoodKG through a similarity model. Given an entity ea, a positive partner ep connected to ea and a negative partner en not connected to ea, the triplet loss is used to optimize the learned embeddings in semantic space:

Lemb (ea; ep; en) = max (0; d (ea; ep) + d (ea; en)) ( 1 ) where d (ei; ej) is used to measure the distance, is a margin parameter. Furthermore, we employ a pretrained ResNet-50 model to represent each image. 2.2 We construct graphs for each recipe and question with the temporal relations. A graph can be defined as G = (V; A), where V is a vertex set, and A is the adjacency matrix of edges. For a heterogeneous node with type t, we employ a type-specific transformation matrix Mt to project the node into same semantic space: where xi is the original feature, and x0i is the projected feature. The node feature h(t 1) in layer t can be updated with localized information as follows: j x0i = M txi h(0) = x0i

i ai(jt) = sof tmax(att(hi(t 1); h(jt 1))) hi(t) = ( X ai(jt)h(jt 1))

j2Ni where Ni is the neighbors set of node i, and att( ) is the self-attention model for learning the weight of different neighbors. Given the node embeddings of recipe Hr and question Hq, we obtain the initial correspondences matrix as:

S(0) = mask_sof tmax(HqHr>) 2.3

Neighborhood consensus matching We iteratively refine the correspondences matrix S(l) to guarantee the neighborhood consensus. The node features Hq, Hr can be passed along the soft correspondence S to obtain node features Hr0, Hq0 in other domain: Hr0 = S>Hq and

Hq0 = SHr And then, we employ node indicator function I to map corresponding neighborhoods into sub-domain and propagate message by GNN model: Oq = GN N (I; Xq; Aq) and

Or = GN N (S(>l)I; Xr; Ar) ( 2 ) ( 3 ) ( 4 ) ( 5 ) ( 6 ) (7) where X is the feature matrix of nodes. We measure the consensus of nodes pair (vi; vj) by dij = oiq ojr. And the correspondence matrix can be updated as follow:

Si(;lj+1) = sof tmax(Si(;lj) + M LP (dj;i))i;j we utilized a similarity model followed by an MLP layer to obtain the final answer:

P (ak) = sof tmax(similarity(Hq; SHr)) (9) (10) The final objective function has been defined as follows:

L =

X log(Si(;L)gt(i)) + i2V

X (a+;a )2Ai where gt(i) is the ground truth correspondences. 3

Experiments and Evaluation [ log(p(a+)) log(1 p(a ))] (11) We conduct the experiments on three tasks of the RecipeQA dataset (i.e., visual cloze, visual coherence, visual ordering). The models are trained on each task separately. We employ accuracy as the metric.

Table 1 presents the quantitative results on the test set. The proposed DHGM outperforms other benchmark models on all three tasks, demonstrating that temporal structural information plays an essential role in procedural understanding. We also employ a DHGM model without knowledge-based recipe embedding to verify the impact of FoodKG, which still outperforms the benchmark models. The DHGM explicitly models the temporal structure in graph matching, which is beneficial to understanding the multimodal temporal context. We observe that the performance drops without the knowledge-based embedding, which shows that knowledge is helpful to enhance the recipe representations. In this paper, we propose DHGM for RecipeQA to explicitly model the temporal structure in graph matching. We believe that the graph-matching-based idea for modeling the temporal structure is meaningful for understanding the multimodal procedural data similar to RecipeQA. Experimental results on the RecipeQA dataset demonstrate the excellent performance of the proposed DHGM. In future work, we would extend our approach to other tasks with heterogeneous procedural data to verify the generalization. 5

Acknowledgments The work described in this paper is partially supported by Shenzhen Science and Technology Foundation (JCYJ20170816093943197) and Key Research and Development Program of Hubei Province (No. 2020BAB026).

1. Yagcioglu , S. , Erdem , A. , Erdem , E. , Ikizler-Cinbis , N.: RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes . In: Proc. of EMNLP 2018 , pp. 1358 - 1368 .

2. Amac , M. S. , Yagcioglu , S. , Erdem , A. , Erdem , E.: Procedural Reasoning Networks for Understanding Multimodal Procedures . In: Proc. of CoNLL 2019 , pp. 441 - 451 .

3. Faghihi , H. R. , Mirzaee , R. , Paliwal , S. , Kordjamshidi , P. : Latent Alignment of Procedural Concepts in Multimodal Recipes . In: Proc. of ALVR 2020 , pp. 26 - 31 .

4. Liu , A. , Yuan , S. , Zhang, C. , Luo , C. , Liao , Y. , Bai , K. , Xu , Z. : Multi-Level Multimodal Transformer Network for Multimodal Recipe Comprehension . In: Proc. of SIGIR 2020 , pp. 1781 - 1784 .

5. Li , D. , Zaki , M. J.: Reciptor: An effective pretrained model for recipe representation learning . In: Proc. of KDD 2020 , pp. 1719 - 1727 .

6. Fey , M. , Lenssen , J. E. , Morris , C. , Masci , J. , Kriege , N. M.: Deep graph matching consensus . In: Proc. of ICLR 2020 , poster .