A Knowledge-based Deep Heterogeneous Graph Matching Model for Multimodal RecipeQA Yunjie Wu1 , Sai Zhang1 , Xiaowang Zhang1 , Zhiyong Feng1,2? , and Liang Wan1 1 College of Intelligence and Computing, Tianjin University, Tianjin, China 2 College of Intelligence and Computing, Shenzhen Research Institute of Tianjin University, Tianjin University, Tianjin, China {yunjie_wu, zhang_sai, xiaowangzhang, zyfeng, lwan}@tju.edu.cn ? Corresponding Author Abstract. RecipeQA is a multimodal task that requires understand- ing the multimodal context. Since the recipe instructions are procedural, temporal relations are essential to support procedural understanding. Due to the high divergence of representation, it is challenging to model the temporal relations of multimodal and dynamic recipes. In this pa- per, we propose a Knowledge-based Deep Heterogeneous Graph Match- ing Model (DHGM) to model temporal structures of recipes. Firstly, we present a knowledge-based recipe encoder to reduce the divergence between recipe entities. Secondly, we design a two-stage heterogeneous graph matching method to guarantee neighborhoods consensus. Exper- imental results show that our proposed approach for RecipeQA obtains the best performance on the RecipeQA dataset. Keywords: Multimodal Reading Comprehension · Heterogeneous Graph Matching · Knowledge Graph 1 Introduction The RecipeQA [1] is a newly proposed Multimodal Machine Comprehension (M3 C) task, which comprises instructional recipes with multimodal context. The RecipeQA provides different multi-choice tasks that require a joint understand- ing of both visual and textual procedural knowledge. Since cooking instructions are procedural, it is essential to understand tem- poral relations for RecipeQA. However, due to the high divergence of representa- tions between different instructions, it is challenging to model temporal relations. In addition to the heterogeneity of multimodal data, even the same textual en- tity could be different. For example, the “ground beef ” in Step 1 would change to “patty” in Step 3 after cooking, but they still correspond to the same entity. The existing works either learn the dynamical states of entities for procedural reason- ing [2] or attempt to answer questions with attention-based alignment [3, 4]. It is * Copyright c 2021 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 Y. Wu et al. not easy for them to capture the semantics of temporal structure since the struc- tural semantics is lost during encoding the recipes. Inspired from DGMC [6], we model the temporal structure of recipes in deep graph matching. In this paper, we propose DHGM for RecipeQA to explicitly model temporal structures in the graph matching processing. Firstly, we present a knowledge- based recipe embedding to reduce the divergence between recipe entities. Sec- ondly, we design a two-stage heterogeneous graph matching method to guarantee neighborhoods consensus for injecting temporal structure. Finally, we conduct experiments on the RecipeQA dataset and achieve the best performance. 2 Deep Heterogeneous Graph Matching Model The framework of DHGM is shown in Figuer 1. We build heterogeneous graphs with temporal relations for recipe and question, which are embedded based on the knowledge graph. The correct answer is chosen with a two-stage method: local heterogeneous feature matching and neighborhood consensus matching. Food Ontologies Recipes FoodKG 𝑚𝑒𝑠𝑠𝑎𝑔𝑒 𝑝𝑎𝑠𝑠𝑖𝑛𝑔 proj Reciptor 𝐺𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛 𝑥𝑖 𝑀𝑡𝑒𝑥𝑡 𝑥𝑖′ Sim(· GNN1 ,·) proj ResNet-50 𝑆 (0) 𝑥𝑗 𝑀𝑖𝑚𝑔 𝑥𝑗′ Knowledge-based embedding Local heterogeneous feature matching 𝐺𝑟𝑒𝑐𝑖𝑝𝑒 Step 1: Put the minced meat in a bowl and add salt, GNN2 - ) 𝐻𝑞 MLP 𝑃𝑎𝑛𝑠𝑤𝑒𝑟 pepper, parmigiano MLP( and 1 egg. Mix well. 𝑆 (𝑙) 𝑆𝐻𝑟 Neighborhood consensus matching Fig. 1. The overall framework of DHGM for RecipeQA 2.1 Knowledge-based recipe embedding To reduce the divergence between entity representations in recipes, we employed a pretrained model Reciptor [5] for recipe embedding with an external knowledge graph FoodKG. With an anchor recipe ra , we extract the related triples from Knowledge-based Graph Matching for Multimodal RecipeQA 3 FoodKG through a similarity model. Given an entity ea , a positive partner ep connected to ea and a negative partner en not connected to ea , the triplet loss is used to optimize the learned embeddings in semantic space: Lemb (ea , ep , en ) = max (0, d (ea , ep ) + α − d (ea , en )) (1) where d (ei , ej ) is used to measure the distance, α is a margin parameter. Fur- thermore, we employ a pretrained ResNet-50 model to represent each image. 2.2 Local heterogeneous feature matching We construct graphs for each recipe and question with the temporal relations. A graph can be defined as G = (V, A), where V is a vertex set, and A is the adjacency matrix of edges. For a heterogeneous node with type t, we employ a type-specific transformation matrix Mt to project the node into same semantic space: x0i = M t xi (2) where xi is the original feature, and x0i is the projected feature. The node feature (t−1) hj in layer t can be updated with localized information as follows: (0) hi = x0i (3) (t) (t−1) (t−1) aij = sof tmax(att(hi , hj )) (4) (t) (t) (t−1) X hi = σ( aij hj ) (5) j∈Ni where Ni is the neighbors set of node i, and att(·) is the self-attention model for learning the weight of different neighbors. Given the node embeddings of recipe Hr and question Hq , we obtain the initial correspondences matrix as: S (0) = mask_sof tmax(Hq Hr > ) (6) 2.3 Neighborhood consensus matching We iteratively refine the correspondences matrix S (l) to guarantee the neigh- borhood consensus. The node features Hq , Hr can be passed along the soft correspondence S to obtain node features Hr 0 , Hq 0 in other domain: Hr 0 = S > Hq and Hq 0 = SHr (7) And then, we employ node indicator function I to map corresponding neighbor- hoods into sub-domain and propagate message by GNN model: > Oq = GN N (I, Xq , Aq ) and Or = GN N (S(l) I, Xr , Ar ) (8) 4 Y. Wu et al. where X is the feature matrix of nodes. We measure the consensus of nodes pair (vi , vj ) by dij = oqi − orj . And the correspondence matrix can be updated as follow: (l+1) (l) Si,j = sof tmax(Si,j + M LP (dj,i ))i,j (9) we utilized a similarity model followed by an MLP layer to obtain the final answer: P (ak ) = sof tmax(similarity(Hq , SHr )) (10) The final objective function has been defined as follows: (L) X X L=− log(Si,πgt (i) ) + [− log(p(a+ )) − log(1 − p(a− ))] (11) i∈Vσ (a+ ,a− )∈Ai where πgt (i) is the ground truth correspondences. 3 Experiments and Evaluation We conduct the experiments on three tasks of the RecipeQA dataset (i.e., visual cloze, visual coherence, visual ordering). The models are trained on each task separately. We employ accuracy as the metric. Table 1 presents the quantitative results on the test set. The proposed DHGM outperforms other benchmark models on all three tasks, demonstrating that tem- poral structural information plays an essential role in procedural understanding. We also employ a DHGM model without knowledge-based recipe embedding to verify the impact of FoodKG, which still outperforms the benchmark models. The DHGM explicitly models the temporal structure in graph matching, which is beneficial to understanding the multimodal temporal context. We observe that the performance drops without the knowledge-based embedding, which shows that knowledge is helpful to enhance the recipe representations. Table 1. Accuracy on the test sets of RecipeQA. Visual Cloze Visual Coherence Visual Ordering HUMAN 77.60 81.60 64.00 Hasty Student 27.35 65.80 40.88 Impatient Reader 27.36 28.08 26.74 PRN (Single Task) 56.31 53.64 62.77 PRN (Multi Task) 46.45 40.58 62.67 DHGM -w/o KG 48.16 45.37 61.54 DHGM 51.57 50.28 63.11 Knowledge-based Graph Matching for Multimodal RecipeQA 5 4 Conclusion In this paper, we propose DHGM for RecipeQA to explicitly model the temporal structure in graph matching. We believe that the graph-matching-based idea for modeling the temporal structure is meaningful for understanding the multimodal procedural data similar to RecipeQA. Experimental results on the RecipeQA dataset demonstrate the excellent performance of the proposed DHGM. In fu- ture work, we would extend our approach to other tasks with heterogeneous procedural data to verify the generalization. 5 Acknowledgments The work described in this paper is partially supported by Shenzhen Science and Technology Foundation (JCYJ20170816093943197) and Key Research and Development Program of Hubei Province (No. 2020BAB026). References 1. Yagcioglu, S., Erdem, A., Erdem, E., Ikizler-Cinbis, N.: RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes. In: Proc. of EMNLP 2018, pp.1358-1368. 2. Amac, M. S., Yagcioglu, S., Erdem, A., Erdem, E.: Procedural Reasoning Networks for Understanding Multimodal Procedures. In: Proc. of CoNLL 2019, pp.441-451. 3. Faghihi, H. R., Mirzaee, R., Paliwal, S., Kordjamshidi, P.: Latent Alignment of Procedural Concepts in Multimodal Recipes. In: Proc. of ALVR 2020, pp.26-31. 4. Liu, A., Yuan, S., Zhang, C., Luo, C., Liao, Y., Bai, K., Xu, Z.: Multi-Level Multimodal Transformer Network for Multimodal Recipe Comprehension. In: Proc. of SIGIR 2020, pp.1781-1784. 5. Li, D., Zaki, M. J.: Reciptor: An effective pretrained model for recipe representation learning. In: Proc. of KDD 2020, pp.1719-1727. 6. Fey, M., Lenssen, J. E., Morris, C., Masci, J., Kriege, N. M.: Deep graph matching consensus. In: Proc. of ICLR 2020, poster.