=Paper=
{{Paper
|id=Vol-3121/short1
|storemode=property
|title=Contrastive Visual and Language Translational Embeddings for Visual Relationship Detection
|pdfUrl=https://ceur-ws.org/Vol-3121/short1.pdf
|volume=Vol-3121
|authors=Thanh Tran,Paulo E. Santos,David Powers
|dblpUrl=https://dblp.org/rec/conf/aaaiss/TranSP22
}}
==Contrastive Visual and Language Translational Embeddings for Visual Relationship Detection==
<pdf width="1500px">https://ceur-ws.org/Vol-3121/short1.pdf</pdf>
<pre>
Contrastive Visual and Language Translational
Embeddings for Visual Relationship Detection
Thanh Tran1 , Paulo E. Santos1 and David Powers1
1
 College of Science and Engineering, Flinders University of South Australia,
1284 South Rd, Clovelly Park SA 5042, Australia


                                         Abstract
                                         Visual relationship detection aims to understand real-world interactions between object pairs by detecting
                                         visual relation triples written in the form of (subject, predicate, object). Previous work has explored the
                                         use of contrastive learning to generate joint visual and language embeddings that aid the detection
                                         of both seen and unseen visual relation triples. However, these contrastive approaches often learned
                                         the mapping functions implicitly and did not fully consider the underlying structure of visual relation
                                         triples, limiting the models’ use cases and their ability to generalize to unseen compositions. This
                                         ongoing work aims to construct joint visual and language embedding models that can capture such
                                         hierarchical structure between objects and predicates by explicitly imposing structural loss constraints.
                                         In this short paper, we propose VLTransE, a novel embedding model that applies translational loss in
                                         conjunction with the visual-language contrastive loss to learn transferable embedding spaces for subjects,
                                         objects, and predicates. At test time, the model ranks potential visual relationships by aggregating the
                                         visual-language consistency score and the translational score. The preliminary results show that the
                                         contrastive model trained with the translational loss constraint can capture hierarchical information
                                         which aids the prediction of not only visual predicates but also masked-out objects, while achieving
                                         comparable predicate prediction results to the model trained without the translational loss.

                                         Keywords
                                         Visual Relationship Detection, Scene Graph, Translational Embedding, Zero-shot Learning, Contrastive
                                         Learning


1. Introduction
Understanding the visual world is essential for many modern machine learning tasks including
visual question answering [1], image retrieval [2], and image captioning [3]. Visual relationship
detection (VRD) [4] aims to facilitate such understanding by bridging the gap between low-level
visual information and high-level symbolic visual relation triples, written in the form of (subject,
predicate, object). Given the successful performance of deep neural networks in low-level
perception tasks such as object classification and object detection, multiple works [4, 5, 6, 7, 8]
for VRD have built neural classification models that directly predict the visual predicate from the

In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI
2022 Spring Symposium on Machine Learning and Knowledge Engineering for Hybrid Intelligence (AAAI-MAKE 2022),
Stanford University, Palo Alto, California, USA, March 21–23, 2022.
$ thanh.tran1725@gmail.com (T. Tran); paulo.santos@flinders.edu.au (P. E. Santos); david.powers@flinders.edu.au
(D. Powers)
 https://www.flinders.edu.au/people/paulo.santos (P. E. Santos); https://www.flinders.edu.au/people/david.powers
(D. Powers)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
input image and text, achieving state-of-the-art results on the VRD benchmarks [4, 9]. However,
these methods have two main limitations. First, the models learn directly from the dataset
distribution, making them susceptible to dataset biases and limiting their ability to generalize
to rare compositions of visual relation triples at test time. For example, the classification model
can detect (person, riding, horse), while struggle to detect (person, riding, cow) or (person, riding,
dog). Second, these models are optimized on a narrowly defined task in the given benchmarks,
making it difficult to extend the model’s use case beyond the given task and domain.
   To address these two issues, this research approaches the problem from a different angle.
Instead of tackling the VRD problem as an end-to-end classification task, this work assumes
the graphical structure of these visual relation triples, interprets this structure as a knowledge
graph [10], and formulates the VRD problem as a knowledge graph completion problem [11].
However, unlike traditional knowledge graphs that are based on factual knowledge bases, the
knowledge graphs here are represented by a set of visual entities and their interactions, where
the nodes are subjects and objects grounded in the image through bounding boxes, and the
edges are the relation predicates that exist between pairs of subjects and objects [12]. In the
current literature, such formulation of a knowledge graph is also called a scene graph [12], and
the task of knowledge graph completion is called scene graph completion [13].
   Central to the scene graph completion is the idea of scene graph embedding (SGE), which aims
to build embedding models that transform the entities and relations into low-dimensional vector
spaces while preserving the structure of the original knowledge graph [10]. Such embedding
approach is beneficial to VRD in two ways. First, because these embedding spaces preserve the
graphical structure, unseen relations can be inferred by aggregating the relevant neighbors’
features [14]. Second, like any other knowledge graph, a scene graph can be augmented with
other domain-specific knowledge graphs [15, 16] or common-sense knowledge graphs [17, 18]
during training, allowing the model to make out-of-domain inferences at test time.
   In this work, we aim to perform scene graph embedding using the contrastive learning
approach [19, 20], which learns representations by pulling together the target vector (or anchor)
and a matching (positive) vector, while pushing apart the anchor from non-matching (negative)
vectors. We believe that such contrastive approach can help us construct a better scene graph
representation that can be transferred to other downstream tasks while giving us more control
over the output embedding spaces. Thus, this short paper proposes VLTransE (Figure 1), a
visual-language contrastive scene graph embedding model that preserves the local structure
of the graph through the use of translational loss constraint [21]. At test time, the model is
evaluated on the predicate detection task and tail entity (object) prediction task (Figure 2). The
preliminary results in Tables 1 and 2 show that the method performs reasonably well on both
tasks, while Table 3 shows that the model trained with translational loss can achieve comparable
predicate prediction results to the model trained without translational loss on unseen triples.


2. Related Work
This section presents a review of the work related to compositional grounding of visual con-
cepts on language [9], with an emphasis on visual relationship detection and scene graph
representational learning through the use of contrastive learning and translational embeddings.
Visual Relationship Detection aims to capture real-world interactions between subject
and object pairs (e.g person-riding-horse), allowing the model to detect not only objects but
also relations between objects. However, due to the large number of potential real-world
interactions, existing visual relationship datasets including VRD [4] and Visual Genome [9] are
often sparse and unbalanced, where common relationships occur more frequently than rarer but
plausible ones. While Lu et al. [4] and Yu et al. [5] have shown that leveraging language biases
can help the models learn co-occurrences statistical priors, such approaches often limit the
model’s generalization ability and prevent the model from dealing with the variability of visual
appearances. Thus, other works [22, 23] have interpreted VRD as a zero-shot detection task,
and uses contrastive learning to construct joint visual and language embedding spaces that can
be transferred to detect unseen visual relation triples. Here, [22] emphasizes the importance of
analogy transfer, which is a downstream neural network module that leverages compositional
embedding parts to compose novel visual relation triples. While our method also constructs
distinct subject, object, and predicates embedding spaces using contrastive learning, the entire
pipeline is trained end-to-end and the method focuses on the use of translation scoring functions
(Figure 2) to rank the predicates instead of having a separate downstream network.

Contrastive Learning focuses on minimizing the distance between the target embedding
(anchor) vector and the matching (positive) embedding vector, while maximizing the distance
between the anchor vector and the non-matching (negative) embedding vectors. Recent work on
contrastive learning have shown that discriminative or contrastive approaches can (i) produce
transferable embeddings for visual objects through the use of data augmentation [20], and (ii)
learn joint visual and language embedding space that can be used to perform zero-shot detection
[24]. Given the sparseness and long-tailed property of scene graph datasets, application (i) of
contrastive approach can help the model learn better visual appearance embeddings of (subject,
object) pairs under limited resource settings. Moreover, in application (ii), contrastive learning
gives a clearer separation of the visual embeddings and language embeddings compared to the
traditional black-box neural fusion approaches [25, 26], giving us more control over both the
symbolic triples input and the final output embedding spaces.

Scene Graph Embedding and Translational Embedding. The above task of constructing
joint transferable visual and language embedding spaces for (𝑠𝑢𝑏𝑗𝑒𝑐𝑡, 𝑝𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒, 𝑜𝑏𝑗𝑒𝑐𝑡) can
also be interpreted as a scene graph embedding task. Here, a scene graph is a graph-based
formulation that explicitly models objects, attributes of objects, and relationships between
objects [12]. Because a scene graph can be interpreted as a knowledge graph, common knowledge
graph embedding techniques [27] can also be used for scene graph embeddings. Thus, inspired
by translational embeddings [21], H. Zhang et al. [6] builds a model called VTransE that predicts
the visual predicate by assuming the translational properties of the (𝑠𝑢𝑏𝑗𝑒𝑐𝑡, 𝑝𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒, 𝑜𝑏𝑗𝑒𝑐𝑡)
triples, where EMB(subject) + EMB(predicate) ≈ EMB(object). Similar to VTransE, the model
proposed presented in the present paper also enforces the translational loss constraints to
preserve the local graph structure. However, instead of training an end-to-end softmax predictor,
the method here uses contrastive learning with negative sampling to learn the three separate
visual-language embedding spaces.

  In the preliminary research reported in this paper, we aim to perform zero-shot visual re-
Figure 1: Overview of the proposed model. Red, black, and blue colors represent subject, predicate, and
object respectively. The blue FC rectangles are independent fully connected layers with RELU activation
function. The final output of the model consists of six embeddings with three visual embeddings (𝑣 𝑠 , 𝑣 𝑝 ,
and 𝑣 𝑜 ) and three language embeddings (𝑤𝑠 , 𝑤𝑝 , and 𝑤𝑜 ). These embeddings are then trained on two
set of losses: (𝐿𝑣𝑙       𝑣𝑙      𝑣𝑙                                                       𝑇𝑟   𝑇𝑟
                  𝑠𝑢𝑏𝑗 , 𝐿𝑝𝑟𝑒𝑑 , 𝐿𝑜𝑏𝑗 ) are the visual-language consistency losses, while 𝐿𝑣 , 𝐿𝑤 are the
translational losses for visual and language embeddings triples.


lationship detection through the use of scene graph embedding, where we construct three
separate visual and language embedding spaces for subject, predicate, and object using con-
trastive loss. While there are multiple contrastive loss functions [28, 29, 30], the visual-language
contrastive loss in this work uses triplet margin loss, where one anchor vector of one modality
is contrasted against one positive and one negative vector of the other modality. To preserve the
local structure of the scene graph during embedding, the method also enforces the translational
loss constraint separately in the language triplet embeddings and the visual triplet embeddings.
While translational loss only preserves the first-order proximity or local structure of the scene
graph, we believe that this method can be extended to other scene graph embedding techniques
in the future.


3. The VLTransE Architecture
This section describes proposed architecture and outlines the details of the current implementa-
tion. The general architecture consists of three modules: (1) the Visual and Spatial Module
that generates visual embeddings based on the extracted features from the images and bounding
boxes’ coordinates (Figure 1, left), (2) the Language Module that learns contextualized token
embeddings which changes according to the context of the input triples (Figure 1, right), (3) the
Loss Functions that enforce translational losses to preserve the first-order graph structure
and visual-language contrastive losses to ensure the consistency between the (visual, language)
embeddings pairs (Figure 1, center).
Figure 2: Test Time Scoring Functions. Red, black, and blue colors represent subject, predicate, and
object respectively. 𝑑(x,y) computes the cosine distance between 𝑥 and 𝑦, and the distances are ranked
in an ascending order. in (a), 𝑛 is the number of predicate classes in the dataset. In (b), 𝑒 is the number
of object classes in the dataset.


3.1. Visual and Spatial Module
Visual Module. One of the main sub-tasks of visual relationship detection is to detect subjects
and objects from a given image, and extract their visual features for downstream embeddings.
Given the success of CNN-based architecture [31] in learning image representations from large
scale pre-training, the visual feature extraction module in this work uses Faster R-CNN [32]
pre-trained on the COCO 2017 dataset. In the current implementation, Faster-RCNN consists of
a shared ResNet-101 backbone network [33], a region proposal network (RPN), and a region of
interest (ROI) detector. Thus, given the ground truth subject, object, and union bounding boxes,
the visual features are extracted from the shared backbone and a region of interest pooling
operation, yielding 𝑧 𝑠 , 𝑧 𝑜 , and 𝑧 𝑢𝑛𝑖𝑜𝑛 respectively.

Spatial Module. The model also extracts spatial information from the subject, object, and
union bounding boxes to incorporate spatial and position priors. Similar to J.Zhang et al. [7],
given the three boxes 𝑏𝑠 , 𝑏𝑢𝑛𝑖𝑜𝑛 , 𝑏𝑜 in [𝑥, 𝑦, 𝑤, ℎ] format, where (𝑥, 𝑦) is the starting coordinate
and (𝑤, ℎ) is the width and height of the box, the spatial encoder generates a 22-dimensional
feature vector:

                                         𝑥1 − 𝑥2 𝑦 1 − 𝑦 2        𝑤1         ℎ1
                      Δ(𝑏1 , 𝑏2 ) =<            ,          , 𝑙𝑜𝑔(    ), 𝑙𝑜𝑔(    )>                     (1)
                                           𝑤2        ℎ2           𝑤2         ℎ2
                                     𝑥        𝑦       𝑥+𝑤 𝑦+ℎ      𝑤ℎ
                       𝑐(𝑏) =<            ,       ,       ,    ,          )>                           (2)
                                   𝑤𝑖𝑚𝑔 ℎ𝑖𝑚𝑔          𝑤𝑖𝑚𝑔 ℎ𝑖𝑚𝑔 𝑤𝑖𝑚𝑔 ℎ𝑖𝑚𝑔

                      < Δ(𝑏𝑠 , 𝑏𝑜 ), Δ(𝑏𝑠 , 𝑏𝑢𝑛𝑖𝑜𝑛 ), Δ(𝑏𝑢𝑛𝑖𝑜𝑛 , 𝑏𝑜 ), 𝑐(𝑏𝑠 ), 𝑐(𝑏𝑜 ) >                (3)
The spatial feature vector in Equation (3) then goes through two fully connected layers to get
the final 64-dimensional spatial embedding, 𝑠𝑝 .
These extracted visual and spatial feature vectors are then passed through three separate neural
networks to generate the subject, predicate, and object embeddings. For the subject and object
embeddings, the visual feature vectors go through three fully connected layers with RELU
activation function to get 256-dimensional embedding vectors, 𝑣 𝑠 and 𝑣 𝑜 . Similarly, the union
feature vector, 𝑧 𝑢𝑛𝑖𝑜𝑛 , is first concatenated with the spatial embedding, 𝑠𝑝 , before going through
three fully connected layers with RELU activation function to get the 256-dimensional predicate
embedding vector, 𝑣 𝑝 .

3.2. Language Module
For the language module, the model also learns three separate neural networks for subject,
predicate, and object that map the pre-trained language features toward the final joint visual
and language spaces. Here, the architecture uses BERT [34] instead of word2vec [35] as our
pre-trained language encoder to leverage the contextualized information from the entire triplet.
We believe that contextualized encoders like BERT are beneficial for visual relationship detection
because the same predicate can have different meanings under different (subject, object) contexts.
Thus, after extracting contextualized feature embeddings from BERT, and passing them through
the three separate neural networks, the final output are three 256-dimensional embedding
vectors: 𝑤𝑠 , 𝑤𝑝 , and 𝑤𝑜 .

3.3. Loss Functions
The model uses triplet margin loss as the primary metric loss function, although this can be
replaced with other contrastive loss functions [28, 30]. Here, cosine similarity is used as the
distance metric 𝑑 for all triplet margin loss functions.

Visual and Language Consistency Loss. Using triplet margin loss, the following loss function
aims to bring the three positive visual embeddings (𝑣 𝑠 , 𝑣 𝑝 , 𝑣 𝑜 ) closer to the three positive
language embeddings (𝑤𝑠 , 𝑤𝑝 , 𝑤𝑜 ), while pushing apart negative pairs. To reduce the number
of equations, the loss function in Equation (6) or 𝐿𝑣𝑙 is applied separately for the three subject,
predicate, and object heads. Therefore, given the set 𝑉 = {𝑣, 𝑤} of positive visual and language
embedding pairs, the set 𝑉 𝑣− = {𝑣 − , 𝑤} of negative visual with positive language pairs, and
𝑉 𝑤− = {𝑣, 𝑤− } of positive visual with negative language pairs, the triplet losses are:
                          ∑︁         1          ∑︁
                 𝐿𝑣𝑙
                  𝑣 =                                       [𝑚 + 𝑑(𝑣, 𝑤) − 𝑑(𝑣 − , 𝑤)]+           (4)
                                  |𝑉 𝑣− |
                        (𝑣,𝑤)∈𝑉             (𝑣 − ,𝑤)∈𝑉 𝑣−
                          ∑︁        1           ∑︁
                𝐿𝑣𝑙
                 𝑤 =                                        [𝑚 + 𝑑(𝑣, 𝑤) − 𝑑(𝑣, 𝑤− )]+            (5)
                                  |𝑉 𝑤− |
                        (𝑣,𝑤)∈𝑉             (𝑣,𝑤− )∈𝑉 𝑤−


                                            𝐿𝑣𝑙 = 𝐿𝑣𝑙   𝑣𝑙
                                                   𝑣 + 𝐿𝑤                                         (6)
where [𝑥]+ = 𝑚𝑎𝑥(0, 𝑥) denotes only the positive part of the input, 𝑚 denotes a margin of 0.2,
and 𝑑 is cosine similarity distance metric. 𝐿𝑣𝑙 is applied correspondingly for objects, subjects,
and predicates pairs.
Translational Loss. To enforce the structural priors of visual relation triples, the model also
enforces the translational loss on the visual embeddings and language embeddings. Thus given
a set 𝑆 of valid triples (𝑠, 𝑝, 𝑜), and 𝑆 − of randomly selected negative triples (𝑠′ , 𝑝′ , 𝑜′ ), the
translational losses are defined as:
                       ∑︁          1           ∑︁                                           ′          ′
           𝐿𝑇𝑣 𝑟 =                                          [𝑚 + 𝑑(𝑣 𝑠 + 𝑣 𝑝 , 𝑣 𝑜 ) − 𝑑(𝑣 𝑠 + 𝑣 𝑝 , 𝑣 𝑜 )]+    (7)
                                 |𝑆 − |
                     (𝑠,𝑝,𝑜)∈𝑆            (𝑠′ ,𝑝,𝑜′ )∈𝑆 −
                     ∑︁         1           ∑︁                                                  ′          ′
         𝐿𝑇𝑤𝑟 =                                          [𝑚 + 𝑑(𝑤𝑠 + 𝑤𝑝 , 𝑤𝑜 ) − 𝑑(𝑤𝑠 + 𝑤𝑝 , 𝑤𝑜 )]+             (8)
                              |𝑆 − |
                  (𝑠,𝑝,𝑜)∈𝑆            (𝑠′ ,𝑝,𝑜′ )∈𝑆 −

Here, the predicate embeddings act as the translational vector between the subject and object
embeddings. Thus, the final combined loss function is defined as:

                                 𝐿 = 𝐿𝑣𝑙      𝑣𝑙     𝑣𝑙      𝑇𝑟   𝑇𝑟
                                      𝑠𝑢𝑏𝑗 + 𝐿𝑜𝑏𝑗 + 𝐿𝑝𝑟𝑒𝑑 + 𝐿𝑣 + 𝐿𝑤                                             (9)

Test-Time Inference. To perform test-time inference on the generated visual and language
embeddings, the evaluation algorithm computes the cosine similarity distances between the
visual embeddings and language embeddings, and ranks them to select the top predictions.
Depending on the evaluation task, different embedding parts of (subject, predicate, object) can be
used (Figure 2). In this paper, we evaluated the model on two tasks: (i) the predicate prediction
task, and (ii) the tail entity prediction task.
   For the predicate prediction task (Figure 2a), both the ground truth bounding boxes and
labels for subject and object are given. Thus, given the ground truth bounding boxes and
an image, the three visual embeddings (𝑣 𝑠 , 𝑣 𝑝 , 𝑣 𝑜 ) for subject, predicate and object are first
generated by the visual and spatial module (Figure 1, 1). For the language modality, due to the
usage of BERT contextualized encoder, the evaluation algorithm first enumerates all possible
(𝑠𝑢𝑏𝑗𝑒𝑐𝑡, 𝑝𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒𝑖 , 𝑜𝑏𝑗𝑒𝑐𝑡) triples where 𝑖 ∈ (0, 𝑛) and 𝑛 is the number of predicates. These
triples are then passed through the language module (Figure 1, 2) to generate 𝑛 language
embeddings triples, (𝑤𝑠 , 𝑤𝑝 , 𝑤𝑜 )𝑖∈(0,𝑛) . Thus, given the visual (𝑣 𝑠 , 𝑣 𝑝 , 𝑣 𝑜 ) embedding triple
and the language (𝑤𝑠 , 𝑤𝑝 , 𝑤𝑜 )𝑖∈(0,𝑛) embedding triples, the visual-language consistency score
for the predicate is computed as:

                                           𝑠𝑐𝑜𝑟𝑒𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦 = 𝑑(𝑣 𝑝 , 𝑤𝑝 )                                     (10)
and the translational score is computed from:

                                 𝑠𝑐𝑜𝑟𝑒𝑡𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛𝑎𝑙 = 𝑑(𝑣 𝑜 − 𝑣 𝑠 , 𝑤𝑜 − 𝑤𝑠 )                                  (11)
These two scores are then multiplied to get the final ranking score

                         𝑠𝑐𝑜𝑟𝑒𝑐𝑜𝑚𝑏𝑖𝑛𝑒 = 𝑠𝑐𝑜𝑟𝑒𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦 * 𝑠𝑐𝑜𝑟𝑒𝑡𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛𝑎𝑙                                  (12)
  For the tail entity prediction task (Figure 2b), only the ground truth bounding box for subject
and ground truth labels for subject and predicate is provided. Thus, without the ground
truth object bounding box, the union box is set to be the subject bounding box. Therefore,
given the image and the subject bounding box, the visual and spatial module (Figure 1, 1)
generates the visual embeddings (𝑣 𝑠 , 𝑣 𝑝 ) for the subject and predicate. Similarly, from the
subject and predicate ground truth labels, the evaluation algorithm first enumerates all possible
(𝑠𝑢𝑏𝑗𝑒𝑐𝑡, 𝑝𝑟𝑒𝑑𝑖𝑐𝑎𝑡𝑒, 𝑜𝑏𝑗𝑒𝑐𝑡𝑖 ) triples where 𝑖 ∈ 𝑒 and 𝑒 is the number of object classes. These
triples are then passed through the language module (Figure 1, 2) to generate (𝑤𝑠 , 𝑤𝑝 , 𝑤𝑜 )𝑖∈(0,𝑒)
embedding triples. Given the visual (𝑣 𝑠 , 𝑣 𝑝 ) embeddings and the language (𝑤𝑠 , 𝑤𝑝 , 𝑤𝑜 )𝑖∈(0,𝑒)
embedding triples, the translational score function is computed as:

                             𝑠𝑐𝑜𝑟𝑒𝑡𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛𝑎𝑙 = 𝑑(𝑣 𝑠 + 𝑣 𝑝 , 𝑤𝑠 + 𝑤𝑝 )                          (13)


4. Preliminary Results
This section evaluates the performance of VLTransE on the VRD dataset, which contains 4000
images for training and 1000 images for testing. In total, the VRD dataset contains 100 object
classes, 70 predicate classes, and 37,993 relationships.

Table 1
Predicate Prediction Results on VRD test set
                                   seen and unseen triples           unseen triples only
          scoring function    Recall top@1       Recall top@5    Recall top@1   Recall top@5
          𝑠𝑐𝑜𝑟𝑒𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦       15.21              36.71            3.93          16.60
          𝑠𝑐𝑜𝑟𝑒𝑡𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛        6.83              26.01            4.62          16.25
           𝑠𝑐𝑜𝑟𝑒𝑐𝑜𝑚𝑏𝑖𝑛𝑒𝑑         18.64              43.75            6.33          22.58


Table 2
Tail Entity Prediction Results on VRD test set
                                   seen and unseen triples           unseen triples only
          scoring function    Recall top@1       Recall top@5    Recall top@1   Recall top@5
          𝑠𝑐𝑜𝑟𝑒𝑡𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛       10.72              33.80            3.51          14.37


Table 3
Comparing the model trained with and without translational loss
                                       seen and unseen triples           unseen triples only
               model                Recall top@1     Recall top@5    Recall top@1   Recall top@5
      without translational loss       24.18            44.20            7.10          22.07
       with translational loss         18.64            43.75            6.33          22.58

Evaluation. All evaluation results are computed using the recall metric on the top 𝑛 ranked
items. For the predicate prediction task, Table 1 shows that by multiplying the visual-language
consistency score (𝑠𝑐𝑜𝑟𝑒𝑐𝑜𝑛𝑠𝑖𝑠𝑡𝑒𝑛𝑐𝑦 ) and the translational score (𝑠𝑐𝑜𝑟𝑒𝑡𝑟𝑎𝑛𝑠𝑙𝑎𝑡𝑖𝑜𝑛 ) instead of
using just the visual-language consistency score, the performances of the model when detecting
all predicates and unseen predicates improved by 22.6% and 61% respectively on Recall top@1
metric, and by 19.2% and 36% on the Recall top@5 metric. In table 3, it shows the model trained
with the additional translational loss performs poorer than the model trained without the
translational loss constraint when evaluated on the entire test set. However, the results between
the two models become comparable when evaluated solely on unseen compositions of visual
relation triples.
   With the additional translational structural loss, the embedding space can now be extended
to tasks other than visual relationship detection. Here, we evaluated the model on the tail entity
prediction task, where the goal is to infer potential objects given only the subject and predicate
ground truth label. The results shown in Table 2 indicate that the model can perform reasonably
well given that no additional visual information is provided.


5. Discussion and Future Work
Visual Relationship Detection is the cornerstone of many modern machine learning tasks that
require a comprehensive understanding of the visual scene. Current contrastive distance metric
approaches in learning joint visual-language embeddings for VRD often rely on neural networks
learning the necessary transformations implicitly without any structural constraints. To this
end, we propose VLTransE, a contrastive visual-language embedding model that preserves
the first-order structure of the graph through the use of the translational constraint. While
the results shown in Table 3 indicate that additional constraints can interfere with the model
learning process and reduce the model’s performance on the given VRD benchmark, Table 1
and 2 show the versatility of the embeddings, where the same embedding space can be used
for tasks other than visual relationship detection. While the initial results of the model’s first
iteration is reasonable, further experiments are needed to see the failure corner cases and verify
the impact of language biases.
   There are certain limitations with the proposed approach that we want to explore in future
research. First, the translational loss can only preserve the first order proximity of the scene
graph, where only intermediate neighbors features are used, limiting the expressiveness of the
embedding spaces. Thus, we might consider extending the method to other graph embedding
techniques that consider not only the local structure, but also the global structure. Second,
the method shown here uses random triplet selection for the contrastive losses, which could
prevent the model from converging to the optimal solution. Therefore, future work may consider
other contrastive loss functions and negative sampling techniques. Finally, while the method
induces the graphical structure in the final output embedding spaces, it remains open on how to
effectively visualize these embeddings or transfer them to create a more explicit representation.
References
 [1] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, D. Parikh, VQA: Visual
     Question Answering, in: 2015 IEEE International Conference on Computer Vision (ICCV),
     IEEE, Santiago, Chile, 2015, pp. 2425–2433. doi:10.1109/ICCV.2015.279.
 [2] S. Schuster, R. Krishna, A. Chang, L. Fei-Fei, C. D. Manning, Generating Semantically
     Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval, in: Pro-
     ceedings of the Fourth Workshop on Vision and Language, Association for Computational
     Linguistics, Lisbon, Portugal, 2015, pp. 70–80. doi:10.18653/v1/W15-2812.
 [3] A. Karpathy, L. Fei-Fei, Deep Visual-Semantic Alignments for Generating Image Descrip-
     tions, arXiv:1412.2306 [cs] (2015). arXiv:1412.2306.
 [4] C. Lu, R. Krishna, M. Bernstein, L. Fei-Fei, Visual Relationship Detection with Language
     Priors, arXiv:1608.00187 [cs] (2016). arXiv:1608.00187.
 [5] R. Yu, A. Li, V. I. Morariu, L. S. Davis, Visual Relationship Detection with Internal and
     External Linguistic Knowledge Distillation, in: 2017 IEEE International Conference on
     Computer Vision (ICCV), IEEE, Venice, 2017, pp. 1068–1076. doi:10.1109/ICCV.2017.
     121.
 [6] H. Zhang, Z. Kyaw, S.-F. Chang, T.-S. Chua, Visual Translation Embedding Network for
     Visual Relation Detection, in: 2017 IEEE Conference on Computer Vision and Pattern
     Recognition (CVPR), IEEE, Honolulu, HI, 2017, pp. 3107–3115. doi:10.1109/CVPR.2017.
     331.
 [7] J. Zhang, K. J. Shih, A. Elgammal, A. Tao, B. Catanzaro, Graphical Contrastive Losses for
     Scene Graph Parsing, arXiv:1903.02728 [cs] (2019). arXiv:1903.02728.
 [8] Y.-C. Su, S. Changpinyo, X. Chen, S. Thoppay, C.-J. Hsieh, L. Shapira, R. Soricut, H. Adam,
     M. Brown, M.-H. Yang, B. Gong, 2.5D Visual Relationship Detection, arXiv:2104.12727
     [cs] (2021). arXiv:2104.12727.
 [9] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J.
     Li, D. A. Shamma, M. S. Bernstein, L. Fei-Fei, Visual Genome: Connecting Language and
     Vision Using Crowdsourced Dense Image Annotations, International Journal of Computer
     Vision 123 (2017) 32–73. doi:10.1007/s11263-016-0981-7.
[10] S. Ji, S. Pan, E. Cambria, P. Marttinen, P. S. Yu, A Survey on Knowledge Graphs: Represen-
     tation, Acquisition and Applications, IEEE Transactions on Neural Networks and Learning
     Systems (2021) 1–21. doi:10.1109/TNNLS.2021.3070843. arXiv:2002.00388.
[11] Z. Chen, Y. Wang, B. Zhao, J. Cheng, X. Zhao, Z. Duan, Knowledge Graph Completion: A
     Review, IEEE Access 8 (2020) 192435–192456. doi:10.1109/ACCESS.2020.3030076.
[12] J. Johnson, R. Krishna, M. Stark, L.-J. Li, D. A. Shamma, M. S. Bernstein, L. Fei-Fei, Image
     retrieval using scene graphs, in: 2015 IEEE Conference on Computer Vision and Pattern
     Recognition (CVPR), IEEE, Boston, MA, USA, 2015, pp. 3668–3678. doi:10.1109/CVPR.
     2015.7298990.
[13] H. Wan, Y. Luo, B. Peng, W.-S. Zheng, Representation Learning for Scene Graph Completion
     via Jointly Structural and Visual Embedding, in: Proceedings of the Twenty-Seventh
     International Joint Conference on Artificial Intelligence, International Joint Conferences
     on Artificial Intelligence Organization, Stockholm, Sweden, 2018, pp. 949–956. doi:10.
     24963/ijcai.2018/132.
[14] P. Maheshwari, R. Chaudhry, V. Vinay, Scene Graph Embeddings Using Relative Similarity
     Supervision, arXiv:2104.02381 [cs] (2021). arXiv:2104.02381.
[15] [1412.0691] RoboBrain:                Large-Scale Knowledge Engine for Robots,
     https://arxiv.org/abs/1412.0691, ????
[16] C. Henson, S. Schmid, T. Tran, A. Karatzoglou, Using a Knowledge Graph of Scenes to
     Enable Search of Autonomous Driving Data (????) 2.
[17] R. Speer, J. Chin, C. Havasi, ConceptNet 5.5: An Open Multilingual Graph of General
     Knowledge, arXiv:1612.03975 [cs] (2018). arXiv:1612.03975.
[18] M. Sap, R. LeBras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A.
     Smith, Y. Choi, ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning,
     arXiv:1811.00146 [cs] (2019). arXiv:1811.00146.
[19] H. Wu, J. Mao, Y. Zhang, Y. Jiang, L. Li, W. Sun, W.-Y. Ma, Unified Visual-Semantic
     Embeddings: Bridging Vision and Language With Structured Meaning Representations, in:
     2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE,
     Long Beach, CA, USA, 2019, pp. 6602–6611. doi:10.1109/CVPR.2019.00677.
[20] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A Simple Framework for Contrastive
     Learning of Visual Representations, in: Proceedings of the 37th International Conference
     on Machine Learning, PMLR, 2020, pp. 1597–1607.
[21] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, O. Yakhnenko, Translating Embeddings
     for Modeling Multi-relational Data, in: Advances in Neural Information Processing
     Systems, volume 26, Curran Associates, Inc., 2013.
[22] J. Peyre, J. Sivic, I. Laptev, C. Schmid, Detecting Unseen Visual Relations Using Analogies,
     in: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), IEEE, Seoul,
     Korea (South), 2019, pp. 1981–1990. doi:10.1109/ICCV.2019.00207.
[23] J. Zhang, Y. Kalantidis, M. Rohrbach, M. Paluri, A. Elgammal, M. Elhoseiny, Large-Scale
     Visual Relationship Understanding, arXiv:1804.10660 [cs] (2019). arXiv:1804.10660.
[24] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,
     P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning Transferable Visual Models From
     Natural Language Supervision, arXiv:2103.00020 [cs] (2021). arXiv:2103.00020.
[25] Y.-C. Chen, L. Li, L. Yu, A. E. Kholy, F. Ahmed, Z. Gan, Y. Cheng, J. Liu, UNITER: UNiversal
     Image-TExt Representation Learning, arXiv:1909.11740 [cs] (2020). arXiv:1909.11740.
[26] W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, J. Dai, VL-BERT: Pre-training of Generic
     Visual-Linguistic Representations, arXiv:1908.08530 [cs] (2020). arXiv:1908.08530.
[27] F. Bianchi, G. Rossiello, L. Costabello, M. Palmonari, P. Minervini, Knowledge Graph
     Embeddings and Explainable AI, arXiv:2004.14843 [cs] (2020). doi:10.3233/SSW200011.
     arXiv:2004.14843.
[28] K. Sohn, Improved Deep Metric Learning with Multi-class N-pair Loss Objective, in:
     Advances in Neural Information Processing Systems, volume 29, Curran Associates, Inc.,
     2016.
[29] B. Yu, T. Liu, M. Gong, C. Ding, D. Tao, Correcting the Triplet Selection Bias for triplet
     loss, in: V. Ferrari, M. Hebert, C. Sminchisescu, Y. Weiss (Eds.), Computer Vision – ECCV
     2018, volume 11210, 2018, p. 17.
[30] A. van den Oord, Y. Li, O. Vinyals, Representation Learning with Contrastive Predictive
     Coding, arXiv:1807.03748 [cs, stat] (2019). arXiv:1807.03748.
[31] Y. LeCun, P. Haffner, L. Bottou, Y. Bengio, Object Recognition with Gradient-Based
     Learning, in: D. A. Forsyth, J. L. Mundy, V. di Gesú, R. Cipolla (Eds.), Shape, Contour
     and Grouping in Computer Vision, Lecture Notes in Computer Science, Springer, Berlin,
     Heidelberg, 1999, pp. 319–345. doi:10.1007/3-540-46805-6_19.
[32] S. Ren, K. He, R. Girshick, J. Sun, Faster R-CNN: Towards Real-Time Object Detection
     with Region Proposal Networks, in: Advances in Neural Information Processing Systems,
     volume 28, Curran Associates, Inc., 2015.
[33] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition,
     arXiv:1512.03385 [cs] (2015). arXiv:1512.03385.
[34] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidi-
     rectional Transformers for Language Understanding, arXiv:1810.04805 [cs] (2019).
     arXiv:1810.04805.
[35] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient Estimation of Word Representations in
     Vector Space, arXiv:1301.3781 [cs] (2013). arXiv:1301.3781.

</pre>