An Insect-Inspired Randomly, Weighted Neural
Network with Random Fourier Features For
Neuro-Symbolic Relational Learning
Jinyung Hong1 , Theodore P. Pavlic1,2,3,4
1
  School of Computing and Augmented Intelligence, Arizona State University, Tempe, AZ 85281, USA
2
  School of Sustainability, Arizona State University, Tempe, AZ 85281, USA
3
  School of Complex Adaptive Systems, Arizona State University, Tempe, AZ 85281, USA
4
  School of Life Sciences, Arizona State University, Tempe, AZ 85281, USA


           Abstract
           The computer-science field of Knowledge Representation and Reasoning (KRR) aims to understand,
           reason, and interpret knowledge as efficiently as human beings do. Because many logical formalisms
           and reasoning methods in the area have shown the capability of higher-order learning, such as abstract
           concept learning, integrating artificial neural networks (ANNs) with KRR methods for learning complex
           and practical tasks has received much attention. For example, Neural Tensor Networks (NTNs) are
           neural-network models capable of transforming symbolic representations into vector spaces where
           reasoning can be performed through matrix computation; when used in Logic Tensor Networks (LTNs),
           they are able to embed first-order logic symbols such as constants, facts, and rules into real-valued tensors.
           The integration of KRR and ANN suggests a potential avenue for bringing biological inspiration from
           neuroscience into KRR. However, higher-order learning is not exclusive to human brains. Insects, such as
           fruit flies and honey bees, can solve simple associative learning tasks and learn abstract concepts such as
           “sameness” and “difference,” which is viewed as a higher-order cognitive function and typically thought
           to depend on top-down neocortical processing. Empirical research with fruit flies strongly supports that
           a randomized representational architecture is used in olfactory processing in insect brains. Based on
           these results, we propose a Randomly Weighted Feature Network (RWFN) that incorporates randomly
           drawn, untrained weights in a encoder that uses an adapted linear model as a decoder. The randomized
           projections between input neurons and higher-order processing centers in the input brain is mimicked
           in RWFN by a single-hidden-layer neural network that specially structures latent representations in
           the hidden layer using random Fourier features that better represent complex relationships between
           inputs using kernel approximation. Because of this special representation, RWFNs can effectively learn
           the degree of relationship among inputs by training only a linear decoder model. We compare the
           performance of RWFNs to LTNs for Semantic Image Interpretation (SII) tasks that have been used as a
           representative example of how LTNs utilize reasoning over first-order logic to surpass the performance
           of solely data-driven methods. We demonstrate that compared to LTNs, RWFNs can achieve better or
           similar performance for both object classification and detection of the part-of relations between objects
           in SII tasks while using much far fewer learnable parameters (1:62 ratio) and a faster learning process (1:2
           ratio of running speed). Furthermore, we show that because the randomized weights do not depend on
           the data, several decoders can share a single randomized encoder, giving RWFNs a unique economy of
           spatial scale for simultaneous classification tasks.

           Keywords
           insect neuroscience, model architecture, randomization, neuro-symbolic computing
1. Introduction
The human brain has an extraordinary ability to memorize and learn new things to solve a
variety of problems with difficulty ranging from trivial to complex. To understand the cognitive
architecture of the brain, research on producing a wiring diagram of the connections among
all neurons, called Connectomics [1], has focused not only on the human brain [2] but also
on the brains of insects [3, 4], and such research has influenced the development of machine
learning and Artificial Intelligence (AI) [5]. However, far more is known about the function of
coarse-grained, high-level structures in the brain than the neuron-scale layout of important
brain regions. Similarly, the high degrees of freedom in artificial neural networks (ANN) has
provided an opportunity for the introduction of Knowledge Representation and Reasoning (KRR)
to constructively constrain ANN architectures and training methods. In particular, combining
KRR techniques with ANNs promises to enhance the high performance of modern AI with ex-
plainability and interpretability, which is necessary for generalized human insight and increased
trustworthiness.
   Several recent studies across statistical relational learning (SRL), neural-symbolic computing,
knowledge completion, and approximate inference [6, 7, 8, 9] have shown that neural networks
can be integrated with logical systems to perform robust learning and effective inference while
also providing increased interpretability from symbolic knowledge extraction. These neural-
network knowledge representation approaches use relational embedding to represents relational
predicates in a neural network [10, 11, 12, 13]. For example, Neural Tensor Networks (NTNs)
are structured to encode the degree of association among pairs of entities in the form of tensor
operations on real-valued vectors [12]. These NTNs have been synthesized with neural symbolic
integration [7] in the development of Logic Tensor Networks (LTNs) [14], which can extend the
power of NTNs to reason over first-order many-valued logic [15].
   Although KRR aims to lift the reasoning ability of computers to that of humans, such higher-
order learning and reasoning capabilities are not unique to humans. Insect neuroscience has
shown that insects show sophisticated and complex behaviors even though they possess minia-
ture central nervous systems compared to the human brain [16]. For example, attention-like
processes have been demonstrated in in fruit flies and honey bees [17, 18], and concept learning
has been shown in bees [19]. Specifically, it has been shown that the honey bee brain contains
high levels of cognitive sophistication so that it can learn relational concepts such as “same,”
“different,” “larger than,” “better than,” among others, and researchers continue to study the
neurobiological mechanisms and computational models underlying these capabilities [20]. Just
as KRR is now being used to better shape ANN’s for more sophisticated reasoning and increased
interpretability, the architectures demonstrated in the honey bee brain may provide insights
into how to augment ANN’s with higher-order reasoning abilities akin to those demonstrated
in insects.

NeSy’20/21 @ IJCLR: 15th International Workshop on Neural-Symbolic Learning and Reasoning, October 25–27, 2021
Envelope-Open jhong53@asu.edu (J. Hong); tpavlic@asu.edu (T. P. Pavlic)
GLOBE https://www.linkedin.com/in/jyhong0304/ (J. Hong); https://isearch.asu.edu/profile/1995237 (T. P. Pavlic)
Orcid 0000-0003-4429-3311 (J. Hong); 0000-0002-7073-6932 (T. P. Pavlic)
                                    © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
 CEUR
 Workshop
 Proceedings
               http://ceur-ws.org
               ISSN 1613-0073
                                    CEUR Workshop Proceedings (CEUR-WS.org)
   In this paper, we propose Randomly Weighted Feature Networks (RWFNs), an insect-brain-
inspired single-hidden-layer neural network for relational embedding that incorporates ran-
domly drawn, untrained weights in its encoder with a trained linear model as a decoder. Our
approach is mainly motivated by neural circuits in the insect brain centered around the Mush-
room Body (MB). The MB, analogous to the neocortex in humans, is a vital region of the
insect brain supporting concept learning because it is responsible for stimulus identification,
categorization, and element learning [21, 22, 23]. We can model the MB as a neural network
model with three layers: Input Neurons (INs) – Kenyon Cells (KCs) – mushroom body Extrinsic
Neurons (ENs). One of the remarkable properties of MB is that the connections between INs
and KCs are relatively random and sparse [24]. To mimic this characteristic, we used a random
weight matrix to transform the input between the input and hidden layers to generate the latent
representation of the relationship between real-valued input entities. By doing so, the learning
process involves only the training of the weights between the hidden and output layers, which
is simple and fast. In contrast, a conventional LTN would incorporate an NTN specially trained
to capture logical relationships present in data, which requires more learning parameters and a
more complex learning process.
   Our method is also influenced by random Fourier features [25], a kernel approximation
method that overcomes the issues of conventional kernel machines or kernel methods [26].
Kernel methods are one of the most powerful and theoretically grounded approaches for
nonlinear statistical learning problems, including classification, regression, clustering, and
others [27, 28, 29]. However, the main issue of the kernel method is the lack of scalability
for large datasets and a slow training process [25, 30]. Random Fourier features can address
these issues by approximating the kernel function by using an explicit feature mapping that
projects the input data into a randomized feature space and by applying faster linear models
to learn. Interestingly, the random Fourier features model can also be viewed as a class of a
single-hidden-layer neural network model with a fixed weight between the input and hidden
layers. Thus, we leveraged this to substitute the tensor operations in conventional NTNs that
model the linear interactions between entities and utilized the projection from input into another
space as another feature representation in the hidden layer of our model to learn relationships.
   Thus, our proposed model is an insect-inspired single-hidden-layer network with latent
representation derived by the integration of the input transformation between INs and KCs
and random Fourier features, and it only requires training of a linear decoder. By applying
the model to solve the Semantic Image Interpretation (SII) tasks, we show that a trained linear
decoder in RWFNs can effectively capture the likelihood of part-of relationships at a level of
performance exceeding that of traditional LTNs, even with far fewer parameters and a faster
learning process. To the best of our knowledge, this is the first research to integrate both insect
neuroscience and neuro-symbolic approaches for reasoning under uncertainty and for learning
in the presence of data and rich knowledge. Furthermore, because the encoder weights in our
model do not depend upon the data, the single encoder can be shared among several decoders,
each trained for a different classifier, giving RWFNs an economy of spatial scale in our model
applications where several classifiers need to be used simultaneously.
2. Related Work and Background
Insect Neuroscience The MB in the insect brain receives processed olfactory, visual, and
mechanosensory stimuli [31] and is viewed as the critical region responsible for multimodal
associative learning [21]. In the fruit fly, thousands of Kenyon Cells (KCs) in the MB each
receive a set of random ∼7 inputs from INs [24, 32], and this is similarly true for honey bees [33].
A simplified neural circuit modeling the MB is a neural network with three layers consisting
of: i) INs that provide olfactory, visual, and mechanosensory inputs, ii) KCs generating the
sparse-encoding of sensory stimuli, and iii) ENs for activating several different behavioral
responses [34]. In particular, INs receive various inputs from Antennal Lobe (AL) glomeruli,
Medulla, and Lobula optic neuropils [34]. For simplicity, we focus on the olfactory pathway
between glomeruli in the AL and KCs in the MB [34, 33]. The insect olfactory neural circuit has
a divergence–convergence structure where ∼800 AL glomeruli form a coded feature vector that
expands into a spare representation across ∼170,000 KCs, and these are decoded by ∼400 ENs
that actuate motor pathways based on this information-processing pipeline [33]. This general
divergence–convergence structure applies equally as well to honey bees and fruit flies [35];
therefore, for modeling the architecture of our methods for brevity, we interchangeably leverage
the neural circuits of olfactory nervous systems between these insects.
   In this paper, the input transformation of odorant representation in the AL to the higher-
order representation across the KCs in the MB is shown in the projection between the input
and hidden layers of our model, and this representation plays the critical role in learning the
relationships presented on the input.

Random Fourier Features Kernel machines, e.g., Support Vector Machines (SVMs) [36, 37],
have received significant attention due to their capability for function approximation and
excellent performance of detecting decision boundaries with enough training data. These
methods use transformations, as with a lifting function 𝜙, that help to better discriminate among
different inputs. Given dataset vector inputs x, y ∈ ℝ𝑑 , the kernel function 𝑘(x, y) = ⟨𝜙(x), 𝜙(y)⟩
represents the similarity (i.e., inner product) between x and y in the transformed space. However,
because of the potential complexity of the transformation 𝜙, learning the kernel function 𝑘 may
require significant computational and storage costs.
   Random Fourier features [25], instead, provide a data transformation that permits using a far
less expensive approximation of the kernel function. For each vector input x ∈ ℝ𝑑 , the technique
applies a randomized feature function z ∶ ℝ𝑑 → ℝ𝐷 (generally, 𝐷 ≫ 𝑑 with sample size 𝑁 ≫ 𝐷)
that maps x to evaluations of 𝐷 random Fourier basis from the Fourier transform of kernel 𝑘. In
this transformed space, kernel evaluations can be approximated by linear operations, as in:

                               𝑘(x, y) = ⟨𝜙(x), 𝜙(y)⟩ ≈ z(x)⊤ z(y)                              (1)

Thus, by transforming the input with z, fast linear learning methods can be leveraged to
approximate the evaluations of nonlinear kernel machines.
  In this paper, we use random Fourier features as latent representations that reduce the
complexity of learning relations among real-valued entities. As described in Section 1, adaptable
NTNs within LTNs have been used to encode relationships among real-valued entities. We
replace the adaptable NTNs with random Fourier features that have high expressiveness with
low decoding overhead. Details and intuitions will be given in Section 3.

Logic Tensor Networks (LTNs) The RWFNs we propose are meant to improve upon LTNs
for statistical relational learning tasks. LTNs integrate learning based on NTNs [12] with
reasoning using first-order, many-valued logic [15], all implemented in TensorFlow [14]. Here,
we briefly introduce LTN syntax and semantics for use in mapping logical symbols to numerical
values and learning reasoning relations among real-valued vectors using the logical formulas.
   Although a first-order-logic (FOL) language ℒ and its signature are defined by consisting of
three disjoint sets – i) 𝒞 (constants), ii) ℱ (functions) and iii) 𝒫 (predicate) – we ignore function
symbols ℱ because they are not used in SII tasks that we focus on here. For any predicate symbol
𝑠, 𝛼(𝑠) can be described as its arity, and logical formulas in ℒ enable the description of relational
knowledge. The objects being reasoned over with FOL are mapped to an interpretation domain
⊆ ℝ𝑛 so that every object is associated with an 𝑛-dimensional vector of real numbers. Intuitively,
this 𝑛-tuple indicates 𝑛 numerical features of an object. Thus, predicates are interpreted as fuzzy
relations on real vectors. With this numerical background, we can now define the numerical
grounding of FOL with the following semantics; this grounding is necessary for NTNs to reason
over logical statements.
   Let 𝑛 ∈ ℕ. An 𝑛-grounding, or simply grounding, 𝒢 for a FOL ℒ is a function defined on the
signature of ℒ satisfying the following conditions:
                           𝒢 (𝑐) ∈ ℝ𝑛 for every constant symbol 𝑐 ∈ 𝒞
                        𝒢 (𝑃) ∈ ℝ𝑛⋅𝛼(𝑓 ) → [0, 1] for predicate sym. 𝑃 ∈ 𝒫
Given a grounding 𝒢, the semantics of closed terms and atomic formulas is defined as follows:
                             𝒢 (𝑃(𝑡1 , … , 𝑡𝑚 )) ≜ 𝒢 (𝑃)(𝒢 (𝑡1 ), … , 𝒢 (𝑡𝑚 ))
The semantics for connectives, such as 𝒢 (¬𝜙), 𝒢 (𝜙 ∧ 𝜓 ), 𝒢 (𝜙 ∨ 𝜓 ), and 𝒢 (𝜙 → 𝜓 ), can be
computed by following the fuzzy logic such as the Lukasiewicz 𝑡-norm [15].
   A partial grounding 𝒢 ̂ can be defined on a subset of the signature of ℒ. A grounding 𝒢 is said
to be a completion of 𝒢 ̂ if 𝒢 is a grounding for ℒ and coincides with 𝒢 ̂ on the symbols where 𝒢 ̂
is defined. Let GT be a grounded theory which is a pair ⟨𝒦 , 𝒢 ̂ ⟩ with a set 𝒦 of closed formulas
and a partial grounding 𝒢.̂ A grounding 𝒢 satisfies a GT ⟨𝒦 , 𝒢 ̂ ⟩ if 𝒢 completes 𝒢 ̂ and 𝒢 (𝜙) = 1
for all 𝜙 ∈ 𝒦. A GT ⟨𝒦 , 𝒢 ̂ ⟩ is satisfiable if there exists a grounding 𝒢 that satisfies ⟨𝒦 , 𝒢 ̂ ⟩.
In other words, deciding the satisfiability of ⟨𝒦 , 𝒢 ̂ ⟩ amounts to searching for a grounding 𝒢
such that all the formulas of 𝒦 are mapped to 1. If a GT is not satisfiable, the best possible
satisfaction that we can reach with a grounding is of our interest.
   Grounding 𝒢 ∗ captures the implicit correlation between quantitative features of objects and
their categorical/relational properties. The grounding of an 𝑚-ary predicate 𝑃, namely 𝒢 (𝑃), is
defined as a generalization of the NTN [12], as a function from ℝ𝑚𝑛 to [0, 1], as follows:
                                                          [1∶𝑘]
                          𝒢𝐿𝑇 𝑁 (𝑃)(v) = 𝜎 (𝑢𝑃⊤ f (v⊤ 𝑊𝑃          v + 𝑉𝑃 v + 𝑏𝑃 ))                (2)
where v = ⟨v⊤         ⊤ ⊤
             1 , … , v𝑚 ⟩ is the 𝑚𝑛-ary vector obtained by concatenating each v 𝑖 . 𝜎 is the sigmoid
                                                                               [1∶𝑘]
function and f is the hyperbolic tangent (tanh). The parameters for 𝑃 are: 𝑊𝑃         , a 3-D tensor
(a) Visualization of the structure of the Randomly (b) Visualization of the structure of the Randomly
    Weighted Feature Network. In the depicted case,      Weighted Feature Network with weight sharing.
    the input vector v constitutes of two entities,      In the case of learning each classifier from the
    𝑒1 , 𝑒2 ∈ ℝ3 and it shows to learn a binary relation class 𝒞1 to the class 𝒞𝑖 , RWFNs allow us to use
    between them (𝑒1 , 𝑅, 𝑒2 ), such as (Cat, hasPart,   the same encoder to extract features from each
    Tail).                                               data from the class 𝒞1 to the class 𝒞𝑖 .
Figure 1: The architectures of RWFNs and RWFNs with weight sharing


in ℝ𝑘×𝑚𝑛×𝑚𝑛 , 𝑉𝑃 ∈ ℝ𝑘×𝑚𝑛 , 𝑏𝑃 ∈ ℝ𝑘 and 𝑢𝑃 ∈ ℝ𝑘 . Because our RWFN model can be used to ground
a predicate as 𝒢𝑅𝑊 𝐹 𝑁 (𝑃), we can directly compare the performance of RWFNs for the SII tasks
with LTNs.


3. Randomly Weighted Feature Networks (RWFNs)
In this section, we introduce the details of Randomly Weighted Feature Networks (RWFNs).
The underlying intuition behind the development of this model can be found in Appendix A.

3.1. Model Architecture
Let the input vector v be [v⊤            ⊤ ⊤
                                1 , … , v𝑚 ] , the 𝑚𝑛-ary vector where 𝑚 is arity and 𝑛 is the input
dimension. We first define the two kinds of latent representations: i) the input transformation
between AL glomeruli and KCs inspired by the insect brain, and ii) the transformed input using
a randomized feature mapping z(⋅) in random Fourier features.
   For the bio-inspired representation, we select 𝑁𝑖𝑛 ∈ [1, 𝑚𝑛) indices of the input at random
without replacement for each hidden node1 . In other words, the output 𝑣𝑗̄ of each hidden node
𝑗 ∈ {1, ..., 𝐵} is a weighted combination of all 𝑚𝑛 inputs where only 𝑁𝑖𝑛 < 𝑚𝑛 inputs have 𝑤𝑗,𝑖 = 1
and all other inputs have 𝑤𝑗,𝑖 = 0. So the inputs are effectively gated by the weights on each
hidden node and the weight matrix W ∈ ℝ𝑚𝑛×𝐵 in this computation is random, binary, and
sparse.


    1
        𝑁𝑖𝑛 < 𝑚𝑛 to prevent hidden-unit outputs becoming trivially 0 by Eq. (3). In our setting, 𝑁𝑖𝑛 = 7.
   Once the 𝑗th hidden node has produced weighted sum 𝑣𝑗̄ , the post-processing step to produce
intermediate output is performed in the hidden layer. Mimicking Eq. (6) with 𝐶 = 1, the 𝑗th
intermediate output 𝑥̂𝑗 is:
                                                            𝐵
                                                         1
                                    𝑣𝑗̂ = 𝑣𝑗̄ − 𝜇, 𝜇 =     ∑ 𝑣̄                               (3)
                                                         𝐵 𝑖=1 𝑖
where 𝐵 is the number of hidden units. Therefore, the sparse output of the 𝑗-th KC node can
                (1)
be defined as ℎ𝑗 = 𝑔(𝑣𝑗̂ ) where 𝑔 is the ReLU function [38] that allows the model to produce
sparse hidden output, which is more biologically plausible. By doing so, we define the output
                   (1)     (1)
vector as h1 = [ℎ1 , … , ℎ𝐵 ]⊤ .
  On the other hand, to generate random Fourier features, we used a randomized feature
function z(⋅) in [25, 39], we can project the input as follows:

                                               √2
                                 h2 = z(v) =      cos(R⊤ v + b)                               (4)
                                               √𝐵

where R ∼ 𝑁 𝑜𝑟𝑚𝑎𝑙 𝑚𝑛×𝐵 (0, 1) and b ∼ 𝑈 𝑛𝑖𝑓 𝑜𝑟𝑚𝐵 (0, 2𝜋), which is Gaussian kernel approximation.
Consequently, the output vector h2 can be considered as another latent representation of
relationship among input. Rahimi et al. [25], Sutherland and Schneider [39], and Liu et al. [30]
provide theoretical derivations of kernel approximation and comparative analyses of various
kinds of random Fourier features.
   Finally, using the above two latent representations, our RWFNs can be defined as a function
from ℝ𝑚𝑛 to [0, 1]:
                                                                  h
                           𝒢𝑅𝑊 𝐹 𝑁 (𝑃)(v ) = 𝜎 (𝛽⊤ h) = 𝜎 (𝛽⊤ f ([ 1 ]))                      (5)
                                                                  h2
where h is the final hidden representation obtained by applying the hyperbolic tangent (tanh)
function f to the concatenation of h1 and h2 , and 𝜎 is the sigmoid function; the tanh function
was used for the numeric stabilization. Because our model requires to adapt only 𝛽 ∈ ℝ2𝐵 , it
possess a faster learning process with fewer parameters compared to LTNs. Fig. 1a shows a
visualization of the structure of our model.

3.2. RWFNs with Weight Sharing
In the insect brain, extrinsic neurons from the MB are processed by several small, downstream
neuropils that ultimately lead to decision-making outcomes, such as muscle actuation. If we view
these small neuropils as decoding the complex representations in the MB, then different decoders
responsible for different decisions all use information sourced from the same randomized
representations in the MB. The MB can be viewed as a generalized encoder that is not tailored
for a particular task; consequently, it provides a shared resource to reduce the complexity of
these downstream neuropils.
   Because the weights of the randomized encoder of an RWFN are independent of the training
data, they can also serve as a shared resource for multiple relatively simple (i.e., linear) down-
stream decoders trained for different classifiers. We refer to this property as weight sharing.
Fig. 1b shows a visualization of the structure of our model applied with weight sharing to the
learning of 𝑖 different classifiers. The large solid box surrounds a single encoder that serves as a
common feature extractor for all classifiers.
   The entities in the original definition of RWFNs in Fig. 1a become the placeholder to be
injected by the input for each classifier. Instead of generating the randomized encoder for
each classifier, each classifier uses the same encoder, and training only requires learning the
weights of that classifier’s highly simple linear decoder. This approach increases reusability
and cost efficiency in a way beyond what is possible with LTNs, which must train all encoder
and decoder networks separately for each classifier.


4. Experimental Evaluation
To evaluate the performance of our proposed RWFNs over LTNs, we employ both for SII tasks,
which extract structured semantic descriptions from images. Very few SRL applications have
been applied to SII tasks because of the high complexity involved with image learning. Donadello
et al. [40] define two main tasks of SII as: (i) the classification of bounding boxes, and (ii) the
detection of the part-of relation between any two bounding boxes. They demonstrated that
LTNs can successfully improve the performance of solely data-driven approaches, including
the state-of-the art Fast Region-based Convolutional Neural Networks (Fast R-CNN) [41]. Our
experiments are conducted by comparing the performance of two tasks of SII between RWFNs
and LTNs. These tasks are well defined in first-order logic, and the codes implemented in
TensorFlow framework have been provided and can be used to compare the performance of
LTNs with RWFNs.

4.1. Methods
Here, we provide details of our experimental comparison of RWFNs and LTNs. We utilize the
formalization of SII in first-order logic from Donadello et al. [40]. For brevity, we describe:
(i) the difference of the ground theories between RWFNs and LTNs, (ii) the data set used
in the experiments (Appendix B), and (iii) the RWFN and LTN hyperparameters used in the
experiments (Appendix B). We omit other formalization details of the SII tasks that can be found
elsewhere [40].

Defining the Grounded Theories for RWFNs and LTNs A set of bounding boxes of
images correctly labelled with the classes that they belong to and pairs of bounding boxes that
properly labelled with the part-of relation were provided. These datasets can be considered as
a training set, and a grounded theory 𝒯LTN ≜ ⟨𝒦 , 𝒢𝐿𝑇         ̂ 𝑁 ⟩ can be constructed. In particular, 𝒦
contains: (i) the set of closed literals 𝐶𝑖 (𝑏) and p a r t O f (𝑏, 𝑏 ′ ) for every bounding box 𝑏 labelled
with 𝐶𝑖 and for every pair of bounding boxes ⟨𝑏, 𝑏 ′ ⟩ connected by the p a r t O f relation, and (ii) the
set of the mereological constraints for the part-of relation, including asymmetric constraints,
lists of several parts of an object, or restrictions that whole objects cannot be part of other
objects and every part object cannot be divided further into parts. Furthermore, the partial
grounding 𝒢𝐿𝑇   ̂ 𝑁 is defined on all bounding boxes of all the images in the training set where both
𝑐𝑙𝑎𝑠𝑠(𝐶𝑖 , 𝑏) and the bounding box coordinates are computed by the Fast R-CNN object detector.
𝒢 ̂ is not defined for the predicate symbols in 𝒫 and is to be learned.
(a) RWFNs achieve similar performance for ob-               (b) RWTNs outperform LTNs on the detec-
    ject type classification compared to LTNs,                  tion of part-of relations, achieving AUC
    achieving an Area Under the Curve (AUC)                     of 0.647 (compared to 0.613).
    of 0.772 (compared to 0.770).
Figure 2: Precision–recall curves for indoor objects type classification and the p a r t O f relation between
objects.


   A grounded theory 𝒯RWTN ≜ ⟨𝒦 , 𝒢𝑅𝑊   ̂ 𝐹 𝑁 ⟩ where a partial grounding 𝒢𝑅𝑊
                                                                            ̂ 𝐹 𝑁 can be de-
scribed for predicates using Eq. (5). Thus, we can easily compare the performance between
  ̂ 𝐹 𝑁 (Eq. (5)) and 𝒢𝐿𝑇
𝒢𝑅𝑊                    ̂ 𝑁 (Eq. (2)).

4.2. Results
Our experiments mainly focus on the comparison of the performance between our model
and LTN, but figures also include results with Fast-RCNN [41] for type classification and the
inclusion ratio 𝑖𝑟 baseline in the part-of detection task. If 𝑖𝑟 is greater than a given threshold
𝑡ℎ (in our experiments, 𝑡ℎ = 0.7), then the bounding boxes are said to be in the p a r t O f relation.
Every bounding box 𝑏 is classified into 𝐶 ∈ 𝒫1 if 𝒢 (𝐶(𝑏)) > 𝑡ℎ.
   Results for indoor objects are shown in Fig. 2 where AUC is the area under the precision–recall
curve. The results show that, for the part-of relation and object types classification, RWFNs
achieve better performance than LTNs. However, there is some variance in the results because
of the stochastic nature of the experiments. Consequently, we carried out five such experiments
for each task, for which the sample averages and 95% confidence intervals are shown in Table 1.
These results confirm that our model can achieve similar performance as LTNs for object-task
classification and superior performance for detection of part-of relations.
   In Table 1, we only included AUC numbers for RWFNs with weight sharing (third column)
for object-type classification because part-of relations only require a single classifier. The
performance of RWFNs with weight sharing for the object-type classification task (which
requires 11 classifiers for indoor objects, 23 for vehicles, and 26 for animals) shows only a
marginal gap in performance compared to other models, which demonstrates the effectiveness
and efficiency of the approach of using a single shared encoder in RWFNs with weight sharing.
   As summarized in Appendix C, we also conducted ablation studies to assess the degree to
Table 1
AUC of T1 (object type classification) and T2 (detection of part-of relation) for LTN, RWFN, and RWFN
with weight sharing across label groups. MEAN±2×SD for all models. Best performances shown in bold.
                        Label-Task      LTN         RWFN        RWFN w/ W.S
                        Indoor-T1     .769±.0314   .770±.0092     .773±.028
                        Indoor-T2      .619±.082   .648±.0621         —
                        Vehicle-T1    .709±.0289   .711±.0162     .706±.0111
                        Vehicle-T2    .576±.0355   .613±.0489         —
                        Animal-T1     .701±.024     .700±.024     .697±.0237
                        Animal-T2     .640±.0783   .661±.0364         —


which the AL–MB input transformation and the random Fourier features each contribute to the
performance of the model. We have also included, in Appendix D, a detailed comparison of
performance among LTNs, RWFNs, and RWFNs with weight sharing. Specifically, we compare
LTNs and RWFNs in terms of numbers of learnable parameters and running times; we also
compare RWFNs with and without weight sharing in terms of space complexity.


5. Conclusion and Future Work
In this paper, we introduced Randomly Weighted Feature Networks, which incorporate the
insect-brain-inspired neuronal feature representation and unique random features derived by
random Fourier features. The RWFN encoder acts as a generalized feature extractor with
greater relational expressiveness and a learning model with relatively simpler structure. We
demonstrated how insights from the insect nervous system can be applied to the fields of neural-
symbolic computing and knowledge representation and reasoning for relational learning.
   Our work can be advanced in several ways. For one, RWFNs can be applied to other variants
of SII problems proposed by Donadello and Serafini [42], and performance between our model
and LTN for zero-shot learning in SII tasks can be compared. In addition, we plan to extend
application of RWFNs to tasks that need to extract structural knowledge from not only images
but also text, such as visual question-answering challenges. Furthermore, we will investigate
how other methods from neuroscience for exploring biologically-plausible learning algorithms
might be applicable to our model. Finally, we will extend RWFNs to include a recurrent part for
representing dynamic features of time-series data, similar to reservoir computing [43, 44, 45]; this
approach may allow for extracting time-varying relational knowledge necessary for developing
a framework for data-driven reasoning over temporal logic.


Acknowledgments
This work was supported in part by NSF SES-1735579.
References
 [1] S. Seung, Connectome: How the brain’s wiring makes us who we are, HMH, 2012.
 [2] O. Sporns, G. Tononi, R. Kötter, The human connectome: a structural description of the
     human brain, PLoS Comput Biol 1 (2005) e42.
 [3] K. Eichler, F. Li, A. Litwin-Kumar, Y. Park, I. Andrade, C. M. Schneider-Mizell, T. Saumweber,
     A. Huser, C. Eschbach, B. Gerber, et al., The complete connectome of a learning and memory
     centre in an insect brain, Nature 548 (2017) 175–182.
 [4] S.-y. Takemura, Y. Aso, T. Hige, A. Wong, Z. Lu, C. S. Xu, P. K. Rivlin, H. Hess, T. Zhao,
     T. Parag, et al., A connectome of a learning and memory center in the adult drosophila
     brain, Elife 6 (2017) e26975.
 [5] D. Hassabis, D. Kumaran, C. Summerfield, M. Botvinick, Neuroscience-inspired artificial
     intelligence, Neuron 95 (2017) 245–258.
 [6] D. Koller, N. Friedman, S. Džeroski, C. Sutton, A. McCallum, A. Pfeffer, P. Abbeel, M.-F.
     Wong, D. Heckerman, C. Meek, et al., Introduction to statistical relational learning, MIT
     press, 2007.
 [7] A. S. Garcez, L. C. Lamb, D. M. Gabbay, Neural-symbolic cognitive reasoning, Springer
     Science & Business Media, 2008.
 [8] J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference,
     Elsevier, 2014.
 [9] M. Nickel, K. Murphy, V. Tresp, E. Gabrilovich, A review of relational machine learning
     for knowledge graphs, Proceedings of the IEEE 104 (2015) 11–33.
[10] I. Sutskever, G. E. Hinton, Using matrices to model symbolic relationship, in: Advances in
     neural information processing systems, 2009, pp. 1593–1600.
[11] A. Bordes, J. Weston, R. Collobert, Y. Bengio, Learning structured embeddings of knowledge
     bases, in: Twenty-Fifth AAAI Conference on Artificial Intelligence, 2011.
[12] R. Socher, D. Chen, C. D. Manning, A. Ng, Reasoning with neural tensor networks for
     knowledge base completion, in: Advances in neural information processing systems, 2013,
     pp. 926–934.
[13] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, T. Lillicrap, A
     simple neural network module for relational reasoning, in: Advances in neural information
     processing systems, 2017, pp. 4967–4976.
[14] L. Serafini, A. d. Garcez, Logic tensor networks: Deep learning and logical reasoning from
     data and knowledge, arXiv preprint arXiv:1606.04422 (2016).
[15] M. Bergmann, An introduction to many-valued and fuzzy logic: semantics, algebras, and
     derivation systems, Cambridge University Press, 2008.
[16] A. Avarguès-Weber, N. Deisig, M. Giurfa, Visual cognition in social insects, Annual review
     of entomology 56 (2011) 423–443.
[17] B. van Swinderen, R. J. Greenspan, Salience modulates 20–30 hz brain activity in drosophila,
     Nature neuroscience 6 (2003) 579–586.
[18] J. Spaethe, J. Tautz, L. Chittka, Do honeybees detect colour targets using serial or parallel
     visual search?, Journal of Experimental Biology 209 (2006) 987–993.
[19] A. Avarguès-Weber, A. G. Dyer, M. Combe, M. Giurfa, Simultaneous mastering of two
     abstract concepts by the miniature brain of bees, Proceedings of the National Academy of
     Sciences 109 (2012) 7481–7486.
[20] A. Avarguès-Weber, M. Giurfa, Conceptual learning by miniature brains, Proceedings of
     the Royal Society B: Biological Sciences 280 (2013) 20131907.
[21] R. Menzel, Searching for the memory trace in a mini-brain, the honeybee, Learning &
     memory 8 (2001) 53–62.
[22] C. G. Galizia, Olfactory coding in the insect brain: data and conjectures, European Journal
     of Neuroscience 39 (2014) 1784–1795.
[23] M. Bazhenov, R. Huerta, B. H. Smith, A computational framework for understanding
     decision making through integration of basic learning rules, Journal of Neuroscience 33
     (2013) 5686–5697.
[24] S. J. Caron, V. Ruta, L. Abbott, R. Axel, Random convergence of olfactory inputs in the
     drosophila mushroom body, Nature 497 (2013) 113–117.
[25] A. Rahimi, B. Recht, et al., Random features for large-scale kernel machines., in: NIPS,
     volume 3, Citeseer, 2007, p. 5.
[26] A. J. Smola, B. Schölkopf, Learning with kernels, volume 4, Citeseer, 1998.
[27] J. Zhu, T. Hastie, Kernel logistic regression and the import vector machine, Journal of
     Computational and Graphical Statistics 14 (2005) 185–205.
[28] H. Drucker, C. J. Burges, L. Kaufman, A. Smola, V. Vapnik, et al., Support vector regression
     machines, Advances in neural information processing systems 9 (1997) 155–161.
[29] I. S. Dhillon, Y. Guan, B. Kulis, Kernel k-means: spectral clustering and normalized cuts, in:
     Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery
     and data mining, 2004, pp. 551–556.
[30] F. Liu, X. Huang, Y. Chen, J. A. Suykens, Random features for kernel approximation: A
     survey in algorithms, theory, and beyond, arXiv preprint arXiv:2004.11154 (2020).
[31] P. Mobbs, The brain of the honeybee apis mellifera. i. the connections and spatial organiza-
     tion of the mushroom bodies, Philosophical Transactions of the Royal Society of London.
     B, Biological Sciences 298 (1982) 309–354.
[32] K. Inada, Y. Tsuchimoto, H. Kazama, Origins of cell-type-specific olfactory processing in
     the drosophila mushroom body circuit, Neuron 95 (2017) 357–367.
[33] F. Peng, L. Chittka, A simple computational model of the bee mushroom body can explain
     seemingly complex forms of olfactory learning and memory, Current Biology 27 (2017)
     224–230.
[34] A. J. Cope, E. Vasilaki, D. Minors, C. Sabo, J. A. Marshall, A. B. Barron, Abstract concept
     learning in a simple neural network inspired by the insect brain, PLoS computational
     biology 14 (2018) e1006435.
[35] K. Endo, Y. Tsuchimoto, H. Kazama, Synthesis of conserved odor object representations in
     a random, divergent-convergent network, Neuron 108 (2020) 367–381.
[36] B. E. Boser, I. M. Guyon, V. N. Vapnik, A training algorithm for optimal margin classifiers,
     in: Proceedings of the fifth annual workshop on Computational learning theory, 1992, pp.
     144–152.
[37] C. Cortes, V. Vapnik, Support-vector networks, Machine learning 20 (1995) 273–297.
[38] X. Glorot, A. Bordes, Y. Bengio, Deep sparse rectifier neural networks, in: Proceedings
     of the fourteenth international conference on artificial intelligence and statistics, JMLR
     Workshop and Conference Proceedings, 2011, pp. 315–323.
[39] D. J. Sutherland, J. Schneider, On the error of random fourier features, arXiv preprint
     arXiv:1506.02785 (2015).
[40] I. Donadello, L. Serafini, A. D. Garcez, Logic tensor networks for semantic image interpre-
     tation, arXiv preprint arXiv:1705.08968 (2017).
[41] R. Girshick, Fast r-cnn, in: Proceedings of the IEEE international conference on computer
     vision, 2015, pp. 1440–1448.
[42] I. Donadello, L. Serafini, Compensating supervision incompleteness with prior knowledge
     in semantic image interpretation, in: 2019 International Joint Conference on Neural
     Networks (IJCNN), IEEE, 2019, pp. 1–8.
[43] A. A. Ferreira, T. B. Ludermir, Genetic algorithm for reservoir computing optimization, in:
     2009 International Joint Conference on Neural Networks, IEEE, 2009, pp. 811–815.
[44] X. Sun, T. Li, Q. Li, Y. Huang, Y. Li, Deep belief echo-state network and its application to
     time series prediction, Knowledge-Based Systems 130 (2017) 17–29.
[45] X. Wang, Y. Jin, K. Hao, Echo state networks regulated by local intrinsic plasticity rules
     for regression, Neurocomputing 351 (2019) 111–122.
[46] S. Dasgupta, C. F. Stevens, S. Navlakha, A neural algorithm for a fundamental computing
     problem, Science 358 (2017) 793–796.
[47] Y. Aso, K. Grübel, S. Busch, A. B. Friedrich, I. Siwanowicz, H. Tanimoto, The mushroom
     body of adult drosophila characterized by gal4 drivers, Journal of neurogenetics 23 (2009)
     156–172.
[48] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, A. Yuille, Detect what you can: Detecting
     and representing objects using holistic models and body parts, in: Proceedings of the IEEE
     Conference on Computer Vision and Pattern Recognition, 2014, pp. 1971–1978.
[49] T. Tieleman, G. Hinton, Lecture 6.5—RmsProp: Divide the gradient by a running average
     of its recent magnitude, COURSERA: Neural Networks for Machine Learning, 2012.
[50] T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, Optuna: A next-generation hyperpa-
     rameter optimization framework, in: Proceedings of the 25th ACM SIGKDD international
     conference on knowledge discovery & data mining, 2019, pp. 2623–2631.


A. The Intuitions of RWFNs
For the bio-inspired representation in our model, we concentrated on implementing: (i) how to
build random connections between the AL glomeruli (input layer) and the KCs (hidden layer),
and (ii) how to guarantee hidden-layer sparsity to best differentiate one odor stimulus from
another. We used a sparse, binary, and random matrix to define an arbitrary set of inputs
for each KC in the model with inspiration from Caron et al. [24], Peng and Chittka [33], and
Dasgupta et al. [46]. In particular, in biological models of the insect brain and the AL–MB
interface, the firing rates from 7 randomly selected glomeruli are passed and summed to each
KC [24, 32]. Furthermore, Endo et al. [35] developed a computational model of sparsity of
the KCs’ output activity based on global inhibition the average KC input. In their model, KCs
output an intermediate result subject to global inhibition from the average glomerular input to
all KCs. The last KC activity is then produced by thresholding the inhibited output through a
ramp function, which is functionally equivalent to the Rectified Linear Unit (ReLU) activation
function [38]. Thus, the output of the 𝑗th KC, 𝐾 𝐶𝑜𝑢𝑡𝑗 , in the computational model was described
as:
                                                            𝑁𝐾 𝐶
                                                        1
                             𝐾 𝐶𝑜𝑢𝑡𝑗 = 𝜙 (𝐾 𝐶𝑖𝑛𝑗 − 𝐶        ∑ 𝐾 𝐶𝑖𝑛𝑗 )                         (6)
                                                     𝑁𝐾 𝐶 𝑗
where 𝐾 𝐶𝑖𝑛𝑗 indicates the weighted sum of input from 7 random indices of the input vector, 𝜙
is the ReLU, 𝐶 is the strength of global inhibition, and 𝑁𝐾 𝐶 is the total number of KCs. The
parameters 𝐶 = 1.0 and 𝑁𝐾 𝐶 = 2000 were chosen so as to match the values best calibrated to
real KC responses [35, 47]. With this KC representation, Endo et al. [35] trained a linear decoder
to successfully classify ’group’ from ’non-group’ odors. Similarly, we make use of Eq. (6) and
train a linear model for learning latent relationships among input.
   For another hidden representation using random Fourier features in our model, based on
Eq. (1), we can define a decision function 𝑓 (x) given a dataset including 𝑁 data samples x, y ∈ ℝ𝑑
and a randomized feature mapping z ∶ ℝ𝑑 → ℝ𝐷 as follows:
                                    𝑁                𝑁
                           𝑓 (x) = ∑ 𝛼𝑛 𝑘(x𝑛 , x) = ∑ 𝛼𝑛 ⟨𝜙(x𝑛 ), 𝜙(x)⟩
                                   𝑛=1              𝑛=1
                                    𝑁
                                                                                                (7)
                                ≈ ∑ 𝛼𝑛 z(x𝑛 )⊤ z(x) = 𝛽⊤ z(x)
                                   𝑛=1

This indicates that if z(⋅) can approximate 𝜙(⋅) well, we can simply map our data using z(⋅) and
then use a linear model to learn because both 𝛽 and z(⋅) in the above equation are 𝐷-vectors.
Therefore, the task that we will describe in the next section is how to find a random projection
function z(⋅) that can approximate the corresponding nonlinear kernel machine appropriately.
   The reason why we leverage the random Fourier feature is conciseness and efficiency of
computing linear interactions among input, which can be replaced with the bilinear model in
Eq. (2). In Eq. (2), the bilinear tensor was used to compute the relation, which seems intuitive
because each slice of the tensor serves as being responsible for one type of relation. However,
this computation requires high computational cost with large number of parameters. In contrast,
the random Fourier features in Eq (7) can do the similar task with a much faster learning process
and fewer number of parameters.
   Considering how Eq. (6) and Eq. (7) can be used in our model, the hidden representation can
be expressed with the concatenation of 𝐾 𝐶𝑜𝑢𝑡 and z(⋅).


B. Details of Experiments
Hardware specification of the server The hardware specification of the server that we
used to experiment is as follows:

    • CPU: Intel® CoreTM i7-6950X CPU @ 3.00GHz (up to 3.50 GHz)
    • RAM: 128 GB (DDR4 2400MHz)
    • GPU: NVIDIA GeForce Titan Xp GP102 (Pascal architecture, 3840 CUDA Cores @ 1.6
      GHz, 384 bit bus width, 12 GB GDDR G5X memory)
Source codes All source codes, trained models, and figures in this paper are available at
https://github.com/jyhong0304/SII.

Datasets The PASCAL-Part-dataset [48] and ontologies (WordNet) are chosen for the part-
of relation. The PASCAL-Part-dataset contains 10103 images with bounding boxes. They
are annotated with object-types and the part-of relation defined between pairs of bounding
boxes. There are three main groups in labels—animals, vehicles, and indoor objects—with their
corresponding parts and “part-of” label. There are 59 labels (20 labels for whole objects and 39
labels for parts). The images were then split into a training set with 80% of the images and a
test set with 20% of the images, maintaining the same proportion of the number of bounding
boxes for each label. Given a set of bounding boxes detected by an object detector (Fast-RCNN),
the task of object classification is to assign to each bounding box an object type. The task of
part-of detection is to decide, given two bounding boxes, if the object contained in the first is a
part of the object contained in the second.

Hyperparameter Setting To compare the performance between RWFNs and LTNs, we
trained two models separately. For LTN, we configure the experimental environment follow-
ing Donadello et al. [40]. The LTNs were configured with a tensor of 𝑘 = 6 layers. For RWFN,
the number of hidden nodes 𝐵 = 200 for a classifier for object type classification. In addition,
we set the number of hidden nodes of a classifier for part-of detection as twice as large as the
number of hidden nodes 𝐵 of a classifier for object classification, which is 400. This is because
the dimension of inputs that the classifier for detecting the part-of relation is twice as large as
the input space required for a classifier for objection categorization. Referring to Donadello
et al. [40], both models make use of a regularization parameter 𝜆 = 10−10 , Lukasiewicz’s 𝑡-
norm (𝜇(𝑎, 𝑏) = max(0, 𝑎 + 𝑏 − 1)), and the harmonic mean as an aggregation operator. We ran
1000 training epochs of the RMSProp [49] learning algorithm available in TensorFlow for each
model.

Hyperparameter Searching for RWFNs To find out the best number of hidden nodes
𝐵, we used the Optuna framework [50] with 500 iterations in the range of [64, 512]. The
Optuna framework allows us to dynamically construct the parameter search space because we
can formulate hyperparameter optimization as the maximization/minimization process of an
objective function that takes a set of hyperparameters as input and returns a validation score.
In our case, the validation score returned was the test AUC values. Furthermore, it provides
efficient sampling methods, such as relational sampling that exploits the correlations among
the parameters.


C. Ablation Studies
Table 2 shows the results of ablation studies. In order to show how much two hidden represen-
tations – the input transformation between AL–MB and random Fourier features – contribute
to the performance of our model, we built two separate RWFN models: one using the AL–MB
input transformation only and another using random Fourier features only. Then, we performed
Table 2
AUC of T1 (object type classification) and T2 (detection of part-of relation) for the AL–MB representa-
tion (AL–MB) and random Fourier features (RFF) across label groups. MEAN±2×SD for all models. The
best performance is displayed in bold.
                                  Label-Task    AL–MB         RFF
                                  Indoor-T1     .743±.021   .766±.012
                                  Indoor-T2     .525±.102   .641±.010
                                  Vehicle-T1    .710±.017   .715±.009
                                  Vehicle-T2    .612±.027   .572±.079
                                  Animal-T1     .705±.017   .709±.013
                                  Animal-T2     .664±.069   .646±.020


five experiments and averaged AUCs of each model for object classification and part-of detec-
tion. The number of hyperparameter 𝛽 for each model was set to the same as the number of
hyperparameter for the original RWFN, which is 200.
   For the object-type classification and part-of detection tasks using Indoor label, the random
Fourier features outperform the AL–MB input transformation. On the other hand, the AL–MB
input transformation for part-of detection tasks using Vehicle and Animal labels show better
performance than the random Fourier features. Therefore, these ablation studies show that
the model architecture of RWFNs in Eq. (5) can fully utilize both hidden representations and
contribute to their good performance shown in Table. 1 by compensating for each other.


D. Performance Analysis
Relative Complexity of RWFNs and LTNs To better appreciate the relative performance
of RWFNs and LTNs, we can compare the number of parameters for grounding a unary predicate
for each model. The dimension of the input in the dataset for both RWFNs and LTNs is 𝑛 = 64.
                                                                     [1∶𝑘]
As shown in Eq. (2), the parameters to learn in LTNs are {𝑢𝑃 ∈ ℝ𝑘 , 𝑊𝑃     ∈ ℝ𝑛×𝑛×𝑘 , 𝑉𝑃 ∈ ℝ𝑘×𝑛 , 𝑏𝑃 ∈
  𝑘
ℝ }, where 𝑘 = 6 following the configuration of the LTNs. Thus, the number of parameters in
LTNs is (𝑛2 + 𝑛 + 2) ⋅ 𝑘 = (642 + 64 + 2) ⋅ 6 = 24972. On the other hand, in Eq. (3) and Eq. (4),
the number of parameters in RWFNs are {W ∈ ℝ𝑛×𝐵 , R ∈ ℝ𝑛×𝐵 , b ∈ ℝ𝐵 , 𝛽 ∈ ℝ2𝐵 }, where 𝐵 = 200
following the configuration of the RWFNs. Therefore, the number of parameters in RWFNs is
(2𝑛 + 3) ⋅ 𝐵 = (2 ⋅ 64 + 3) ⋅ 200 = 26200. Although our method requires more space complexity
compared to LTNs (26200 > 24972), the parameters {W, R, b} in RWFNs are randomly drawn
and fixed weights. Thus, it is also necessary to compare the number of learnable parameters
across the two models.
    All of the above parameters of LTNs must be adaptable, whereas the parameters to learn in
RWFNs for object type classification are only 𝛽 ∈ ℝ2𝐵 . Thus, the number of learnable parameters
is 400, which is much smaller than that of LTNs. This means that the ratio of the two numbers
of parameters to learn is about 400 ∶ 24972 ≈ 1 ∶ 62. Consequently, non-adaptable parameters
in RWFNs can have significant power to represent the latent relationship among objects so
that the model can efficiently extract relational knowledge even though using fewer adaptable
Figure 3: The comparison of running time, including data configuration time, training time (sec) for
LTN and RWFN


parameters. Furthermore, the number of LTN parameters heavily depends on the number of
features, whereas RWFNs are independent of the number of features. In principle, this could
allow the learning process in our model to be accelerated if the feature representation from the
encoder model is pre-processed and stored.

Running Time Fig. 3 depicts the comparison of running time, including data configuration
time and training time for LTNs and RWFNs. The running time of RWFNs is roughly as half of
that of LTNs. This is because the number of learnable parameters in RWFNs is far smaller than
LTNs, and RWFNs have linear models to learn, which is much simpler compared to models
used in LTNs.

Space Complexity of RWFNs with Weight Sharing Weight sharing is a unique feature
of RWFNs, which can greatly reduce necessary space complexity when multiple classifiers are
used simultaneously. In the depicted case of learning 𝑖 classifiers in Fig. 1b, the space complexity
for RWFNs without using weight sharing is (2 ⋅ 𝑛 ⋅ 𝐵 + 3 ⋅ 𝐵) ⋅ 𝑖 = (2 ⋅ 𝑛 + 3) ⋅ 𝐵 ⋅ 𝑖 ≈ 𝑂(𝑖 ⋅ 𝐵 ⋅ 𝑛).
However, with weight sharing, RWFNs can achieve much better space complexity, which is
2 ⋅ 𝑛 ⋅ 𝐵 + 𝐵 + 2 ⋅ 𝐵 ⋅ 𝑖 ≈ 𝑂(𝐵 ⋅ 𝑛) because 𝐵 ⋅ 𝑖 < 𝐵 ⋅ 𝑛 for the experiments conducted in the SII task,
but it can also be 𝑂(𝑖 ⋅ 𝐵) for the different task. This indicates that the one of the factors that
highly influence to the space complexity of the original RWFNs can be negligible when using
weight sharing, which makes the model cost efficient and economical.