<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Alternative implementations of the Auxiliary Duplicating Permutation Invariant Training</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jens Gulin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karl Åström</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Vision and Machine Learning, Centre for Mathematical Sciences, Lund University</institution>
          ,
          <addr-line>Lund</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sony Europe B.V., Lund, Sweden Software Program (WASP) funded by the Knut and Alice Wallenberg Foundation</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>4</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>Simultaneous sound event localization and detection (SELD) for multi-source sound events is an open research ifeld. The Multi-ACCDOA format is a popular way to handle activity-coupled sound events where the same class occurs at multiple locations at the same time. An important part is the Auxiliary Duplicating Permutation Invariant Training (ADPIT) paradigm that calculates the loss for order-agnostic output. The baseline system for the DCASE SELD challenge 2024 has an implementation of ADPIT. In this paper we discuss alternative ways to implement ADPIT with the goal to reduce multiplications, to make the equivalent calculations faster. ADPIT duplicates output when there are fewer events than tracks. A brief discussion how this difers from permutation invariant training without duplicated output is also included. The loss calculations are likely not the execution bottleneck in the current challenge setup, but ADPIT scales poorly for an increased number of tracks and improved eficiency is thus of general interest for audio localization.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        This work presents diferent ways to implement ADPIT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and a discussion on their computational
requirements. For this paper, we consider the problem in abstraction and for no specific computational
platform. Because of that, there is neither implementation nor formal performance analysis included.
Although it may be considered a work-in-progress, we want to present the discussion and welcome
input from the community.
      </p>
      <p>
        Although zeroth-order optimization [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is advancing, stochastic gradient descent (SGD) based methods
like Adam [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] are the default choices. Either way, the loss function is iterated repeatedly during training.
The discussion will refer to branch free loss calculation, meaning a vectorizable implementation not
branching of depending on data content. This is not to be confused with [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] using “without branching”
to describe a unified SELD network, with sound detection, classification and localization in the same
lfow.
      </p>
      <p>
        Time complexity is commonly discussed in terms of asymptotic cost () when the number of samples
increases [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. From that point of view, the number of tracks (or slots, as we will call them) is fixed, and
eficiency in terms of tracks is just afecting the coeficient and not  complexity. On the other hand,
having a large constant factor may be prohibiting to both large and small sample sizes, and calculations
wasted are good for nothing. This is the motivation to discuss how to reduce the cost per track. With
control of that cost, we argue that also higher number of tracks should be explored.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>This section provides terminology and an in-depth introduction to ADPIT, as well as a brief discussion
of related work.</p>
      <sec id="sec-2-1">
        <title>2.1. Problem formulation</title>
        <p>Consider the actual problem as the eficient execution of the loss function during permutation invariant
training of a SELD network. To simplify discussion, we will describe a more abstract task and formulate
the problem more generally.</p>
        <p>Assume that we have recordings of an environment containing the sounds of events and
nonevents, together with a (fairly accurate) ground truth (GT) list of these events (as provided from oracle
knowledge). Although training is meant to give the network the generalized ability to separate them, the
environment-specific semantic diference of events and non-events is disregarded here: What makes it
an event is simply that it is listed as GT. At any time, there may be no event active, one event, or multiple
events overlapping fully or partially. The events have a start and end time, a source position in space,
and an audio class. The classes are predefined for the environment, e.g. female speech, clapping and
music. The time may be divided into discrete frames, so that the event duration is instead represented
as a number of consecutive frame-wise events.</p>
        <p>The network can be any entity that takes input, say raw audio, and produces a set of tokens as output.
Each token represents an estimation of one event inferred from the input, we will however reserve the
word event for GT and reserve token for the estimates. The loss function evaluates the output quality
compared to the GT, and provides feedback (gradients) to guide the training. The internal representation
of the token and the event may be diferent, but the loss function is able to compare them.</p>
        <p>The output is arranged in a fixed number of  slots. All slots need to be filled, when the network
infer fewer tokens, slots are marked empty by the zero (⊘ ) token. As mentioned, the output is a set,
where order is disregarded, but in practice the output slots are ordered. For permutation invariant
training (PIT), the problem is to find the best way to map each output to a distinct event from GT,
ifnding the sequence of  events that best matches the ordered output. With  ≤  actual GT events
in a scope, each  will introduce a permutation category, and we label the events accordingly; ⊘ , {A0},
{B0, B1}, etc.</p>
        <p>
          We can assume that the loss function directly compares one token to one event, even if multiple
entries are input together. A further generalization is that the loss operates independently within
a scope, where a scope may be e.g. frame-wise or frame-class-wise. That is, when scoped for both
frame and class, there is no need to evaluate a token towards another frame or another class. Finally,
batches are processed together, and the combined loss is the average over all samples. Batching allows
for eficient vectorization with limited memory usage and it also provides a stochastic regularization
smoothing gradients.
2.2. ADPIT
This section introduces the work of Shimada et al. presented in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Everything here reference their
paper to provide the background needed, only rephrased slightly to fit the discussion. Any commentary
given is that of our own.
        </p>
        <p>The ADPIT method is described as PIT with auxiliary duplication into empty slots. Our interpretation
is that duplication introduces identical copies of enough events to fill all slots, while auxiliary means
that copying is meant to support the result. Equivalently, the network is trained to produce close to
identical copies, while free to chose which tokens to duplicate and how similar they would be in any
way that could support training the most. Compared to PIT, the experimental results indicated that
ADPIT made a large improvement. Our hypothesis is that the duplication gives the network better
learning possibilities through redundancy. The Multi-ACCDOA format extends the Single-ACCDOA
with multiple slots per class. The paper claims that this gave better results for cases where there were
multiple active events from the same class, and did not worsen the normal case.</p>
        <p>ADPIT scope is frame-based processed class-wise. There are a fixed number of  output slots1 per
class  for each frame , and  is the number of active events of that class in the current frame. Due
to duplication, there should be no ⊘ -tokens unless there are no estimates at all for the scope.
1Called tracks in reference, but there is no tracking over time so we call them slots.</p>
        <p>Their eq. 6, here slightly reformulated, gives an upper limit of number of permutations possible for a
scope:
^(,  ) =
{︃  ×  −  if  &gt; 0,
1
if  = 0,
(1)
where × is scalar multiplication,   counts the number of (partial) permutations of  events onto
 slots without repetition. The exponential  −  counts all combinations of the  events copied
into the (possibly) remaining  −  slots. We rename it ^(,  ) to clarify that it is an upper limit
and show that the function is the same regardless how the variables are scoped. They acknowledge that
the unique number of sequences may be lower than ^, since (1) does not account for that a duplicate is
indistinguishable from the original.</p>
        <p>In the example of  = 3 slots and  = 2 events, calculation yields ^(3, 2) = 12, but the actual
number of permutations is 6. Table 1 lists all twelve variants grouped into six cells to show this more
clearly. We refer to indistinguishable sequences (variants within a cell) as aliasing and they should be
processed as only one in an eficient implementation.</p>
        <p>B0 B0 B1 B0 B1 B0 B1 B0 B0 B1 B1 B0 B1 B0 B1 B0 B1 B1</p>
        <p>B0 B0 B1 B0 B1 B0 B1 B0 B0 B1 B1 B0 B1 B0 B1 B0 B1 B1</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Baseline implementation</title>
        <p>The baseline system for the DCASE SELD challenge 2024 has an implementation2 of ADPIT. In overview
it has several steps: 1) preprocess GT and introduce a dummy dimension that separates the events
based on  , 2a) at training, load preprocessed scope and calculate loss for each possible sequence, for
all categories at the same time, 2b) pick the minimum loss and use that for gradient calculation and
back-propagation, 3) at inference, identify and discard duplicate output.</p>
        <p>The implementation is scoped for  = 3 slots for each class and time frame. GT is preprocessed to do
the proper scoping and at the same time  is calculated. Step 1 applies  = 6 dummy events. The idea
is to give the entire set {A0, B0, B1, C0, C1, C2} in parallel. For a specific scope, only one  can be the
actual target and in result only one of the subsets A, B or C are active, while the other dummy entries
are all zeros. In step 2, the events of A, B and C are permuted separately. The number of permutations
of B is correctly set to 6. With 1 for A and 6 for C, the total is 13 sequences per scope, again processed
together. Now, if not cared for, the inactive dummy entries could severely disrupt the assumption that
the lowest loss is the best matching sequence. To regulate the numerous all-zero entries, an overlay
strategy ensures that these zero-entries are converted into existing sequences that do not interfere.
All sequences and the overlays are shown in Table 2. Note that when  = 0, we can see that every
row will turn into all zeros, which is an appropriate target for ⊘ -events, and this category requires no
additional logic.</p>
        <p>The purpose for processing also “useless” dummy losses is to make a branch free implementation,
which allows full vectorization and automatic gradient calculations: Each scope is processed exactly
the same, regardless of the data. The overlay strategy is however not optimal, and Section 3.2 proposes
another way. The step 3 has a fixed angular threshold and does not rely on event-pair permutations
(since no GT knowledge is available). It might be improved still, but this step is not part of the learning
loop, thus less of a priority here.</p>
        <p>In this context, the exact implementation of the loss function is not important. For the sake of
completeness, the loss function is the mean squared error (MSE) of the estimated position and the event
position. Since the scope is class-based, the loss function can disregard class. In the end, the loss will
still penalize a token with no matching event of that class or an event with no matching token.</p>
        <p>
          Note: The baseline implementation also has a function to calculate various metrics against GT, for
logging during training or standalone evaluation. The linear_sum_assignment function from SciPy
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], is used to find the best token-event pairing. The implementation is not optimized for the specific
problem and reusing the token pairing (algorithm) from the loss function might be beneficial, but this
is left for future study.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.4. PIT and Related Work</title>
        <p>
          Permutation invariant training has been used for diferent tasks, such as speaker separation [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and
SELD with track-wise loss [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. ADPIT is in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] both used for the usual frame-class-scope SELD and, in
a novel setting, for training a multi-source time-of-arrival feature extractor.
        </p>
        <p>
          Relying on exhaustive permutation and taking the sequence with minimum loss, [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] mentions  !
sequences. A naive permutation strategy is however  ! and to avoid that for  &lt;  , some slots
need to be discarded. The auxiliary autoencoding PIT (A2PIT) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], providing fault tolerance for speech
separation, replaces empty targets with the original input. From a permutation perspective, any unique
encoding for empty targets is however equivalent. It is clarified that A2PIT first identify the assumed
valid (non-empty) outputs and then add (empty) or remove (valid) slots randomly to get exactly as many
as the  real targets. This is simple, but not a perfect strategy, as random selection does not guarantee
best fit. A possibility is that the loss function still does  ! permutations, and that  is used only for
the balance factor between valid and empty pairs.
        </p>
        <p>Even  ! sequences of PIT are often fewer than exhaustive permutations for ADPIT. With ADPIT,
the slots are potentially encoded in many diferent representations, adding permutation complexity. It
is also more dificult to filter out the “empty” ADPIT slots, for the same reason. In result, PIT handles
any  within the same permutations and no extra care is needed for a branch free implementation. It
would be interesting to see more papers compare if the ADPIT complexity is worthwhile.</p>
        <p>
          Exhaustive permutation is a naive way to solve the linear assignment problem, or equivalently to find
the minimum weight matching in bipartite graphs. The weights are pre-calculated as a cost matrix with
 rows and  columns. The Hungarian algorithm [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] (also known as Kuhn–Munkres) originally
solved this for an  square matrix in ( 4). Variants, such as the modified Jonker-Volgenant algorithm
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] implemented in SciPy [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], reach ( 3) and handle  ̸= . The implementation is not branch
free, and since  ! &lt;  3 until  = 6 we rely on exhaustive summation for now. The auctioning
algorithm is according to [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] rarely faster and the worst case sometimes very slow, but it may be better
suited for parallelization and well-structured problems. In addition, for branching algorithms, the best
or average case is often faster than the worst case.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Proposals: K(N,M), ARMPIT and BRADPIT</title>
      <p>This section presents a number of simple and elegant implementation strategies for ADPIT. We
collectively name the branch free ones, especially the combination of them, ADPIT with Reduced Multiplications
(ARMPIT). In addition, we believe that a branching approach may be a suitable implementation, and
present also Branch Remedied ADPIT (BRADPIT). First step is to clarify the permutation requirements.</p>
      <sec id="sec-3-1">
        <title>3.1. Number of sequences</title>
        <p>The area of combinatorics is a well developed study of permutations and combinations within
mathematics. The interpretation provided for (1) uses some commonly understood rules, but a broader search
also provides the exact count we need. A table of thirty conceptual counts [12, page 26] identifies our
problem as Count B3, providing the exact number of sequences,</p>
        <p>(,  ) =  ! (,  )
for 1 ≤  ≤</p>
        <p>
          =0
= ∑︁(− 1)− ︂(  )︂  =  ! ∑︁ (− 1)−   ,

=0 ( − )! !
where (,  ) counts the number of ways to arrange all  outputs into  unlabeled groups, with
no group empty. The factorial  !, alternatively   , then counts the ways to label the  groups (as
 events). (,  ) are known as the Stirling numbers of the second kind [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], defined as the number
of ways to partition a set of  elements into  non-empty subsets. The On-Line Encyclopedia of
Integer Sequences (OEIS) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] provides values for the Stirling numbers in entry A008277, as well as the
pre-calculated (2) and its maximum value for each N in entries A019538 and A002869 respectively. To
provide an understanding of the equation beyond Stirling numbers, (3) is an adaptation from [15, eq.
6.19] and the last step follows from the definition of the binomial coeficients. Note that the factorial is
absorbed by the binomial in (2) and from symmetry some references substitute  =  − .
        </p>
        <p>
          The case with  = 0 slots has no practical application, but  = 0 events occur whenever there is
“silence”. For these cases, (, 0) = 0 and there are indeed no ways to label the output if there are no
events. However, following (1), originally from [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], and defining
        </p>
        <p>(,  )
for  = 0,  &lt;</p>
        <p>
          = 1
(,  ) =   ,
for 1 ≤  &lt; 
(2)
(3)
(4)
(5)
makes sense, considering this as a special case where each output slot is expected to approach the
zeroevent ⊘ . The number of slots should be chosen to exceed any expected number of events. Whenever
 &gt;  , there are too few slots to output all events, and thus no slots for duplicates. PIT can still score
a (best subset) loss for the outputs given. The number of sequences are then Count B0 [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ],
which concludes all cases.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Optimal overlay</title>
        <p>In the baseline implementation, the overlays are not used to their full potential. Instead of just
safeguarding against empty entries, each overlay can permute one  category, see example in Table 3. The
order of sequences within each overlay does not matter here, but it may be noted that the Overlay B
arrangement is chosen for the purpose of upcoming Section 3.3.</p>
        <p>The optimal overlay strategy has a certain beauty for the  = 3 case, since two of the three cases
are exactly six entries. For other cases, one permutation category will dominate, and as for Overlay A
there will be redundancy. Yet, this strategy is a simple drop in replacement that reduces loss iterations
from the sum of all categories to the count of the largest category. For the implemented case, this
halves the work instantly. We would also suggest moving all of the overlay calculations out of the
training loop. Instead of passing six dummy events from preprocessing, provide the six sequences for
the relevant category. Make sure to store the data in an eficient format, so that managing the increased
data footprint is not slower than constructing the sequences on the fly.</p>
        <p>An additional insight is that the loss evaluation returns the gradients and the loss value, but not the
event permutation order. Doing so would be slightly more complicated with this more compact overlay
format. It is crucial to keep track of the calculated gradients, but the min function is a well established
pooling function that should not pose any problem.</p>
        <p>As for the baseline implementation, the loss function would evaluate each slot against the given
event sequence and sum the components. The time complexity of this strategy is mainly determined by
the distance calculations and matrix multiplications involved in evaluating the loss for a sequence. This
approach evaluates 6 × 3 = 18 pairs instead of 13 × 3. The reduced number of sequences is a great
improvement, but even for the Overlay C category, each token-event pair is evaluated twice.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Distance first</title>
        <p>Although the permutations help make sure that all token-event pairs are evaluated, separating the
enumeration of pairs from the permutation of events is our next approach. There are  ×  = 9
token-event pairs for the worst category. Lining up these pairs for vectorized calculation then turns the
loss of each permutation into simple sums.</p>
        <p>To make this a branch free algorithm, we break down the sums formed by the overlays in Table 3
and calculate each component separately. The sequences are reformulated in Table 4 using dummy
event sums. There are six unique sums X, and when paired with the matching slots of the sequence
S0S1S2, it turns out to be twelve unique pairs. This is slightly higher than the “worst category” nine,
and a trade-of for being branch free. Due to an intentional symmetry-breaking ordering chosen, one
sequence (in bold, X4X5X0) is entirely made up from pairs that are not reused. Thus the row can be
handled as a sequence loss rather than the separate pairs, but from our complexity perspective that still
calculates 12 loss-pairs.</p>
        <p>Remember that sequence X4X5X0, and thus ○ 12 ○ 11 ○ 10 , is an overlay of three sequences where only
one is active. We can replace it with two sequences that focus on each category instead. The C2C0C1
sequence is also X3X1X2, which is available for reuse in pairs ○ 9 , ○ 7 and ○ 5 . When category C is active
that concludes the needed permutations, for category B it produces B1B0B0 which is a permutation
already covered (by the last sequence). Calculating B1B0B0 twice is wasteful, but harmless since we
reuse existing pairs. In the same way, A0A0A0 is an obvious repetition, and we will ignore category A
henceforth. The B0B1B1 sequence is still needed to close the category B permutations. The X1X3X3 is
not possible as it would produce the C0C2C2 sequence, which is not a category C permutation since
it allows duplication instead of the true C1 event. Instead we propose X1X3X0 (giving B0B1B1 and
redundant C0C2C1), i.e. pairs ○ 1 and ○ 4 reused and only ○ 10 used only once. In total this makes ten loss
iterations and seven sums to find the min loss. Instead of six dummy events sums, this strategy only
uses four (discarding X4 and X5), which can be pre-processed. We have not found a way to reach nine
loss iterations in a completely branch free manner.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Branching benefits (BRADPIT)</title>
        <p>So far, we focused on the loss calculation for a single scope. Since Multi-ACCDOA has a fixed  for
each scope, loss is calculated for many slots even if it is unlikely to have overlapping events. With
many classes, the most likely GT is zero events for most of them, possibly one event for a few and
in rare cases overlapping events of one class. The statistics of course depends on the dataset, but the
general distribution is likely biased towards sparsity. An approach based on branch free processing
needs to cover the worst case, even with an eficient implementation. There should be an advantage
for an implementation that can recognize the majority of scopes as category A or ⊘ and avoid any
permutation cost for them.</p>
        <p>If we know the ground truth event number M, can we branch out the diferent categories without
breaking the gradients? Will it lead to faster learning? Theoretically, a dataset biased towards the best
case should benefit from an implementation with ability to adapt to  . Analysis for a specific use-case
is needed in order to know if there is a significant improvement to learning time. Any improvements
are still to the coeficient, not the  time complexity. The benefit however grows when  grows, as
long as the dataset is biased towards lower  .</p>
        <p>In pre-processing, pass on the specific value of  , it may replace the “active” field currently in the
baseline implementation ( &gt; 0 is active). A first approach simply finds the maximum  in the
current batch and branches to an eficient implementation for that maximum. If a batch has only scopes
of  &lt; 2, the entire batch can have a permutation free loss calculation. From a gradient perspective,
processing the entire branch the same way is essentially branch free. Note that the variation of training
data within a batch should reflect the stochastic nature of the dataset. Batching samples strategically to
achieve higher  consistency would impact learning negatively. Sorting within a batch just before loss
calculation is however not changing the results, which leads to a second approach.</p>
        <p>A batch is a number of samples grouped together to produce a common loss and a joint gradient.
The loss is however calculated for several similar parts (e.g. individual scopes) and then joined (the
average). Grouping the parts into sub-batches based on  can improve speed without afecting the
total batch average. Since each time frame can have multiple scopes with diferent  , this strategy
opts to separate them even if that breaks the benefit of continuous memory layout. Most importantly,
the  &lt; 2 categories are handled together, in the same way as above. The remaining few scopes can
then be grouped together, also based on the maximum as above. The diference to the first approach
may seem small, but it guarantees that the majority of the scopes gets the optimal implementation.
If  is high or the batch size is large, more than two groupings may be beneficial. Groups that are
“downwards compatible” to any lower  can accept spill-over from lower categories, which may help
balance vectorization. When a group is instead completely homogeneous, the implementation may be
faster than one that caters to more categories. The earlier example for  = 3 has six pairs for  = 2,
nine pairs for  = 3, but ten pairs for the combined implementation. A diference this small is however
not worth pursuing, unless there are plenty of iterations in each group.</p>
        <p>So far, the sub-batches are processed separately, only joined for the averaging step. The third approach
instead follows a map-reduce strategy to join the steps that can be vectorized regardless of category:
Once the token-event pairs have been formed (based on  ), the MSE calculation is generic; with a
list of templates referring to the MSE, the sums are generic. In a balanced manner, some steps may
combine all categories, others split categories. This approach is probably not useful unless a well
balanced implementation can be found. It is not obvious that a high  would make that more likely,
the imbalance is rather increasing from more diversity.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>The loss execution cost is approximately , the total number of token-event pairs evaluated in an
epoch, as in (6).  may be diferent in each scope. If the pairs are not evaluated individually, the
number depends on the number of permutation sequences, (8). With a frame-class scope, the number
of scopes is (7).</p>
      <p>
        =  ×  
 =  ×  
 =  × 
(6)
(7)
(8)
A theoretic execution cost of these loss implementations on a fictive dataset similar to STARSS23 [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
is shown in Table 5. There are 13 classes, 200k frames and no frames have  &gt; 5. Only 30k of the
frames have  &gt; 1, and even there it is unlikely that the overlapping events are of the same class.
BRADPIT results depend on how many scopes that have  &gt; 1 and we present two approximations.
The optimistic value assumes that each scope can be calculated as a single sequence. This is not true,
but even if we assume that each frame with overlap afects one scope, it only adds the cost for 30k
scopes to the pessimistic . The pessimistic  is unrealistic for frame-class scope, since not
all overlapping events are of the same class.
      </p>
      <p>The Multi-ACCDOA format has multiple slots per class. Since overlapping events of the same class is
rare, nearly all slots are empty or duplicates. If the scope is instead frame-wise, and the output format
and loss function can handle tokens that combine class and location, selecting  = 5 is an option.
With the baseline implementation, the cost for that may seem prohibitively large. With ARMPIT or
BRADPIT however, Table 5 indicates that  = 5, from the perspective of loss iterations, may be better
than  = 3. We have not done a full analysis of the  for  = 5, and these results should be
considered speculative. In particular, ARMPIT “dist. first” has 5 × 5 worst category token-event pairs,
but due to insights from Section 3.3, that number is quadrupled to give a conservative estimate that
allows for a branch free loss.</p>
      <p>The optimistic BRADPIT value for  = 5 again assumes that each scope can be calculated as a single
sequence. This is not realistic, in fact the actual  &gt; 1 count is 30k and the pessimistic number takes
that into account. We consider it pessimistic as the ARMPIT pairs are likely higher than needed, but
this time the reality is likely closer to pessimism. The reason is that the frame-scope already removed
the empty scopes, and the penalty for higher  grows fast. Also note that the summing of sequences is
disregarded in the metric, but that is increasingly costly when  grows.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>How permutations are evaluated for an ADPIT loss implementation can potentially improve the
evaluation time significantly. With these improvements in place, implementing frame-wise ARMPIT
with  = 5 does not seem infeasible for STARSS23, perhaps even a better option than Multi-ACCDOA
with  = 3. A natural continuation of the work is to verify the results with an implementation. We
here present (,  ), the accurate number of ADPIT permutations for each category.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Shimada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Koyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Takahashi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Takahashi</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Multi-ACCDOA</surname>
          </string-name>
          :
          <article-title>Localizing and detecting overlapping sounds from the same class with auxiliary duplicating permutation invariant training</article-title>
          ,
          <source>in: Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP)</source>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>316</fpage>
          -
          <lpage>320</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Difenderfer</surname>
          </string-name>
          , et al.,
          <article-title>DeepZero: Scaling up zeroth-order optimization for deep model training</article-title>
          ,
          <source>in: International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Adam: A method for stochastic optimization</article-title>
          ,
          <source>in: International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sipser</surname>
          </string-name>
          , Introduction to the
          <source>Theory of Computation</source>
          , 2nd ed.,
          <source>Thomson Course Technology</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Oliphant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Peterson</surname>
          </string-name>
          , et al.,
          <source>SciPy: Open source scientific tools for Python</source>
          ,
          <year>2024</year>
          . URL: http://www.scipy.org/, version 1.14.0.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kolbaek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-H.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jensen</surname>
          </string-name>
          ,
          <article-title>Permutation invariant training of deep models for speakerindependent multi-talker speech separation</article-title>
          , in: International Conference on Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>241</fpage>
          -
          <lpage>245</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP.
          <year>2017</year>
          .
          <volume>7952154</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Iqbal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Kong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>An</surname>
          </string-name>
          , et al.,
          <article-title>An improved event-independent network for polyphonic sound event localization and detection</article-title>
          , in: International Conference on Acoustics,
          <source>Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>885</fpage>
          -
          <lpage>889</lpage>
          . doi:
          <volume>10</volume>
          .1109/ICASSP39728.
          <year>2021</year>
          .
          <volume>9413473</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Berg</surname>
          </string-name>
          , et al.,
          <article-title>Learning multi-target TDOA features for sound event localization and detection</article-title>
          ,
          <source>in: Detection and Classification of Acoustic Scenes and Events (DCASE)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>16</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mesgarani</surname>
          </string-name>
          ,
          <article-title>Separating Varying Numbers of Sources with Auxiliary Autoencoding Loss</article-title>
          ,
          <source>in: Proc. Interspeech</source>
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>2622</fpage>
          -
          <lpage>2626</lpage>
          . doi:
          <volume>10</volume>
          .21437/Interspeech.2020-
          <volume>34</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          ,
          <article-title>The hungarian method for the assignment problem</article-title>
          ,
          <source>Naval Research Logistics (NRL) 52</source>
          (
          <year>1955</year>
          ). URL: https://api.semanticscholar.org/CorpusID:9426884.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D. F.</given-names>
            <surname>Crouse</surname>
          </string-name>
          ,
          <article-title>On implementing 2D rectangular assignment algorithms</article-title>
          ,
          <source>IEEE Transactions on Aerospace and Electronic Systems</source>
          <volume>52</volume>
          (
          <year>2016</year>
          )
          <fpage>1679</fpage>
          -
          <lpage>1696</lpage>
          . doi:
          <volume>10</volume>
          .1109/TAES.
          <year>2016</year>
          .
          <volume>140952</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Proctor</surname>
          </string-name>
          ,
          <article-title>Let's Expand Rota's Twelvefold Way For Counting Partitions</article-title>
          !,
          <year>2007</year>
          . URL: https: //arxiv.org/abs/math/0606404.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>O.-Y.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. V.</given-names>
            <surname>Manna</surname>
          </string-name>
          ,
          <article-title>Congruences for Stirling numbers of the second kind</article-title>
          ,
          <source>in: Gems in Experimental Mathematics</source>
          ,
          <year>2010</year>
          , pp.
          <fpage>97</fpage>
          -
          <lpage>111</lpage>
          . doi:
          <volume>10</volume>
          .1090/conm/517.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>OEIS</given-names>
            <surname>Foundation</surname>
          </string-name>
          <article-title>Inc</article-title>
          .,
          <source>The On-Line Encyclopedia of Integer Sequences</source>
          ,
          <year>2024</year>
          . At http://oeis.org.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Graham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Knuth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Patashnik</surname>
          </string-name>
          , Concrete Mathematics:
          <article-title>A Foundation for Computer Science</article-title>
          , second ed.,
          <string-name>
            <surname>Addison-Wesley</surname>
          </string-name>
          , Reading, MA,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>K.</given-names>
            <surname>Shimada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Politis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sudarsanam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Krause</surname>
          </string-name>
          , et al.,
          <article-title>STARSS23: An audio-visual dataset of spatial recordings of real scenes with spatiotemporal annotations of sound events</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          , volume
          <volume>36</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>72931</fpage>
          -
          <lpage>72957</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>