=Paper= {{Paper |id=Vol-2590/paper16 |storemode=property |title=Optimization of Data Locality in Relaxed Concurrent Priority Queues |pdfUrl=https://ceur-ws.org/Vol-2590/paper16.pdf |volume=Vol-2590 |authors=Andrey Tabakov,Alexey Paznikov |dblpUrl=https://dblp.org/rec/conf/micsecs/TabakovP19 }} ==Optimization of Data Locality in Relaxed Concurrent Priority Queues== https://ceur-ws.org/Vol-2590/paper16.pdf

Optimization of Data Locality in Relaxed
Concurrent Priority Queues?

Andrey Tabakov1[0000−0001−7220−107X] and Alexey
Paznikov1[0000−0002−3735−6882]

Saint Petersburg Electrotechnical University “LETI”, 5 Professora Popova Str., Saint
Petersburg 197022, Russia komdosh@yandex.ru, apaznikov@gmail.com

Abstract. Parallel computing is one of the top priorities in computer
science. The main means of parallel processing information is a dis-
tributed computing system (CS) - a composition of elementary machines
that interact through a communication medium. Modern distributed
CSs implement thread-level parallelism (TLP) within a single comput-
ing node (multi-core CS with shared memory), as well as process-level
parallelism (PLP) process-level parallelism for the entire distributed CS.
Design of scalable concurrent data structures for shared memory sys-
tems is one of promising approach to relaxation of operation execution
order. The growing demand for scalable parallel programs (especially
in complex hierarchical control systems) states the evolution of MPI,
which nowadays supports not only common PLP, but TLP within each
the MPI-process. Relaxed concurrent data structures are non-linearizable
and do not provide strong operation semantics (such as FIFO/LIFO for
linear lists, delete max (min) element for priority queues, etc.). In the pa-
per, we use the approach based on design of concurrent data structure as
multiple simple data structures distributed among the threads. For op-
eration execution (insert, delete), a thread randomly chooses a subset of
these simple structures and make actions on them. We propose optimized
relaxed concurrent priority queues based on this approach. We designed
algorithms for optimization of priority queues selection for insert/delete
operations and algorithm for balancing of elements in queues.

Keywords: Distributed Computing Systems · Concurrent Data Struc-
tures · Relaxed Data Structures · Relaxed Semantics · Multithreading.

1 Introduction
The Multicore computing systems (CS) with shared memory are the primary
means of solving complex problems and processing large amounts of data. It is
?
The reported study was funded by RFBR according to the research projects
19-07-00784 and was supported by Russian Federation President Council on Grants
for governmental support for young Russian scientists (project SP-4971.2018.5).

Copyright c 2019 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
2 A. Tabakov, A. Paznikov

used both autonomously and as part of distributed CS (cluster, multi-cluster sys-
tems, systems with mass parallelism). Such systems include number of multicore
processors that share a single address space and have a hierarchical structure
(Fig. 1). Thus, the Summit (OLCF-4) supercomputer (17 million processor
cores), which is first in the TOP500 rating, includes 4608 computing nodes. [3].
Among the existing CS with shared memory, there are SMP-systems, providing
the same speed of access of processors to memory, and NUMA-systems, rep-
resented as a set of nodes (composition of processor and local memory) and
characterized by different latency for access of processors to local and remote
memory segments [1]. Parallel programs for shared-memory CS are created in
a multithreaded programming model. The main problem arising in the develop-
ment of programs is the organization of access to shared data structures. It is
necessary to implement the correct execution of operations by parallel threads
(no race conditions, dead-locks, etc.) and to provide scalability for a large num-
ber of threads and a high intensity of operations. For this, it is necessary to
develop means for synchronizing access to shared memory.

Fig. 1. Multicore system hierarchical structure.

2 Methods of Thread Synchronization

Locks Semaphores, mutexes implement access to shared memory areas with a
single thread at any time. Among the locking algorithms, we can distinguish
TTAS, CLH, MCS, Lock Cohorting, Flat Combining, Remote Core Locking,
etc. [4]. The disadvantages of locks are the possibility of deadlocks (livelocks),
threads starvation, priority inversion and lock convoy.
Optimization of Data Locality in Relaxed 3

Non-blocking concurrent data structures they have the property of lin-
earizability, assuming that any parallel execution of operations is equivalent to
some sequential execution of them, and the completion of each operation does not
require the completion of any other operation. Among non-blocking data struc-
tures, there are three classes: wait-free – each thread completes the execution of
any operation in a finite number of steps; lock-free – some threads complete the
execution of operations in a finite number of steps; obstruction-free – any thread
completes the operations for a finite number of steps, if the execution of other
threads is suspend-ed. Among the most common non-blocking algorithms and
data structures, we could highlight Treiber Stack, Michael Scott Queue, Har-
ris Michael List, Elimination Backoff Stack, Combining Trees, Diffracting Trees,
Counting Network, Cliff Click Hash Table, Split-ordered lists, etc. [5]. The dis-
advantages of non-blocking data structures include the high complexity of their
development (compared to other approaches), the lack of hardware support for
atomic operations for large or non-adjacent memory (double compare-and-swap,
double-width compare-and-swap), problems of freeing memory (ABA problem).

Transactional memory Within this approach, the program organizes transac-
tional sections in which the protection of shared memory areas is implemented.
Transaction is the finite operations sequence of the transactional read / write
actions [16]. The transactional read operation copies the contents of the speci-
fied section of shared memory into the corresponding section of the local thread
memory. The transactional write copies the contents of the specified section of
local memory into the corresponding section of shared memory accessible to all
threads. After completion of the execution, the transaction can either be com-
mitted or canceled. Committing a transaction implies that all changes made
to memory become irreversible. There are software (LazySTM, TinySTM, GCC
TM, etc.) and hardware (Intel TSX, AMD ASF, Oracle Rock, etc.) transactional
memory. Among the shortcomings of the transactional memory, it is possible to
distinguish restrictions on certain types of operations within the transactional
sections, overhead in tracking changes in memory, the complexity of debugging
the program. Given the above discussed shortcomings of the existing synchro-
nization tools, their performance may be insufficient for modern multithreaded
programs. In this paper, the method of increasing scalability is used due to
relaxation of the semantics of data structures.

3 Relaxed Concurrent Data Structures
The basis of this approach is a compromise between scalability (performance)
and the correctness of the semantics of operations [2]. It is proposed to relax
the semantics of the implementation of operations to increase the possibility of
scaling. In most existing non-blocking thread-safe structures and blocking algo-
rithms, there is a single point of execution of operations on the structure. For
example, when inserting an element into a queue, it is necessary to use a single
4 A. Tabakov, A. Paznikov

pointer to the first element of the structure, in the case of a multi-threaded sys-
tem. This fact is a bottleneck, since each thread is forced to block one element,
causing other threads to wait. The principle of quasi-linearizability [13] is ap-
plicable to this approach, which assumes that several events may occur during
the execution of some operations, simultaneously changing the data structure in
such a way that after performing one of the operations, the state of the data
structure is undefined. Relaxed concurrent data structures randomly select on
of available structure. In case of successful locking of the structure, the thread
completes the operation; otherwise, randomly selects an another structure. Thus,
synchronization of threads is minimized, losses in the accuracy of operations are
acceptable [2]. The main representatives of relaxed data structures are SprayList,
k-LSM, Multiqueue.

SprayList This structure is based on the SkipList structure [11]. SprayList is
a connected graph, where at the lower level of the structure there is a connected
sorted list of all elements, and each next level with a given fixed probability
contains the elements of the list of the lower level. The search for this structure
is carried out linearly from top to bottom and from left to right; one pointer
follows each iteration. Unlike the SkipList, SprayList assumes not a linear search
from top to bottom and from beginning to end, is uses a random movement from
top to bottom and from left to right. If the search fails or another thread locks
the item, the algorithm returns to the previous item. After finding the necessary
element, the operations with the list are performed in the same way as in the
SkipList. However, in the worst case, inserting items into a SprayList causes
significant overhead due to the need to maintain an ordered list.

k-LSM The log-structured tree with merge (LSM) is used as the basic structure
of k-LSM [12]. Each structure change is recorded in a separate log file, tree nodes
are sorted arrays (blocks), each of which is at the L level of the tree and can
contain N elements (2L–1 < N ≤ 2L). Each thread has a local distributed
LSM. Common LSM is the result of the merging of several distributed LSM
structures. All threads can access common LSM using a single pointer. As a result
of combining common and distributed LSM structures, a k-LSM structure was
obtained. When performing the insert operation, the thread stores the element in
the local LSM structure. During a delete operation, the search for the smallest
key in the local LSM is used. If the local structure is empty, and not insert
operation is required, an attempt is made to access foreign LSM structures, the
search is performed among all distributed and common LSM structures, and if
it was found the structure that is not locked, an operation is performed on it.
The disadvantages of this structure are the synchronization of calls to a common
LSM and an attempt to call someone else’s LSM, since there is no guarantee that
it is not empty.

Multiqueue This structure [9] (Fig. 2) is a composition of simple priority
queues protected by locks. Each thread has two or more priority queues. The
Optimization of Data Locality in Relaxed 5

operation of inserting an element is performed in a random, unlocked by another
thread, queue. The operation of deleting an element with the minimum key is
performed as follows: two random unlocked queues are selected, their values of
the minimum elements are compared and the element with the smallest value
from the corresponding queue is deleted. This element is not always the minimum
inserted in the global structure, but it is close to the minimum and for real
problems, this error can be neglected. This paper proposes algorithms to optimize
the execution of operations for priority queues based on Multiqueues. Algorithms
allow you to optimize the choice of structure for performing the operation [8].

Fig. 2. Multiqueues - Relaxed Concurrent Priority Queue

4 Operation Execution Optimization
The disadvantage of the current implementation of insert and delete operations
in Multiqueues is the algorithm for searching for a random queue. The thread
performing the operation most likely turns to queues locked by other threads.
The structure of Multiqueues includes kp queues, where k is the number of
queues per thread, p is the number of threads.

4.1 Evaluation Queue Identifier by Thread
In the article methods are proposed for reducing the number of collisions based
on the limitation of the range for randomly choosing a structure. By collisions
is meant access a memory area that was already locked by another thread. The
set of queues and threads is divided in half. Let i in (1) be the thread identifier,
6 A. Tabakov, A. Paznikov

then for the first half of all threads (2), the random queue selection operation is
performed among the queues (3), where 1 is selected to perform the operation
queue in the Multiqueue structure. For the second half of threads (4), queue
searching in the second half of queues (5).
i ∈ 0, 1, . . . , p (1) i ∈ 0, 1, . . . , p/2 (4)
q ∈ 0, 1, . . . , kp/2 (2) i ∈ p/2 + 1. . . , p (5)
q ∈ kp/2 + 1, . . . , kp (3) q ∈ pi, . . . , pi + k (6)
An approach to optimizing the selection of queues, based on the ”binding” of
queues to threads, has also been developed. This scheme allows you to specify
the order in which the thread accesses queues. In the implementation of fasten-
ing, the following model is used: in total, the set contains kp queues, then each
thread has (6) fixed queues. When performing an operation, the thread first
turns to the one of (6) queue, thereby minimizing calls to blocked queues. If all
queues from the set (6) are blocked, then the original queue selection algorithm
among all queues is used.

4.2 Optimized Algorithms

Algorithms presented in this section have the following semantics: Lock/Unlock
– lock and unlock thread mutex for one of specified queues, tryLock – check
that queue is not processing by another thread at the exact same time and try
to lock mutex, RandQueue – return one queue from set of all queues by its
identifier that in the specified range, Rand2Queue – same as a RandQueue, but
returns pair of randomly selected queues, threadId - identifier of current thread
for this concrete process from 0 to p. For implementation it is recommended to
use thread affinity with cores, it helps to minimize context switching between
threads [15].

OptHalfInsert Algorithm 1 represents an optimized insertion algorithm of an
element in a Multiqueues structure. The queue is selected depending on which
half of the threads the current thread identifier belongs to.
do
if i ∈ 0, 1, . . . , p/2 then
q = RandQueue(0, kp/2);
else
q = RandQueue(kp/2+1, kp);
end
while tryLock(q) = f alse;
insert(q, element);
Unlock(q);
Algorithm 1: OptHalfInsert
Optimization of Data Locality in Relaxed 7

OptHalfDelete Algorithm 2 is exactly the same as OptHalfInsert, but it takes
two queues for increase accuracy of deleted max (min) element.
do
if i ∈ 0, 1, . . . , p/2 then
(q1, q2) = Rand2Queue(0, kp/2);
else
(q1, q2) = Rand2Queue(kp/2+1, kp);
end
q = GetMaxElementQueue(q1, q2);
while tryLock(q) = f alse;
removeMax(q);
Unlock(q);
Algorithm 2: OptHalfDelete

OptExactDelete Alternative optimized Algorithm 3. for deleting the maxi-
mum element, is similar to Algorithm 3. but with one change: at first it attempt
to lock queues that ”attached” to the thread (as it is threadId), and only after
failing it search queues among all.
firstIteration = true;
do
if firstIteration then
(q1, q2) = Rand2Queue(threadId, threadId+p);
else
if i ∈ 0, 1, . . . , p/2 then
(q1, q2) = Rand2Queue(0, kp/2);
else
(q1, q2) = Rand2Queue(kp/2+1, kp);
end
end
q = GetMaxElementQueue(q1, q2);
firstIteration = false;
while tryLock(q) = f alse;
removeMax(q);
Unlock(q);
Algorithm 3: OptExactDelete
As a result of continuous running program, there may be an imbalance in
a relaxed priority queues: some queues may contain significantly more elements
than others. This circumstance leads to a decrease in the performance of the
algorithms, since empty queues become unsuitable for a delete operation, which
increases the search time for suitable queues to perform the operation. A balanc-
ing algorithm (Algorithm 4) for the complete Multiqueues structure has been
created, for a uniform distribution of elements among the queues.
8 A. Tabakov, A. Paznikov

q1=FindLargestQueue();
q2=FindShortestQueue();
if size(q1) > AvgSizeT otalSize() ∗ α then
Lock(q1);
q2IsLocked=LockWithTimeout(q2);
if q2IsLocked then
sizeToTransfer = size(q1)∗β
TransferElements(q1, q2, sizeToTransfer);
Unlock(q2);
end
Unlock(q1);
end
Algorithm 4: Balancing algorithm

5 Evaluation

5.1 Testing Platform

All experiments reported in this paper were processed on one node of cluster,
which equipped with dual-socket Intel Xeon X5670 where each socket contains
six cores with 2.93 GHz each (Hyper-Threading is disabled). 16Gb of RAM was
used.

5.2 Measurement Technique

As an indicator of efficiency, the throughput was used, which is calculated as the
sum of the carrying capacities of the threads (7), where n is the number of insert
/ delete operations by the thread i, t is the execution time of operations [14].

bi = n/t (7)

5.3 Benchmarks

The effectiveness of the original and optimized multiqueues was compared. In-
dividual insert / delete operations were investigated. Each thread was allocated
k = 2 queues. Each thread performed n = 106 insert operations and n = 0,5
∗ 106 delete operations. The following experiment shows the dependence of the
number of random operations (insertion, deletion) on the number of threads
used. The following options for using the insertion and deletion algorithms were
analyzed:

– The original insertion algorithm (Insert) and the original element deletion
algorithm (Delete)
– Optimized insertion algorithm (OptHalfInsert) and optimized element dele-
tion algorithm (OptHalfDelete)
Optimization of Data Locality in Relaxed 9

Number of Operations [MOPPS]

Number of Operations [MOPPS]
2.2 1.1
Insert 1.0 Delete
2.0 OptHalfInsert OptHalfDelete
1.8 0.9 OptExactDelete
1.6 0.8
1.4 0.7
0.6
1.2
0.5
1.0 0.4
0.8 0.3
0.6 0.2
0.4 0.1
0.2 0.0
0 2 4 6 8 10121416182022242628303234 0 2 4 6 8 10121416182022242628303234
Number of Threads (p) Number of Threads (p)

Fig. 3. Insert Throughput by Threads. Fig. 4. Deletion Throughput by Threads.
Number of Operations [MOPPS]

Number of Operations [MOPPS]
3.0 1.2
Insert Delete
2.8 OptHalfInsert 1.1 OptHalfDelete
1.0 OptExactDelete
2.6
0.9
2.4
0.8
2.2
0.7
2.0 0.6
1.8 0.5
1.6 0.4
0 2 4 6 8 10 12 0 2 4 6 8 10 12
Number of Queues per Thread (k) Number of Queues per Thread (k)

Fig. 5. Insert Throughput by Number Fig. 6. Deletion Throughput by Number
of Queues. of Queues.

– Optimized insertion algorithm (OptHalfInsert) and an alternative optimized
element removal algorithm (OptExactDelete)

The analysis of the optimal number of k queues per threads (Fig. 5 and fig. 6).
We used a fixed number of threads p = 12. The algorithms OptHalfInsert and
OptExactDelete has the maximum throughput of the system is achieved when
the number of queues k = 4 per thread.

6 Future Work

Coming up with high-performance data processing of large data volumes is chal-
lenging in modern control systems. Concerning this, modern systems for complex
object control (especially large-scale distributed hierarchical control systems) are
based on multi- and many-core distributed computer systems. Computer systems
include a wide class of systems – from embedded systems and mobile devices
to cluster computing systems, massively parallel systems, GRID systems, and
10 A. Tabakov, A. Paznikov

cloud-based CS. Algorithmic and software tools for parallel programming is the
basis for building modern systems for processing big data, machine learning and
artificial intelligence. The main class of systems used for high-performance infor-
mation processing are distributed CS - collectives of elementary machines inter-
acting through a communication environment. In design of parallel programs for
distributed CS, the de-facto standard is the messaging model, which is primarily
represented by the MPI (Message Passing Interface) standard. The scalability
of MPI programs depends significantly on the efficiency of the implementation
of collective information exchange operations (collectives) [7]. Such operations
are used in most of the MPI programs, they account for a significant proportion
of the total program execution time. An adaptive approach for development of
collectives is promising. Nowadays, using only the message passing model (MPI-
everywhere) may not be sufficient to develop effective MPI programs. In this
regard, a promising approach is to use MPI for interaction between computer
nodes and multithreading support systems (PThreads, OpenMP, Intel TBB)
inside the nodes. The main task of implementation of hybrid mode is the orga-
nization of scalable access of parallel threads to shared data structures (context
identifiers, virtual channels, message queues, request pools, etc.). Types of MPI
standard hybrid mode:

– MPI THREAD SINGLE - one thread of execution
– MPI THREAD FUNNELED - is a multi-threaded program, but only one
thread can perform MPI operations
– MPI THREAD SERIALIZED - only one thread at the exact same time can
make a call to MPI functions
– MPI THREAD MULTIPLE - each program flow can perform MPI functions
at any time.

One of the implementations of the hybrid multi-threaded MPI program in the
MPI THREAD MULTIPLE mode is the MPICH version CH4 library, which
defines standards for using lock-free data structures. In this mode, two types
of synchronization are available: trylock - in which the program cyclically tries
to capture the mutex and access the queue; handoff - a thread-safe queue when
accessed which causes an active wait for an item by a thread.
It is proposed to replace the thread-safe work queue from the izem library with
a relaxed thread-safe queue - Multiqueues, in which the mechanism for accessing
queue elements has been improved
The use of a Multiqueues with relaxed semantics for performing operations will
avoid the occurrence of bottlenecks when synchronizing threads [6]. Unlike most
existing lock-free thread-safe data structures and locking algorithms, where there
is a single point of execution of operations on the structure, a set of simple
sequential structures is used in relaxed data structures, the composition of which
is considered as a logical single structure. As a result, the number of possible
points of access to this structure increases. This approach will allow to achieve
much greater throughput compared to existing data structures.
Optimization of Data Locality in Relaxed 11

7 Conclusion

An optimized version of a relaxed concurrent priority queue based on Multi-
queues has been developed. The developed insertion and deletion algorithms
has a 1.2 and 1.6 times performance boost, respectively, compared to the origi-
nal insertion and deletion algorithms. Optimization is achieved by reducing the
number of collisions based on the limitation of the random structure selection. In
implementation is the thread affinity is used for minimization contention between
cores. It is proposed to use a scalable thread-safe queue with relaxed semantics
in hybrid multi-threaded MPI programs (MPI + threads model), which will re-
duce the overhead of synchronizing threads when performing operations with a
working task queue. There are some evidences, that this queue is highly scalable
and reduce overheads for thread synchronization. The implementation of these
algorithms is publicly available at https://github.com/Komdosh/Multiqueues.

References

1. Anenkov, A. D., Paznikov, A. A., Kurnosov, M. G.: Algorithms for access localiza-
tion to objects of scalable concurrent pools based on diffracting trees in multicore
computer systems. In: 2018 XIV International Scientific-Technical Conference on
Actual Problems of Electronics Instrument Engineering (APEIE) pp. 374-380. IEEE
(2018) https://doi.org/10.1109/APEIE.2018.8545197
2. Tabakov A. V., Veretennikov L. M.: Relaxed Concurrent Data Structures. In III
Science of Present and Future, pp. 105–107, ETU, Russia, (2018)
3. TOP500 supercomputers list, https://www.top500.org/news/summit-up-and-
running-at-oak-ridge-claims-first-exascale-application. Last accessed 7 Nov 2019
4. Herlihy M., Shavit N.: The art of multiprocessor programming. Morgan Kaufmann,
Boston (2011)
5. Shavit N., Touitou D.: Software transactional memory. Distributed Computing 10,
99–116 (1997)
6. Tabakov, A. V., Paznikov, A. A.: Using relaxed concurrent data struc-
tures for contention minimization in multithreaded MPI programs. Journal
of Physics: Conference Series 1399(3), pp. 033037. IOP Publishing (2019)
https://doi.org/10.1088/1742-6596/1399/3/033037
7. Tabakov A., Paznikov A.: Modelling of parallel threads synchronization in hybrid
MPI+Threads programs. In XXI IEEE International Conference on Soft Computing
and Measurements (SCM), 4–7 (2019)
8. Paznikov A., Anenkov A.: Implementation and Analysis of Distributed Relaxed
Concurrent Queues in Remote Memory Access Model In XIII International Sym-
posium “Intelligent Systems – 2018” (INTELS’18). Procedia Computer Science 50,
654–662 (2019) https://doi.org/10.1016/j.procs.2019.02.101
9. Rihani H., Sanders P., Dementiev R.:Multiqueues: Simpler, faster, and better re-
laxed concurrent priority queues. arXiv pre-print arXiv:1411.1209 50, 277–278
(2015) https://arxiv.org/pdf/1411.1209.pdf. Last accessed 7 Nov 2019
10. Goncharenko E.A., Paznikov A.A., Tabakov A.V.: Evaluating the performance of
atomic operations on modern multicore systems. Journal of Physics: Conference
Series 1399, 033107 (2019) https://doi.org/10.1088/1742-6596/1399/3/033107
12 A. Tabakov, A. Paznikov

11. Alistarh D. et al. The SprayList: A scalable relaxed priority queue. ACM SIGPLAN
Notices 10, 11–20 (2015).
12. Wimmer M. et al. The lock-free k-LSM relaxed priority queue. ACM SIGPLAN
Notices 50, 277–278 (2015).
13. Y. Afek, G. Korland, E. Yanovsky.: Quasi-Linearizability: Relaxed Consistency for
Improved Concurrency. OPODIS 6490, 3–10 (2010)
14. Tabakov, A. V., Paznikov, A. A.: Algorithms for optimization of relaxed concurrent
priority queues in multicore systems. In: 2019 IEEE Conference of Russian Young
Researchers in Electrical and Electronic Engineering (EIConRus) 360–365. (2019)
https://doi.org/10.1109/EIConRus.2019.8657105
15. Paznikov, A., Shichkina, Y.: Algorithms for optimization of processor and memory
affinity for Remote Core Locking synchronization in multithreaded applications.
Information 9(1), 21–24 (2018) https://doi.org/10.3390/info9010021
16. Kulagin I. I.: Means of architectural-oriented optimization of parallel program ex-
ecution for computing systems with multilevel parallelism. NUSC 10, 77–82 (2017)
17. Paznikov, A. A., Smirnov, V. A., Omelnichenko, A. R. Towards Effi-
cient Implementation of Concurrent Hash Tables and Search Trees Based
on Software Transactional Memory. In: 2019 International Multi-Conference
on Industrial Engineering and Modern Technologies (FarEastCon) 1–5 (2019).
https://doi.org/10.1109/FarEastCon.2019.8934131