<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Optimization of Data Locality in Relaxed Concurrent Priority Queues?</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Saint Petersburg Electrotechnical University \LETI"</institution>
          ,
          <addr-line>5 Professora Popova Str., Saint Petersburg 197022</addr-line>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Parallel computing is one of the top priorities in computer science. The main means of parallel processing information is a distributed computing system (CS) - a composition of elementary machines that interact through a communication medium. Modern distributed CSs implement thread-level parallelism (TLP) within a single computing node (multi-core CS with shared memory), as well as process-level parallelism (PLP) process-level parallelism for the entire distributed CS. Design of scalable concurrent data structures for shared memory systems is one of promising approach to relaxation of operation execution order. The growing demand for scalable parallel programs (especially in complex hierarchical control systems) states the evolution of MPI, which nowadays supports not only common PLP, but TLP within each the MPI-process. Relaxed concurrent data structures are non-linearizable and do not provide strong operation semantics (such as FIFO/LIFO for linear lists, delete max (min) element for priority queues, etc.). In the paper, we use the approach based on design of concurrent data structure as multiple simple data structures distributed among the threads. For operation execution (insert, delete), a thread randomly chooses a subset of these simple structures and make actions on them. We propose optimized relaxed concurrent priority queues based on this approach. We designed algorithms for optimization of priority queues selection for insert/delete operations and algorithm for balancing of elements in queues.</p>
      </abstract>
      <kwd-group>
        <kwd>Distributed Computing Systems Concurrent Data Structures Relaxed Data Structures Relaxed Semantics Multithreading</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Multicore computing systems (CS) with shared memory are the primary
means of solving complex problems and processing large amounts of data. It is
? The reported study was funded by RFBR according to the research projects
19-07-00784 and was supported by Russian Federation President Council on Grants
for governmental support for young Russian scientists (project SP-4971.2018.5).
used both autonomously and as part of distributed CS (cluster, multi-cluster
systems, systems with mass parallelism). Such systems include number of multicore
processors that share a single address space and have a hierarchical structure
(Fig. 1). Thus, the Summit (OLCF-4) supercomputer (17 million processor
cores), which is rst in the TOP500 rating, includes 4608 computing nodes. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
Among the existing CS with shared memory, there are SMP-systems, providing
the same speed of access of processors to memory, and NUMA-systems,
represented as a set of nodes (composition of processor and local memory) and
characterized by di erent latency for access of processors to local and remote
memory segments [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Parallel programs for shared-memory CS are created in
a multithreaded programming model. The main problem arising in the
development of programs is the organization of access to shared data structures. It is
necessary to implement the correct execution of operations by parallel threads
(no race conditions, dead-locks, etc.) and to provide scalability for a large
number of threads and a high intensity of operations. For this, it is necessary to
develop means for synchronizing access to shared memory.
Locks Semaphores, mutexes implement access to shared memory areas with a
single thread at any time. Among the locking algorithms, we can distinguish
TTAS, CLH, MCS, Lock Cohorting, Flat Combining, Remote Core Locking,
etc. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The disadvantages of locks are the possibility of deadlocks (livelocks),
threads starvation, priority inversion and lock convoy.
Non-blocking concurrent data structures they have the property of
linearizability, assuming that any parallel execution of operations is equivalent to
some sequential execution of them, and the completion of each operation does not
require the completion of any other operation. Among non-blocking data
structures, there are three classes: wait-free { each thread completes the execution of
any operation in a nite number of steps; lock-free { some threads complete the
execution of operations in a nite number of steps; obstruction-free { any thread
completes the operations for a nite number of steps, if the execution of other
threads is suspend-ed. Among the most common non-blocking algorithms and
data structures, we could highlight Treiber Stack, Michael Scott Queue,
Harris Michael List, Elimination Backo Stack, Combining Trees, Di racting Trees,
Counting Network, Cli Click Hash Table, Split-ordered lists, etc. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The
disadvantages of non-blocking data structures include the high complexity of their
development (compared to other approaches), the lack of hardware support for
atomic operations for large or non-adjacent memory (double compare-and-swap,
double-width compare-and-swap), problems of freeing memory (ABA problem).
Transactional memory Within this approach, the program organizes
transactional sections in which the protection of shared memory areas is implemented.
Transaction is the nite operations sequence of the transactional read / write
actions [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The transactional read operation copies the contents of the
specied section of shared memory into the corresponding section of the local thread
memory. The transactional write copies the contents of the speci ed section of
local memory into the corresponding section of shared memory accessible to all
threads. After completion of the execution, the transaction can either be
committed or canceled. Committing a transaction implies that all changes made
to memory become irreversible. There are software (LazySTM, TinySTM, GCC
TM, etc.) and hardware (Intel TSX, AMD ASF, Oracle Rock, etc.) transactional
memory. Among the shortcomings of the transactional memory, it is possible to
distinguish restrictions on certain types of operations within the transactional
sections, overhead in tracking changes in memory, the complexity of debugging
the program. Given the above discussed shortcomings of the existing
synchronization tools, their performance may be insu cient for modern multithreaded
programs. In this paper, the method of increasing scalability is used due to
relaxation of the semantics of data structures.
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Relaxed Concurrent Data Structures</title>
      <p>
        The basis of this approach is a compromise between scalability (performance)
and the correctness of the semantics of operations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It is proposed to relax
the semantics of the implementation of operations to increase the possibility of
scaling. In most existing non-blocking thread-safe structures and blocking
algorithms, there is a single point of execution of operations on the structure. For
example, when inserting an element into a queue, it is necessary to use a single
pointer to the rst element of the structure, in the case of a multi-threaded
system. This fact is a bottleneck, since each thread is forced to block one element,
causing other threads to wait. The principle of quasi-linearizability [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is
applicable to this approach, which assumes that several events may occur during
the execution of some operations, simultaneously changing the data structure in
such a way that after performing one of the operations, the state of the data
structure is unde ned. Relaxed concurrent data structures randomly select on
of available structure. In case of successful locking of the structure, the thread
completes the operation; otherwise, randomly selects an another structure. Thus,
synchronization of threads is minimized, losses in the accuracy of operations are
acceptable [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The main representatives of relaxed data structures are SprayList,
k-LSM, Multiqueue.
      </p>
      <p>
        SprayList This structure is based on the SkipList structure [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. SprayList is
a connected graph, where at the lower level of the structure there is a connected
sorted list of all elements, and each next level with a given xed probability
contains the elements of the list of the lower level. The search for this structure
is carried out linearly from top to bottom and from left to right; one pointer
follows each iteration. Unlike the SkipList, SprayList assumes not a linear search
from top to bottom and from beginning to end, is uses a random movement from
top to bottom and from left to right. If the search fails or another thread locks
the item, the algorithm returns to the previous item. After nding the necessary
element, the operations with the list are performed in the same way as in the
SkipList. However, in the worst case, inserting items into a SprayList causes
signi cant overhead due to the need to maintain an ordered list.
k-LSM The log-structured tree with merge (LSM) is used as the basic structure
of k-LSM [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Each structure change is recorded in a separate log le, tree nodes
are sorted arrays (blocks), each of which is at the L level of the tree and can
contain N elements (2L{1 &lt; N 2L). Each thread has a local distributed
LSM. Common LSM is the result of the merging of several distributed LSM
structures. All threads can access common LSM using a single pointer. As a result
of combining common and distributed LSM structures, a k-LSM structure was
obtained. When performing the insert operation, the thread stores the element in
the local LSM structure. During a delete operation, the search for the smallest
key in the local LSM is used. If the local structure is empty, and not insert
operation is required, an attempt is made to access foreign LSM structures, the
search is performed among all distributed and common LSM structures, and if
it was found the structure that is not locked, an operation is performed on it.
The disadvantages of this structure are the synchronization of calls to a common
LSM and an attempt to call someone else's LSM, since there is no guarantee that
it is not empty.
      </p>
      <p>
        Multiqueue This structure [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] (Fig. 2) is a composition of simple priority
queues protected by locks. Each thread has two or more priority queues. The
operation of inserting an element is performed in a random, unlocked by another
thread, queue. The operation of deleting an element with the minimum key is
performed as follows: two random unlocked queues are selected, their values of
the minimum elements are compared and the element with the smallest value
from the corresponding queue is deleted. This element is not always the minimum
inserted in the global structure, but it is close to the minimum and for real
problems, this error can be neglected. This paper proposes algorithms to optimize
the execution of operations for priority queues based on Multiqueues. Algorithms
allow you to optimize the choice of structure for performing the operation [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>Operation Execution Optimization</title>
      <p>The disadvantage of the current implementation of insert and delete operations
in Multiqueues is the algorithm for searching for a random queue. The thread
performing the operation most likely turns to queues locked by other threads.
The structure of Multiqueues includes kp queues, where k is the number of
queues per thread, p is the number of threads.
4.1</p>
      <sec id="sec-3-1">
        <title>Evaluation Queue Identi er by Thread</title>
        <p>In the article methods are proposed for reducing the number of collisions based
on the limitation of the range for randomly choosing a structure. By collisions
is meant access a memory area that was already locked by another thread. The
set of queues and threads is divided in half. Let i in (1) be the thread identi er,
then for the rst half of all threads (2), the random queue selection operation is
performed among the queues (3), where 1 is selected to perform the operation
queue in the Multiqueue structure. For the second half of threads (4), queue
searching in the second half of queues (5).</p>
        <p>i 2 0; 1; : : : ; p
q 2 0; 1; : : : ; kp=2
q 2 kp=2 + 1; : : : ; kp
(1)
(2)
(3)
i 2 0; 1; : : : ; p=2
i 2 p=2 + 1: : : ; p
q 2 pi; : : : ; pi + k
(4)
(5)
(6)
An approach to optimizing the selection of queues, based on the "binding" of
queues to threads, has also been developed. This scheme allows you to specify
the order in which the thread accesses queues. In the implementation of
fastening, the following model is used: in total, the set contains kp queues, then each
thread has (6) xed queues. When performing an operation, the thread rst
turns to the one of (6) queue, thereby minimizing calls to blocked queues. If all
queues from the set (6) are blocked, then the original queue selection algorithm
among all queues is used.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Optimized Algorithms</title>
        <p>
          Algorithms presented in this section have the following semantics: Lock/Unlock
{ lock and unlock thread mutex for one of speci ed queues, tryLock { check
that queue is not processing by another thread at the exact same time and try
to lock mutex, RandQueue { return one queue from set of all queues by its
identi er that in the speci ed range, Rand2Queue { same as a RandQueue, but
returns pair of randomly selected queues, threadId - identi er of current thread
for this concrete process from 0 to p. For implementation it is recommended to
use thread a nity with cores, it helps to minimize context switching between
threads [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>OptHalfInsert Algorithm 1 represents an optimized insertion algorithm of an
element in a Multiqueues structure. The queue is selected depending on which
half of the threads the current thread identi er belongs to.</p>
        <p>do
if i 2 0; 1; : : : ; p=2 then</p>
        <p>q = RandQueue(0, kp/2);
else</p>
        <p>q = RandQueue(kp/2+1, kp);
end
while tryLock(q) = f alse;
insert(q, element);
Unlock(q);</p>
        <sec id="sec-3-2-1">
          <title>Algorithm 1: OptHalfInsert</title>
          <p>OptHalfDelete Algorithm 2 is exactly the same as OptHalfInsert, but it takes
two queues for increase accuracy of deleted max (min) element.
do
if i 2 0; 1; : : : ; p=2 then</p>
          <p>(q1, q2) = Rand2Queue(0, kp/2);
else</p>
          <p>(q1, q2) = Rand2Queue(kp/2+1, kp);
end
q = GetMaxElementQueue(q1, q2);
while tryLock(q) = f alse;
removeMax(q);
Unlock(q);</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Algorithm 2: OptHalfDelete</title>
          <p>OptExactDelete Alternative optimized Algorithm 3. for deleting the
maximum element, is similar to Algorithm 3. but with one change: at rst it attempt
to lock queues that "attached" to the thread (as it is threadId), and only after
failing it search queues among all.</p>
          <p>rstIteration = true;
do
if rstIteration then</p>
          <p>(q1, q2) = Rand2Queue(threadId, threadId+p);
else
if i 2 0; 1; : : : ; p=2 then</p>
          <p>(q1, q2) = Rand2Queue(0, kp/2);
else</p>
          <p>(q1, q2) = Rand2Queue(kp/2+1, kp);
end
end
q = GetMaxElementQueue(q1, q2);</p>
          <p>rstIteration = false;
while tryLock(q) = f alse;
removeMax(q);
Unlock(q);</p>
        </sec>
        <sec id="sec-3-2-3">
          <title>Algorithm 3: OptExactDelete</title>
          <p>As a result of continuous running program, there may be an imbalance in
a relaxed priority queues: some queues may contain signi cantly more elements
than others. This circumstance leads to a decrease in the performance of the
algorithms, since empty queues become unsuitable for a delete operation, which
increases the search time for suitable queues to perform the operation. A
balancing algorithm (Algorithm 4) for the complete Multiqueues structure has been
created, for a uniform distribution of elements among the queues.
q1=FindLargestQueue();
q2=FindShortestQueue();
if size(q1) &gt; AvgSizeT otalSize() then</p>
          <p>Lock(q1);
q2IsLocked=LockWithTimeout(q2);
if q2IsLocked then
sizeToTransfer = size(q1)
TransferElements(q1, q2, sizeToTransfer);</p>
          <p>Unlock(q2);
end</p>
          <p>Unlock(q1);
end</p>
        </sec>
        <sec id="sec-3-2-4">
          <title>Algorithm 4: Balancing algorithm</title>
          <p>5
5.1</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <sec id="sec-4-1">
        <title>Testing Platform</title>
        <p>All experiments reported in this paper were processed on one node of cluster,
which equipped with dual-socket Intel Xeon X5670 where each socket contains
six cores with 2.93 GHz each (Hyper-Threading is disabled). 16Gb of RAM was
used.
5.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Measurement Technique</title>
        <p>
          As an indicator of e ciency, the throughput was used, which is calculated as the
sum of the carrying capacities of the threads (7), where n is the number of insert
/ delete operations by the thread i, t is the execution time of operations [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
bi = n=t
(7)
5.3
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>Benchmarks</title>
        <p>The e ectiveness of the original and optimized multiqueues was compared.
Individual insert / delete operations were investigated. Each thread was allocated
k = 2 queues. Each thread performed n = 106 insert operations and n = 0,5
106 delete operations. The following experiment shows the dependence of the
number of random operations (insertion, deletion) on the number of threads
used. The following options for using the insertion and deletion algorithms were
analyzed:
{ The original insertion algorithm (Insert) and the original element deletion
algorithm (Delete)
{ Optimized insertion algorithm (OptHalfInsert) and optimized element
deletion algorithm (OptHalfDelete)
0 2 4 6 8 10121416182022242628303234</p>
        <p>Number of Threads (p)</p>
        <p>Fig. 4. Deletion Throughput by Threads.</p>
        <p>Fig. 6. Deletion Throughput by Number
of Queues.
{ Optimized insertion algorithm (OptHalfInsert) and an alternative optimized
element removal algorithm (OptExactDelete)
The analysis of the optimal number of k queues per threads (Fig. 5 and g. 6).
We used a xed number of threads p = 12. The algorithms OptHalfInsert and
OptExactDelete has the maximum throughput of the system is achieved when
the number of queues k = 4 per thread.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Future Work</title>
      <p>
        Coming up with high-performance data processing of large data volumes is
challenging in modern control systems. Concerning this, modern systems for complex
object control (especially large-scale distributed hierarchical control systems) are
based on multi- and many-core distributed computer systems. Computer systems
include a wide class of systems { from embedded systems and mobile devices
to cluster computing systems, massively parallel systems, GRID systems, and
cloud-based CS. Algorithmic and software tools for parallel programming is the
basis for building modern systems for processing big data, machine learning and
arti cial intelligence. The main class of systems used for high-performance
information processing are distributed CS - collectives of elementary machines
interacting through a communication environment. In design of parallel programs for
distributed CS, the de-facto standard is the messaging model, which is primarily
represented by the MPI (Message Passing Interface) standard. The scalability
of MPI programs depends signi cantly on the e ciency of the implementation
of collective information exchange operations (collectives) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Such operations
are used in most of the MPI programs, they account for a signi cant proportion
of the total program execution time. An adaptive approach for development of
collectives is promising. Nowadays, using only the message passing model
(MPIeverywhere) may not be su cient to develop e ective MPI programs. In this
regard, a promising approach is to use MPI for interaction between computer
nodes and multithreading support systems (PThreads, OpenMP, Intel TBB)
inside the nodes. The main task of implementation of hybrid mode is the
organization of scalable access of parallel threads to shared data structures (context
identi ers, virtual channels, message queues, request pools, etc.). Types of MPI
standard hybrid mode:
{ MPI THREAD SINGLE - one thread of execution
{ MPI THREAD FUNNELED - is a multi-threaded program, but only one
thread can perform MPI operations
{ MPI THREAD SERIALIZED - only one thread at the exact same time can
make a call to MPI functions
{ MPI THREAD MULTIPLE - each program ow can perform MPI functions
at any time.
      </p>
      <p>One of the implementations of the hybrid multi-threaded MPI program in the
MPI THREAD MULTIPLE mode is the MPICH version CH4 library, which
de nes standards for using lock-free data structures. In this mode, two types
of synchronization are available: trylock - in which the program cyclically tries
to capture the mutex and access the queue; hando - a thread-safe queue when
accessed which causes an active wait for an item by a thread.</p>
      <p>
        It is proposed to replace the thread-safe work queue from the izem library with
a relaxed thread-safe queue - Multiqueues, in which the mechanism for accessing
queue elements has been improved
The use of a Multiqueues with relaxed semantics for performing operations will
avoid the occurrence of bottlenecks when synchronizing threads [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Unlike most
existing lock-free thread-safe data structures and locking algorithms, where there
is a single point of execution of operations on the structure, a set of simple
sequential structures is used in relaxed data structures, the composition of which
is considered as a logical single structure. As a result, the number of possible
points of access to this structure increases. This approach will allow to achieve
much greater throughput compared to existing data structures.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>An optimized version of a relaxed concurrent priority queue based on
Multiqueues has been developed. The developed insertion and deletion algorithms
has a 1.2 and 1.6 times performance boost, respectively, compared to the
original insertion and deletion algorithms. Optimization is achieved by reducing the
number of collisions based on the limitation of the random structure selection. In
implementation is the thread a nity is used for minimization contention between
cores. It is proposed to use a scalable thread-safe queue with relaxed semantics
in hybrid multi-threaded MPI programs (MPI + threads model), which will
reduce the overhead of synchronizing threads when performing operations with a
working task queue. There are some evidences, that this queue is highly scalable
and reduce overheads for thread synchronization. The implementation of these
algorithms is publicly available at https://github.com/Komdosh/Multiqueues.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Anenkov</surname>
            ,
            <given-names>A. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paznikov</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kurnosov</surname>
            ,
            <given-names>M. G.</given-names>
          </string-name>
          :
          <article-title>Algorithms for access localization to objects of scalable concurrent pools based on di racting trees in multicore computer systems</article-title>
          . In: 2018
          <string-name>
            <given-names>XIV</given-names>
            <surname>International Scienti</surname>
          </string-name>
          c-Technical Conference on Actual Problems of Electronics Instrument Engineering (APEIE) pp.
          <fpage>374</fpage>
          -
          <lpage>380</lpage>
          . IEEE (
          <year>2018</year>
          ) https://doi.org/10.1109/APEIE.
          <year>2018</year>
          .8545197
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Tabakov</surname>
            <given-names>A. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Veretennikov L. M.</surname>
          </string-name>
          <article-title>: Relaxed Concurrent Data Structures</article-title>
          .
          <source>In III Science of Present and Future</source>
          , pp.
          <volume>105</volume>
          {
          <issue>107</issue>
          ,
          <string-name>
            <surname>ETU</surname>
          </string-name>
          , Russia, (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. TOP500 supercomputers list, https://www.top500.
          <article-title>org/news/summit-up-andrunning-at-oak-ridge-claims- rst-exascale-application</article-title>
          .
          <source>Last accessed 7 Nov 2019</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Herlihy</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shavit</surname>
            <given-names>N.:</given-names>
          </string-name>
          <article-title>The art of multiprocessor programming</article-title>
          . Morgan Kaufmann, Boston (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Shavit</surname>
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Touitou</surname>
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Software transactional memory</article-title>
          .
          <source>Distributed Computing</source>
          <volume>10</volume>
          ,
          <issue>99</issue>
          {
          <fpage>116</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Tabakov</surname>
            ,
            <given-names>A. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paznikov</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          :
          <article-title>Using relaxed concurrent data structures for contention minimization in multithreaded MPI programs</article-title>
          .
          <source>Journal of Physics: Conference Series</source>
          <volume>1399</volume>
          (
          <issue>3</issue>
          ), pp.
          <fpage>033037</fpage>
          . IOP Publishing (
          <year>2019</year>
          ) https://doi.org/10.1088/
          <fpage>1742</fpage>
          -6596/1399/3/033037
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Tabakov</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paznikov</surname>
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Modelling of parallel threads synchronization in hybrid MPI+Threads programs</article-title>
          .
          <source>In XXI IEEE International Conference on Soft Computing and Measurements (SCM)</source>
          ,
          <volume>4</volume>
          {
          <issue>7</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Paznikov</surname>
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anenkov</surname>
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Implementation and Analysis of Distributed Relaxed Concurrent Queues in Remote Memory Access Model In XIII International Symposium \Intelligent Systems { 2018" (INTELS'18)</article-title>
          .
          <source>Procedia Computer Science</source>
          <volume>50</volume>
          ,
          <issue>654</issue>
          {
          <fpage>662</fpage>
          (
          <year>2019</year>
          ) https://doi.org/10.1016/j.procs.
          <year>2019</year>
          .
          <volume>02</volume>
          .101
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Rihani</surname>
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sanders</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dementiev</surname>
            <given-names>R.</given-names>
          </string-name>
          :
          <article-title>Multiqueues: Simpler, faster, and better relaxed concurrent priority queues</article-title>
          .
          <source>arXiv pre-print arXiv:1411.1209 50</source>
          ,
          <issue>277</issue>
          {
          <fpage>278</fpage>
          (
          <year>2015</year>
          ) https://arxiv.org/pdf/1411.1209.pdf.
          <source>Last accessed 7 Nov 2019</source>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Goncharenko</surname>
            <given-names>E.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paznikov</surname>
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tabakov</surname>
            <given-names>A.V.</given-names>
          </string-name>
          :
          <article-title>Evaluating the performance of atomic operations on modern multicore systems</article-title>
          .
          <source>Journal of Physics: Conference Series</source>
          <volume>1399</volume>
          ,
          <issue>033107</issue>
          (
          <year>2019</year>
          ) https://doi.org/10.1088/
          <fpage>1742</fpage>
          -6596/1399/3/033107
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Alistarh D</surname>
          </string-name>
          . et al.
          <article-title>The SprayList: A scalable relaxed priority queue</article-title>
          .
          <source>ACM SIGPLAN Notices</source>
          <volume>10</volume>
          ,
          <issue>11</issue>
          {
          <fpage>20</fpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Wimmer</surname>
            <given-names>M.</given-names>
          </string-name>
          et al.
          <article-title>The lock-free k-LSM relaxed priority queue</article-title>
          .
          <source>ACM SIGPLAN Notices</source>
          <volume>50</volume>
          ,
          <issue>277</issue>
          {
          <fpage>278</fpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Y.</given-names>
            <surname>Afek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Korland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Yanovsky.</given-names>
            :
            <surname>Quasi-Linearizability</surname>
          </string-name>
          :
          <article-title>Relaxed Consistency for Improved Concurrency</article-title>
          .
          <source>OPODIS 6490</source>
          ,
          <issue>3</issue>
          {
          <fpage>10</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Tabakov</surname>
            ,
            <given-names>A. V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paznikov</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          :
          <article-title>Algorithms for optimization of relaxed concurrent priority queues in multicore systems</article-title>
          .
          <source>In: 2019 IEEE Conference of Russian Young Researchers in Electrical and Electronic</source>
          Engineering (EIConRus)
          <volume>360</volume>
          {
          <fpage>365</fpage>
          . (
          <year>2019</year>
          ) https://doi.org/10.1109/EIConRus.
          <year>2019</year>
          .8657105
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Paznikov</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shichkina</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Algorithms for optimization of processor and memory a nity for Remote Core Locking synchronization in multithreaded applications</article-title>
          .
          <source>Information</source>
          <volume>9</volume>
          (
          <issue>1</issue>
          ),
          <volume>21</volume>
          {
          <fpage>24</fpage>
          (
          <year>2018</year>
          ) https://doi.org/10.3390/info9010021
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Kulagin</surname>
            <given-names>I. I.</given-names>
          </string-name>
          :
          <article-title>Means of architectural-oriented optimization of parallel program execution for computing systems with multilevel parallelism</article-title>
          .
          <source>NUSC 10</source>
          ,
          <issue>77</issue>
          {
          <fpage>82</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Paznikov</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smirnov</surname>
            ,
            <given-names>V. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Omelnichenko</surname>
            ,
            <given-names>A. R.</given-names>
          </string-name>
          <string-name>
            <surname>Towards</surname>
          </string-name>
          E - cient
          <source>Implementation of Concurrent Hash Tables and Search Trees Based on Software Transactional Memory. In: 2019 International Multi-Conference on Industrial Engineering and Modern Technologies (FarEastCon) 1{5</source>
          (
          <year>2019</year>
          ). https://doi.org/10.1109/FarEastCon.
          <year>2019</year>
          .8934131
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>