Adaptive Regular Pattern Prefetcher for L1
    Data Cache for Microprocessors with RISC-V
                       ISA

Bogdana Tishuk[0000−0002−4599−3136] and Sergei Bykovskii[0000−0003−4163−9743]

             ITMO University, 49 Kronverksky Pr., St. Petersburg, Russia
                tishuk52@gmail.com, sergei bykovskii@itmo.ru


        Abstract. Due to high percent of data is represented as N - dimen-
        sional arrays or other array - based structures, especially in scientific
        applications, access to this data will have regular pattern, consequently
        regular pattern prefetchers work effectively for a wide range of tasks.
        Main advantages of regular pattern prefetchers are low off - chip traffic,
        hardware simplicity, consequently low power consumption, compact area
        on the chip and low incremental memory bus load, caused by prefetching.
        This paper introduces AR2P - adaptive regular pattern prefetcher. The
        key ideas of AR2P are adaptive prefetch depth that depends on spatial
        locality of data, late prefetch detection, and decreasing power consump-
        tion as compared with RPT prefetcher due to flip - flop switch reduce.
        We evaluate AR2P using cycle accurate single core MARSS - RISCV
        simulator and TACLE benchmark set. AR2P shows 95% coverage on the
        average on benchmarks with regular access pattern and 37% for complex
        patterns.

        Keywords: RISC-V · prefetching · cache


1     Introduction
Instruction and Data prefetchers are critical modules of modern computing sys-
tems, due to DRAM-memory is a bottleneck and one memory access costs nearly
100 machine cycles, consequently cache misses significantly influence on task ex-
ecution time. Prefetching data and instructions into cache before demand is
designed to reduce cache misses and increase performance.
    Now microprocessors usually contain several prefetchers, for caches of differ-
ent levels and different memory access patterns. For example, Intel microarchi-
tectures: Ivy Bridge, Broadwell and other have two prefetchers for L2 cache and
two prefetchers for L1 data cache. For L1 data cache there is an adjacent line
prefetcher and PC - localized prefetcher. ARM cores have similar prefetching
subsystem.
    Prefetchers can take place in memory subsystem as shown on Figure 1 as a
variant.
    Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2         B. Tishuk, S. Bykovskii


                    Figure 1 - Prefetchers in memory subsystem

      Main questions, that arise in design process of L1 prefetching subsystem are:

    – Is it possible to adapt prefetcher for various access patterns dynamically?
    – What is optimal for designed system, large L1 cache and simple prefetcher,
      or small L1 cache and complex accurate prefetcher?
    – Where is effective to keep prefetched data, in cache or in on-chip prefetch
      buffers?
    – Which event triggers prefetching?
        • Cache misses
        • LLC access
        • Prefetced data hits
    – How do conflict misses in not fully - associative caches depend on prefetch-
      ing?

    There is a trade off between accuracy, coverage and hardware complexity, al-
though recent researches[5] show that high percentage of effective prefetches are
accurately detected regular pattern prefetches, so high performance of prefetch-
ing subsystem can be reached by rather simple hardware.


2      Related works

This section introduces analysis of various regular pattern prefetchers and cor-
related prefetchers for irregular patterns, that are used in today computing sys-
tems.


2.1     Regular pattern prefetchers

The simplest regular pattern prefetcher is next N line prefetcher. On triggering
event on address X, prefetcher puts into cache memory N lines next to accessed
                                  Title Suppressed Due to Excessive Length        3

line. If memory virtualization is used, prefetcher should stop on physical page
bounds. On sequential access patterns like matrix operations with primitive data
types or for instruction cache next N line prefetchers are rather effective. Jamison
D. shows in his research that next N line prefetcher can have accuracy and
coverage more 90 % on hydro and about 75% on mgrid benchmarks [1].
    Other simple prefetcher is a stride prefetcher. On triggering event on address
X, prefetcher puts into cache memory words from X + S to X + (S + N)
addresses in common case, where S is the stride. Prefetcher detects if stride
repeats necessary times, it predicts that next stride will be same and prefetches
data. In common case, stride prefetchers are effective on operations with arrays
of structures and matrix rows and columns. Now microprocessors usually use
RPT prefetcher. To introduce RPT prefetcher consider situation, presented on
Figure 2.


                       Figure 2 - Accesses arrays in cycle


    There is a cycle, where each element of array C should be summed up with
appropriate elements of arrays A and B. On each step of the cycle A[i], B[i], C[i]
are fetched:
Stride 1 : address A[i] - address C[i-1]
Stride 2 : address B[i] - address A[i]
Stride 3 : address C[i] - address B[i]

    Due to this stride is various, consequently clearly discernible stride does not
exist and stride prefetcher is ineffective here. The solution to the problem is
an on chip table, where prefetcher keeps last memory access address , detected
stride and confidence counter. On triggering event for defined PC current access
address is compared with last address. If stride is the same, confidence counter
updates, else stride field of table updates with new value. If confidence counter
value is more than reference value, prefetch occurs on current address with added
stride. Main metrics: accuracy - percent of prefetch hit and coverage - percent
of cache misses, prevented by prefetcher, are 90% for art benchmark, and 30 %
and 23 % for gcc [3].
4      B. Tishuk, S. Bykovskii

2.2   Correlated prefetchers

Correlated prefetchers are effective on non-stride or irregular memory access
patterns as processing linked list or graph elements. Common principle of cor-
related prefetcher is memorizing sequential streams of memory accesses, and on
prefetching next stream part on triggering event.


Markov Table. Markov table is one of early solutions to organize temporary
correlated access pairs spatially, by table, indexed by physical address[2]. Each
table entry is fix - sized, but correlated streams can variate from two memory
words to several MBytes. Consequently Markov Table is prone to inner fragmen-
tation and loss of useful metadata due to fact that short streams waste space in
Markov table, long streams are truncated.


Domino Domino prefetcher[6] is accurate correlated prefetcher for L1 data
cache. Domino keeps metadata about correlated L1 data cache misses in two on
- chip miss history tables MHT1 and MHT2. Also Domino keeps information
about two previous L1 data cache misses in runtime.


                            Figure 3 - MHT1 entry

MHT2 holds tags of two previous misses and is indexed as by exclusive or of
current and previous L1 data cache miss. On L1 data cache miss Domino looks
up entry with the same tags of misses in MHT2 and prefetches data on prediction
address. If entry in MHT2 does not exist, Domino updates MHT2 with tags of
two previous misses and current miss as prediction, and looks up entry in MHT1.
If MHT1 entry exists, Domino prefetcher data on prediction address, if entry is
not available, Domino updates MHT1.


MISB ARM introduced Managed Irregular Stream Buffer (MISB) prefetcher
in 2019. The main idea of MISB is to create a new structural address space
in which correlated physical addresses are kept in sequence. The key point is
that in this structural address space, streams of correlated memory addresses
are both temporally ordered and spatially ordered. The mapping from physical
to structural addresses and vise versa is performed at a cache line granularity
by two spatially indexed on-chip address metadata caches whose contents can
be easily synchronized with that of the TLB[2].Unlike ISB, introduced in 2013
, MISB has inner stride prefetcher for metadata caches, Bloom filter, to filter
requests to off - chip mapping tables when mappings do not exist in mapping
tables, and finer mechanism of interaction with TLB. MISB shows significant
speedup on Libquantum and SoPlex - 95 % and 65% , on gcc about 20% [2].
MISB is an effective algorithm to process irregular access patterns, but hardware
complexity of MISB is significantly higher in comparison to other prefetchers.
                                   Title Suppressed Due to Excessive Length         5

3    Proposal
Today’s microprocessors usually have separate L1 caches for instructions and
data. If there are no branches, instruction are fetched sequentially from memory,
consequently for L1 instruction cache next N line prefetcher can be effectively
used with branch predictor interaction.
    For L1 data cache there is a solution to modify localized PC data prefetcher
by improving prefetch stride scale and prefetch depth prediction mechanism for
regular patterns. The second modification is unlike RPT prefetcher updates Last
address field for all entries of it’s on-chip metadata table, our solution updates
Last address for specified PC entry. This reduces power consumption due to field
last address flip-flop switch decrease.
    In common case for stride prefetcher if stride is close to the current demand,
next demand may occur before necessary data be prefetched, that defeats pur-
pose of prefetching. If stride is large, prefetching occurs too early, prefetched data
can be evicted from cache. RPT prefetcher potentially suffers of late prefetches.
This problem can be solved by detecting that demand was occurred before data
was prefetched. If this repeats several times sequentially, prefetcher decides to
increment stride . If stride doesn’t change for several times, prefetcher assumes
that it is a vector or matrix, calculates prefetch depth. Organization of prefetcher
is shown on Figure 4.


              Figure 4 - Functional scheme of proposed prefetcher
At the first moment stride is detected as difference between last address base +
offset and current address. Offset indicates position in prefetched vector, when
prefetch depth more than 1. Further if difference is equal stride, confidence
counter increments. Prefetch depth can be calculated as difference between con-
fidence counter and reference repeat value. When current address become equal
6      B. Tishuk, S. Bykovskii

last address base + prefetch depth, last address base changed to current address
and new data is prefetched. If stride changes, scale, confidence counter, late
counter are reset. On late prefetch event late counter increments. If late counter
overflows, stride is increased by itself, scale increments. Algorithm of proposed
prefetcher is shown on Figure 5


                         Figure 5 - Proposed algorithm
                                 Title Suppressed Due to Excessive Length      7

4   Evaluation
For value evaluation of proposed method was used cycle accurate simulator
MARSS - RISCV with following configuration:
 – core : out of order, single core
 – L1 data cache
     • size : 32 KB
     • ways : 8
     • eviction policy : LRU
 – L2 shared cache
     • size : 512 KB .
     • ways : 8
     • eviction policy : LRU
   The evaluation uses TACLE benchmarks to explore prefetcher efficiency on
regular access patterns and Linux launch for complex pattern exploration. To
make sure that results given by MARSS - RISCV simulator are adequate, bench-
marks were simulated on MARSS-RISCV without prefetcher and results of sim-
ulation were compared to RTL simulation of RISC-V core with the same config-
uration. The difference in measurements was not more than 0,9%.
   Models of AR2P, stride prefetcher, adjacent line prefetcher and Domino
prefetcher were simulated. Single core results are presented on the Figure 6.


                         Figure 6 - Single core coverage
   Coverage is a metric that shows what portion of cache misses were prevented
by prefetcher.
                                            p
                                      c=1−    ,                               (1)
                                           m
   where c - coverage, p - misses with prefetcher, m - misses without prefetcher.
   On diagram Tacle benchmarks bsort and matrix1 have mainly regular access
patterns, Tacle powerwindow and countnegatives have complex regular and ir-
regular patterns. Linux launch has more irregular than regular memory access
patterns.
8         B. Tishuk, S. Bykovskii

   AR2P shows 95% coverage on the average on Tacle benchmark set with
mainly regular patterns, that is 30% more effective than stride prefetcher and
20% more effective than adjacent line prefetcher. On complex patterns as Linux
launch AR2P shows 37% coverage, that is 25% less effective than Domino be-
cause of conflict evictions from reference prefetch table and non clear stride.
   At the moment vector of attention is directed on:
    – The ability to dynamically adapt to irregular and complex patterns. The
      main idea is to detect, if clear stride does not exist and conflict evictions
      occur frequently, prefetcher assumes, that memory access pattern is irregular
      and uses reference prefetch table as miss history table.
    – The dependence of efficiency of prefetcher on associativity of reference prefetch
      table/miss history table.


5      Conclusion
Researches[5] show that most effective prefetching is done by detecting and pro-
cessing regular access patterns. In this paper, we presented an analysis of the
most common data prefetching algorithms and presented an adaptive algorithm
for regular patterns that tracks and prevents late prefetching. Also, due to the
adaptive prefetch depth, the memory bus bandwidth consumption is reduced. As
expected, AR2P shows inferior results for irregular access patterns and current
works are directed on the ability to dynamically adapt to irregular and complex
patterns


References
1. Jamison D. Collins, Dean M. Tullsen: Runtime Identification of Cache Conflict
   Misses: The Adaptive Miss Buffer. ACM Transactions on Computer Systems, 4(5),
   414–439 (2001)
2. Jain A., Lin C.: Linearizing Irregular Memory Accesses for Improved Correlated
   Prefetching. In:Proceedings of the 46th Annual IEEE/ACM International Sympo-
   sium on Microarchitecture-MICRO-46, pp. 247–259. ACM , Davis, CA, USA (2013)
3. Al-Sukhni H., Holt J., Connors D. : Improved Stride Prefetching using Extrinsic
   Stream Characteristics. In:Performance Analysis of Systems and Software, 2006
   IEEE International Symposium, pp. 166–176. IEEE Xplore, April (2006)
4. Wu, H., Nathella, K., Sunwoo, D., Jain, A., Lin, C: Efficient Metadata Manage-
   ment for Irregular Data Prefetching. In:ISCA ’19: ACM International Symposium
   on Computer Architecture, pp. 1–13. June 22–26, 2019, Phoenix, Arizona, USA
5. Kondguli S., Huang M.: T2: A Highly Accurate and Energy Efficient Stride
   Prefetcher. In: 2017 IEEE International Conference on Computer Design (ICCD).,
   pp. 373–376.
6. Bakhshalipour M.,Lotfi-Kamran P., Sarbazi-Azad. H: Domino Temporal Data
   Prefetcher. In: 2018 IEEE International Symposium on High Performance Com-
   puter Architecture (HPCA)., pp. 131–142