Empathy List Archives

gem5-users@gem5.org

The gem5 Users mailing list

Issue regarding Indirect Memory Prefetcher

Anshul Naithani

Mon, Dec 4, 2023 5:05 PM

I compared the execution of the stream prefetcher inside IMP with the OG
standalone 'StridePrefetcher' and found that the address of the next few
requested cachelines was not multiplied by the size of the cacheline (for
stride=1 and streamingDistance=4 should request blockAddress + i * 64 where
i ranges from 1 to 4). As a result even with a streamingDistance=4
'IndirectMemoryPrefetcher' was only requesting the current cache line,
while with 'StridePrefetcher' the next 4 cachelines were being fetched.

In indirect_memory.cc
[image: Screenshot 2023-12-03 at 21.36.33.png]
In stride.cc
[image: Screenshot 2023-12-03 at 21.31.53.png]

For a code like:

uint8_t dataArray [size];
printf("Stride prefetcher training start\n");

for (int i = 0; i < size; i++)
x = array[i]
printf("Stride prefetcher training end\n");

The value of 'delta' in stride.cc is 64 while in indirect_memory.cc it is

As a result in 'addresses' StridePrefetcher pushes 4 separate cachelines
while the stride prefetcher in IndirectMemoryPrefetcher pushes only the
current cacheline.

My intention is to confirm that for a loop like

for (int i = 0; i < size; i++)
x = array2[array1[i]]

My array1 is being prefetched by the StridePrefetcher or not.

Is my understanding correct or is the 'IndirectMemoryPrefetcher' used in
any other way (like to be used in 'MultiPrefetcher' along with
'StridePrefetcher').

I'm running my compiled cpp code as:
build/X86/gem5.opt configs/deprecated/example/se.py --cpu-type=DerivO3CPU
--caches --l1d_size=32kB --l1i_size=32kB --l2cache --l2_size=512kB
--mem-type=DDR3_1600_8x8 --bp-type=TAGE_SC_L_64KB
--l1d-hwp-type=StridePrefetcher --l2-hwp-type=StridePrefetcher -c streamPoC

I was debugging the execution of 'IndirectMemoryPrefetcher' and I found that the stride prefetcher (or stream prefetcher) implemented inside IMP was implemented maybe incorrectly. Instead of fetching the next few cachelines (the number based on 'streamingDistance') it was only sending a prefetch request for the current cacheline (and the request got squashed because of the cacheline being already in cache or being in the MSHR). I compared the execution of the stream prefetcher inside IMP with the OG standalone 'StridePrefetcher' and found that the address of the next few requested cachelines was not multiplied by the size of the cacheline (for stride=1 and streamingDistance=4 should request blockAddress + i * 64 where i ranges from 1 to 4). As a result even with a streamingDistance=4 'IndirectMemoryPrefetcher' was only requesting the current cache line, while with 'StridePrefetcher' the next 4 cachelines were being fetched. In indirect_memory.cc [image: Screenshot 2023-12-03 at 21.36.33.png] In stride.cc [image: Screenshot 2023-12-03 at 21.31.53.png] For a code like: uint8_t dataArray [size]; printf("Stride prefetcher training start\n"); for (int i = 0; i < size; i++) x = array[i] printf("Stride prefetcher training end\n"); The value of 'delta' in stride.cc is 64 while in indirect_memory.cc it is 1. As a result in 'addresses' StridePrefetcher pushes 4 separate cachelines while the stride prefetcher in IndirectMemoryPrefetcher pushes only the current cacheline. My intention is to confirm that for a loop like for (int i = 0; i < size; i++) x = array2[array1[i]] My array1 is being prefetched by the StridePrefetcher or not. Is my understanding correct or is the 'IndirectMemoryPrefetcher' used in any other way (like to be used in 'MultiPrefetcher' along with 'StridePrefetcher'). I'm running my compiled cpp code as: build/X86/gem5.opt configs/deprecated/example/se.py --cpu-type=DerivO3CPU --caches --l1d_size=32kB --l1i_size=32kB --l2cache --l2_size=512kB --mem-type=DDR3_1600_8x8 --bp-type=TAGE_SC_L_64KB --l1d-hwp-type=StridePrefetcher --l2-hwp-type=StridePrefetcher -c streamPoC