I was debugging the execution of 'IndirectMemoryPrefetcher' and I found
that the stride prefetcher (or stream prefetcher) implemented inside IMP
was implemented maybe incorrectly. Instead of fetching the next few
cachelines (the number based on 'streamingDistance') it was only sending a
prefetch request for the current cacheline (and the request got squashed
because of the cacheline being already in cache or being in the MSHR).
I compared the execution of the stream prefetcher inside IMP with the OG
standalone 'StridePrefetcher' and found that the address of the next few
requested cachelines was not multiplied by the size of the cacheline (for
stride=1 and streamingDistance=4 should request blockAddress + i * 64 where
i ranges from 1 to 4). As a result even with a streamingDistance=4
'IndirectMemoryPrefetcher' was only requesting the current cache line,
while with 'StridePrefetcher' the next 4 cachelines were being fetched.
In indirect_memory.cc
[image: Screenshot 2023-12-03 at 21.36.33.png]
In stride.cc
[image: Screenshot 2023-12-03 at 21.31.53.png]
For a code like:
uint8_t dataArray [size];
printf("Stride prefetcher training start\n");
for (int i = 0; i < size; i++)
x = array[i]
printf("Stride prefetcher training end\n");
The value of 'delta' in stride.cc is 64 while in indirect_memory.cc it is
- As a result in 'addresses' StridePrefetcher pushes 4 separate cachelines
while the stride prefetcher in IndirectMemoryPrefetcher pushes only the
current cacheline.
My intention is to confirm that for a loop like
for (int i = 0; i < size; i++)
x = array2[array1[i]]
My array1 is being prefetched by the StridePrefetcher or not.
Is my understanding correct or is the 'IndirectMemoryPrefetcher' used in
any other way (like to be used in 'MultiPrefetcher' along with
'StridePrefetcher').
I'm running my compiled cpp code as:
build/X86/gem5.opt configs/deprecated/example/se.py --cpu-type=DerivO3CPU
--caches --l1d_size=32kB --l1i_size=32kB --l2cache --l2_size=512kB
--mem-type=DDR3_1600_8x8 --bp-type=TAGE_SC_L_64KB
--l1d-hwp-type=StridePrefetcher --l2-hwp-type=StridePrefetcher -c streamPoC
I was debugging the execution of 'IndirectMemoryPrefetcher' and I found
that the stride prefetcher (or stream prefetcher) implemented inside IMP
was implemented maybe incorrectly. Instead of fetching the next few
cachelines (the number based on 'streamingDistance') it was only sending a
prefetch request for the current cacheline (and the request got squashed
because of the cacheline being already in cache or being in the MSHR).
I compared the execution of the stream prefetcher inside IMP with the OG
standalone 'StridePrefetcher' and found that the address of the next few
requested cachelines was not multiplied by the size of the cacheline (for
stride=1 and streamingDistance=4 should request blockAddress + i * 64 where
i ranges from 1 to 4). As a result even with a streamingDistance=4
'IndirectMemoryPrefetcher' was only requesting the current cache line,
while with 'StridePrefetcher' the next 4 cachelines were being fetched.
In indirect_memory.cc
[image: Screenshot 2023-12-03 at 21.36.33.png]
In stride.cc
[image: Screenshot 2023-12-03 at 21.31.53.png]
For a code like:
uint8_t dataArray [size];
printf("Stride prefetcher training start\n");
for (int i = 0; i < size; i++)
x = array[i]
printf("Stride prefetcher training end\n");
The value of 'delta' in stride.cc is 64 while in indirect_memory.cc it is
1. As a result in 'addresses' StridePrefetcher pushes 4 separate cachelines
while the stride prefetcher in IndirectMemoryPrefetcher pushes only the
current cacheline.
My intention is to confirm that for a loop like
for (int i = 0; i < size; i++)
x = array2[array1[i]]
My array1 is being prefetched by the StridePrefetcher or not.
Is my understanding correct or is the 'IndirectMemoryPrefetcher' used in
any other way (like to be used in 'MultiPrefetcher' along with
'StridePrefetcher').
I'm running my compiled cpp code as:
build/X86/gem5.opt configs/deprecated/example/se.py --cpu-type=DerivO3CPU
--caches --l1d_size=32kB --l1i_size=32kB --l2cache --l2_size=512kB
--mem-type=DDR3_1600_8x8 --bp-type=TAGE_SC_L_64KB
--l1d-hwp-type=StridePrefetcher --l2-hwp-type=StridePrefetcher -c streamPoC