Empathy List Archives

NF

Nick F

Wed, Jul 12, 2023 1:13 AM

Good afternoon,

I have been trying to use Gem5 to research and study the performance of
several different computer architectures. However, I have been noticing
that I may be unable to accurately model the differences in cycle length
for computer programs.

Take for example these two programs:

#include <stdint.h>

int main(void)
{
    for (uint32_t i = 0; i < 1000; i++) {
        uint32_t x = 5 * 6;
        if (x != 30) {
            return 1;
        }
    }
    return 0;
}

#include <stdint.h>

int main(void)
{
    for (uint32_t i = 0; i < 1000; i++) {
        uint32_t x = 5 + 6;
        if (x != 11) {
            return 1;
        }
    }
    return 0;
}

Compiling and running both individually on a basic RISC-V CPU config,
they both exit at exactly 1,297,721,000. However, in a real system, each
multiply operation would take longer and I'd suspect doing 1000
multiplications would have even a tiny difference in performance. My own
research would also have difficulties analyzing relative performance
unless I'm missing something.

Even custom instructions seem to execute in a single CPU cycle
regardless of how the hardware would be implemented.

Is there a good way to define cycle delays in my Gem5 environment? I can
implement a "multiply" function inserts a bunch of no-ops, but that
would make it more complicated when the program complexity grows.

I've written a small blog post
https://fleker.medium.com/modeling-memristors-to-execute-physically-accurate-imply-operations-in-gem5-ef888b7dc49b
exploring some of what I've tried in the past week. If anyone here has
any suggestions I'd be interested to hear them.

Thanks,

Nick

Good afternoon, I have been trying to use Gem5 to research and study the performance of several different computer architectures. However, I have been noticing that I may be unable to accurately model the differences in cycle length for computer programs. Take for example these two programs: #include <stdint.h> int main(void) { for (uint32_t i = 0; i < 1000; i++) { uint32_t x = 5 * 6; if (x != 30) { return 1; } } return 0; } #include <stdint.h> int main(void) { for (uint32_t i = 0; i < 1000; i++) { uint32_t x = 5 + 6; if (x != 11) { return 1; } } return 0; } Compiling and running both individually on a basic RISC-V CPU config, they both exit at exactly 1,297,721,000. However, in a real system, each multiply operation would take longer and I'd suspect doing 1000 multiplications would have even a tiny difference in performance. My own research would also have difficulties analyzing relative performance unless I'm missing something. Even custom instructions seem to execute in a single CPU cycle regardless of how the hardware would be implemented. Is there a good way to define cycle delays in my Gem5 environment? I can implement a "multiply" function inserts a bunch of no-ops, but that would make it more complicated when the program complexity grows. I've written a small blog post <https://fleker.medium.com/modeling-memristors-to-execute-physically-accurate-imply-operations-in-gem5-ef888b7dc49b> exploring some of what I've tried in the past week. If anyone here has any suggestions I'd be interested to hear them. Thanks, Nick

AA

Ayaz Akram

Wed, Jul 12, 2023 3:00 AM

Hi Nick,

I wonder which optimization flag you are using while compiling your
program? My guess is the the behavior you are observing is because the
compiler is able to figure out that the x is a constant number that can be
determined statically and the binary it is generating in both cases
probably does not do anything (as there is nothing else in the loop as
well). As a result, you don't see any difference in cycle count.

As far as your blog post is concerned, please note that the format block
defines how your instruction will be executed in the simulation. Basically,
the value of `i' will not have any impact on the latency of the
instruction. To change the latency you will have to separately change the
latency of that opClass of the instruction you have added. For example, for
O3CPU you can have a look at the latencies of different instructions:

https://gem5.googlesource.com/public/gem5/+/refs/heads/develop/src/cpu/o3/FuncUnitConfig.py

-Ayaz

On Tue, Jul 11, 2023 at 6:16 PM Nick F via gem5-users gem5-users@gem5.org
wrote:

Good afternoon,

I have been trying to use Gem5 to research and study the performance of
several different computer architectures. However, I have been noticing
that I may be unable to accurately model the differences in cycle length
for computer programs.

Take for example these two programs:

#include <stdint.h>

int main(void)
{
for (uint32_t i = 0; i < 1000; i++) {
uint32_t x = 5 * 6;
if (x != 30) {
return 1;
}
}
return 0;
}

#include <stdint.h>

int main(void)
{
for (uint32_t i = 0; i < 1000; i++) {
uint32_t x = 5 + 6;
if (x != 11) {
return 1;
}
}
return 0;
}

Compiling and running both individually on a basic RISC-V CPU config, they
both exit at exactly 1,297,721,000. However, in a real system, each
multiply operation would take longer and I'd suspect doing 1000
multiplications would have even a tiny difference in performance. My own
research would also have difficulties analyzing relative performance unless
I'm missing something.

Even custom instructions seem to execute in a single CPU cycle regardless
of how the hardware would be implemented.

Is there a good way to define cycle delays in my Gem5 environment? I can
implement a "multiply" function inserts a bunch of no-ops, but that would
make it more complicated when the program complexity grows.

I've written a small blog post
https://fleker.medium.com/modeling-memristors-to-execute-physically-accurate-imply-operations-in-gem5-ef888b7dc49b
exploring some of what I've tried in the past week. If anyone here has any
suggestions I'd be interested to hear them.

Thanks,

Nick

gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.org

Hi Nick, I wonder which optimization flag you are using while compiling your program? My guess is the the behavior you are observing is because the compiler is able to figure out that the x is a constant number that can be determined statically and the binary it is generating in both cases probably does not do anything (as there is nothing else in the loop as well). As a result, you don't see any difference in cycle count. As far as your blog post is concerned, please note that the format block defines how your instruction will be executed in the simulation. Basically, the value of `i' will not have any impact on the latency of the instruction. To change the latency you will have to separately change the latency of that opClass of the instruction you have added. For example, for O3CPU you can have a look at the latencies of different instructions: https://gem5.googlesource.com/public/gem5/+/refs/heads/develop/src/cpu/o3/FuncUnitConfig.py -Ayaz On Tue, Jul 11, 2023 at 6:16 PM Nick F via gem5-users <gem5-users@gem5.org> wrote: > Good afternoon, > > I have been trying to use Gem5 to research and study the performance of > several different computer architectures. However, I have been noticing > that I may be unable to accurately model the differences in cycle length > for computer programs. > > Take for example these two programs: > > #include <stdint.h> > > int main(void) > { > for (uint32_t i = 0; i < 1000; i++) { > uint32_t x = 5 * 6; > if (x != 30) { > return 1; > } > } > return 0; > } > > #include <stdint.h> > > int main(void) > { > for (uint32_t i = 0; i < 1000; i++) { > uint32_t x = 5 + 6; > if (x != 11) { > return 1; > } > } > return 0; > } > > Compiling and running both individually on a basic RISC-V CPU config, they > both exit at exactly 1,297,721,000. However, in a real system, each > multiply operation would take longer and I'd suspect doing 1000 > multiplications would have even a tiny difference in performance. My own > research would also have difficulties analyzing relative performance unless > I'm missing something. > > Even custom instructions seem to execute in a single CPU cycle regardless > of how the hardware would be implemented. > > Is there a good way to define cycle delays in my Gem5 environment? I can > implement a "multiply" function inserts a bunch of no-ops, but that would > make it more complicated when the program complexity grows. > > I've written a small blog post > <https://fleker.medium.com/modeling-memristors-to-execute-physically-accurate-imply-operations-in-gem5-ef888b7dc49b> > exploring some of what I've tried in the past week. If anyone here has any > suggestions I'd be interested to hear them. > > Thanks, > > Nick > _______________________________________________ > gem5-users mailing list -- gem5-users@gem5.org > To unsubscribe send an email to gem5-users-leave@gem5.org >

EM

Eliot Moss

Wed, Jul 12, 2023 3:05 AM

On 7/11/2023 9:13 PM, Nick F via gem5-users wrote:

Good afternoon,

I have been trying to use Gem5 to research and study the performance of several different computer
architectures. However, I have been noticing that I may be unable to accurately model the
differences in cycle length for computer programs.

Take for example these two programs:

#include <stdint.h>

int main(void)
{
    for (uint32_t i = 0; i < 1000; i++) {
        uint32_t x = 5 * 6;
        if (x != 30) {
            return 1;
        }
    }
    return 0;
}

#include <stdint.h>

int main(void)
{
    for (uint32_t i = 0; i < 1000; i++) {
        uint32_t x = 5 + 6;
        if (x != 11) {
            return 1;
        }
    }
    return 0;
}

Compiling and running both individually on a basic RISC-V CPU config, they both exit at exactly
1,297,721,000. However, in a real system, each multiply operation would take longer and I'd suspect
doing 1000 multiplications would have even a tiny difference in performance. My own research would
also have difficulties analyzing relative performance unless I'm missing something.

Even custom instructions seem to execute in a single CPU cycle regardless of how the hardware would
be implemented.

Is there a good way to define cycle delays in my Gem5 environment? I can implement a "multiply"
function inserts a bunch of no-ops, but that would make it more complicated when the program
complexity grows.

I've written a small blog post
https://fleker.medium.com/modeling-memristors-to-execute-physically-accurate-imply-operations-in-gem5-ef888b7dc49b exploring some of what I've tried in the past week. If anyone here has any suggestions I'd be interested to hear them.

Unless you work really hard to defeat it, any compiler will do 5 + 6 or 5 * 6 at
compile time, producing a small constant. The if test will go away, and the loop
probably will, too. Did you look at the actual machine code of the executable?

Even if you turned all optimizations off, etc., many cores can do adds and
multiplies in about the same amount of time, and given pipelines, other work,
etc., it might well come out to the same number of cycles even if the operations
are there.

So, I'd first check the machine code, and would also want to know the specific
gem5 model being used (in order, out of order, etc.) and other parameters ...

Best - EM

On 7/11/2023 9:13 PM, Nick F via gem5-users wrote: > Good afternoon, > > I have been trying to use Gem5 to research and study the performance of several different computer > architectures. However, I have been noticing that I may be unable to accurately model the > differences in cycle length for computer programs. > > Take for example these two programs: > > #include <stdint.h> > > int main(void) > { > for (uint32_t i = 0; i < 1000; i++) { > uint32_t x = 5 * 6; > if (x != 30) { > return 1; > } > } > return 0; > } > > #include <stdint.h> > > int main(void) > { > for (uint32_t i = 0; i < 1000; i++) { > uint32_t x = 5 + 6; > if (x != 11) { > return 1; > } > } > return 0; > } > > Compiling and running both individually on a basic RISC-V CPU config, they both exit at exactly > 1,297,721,000. However, in a real system, each multiply operation would take longer and I'd suspect > doing 1000 multiplications would have even a tiny difference in performance. My own research would > also have difficulties analyzing relative performance unless I'm missing something. > > Even custom instructions seem to execute in a single CPU cycle regardless of how the hardware would > be implemented. > > Is there a good way to define cycle delays in my Gem5 environment? I can implement a "multiply" > function inserts a bunch of no-ops, but that would make it more complicated when the program > complexity grows. > > I've written a small blog post > <https://fleker.medium.com/modeling-memristors-to-execute-physically-accurate-imply-operations-in-gem5-ef888b7dc49b> exploring some of what I've tried in the past week. If anyone here has any suggestions I'd be interested to hear them. Unless you work really hard to defeat it, any compiler will do 5 + 6 or 5 * 6 at compile time, producing a small constant. The if test will go away, and the loop probably will, too. Did you look at the actual machine code of the executable? Even if you turned all optimizations off, etc., many cores can do adds and multiplies in about the same amount of time, and given pipelines, other work, etc., it might well come out to the same number of cycles even if the operations are there. So, I'd first check the machine code, and would also want to know the specific gem5 model being used (in order, out of order, etc.) and other parameters ... Best - EM

gem5-users@gem5.org

Analyzing instruction cycle count