gem5-users@gem5.org

The gem5 Users mailing list

View all threads

Replacing CPU model in GPU-FS

AM
Anoop Mysore
Fri, Jun 30, 2023 10:10 AM

According to the GPU-FS blog
https://urldefense.com/v3/__https://www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$
,
"Currently KVM and X86 are required to run full system. Atomic and
Timing CPUs are not yet compatible with the disconnected Ruby network
required for GPUFS and is a work in progress
."
My understanding is that KVM is used to boot Ubuntu; so, are the GPU
applications run on KVM? Also, what does "disconnected" Ruby network mean
there?
If so, is there any work in progress that I can use to develop on, or a
(noob-friendly) documentation of what needs to be done to extend the
support to Atomic/O3 CPU?
For a project I'm working on, I need complete visibility into the CPU+GPU
cache hierarchy + perhaps a few more custom probes; could you comment on
whether this would be restrictive if going with KVM in the meantime given
that it leverages the host for the virtualized HW?

Please let me know if I have got any of this wrong or if there are other
details you think would be useful.

According to the GPU-FS blog <https://urldefense.com/v3/__https://www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$> , "*Currently KVM and X86 are required to run full system. Atomic and Timing CPUs are not yet compatible with the disconnected Ruby network required for GPUFS and is a work in progress*." My understanding is that KVM is used to boot Ubuntu; so, are the GPU applications run on KVM? Also, what does "disconnected" Ruby network mean there? If so, is there any work in progress that I can use to develop on, or a (noob-friendly) documentation of what needs to be done to extend the support to Atomic/O3 CPU? For a project I'm working on, I need complete visibility into the CPU+GPU cache hierarchy + perhaps a few more custom probes; could you comment on whether this would be restrictive if going with KVM in the meantime given that it leverages the host for the virtualized HW? Please let me know if I have got any of this wrong or if there are other details you think would be useful.
AM
Anoop Mysore
Fri, Jun 30, 2023 3:43 PM

It appears the host part of GPU applications are indeed executed on KVM,
from:
https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf
.

A few more questions:

  1. I missed that it isn't mentioned that O3 CPU models aren't supported --
    would that be as easy as changing the cpu_type in the config file and
    running? I intend to run with the latest O3 CPU config I have (an Intel
    CPU).
  2. The Ruby network that's used -- is it intercepting (perhaps just MMIO)
    memory operations from the KVM CPU? Could you please briefly describe how
    Ruby is working with both KVM and GPU (or point me to any document)?
  3. The GPU MMIO trace we pass during simulator invocation -- what exactly
    is this? If it's a trace of the kernel driver/CPU's MMIO calls into GPU,
    how is it portable across different programs within a benchmark-suite --
    HeteroSync, for example?
  4. In HeteroSync, there's fine-grain synchronization between CPU and GPU
    in many apps. If I use the vega10_kvm.py, which has a discrete GPU with a
    KVM CPU, where do the synchronizations happen?
  5. If I want to move to an integrated GPU model with an O3 CPU (the only
    requirement is the shared LLC) -- are there any resources that can help me?
    I do see a bootcamp that uses the apu_se.py -- can this be utilized at
    least partially to support full system O3 CPU + integrated GPU? Are there
    any modifications that need to be made to support synchronizations in L3?

Please excuse the jumbled questions, I am in the process of gaining more
clarity.

On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore mysanoop@gmail.com wrote:

According to the GPU-FS blog
https://urldefense.com/v3/__https://www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$
,
"Currently KVM and X86 are required to run full system. Atomic and
Timing CPUs are not yet compatible with the disconnected Ruby network
required for GPUFS and is a work in progress
."
My understanding is that KVM is used to boot Ubuntu; so, are the GPU
applications run on KVM? Also, what does "disconnected" Ruby network mean
there?
If so, is there any work in progress that I can use to develop on, or a
(noob-friendly) documentation of what needs to be done to extend the
support to Atomic/O3 CPU?
For a project I'm working on, I need complete visibility into the CPU+GPU
cache hierarchy + perhaps a few more custom probes; could you comment on
whether this would be restrictive if going with KVM in the meantime given
that it leverages the host for the virtualized HW?

Please let me know if I have got any of this wrong or if there are other
details you think would be useful.

It appears the host part of GPU applications are indeed executed on KVM, from: https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf . A few more questions: 1. I missed that it isn't mentioned that O3 CPU models aren't supported -- would that be as easy as changing the `cpu_type` in the config file and running? I intend to run with the latest O3 CPU config I have (an Intel CPU). 2. The Ruby network that's used -- is it intercepting (perhaps just MMIO) memory operations from the KVM CPU? Could you please briefly describe how Ruby is working with both KVM and GPU (or point me to any document)? 3. The GPU MMIO trace we pass during simulator invocation -- what exactly is this? If it's a trace of the kernel driver/CPU's MMIO calls into GPU, how is it portable across different programs within a benchmark-suite -- HeteroSync, for example? 4. In HeteroSync, there's fine-grain synchronization between CPU and GPU in many apps. If I use the vega10_kvm.py, which has a discrete GPU with a KVM CPU, where do the synchronizations happen? 5. If I want to move to an integrated GPU model with an O3 CPU (the only requirement is the shared LLC) -- are there any resources that can help me? I do see a bootcamp that uses the apu_se.py -- can this be utilized at least partially to support full system O3 CPU + integrated GPU? Are there any modifications that need to be made to support synchronizations in L3? Please excuse the jumbled questions, I am in the process of gaining more clarity. On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore <mysanoop@gmail.com> wrote: > According to the GPU-FS blog > <https://urldefense.com/v3/__https://www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$> > , > "*Currently KVM and X86 are required to run full system. Atomic and > Timing CPUs are not yet compatible with the disconnected Ruby network > required for GPUFS and is a work in progress*." > My understanding is that KVM is used to boot Ubuntu; so, are the GPU > applications run on KVM? Also, what does "disconnected" Ruby network mean > there? > If so, is there any work in progress that I can use to develop on, or a > (noob-friendly) documentation of what needs to be done to extend the > support to Atomic/O3 CPU? > For a project I'm working on, I need complete visibility into the CPU+GPU > cache hierarchy + perhaps a few more custom probes; could you comment on > whether this would be restrictive if going with KVM in the meantime given > that it leverages the host for the virtualized HW? > > Please let me know if I have got any of this wrong or if there are other > details you think would be useful. >
PM
Poremba, Matthew
Fri, Jun 30, 2023 5:00 PM

[Public]

Hi,

No worries about the questions! I will try to answer them all, so this will be a long email 😊:

The disconnected (or disjoint) Ruby network is essentially the same as the APU Ruby network used in SE mode -  That is, it combines two Ruby protocols in one protocol (MOESI_AMD_base and GPU_VIPER).  They are disjointed because there are no paths / network links between the GPU and CPU side, simulating a discrete GPU. These protocols work together because they use the same network messages / virtual channels to the directory – Basically you cannot simply drop in another CPU protocol and have it work.

Atomic CPU is working very recently – As in this week.  It is on review board right now and I believe might be part of the gem5 v23.0 release.  However, the reason Atomic and KVM CPUs are required is because they use the atomic_noncaching memory mode and basically bypass the CPU cache. The timing CPUs (timing and O3) are trying to generate routes to the GPU side which is causing deadlocks.  I have not had any time to look into this further, but that is the status.

| are the GPU applications run on KVM?

The CPU portion of GPU applications runs on KVM.  The GPU is simulated in timing mode so the compute units, cache, memory, etc. are all simulated with events.  For an application that simply launches GPU kernels, the CPU is just waiting for the kernels to finish.

For your other questions:

  1. Unfortunately no, it is not this easy. There is an issue with timing CPUs that is still an outstanding bug – we focused on atomic CPU recently as a way to allow users who aren’t able to use KVM to be able to use the GPU model.
  2. KVM exits whenever there is a memory request outside of its VM range. The PCI address range is outside the VM range, so for example when the CPU writes to PCI space it will trigger an event for the GPU. The only Ruby involvement here is that Ruby will send all requests outside of its memory range to the IO bus (KVM or not).
  3. The MMIO trace is only to load the GPU driver and not used in applications. It basically contains some reasonable register values for anything that is not modeled in gem5 so that we do not need to model them (e.g., graphics, power management, video encode/decode, etc.).  This is not required for compute-only GPU variants but that is a different topic.
  4. I’m not familiar enough with this particular application to answer this question.
  5. I think you will need to use SE mode to do what you are trying to do.  Full system mode is using the real GPU driver, ROCm stack, etc. which currently does not support any APU-like devices. SE mode is able to do this by making use of an emulated driver.

-Matt

From: Anoop Mysore via gem5-users gem5-users@gem5.org
Sent: Friday, June 30, 2023 8:43 AM
To: The gem5 Users mailing list gem5-users@gem5.org
Cc: Anoop Mysore mysanoop@gmail.com
Subject: [gem5-users] Re: Replacing CPU model in GPU-FS

Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.

It appears the host part of GPU applications are indeed executed on KVM, from: https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf.
A few more questions:

  1. I missed that it isn't mentioned that O3 CPU models aren't supported -- would that be as easy as changing the cpu_type in the config file and running? I intend to run with the latest O3 CPU config I have (an Intel CPU).
  2. The Ruby network that's used -- is it intercepting (perhaps just MMIO) memory operations from the KVM CPU? Could you please briefly describe how Ruby is working with both KVM and GPU (or point me to any document)?
  3. The GPU MMIO trace we pass during simulator invocation -- what exactly is this? If it's a trace of the kernel driver/CPU's MMIO calls into GPU, how is it portable across different programs within a benchmark-suite -- HeteroSync, for example?
  4. In HeteroSync, there's fine-grain synchronization between CPU and GPU in many apps. If I use the vega10_kvm.py, which has a discrete GPU with a KVM CPU, where do the synchronizations happen?
  5. If I want to move to an integrated GPU model with an O3 CPU (the only requirement is the shared LLC) -- are there any resources that can help me? I do see a bootcamp that uses the apu_se.py -- can this be utilized at least partially to support full system O3 CPU + integrated GPU? Are there any modifications that need to be made to support synchronizations in L3?

Please excuse the jumbled questions, I am in the process of gaining more clarity.

On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore <mysanoop@gmail.commailto:mysanoop@gmail.com> wrote:
According to the GPU-FS bloghttps://urldefense.com/v3/__https:/www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$,
"Currently KVM and X86 are required to run full system. Atomic and Timing CPUs are not yet compatible with the disconnected Ruby network required for GPUFS and is a work in progress."
My understanding is that KVM is used to boot Ubuntu; so, are the GPU applications run on KVM? Also, what does "disconnected" Ruby network mean there?
If so, is there any work in progress that I can use to develop on, or a (noob-friendly) documentation of what needs to be done to extend the support to Atomic/O3 CPU?
For a project I'm working on, I need complete visibility into the CPU+GPU cache hierarchy + perhaps a few more custom probes; could you comment on whether this would be restrictive if going with KVM in the meantime given that it leverages the host for the virtualized HW?

Please let me know if I have got any of this wrong or if there are other details you think would be useful.

[Public] Hi, No worries about the questions! I will try to answer them all, so this will be a long email 😊: The disconnected (or disjoint) Ruby network is essentially the same as the APU Ruby network used in SE mode - That is, it combines two Ruby protocols in one protocol (MOESI_AMD_base and GPU_VIPER). They are disjointed because there are no paths / network links between the GPU and CPU side, simulating a discrete GPU. These protocols work together because they use the same network messages / virtual channels to the directory – Basically you cannot simply drop in another CPU protocol and have it work. Atomic CPU is working *very* recently – As in this week. It is on review board right now and I believe might be part of the gem5 v23.0 release. However, the reason Atomic and KVM CPUs are required is because they use the atomic_noncaching memory mode and basically bypass the CPU cache. The timing CPUs (timing and O3) are trying to generate routes to the GPU side which is causing deadlocks. I have not had any time to look into this further, but that is the status. | are the GPU applications run on KVM? The CPU portion of GPU applications runs on KVM. The GPU is simulated in timing mode so the compute units, cache, memory, etc. are all simulated with events. For an application that simply launches GPU kernels, the CPU is just waiting for the kernels to finish. For your other questions: 1. Unfortunately no, it is not this easy. There is an issue with timing CPUs that is still an outstanding bug – we focused on atomic CPU recently as a way to allow users who aren’t able to use KVM to be able to use the GPU model. 2. KVM exits whenever there is a memory request outside of its VM range. The PCI address range is outside the VM range, so for example when the CPU writes to PCI space it will trigger an event for the GPU. The only Ruby involvement here is that Ruby will send all requests outside of its memory range to the IO bus (KVM or not). 3. The MMIO trace is only to load the GPU driver and not used in applications. It basically contains some reasonable register values for anything that is not modeled in gem5 so that we do not need to model them (e.g., graphics, power management, video encode/decode, etc.). This is not required for compute-only GPU variants but that is a different topic. 4. I’m not familiar enough with this particular application to answer this question. 5. I think you will need to use SE mode to do what you are trying to do. Full system mode is using the real GPU driver, ROCm stack, etc. which currently does not support any APU-like devices. SE mode is able to do this by making use of an emulated driver. -Matt From: Anoop Mysore via gem5-users <gem5-users@gem5.org> Sent: Friday, June 30, 2023 8:43 AM To: The gem5 Users mailing list <gem5-users@gem5.org> Cc: Anoop Mysore <mysanoop@gmail.com> Subject: [gem5-users] Re: Replacing CPU model in GPU-FS Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding. It appears the host part of GPU applications are indeed executed on KVM, from: https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf. A few more questions: 1. I missed that it isn't mentioned that O3 CPU models aren't supported -- would that be as easy as changing the `cpu_type` in the config file and running? I intend to run with the latest O3 CPU config I have (an Intel CPU). 2. The Ruby network that's used -- is it intercepting (perhaps just MMIO) memory operations from the KVM CPU? Could you please briefly describe how Ruby is working with both KVM and GPU (or point me to any document)? 3. The GPU MMIO trace we pass during simulator invocation -- what exactly is this? If it's a trace of the kernel driver/CPU's MMIO calls into GPU, how is it portable across different programs within a benchmark-suite -- HeteroSync, for example? 4. In HeteroSync, there's fine-grain synchronization between CPU and GPU in many apps. If I use the vega10_kvm.py, which has a discrete GPU with a KVM CPU, where do the synchronizations happen? 5. If I want to move to an integrated GPU model with an O3 CPU (the only requirement is the shared LLC) -- are there any resources that can help me? I do see a bootcamp that uses the apu_se.py -- can this be utilized at least partially to support full system O3 CPU + integrated GPU? Are there any modifications that need to be made to support synchronizations in L3? Please excuse the jumbled questions, I am in the process of gaining more clarity. On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore <mysanoop@gmail.com<mailto:mysanoop@gmail.com>> wrote: According to the GPU-FS blog<https://urldefense.com/v3/__https:/www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$>, "Currently KVM and X86 are required to run full system. Atomic and Timing CPUs are not yet compatible with the disconnected Ruby network required for GPUFS and is a work in progress." My understanding is that KVM is used to boot Ubuntu; so, are the GPU applications run on KVM? Also, what does "disconnected" Ruby network mean there? If so, is there any work in progress that I can use to develop on, or a (noob-friendly) documentation of what needs to be done to extend the support to Atomic/O3 CPU? For a project I'm working on, I need complete visibility into the CPU+GPU cache hierarchy + perhaps a few more custom probes; could you comment on whether this would be restrictive if going with KVM in the meantime given that it leverages the host for the virtualized HW? Please let me know if I have got any of this wrong or if there are other details you think would be useful.
MS
Matt Sinclair
Fri, Jun 30, 2023 5:40 PM

Just to follow-up on 4 and 5:

  1. The synchronization should happen at the directory-level here, since
    this is the first level of the memory system where both the CPU and GPU are
    connected.  However, I have not tested if the programmer sets the GLC bit
    (which should perform the atomic at the GPU's LLC) if Ruby has the
    functionality to send invalidations as appropriate to allow this.  I
    suspect it would work as is, but would have to check ...

  2. Yeah, for the reasons Matt P already stated O3 is not currently
    supported in GPUFS.  So GPUSE would be a better option here.  Yes, you can
    use the apu_se.py script as the base script for running GPUSE experiments.
    There are a number of examples on gem5-resources for how to get started
    with this (including HeteroSync), but I normally recommend starting with
    square if you haven't used the GPU model before:
    https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/square/.
    In terms of support for synchronization at different levels of the memory
    hierarchy, but default the GPU VIPER coherence protocol assumes that all
    synchronization happens at the system-level (at the directory, in the
    current implementation).  However, one of my students will be pushing
    updates (hopefully today) that allow non-system level support (e.g., the
    GPU LLC "GLC" level as mentioned above).  It sounds like you want to change
    the cache hierarchy and coherence protocol to add another level of cache
    (the L3) before the directory and after the CPU/GPU LLCs?  If so, you would
    need to change the current Ruby support to add this additional level and
    the appropriate transitions to do so.  However, if you instead meant that
    you are thinking of the directory level as synchronizing between the CPU
    and GPU, then you could use the support as is without any changes (I think).

Hope this helps,
Matt S.

On Fri, Jun 30, 2023 at 12:05 PM Poremba, Matthew via gem5-users <
gem5-users@gem5.org> wrote:

[Public]

Hi,

No worries about the questions! I will try to answer them all, so this
will be a long email 😊:

The disconnected (or disjoint) Ruby network is essentially the same as the
APU Ruby network used in SE mode -  That is, it combines two Ruby protocols
in one protocol (MOESI_AMD_base and GPU_VIPER).  They are disjointed
because there are no paths / network links between the GPU and CPU side,
simulating a discrete GPU. These protocols work together because they use
the same network messages / virtual channels to the directory – Basically
you cannot simply drop in another CPU protocol and have it work.

Atomic CPU is working very recently – As in this week.  It is on
review board right now and I believe might be part of the gem5 v23.0
release.  However, the reason Atomic and KVM CPUs are required is because
they use the atomic_noncaching memory mode and basically bypass the CPU
cache. The timing CPUs (timing and O3) are trying to generate routes to the
GPU side which is causing deadlocks.  I have not had any time to look into
this further, but that is the status.

| are the GPU applications run on KVM?

The CPU portion of GPU applications runs on KVM.  The GPU is simulated in
timing mode so the compute units, cache, memory, etc. are all simulated
with events.  For an application that simply launches GPU kernels, the CPU
is just waiting for the kernels to finish.

For your other questions:

  1. Unfortunately no, it is not this easy. There is an issue with timing
    CPUs that is still an outstanding bug – we focused on atomic CPU recently
    as a way to allow users who aren’t able to use KVM to be able to use the
    GPU model.

  2. KVM exits whenever there is a memory request outside of its VM range.
    The PCI address range is outside the VM range, so for example when the CPU
    writes to PCI space it will trigger an event for the GPU. The only Ruby
    involvement here is that Ruby will send all requests outside of its memory
    range to the IO bus (KVM or not).

  3. The MMIO trace is only to load the GPU driver and not used in
    applications. It basically contains some reasonable register values for
    anything that is not modeled in gem5 so that we do not need to model them
    (e.g., graphics, power management, video encode/decode, etc.).  This is not
    required for compute-only GPU variants but that is a different topic.

  4. I’m not familiar enough with this particular application to answer
    this question.

  5. I think you will need to use SE mode to do what you are trying to do.
    Full system mode is using the real GPU driver, ROCm stack, etc. which
    currently does not support any APU-like devices. SE mode is able to do this
    by making use of an emulated driver.

-Matt

From: Anoop Mysore via gem5-users gem5-users@gem5.org
Sent: Friday, June 30, 2023 8:43 AM
To: The gem5 Users mailing list gem5-users@gem5.org
Cc: Anoop Mysore mysanoop@gmail.com
Subject: [gem5-users] Re: Replacing CPU model in GPU-FS

Caution: This message originated from an External Source. Use proper
caution when opening attachments, clicking links, or responding.

It appears the host part of GPU applications are indeed executed on KVM,
from:
https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf
.

A few more questions:

  1. I missed that it isn't mentioned that O3 CPU models aren't supported --
    would that be as easy as changing the cpu_type in the config file and
    running? I intend to run with the latest O3 CPU config I have (an Intel
    CPU).

  2. The Ruby network that's used -- is it intercepting (perhaps just MMIO)
    memory operations from the KVM CPU? Could you please briefly describe how
    Ruby is working with both KVM and GPU (or point me to any document)?

  3. The GPU MMIO trace we pass during simulator invocation -- what exactly
    is this? If it's a trace of the kernel driver/CPU's MMIO calls into GPU,
    how is it portable across different programs within a benchmark-suite --
    HeteroSync, for example?

  4. In HeteroSync, there's fine-grain synchronization between CPU and GPU
    in many apps. If I use the vega10_kvm.py, which has a discrete GPU with a
    KVM CPU, where do the synchronizations happen?

  5. If I want to move to an integrated GPU model with an O3 CPU (the only
    requirement is the shared LLC) -- are there any resources that can help me?
    I do see a bootcamp that uses the apu_se.py -- can this be utilized at
    least partially to support full system O3 CPU + integrated GPU? Are there
    any modifications that need to be made to support synchronizations in L3?

Please excuse the jumbled questions, I am in the process of gaining more
clarity.

On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore mysanoop@gmail.com wrote:

According to the GPU-FS blog
https://urldefense.com/v3/__https:/www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$
,

 "*Currently KVM and X86 are required to run full system. Atomic and

Timing CPUs are not yet compatible with the disconnected Ruby network
required for GPUFS and is a work in progress*."

My understanding is that KVM is used to boot Ubuntu; so, are the GPU
applications run on KVM? Also, what does "disconnected" Ruby network mean
there?

If so, is there any work in progress that I can use to develop on, or a
(noob-friendly) documentation of what needs to be done to extend the
support to Atomic/O3 CPU?

For a project I'm working on, I need complete visibility into the CPU+GPU
cache hierarchy + perhaps a few more custom probes; could you comment on
whether this would be restrictive if going with KVM in the meantime given
that it leverages the host for the virtualized HW?

Please let me know if I have got any of this wrong or if there are other
details you think would be useful.


gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.org

Just to follow-up on 4 and 5: 4. The synchronization should happen at the directory-level here, since this is the first level of the memory system where both the CPU and GPU are connected. However, I have not tested if the programmer sets the GLC bit (which should perform the atomic at the GPU's LLC) if Ruby has the functionality to send invalidations as appropriate to allow this. I suspect it would work as is, but would have to check ... 5. Yeah, for the reasons Matt P already stated O3 is not currently supported in GPUFS. So GPUSE would be a better option here. Yes, you can use the apu_se.py script as the base script for running GPUSE experiments. There are a number of examples on gem5-resources for how to get started with this (including HeteroSync), but I normally recommend starting with square if you haven't used the GPU model before: https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/square/. In terms of support for synchronization at different levels of the memory hierarchy, but default the GPU VIPER coherence protocol assumes that all synchronization happens at the system-level (at the directory, in the current implementation). However, one of my students will be pushing updates (hopefully today) that allow non-system level support (e.g., the GPU LLC "GLC" level as mentioned above). It sounds like you want to change the cache hierarchy and coherence protocol to add another level of cache (the L3) before the directory and after the CPU/GPU LLCs? If so, you would need to change the current Ruby support to add this additional level and the appropriate transitions to do so. However, if you instead meant that you are thinking of the directory level as synchronizing between the CPU and GPU, then you could use the support as is without any changes (I think). Hope this helps, Matt S. On Fri, Jun 30, 2023 at 12:05 PM Poremba, Matthew via gem5-users < gem5-users@gem5.org> wrote: > [Public] > > Hi, > > > > > > No worries about the questions! I will try to answer them all, so this > will be a long email 😊: > > > > The disconnected (or disjoint) Ruby network is essentially the same as the > APU Ruby network used in SE mode - That is, it combines two Ruby protocols > in one protocol (MOESI_AMD_base and GPU_VIPER). They are disjointed > because there are no paths / network links between the GPU and CPU side, > simulating a discrete GPU. These protocols work together because they use > the same network messages / virtual channels to the directory – Basically > you cannot simply drop in another CPU protocol and have it work. > > > > Atomic CPU is working **very** recently – As in this week. It is on > review board right now and I believe might be part of the gem5 v23.0 > release. However, the reason Atomic and KVM CPUs are required is because > they use the atomic_noncaching memory mode and basically bypass the CPU > cache. The timing CPUs (timing and O3) are trying to generate routes to the > GPU side which is causing deadlocks. I have not had any time to look into > this further, but that is the status. > > > > | are the GPU applications run on KVM? > > > > The CPU portion of GPU applications runs on KVM. The GPU is simulated in > timing mode so the compute units, cache, memory, etc. are all simulated > with events. For an application that simply launches GPU kernels, the CPU > is just waiting for the kernels to finish. > > > > For your other questions: > > 1. Unfortunately no, it is not this easy. There is an issue with timing > CPUs that is still an outstanding bug – we focused on atomic CPU recently > as a way to allow users who aren’t able to use KVM to be able to use the > GPU model. > > 2. KVM exits whenever there is a memory request outside of its VM range. > The PCI address range is outside the VM range, so for example when the CPU > writes to PCI space it will trigger an event for the GPU. The only Ruby > involvement here is that Ruby will send all requests outside of its memory > range to the IO bus (KVM or not). > > 3. The MMIO trace is only to load the GPU driver and not used in > applications. It basically contains some reasonable register values for > anything that is not modeled in gem5 so that we do not need to model them > (e.g., graphics, power management, video encode/decode, etc.). This is not > required for compute-only GPU variants but that is a different topic. > > 4. I’m not familiar enough with this particular application to answer > this question. > > 5. I think you will need to use SE mode to do what you are trying to do. > Full system mode is using the real GPU driver, ROCm stack, etc. which > currently does not support any APU-like devices. SE mode is able to do this > by making use of an emulated driver. > > > > > > -Matt > > > > *From:* Anoop Mysore via gem5-users <gem5-users@gem5.org> > *Sent:* Friday, June 30, 2023 8:43 AM > *To:* The gem5 Users mailing list <gem5-users@gem5.org> > *Cc:* Anoop Mysore <mysanoop@gmail.com> > *Subject:* [gem5-users] Re: Replacing CPU model in GPU-FS > > > > *Caution:* This message originated from an External Source. Use proper > caution when opening attachments, clicking links, or responding. > > > > It appears the host part of GPU applications are indeed executed on KVM, > from: > https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf > . > > A few more questions: > > 1. I missed that it isn't mentioned that O3 CPU models aren't supported -- > would that be as easy as changing the `cpu_type` in the config file and > running? I intend to run with the latest O3 CPU config I have (an Intel > CPU). > 2. The Ruby network that's used -- is it intercepting (perhaps just MMIO) > memory operations from the KVM CPU? Could you please briefly describe how > Ruby is working with both KVM and GPU (or point me to any document)? > 3. The GPU MMIO trace we pass during simulator invocation -- what exactly > is this? If it's a trace of the kernel driver/CPU's MMIO calls into GPU, > how is it portable across different programs within a benchmark-suite -- > HeteroSync, for example? > 4. In HeteroSync, there's fine-grain synchronization between CPU and GPU > in many apps. If I use the vega10_kvm.py, which has a discrete GPU with a > KVM CPU, where do the synchronizations happen? > > 5. If I want to move to an integrated GPU model with an O3 CPU (the only > requirement is the shared LLC) -- are there any resources that can help me? > I do see a bootcamp that uses the apu_se.py -- can this be utilized at > least partially to support full system O3 CPU + integrated GPU? Are there > any modifications that need to be made to support synchronizations in L3? > > > > Please excuse the jumbled questions, I am in the process of gaining more > clarity. > > > > On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore <mysanoop@gmail.com> wrote: > > According to the GPU-FS blog > <https://urldefense.com/v3/__https:/www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$> > , > > "*Currently KVM and X86 are required to run full system. Atomic and > Timing CPUs are not yet compatible with the disconnected Ruby network > required for GPUFS and is a work in progress*." > > My understanding is that KVM is used to boot Ubuntu; so, are the GPU > applications run on KVM? Also, what does "disconnected" Ruby network mean > there? > > If so, is there any work in progress that I can use to develop on, or a > (noob-friendly) documentation of what needs to be done to extend the > support to Atomic/O3 CPU? > > For a project I'm working on, I need complete visibility into the CPU+GPU > cache hierarchy + perhaps a few more custom probes; could you comment on > whether this would be restrictive if going with KVM in the meantime given > that it leverages the host for the virtualized HW? > > > > Please let me know if I have got any of this wrong or if there are other > details you think would be useful. > > _______________________________________________ > gem5-users mailing list -- gem5-users@gem5.org > To unsubscribe send an email to gem5-users-leave@gem5.org >
AM
Anoop Mysore
Tue, Jul 4, 2023 11:00 AM

Thank you so much for the kind and detailed explanations!

Just to clarify: I can use the APU config (apu_se.py) and switch out to an
O3 CPU, and I would still have the detailed GPU model, and the disconnected
Ruby model that synchronizes between CPU and GPU at the system-level
directory -- is that correct?

Last question: when using the APU config for simulating HeteroSync which,
for example, has a sleep mutex primitive that invokes a
__builtin_amdgcn_s_sleep(), is there any OS involvement? If yes, would SE
mode's emulation of those syscalls inexorably sacrifice any fidelity that
could be argued leads to inaccurate evaluations of heterogeneous coherence
implementations? Or are any there other factors of insufficient fidelity
that might be important in this regard?

On Fri, Jun 30, 2023 at 7:40 PM Matt Sinclair mattdsinclair.wisc@gmail.com
wrote:

Just to follow-up on 4 and 5:

  1. The synchronization should happen at the directory-level here, since
    this is the first level of the memory system where both the CPU and GPU are
    connected.  However, I have not tested if the programmer sets the GLC bit
    (which should perform the atomic at the GPU's LLC) if Ruby has the
    functionality to send invalidations as appropriate to allow this.  I
    suspect it would work as is, but would have to check ...

  2. Yeah, for the reasons Matt P already stated O3 is not currently
    supported in GPUFS.  So GPUSE would be a better option here.  Yes, you can
    use the apu_se.py script as the base script for running GPUSE experiments.
    There are a number of examples on gem5-resources for how to get started
    with this (including HeteroSync), but I normally recommend starting with
    square if you haven't used the GPU model before:
    https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/square/.
    In terms of support for synchronization at different levels of the memory
    hierarchy, but default the GPU VIPER coherence protocol assumes that all
    synchronization happens at the system-level (at the directory, in the
    current implementation).  However, one of my students will be pushing
    updates (hopefully today) that allow non-system level support (e.g., the
    GPU LLC "GLC" level as mentioned above).  It sounds like you want to change
    the cache hierarchy and coherence protocol to add another level of cache
    (the L3) before the directory and after the CPU/GPU LLCs?  If so, you would
    need to change the current Ruby support to add this additional level and
    the appropriate transitions to do so.  However, if you instead meant that
    you are thinking of the directory level as synchronizing between the CPU
    and GPU, then you could use the support as is without any changes (I think).

Hope this helps,
Matt S.

On Fri, Jun 30, 2023 at 12:05 PM Poremba, Matthew via gem5-users <
gem5-users@gem5.org> wrote:

[Public]

Hi,

No worries about the questions! I will try to answer them all, so this
will be a long email 😊:

The disconnected (or disjoint) Ruby network is essentially the same as
the APU Ruby network used in SE mode -  That is, it combines two Ruby
protocols in one protocol (MOESI_AMD_base and GPU_VIPER).  They are
disjointed because there are no paths / network links between the GPU and
CPU side, simulating a discrete GPU. These protocols work together because
they use the same network messages / virtual channels to the directory –
Basically you cannot simply drop in another CPU protocol and have it work.

Atomic CPU is working very recently – As in this week.  It is on
review board right now and I believe might be part of the gem5 v23.0
release.  However, the reason Atomic and KVM CPUs are required is because
they use the atomic_noncaching memory mode and basically bypass the CPU
cache. The timing CPUs (timing and O3) are trying to generate routes to the
GPU side which is causing deadlocks.  I have not had any time to look into
this further, but that is the status.

| are the GPU applications run on KVM?

The CPU portion of GPU applications runs on KVM.  The GPU is simulated in
timing mode so the compute units, cache, memory, etc. are all simulated
with events.  For an application that simply launches GPU kernels, the CPU
is just waiting for the kernels to finish.

For your other questions:

  1. Unfortunately no, it is not this easy. There is an issue with timing
    CPUs that is still an outstanding bug – we focused on atomic CPU recently
    as a way to allow users who aren’t able to use KVM to be able to use the
    GPU model.

  2. KVM exits whenever there is a memory request outside of its VM range.
    The PCI address range is outside the VM range, so for example when the CPU
    writes to PCI space it will trigger an event for the GPU. The only Ruby
    involvement here is that Ruby will send all requests outside of its memory
    range to the IO bus (KVM or not).

  3. The MMIO trace is only to load the GPU driver and not used in
    applications. It basically contains some reasonable register values for
    anything that is not modeled in gem5 so that we do not need to model them
    (e.g., graphics, power management, video encode/decode, etc.).  This is not
    required for compute-only GPU variants but that is a different topic.

  4. I’m not familiar enough with this particular application to answer
    this question.

  5. I think you will need to use SE mode to do what you are trying to
    do.  Full system mode is using the real GPU driver, ROCm stack, etc. which
    currently does not support any APU-like devices. SE mode is able to do this
    by making use of an emulated driver.

-Matt

From: Anoop Mysore via gem5-users gem5-users@gem5.org
Sent: Friday, June 30, 2023 8:43 AM
To: The gem5 Users mailing list gem5-users@gem5.org
Cc: Anoop Mysore mysanoop@gmail.com
Subject: [gem5-users] Re: Replacing CPU model in GPU-FS

Caution: This message originated from an External Source. Use proper
caution when opening attachments, clicking links, or responding.

It appears the host part of GPU applications are indeed executed on KVM,
from:
https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf
.

A few more questions:

  1. I missed that it isn't mentioned that O3 CPU models aren't supported
    -- would that be as easy as changing the cpu_type in the config file and
    running? I intend to run with the latest O3 CPU config I have (an Intel
    CPU).

  2. The Ruby network that's used -- is it intercepting (perhaps just MMIO)
    memory operations from the KVM CPU? Could you please briefly describe how
    Ruby is working with both KVM and GPU (or point me to any document)?

  3. The GPU MMIO trace we pass during simulator invocation -- what exactly
    is this? If it's a trace of the kernel driver/CPU's MMIO calls into GPU,
    how is it portable across different programs within a benchmark-suite --
    HeteroSync, for example?

  4. In HeteroSync, there's fine-grain synchronization between CPU and GPU
    in many apps. If I use the vega10_kvm.py, which has a discrete GPU with a
    KVM CPU, where do the synchronizations happen?

  5. If I want to move to an integrated GPU model with an O3 CPU (the only
    requirement is the shared LLC) -- are there any resources that can help me?
    I do see a bootcamp that uses the apu_se.py -- can this be utilized at
    least partially to support full system O3 CPU + integrated GPU? Are there
    any modifications that need to be made to support synchronizations in L3?

Please excuse the jumbled questions, I am in the process of gaining more
clarity.

On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore mysanoop@gmail.com wrote:

According to the GPU-FS blog
https://urldefense.com/v3/__https:/www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$
,

 "*Currently KVM and X86 are required to run full system. Atomic and

Timing CPUs are not yet compatible with the disconnected Ruby network
required for GPUFS and is a work in progress*."

My understanding is that KVM is used to boot Ubuntu; so, are the GPU
applications run on KVM? Also, what does "disconnected" Ruby network mean
there?

If so, is there any work in progress that I can use to develop on, or a
(noob-friendly) documentation of what needs to be done to extend the
support to Atomic/O3 CPU?

For a project I'm working on, I need complete visibility into the CPU+GPU
cache hierarchy + perhaps a few more custom probes; could you comment on
whether this would be restrictive if going with KVM in the meantime given
that it leverages the host for the virtualized HW?

Please let me know if I have got any of this wrong or if there are other
details you think would be useful.


gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.org

Thank you so much for the kind and detailed explanations! Just to clarify: I can use the APU config (apu_se.py) and switch out to an O3 CPU, and I would still have the detailed GPU model, and the disconnected Ruby model that synchronizes between CPU and GPU at the system-level directory -- is that correct? Last question: when using the APU config for simulating HeteroSync which, for example, has a sleep mutex primitive that invokes a __builtin_amdgcn_s_sleep(), is there any OS involvement? If yes, would SE mode's emulation of those syscalls inexorably sacrifice any fidelity that could be argued leads to inaccurate evaluations of heterogeneous coherence implementations? Or are any there other factors of insufficient fidelity that might be important in this regard? On Fri, Jun 30, 2023 at 7:40 PM Matt Sinclair <mattdsinclair.wisc@gmail.com> wrote: > Just to follow-up on 4 and 5: > > 4. The synchronization should happen at the directory-level here, since > this is the first level of the memory system where both the CPU and GPU are > connected. However, I have not tested if the programmer sets the GLC bit > (which should perform the atomic at the GPU's LLC) if Ruby has the > functionality to send invalidations as appropriate to allow this. I > suspect it would work as is, but would have to check ... > > 5. Yeah, for the reasons Matt P already stated O3 is not currently > supported in GPUFS. So GPUSE would be a better option here. Yes, you can > use the apu_se.py script as the base script for running GPUSE experiments. > There are a number of examples on gem5-resources for how to get started > with this (including HeteroSync), but I normally recommend starting with > square if you haven't used the GPU model before: > https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/square/. > In terms of support for synchronization at different levels of the memory > hierarchy, but default the GPU VIPER coherence protocol assumes that all > synchronization happens at the system-level (at the directory, in the > current implementation). However, one of my students will be pushing > updates (hopefully today) that allow non-system level support (e.g., the > GPU LLC "GLC" level as mentioned above). It sounds like you want to change > the cache hierarchy and coherence protocol to add another level of cache > (the L3) before the directory and after the CPU/GPU LLCs? If so, you would > need to change the current Ruby support to add this additional level and > the appropriate transitions to do so. However, if you instead meant that > you are thinking of the directory level as synchronizing between the CPU > and GPU, then you could use the support as is without any changes (I think). > > Hope this helps, > Matt S. > > On Fri, Jun 30, 2023 at 12:05 PM Poremba, Matthew via gem5-users < > gem5-users@gem5.org> wrote: > >> [Public] >> >> Hi, >> >> >> >> >> >> No worries about the questions! I will try to answer them all, so this >> will be a long email 😊: >> >> >> >> The disconnected (or disjoint) Ruby network is essentially the same as >> the APU Ruby network used in SE mode - That is, it combines two Ruby >> protocols in one protocol (MOESI_AMD_base and GPU_VIPER). They are >> disjointed because there are no paths / network links between the GPU and >> CPU side, simulating a discrete GPU. These protocols work together because >> they use the same network messages / virtual channels to the directory – >> Basically you cannot simply drop in another CPU protocol and have it work. >> >> >> >> Atomic CPU is working **very** recently – As in this week. It is on >> review board right now and I believe might be part of the gem5 v23.0 >> release. However, the reason Atomic and KVM CPUs are required is because >> they use the atomic_noncaching memory mode and basically bypass the CPU >> cache. The timing CPUs (timing and O3) are trying to generate routes to the >> GPU side which is causing deadlocks. I have not had any time to look into >> this further, but that is the status. >> >> >> >> | are the GPU applications run on KVM? >> >> >> >> The CPU portion of GPU applications runs on KVM. The GPU is simulated in >> timing mode so the compute units, cache, memory, etc. are all simulated >> with events. For an application that simply launches GPU kernels, the CPU >> is just waiting for the kernels to finish. >> >> >> >> For your other questions: >> >> 1. Unfortunately no, it is not this easy. There is an issue with timing >> CPUs that is still an outstanding bug – we focused on atomic CPU recently >> as a way to allow users who aren’t able to use KVM to be able to use the >> GPU model. >> >> 2. KVM exits whenever there is a memory request outside of its VM range. >> The PCI address range is outside the VM range, so for example when the CPU >> writes to PCI space it will trigger an event for the GPU. The only Ruby >> involvement here is that Ruby will send all requests outside of its memory >> range to the IO bus (KVM or not). >> >> 3. The MMIO trace is only to load the GPU driver and not used in >> applications. It basically contains some reasonable register values for >> anything that is not modeled in gem5 so that we do not need to model them >> (e.g., graphics, power management, video encode/decode, etc.). This is not >> required for compute-only GPU variants but that is a different topic. >> >> 4. I’m not familiar enough with this particular application to answer >> this question. >> >> 5. I think you will need to use SE mode to do what you are trying to >> do. Full system mode is using the real GPU driver, ROCm stack, etc. which >> currently does not support any APU-like devices. SE mode is able to do this >> by making use of an emulated driver. >> >> >> >> >> >> -Matt >> >> >> >> *From:* Anoop Mysore via gem5-users <gem5-users@gem5.org> >> *Sent:* Friday, June 30, 2023 8:43 AM >> *To:* The gem5 Users mailing list <gem5-users@gem5.org> >> *Cc:* Anoop Mysore <mysanoop@gmail.com> >> *Subject:* [gem5-users] Re: Replacing CPU model in GPU-FS >> >> >> >> *Caution:* This message originated from an External Source. Use proper >> caution when opening attachments, clicking links, or responding. >> >> >> >> It appears the host part of GPU applications are indeed executed on KVM, >> from: >> https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf >> . >> >> A few more questions: >> >> 1. I missed that it isn't mentioned that O3 CPU models aren't supported >> -- would that be as easy as changing the `cpu_type` in the config file and >> running? I intend to run with the latest O3 CPU config I have (an Intel >> CPU). >> 2. The Ruby network that's used -- is it intercepting (perhaps just MMIO) >> memory operations from the KVM CPU? Could you please briefly describe how >> Ruby is working with both KVM and GPU (or point me to any document)? >> 3. The GPU MMIO trace we pass during simulator invocation -- what exactly >> is this? If it's a trace of the kernel driver/CPU's MMIO calls into GPU, >> how is it portable across different programs within a benchmark-suite -- >> HeteroSync, for example? >> 4. In HeteroSync, there's fine-grain synchronization between CPU and GPU >> in many apps. If I use the vega10_kvm.py, which has a discrete GPU with a >> KVM CPU, where do the synchronizations happen? >> >> 5. If I want to move to an integrated GPU model with an O3 CPU (the only >> requirement is the shared LLC) -- are there any resources that can help me? >> I do see a bootcamp that uses the apu_se.py -- can this be utilized at >> least partially to support full system O3 CPU + integrated GPU? Are there >> any modifications that need to be made to support synchronizations in L3? >> >> >> >> Please excuse the jumbled questions, I am in the process of gaining more >> clarity. >> >> >> >> On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore <mysanoop@gmail.com> wrote: >> >> According to the GPU-FS blog >> <https://urldefense.com/v3/__https:/www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$> >> , >> >> "*Currently KVM and X86 are required to run full system. Atomic and >> Timing CPUs are not yet compatible with the disconnected Ruby network >> required for GPUFS and is a work in progress*." >> >> My understanding is that KVM is used to boot Ubuntu; so, are the GPU >> applications run on KVM? Also, what does "disconnected" Ruby network mean >> there? >> >> If so, is there any work in progress that I can use to develop on, or a >> (noob-friendly) documentation of what needs to be done to extend the >> support to Atomic/O3 CPU? >> >> For a project I'm working on, I need complete visibility into the CPU+GPU >> cache hierarchy + perhaps a few more custom probes; could you comment on >> whether this would be restrictive if going with KVM in the meantime given >> that it leverages the host for the virtualized HW? >> >> >> >> Please let me know if I have got any of this wrong or if there are other >> details you think would be useful. >> >> _______________________________________________ >> gem5-users mailing list -- gem5-users@gem5.org >> To unsubscribe send an email to gem5-users-leave@gem5.org >> >
MS
Matt Sinclair
Wed, Jul 5, 2023 5:09 PM

Answers:

  1. Yes, I believe so.  However, I have never personally tried using the O3
    model with the GPU.  Matt P has, I believe, so he may have better feedback
    there.

  2. I have not followed the chain of events all the way through here, but I
    believe that the builtin you highlighted is used at the compiler level by
    HIPCC/LLVM to generate the appropriate assembly for a given AMD GPU.  In
    this case (gfx900), I believe there is a 1-1 correlation with this builtin
    becoming an s_sleep assembly instruction (maybe with the addition of a
    v_mov-type instruction before it to set the register to the appropriate
    sleep value).  I am not aware of s_sleep()'s builtin requiring OS calls (or
    emulation).  But what you have described is more generally the issue with
    SE mode (CPU, GPU, etc.) -- because SE mode does not model OS calls, the
    fidelity of anything involving the OS will be less.  Perhaps a trite way to
    answer this is: if the fidelity of the OS calls is important for the
    applications you are studying, then I strongly recommend using FS mode.

Hope this helps,
Matt S.

On Tue, Jul 4, 2023 at 6:01 AM Anoop Mysore mysanoop@gmail.com wrote:

Thank you so much for the kind and detailed explanations!

Just to clarify: I can use the APU config (apu_se.py) and switch out to an
O3 CPU, and I would still have the detailed GPU model, and the disconnected
Ruby model that synchronizes between CPU and GPU at the system-level
directory -- is that correct?

Last question: when using the APU config for simulating HeteroSync which,
for example, has a sleep mutex primitive that invokes a
__builtin_amdgcn_s_sleep(), is there any OS involvement? If yes, would SE
mode's emulation of those syscalls inexorably sacrifice any fidelity that
could be argued leads to inaccurate evaluations of heterogeneous coherence
implementations? Or are any there other factors of insufficient fidelity
that might be important in this regard?

On Fri, Jun 30, 2023 at 7:40 PM Matt Sinclair <
mattdsinclair.wisc@gmail.com> wrote:

Just to follow-up on 4 and 5:

  1. The synchronization should happen at the directory-level here, since
    this is the first level of the memory system where both the CPU and GPU are
    connected.  However, I have not tested if the programmer sets the GLC bit
    (which should perform the atomic at the GPU's LLC) if Ruby has the
    functionality to send invalidations as appropriate to allow this.  I
    suspect it would work as is, but would have to check ...

  2. Yeah, for the reasons Matt P already stated O3 is not currently
    supported in GPUFS.  So GPUSE would be a better option here.  Yes, you can
    use the apu_se.py script as the base script for running GPUSE experiments.
    There are a number of examples on gem5-resources for how to get started
    with this (including HeteroSync), but I normally recommend starting with
    square if you haven't used the GPU model before:
    https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/square/.
    In terms of support for synchronization at different levels of the memory
    hierarchy, but default the GPU VIPER coherence protocol assumes that all
    synchronization happens at the system-level (at the directory, in the
    current implementation).  However, one of my students will be pushing
    updates (hopefully today) that allow non-system level support (e.g., the
    GPU LLC "GLC" level as mentioned above).  It sounds like you want to change
    the cache hierarchy and coherence protocol to add another level of cache
    (the L3) before the directory and after the CPU/GPU LLCs?  If so, you would
    need to change the current Ruby support to add this additional level and
    the appropriate transitions to do so.  However, if you instead meant that
    you are thinking of the directory level as synchronizing between the CPU
    and GPU, then you could use the support as is without any changes (I think).

Hope this helps,
Matt S.

On Fri, Jun 30, 2023 at 12:05 PM Poremba, Matthew via gem5-users <
gem5-users@gem5.org> wrote:

[Public]

Hi,

No worries about the questions! I will try to answer them all, so this
will be a long email 😊:

The disconnected (or disjoint) Ruby network is essentially the same as
the APU Ruby network used in SE mode -  That is, it combines two Ruby
protocols in one protocol (MOESI_AMD_base and GPU_VIPER).  They are
disjointed because there are no paths / network links between the GPU and
CPU side, simulating a discrete GPU. These protocols work together because
they use the same network messages / virtual channels to the directory –
Basically you cannot simply drop in another CPU protocol and have it work.

Atomic CPU is working very recently – As in this week.  It is on
review board right now and I believe might be part of the gem5 v23.0
release.  However, the reason Atomic and KVM CPUs are required is because
they use the atomic_noncaching memory mode and basically bypass the CPU
cache. The timing CPUs (timing and O3) are trying to generate routes to the
GPU side which is causing deadlocks.  I have not had any time to look into
this further, but that is the status.

| are the GPU applications run on KVM?

The CPU portion of GPU applications runs on KVM.  The GPU is simulated
in timing mode so the compute units, cache, memory, etc. are all simulated
with events.  For an application that simply launches GPU kernels, the CPU
is just waiting for the kernels to finish.

For your other questions:

  1. Unfortunately no, it is not this easy. There is an issue with timing
    CPUs that is still an outstanding bug – we focused on atomic CPU recently
    as a way to allow users who aren’t able to use KVM to be able to use the
    GPU model.

  2. KVM exits whenever there is a memory request outside of its VM
    range. The PCI address range is outside the VM range, so for example when
    the CPU writes to PCI space it will trigger an event for the GPU. The only
    Ruby involvement here is that Ruby will send all requests outside of its
    memory range to the IO bus (KVM or not).

  3. The MMIO trace is only to load the GPU driver and not used in
    applications. It basically contains some reasonable register values for
    anything that is not modeled in gem5 so that we do not need to model them
    (e.g., graphics, power management, video encode/decode, etc.).  This is not
    required for compute-only GPU variants but that is a different topic.

  4. I’m not familiar enough with this particular application to answer
    this question.

  5. I think you will need to use SE mode to do what you are trying to
    do.  Full system mode is using the real GPU driver, ROCm stack, etc. which
    currently does not support any APU-like devices. SE mode is able to do this
    by making use of an emulated driver.

-Matt

From: Anoop Mysore via gem5-users gem5-users@gem5.org
Sent: Friday, June 30, 2023 8:43 AM
To: The gem5 Users mailing list gem5-users@gem5.org
Cc: Anoop Mysore mysanoop@gmail.com
Subject: [gem5-users] Re: Replacing CPU model in GPU-FS

Caution: This message originated from an External Source. Use proper
caution when opening attachments, clicking links, or responding.

It appears the host part of GPU applications are indeed executed on KVM,
from:
https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf
.

A few more questions:

  1. I missed that it isn't mentioned that O3 CPU models aren't supported
    -- would that be as easy as changing the cpu_type in the config file and
    running? I intend to run with the latest O3 CPU config I have (an Intel
    CPU).

  2. The Ruby network that's used -- is it intercepting (perhaps just
    MMIO) memory operations from the KVM CPU? Could you please briefly describe
    how Ruby is working with both KVM and GPU (or point me to any document)?

  3. The GPU MMIO trace we pass during simulator invocation -- what
    exactly is this? If it's a trace of the kernel driver/CPU's MMIO calls into
    GPU, how is it portable across different programs within a benchmark-suite
    -- HeteroSync, for example?

  4. In HeteroSync, there's fine-grain synchronization between CPU and GPU
    in many apps. If I use the vega10_kvm.py, which has a discrete GPU with a
    KVM CPU, where do the synchronizations happen?

  5. If I want to move to an integrated GPU model with an O3 CPU (the only
    requirement is the shared LLC) -- are there any resources that can help me?
    I do see a bootcamp that uses the apu_se.py -- can this be utilized at
    least partially to support full system O3 CPU + integrated GPU? Are there
    any modifications that need to be made to support synchronizations in L3?

Please excuse the jumbled questions, I am in the process of gaining more
clarity.

On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore mysanoop@gmail.com
wrote:

According to the GPU-FS blog
https://urldefense.com/v3/__https:/www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$
,

 "*Currently KVM and X86 are required to run full system. Atomic and

Timing CPUs are not yet compatible with the disconnected Ruby network
required for GPUFS and is a work in progress*."

My understanding is that KVM is used to boot Ubuntu; so, are the GPU
applications run on KVM? Also, what does "disconnected" Ruby network mean
there?

If so, is there any work in progress that I can use to develop on, or a
(noob-friendly) documentation of what needs to be done to extend the
support to Atomic/O3 CPU?

For a project I'm working on, I need complete visibility into the
CPU+GPU cache hierarchy + perhaps a few more custom probes; could you
comment on whether this would be restrictive if going with KVM in the
meantime given that it leverages the host for the virtualized HW?

Please let me know if I have got any of this wrong or if there are other
details you think would be useful.


gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.org

Answers: 1. Yes, I believe so. However, I have never personally tried using the O3 model with the GPU. Matt P has, I believe, so he may have better feedback there. 2. I have not followed the chain of events all the way through here, but I *believe* that the builtin you highlighted is used at the compiler level by HIPCC/LLVM to generate the appropriate assembly for a given AMD GPU. In this case (gfx900), I believe there is a 1-1 correlation with this builtin becoming an s_sleep assembly instruction (maybe with the addition of a v_mov-type instruction before it to set the register to the appropriate sleep value). I am not aware of s_sleep()'s builtin requiring OS calls (or emulation). But what you have described is more generally the issue with SE mode (CPU, GPU, etc.) -- because SE mode does not model OS calls, the fidelity of anything involving the OS will be less. Perhaps a trite way to answer this is: if the fidelity of the OS calls is important for the applications you are studying, then I strongly recommend using FS mode. Hope this helps, Matt S. On Tue, Jul 4, 2023 at 6:01 AM Anoop Mysore <mysanoop@gmail.com> wrote: > Thank you so much for the kind and detailed explanations! > > Just to clarify: I can use the APU config (apu_se.py) and switch out to an > O3 CPU, and I would still have the detailed GPU model, and the disconnected > Ruby model that synchronizes between CPU and GPU at the system-level > directory -- is that correct? > > Last question: when using the APU config for simulating HeteroSync which, > for example, has a sleep mutex primitive that invokes a > __builtin_amdgcn_s_sleep(), is there any OS involvement? If yes, would SE > mode's emulation of those syscalls inexorably sacrifice any fidelity that > could be argued leads to inaccurate evaluations of heterogeneous coherence > implementations? Or are any there other factors of insufficient fidelity > that might be important in this regard? > > > On Fri, Jun 30, 2023 at 7:40 PM Matt Sinclair < > mattdsinclair.wisc@gmail.com> wrote: > >> Just to follow-up on 4 and 5: >> >> 4. The synchronization should happen at the directory-level here, since >> this is the first level of the memory system where both the CPU and GPU are >> connected. However, I have not tested if the programmer sets the GLC bit >> (which should perform the atomic at the GPU's LLC) if Ruby has the >> functionality to send invalidations as appropriate to allow this. I >> suspect it would work as is, but would have to check ... >> >> 5. Yeah, for the reasons Matt P already stated O3 is not currently >> supported in GPUFS. So GPUSE would be a better option here. Yes, you can >> use the apu_se.py script as the base script for running GPUSE experiments. >> There are a number of examples on gem5-resources for how to get started >> with this (including HeteroSync), but I normally recommend starting with >> square if you haven't used the GPU model before: >> https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/square/. >> In terms of support for synchronization at different levels of the memory >> hierarchy, but default the GPU VIPER coherence protocol assumes that all >> synchronization happens at the system-level (at the directory, in the >> current implementation). However, one of my students will be pushing >> updates (hopefully today) that allow non-system level support (e.g., the >> GPU LLC "GLC" level as mentioned above). It sounds like you want to change >> the cache hierarchy and coherence protocol to add another level of cache >> (the L3) before the directory and after the CPU/GPU LLCs? If so, you would >> need to change the current Ruby support to add this additional level and >> the appropriate transitions to do so. However, if you instead meant that >> you are thinking of the directory level as synchronizing between the CPU >> and GPU, then you could use the support as is without any changes (I think). >> >> Hope this helps, >> Matt S. >> >> On Fri, Jun 30, 2023 at 12:05 PM Poremba, Matthew via gem5-users < >> gem5-users@gem5.org> wrote: >> >>> [Public] >>> >>> Hi, >>> >>> >>> >>> >>> >>> No worries about the questions! I will try to answer them all, so this >>> will be a long email 😊: >>> >>> >>> >>> The disconnected (or disjoint) Ruby network is essentially the same as >>> the APU Ruby network used in SE mode - That is, it combines two Ruby >>> protocols in one protocol (MOESI_AMD_base and GPU_VIPER). They are >>> disjointed because there are no paths / network links between the GPU and >>> CPU side, simulating a discrete GPU. These protocols work together because >>> they use the same network messages / virtual channels to the directory – >>> Basically you cannot simply drop in another CPU protocol and have it work. >>> >>> >>> >>> Atomic CPU is working **very** recently – As in this week. It is on >>> review board right now and I believe might be part of the gem5 v23.0 >>> release. However, the reason Atomic and KVM CPUs are required is because >>> they use the atomic_noncaching memory mode and basically bypass the CPU >>> cache. The timing CPUs (timing and O3) are trying to generate routes to the >>> GPU side which is causing deadlocks. I have not had any time to look into >>> this further, but that is the status. >>> >>> >>> >>> | are the GPU applications run on KVM? >>> >>> >>> >>> The CPU portion of GPU applications runs on KVM. The GPU is simulated >>> in timing mode so the compute units, cache, memory, etc. are all simulated >>> with events. For an application that simply launches GPU kernels, the CPU >>> is just waiting for the kernels to finish. >>> >>> >>> >>> For your other questions: >>> >>> 1. Unfortunately no, it is not this easy. There is an issue with timing >>> CPUs that is still an outstanding bug – we focused on atomic CPU recently >>> as a way to allow users who aren’t able to use KVM to be able to use the >>> GPU model. >>> >>> 2. KVM exits whenever there is a memory request outside of its VM >>> range. The PCI address range is outside the VM range, so for example when >>> the CPU writes to PCI space it will trigger an event for the GPU. The only >>> Ruby involvement here is that Ruby will send all requests outside of its >>> memory range to the IO bus (KVM or not). >>> >>> 3. The MMIO trace is only to load the GPU driver and not used in >>> applications. It basically contains some reasonable register values for >>> anything that is not modeled in gem5 so that we do not need to model them >>> (e.g., graphics, power management, video encode/decode, etc.). This is not >>> required for compute-only GPU variants but that is a different topic. >>> >>> 4. I’m not familiar enough with this particular application to answer >>> this question. >>> >>> 5. I think you will need to use SE mode to do what you are trying to >>> do. Full system mode is using the real GPU driver, ROCm stack, etc. which >>> currently does not support any APU-like devices. SE mode is able to do this >>> by making use of an emulated driver. >>> >>> >>> >>> >>> >>> -Matt >>> >>> >>> >>> *From:* Anoop Mysore via gem5-users <gem5-users@gem5.org> >>> *Sent:* Friday, June 30, 2023 8:43 AM >>> *To:* The gem5 Users mailing list <gem5-users@gem5.org> >>> *Cc:* Anoop Mysore <mysanoop@gmail.com> >>> *Subject:* [gem5-users] Re: Replacing CPU model in GPU-FS >>> >>> >>> >>> *Caution:* This message originated from an External Source. Use proper >>> caution when opening attachments, clicking links, or responding. >>> >>> >>> >>> It appears the host part of GPU applications are indeed executed on KVM, >>> from: >>> https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf >>> . >>> >>> A few more questions: >>> >>> 1. I missed that it isn't mentioned that O3 CPU models aren't supported >>> -- would that be as easy as changing the `cpu_type` in the config file and >>> running? I intend to run with the latest O3 CPU config I have (an Intel >>> CPU). >>> 2. The Ruby network that's used -- is it intercepting (perhaps just >>> MMIO) memory operations from the KVM CPU? Could you please briefly describe >>> how Ruby is working with both KVM and GPU (or point me to any document)? >>> 3. The GPU MMIO trace we pass during simulator invocation -- what >>> exactly is this? If it's a trace of the kernel driver/CPU's MMIO calls into >>> GPU, how is it portable across different programs within a benchmark-suite >>> -- HeteroSync, for example? >>> 4. In HeteroSync, there's fine-grain synchronization between CPU and GPU >>> in many apps. If I use the vega10_kvm.py, which has a discrete GPU with a >>> KVM CPU, where do the synchronizations happen? >>> >>> 5. If I want to move to an integrated GPU model with an O3 CPU (the only >>> requirement is the shared LLC) -- are there any resources that can help me? >>> I do see a bootcamp that uses the apu_se.py -- can this be utilized at >>> least partially to support full system O3 CPU + integrated GPU? Are there >>> any modifications that need to be made to support synchronizations in L3? >>> >>> >>> >>> Please excuse the jumbled questions, I am in the process of gaining more >>> clarity. >>> >>> >>> >>> On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore <mysanoop@gmail.com> >>> wrote: >>> >>> According to the GPU-FS blog >>> <https://urldefense.com/v3/__https:/www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$> >>> , >>> >>> "*Currently KVM and X86 are required to run full system. Atomic and >>> Timing CPUs are not yet compatible with the disconnected Ruby network >>> required for GPUFS and is a work in progress*." >>> >>> My understanding is that KVM is used to boot Ubuntu; so, are the GPU >>> applications run on KVM? Also, what does "disconnected" Ruby network mean >>> there? >>> >>> If so, is there any work in progress that I can use to develop on, or a >>> (noob-friendly) documentation of what needs to be done to extend the >>> support to Atomic/O3 CPU? >>> >>> For a project I'm working on, I need complete visibility into the >>> CPU+GPU cache hierarchy + perhaps a few more custom probes; could you >>> comment on whether this would be restrictive if going with KVM in the >>> meantime given that it leverages the host for the virtualized HW? >>> >>> >>> >>> Please let me know if I have got any of this wrong or if there are other >>> details you think would be useful. >>> >>> _______________________________________________ >>> gem5-users mailing list -- gem5-users@gem5.org >>> To unsubscribe send an email to gem5-users-leave@gem5.org >>> >>
AM
Anoop Mysore
Thu, Jul 6, 2023 10:49 AM

I understand; thanks again for the details.

On Wed, Jul 5, 2023 at 7:10 PM Matt Sinclair mattdsinclair.wisc@gmail.com
wrote:

Answers:

  1. Yes, I believe so.  However, I have never personally tried using the
    O3 model with the GPU.  Matt P has, I believe, so he may have better
    feedback there.

  2. I have not followed the chain of events all the way through here, but
    I believe that the builtin you highlighted is used at the compiler level
    by HIPCC/LLVM to generate the appropriate assembly for a given AMD GPU.  In
    this case (gfx900), I believe there is a 1-1 correlation with this builtin
    becoming an s_sleep assembly instruction (maybe with the addition of a
    v_mov-type instruction before it to set the register to the appropriate
    sleep value).  I am not aware of s_sleep()'s builtin requiring OS calls (or
    emulation).  But what you have described is more generally the issue with
    SE mode (CPU, GPU, etc.) -- because SE mode does not model OS calls, the
    fidelity of anything involving the OS will be less.  Perhaps a trite way to
    answer this is: if the fidelity of the OS calls is important for the
    applications you are studying, then I strongly recommend using FS mode.

Hope this helps,
Matt S.

On Tue, Jul 4, 2023 at 6:01 AM Anoop Mysore mysanoop@gmail.com wrote:

Thank you so much for the kind and detailed explanations!

Just to clarify: I can use the APU config (apu_se.py) and switch out to
an O3 CPU, and I would still have the detailed GPU model, and the
disconnected Ruby model that synchronizes between CPU and GPU at the
system-level directory -- is that correct?

Last question: when using the APU config for simulating HeteroSync which,
for example, has a sleep mutex primitive that invokes a
__builtin_amdgcn_s_sleep(), is there any OS involvement? If yes, would SE
mode's emulation of those syscalls inexorably sacrifice any fidelity that
could be argued leads to inaccurate evaluations of heterogeneous coherence
implementations? Or are any there other factors of insufficient fidelity
that might be important in this regard?

On Fri, Jun 30, 2023 at 7:40 PM Matt Sinclair <
mattdsinclair.wisc@gmail.com> wrote:

Just to follow-up on 4 and 5:

  1. The synchronization should happen at the directory-level here, since
    this is the first level of the memory system where both the CPU and GPU are
    connected.  However, I have not tested if the programmer sets the GLC bit
    (which should perform the atomic at the GPU's LLC) if Ruby has the
    functionality to send invalidations as appropriate to allow this.  I
    suspect it would work as is, but would have to check ...

  2. Yeah, for the reasons Matt P already stated O3 is not currently
    supported in GPUFS.  So GPUSE would be a better option here.  Yes, you can
    use the apu_se.py script as the base script for running GPUSE experiments.
    There are a number of examples on gem5-resources for how to get started
    with this (including HeteroSync), but I normally recommend starting with
    square if you haven't used the GPU model before:
    https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/square/.
    In terms of support for synchronization at different levels of the memory
    hierarchy, but default the GPU VIPER coherence protocol assumes that all
    synchronization happens at the system-level (at the directory, in the
    current implementation).  However, one of my students will be pushing
    updates (hopefully today) that allow non-system level support (e.g., the
    GPU LLC "GLC" level as mentioned above).  It sounds like you want to change
    the cache hierarchy and coherence protocol to add another level of cache
    (the L3) before the directory and after the CPU/GPU LLCs?  If so, you would
    need to change the current Ruby support to add this additional level and
    the appropriate transitions to do so.  However, if you instead meant that
    you are thinking of the directory level as synchronizing between the CPU
    and GPU, then you could use the support as is without any changes (I think).

Hope this helps,
Matt S.

On Fri, Jun 30, 2023 at 12:05 PM Poremba, Matthew via gem5-users <
gem5-users@gem5.org> wrote:

[Public]

Hi,

No worries about the questions! I will try to answer them all, so this
will be a long email 😊:

The disconnected (or disjoint) Ruby network is essentially the same as
the APU Ruby network used in SE mode -  That is, it combines two Ruby
protocols in one protocol (MOESI_AMD_base and GPU_VIPER).  They are
disjointed because there are no paths / network links between the GPU and
CPU side, simulating a discrete GPU. These protocols work together because
they use the same network messages / virtual channels to the directory –
Basically you cannot simply drop in another CPU protocol and have it work.

Atomic CPU is working very recently – As in this week.  It is on
review board right now and I believe might be part of the gem5 v23.0
release.  However, the reason Atomic and KVM CPUs are required is because
they use the atomic_noncaching memory mode and basically bypass the CPU
cache. The timing CPUs (timing and O3) are trying to generate routes to the
GPU side which is causing deadlocks.  I have not had any time to look into
this further, but that is the status.

| are the GPU applications run on KVM?

The CPU portion of GPU applications runs on KVM.  The GPU is simulated
in timing mode so the compute units, cache, memory, etc. are all simulated
with events.  For an application that simply launches GPU kernels, the CPU
is just waiting for the kernels to finish.

For your other questions:

  1. Unfortunately no, it is not this easy. There is an issue with
    timing CPUs that is still an outstanding bug – we focused on atomic CPU
    recently as a way to allow users who aren’t able to use KVM to be able to
    use the GPU model.

  2. KVM exits whenever there is a memory request outside of its VM
    range. The PCI address range is outside the VM range, so for example when
    the CPU writes to PCI space it will trigger an event for the GPU. The only
    Ruby involvement here is that Ruby will send all requests outside of its
    memory range to the IO bus (KVM or not).

  3. The MMIO trace is only to load the GPU driver and not used in
    applications. It basically contains some reasonable register values for
    anything that is not modeled in gem5 so that we do not need to model them
    (e.g., graphics, power management, video encode/decode, etc.).  This is not
    required for compute-only GPU variants but that is a different topic.

  4. I’m not familiar enough with this particular application to answer
    this question.

  5. I think you will need to use SE mode to do what you are trying to
    do.  Full system mode is using the real GPU driver, ROCm stack, etc. which
    currently does not support any APU-like devices. SE mode is able to do this
    by making use of an emulated driver.

-Matt

From: Anoop Mysore via gem5-users gem5-users@gem5.org
Sent: Friday, June 30, 2023 8:43 AM
To: The gem5 Users mailing list gem5-users@gem5.org
Cc: Anoop Mysore mysanoop@gmail.com
Subject: [gem5-users] Re: Replacing CPU model in GPU-FS

Caution: This message originated from an External Source. Use proper
caution when opening attachments, clicking links, or responding.

It appears the host part of GPU applications are indeed executed on
KVM, from:
https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf
.

A few more questions:

  1. I missed that it isn't mentioned that O3 CPU models aren't supported
    -- would that be as easy as changing the cpu_type in the config file and
    running? I intend to run with the latest O3 CPU config I have (an Intel
    CPU).

  2. The Ruby network that's used -- is it intercepting (perhaps just
    MMIO) memory operations from the KVM CPU? Could you please briefly describe
    how Ruby is working with both KVM and GPU (or point me to any document)?

  3. The GPU MMIO trace we pass during simulator invocation -- what
    exactly is this? If it's a trace of the kernel driver/CPU's MMIO calls into
    GPU, how is it portable across different programs within a benchmark-suite
    -- HeteroSync, for example?

  4. In HeteroSync, there's fine-grain synchronization between CPU and
    GPU in many apps. If I use the vega10_kvm.py, which has a discrete GPU with
    a KVM CPU, where do the synchronizations happen?

  5. If I want to move to an integrated GPU model with an O3 CPU (the
    only requirement is the shared LLC) -- are there any resources that can
    help me? I do see a bootcamp that uses the apu_se.py -- can this be
    utilized at least partially to support full system O3 CPU + integrated GPU?
    Are there any modifications that need to be made to support
    synchronizations in L3?

Please excuse the jumbled questions, I am in the process of gaining
more clarity.

On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore mysanoop@gmail.com
wrote:

According to the GPU-FS blog
https://urldefense.com/v3/__https:/www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$
,

 "*Currently KVM and X86 are required to run full system. Atomic

and Timing CPUs are not yet compatible with the disconnected Ruby network
required for GPUFS and is a work in progress*."

My understanding is that KVM is used to boot Ubuntu; so, are the GPU
applications run on KVM? Also, what does "disconnected" Ruby network mean
there?

If so, is there any work in progress that I can use to develop on, or a
(noob-friendly) documentation of what needs to be done to extend the
support to Atomic/O3 CPU?

For a project I'm working on, I need complete visibility into the
CPU+GPU cache hierarchy + perhaps a few more custom probes; could you
comment on whether this would be restrictive if going with KVM in the
meantime given that it leverages the host for the virtualized HW?

Please let me know if I have got any of this wrong or if there are
other details you think would be useful.


gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-leave@gem5.org

I understand; thanks again for the details. On Wed, Jul 5, 2023 at 7:10 PM Matt Sinclair <mattdsinclair.wisc@gmail.com> wrote: > Answers: > > 1. Yes, I believe so. However, I have never personally tried using the > O3 model with the GPU. Matt P has, I believe, so he may have better > feedback there. > > 2. I have not followed the chain of events all the way through here, but > I *believe* that the builtin you highlighted is used at the compiler level > by HIPCC/LLVM to generate the appropriate assembly for a given AMD GPU. In > this case (gfx900), I believe there is a 1-1 correlation with this builtin > becoming an s_sleep assembly instruction (maybe with the addition of a > v_mov-type instruction before it to set the register to the appropriate > sleep value). I am not aware of s_sleep()'s builtin requiring OS calls (or > emulation). But what you have described is more generally the issue with > SE mode (CPU, GPU, etc.) -- because SE mode does not model OS calls, the > fidelity of anything involving the OS will be less. Perhaps a trite way to > answer this is: if the fidelity of the OS calls is important for the > applications you are studying, then I strongly recommend using FS mode. > > Hope this helps, > Matt S. > > On Tue, Jul 4, 2023 at 6:01 AM Anoop Mysore <mysanoop@gmail.com> wrote: > >> Thank you so much for the kind and detailed explanations! >> >> Just to clarify: I can use the APU config (apu_se.py) and switch out to >> an O3 CPU, and I would still have the detailed GPU model, and the >> disconnected Ruby model that synchronizes between CPU and GPU at the >> system-level directory -- is that correct? >> >> Last question: when using the APU config for simulating HeteroSync which, >> for example, has a sleep mutex primitive that invokes a >> __builtin_amdgcn_s_sleep(), is there any OS involvement? If yes, would SE >> mode's emulation of those syscalls inexorably sacrifice any fidelity that >> could be argued leads to inaccurate evaluations of heterogeneous coherence >> implementations? Or are any there other factors of insufficient fidelity >> that might be important in this regard? >> >> >> On Fri, Jun 30, 2023 at 7:40 PM Matt Sinclair < >> mattdsinclair.wisc@gmail.com> wrote: >> >>> Just to follow-up on 4 and 5: >>> >>> 4. The synchronization should happen at the directory-level here, since >>> this is the first level of the memory system where both the CPU and GPU are >>> connected. However, I have not tested if the programmer sets the GLC bit >>> (which should perform the atomic at the GPU's LLC) if Ruby has the >>> functionality to send invalidations as appropriate to allow this. I >>> suspect it would work as is, but would have to check ... >>> >>> 5. Yeah, for the reasons Matt P already stated O3 is not currently >>> supported in GPUFS. So GPUSE would be a better option here. Yes, you can >>> use the apu_se.py script as the base script for running GPUSE experiments. >>> There are a number of examples on gem5-resources for how to get started >>> with this (including HeteroSync), but I normally recommend starting with >>> square if you haven't used the GPU model before: >>> https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/square/. >>> In terms of support for synchronization at different levels of the memory >>> hierarchy, but default the GPU VIPER coherence protocol assumes that all >>> synchronization happens at the system-level (at the directory, in the >>> current implementation). However, one of my students will be pushing >>> updates (hopefully today) that allow non-system level support (e.g., the >>> GPU LLC "GLC" level as mentioned above). It sounds like you want to change >>> the cache hierarchy and coherence protocol to add another level of cache >>> (the L3) before the directory and after the CPU/GPU LLCs? If so, you would >>> need to change the current Ruby support to add this additional level and >>> the appropriate transitions to do so. However, if you instead meant that >>> you are thinking of the directory level as synchronizing between the CPU >>> and GPU, then you could use the support as is without any changes (I think). >>> >>> Hope this helps, >>> Matt S. >>> >>> On Fri, Jun 30, 2023 at 12:05 PM Poremba, Matthew via gem5-users < >>> gem5-users@gem5.org> wrote: >>> >>>> [Public] >>>> >>>> Hi, >>>> >>>> >>>> >>>> >>>> >>>> No worries about the questions! I will try to answer them all, so this >>>> will be a long email 😊: >>>> >>>> >>>> >>>> The disconnected (or disjoint) Ruby network is essentially the same as >>>> the APU Ruby network used in SE mode - That is, it combines two Ruby >>>> protocols in one protocol (MOESI_AMD_base and GPU_VIPER). They are >>>> disjointed because there are no paths / network links between the GPU and >>>> CPU side, simulating a discrete GPU. These protocols work together because >>>> they use the same network messages / virtual channels to the directory – >>>> Basically you cannot simply drop in another CPU protocol and have it work. >>>> >>>> >>>> >>>> Atomic CPU is working **very** recently – As in this week. It is on >>>> review board right now and I believe might be part of the gem5 v23.0 >>>> release. However, the reason Atomic and KVM CPUs are required is because >>>> they use the atomic_noncaching memory mode and basically bypass the CPU >>>> cache. The timing CPUs (timing and O3) are trying to generate routes to the >>>> GPU side which is causing deadlocks. I have not had any time to look into >>>> this further, but that is the status. >>>> >>>> >>>> >>>> | are the GPU applications run on KVM? >>>> >>>> >>>> >>>> The CPU portion of GPU applications runs on KVM. The GPU is simulated >>>> in timing mode so the compute units, cache, memory, etc. are all simulated >>>> with events. For an application that simply launches GPU kernels, the CPU >>>> is just waiting for the kernels to finish. >>>> >>>> >>>> >>>> For your other questions: >>>> >>>> 1. Unfortunately no, it is not this easy. There is an issue with >>>> timing CPUs that is still an outstanding bug – we focused on atomic CPU >>>> recently as a way to allow users who aren’t able to use KVM to be able to >>>> use the GPU model. >>>> >>>> 2. KVM exits whenever there is a memory request outside of its VM >>>> range. The PCI address range is outside the VM range, so for example when >>>> the CPU writes to PCI space it will trigger an event for the GPU. The only >>>> Ruby involvement here is that Ruby will send all requests outside of its >>>> memory range to the IO bus (KVM or not). >>>> >>>> 3. The MMIO trace is only to load the GPU driver and not used in >>>> applications. It basically contains some reasonable register values for >>>> anything that is not modeled in gem5 so that we do not need to model them >>>> (e.g., graphics, power management, video encode/decode, etc.). This is not >>>> required for compute-only GPU variants but that is a different topic. >>>> >>>> 4. I’m not familiar enough with this particular application to answer >>>> this question. >>>> >>>> 5. I think you will need to use SE mode to do what you are trying to >>>> do. Full system mode is using the real GPU driver, ROCm stack, etc. which >>>> currently does not support any APU-like devices. SE mode is able to do this >>>> by making use of an emulated driver. >>>> >>>> >>>> >>>> >>>> >>>> -Matt >>>> >>>> >>>> >>>> *From:* Anoop Mysore via gem5-users <gem5-users@gem5.org> >>>> *Sent:* Friday, June 30, 2023 8:43 AM >>>> *To:* The gem5 Users mailing list <gem5-users@gem5.org> >>>> *Cc:* Anoop Mysore <mysanoop@gmail.com> >>>> *Subject:* [gem5-users] Re: Replacing CPU model in GPU-FS >>>> >>>> >>>> >>>> *Caution:* This message originated from an External Source. Use proper >>>> caution when opening attachments, clicking links, or responding. >>>> >>>> >>>> >>>> It appears the host part of GPU applications are indeed executed on >>>> KVM, from: >>>> https://www.gem5.org/assets/files/workshop-isca-2023/slides/improving-gem5s-gpufs-support.pdf >>>> . >>>> >>>> A few more questions: >>>> >>>> 1. I missed that it isn't mentioned that O3 CPU models aren't supported >>>> -- would that be as easy as changing the `cpu_type` in the config file and >>>> running? I intend to run with the latest O3 CPU config I have (an Intel >>>> CPU). >>>> 2. The Ruby network that's used -- is it intercepting (perhaps just >>>> MMIO) memory operations from the KVM CPU? Could you please briefly describe >>>> how Ruby is working with both KVM and GPU (or point me to any document)? >>>> 3. The GPU MMIO trace we pass during simulator invocation -- what >>>> exactly is this? If it's a trace of the kernel driver/CPU's MMIO calls into >>>> GPU, how is it portable across different programs within a benchmark-suite >>>> -- HeteroSync, for example? >>>> 4. In HeteroSync, there's fine-grain synchronization between CPU and >>>> GPU in many apps. If I use the vega10_kvm.py, which has a discrete GPU with >>>> a KVM CPU, where do the synchronizations happen? >>>> >>>> 5. If I want to move to an integrated GPU model with an O3 CPU (the >>>> only requirement is the shared LLC) -- are there any resources that can >>>> help me? I do see a bootcamp that uses the apu_se.py -- can this be >>>> utilized at least partially to support full system O3 CPU + integrated GPU? >>>> Are there any modifications that need to be made to support >>>> synchronizations in L3? >>>> >>>> >>>> >>>> Please excuse the jumbled questions, I am in the process of gaining >>>> more clarity. >>>> >>>> >>>> >>>> On Fri, Jun 30, 2023 at 12:10 PM Anoop Mysore <mysanoop@gmail.com> >>>> wrote: >>>> >>>> According to the GPU-FS blog >>>> <https://urldefense.com/v3/__https:/www.gem5.org/2023/02/13/moving-to-full-system-gpu.html__;!!K-Hz7m0Vt54!k9hG9tkVg8rKCoOxEFpXaQmrvFKmQeDhFkiaPWuQkMFWMhio1S4d8IWkF32x0Nyo7bBbV3LJKv6eLEKcc8oh0uyzua4$> >>>> , >>>> >>>> "*Currently KVM and X86 are required to run full system. Atomic >>>> and Timing CPUs are not yet compatible with the disconnected Ruby network >>>> required for GPUFS and is a work in progress*." >>>> >>>> My understanding is that KVM is used to boot Ubuntu; so, are the GPU >>>> applications run on KVM? Also, what does "disconnected" Ruby network mean >>>> there? >>>> >>>> If so, is there any work in progress that I can use to develop on, or a >>>> (noob-friendly) documentation of what needs to be done to extend the >>>> support to Atomic/O3 CPU? >>>> >>>> For a project I'm working on, I need complete visibility into the >>>> CPU+GPU cache hierarchy + perhaps a few more custom probes; could you >>>> comment on whether this would be restrictive if going with KVM in the >>>> meantime given that it leverages the host for the virtualized HW? >>>> >>>> >>>> >>>> Please let me know if I have got any of this wrong or if there are >>>> other details you think would be useful. >>>> >>>> _______________________________________________ >>>> gem5-users mailing list -- gem5-users@gem5.org >>>> To unsubscribe send an email to gem5-users-leave@gem5.org >>>> >>>