From d99193d3fcc4b2a0dacc0a9d7e4951ea611a3e96 Mon Sep 17 00:00:00 2001 From: Jon Leech Date: Wed, 7 Feb 2024 19:51:30 -0800 Subject: Fix some improperly named files. --- proposals/VK_KHR_cooperative_matrix.adoc | 50 +++++++ proposals/VK_KHR_cooperative_matrix.asciidoc | 50 ------- proposals/VK_KHR_shader_expect_assume.adoc | 93 ++++++++++++ proposals/VK_KHR_shader_expect_assume.asciidoc | 93 ------------ proposals/VK_KHR_shader_maximal_reconvergence.adoc | 162 +++++++++++++++++++++ .../VK_KHR_shader_maximal_reconvergence.asciidoc | 162 --------------------- proposals/VK_KHR_shader_subgroup_rotate.adoc | 150 +++++++++++++++++++ proposals/VK_KHR_shader_subgroup_rotate.asciidoc | 150 ------------------- 8 files changed, 455 insertions(+), 455 deletions(-) create mode 100644 proposals/VK_KHR_cooperative_matrix.adoc delete mode 100644 proposals/VK_KHR_cooperative_matrix.asciidoc create mode 100644 proposals/VK_KHR_shader_expect_assume.adoc delete mode 100644 proposals/VK_KHR_shader_expect_assume.asciidoc create mode 100644 proposals/VK_KHR_shader_maximal_reconvergence.adoc delete mode 100644 proposals/VK_KHR_shader_maximal_reconvergence.asciidoc create mode 100644 proposals/VK_KHR_shader_subgroup_rotate.adoc delete mode 100644 proposals/VK_KHR_shader_subgroup_rotate.asciidoc diff --git a/proposals/VK_KHR_cooperative_matrix.adoc b/proposals/VK_KHR_cooperative_matrix.adoc new file mode 100644 index 00000000..83766b7a --- /dev/null +++ b/proposals/VK_KHR_cooperative_matrix.adoc @@ -0,0 +1,50 @@ +// Copyright 2021-2024 The Khronos Group Inc. +// +// SPDX-License-Identifier: CC-BY-4.0 + += VK_KHR_cooperative_matrix +:toc: left +:refpage: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/ +:sectnums: + +This document proposes adding support for so-called cooperative matrix +operations that enables multiple shader invocations to cooperatively and +efficiently perform matrix multiplications. + +== Problem Statement + +A growing number of GPU applications are making use of matrix multiplication +operations. Modern GPU HW can take advantage of cross-invocation communication +channels or other hardware facilities to implement matrix multiplications +operations more efficiently but there is currently no suitable standard +SPIR-V/API mechanism to expose these features to applications or libraries. + +== Solution Space + +Applications or libraries can use subgroup primitives to write more efficient +matrix multiplication kernels but, while technically possible on some hardware, +this approach often does not make it possible to write optimal kernels and +requires applications to have a lot of device-specific knowledge. + +NVIDIA exposed with VK_NV_cooperative_matrix a new set of abstractions for such +cooperative matrix operations. These include cooperative load and store +instructions, a matrix multiplication-addition instruction as well a limited +support for element-wise operations on these matrices. Since the release of +that extension, a growing body of evidence in the form of discussions and +other similar vendor extensions suggests that this approach is suitable for +a wide variety of devices and applications and is thus a good candidate for +standardisation. + +== Proposal + +Work towards a standard extension that exposes abstractions similar as those +released under VK_NV_cooperative_matrix. + +== Examples + +See specifications and presentations for VK_NV_cooperative_matrix. + +== Issues + +None. + diff --git a/proposals/VK_KHR_cooperative_matrix.asciidoc b/proposals/VK_KHR_cooperative_matrix.asciidoc deleted file mode 100644 index 83766b7a..00000000 --- a/proposals/VK_KHR_cooperative_matrix.asciidoc +++ /dev/null @@ -1,50 +0,0 @@ -// Copyright 2021-2024 The Khronos Group Inc. -// -// SPDX-License-Identifier: CC-BY-4.0 - -= VK_KHR_cooperative_matrix -:toc: left -:refpage: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/ -:sectnums: - -This document proposes adding support for so-called cooperative matrix -operations that enables multiple shader invocations to cooperatively and -efficiently perform matrix multiplications. - -== Problem Statement - -A growing number of GPU applications are making use of matrix multiplication -operations. Modern GPU HW can take advantage of cross-invocation communication -channels or other hardware facilities to implement matrix multiplications -operations more efficiently but there is currently no suitable standard -SPIR-V/API mechanism to expose these features to applications or libraries. - -== Solution Space - -Applications or libraries can use subgroup primitives to write more efficient -matrix multiplication kernels but, while technically possible on some hardware, -this approach often does not make it possible to write optimal kernels and -requires applications to have a lot of device-specific knowledge. - -NVIDIA exposed with VK_NV_cooperative_matrix a new set of abstractions for such -cooperative matrix operations. These include cooperative load and store -instructions, a matrix multiplication-addition instruction as well a limited -support for element-wise operations on these matrices. Since the release of -that extension, a growing body of evidence in the form of discussions and -other similar vendor extensions suggests that this approach is suitable for -a wide variety of devices and applications and is thus a good candidate for -standardisation. - -== Proposal - -Work towards a standard extension that exposes abstractions similar as those -released under VK_NV_cooperative_matrix. - -== Examples - -See specifications and presentations for VK_NV_cooperative_matrix. - -== Issues - -None. - diff --git a/proposals/VK_KHR_shader_expect_assume.adoc b/proposals/VK_KHR_shader_expect_assume.adoc new file mode 100644 index 00000000..9cfe62c2 --- /dev/null +++ b/proposals/VK_KHR_shader_expect_assume.adoc @@ -0,0 +1,93 @@ +// Copyright 2021-2024 The Khronos Group, Inc. +// +// SPDX-License-Identifier: CC-BY-4.0 + += VK_KHR_shader_expect_assume +:toc: left +:refpage: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/ +:sectnums: + +This document proposes adding support for expect/assume SPIR-V instructions +to guide shader program optimizations. + +== Problem Statement + +Shader writers or generators as well as other SPIR-V producers (e.g. Machine +Learning compilers) often have access to information that could enable the SPIR-V +consumers in Vulkan implementations to make better optimization decisions, such +as knowledge of the likely value of objects or whether a given condition holds, +but which they cannot communicate to a Vulkan SPIR-V consumer using existing features. + +== Solution Space + +SPIR-V already provides some mechanisms for producers to give hints to consumers +in a limited number of scenarios: + +- `OpBranchConditional` can accept branch weights that enable producers to +indicate the likelihood of each path. This does not however generalize +to `OpSwitch` constructs. + +- Various so called _Loop Controls_ make it possible for producers to provide +metadata about the iteration count of loops or desired unrolling behaviour. + +There is however no exposed generic mechanism for SPIR-V producers to communicate +optimisation information to consumers. SPIR-V does support dedicated instructions, +introduced by the +http://htmlpreview.github.io/?https://github.com/KhronosGroup/SPIRV-Registry/blob/master/extensions/KHR/SPV_KHR_expect_assume.html[SPV_KHR_expect_assume] +extension, that make it possible for producers to communicate to consumers the +likely value of an object or whether a given condition holds, but this extension +is currently not exposed in Vulkan. + +== Proposal + +Expose the +http://htmlpreview.github.io/?https://github.com/KhronosGroup/SPIRV-Registry/blob/master/extensions/KHR/SPV_KHR_expect_assume.html[SPV_KHR_expect_assume] +extension in Vulkan. + +The `SPV_KHR_expect_assume` extension introduces two new instructions: + +- `OpExpectKHR` makes it possible to state the most probable value of its input. +- `OpAssumeTrueKHR` enables the optimizer to assume that the provided condition is +always true. + +== Examples + +As an illustration, consider the following pseudocode example: + +[source] +---- +c = 20 +d = 2 +b = c / d + +if (a - b > 0) { + ... +} else { + ... +} +---- + +The writer or producer may know that a > 10. This knowledge makes it possible +to completely remove the `else` branch. In this case, the producer could perform +that optimisation alone. However, if the producer only knows that `a` is greater +than _some_ value provided, say with a specialization constant, it can no longer +perform the optimisation. Adding that information to the SPIR-V module would +enable the SPIR-V consumer to do it. + +Another possible use could be to provide guarantees that a particular value +is not NaN or infinite: + +[source] +---- +value = load(...) +assume(!isnan(value)) +---- + +== Issues + +1) What shader stages should the instructions introduced by this extension +be allowed in? + +*PROPOSED*: No restrictions are placed on the shader stages the instructions can +be used in. + diff --git a/proposals/VK_KHR_shader_expect_assume.asciidoc b/proposals/VK_KHR_shader_expect_assume.asciidoc deleted file mode 100644 index 9cfe62c2..00000000 --- a/proposals/VK_KHR_shader_expect_assume.asciidoc +++ /dev/null @@ -1,93 +0,0 @@ -// Copyright 2021-2024 The Khronos Group, Inc. -// -// SPDX-License-Identifier: CC-BY-4.0 - -= VK_KHR_shader_expect_assume -:toc: left -:refpage: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/ -:sectnums: - -This document proposes adding support for expect/assume SPIR-V instructions -to guide shader program optimizations. - -== Problem Statement - -Shader writers or generators as well as other SPIR-V producers (e.g. Machine -Learning compilers) often have access to information that could enable the SPIR-V -consumers in Vulkan implementations to make better optimization decisions, such -as knowledge of the likely value of objects or whether a given condition holds, -but which they cannot communicate to a Vulkan SPIR-V consumer using existing features. - -== Solution Space - -SPIR-V already provides some mechanisms for producers to give hints to consumers -in a limited number of scenarios: - -- `OpBranchConditional` can accept branch weights that enable producers to -indicate the likelihood of each path. This does not however generalize -to `OpSwitch` constructs. - -- Various so called _Loop Controls_ make it possible for producers to provide -metadata about the iteration count of loops or desired unrolling behaviour. - -There is however no exposed generic mechanism for SPIR-V producers to communicate -optimisation information to consumers. SPIR-V does support dedicated instructions, -introduced by the -http://htmlpreview.github.io/?https://github.com/KhronosGroup/SPIRV-Registry/blob/master/extensions/KHR/SPV_KHR_expect_assume.html[SPV_KHR_expect_assume] -extension, that make it possible for producers to communicate to consumers the -likely value of an object or whether a given condition holds, but this extension -is currently not exposed in Vulkan. - -== Proposal - -Expose the -http://htmlpreview.github.io/?https://github.com/KhronosGroup/SPIRV-Registry/blob/master/extensions/KHR/SPV_KHR_expect_assume.html[SPV_KHR_expect_assume] -extension in Vulkan. - -The `SPV_KHR_expect_assume` extension introduces two new instructions: - -- `OpExpectKHR` makes it possible to state the most probable value of its input. -- `OpAssumeTrueKHR` enables the optimizer to assume that the provided condition is -always true. - -== Examples - -As an illustration, consider the following pseudocode example: - -[source] ----- -c = 20 -d = 2 -b = c / d - -if (a - b > 0) { - ... -} else { - ... -} ----- - -The writer or producer may know that a > 10. This knowledge makes it possible -to completely remove the `else` branch. In this case, the producer could perform -that optimisation alone. However, if the producer only knows that `a` is greater -than _some_ value provided, say with a specialization constant, it can no longer -perform the optimisation. Adding that information to the SPIR-V module would -enable the SPIR-V consumer to do it. - -Another possible use could be to provide guarantees that a particular value -is not NaN or infinite: - -[source] ----- -value = load(...) -assume(!isnan(value)) ----- - -== Issues - -1) What shader stages should the instructions introduced by this extension -be allowed in? - -*PROPOSED*: No restrictions are placed on the shader stages the instructions can -be used in. - diff --git a/proposals/VK_KHR_shader_maximal_reconvergence.adoc b/proposals/VK_KHR_shader_maximal_reconvergence.adoc new file mode 100644 index 00000000..7b361e4e --- /dev/null +++ b/proposals/VK_KHR_shader_maximal_reconvergence.adoc @@ -0,0 +1,162 @@ +// Copyright 2024 The Khronos Group, Inc. +// +// SPDX-License-Identifier: CC-BY-4.0 + += VK_KHR_shader_maximal_reconvergence +:toc: left +:refpage: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/ +:sectnums: + +== Problem Statement + +The SPIR-V specification defines several types of instructions as communicating between invocations. +It refers to these instructions as +https://www.khronos.org/registry/SPIR-V/specs/unified1/SPIRV.html#tangled_instruction[tangled +instructions]. +Tangled instructions include very useful instructions such as subgroup +operations and derivatives. +In order to correctly reason about their programs, shader authors need to be +able to understand, and be provided some guarantees, about which invocations +will be tangled together. +Unfortunately, SPIR-V does not provide strong guarantees surrounding the +divergence and reconvergence of invocations. +The +https://www.khronos.org/registry/SPIR-V/specs/unified1/SPIRV.html#uniform_control_flow[guarantees] +it does provide are rather weak and lead to unreliable behaviour across +different devices (or even different drivers of the same device). + +VK_KHR_shader_subgroup_uniform_control_flow provides stronger guarantees, but +still has some drawbacks from a shader author's point of view. +Shader authors would like to be able to reason about the divergence and +reconvergence of invocations executing shaders written in a HLL and have that +reasoning translate faithfully into SPIR-V. + +== Solution Space + +The following options were considered to address this issue: + +1. Add new mechanisms to SPIR-V, and optionally HLLs, that provide explicit + divergence and reconvergence information directly in the shader. +2. Add new guarantees to SPIR-V (through a new execution mode) that guarantee + divergence and reconvergence in SPIR-V maps intuitively from the shader's + representation in a HLL. + +The main advantage of option 1 is that is completely explicit. +The main disadvantage is it likely requires additional changes in HLL +(otherwise just use option 2) and that it requires shader authors to write more +verbose code to achieve what should, intuitively, be obvious behavior. + +The main advantage of option 2 is that there is almost no burden placed on +shader authors (beyond requesting the new style of execution). +Their code works how they expect across different devices. +The main disadvantage is that drivers must be cautious to preserve the +information implicitly encoded in the SPIR-V control flow graph throughout +internal transformations in order to guarantee the expected divergence and +reconvergence. +Option 2 is a clear win for shader authors and the difficulty for +implementations is expected to be manageable. + +== Proposal + +=== SPV_KHR_maximal_reconvergence + +This extension exposes the ability to use the SPIR-V extension, which provides +extra guarantees surrounding divergence and reconvergence. + +The extension introduces the idea of a tangle, which is the set of invocations +that execute a specific dynamic instruction instance and provides a set of +rules to reason about which invocations are included in each tangle. + +The rules are designed to match shader author intuition of divergence and +reconvergence in an HLL. +That is, divergence and reconvergence information is inferred directly from the +control flow graph of the SPIR-V module. + +=== Examples + +[source,c] +---- +uint myMaterialIndex = ...; +for (;;) { + uint materialIndex = subgroupBroadcastFirst(myMaterialIndex); + if (myMaterialIndex == materialIndex) { + // Vulkan specification requires uniform access to the resource. + vec4 diffuse = texture(diffuseSamplers[materialIndex], uv); + + // ... + + break; + } +} +---- + +In the above example, the shader author relies on invocations executing +different loop iterations being diverged from each other; however, SPIR-V does +not guarantee this to be the case. +Without maximal reconvergence, an implementation may interleave invocations +among different iterations of the loop, inadvertently breaking the uniform +access. +Another potential problem is that implementations may treat the resource access +as occurring outside the loop altogether depending on how the compiler analyzes +the program. +With maximal reconvergence, invocations are executing different loop iterations +are never in the same tangle and the break block is always considered to be +inside the loop. +With those restrictions, this example behaves as the shader author expects. + +[source,c] +---- +// Free should be initialized to 0. +layout(set=0, binding=0) buffer BUFFER { uint free; uint data[]; } b; +void main() { + bool needs_space = false; + ... + if (needs_space) { + // gl_SubgroupSize may be larger than the actual subgroup size so + // calculate the actual subgroup size. + uvec4 mask = subgroupBallot(needs_space); + uint size = subgroupBallotBitCount(mask); + uint base = 0; + if (subgroupElect()) { + // "free" tracks the next free slot for writes. + // The first invocation in the subgroup allocates space + // for each invocation in the subgroup that requires it. + base = atomicAdd(b.free, size); + } + + // Broadcast the base index to other invocations in the subgroup. + base = subgroupBroadcastFirst(base); + // Calculate the offset from "base" for each invocation. + uint offset = subgroupBallotExclusiveBitCount(mask); + + // Write the data in the allocated slot for each invocation that + // requested space. + b.data[base + offset] = ...; + } + ... +} +---- + +This example is borrowed from the +https://github.com/KhronosGroup/Vulkan-Guide/blob/master/chapters/extensions/VK_KHR_shader_subgroup_uniform_control_flow.adoc[guide +for VK_KHR_shader_subgroup_uniform_control flow]. +Even with subgroup uniform control flow the rewritten example had a caveat that +the code could only be executed from subgroup uniform control flow. +With maximal reconvergence, the unaltered version of code (as listed above) can +be used directly to perform atomic compaction. +The extra subgroup operations required by subgroup uniform control flow are no longer required. +Maximal reconvergence guarantees that the election, broadcast and bit count all +operate on the same tangle. + +== Issues + +=== RESOLVED: Can a single behavior be provided for switch statements? + +Unfortunately, maximal reconvergence cannot guarantee a single behavior for +switch statements. +There are too many different implementations for a switch statement, +restricting the divergence and reconvergence behavior would have serious +negative performance impacts on some implementations. +Instead, shader authors should avoid switch statements in favour of if/else +statements if they require guarantees about divergence and reconvergence. + diff --git a/proposals/VK_KHR_shader_maximal_reconvergence.asciidoc b/proposals/VK_KHR_shader_maximal_reconvergence.asciidoc deleted file mode 100644 index 7b361e4e..00000000 --- a/proposals/VK_KHR_shader_maximal_reconvergence.asciidoc +++ /dev/null @@ -1,162 +0,0 @@ -// Copyright 2024 The Khronos Group, Inc. -// -// SPDX-License-Identifier: CC-BY-4.0 - -= VK_KHR_shader_maximal_reconvergence -:toc: left -:refpage: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/ -:sectnums: - -== Problem Statement - -The SPIR-V specification defines several types of instructions as communicating between invocations. -It refers to these instructions as -https://www.khronos.org/registry/SPIR-V/specs/unified1/SPIRV.html#tangled_instruction[tangled -instructions]. -Tangled instructions include very useful instructions such as subgroup -operations and derivatives. -In order to correctly reason about their programs, shader authors need to be -able to understand, and be provided some guarantees, about which invocations -will be tangled together. -Unfortunately, SPIR-V does not provide strong guarantees surrounding the -divergence and reconvergence of invocations. -The -https://www.khronos.org/registry/SPIR-V/specs/unified1/SPIRV.html#uniform_control_flow[guarantees] -it does provide are rather weak and lead to unreliable behaviour across -different devices (or even different drivers of the same device). - -VK_KHR_shader_subgroup_uniform_control_flow provides stronger guarantees, but -still has some drawbacks from a shader author's point of view. -Shader authors would like to be able to reason about the divergence and -reconvergence of invocations executing shaders written in a HLL and have that -reasoning translate faithfully into SPIR-V. - -== Solution Space - -The following options were considered to address this issue: - -1. Add new mechanisms to SPIR-V, and optionally HLLs, that provide explicit - divergence and reconvergence information directly in the shader. -2. Add new guarantees to SPIR-V (through a new execution mode) that guarantee - divergence and reconvergence in SPIR-V maps intuitively from the shader's - representation in a HLL. - -The main advantage of option 1 is that is completely explicit. -The main disadvantage is it likely requires additional changes in HLL -(otherwise just use option 2) and that it requires shader authors to write more -verbose code to achieve what should, intuitively, be obvious behavior. - -The main advantage of option 2 is that there is almost no burden placed on -shader authors (beyond requesting the new style of execution). -Their code works how they expect across different devices. -The main disadvantage is that drivers must be cautious to preserve the -information implicitly encoded in the SPIR-V control flow graph throughout -internal transformations in order to guarantee the expected divergence and -reconvergence. -Option 2 is a clear win for shader authors and the difficulty for -implementations is expected to be manageable. - -== Proposal - -=== SPV_KHR_maximal_reconvergence - -This extension exposes the ability to use the SPIR-V extension, which provides -extra guarantees surrounding divergence and reconvergence. - -The extension introduces the idea of a tangle, which is the set of invocations -that execute a specific dynamic instruction instance and provides a set of -rules to reason about which invocations are included in each tangle. - -The rules are designed to match shader author intuition of divergence and -reconvergence in an HLL. -That is, divergence and reconvergence information is inferred directly from the -control flow graph of the SPIR-V module. - -=== Examples - -[source,c] ----- -uint myMaterialIndex = ...; -for (;;) { - uint materialIndex = subgroupBroadcastFirst(myMaterialIndex); - if (myMaterialIndex == materialIndex) { - // Vulkan specification requires uniform access to the resource. - vec4 diffuse = texture(diffuseSamplers[materialIndex], uv); - - // ... - - break; - } -} ----- - -In the above example, the shader author relies on invocations executing -different loop iterations being diverged from each other; however, SPIR-V does -not guarantee this to be the case. -Without maximal reconvergence, an implementation may interleave invocations -among different iterations of the loop, inadvertently breaking the uniform -access. -Another potential problem is that implementations may treat the resource access -as occurring outside the loop altogether depending on how the compiler analyzes -the program. -With maximal reconvergence, invocations are executing different loop iterations -are never in the same tangle and the break block is always considered to be -inside the loop. -With those restrictions, this example behaves as the shader author expects. - -[source,c] ----- -// Free should be initialized to 0. -layout(set=0, binding=0) buffer BUFFER { uint free; uint data[]; } b; -void main() { - bool needs_space = false; - ... - if (needs_space) { - // gl_SubgroupSize may be larger than the actual subgroup size so - // calculate the actual subgroup size. - uvec4 mask = subgroupBallot(needs_space); - uint size = subgroupBallotBitCount(mask); - uint base = 0; - if (subgroupElect()) { - // "free" tracks the next free slot for writes. - // The first invocation in the subgroup allocates space - // for each invocation in the subgroup that requires it. - base = atomicAdd(b.free, size); - } - - // Broadcast the base index to other invocations in the subgroup. - base = subgroupBroadcastFirst(base); - // Calculate the offset from "base" for each invocation. - uint offset = subgroupBallotExclusiveBitCount(mask); - - // Write the data in the allocated slot for each invocation that - // requested space. - b.data[base + offset] = ...; - } - ... -} ----- - -This example is borrowed from the -https://github.com/KhronosGroup/Vulkan-Guide/blob/master/chapters/extensions/VK_KHR_shader_subgroup_uniform_control_flow.adoc[guide -for VK_KHR_shader_subgroup_uniform_control flow]. -Even with subgroup uniform control flow the rewritten example had a caveat that -the code could only be executed from subgroup uniform control flow. -With maximal reconvergence, the unaltered version of code (as listed above) can -be used directly to perform atomic compaction. -The extra subgroup operations required by subgroup uniform control flow are no longer required. -Maximal reconvergence guarantees that the election, broadcast and bit count all -operate on the same tangle. - -== Issues - -=== RESOLVED: Can a single behavior be provided for switch statements? - -Unfortunately, maximal reconvergence cannot guarantee a single behavior for -switch statements. -There are too many different implementations for a switch statement, -restricting the divergence and reconvergence behavior would have serious -negative performance impacts on some implementations. -Instead, shader authors should avoid switch statements in favour of if/else -statements if they require guarantees about divergence and reconvergence. - diff --git a/proposals/VK_KHR_shader_subgroup_rotate.adoc b/proposals/VK_KHR_shader_subgroup_rotate.adoc new file mode 100644 index 00000000..83eb8646 --- /dev/null +++ b/proposals/VK_KHR_shader_subgroup_rotate.adoc @@ -0,0 +1,150 @@ +// Copyright 2021-2024 The Khronos Group, Inc. +// +// SPDX-License-Identifier: CC-BY-4.0 + +# Subgroup rotation instruction +:toc: left +:refpage: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/ +:sectnums: + +## Problem Statement + +Subgroup operations are useful in the implementation of many compute algorithms. +Rotating values across invocations within a subgroup in particular can be useful +in the implementation of the convolution routines used in neural network inference. + +A rotation by N rotates values "down" N invocations within the subgroup. + +A rotation by (SubgroupSize - N) rotates values "up" N invocations +within the subgroup. + +Taking the example of a subgroup of size 16, a rotation by 2 would, +when executed by the invocation identified by id 0, return the value from the +invocation identified by id 2. The same rotation instruction, when executed +by the invocation identified by id 14, would return the value from the invocation +identified by id 0. + +While this rotation operation can be built on top of existing subgroup instructions, +doing so results in far from optimal performance on some implementations. + +## Solution Space + +### Using existing broadcast instruction + +It is possible to broadcast the value for each invocation to all other invocations +and for each invocation to calculate the id of the invocation whose value it needs +to retain. This is very inefficient and the cost of the rotation operation as a +whole grows linearly with the size of the subgroup. It is included here only for +the sake of completeness. + +### Using existing shuffle instruction + +The rotation operation above can be built on top of the *OpGroupNonUniformShuffle* +instruction, here abbreviated as `Shuffle`, as follows: + +``` +ShuffleRotate(value, amount) = Shuffle(value, ((amount + LocalId) & (SubgroupSize - 1))) +``` + +*OpGroupNonUniformShuffle* does not require the source +invocation's id to be dynamically uniform within the subgroup which results in +inefficient code for implementations that can optimise the case where the source +ID is dynamically uniform. Admittedly, it is possible for applications to decorate +the calculated source id with `Uniform` and implementations to detect that pattern +and emit optimised code but this approach can be complex and costly to implement as +well as brittle, especially without introducing a new high-level language construct. + +### Using existing relative shuffle instruction + +It is similarly possible to implement the rotation operation using the +*OpGroupNonUniformShuffleUp* or *OpGroupNonUniformShuffleDown* relative shuffle +instruction that are more efficient on some implementations. However, these +instructions also do not require the source invocation id to be dynamically +uniform and their relative nature makes calculating the source invocation ID +required for a rotation operation more complex than with a general shuffle. + +### New shuffle features + +Another solution that was considered is the addition of new subgroup features +that only enable shuffle instructions for cases where the source invocation ID +is dynamically uniform. While this would be a significant step toward enabling a +more efficient implementation of the rotation operation described here on +implementations that can optimise this case, it would not solve the implementation +complexity issues mentioned above. + +This functionality would however be otherwise useful and could be added to the +current proposal or be the object of a separate proposal. + +### New dedicated SPIR-V instruction + +Introduce a new dedicated SPIR-V instruction that performs subgroup rotation +operations and requires the rotation distance to be dynamically uniform. + +## Proposal + +Expose a new dedicated SPIR-V instruction, as defined by +http://htmlpreview.github.io/?https://github.com/KhronosGroup/SPIRV-Registry/blob/main/extensions/KHR/SPV_KHR_subgroup_rotate.html[SPV_KHR_subgroup_rotate] +to express rotating values across the invocations of a subgroup that requires +the rotation amount to be dynamically uniform within the subgroup. + +Specify new built-in functions to expose the SPIR-V instruction in GLSL: + +``` +genType subgroupRotate(genType value, uint delta); +genIType subgroupRotate(genIType value, uint delta); +genUType subgroupRotate(genUType value, uint delta); +genBType subgroupRotate(genBType value, uint delta); +genDType subgroupRotate(genDType value, uint delta); + +genType subgroupClusteredRotate(genType value, uint delta, uint clusterSize); +genIType subgroupClusteredRotate(genIType value, uint delta, uint clusterSize); +genUType subgroupClusteredRotate(genUType value, uint delta, uint clusterSize); +genBType subgroupClusteredRotate(genBType value, uint delta, uint clusterSize); +genDType subgroupClusteredRotate(genDType value, uint delta, uint clusterSize); + +If GL_EXT_shader_subgroup_extended_types_int8 is enabled: + +genI8Type subgroupRotate(genI8Type value, uint delta); +genU8Type subgroupRotate(genU8Type value, uint delta); + +genI8Type subgroupClusteredRotate(genI8Type value, uint delta, uint clusterSize); +genU8Type subgroupClusteredRotate(genU8Type value, uint delta, uint clusterSize); + +If GL_EXT_shader_subgroup_extended_types_int16 is enabled: + +genI16Type subgroupRotate(genI16Type value, uint delta); +genU16Type subgroupRotate(genU16Type value, uint delta); + +genI16Type subgroupClusteredRotate(genI16Type value, uint delta, uint clusterSize); +genU16Type subgroupClusteredRotate(genU16Type value, uint delta, uint clusterSize); + +If GL_EXT_shader_subgroup_extended_types_int64 is enabled: + +genI64Type subgroupRotate(genI64Type value, uint delta); +genU64Type subgroupRotate(genU64Type value, uint delta); + +genI64Type subgroupClusteredRotate(genI64Type value, uint delta, uint clusterSize); +genU64Type subgroupClusteredRotate(genU64Type value, uint delta, uint clusterSize); + +If GL_EXT_shader_subgroup_extended_types_float16 is enabled: + +genF16Type subgroupRotate(genF16Type value, uint delta); + +genF16Type subgroupClusteredRotate(genF16Type value, uint delta, uint clusterSize); + +``` + +Each of the rotate functions shuffles `value` to the invocation with a `gl_SubgroupInvocationID` equal to `(gl_SubgroupInvocationID + delta) % gl_SubgroupSize` for `subgroupRotate`, or to the invocation with a `gl_SubgroupInvocationID` equal to `(gl_SubgroupInvocationID - (gl_SubgroupInvocationID % clusterSize)) + ((gl_SubgroupInvocationID % clusterSize + delta) % clusterSize)` for `subgroupClusteredRotate` functions. + +## Examples + +``` +OpCapability GroupNonUniformShuffleRotateKHR +... +%result = OpGroupNonUniformShuffleRotateKHR %result_type Subgroup %value %amount +``` + +## Further Functionality + +See the above description for new shuffle features that would require the +source invocation id to be dynamically uniform. diff --git a/proposals/VK_KHR_shader_subgroup_rotate.asciidoc b/proposals/VK_KHR_shader_subgroup_rotate.asciidoc deleted file mode 100644 index 83eb8646..00000000 --- a/proposals/VK_KHR_shader_subgroup_rotate.asciidoc +++ /dev/null @@ -1,150 +0,0 @@ -// Copyright 2021-2024 The Khronos Group, Inc. -// -// SPDX-License-Identifier: CC-BY-4.0 - -# Subgroup rotation instruction -:toc: left -:refpage: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/ -:sectnums: - -## Problem Statement - -Subgroup operations are useful in the implementation of many compute algorithms. -Rotating values across invocations within a subgroup in particular can be useful -in the implementation of the convolution routines used in neural network inference. - -A rotation by N rotates values "down" N invocations within the subgroup. - -A rotation by (SubgroupSize - N) rotates values "up" N invocations -within the subgroup. - -Taking the example of a subgroup of size 16, a rotation by 2 would, -when executed by the invocation identified by id 0, return the value from the -invocation identified by id 2. The same rotation instruction, when executed -by the invocation identified by id 14, would return the value from the invocation -identified by id 0. - -While this rotation operation can be built on top of existing subgroup instructions, -doing so results in far from optimal performance on some implementations. - -## Solution Space - -### Using existing broadcast instruction - -It is possible to broadcast the value for each invocation to all other invocations -and for each invocation to calculate the id of the invocation whose value it needs -to retain. This is very inefficient and the cost of the rotation operation as a -whole grows linearly with the size of the subgroup. It is included here only for -the sake of completeness. - -### Using existing shuffle instruction - -The rotation operation above can be built on top of the *OpGroupNonUniformShuffle* -instruction, here abbreviated as `Shuffle`, as follows: - -``` -ShuffleRotate(value, amount) = Shuffle(value, ((amount + LocalId) & (SubgroupSize - 1))) -``` - -*OpGroupNonUniformShuffle* does not require the source -invocation's id to be dynamically uniform within the subgroup which results in -inefficient code for implementations that can optimise the case where the source -ID is dynamically uniform. Admittedly, it is possible for applications to decorate -the calculated source id with `Uniform` and implementations to detect that pattern -and emit optimised code but this approach can be complex and costly to implement as -well as brittle, especially without introducing a new high-level language construct. - -### Using existing relative shuffle instruction - -It is similarly possible to implement the rotation operation using the -*OpGroupNonUniformShuffleUp* or *OpGroupNonUniformShuffleDown* relative shuffle -instruction that are more efficient on some implementations. However, these -instructions also do not require the source invocation id to be dynamically -uniform and their relative nature makes calculating the source invocation ID -required for a rotation operation more complex than with a general shuffle. - -### New shuffle features - -Another solution that was considered is the addition of new subgroup features -that only enable shuffle instructions for cases where the source invocation ID -is dynamically uniform. While this would be a significant step toward enabling a -more efficient implementation of the rotation operation described here on -implementations that can optimise this case, it would not solve the implementation -complexity issues mentioned above. - -This functionality would however be otherwise useful and could be added to the -current proposal or be the object of a separate proposal. - -### New dedicated SPIR-V instruction - -Introduce a new dedicated SPIR-V instruction that performs subgroup rotation -operations and requires the rotation distance to be dynamically uniform. - -## Proposal - -Expose a new dedicated SPIR-V instruction, as defined by -http://htmlpreview.github.io/?https://github.com/KhronosGroup/SPIRV-Registry/blob/main/extensions/KHR/SPV_KHR_subgroup_rotate.html[SPV_KHR_subgroup_rotate] -to express rotating values across the invocations of a subgroup that requires -the rotation amount to be dynamically uniform within the subgroup. - -Specify new built-in functions to expose the SPIR-V instruction in GLSL: - -``` -genType subgroupRotate(genType value, uint delta); -genIType subgroupRotate(genIType value, uint delta); -genUType subgroupRotate(genUType value, uint delta); -genBType subgroupRotate(genBType value, uint delta); -genDType subgroupRotate(genDType value, uint delta); - -genType subgroupClusteredRotate(genType value, uint delta, uint clusterSize); -genIType subgroupClusteredRotate(genIType value, uint delta, uint clusterSize); -genUType subgroupClusteredRotate(genUType value, uint delta, uint clusterSize); -genBType subgroupClusteredRotate(genBType value, uint delta, uint clusterSize); -genDType subgroupClusteredRotate(genDType value, uint delta, uint clusterSize); - -If GL_EXT_shader_subgroup_extended_types_int8 is enabled: - -genI8Type subgroupRotate(genI8Type value, uint delta); -genU8Type subgroupRotate(genU8Type value, uint delta); - -genI8Type subgroupClusteredRotate(genI8Type value, uint delta, uint clusterSize); -genU8Type subgroupClusteredRotate(genU8Type value, uint delta, uint clusterSize); - -If GL_EXT_shader_subgroup_extended_types_int16 is enabled: - -genI16Type subgroupRotate(genI16Type value, uint delta); -genU16Type subgroupRotate(genU16Type value, uint delta); - -genI16Type subgroupClusteredRotate(genI16Type value, uint delta, uint clusterSize); -genU16Type subgroupClusteredRotate(genU16Type value, uint delta, uint clusterSize); - -If GL_EXT_shader_subgroup_extended_types_int64 is enabled: - -genI64Type subgroupRotate(genI64Type value, uint delta); -genU64Type subgroupRotate(genU64Type value, uint delta); - -genI64Type subgroupClusteredRotate(genI64Type value, uint delta, uint clusterSize); -genU64Type subgroupClusteredRotate(genU64Type value, uint delta, uint clusterSize); - -If GL_EXT_shader_subgroup_extended_types_float16 is enabled: - -genF16Type subgroupRotate(genF16Type value, uint delta); - -genF16Type subgroupClusteredRotate(genF16Type value, uint delta, uint clusterSize); - -``` - -Each of the rotate functions shuffles `value` to the invocation with a `gl_SubgroupInvocationID` equal to `(gl_SubgroupInvocationID + delta) % gl_SubgroupSize` for `subgroupRotate`, or to the invocation with a `gl_SubgroupInvocationID` equal to `(gl_SubgroupInvocationID - (gl_SubgroupInvocationID % clusterSize)) + ((gl_SubgroupInvocationID % clusterSize + delta) % clusterSize)` for `subgroupClusteredRotate` functions. - -## Examples - -``` -OpCapability GroupNonUniformShuffleRotateKHR -... -%result = OpGroupNonUniformShuffleRotateKHR %result_type Subgroup %value %amount -``` - -## Further Functionality - -See the above description for new shuffle features that would require the -source invocation id to be dynamically uniform. -- cgit v1.2.3