Merge pull request #5108 from Barinzaya/core-simd-indices-redadd-redmul

Alternate `reduce_add`/`reduce_mul` intrinsics
author: gingerBill <gingerBill@users.noreply.github.com> 2025-05-06 15:46:49 +0100
committer: GitHub <noreply@github.com> 2025-05-06 15:46:49 +0100
commit: 0cf5b5984de51c18382b07f20b997f89442fab36 (patch)
tree: bf1dd12e79c8ff88ba3495e4a24708eea372e398 /core/simd
parent: e07451898396c84b6a80e63c3e283939d699784a (diff)
parent: dd5b7852ce569027e87d77f46601210aa4180947 (diff)
1 files changed, 192 insertions, 2 deletions
diff --git a/core/simd/simd.odin b/core/simd/simd.odin
index 0e69304c3..a97155f58 100644
--- a/core/simd/simd.odin
+++ b/core/simd/simd.odin
@@ -1759,7 +1759,103 @@ Returns:
 replace :: intrinsics.simd_replace
 
 /*
-Reduce a vector to a scalar by adding up all the lanes.
+Reduce a vector to a scalar by adding up all the lanes in a bisecting fashion.
+
+This procedure returns a scalar that is the sum of all lanes, calculated by
+bisecting the vector into two parts, where the first contains lanes [0, N/2)
+and the second contains lanes [N/2, N), and adding the two halves element-wise
+to produce N/2 values. This is repeated until only a single element remains.
+This order may be faster to compute than the ordered sum for floats, as it can
+often be better parallelized.
+
+The order of the sum may be important for accounting for precision errors in
+floating-point computation, as floating-point addition is not associative, that
+is `(a+b)+c` may not be equal to `a+(b+c)`.
+
+Inputs:
+- `v`: The vector to reduce.
+
+Result:
+- Sum of all lanes, as a scalar.
+
+**Operation**:
+
+	for n > 1 {
+		n = n / 2
+		for i in 0 ..< n {
+			a[i] += a[i+n]
+		}
+	}
+	res := a[0]
+
+Graphical representation of the operation for N=4:
+
+	     +-----------------------+
+	     | v0  | v1  | v2  | v3  |
+	     +-----------------------+
+	        |     |     |     |
+	       [+]<-- | ---'      |
+	        |    [+]<--------'
+	        |     |
+	        `>[+]<'
+	           |
+	           v
+	        +-----+
+	result: | y0  |
+	        +-----+
+*/
+reduce_add_bisect :: intrinsics.simd_reduce_add_bisect
+
+/*
+Reduce a vector to a scalar by multiplying up all the lanes in a bisecting fashion.
+
+This procedure returns a scalar that is the product of all lanes, calculated by
+bisecting the vector into two parts, where the first contains indices [0, N/2)
+and the second contains indices [N/2, N), and multiplying the two halves
+together element-wise to produce N/2 values. This is repeated until only a
+single element remains. This order may be faster to compute than the ordered
+product for floats, as it can often be better parallelized.
+
+The order of the product may be important for accounting for precision errors
+in floating-point computation, as floating-point multiplication is not
+associative, that is `(a*b)*c` may not be equal to `a*(b*c)`.
+
+Inputs:
+- `v`: The vector to reduce.
+
+Result:
+- Product of all lanes, as a scalar.
+
+**Operation**:
+
+	for n > 1 {
+		n = n / 2
+		for i in 0 ..< n {
+			a[i] *= a[i+n]
+		}
+	}
+	res := a[0]
+
+Graphical representation of the operation for N=4:
+
+	     +-----------------------+
+	     | v0  | v1  | v2  | v3  |
+	     +-----------------------+
+	        |     |     |     |
+	       [x]<-- | ---'      |
+	        |    [x]<--------'
+	        |     |
+	        `>[x]<'
+	           |
+	           v
+	        +-----+
+	result: | y0  |
+	        +-----+
+*/
+reduce_mul_bisect :: intrinsics.simd_reduce_mul_bisect
+
+/*
+Reduce a vector to a scalar by adding up all the lanes in an ordered fashion.
 
 This procedure returns a scalar that is the ordered sum of all lanes. The
 ordered sum may be important for accounting for precision errors in
@@ -1782,7 +1878,7 @@ Result:
 reduce_add_ordered :: intrinsics.simd_reduce_add_ordered
 
 /*
-Reduce a vector to a scalar by multiplying all the lanes.
+Reduce a vector to a scalar by multiplying all the lanes in an ordered fashion.
 
 This procedure returns a scalar that is the ordered product of all lanes.
 The ordered product may be important for accounting for precision errors in
@@ -1805,6 +1901,100 @@ Result:
 reduce_mul_ordered :: intrinsics.simd_reduce_mul_ordered
 
 /*
+Reduce a vector to a scalar by adding up all the lanes in a pairwise fashion.
+
+This procedure returns a scalar that is the sum of all lanes, calculated by
+adding each even-indexed element with the following odd-indexed element to
+produce N/2 values. This is repeated until only a single element remains. This
+order is supported by hardware instructions for some types/architectures (e.g.
+i16/i32/f32/f64 on x86 SSE, i8/i16/i32/f32 on ARM NEON).
+
+The order of the sum may be important for accounting for precision errors in
+floating-point computation, as floating-point addition is not associative, that
+is `(a+b)+c` may not be equal to `a+(b+c)`.
+
+Inputs:
+- `v`: The vector to reduce.
+
+Result:
+- Sum of all lanes, as a scalar.
+
+**Operation**:
+
+	for n > 1 {
+		n = n / 2
+		for i in 0 ..< n {
+			a[i] = a[2*i+0] + a[2*i+1]
+		}
+	}
+	res := a[0]
+
+Graphical representation of the operation for N=4:
+
+	   +-----------------------+
+	v: | v0  | v1  | v2  | v3  |
+	   +-----------------------+
+	      |     |     |     |
+	      `>[+]<'     `>[+]<'
+	         |           |
+	         `--->[+]<--'
+	               |
+	               v
+	            +-----+
+	    result: | y0  |
+	            +-----+
+*/
+reduce_add_pairs :: intrinsics.simd_reduce_add_pairs
+
+/*
+Reduce a vector to a scalar by multiplying all the lanes in a pairwise fashion.
+
+This procedure returns a scalar that is the product of all lanes, calculated by
+bisecting the vector into two parts, where the first contains lanes [0, N/2)
+and the second contains lanes [N/2, N), and multiplying the two halves together
+multiplying each even-indexed element with the following odd-indexed element to
+produce N/2 values. This is repeated until only a single element remains. This
+order may be faster to compute than the ordered product for floats, as it can
+often be better parallelized.
+
+The order of the product may be important for accounting for precision errors
+in floating-point computation, as floating-point multiplication is not
+associative, that is `(a*b)*c` may not be equal to `a*(b*c)`.
+
+Inputs:
+- `v`: The vector to reduce.
+
+Result:
+- Product of all lanes, as a scalar.
+
+**Operation**:
+
+	for n > 1 {
+		n = n / 2
+		for i in 0 ..< n {
+			a[i] = a[2*i+0] * a[2*i+1]
+		}
+	}
+	res := a[0]
+
+Graphical representation of the operation for N=4:
+
+	   +-----------------------+
+	v: | v0  | v1  | v2  | v3  |
+	   +-----------------------+
+	      |     |     |     |
+	      `>[x]<'     `>[x]<'
+	         |           |
+	         `--->[x]<--'
+	               |
+	               v
+	            +-----+
+	    result: | y0  |
+	            +-----+
+*/
+reduce_mul_pairs :: intrinsics.simd_reduce_mul_pairs
+
+/*
 Reduce a vector to a scalar by finding the minimum value between all of the lanes.
 
 This procedure returns a scalar that is the minimum value of all the lanes
author	gingerBill <gingerBill@users.noreply.github.com>	2025-05-06 15:46:49 +0100
committer	GitHub <noreply@github.com>	2025-05-06 15:46:49 +0100
commit	0cf5b5984de51c18382b07f20b997f89442fab36 (patch)
tree	bf1dd12e79c8ff88ba3495e4a24708eea372e398 /core/simd
parent	e07451898396c84b6a80e63c3e283939d699784a (diff)
parent	dd5b7852ce569027e87d77f46601210aa4180947 (diff)