ARTICLE AD BOX
I have a project where I would need to reduce a huge amount of data in a fast way. I thought that Metal may be the right way to do that.
I'm a C / OC / Swift developer since a long time, but totally new to Metal, so I'm (very slowly) learning it.
I found this sample code in the Metal Shading Language Specification (PDF) for performing a parallelized reduction over an input array. It's not what I want to do eventually, but the structure would be the similar, so I think it's a good start for me.
kernel void reduce(const device int *input [[buffer(0)]], device atomic_int *output [[buffer(1)]], threadgroup int *ldata [[threadgroup(0)]], uint gid [[thread_position_in_grid]], uint lid [[thread_position_in_threadgroup]], uint lsize [[threads_per_threadgroup]], uint simd_size [[threads_per_simdgroup]], uint simd_lane_id [[thread_index_in_simdgroup]], uint simd_group_id [[simdgroup_index_in_threadgroup]]) { // Perform the first level of reduction. // Read from device memory, write to threadgroup memory. int val = input[gid] + input[gid + lsize]; for (uint s=lsize/simd_size; s>simd_size; s/=simd_size) { // Perform per-SIMD partial reduction. for (uint offset=simd_size/2; offset>0; offset/=2) val += simd_shuffle_down(val, offset); // Write per-SIMD partial reduction value to // threadgroup memory. if (simd_lane_id == 0) ldata[simd_group_id] = val; // Wait for all partial reductions to complete. threadgroup_barrier(mem_flags::mem_threadgroup); val = (lid < s) ? ldata[lid] : 0; } // Perform final per-SIMD partial reduction to calculate // the threadgroup partial reduction result. for (uint offset=simd_size/2; offset>0; offset/=2) val += simd_shuffle_down(val, offset); // Atomically update the reduction result. if (lid == 0) { atomic_fetch_add_explicit(output, val, memory_order_relaxed); } }According to this question, it has 2 bugs, so I tried to use this suggested fix instead:
kernel void reduce(const device int *input [[buffer(0)]], device atomic_int *output [[buffer(1)]], threadgroup int *ldata [[threadgroup(0)]], uint gid [[thread_position_in_grid]], uint lid [[thread_position_in_threadgroup]], uint lsize [[threads_per_threadgroup]], uint simd_size [[threads_per_simdgroup]], uint simd_lane_id [[thread_index_in_simdgroup]], uint simd_group_id [[simdgroup_index_in_threadgroup]]) { // Perform the first level of reduction. // Read from device memory, write to threadgroup memory. int val = input[gid]; for (uint s=lsize/simd_size; s>1; s/=simd_size) { // Perform per-SIMD partial reduction. for (uint offset=simd_size/2; offset>0; offset/=2) val += simd_shuffle_down(val, offset); // Write per-SIMD partial reduction value to // threadgroup memory. if (simd_lane_id == 0) ldata[simd_group_id] = val; // Wait for all partial reductions to complete. threadgroup_barrier(mem_flags::mem_threadgroup); val = (lid < s) ? ldata[lid] : 0; } // Perform final per-SIMD partial reduction to calculate // the threadgroup partial reduction result. for (uint offset=simd_size/2; offset>0; offset/=2) val += simd_shuffle_down(val, offset); // Atomically update the reduction result. if (lid == 0) { atomic_fetch_add_explicit(output, val, memory_order_relaxed); } }I'm trying to execute it with this code in Swift, by summing the values between 1 to 1000:
func do_reduce() throws { // -- Init -- // Get default device. let device = MTLCreateSystemDefaultDevice()! // Load the shader file. guard let library = device.makeDefaultLibrary() else { throw NSError(domain: "test", code: 1, userInfo: [ NSLocalizedDescriptionKey : "Failed to find the default library" ]) } // Load the function. guard let function = library.makeFunction(name: "reduce") else { throw NSError(domain: "test", code: 1, userInfo: [ NSLocalizedDescriptionKey : "Failed to find the function" ]) } // Create a compute pipeline state object. let functionPSO = try device.makeComputePipelineState(function: function) // Create a command queue guard let commandQueue = device.makeCommandQueue() else { throw NSError(domain: "test", code: 1, userInfo: [ NSLocalizedDescriptionKey : "Failed to create a command queue" ]) } // -- Prepare data -- // Buffer input. let count = 1_000 let input = (0..<count).map { $0 + 1 } let inputBuffer = device.makeBuffer( bytes: input, length: MemoryLayout<Int>.stride * count, options: .storageModeShared )! // Buffer output. let initialValue: Int32 = 0 let outputBuffer = device.makeBuffer( bytes: [initialValue], length: MemoryLayout<Int32>.stride, options: .storageModeShared )! // -- Encode -- let simdWidth = functionPSO.threadExecutionWidth // Create a command buffer to hold commands. let commandBuffer = commandQueue.makeCommandBuffer()! // Start a compute pass. let computeEncoder = commandBuffer.makeComputeCommandEncoder()! // Encode the pipeline state object and its parameters. // Note: the size to use for the threadgroup (` threadgroup int *ldata [[threadgroup(0)]]`) is a bit // unclear, but apparently, at least on Apple Silicon, using `threadExecutionWidth` is what // should be used. So let use that, and hope for the best. computeEncoder.setComputePipelineState(functionPSO) computeEncoder.setBuffer(inputBuffer, offset: 0, index: 0) computeEncoder.setBuffer(outputBuffer, offset: 0, index: 1) computeEncoder.setThreadgroupMemoryLength(simdWidth, index: 0) // Calculate a threadgroup size. // - `threadsPerGrid` is the number of data we want to deal with (number of pixel in an image, array size, etc.) // - ~`threadsPerThreadgroup` is a bit abstruse to me, so I'm just using what Apple suggest to do in their doc, // and hope for the best.~ Correction, I use maxTotalThreadsPerThreadgroup as Apple is doing in their MetalComputeBasic project. //let w = simdWidth //let h = pipeline.maxTotalThreadsPerThreadgroup / simdWidth let w = functionPSO.maxTotalThreadsPerThreadgroup < count ? functionPSO.maxTotalThreadsPerThreadgroup : count let threads = MTLSize(width: count, height: 1, depth: 1) let threadsPerThreadgroup = MTLSize(width: w, height: 1, depth: 1) // Encode the compute command. computeEncoder.dispatchThreads(threads, threadsPerThreadgroup: threadsPerThreadgroup) // End the compute pass. computeEncoder.endEncoding() // -- Execute -- commandBuffer.commit() commandBuffer.waitUntilCompleted() // Show result. let result = outputBuffer.contents().bindMemory(to: Int32.self, capacity: 1).pointee print("result: \(result)") }The value returned is not the one expected: it gives "123256" (i.e. the sum of 1 to 496) instead of 500500.
I have the feeling that the count I'm using isn't right, because the Metal function takes every values of the input buffer with int val = input[gid]; where gid is supposed to cover all the indexes because of the threadsPerGrid value passed to dispatchThreads function (at least it's how I understand it, and how Apple explain it in their doc and "MetalComputeBasic" sample project), so I'm not sure how this can articulate well with the for (uint s=lsize/simd_size; s>1; s/=simd_size) loop at the top level of the reduction… And actually, if I "play" with this count (use a bigger value) it impacts the sum result (closer to 500500, but never this value), without ever crashing with some Metal equivalent out-of-bound (if this exists…). Aka: I don't know what I'm doing.
Does someone understand where I failed here, and can explain why?
