Comment
Author: Admin | 2025-04-28
Hi,This is probably not a bug, more a limitation. Not sure if there's a solution.In a for loop I call CLBlastSgemm with,tA = yestB = nocolMaj = yesm = 512n = 8k = 500000minimal lda, ldb, ldc (lda = 500000 ldb = 500000 ldc = 512)CPU memory rapidly exceeds 32GB.I then observe eitherCLBlast: OpenCL error: clEnqueueNDRangeKernel: -4orMemory access fault by GPU node-1 on address 0xd8a9dd000. Reason: Page not present or supervisor privilegeI guess CLBlast is allocating workspace memory on the GPU (but not sure why CPU memory goes up)If I change from tA = yes to tA = no, the memory consumption is greatly reduced and the loop runs to completion, which is surprising as matrix sizes are unchanged.The above is all with the final cl_event * parameter to CLBlastSgemm set to nullptr. If instead I pass a pointer to a cl_event and then block after each gemm call with cl_wait_for_events, there are no problems, unsurprisingly.Above on Fiji Nano GPU using ROCm 1.6.Thanks.
Add Comment