This is not a fast way to sort on GPU. The fastest known sorting algorithm on CUDA is Onesweep, which uses a lot of sophisticated techniques to take advantage of GPU-style parallelism and work around its limitations.
Linebender is working (slowly) on adapting these ideas to GPUs more portably. There's a wiki page here with some resources:
The second one is Thomas Smith's independent reimplementation of Onesweep. For the official version, see https://github.com/NVIDIA/cccl . The Onesweep implementation is in cub/cub/agent/agent_radix_sort_onesweep.cuh .
Yup, nothing wrong with clear exposition about simpler algorithms, there's definitely a place for that. I just thought HN readers should have some more context on whether we were looking at a programming exercise or state of the art algorithms.
What, exactly, is misleading? The title of the blogpost is "Sorting Algorithms with CUDA" and I didn't get the feeling that the author is touting their "Bottom-up iterative merge sort" is the fastest possible way of sorting with CUDA. There is even a "Future Work" section at the end, implying even the author know it can be done better.
I think the faux-academic style is throwing people off.
If author is taking a truly academic perspective, then a section should be included with background on state of the art, best known performance, etc.
If this is just a blog post (which is more likely) with less rigor, then the style could reflect that better. For instance, calling it an "introduction" or "exercise".
How many academic papers start with "I went for a NVIDIA recruiting event some days ago, that was a great event and it motivated me to try to rewrite the sorting algorithms using CUDA."
As the other posts have said, this isn't the right algorithm. Onesweep and its kin are cool but intimidating. The magic is easier to understand if one looks into the core algorithm: radix sort.
It turns out that radix sort can be implemented in a way that is very straightforward to parallelize. It's a beautiful and elegant approach and worth knowing about!
I always ask in these threads, what sort of data do you have that cannot be divided into several integers, such that sorting by those integers according to some priority implements your comparator correctly, and also your comparator is not so expensive as to be impractical? on one occasion someone replied "sorting unicode strings in collation order" but it doesn't seem like recomputing the collation inside of each comparison is practical, and the cache behavior of comparison sorts if you precompute all of them (spending several times the memory occupied by the input) will be quite bad.
structs containing strings and doubles for example are well suited to radix sort.
A sequence of integers such that sorting by those integers implements the comparator correctly is precisely what the Unicode Collation Algorithm produces as its sort key. https://www.unicode.org/reports/tr10/#Scope
Best I've been able to come up with is fractions. Though true fractions are rare (and can be converted to a sequence of integers that can be sorted lexicographically, see continued fractions).
Technically floats qualify, but not in an interesting way since you basically just need to take care of the sign bit and a few special cases.
Really most things can be sorted with radix sort, though I wouldn't want to be the one having to implement it for unicode strings (sorting text in general is one of those problems you'd preferably let other people solve for you).
Kind of a fun toy problem to play around with. I noticed you had thread coarsening as an option to play around with - there is often some gain to be had here. I think this is also a fun thing to play around with Nsight on - things that are impacting your performance aren't always obvious and it is a pretty good profiler - might be worth playing around with. (I wrote about a fun thing I found with thread coarsening and automatic loop unrolling with Nsight here: https://www.spenceruresk.com/loop-unrolling-gone-bad-e81f66f...)
You may also want to look at other sorting algorithms - common CPU sorting algorithms are hard to maximize GPU hardware with - a network sort like bitonic sorting involves more work (and you have to pad to a power of 2) but often runs much faster on parallel hardware.
I had a fairly naive implementation that would sort 10M in around 10ms on an H100. I'm sure with more work they can get quite a bit faster, but they need to be fairly big to make up for the kernel launch overhead.
For a more convenient way to use GPUs for algorithms like this, the Futhark language [1] can be very valuable. It is a very high-level language that compiles to GPU instructions, which can be accessed as Python libraries. In their website there is an example of a merge sort implementation [2].
Lololol then why are you recommending it like you know anything about it? I will never understand this kind of comment on hn - like you don't get why hyping something up that you don't actually understand is bad?
I agree with your comment in principle, but also disagree in this instance. I'd also recommend Futhark even though I'm not using it in production based on positive experience I've had with it previously. I happened to own an AMD Instinct MI50 (32 GB) card which more or less sucks for AI, but has really nice FP64 performance so figured I might have a stab at "scientific computing". Believe it or not, Futhark was one thing that worked, made sense, and worked reliably, too. It's quite intuitive piece of compiler although I'm sure it's a far-cry from something like cache-optimised CUDA C, HIP, or whatever in terms of raw performance, but surely there's something you could do to the OpenCL emitter, if you really wanted to.
Although on second thought something like JAX is probably the better choice these days anyway.
Love your notes explaining some of the GPU thread-indexing concepts!
Shameless plug for my own little post going a bit into the performance benefits of "vectorized" sorting, even vs. programmable L1 cache:
https://winwang.blog/posts/bitonic-sort/
It's great since it easily works with the Cuda driver API, unlike CUB which is mostly exclusive for the runtime API. It also has Onesweep but I havent been able to make that one work.
I've looked at using it before with Unity, but I couldn't get past the bottleneck of having to get data to it and then back again. There's an overhead with using e.g. compute shaders as well but not nearly as much.
Some render algorithms require sorting, like the currently hugely popular gaussian splats. Those sort like 5 to 20 million items per frame, in real-time.
Let me save you the time: Someone wrote a sorting algorithm on a GPU. It was slow. It's not state of the art and they aren't an expert. It's not clear that they know how to use a GPU effectively. This is just someone's personal playing with GPU programming. (Not judging, but there's literally nothing interesting here at all to most people who would be attracted by the title. It's just a personal blog post.)
The TL;DR is the implementer was able to get use GPUs to get merge sort speedups that exceeded CPUs once the number of elements being sorted was >10,000,000.
thrust::sort is an Nvidia C++ library; I am not clear whether it is related to CUDA or not actually; the article author started out with CUDA implementing a merge sort, but once it was slower than CPU the author tried thrust::sort library and was able to get a faster result in some cases. The article author did not yet try a parallel merge sort.
I would be curious if anyone knows what database engines take advantage of GPUs and see actual sort/query performance boosts and on what sized datasets. My impression is that a few engines have tried it, but the payoff is small enough that industry-wide people haven't adopted it.
Modern database engines are bandwidth bound. Moving data to and from GPUs is expensive and slow by database engine standards so any performance gain due to higher memory bandwidth is usually lost in the data transfer overhead and much lower effective I/O bandwidth.
Every database co-processor, not just GPUs, have had the same issue.
1. Parallel radix sort (CPU or GPU) can also be pretty fast. For input distributions without too much pattern and keys that are not too large, radix sort is probably the fastest if you can use it.
2. For typical applications, memory transfer speed matters more than sorting performance on the GPU. If most of your work is done on the CPU, transfering the memory to the GPU may take more time than sorting the array. Not sure if unified memory (apple M series chips and AMD new APU) can remedy this though.
Not a database engine, but I'm processing distributed dataframes (~400TB). We're seeing a perf/price ratio around 3x~4x, though it's only a limited sample of workloads/datasets. (Note: we are not sorting, yet)
As far as I can tell, it's less so that the payoff is small, but that the payoff is small considering the maturity/scarcity of GPU programming, and availability of GPUs (esp. on-prem).
Surprised that gpu accelerated databases are not a bigger thing, for example ive never even heard of PG-storm in the wild nor does it implement a sorting algorithm like onesweep iir
This is not a fast way to sort on GPU. The fastest known sorting algorithm on CUDA is Onesweep, which uses a lot of sophisticated techniques to take advantage of GPU-style parallelism and work around its limitations.
Linebender is working (slowly) on adapting these ideas to GPUs more portably. There's a wiki page here with some resources:
https://linebender.org/wiki/gpu/sorting/
Onesweep paper (Nvidia 2022): https://research.nvidia.com/publication/2022-06_onesweep-fas...
Onesweep GitHub repo: https://github.com/b0nes164/GPUSorting
The second one is Thomas Smith's independent reimplementation of Onesweep. For the official version, see https://github.com/NVIDIA/cccl . The Onesweep implementation is in cub/cub/agent/agent_radix_sort_onesweep.cuh .
To be fair, I took this far more as an exploration on writing CUDA than I did an attempt at the best sorting method.
Yup, nothing wrong with clear exposition about simpler algorithms, there's definitely a place for that. I just thought HN readers should have some more context on whether we were looking at a programming exercise or state of the art algorithms.
Sure but it's still incredibly misleading.
> it's still incredibly misleading
What, exactly, is misleading? The title of the blogpost is "Sorting Algorithms with CUDA" and I didn't get the feeling that the author is touting their "Bottom-up iterative merge sort" is the fastest possible way of sorting with CUDA. There is even a "Future Work" section at the end, implying even the author know it can be done better.
I think the faux-academic style is throwing people off.
If author is taking a truly academic perspective, then a section should be included with background on state of the art, best known performance, etc.
If this is just a blog post (which is more likely) with less rigor, then the style could reflect that better. For instance, calling it an "introduction" or "exercise".
Reread the first paragraph of this blog post.
How many academic papers start with "I went for a NVIDIA recruiting event some days ago, that was a great event and it motivated me to try to rewrite the sorting algorithms using CUDA."
As the other posts have said, this isn't the right algorithm. Onesweep and its kin are cool but intimidating. The magic is easier to understand if one looks into the core algorithm: radix sort.
A very readable explanation is here: https://gpuopen.com/download/publications/Introduction_to_GP...
It turns out that radix sort can be implemented in a way that is very straightforward to parallelize. It's a beautiful and elegant approach and worth knowing about!
Yeah, but radix sort requires you to have an ordered integer key, not just two objects being comparable, like with most general sorting techniques.
I always ask in these threads, what sort of data do you have that cannot be divided into several integers, such that sorting by those integers according to some priority implements your comparator correctly, and also your comparator is not so expensive as to be impractical? on one occasion someone replied "sorting unicode strings in collation order" but it doesn't seem like recomputing the collation inside of each comparison is practical, and the cache behavior of comparison sorts if you precompute all of them (spending several times the memory occupied by the input) will be quite bad.
structs containing strings and doubles for example are well suited to radix sort.
A sequence of integers such that sorting by those integers implements the comparator correctly is precisely what the Unicode Collation Algorithm produces as its sort key. https://www.unicode.org/reports/tr10/#Scope
Yes.
Best I've been able to come up with is fractions. Though true fractions are rare (and can be converted to a sequence of integers that can be sorted lexicographically, see continued fractions).
Technically floats qualify, but not in an interesting way since you basically just need to take care of the sign bit and a few special cases.
Really most things can be sorted with radix sort, though I wouldn't want to be the one having to implement it for unicode strings (sorting text in general is one of those problems you'd preferably let other people solve for you).
Kind of a fun toy problem to play around with. I noticed you had thread coarsening as an option to play around with - there is often some gain to be had here. I think this is also a fun thing to play around with Nsight on - things that are impacting your performance aren't always obvious and it is a pretty good profiler - might be worth playing around with. (I wrote about a fun thing I found with thread coarsening and automatic loop unrolling with Nsight here: https://www.spenceruresk.com/loop-unrolling-gone-bad-e81f66f...)
You may also want to look at other sorting algorithms - common CPU sorting algorithms are hard to maximize GPU hardware with - a network sort like bitonic sorting involves more work (and you have to pad to a power of 2) but often runs much faster on parallel hardware.
I had a fairly naive implementation that would sort 10M in around 10ms on an H100. I'm sure with more work they can get quite a bit faster, but they need to be fairly big to make up for the kernel launch overhead.
For a more convenient way to use GPUs for algorithms like this, the Futhark language [1] can be very valuable. It is a very high-level language that compiles to GPU instructions, which can be accessed as Python libraries. In their website there is an example of a merge sort implementation [2].
[1] https://futhark-lang.org/ [2] https://futhark-lang.org/examples/merge-sort.html
Do you use futhark in prod? Do you use it at all actually?
I don't use it at all currently because the problems it solves do not come up at my job. But if they did, I would gladly use it.
If your job involves a lot of heavy number crunching it might be useful.
I don't use it because it didn't seem stable for CUDA when I tried it out.
As for number crunching, I'd probably use CuPy (outside of the typical ML stuff).
Lololol then why are you recommending it like you know anything about it? I will never understand this kind of comment on hn - like you don't get why hyping something up that you don't actually understand is bad?
I agree with your comment in principle, but also disagree in this instance. I'd also recommend Futhark even though I'm not using it in production based on positive experience I've had with it previously. I happened to own an AMD Instinct MI50 (32 GB) card which more or less sucks for AI, but has really nice FP64 performance so figured I might have a stab at "scientific computing". Believe it or not, Futhark was one thing that worked, made sense, and worked reliably, too. It's quite intuitive piece of compiler although I'm sure it's a far-cry from something like cache-optimised CUDA C, HIP, or whatever in terms of raw performance, but surely there's something you could do to the OpenCL emitter, if you really wanted to.
Although on second thought something like JAX is probably the better choice these days anyway.
I use it in prod. Currently actually expanding our use of it. Main selling point for us was AD.
nice! this reminds me of a small project I did in college implementing bitonic sort[0] in CUDA for a gpu accelerated Burrow-Wheelers transform
https://github.com/jedbrooke/cuda_bwt
I believe I got the implementation for bitonic sort here: https://gist.github.com/mre/1392067
[0]: https://en.wikipedia.org/wiki/Bitonic_sorter
Love your notes explaining some of the GPU thread-indexing concepts!
Shameless plug for my own little post going a bit into the performance benefits of "vectorized" sorting, even vs. programmable L1 cache: https://winwang.blog/posts/bitonic-sort/
I'm a fan of this blazingly fast radix sort implementation: https://github.com/b0nes164/GPUSorting
It's great since it easily works with the Cuda driver API, unlike CUB which is mostly exclusive for the runtime API. It also has Onesweep but I havent been able to make that one work.
I've looked at using it before with Unity, but I couldn't get past the bottleneck of having to get data to it and then back again. There's an overhead with using e.g. compute shaders as well but not nearly as much.
I think you need to have huge arrays to worth sorting them on GPU. Copying data between RAM and GPU and taking it back will take some time.
Some render algorithms require sorting, like the currently hugely popular gaussian splats. Those sort like 5 to 20 million items per frame, in real-time.
They may already be in GPU memory. Sorting is rarely all the processing that you need to do on your data.
Let me save you the time: Someone wrote a sorting algorithm on a GPU. It was slow. It's not state of the art and they aren't an expert. It's not clear that they know how to use a GPU effectively. This is just someone's personal playing with GPU programming. (Not judging, but there's literally nothing interesting here at all to most people who would be attracted by the title. It's just a personal blog post.)
Not judging, but I am having trouble seeing how
> there's literally nothing interesting here at all to most people who would be attracted by the title
is not some kind of judgment.
The TL;DR is the implementer was able to get use GPUs to get merge sort speedups that exceeded CPUs once the number of elements being sorted was >10,000,000.
thrust::sort is an Nvidia C++ library; I am not clear whether it is related to CUDA or not actually; the article author started out with CUDA implementing a merge sort, but once it was slower than CPU the author tried thrust::sort library and was able to get a faster result in some cases. The article author did not yet try a parallel merge sort.
I would be curious if anyone knows what database engines take advantage of GPUs and see actual sort/query performance boosts and on what sized datasets. My impression is that a few engines have tried it, but the payoff is small enough that industry-wide people haven't adopted it.
Modern database engines are bandwidth bound. Moving data to and from GPUs is expensive and slow by database engine standards so any performance gain due to higher memory bandwidth is usually lost in the data transfer overhead and much lower effective I/O bandwidth.
Every database co-processor, not just GPUs, have had the same issue.
As they say, a coprocessor is a box that turns your cpu-bound problem into an i/o-bound problem.
1. Parallel radix sort (CPU or GPU) can also be pretty fast. For input distributions without too much pattern and keys that are not too large, radix sort is probably the fastest if you can use it.
2. For typical applications, memory transfer speed matters more than sorting performance on the GPU. If most of your work is done on the CPU, transfering the memory to the GPU may take more time than sorting the array. Not sure if unified memory (apple M series chips and AMD new APU) can remedy this though.
Not a database engine, but I'm processing distributed dataframes (~400TB). We're seeing a perf/price ratio around 3x~4x, though it's only a limited sample of workloads/datasets. (Note: we are not sorting, yet)
As far as I can tell, it's less so that the payoff is small, but that the payoff is small considering the maturity/scarcity of GPU programming, and availability of GPUs (esp. on-prem).
Radix sort??
Surprised that gpu accelerated databases are not a bigger thing, for example ive never even heard of PG-storm in the wild nor does it implement a sorting algorithm like onesweep iir