> The mallocng allocator was designed to favor very low memory overhead, low worst-case fragmentation cost, and strong hardening over performance. This is because it's much easier and safer to opt in to using a performance-oriented allocator for the few applications that are doing ridiculous things with malloc to make it a performance bottleneck than to opt out of trading safety for performance in every basic system utility that doesn't hammer malloc.
Not really "goto statements" so much as the go-to arbitrary control flow semantic aka jump.
C's goto is a housecat to the full blown jump's tiger. No doubt an angry housecat is a nuisance but the tiger is much more dangerous.
C goto won't let you jump straight into the middle of unrelated code, for example, but the jump instruction has no such limit and neither did the feature Dijkstra was discussing.
A language community which so prizes the linked list is in no position to go throwing such stones.
Linux lucked out, when you're doing tricky wait free concurrent algorithms that intrusive linked list you hand designed was a good choice. But over in userland you'll find another hand rolled list in somebody's single threaded file parser and oh, the growable array would be fifty times faster, shame the C programmer doesn't have one in their toolbox.
You do you. Most people don't care about software that much in general. The most important thing is that it does the job and it does it securely.
C won't help you with bugs in any shape or form (in fact it's famously bug-friendly), so it often makes more sense to use a tech stack that either helps with those or lowers the cost on the developer side.
People care about the performance. There are numerous studies about that, showing, for instance a direct correlation between how fast a page loads and conversion rate. Also, Chrome, initially, the pitch was almost all about performance, and it was. They only became complacent once they got their majority market share.
It makes sense to use a tech stack that lowers the cost on the developer side in the same way that it makes sense to make junk food. Why produce good, tasty food when there is more money do be made by just selling cheap stuff, it does the most important thing: give people calories without poisoning them (short term).
Yeah but we're mentioning the performance of the language.
People do have a baseline level of accepted performance, but this is about perceived performance and if software feels slow most of the time it's just because of some dumb design. Like a decision to show an animated request to sign up for the newsletter on the first visit. Or loading 20 high quality images in a grid view on top of the page. Or just in general choosing animations that just feel slow even though they're hitting the FPS target perfectly without hiccups.
Get rid of those dumb decisions and it could have been pure JS and be 100% fine.
C has no value here. The slow performance of JS is not harmful here.
Discord is fast enough although it's Electron. VS Code is also fast enough.
But I'd also like to respond to the food analogy, since it's funny.
Let's say that going full untyped scripting language would be the fast food. You get things fast, it does the job, but is unhealthy. You can write only so much bash before throwing up.
Developing in C is like cooking for those equally dumb expensive unsustainable restaurants which give you "an experience" instead of a full healthy meal.
Sure, the result uses the best ingredients, it's incredibly tasty but there's way too little food for too much cost. It's bad for the economy (the money should've been spent elsewhere), bad for the customer (same thing about money + he's going to be hungry!) and bad for the cook (if he chose a different job, he'd contribute to the society in better ways!) :D
Just go for something in the middle. Eat some C# or something.
externalising developer cost onto runtime performance only makes sense if humans will spend more time writing than running (in aggregate).
Essentially you’re telling me that the software being made is not useful to many people; because the cost of writing the software (a handful of developers) will spend more time writing the software than their userbase will in executing their software.
Otherwise you’re inflicting something on humanity.
Dumping toxic waste in a river is much cheaper than properly disposing of it too; yet we understand that we are causing harm to the environment and litigate people who do that.
Slow software is fine in low volumes (think: shitting in the woods) but dumping it on huge numbers of users by default is honestly ridiculous (Teams, I’m looking at you: with your expectation to run always and on everyones machine!)
> Most people don't care about software that much in general.
This is an example of not caring about the software per se, but only about the outcome.
> [C is] in fact it's famously bug-friendly
Yes, but as a user I like that. I have a game that from the user-experience seams to have tons of use-after-free bugs. You see that as a user, as strings shown in the UI suddenly turn to garbage and then change very fast. Even with such fatal bugs, the program continues to work, which I like as a user, since I just want to play the game, I don't care if the program is correct. When I want to get rid of these garbage text, I simply close the in-game window and reopen it and everything is fine.
On the other side there are games written in Pascal or Java, which might not have that much bugs, but every single null pointer exception is fatal. This led to me not playing the games anymore, because being good and then having the program crash is so frustrating. I rather have it running a bit longer with silent corruption.
Sure, but this is perceived performance and it's 100% unrelated to the language.
It's bugs, I/O, telemetry, updates, ads, other unnecessary background things, or just dumb design (e.g. showing onedrive locations first when trying to save a file in Word) in general.
C won't help with any of that. Unless the cost of development using it will scare away management which requests those dumb features. Fair enough then :)
> or just dumb design (e.g. showing onedrive locations first when trying to save a file in Word)
Your example is not one of a 'dumb' design, it is a deliberate 'dark pattern' --> pushing you to use OneDrive as much as possible so that to earn more money.
maybe its not 'slow' but more 'generalized for a wide range of use-cases'? - because is it really slow for what it does, or simply slower compared to a specialized implementation? (this is calling a regular person car slow compared to an F1 car... sure the thing is fast but good luck takin ur kids on holiday or doing weekly shopping runs?)
It only matters when your threads allocate with such a high frequency that they run into contention.
A too high access frequency to a shared resource is not a "general case", but simply poorly designed multithreaded code (but besides, a high allocation frequency through the system allocator is also poor design for any single-threaded code, application code simply should not assume any specific performance behaviour from the system allocator).
Well, what is "such a high frequency"? Different allocators have different breaking points, and the musl's one is apparently very low.
> application code simply should not assume any specific performance behaviour from the system allocator
Technically, yes. Practically, no; that's why e.g. C++ standard mandates time complexity of its containers. If you can't assume any specific performance from your system, that means you have to prepare for every system-provided functionality to be exponentially slow and obviously you can't do that.
Take, for instance, the JSON parser in GTA V [0]: apparently, sscanf(buffer, "%d", &n) calls strlen(buffer) internally, so using it to parse numbers in a hot loop on 2 MiB-long JSON craters your performance. On one hand, sure, one can argue that glibc/musl developers are within their right to implement sscanf however inefficiently they want, and the application developers should not expect any performance targets from it, and therefore, probably should not use it. On the other hand, what is even the point of the standard library if you're not supposed to use it for anything practical? Or, for that matter, why waste your time writing an implementation that no-one should use for anything practical anyhow, due to its abysmal performance?
My simple rule of thumb: if the general purpose allocator shows up in performance profiles, then there's too much allocation going on in the hot path (e.g. depending on the 'system allocator' being fast in all situations is a convenient but sloppy attitude for code that's supposed to be portable since neither the C standard nor POSIX say anything performance).
FWIW on Emscripten I specifically pick the slow-but-small emmalloc instead of the fast-but-big jemalloc because a small size matters more than performance in that case. My C code also rarely heap-allocates, and the few heap-allocations that happen are all in the init-phase, not in the hot path - e.g. even in multithreaded code, the MUSL allocator would be totally fine.
Performance in edge-cases by far isn't the only metric that matters for allocators.
The root cause of the issue, is that musl malloc uses a single head, and relies on locking to support multiple heaps. This means each allocation/free must acquire this lock. Imo it's good for single threaded programs (which might've been musls main usecase), but Rust programs nowadays mostly use multiple threads.
In contrast mimalloc, a similarly minimalistic allocator has a per-thread heap, which each thread owning the memory it allocates, and cross-thread free's are handled in a deferred manner.
This works very well with Rust's ownership system, where objects rarely move between threads.
Internally, both allocators use size-class based allocation, into predefined chunks, with the key difference being that musl uses bitmaps and mimalloc uses free lists to keep track of memory.
Musl could be fixed, it they switch from a single thread model, to a per-thread heap as well.
mimalloc has about 10kloc, while (assuming I'm looking in the right place) the new musl allocator has 891 and the old musl allocator has 518 lines of code. I wouldn't call an order of magnitude difference in line count 'similar'.
It's minimalistic in the sense that it compiles to a tiny binary (a lot of the code is either per platform, musl is POSIX only afaik) or for debugging. Yes it's bigger, but still tiny compared to something like jemalloc, and I'm sure it's like 10kb in a binary.
Programs that tend to have higher performance requirements are typically multi threaded and those are the ones that are also hit particularly hard by this issue.
glibc malloc still doesn't work well for multi-threaded apps. It is prone to memory fragmentation which causes excessive memory usage. One can reduce number of arenas using MALLOC_ARENA_MAX environment variable and in many cases it's a good idea but it could increase lock contention.
If you care about efficiency of a multi-threaded app you should use jemalloc (sadly no longer maintained but still works well), mi-malloc or tcmalloc.
Hot take: Almost all programs are actually multithreaded. The only exception is tiny UNIX-like shell utilities that are meant to run in parallel with other processes, and toy programs.
The third exception is programs that should be multithreaded but aren't because they are written in languages where adding more threads is disproportionately hard (C, C++) or impossible (Python, Ruby, etc.).
how are C/C++ disproportionally hard? the concept of multi-threading is the same for any language that supports it, most of the primitives are the same, and it's really not a lot nor complicated code to implement those.
the difficulty totally lies in the design... actually using parallelism where it matters. - tons of multi-threaded programs are just single-thread with a lot of 'scheduler' spliced into this one thread -_-
It is disproportionately hard to do multithreading in C and C++ because the blast radius is huge and the tooling is not good. Languages with runtimes (Java, C#, etc.) give you lots of analysis and well-defined failure modes (in most cases), and Rust prevents almost all related bugs through the type system.
In terms of effort or expense, making any C or C++ program multithreaded is at least an order of magnitude harder/more expensive, even when designed for it from the beginning, so lots of programs aren't multithreaded that could be.
Maybe the large number of standard library functions that operate on globals and require you to remember the "_r" variant of that function exists, or the mess with handling signals, or the fact that Win32 and Posix use significantly different primitives for synchronization? Or maybe just the fact that most libraries for C/++ won't have built-in threading support and you need to synchronize at each call site?
Unless I'm writing Java, I avoid multithreading whenever possible. I hear it's also nice in Go.
For docker images, cgr.dev/chainguard/wolfi-base (https://images.chainguard.dev/directory/image/wolfi-base/ver...) is a great replacement for Alpine. Wolfi is glibc based. It's easy to switch from Alpine since Wolfi uses apk for package management with similar package names and also contains busybox like Alpine.
I’d much rather go with distroless, if its a choice.
But I think you can tweak musl to perform well, and musl is closer to the spec than glibc so I would rather use it; even if its slower in the default case for multithreaded programmes.
Swapping out jemalloc for the system allocator will net you huge performance wins if you link against musl, but you’ll still have issues with multithreading performance due to the slower implementations of necessary helpers.
Specifically its goals are low memory overhead and hardening. Safe defaults, and easy to swap to a performance-oriented malloc for those apps that want it.
My question is: why is Rust performance contingent on a C malloc?
> why is Rust performance contingent on a C malloc?
Because Rust switched to “system” allocators way back for compatibility with, well, the system, as well as introspection / perf tooling, to lower the size of basic programs, and to lower maintenance.
It used to use jemalloc, but that took a lot of space in even the most basic binary and because jemalloc is not available everywhere it still had to deal with system allocators anyway.
It's not a developer decision on Alpine where musl is the system allocator. Otherwise I fully agree, application developers are mainly responsible for the performance of their applications.
Using the system allocator is also a developer decision. They can use any custom allocator they want. A lot of programs use Jemalloc regardless of what the system allocator is.
> and musl’s allocator is garbage for any multithreaded program.
...it only matters if the threads allocate/free with such a high frequency that they run into contention, the C stdlib allocator is a shared resource and user code really shouldn't assume that the allocators fixes their poor design decisions for multithreaded code.
If other allocators are able to handle a situation perfectly well, even a general-purpose allocator like the one in glibc, that suggests that musl's is deficient.
AMD's threadrippers had 64 cores in 2020. The workstation targeted threadripper pro reaches 96. These are desktop parts, the top end of their server offering has 192 cores.
It's only for Rust binaries that are built with the the -linux-musl* (instead -linux-gnu*) toolchains, which are not the default, and usually used to make portable/static binaries.
It's still possible to build Rust binaries with jemalloc if you need the performance (or another allocator). Also, it will heavily depend on the usecase; for many usecases, Rust will in fact pressure the heap less, precisely because it tracks ownership, and passing or returning structs by value (on the stack) is often "free" if you pass the ownership as well.
> Corollary: hats off to Red Hat for supporting their distro releases for such a lengthy period of time.
This has been my bane at various open source projects, because at some point somebody will say that all currently supported Linux distributions should be supported by a project. This works as a rule of thumb, except for RHEL, which has some truly ancient GCC versions provided in the "extended support" OS versions.
* The oldest supported versions in "production" is RHEL 8, and in "extended support" is RHEL 7.
* RHEL 8 (released 2019) provides gcc 8 (released May 2018). RHEL 7 (released 2014) provides gcc 4.8 (released March 2013).
* gcc 8 supports C++17, but not C++20. gcc 4.8 supports most of C++11 (some C++ stdlib implementations weren't added until later), but doesn't support C++14.
So the well-meaning cutoff of "support the compiler provided by supported major OS versions" becomes a royal pain, since it would mean avoiding useful functionality in C++17 until mid-2024 (when RHEL 7 went from "production" to "extended support") or mid-2028 (when RHEL 7 "extended support" will end). It's not as bad at the moment, since C++20 and C++23 were relatively minor changes, but C++26 is shaping up to be a pretty useful change, and that wouldn't be usable until around 2035 when RHEL 10 leaves "production".
I wouldn't mind it as much if RHEL named the support something sensible. By the end of a "production" window, the OS is still absolutely suitable as a deployment platform for existing software. Unlike other "production" OS versions, though, it is no longer reasonable as a target for new development at that point.
RHEL has gcc-toolset-N (previously devtoolset-N-gcc) for that. It's perfectly fine to only support building a project with, say, the penultimate gcc-toolset. Or ask for a payment for support, which is the norm in this (LTS) space.
Oh, absolutely, and I usually push for having users installed a more recent compiler. The problem comes when the compatibility policy is defined in terms of the default compiler provided, because then it requires a larger discussion around that entire policy.
Alas this is a huge foot gun that ensnares many orgs. Because engineers seem drawn like moths to the flame to Alpine container images. Yes they are small, but the ramifications of Alpine & using musl are significant.
Optimizing for size & stdlib code simplicity is probably not the best fit for your application server! Container size has always struck me as such a Goodhart's Law issue (and worse, already a bad measure as it measures only a very brief part of the software lifecycle). Goodhart's Law:
> When a measure becomes a target, it ceases to be a good measure
This particular musl/Alpine footgun can be worked around. It's not particularly hard to install and use another allocator on Alpine or anywhere really. Ruby folks in particular seem to have a lot of lore around jemalloc, with various versions preferences and MALLOC_CONFIGs on top of that. But in general I continue to feel like Alpine base images bring in quite an X factor, even if you knowingly adjust the allocator: the prevalence of Alpine in container images feels unfortunate & eccentric.
Going distorless is always an option. A little too radical for my tastes though usually. I think of musl+busybox+ipkg as the distinguishing aspects of Alpine, so on that basis I'm excited to see the recent huge strides by uutil, the rust rewrite of gnu coreutils focused on compatibility. While offering a BusyBox-like all-in-one binary convenience! It should make a nice compact coreutils for containers! The recent 0.2 has competitive performance which is awesome to see. https://www.phoronix.com/news/Rust-Coreutils-0.2
Huh I guess I'm lucky I never faced this, we've always used Debian or RHEL containers where I've worked. Every time I toyed with using a minimalist distro I found debugging to be much more difficult and ended up abandoning the idea.
Once the container OS forks and runs your binary, I'm curious why does it matter? Is it because people run interpreted code (like Python or Node) and use runtimes that link musl libc? If you deploy JVM or Go apps this will probably not be a factor.
Jvm will also use whatever libc is available, afaik. Here's an article on switching a jvm container to jemalloc from 2021. But this isn't for the heap, it's just for the jvm itself & io related concerns! https://blog.malt.engineering/java-in-k8s-how-weve-reduced-m...
Go is a rare counter example, which ignores the system allocator & bundles its own.
GNU coreutils can be built as a single binary with ./configure --enable-single-binary. One can install this variant on Fedora for example with the coreutils-single package, and this is used in some container images.
Quote from the musl mailing list:
> The mallocng allocator was designed to favor very low memory overhead, low worst-case fragmentation cost, and strong hardening over performance. This is because it's much easier and safer to opt in to using a performance-oriented allocator for the few applications that are doing ridiculous things with malloc to make it a performance bottleneck than to opt out of trading safety for performance in every basic system utility that doesn't hammer malloc.
[1]https://www.openwall.com/lists/musl/2025/09/05/3
Instead of “harmful to performance”, why can’t we say “slow”?
Harmful should be reserved for things that affect security or privacy e.g. accidentally encourage bugs like goto does.
"Considered harmful" is a meme they're referencing but yeah...its pretty stale at this point.
To me it’s not a meme, it’s a reference to a very famous letter by dijkstra regarding goto statements.
https://en.wikipedia.org/wiki/Considered_harmful
That is the meme.
You want to object because of a misunderstanding. The usage of the word meme here is correct in its original sense. The word cliche would also work.
Not really "goto statements" so much as the go-to arbitrary control flow semantic aka jump.
C's goto is a housecat to the full blown jump's tiger. No doubt an angry housecat is a nuisance but the tiger is much more dangerous.
C goto won't let you jump straight into the middle of unrelated code, for example, but the jump instruction has no such limit and neither did the feature Dijkstra was discussing.
Only once we convince C developers that a lack of performance isn't inherently harmful.
More like Python / JS devs
C devs are the few I've met that seem to actually care.
A language community which so prizes the linked list is in no position to go throwing such stones.
Linux lucked out, when you're doing tricky wait free concurrent algorithms that intrusive linked list you hand designed was a good choice. But over in userland you'll find another hand rolled list in somebody's single threaded file parser and oh, the growable array would be fifty times faster, shame the C programmer doesn't have one in their toolbox.
I think you misunderstood. That's exactly the problem. C developers consider slow performance harmful, which is often dumb.
Except that's why I use hundreds of C programs every day, but complain about the few python programs and all the sloppy websites.
You do you. Most people don't care about software that much in general. The most important thing is that it does the job and it does it securely. C won't help you with bugs in any shape or form (in fact it's famously bug-friendly), so it often makes more sense to use a tech stack that either helps with those or lowers the cost on the developer side.
People care about the performance. There are numerous studies about that, showing, for instance a direct correlation between how fast a page loads and conversion rate. Also, Chrome, initially, the pitch was almost all about performance, and it was. They only became complacent once they got their majority market share.
It makes sense to use a tech stack that lowers the cost on the developer side in the same way that it makes sense to make junk food. Why produce good, tasty food when there is more money do be made by just selling cheap stuff, it does the most important thing: give people calories without poisoning them (short term).
Yeah but we're mentioning the performance of the language. People do have a baseline level of accepted performance, but this is about perceived performance and if software feels slow most of the time it's just because of some dumb design. Like a decision to show an animated request to sign up for the newsletter on the first visit. Or loading 20 high quality images in a grid view on top of the page. Or just in general choosing animations that just feel slow even though they're hitting the FPS target perfectly without hiccups.
Get rid of those dumb decisions and it could have been pure JS and be 100% fine. C has no value here. The slow performance of JS is not harmful here. Discord is fast enough although it's Electron. VS Code is also fast enough.
But I'd also like to respond to the food analogy, since it's funny.
Let's say that going full untyped scripting language would be the fast food. You get things fast, it does the job, but is unhealthy. You can write only so much bash before throwing up.
Developing in C is like cooking for those equally dumb expensive unsustainable restaurants which give you "an experience" instead of a full healthy meal. Sure, the result uses the best ingredients, it's incredibly tasty but there's way too little food for too much cost. It's bad for the economy (the money should've been spent elsewhere), bad for the customer (same thing about money + he's going to be hungry!) and bad for the cook (if he chose a different job, he'd contribute to the society in better ways!) :D
Just go for something in the middle. Eat some C# or something.
externalising developer cost onto runtime performance only makes sense if humans will spend more time writing than running (in aggregate).
Essentially you’re telling me that the software being made is not useful to many people; because the cost of writing the software (a handful of developers) will spend more time writing the software than their userbase will in executing their software.
Otherwise you’re inflicting something on humanity.
Dumping toxic waste in a river is much cheaper than properly disposing of it too; yet we understand that we are causing harm to the environment and litigate people who do that.
Slow software is fine in low volumes (think: shitting in the woods) but dumping it on huge numbers of users by default is honestly ridiculous (Teams, I’m looking at you: with your expectation to run always and on everyones machine!)
> Most people don't care about software that much in general.
This is an example of not caring about the software per se, but only about the outcome.
> [C is] in fact it's famously bug-friendly
Yes, but as a user I like that. I have a game that from the user-experience seams to have tons of use-after-free bugs. You see that as a user, as strings shown in the UI suddenly turn to garbage and then change very fast. Even with such fatal bugs, the program continues to work, which I like as a user, since I just want to play the game, I don't care if the program is correct. When I want to get rid of these garbage text, I simply close the in-game window and reopen it and everything is fine.
On the other side there are games written in Pascal or Java, which might not have that much bugs, but every single null pointer exception is fatal. This led to me not playing the games anymore, because being good and then having the program crash is so frustrating. I rather have it running a bit longer with silent corruption.
A null-pointer dereference in C will be just as fatal (modulo optimizations).
I think people also care that software runs reasonably quickly. Among non-technical people, "my Windows is slow" seems to be a common complaint.
Sure, but this is perceived performance and it's 100% unrelated to the language. It's bugs, I/O, telemetry, updates, ads, other unnecessary background things, or just dumb design (e.g. showing onedrive locations first when trying to save a file in Word) in general.
C won't help with any of that. Unless the cost of development using it will scare away management which requests those dumb features. Fair enough then :)
> or just dumb design (e.g. showing onedrive locations first when trying to save a file in Word)
Your example is not one of a 'dumb' design, it is a deliberate 'dark pattern' --> pushing you to use OneDrive as much as possible so that to earn more money.
> The most important thing is that it does the job and it does it securely
ROTFL. Is there any security audit ? /s
it does the job - mostly.
maybe its not 'slow' but more 'generalized for a wide range of use-cases'? - because is it really slow for what it does, or simply slower compared to a specialized implementation? (this is calling a regular person car slow compared to an F1 car... sure the thing is fast but good luck takin ur kids on holiday or doing weekly shopping runs?)
glibc is faster in basically every usecase, though.
“Generalised to a wide range of use cases” is a really strange way to say “unsuitable to most multi-threaded programs”.
In 2025 an allocator not cratering multi-threaded programs is the opposite of specialisation.
It only matters when your threads allocate with such a high frequency that they run into contention.
A too high access frequency to a shared resource is not a "general case", but simply poorly designed multithreaded code (but besides, a high allocation frequency through the system allocator is also poor design for any single-threaded code, application code simply should not assume any specific performance behaviour from the system allocator).
Well, what is "such a high frequency"? Different allocators have different breaking points, and the musl's one is apparently very low.
> application code simply should not assume any specific performance behaviour from the system allocator
Technically, yes. Practically, no; that's why e.g. C++ standard mandates time complexity of its containers. If you can't assume any specific performance from your system, that means you have to prepare for every system-provided functionality to be exponentially slow and obviously you can't do that.
Take, for instance, the JSON parser in GTA V [0]: apparently, sscanf(buffer, "%d", &n) calls strlen(buffer) internally, so using it to parse numbers in a hot loop on 2 MiB-long JSON craters your performance. On one hand, sure, one can argue that glibc/musl developers are within their right to implement sscanf however inefficiently they want, and the application developers should not expect any performance targets from it, and therefore, probably should not use it. On the other hand, what is even the point of the standard library if you're not supposed to use it for anything practical? Or, for that matter, why waste your time writing an implementation that no-one should use for anything practical anyhow, due to its abysmal performance?
[0] https://news.ycombinator.com/item?id=26296339
My simple rule of thumb: if the general purpose allocator shows up in performance profiles, then there's too much allocation going on in the hot path (e.g. depending on the 'system allocator' being fast in all situations is a convenient but sloppy attitude for code that's supposed to be portable since neither the C standard nor POSIX say anything performance).
They don't, but if your C standard library is slow you should get a new one.
FWIW on Emscripten I specifically pick the slow-but-small emmalloc instead of the fast-but-big jemalloc because a small size matters more than performance in that case. My C code also rarely heap-allocates, and the few heap-allocations that happen are all in the init-phase, not in the hot path - e.g. even in multithreaded code, the MUSL allocator would be totally fine.
Performance in edge-cases by far isn't the only metric that matters for allocators.
The root cause of the issue, is that musl malloc uses a single head, and relies on locking to support multiple heaps. This means each allocation/free must acquire this lock. Imo it's good for single threaded programs (which might've been musls main usecase), but Rust programs nowadays mostly use multiple threads.
In contrast mimalloc, a similarly minimalistic allocator has a per-thread heap, which each thread owning the memory it allocates, and cross-thread free's are handled in a deferred manner.
This works very well with Rust's ownership system, where objects rarely move between threads.
Internally, both allocators use size-class based allocation, into predefined chunks, with the key difference being that musl uses bitmaps and mimalloc uses free lists to keep track of memory.
Musl could be fixed, it they switch from a single thread model, to a per-thread heap as well.
> a similarly minimalistic allocator
mimalloc has about 10kloc, while (assuming I'm looking in the right place) the new musl allocator has 891 and the old musl allocator has 518 lines of code. I wouldn't call an order of magnitude difference in line count 'similar'.
It's minimalistic in the sense that it compiles to a tiny binary (a lot of the code is either per platform, musl is POSIX only afaik) or for debugging. Yes it's bigger, but still tiny compared to something like jemalloc, and I'm sure it's like 10kb in a binary.
yeah, the Mimalloc design is just the correct one.
Maybe its just that the allocator is absolutely fine for single thread programs, and that's what a lot of programs are...
Its not so long ago that the GNU libc had a very similar allocator too, and thats why you'd pop Hoard in your LD_PRELOAD or whatever.
Not every program is multi-threaded, and so not every program would experience thread contention.
Programs that tend to have higher performance requirements are typically multi threaded and those are the ones that are also hit particularly hard by this issue.
glibc malloc still doesn't work well for multi-threaded apps. It is prone to memory fragmentation which causes excessive memory usage. One can reduce number of arenas using MALLOC_ARENA_MAX environment variable and in many cases it's a good idea but it could increase lock contention.
If you care about efficiency of a multi-threaded app you should use jemalloc (sadly no longer maintained but still works well), mi-malloc or tcmalloc.
Glibc malloc also has a fun bug where it doesn't return memory to the OS to make it look better on benchmarks.
Hot take: Almost all programs are actually multithreaded. The only exception is tiny UNIX-like shell utilities that are meant to run in parallel with other processes, and toy programs.
The third exception is programs that should be multithreaded but aren't because they are written in languages where adding more threads is disproportionately hard (C, C++) or impossible (Python, Ruby, etc.).
how are C/C++ disproportionally hard? the concept of multi-threading is the same for any language that supports it, most of the primitives are the same, and it's really not a lot nor complicated code to implement those.
the difficulty totally lies in the design... actually using parallelism where it matters. - tons of multi-threaded programs are just single-thread with a lot of 'scheduler' spliced into this one thread -_-
It is disproportionately hard to do multithreading in C and C++ because the blast radius is huge and the tooling is not good. Languages with runtimes (Java, C#, etc.) give you lots of analysis and well-defined failure modes (in most cases), and Rust prevents almost all related bugs through the type system.
In terms of effort or expense, making any C or C++ program multithreaded is at least an order of magnitude harder/more expensive, even when designed for it from the beginning, so lots of programs aren't multithreaded that could be.
Maybe the large number of standard library functions that operate on globals and require you to remember the "_r" variant of that function exists, or the mess with handling signals, or the fact that Win32 and Posix use significantly different primitives for synchronization? Or maybe just the fact that most libraries for C/++ won't have built-in threading support and you need to synchronize at each call site?
Unless I'm writing Java, I avoid multithreading whenever possible. I hear it's also nice in Go.
Go is kind of broken here, since multithreading is one of extremely few ways to cause UB in Go.
Rust is very much best in class here.
I'm not seeing how this justifies a 700x performance difference.
For docker images, cgr.dev/chainguard/wolfi-base (https://images.chainguard.dev/directory/image/wolfi-base/ver...) is a great replacement for Alpine. Wolfi is glibc based. It's easy to switch from Alpine since Wolfi uses apk for package management with similar package names and also contains busybox like Alpine.
I’d much rather go with distroless, if its a choice.
But I think you can tweak musl to perform well, and musl is closer to the spec than glibc so I would rather use it; even if its slower in the default case for multithreaded programmes.
> But I think you can tweak musl to perform well
You can not, its allocator does thread safety via a big lock and that’s that.
> musl is closer to the spec than glibc
Is it?
> even if its slower in the default case for multithreaded programmes.
That’s far from the only situation where it’s slower though.
Yeah, the musl people tend to closely follow the spec, this doesn’t always win them friends: https://news.ycombinator.com/item?id=22682510
Swapping out jemalloc for the system allocator will net you huge performance wins if you link against musl, but you’ll still have issues with multithreading performance due to the slower implementations of necessary helpers.
Rich replaced the default musl malloc some time ago for exactly those reasons. Maybe they still used the old musl libc?
The new one was drafted here: https://github.com/richfelker/mallocng-draft
The new allocator does nothing to improve the performances in a threaded / contended application: https://www.openwall.com/lists/musl/2025/09/04/3
The response to the link here is really telling.
Blames it all on app code like Wayland
This is addressed in the article: https://nickb.dev/blog/default-musl-allocator-considered-har...
From the article:
> “the new ng allocator in MUSL doesn’t make a dime of a difference”
Yes, sorry, missed that at the very end.
The musl pthread muxtexes are also awfully slow: https://justine.lol/mutex/
I believe musl is supposed to be optimised heavily for size, not speed.
Specifically its goals are low memory overhead and hardening. Safe defaults, and easy to swap to a performance-oriented malloc for those apps that want it.
My question is: why is Rust performance contingent on a C malloc?
> why is Rust performance contingent on a C malloc?
Because Rust switched to “system” allocators way back for compatibility with, well, the system, as well as introspection / perf tooling, to lower the size of basic programs, and to lower maintenance.
It used to use jemalloc, but that took a lot of space in even the most basic binary and because jemalloc is not available everywhere it still had to deal with system allocators anyway.
So basically, the Rust project made a bad decision and now it's all musl's fault? ;)
Sounds like a sane decision to me? Using musl is the developers decision, not Rusts.
It's not a developer decision on Alpine where musl is the system allocator. Otherwise I fully agree, application developers are mainly responsible for the performance of their applications.
Using the system allocator is also a developer decision. They can use any custom allocator they want. A lot of programs use Jemalloc regardless of what the system allocator is.
The rust project made a sensible decision given its direction and goals, and musl’s allocator is garbage for any multithreaded program.
> and musl’s allocator is garbage for any multithreaded program.
...it only matters if the threads allocate/free with such a high frequency that they run into contention, the C stdlib allocator is a shared resource and user code really shouldn't assume that the allocators fixes their poor design decisions for multithreaded code.
Ah yes, the "you're holding it wrong" argument.
If other allocators are able to handle a situation perfectly well, even a general-purpose allocator like the one in glibc, that suggests that musl's is deficient.
glibc's allocator is about 10x more code than musl's. Why should it be controversial that different C stdlib implementations set different priorities?
A smaller code base also means a smaller attack surface and fewer potential bugs.
The question remains: why does the Rust ecosystem depend so much on a system component they ultimately have no control over?
“The allocator is perfectly fine as long as you don’t use it” is more a confirmation than a disagreement.
Intel has announced a desktop CPU with 52 cores.
Edit: To be more precise, an engineering sample was spotted.
AMD's threadrippers had 64 cores in 2020. The workstation targeted threadripper pro reaches 96. These are desktop parts, the top end of their server offering has 192 cores.
never blame rust. rust is the replacement for C.
It's only for Rust binaries that are built with the the -linux-musl* (instead -linux-gnu*) toolchains, which are not the default, and usually used to make portable/static binaries.
Unless you're on a distro like Alpine where musl is the system libc. Which is common in, e.g., containers.
It's still possible to build Rust binaries with jemalloc if you need the performance (or another allocator). Also, it will heavily depend on the usecase; for many usecases, Rust will in fact pressure the heap less, precisely because it tracks ownership, and passing or returning structs by value (on the stack) is often "free" if you pass the ownership as well.
> Corollary: hats off to Red Hat for supporting their distro releases for such a lengthy period of time.
This has been my bane at various open source projects, because at some point somebody will say that all currently supported Linux distributions should be supported by a project. This works as a rule of thumb, except for RHEL, which has some truly ancient GCC versions provided in the "extended support" OS versions.
* The oldest supported versions in "production" is RHEL 8, and in "extended support" is RHEL 7. * RHEL 8 (released 2019) provides gcc 8 (released May 2018). RHEL 7 (released 2014) provides gcc 4.8 (released March 2013). * gcc 8 supports C++17, but not C++20. gcc 4.8 supports most of C++11 (some C++ stdlib implementations weren't added until later), but doesn't support C++14.
So the well-meaning cutoff of "support the compiler provided by supported major OS versions" becomes a royal pain, since it would mean avoiding useful functionality in C++17 until mid-2024 (when RHEL 7 went from "production" to "extended support") or mid-2028 (when RHEL 7 "extended support" will end). It's not as bad at the moment, since C++20 and C++23 were relatively minor changes, but C++26 is shaping up to be a pretty useful change, and that wouldn't be usable until around 2035 when RHEL 10 leaves "production".
I wouldn't mind it as much if RHEL named the support something sensible. By the end of a "production" window, the OS is still absolutely suitable as a deployment platform for existing software. Unlike other "production" OS versions, though, it is no longer reasonable as a target for new development at that point.
RHEL has gcc-toolset-N (previously devtoolset-N-gcc) for that. It's perfectly fine to only support building a project with, say, the penultimate gcc-toolset. Or ask for a payment for support, which is the norm in this (LTS) space.
Oh, absolutely, and I usually push for having users installed a more recent compiler. The problem comes when the compatibility policy is defined in terms of the default compiler provided, because then it requires a larger discussion around that entire policy.
GCC 12 is available for RHEL 7.
> at some point somebody will say that all currently supported Linux distributions should be supported by a project
Ask for payment for extended support as well.
Can someone please write a '"considered harmful" considered harmful' piece.
Here's one from 2002: https://meyerweb.com/eric/comment/chech.html
Perfect.
Alas this is a huge foot gun that ensnares many orgs. Because engineers seem drawn like moths to the flame to Alpine container images. Yes they are small, but the ramifications of Alpine & using musl are significant.
Optimizing for size & stdlib code simplicity is probably not the best fit for your application server! Container size has always struck me as such a Goodhart's Law issue (and worse, already a bad measure as it measures only a very brief part of the software lifecycle). Goodhart's Law:
> When a measure becomes a target, it ceases to be a good measure
This particular musl/Alpine footgun can be worked around. It's not particularly hard to install and use another allocator on Alpine or anywhere really. Ruby folks in particular seem to have a lot of lore around jemalloc, with various versions preferences and MALLOC_CONFIGs on top of that. But in general I continue to feel like Alpine base images bring in quite an X factor, even if you knowingly adjust the allocator: the prevalence of Alpine in container images feels unfortunate & eccentric.
Going distorless is always an option. A little too radical for my tastes though usually. I think of musl+busybox+ipkg as the distinguishing aspects of Alpine, so on that basis I'm excited to see the recent huge strides by uutil, the rust rewrite of gnu coreutils focused on compatibility. While offering a BusyBox-like all-in-one binary convenience! It should make a nice compact coreutils for containers! The recent 0.2 has competitive performance which is awesome to see. https://www.phoronix.com/news/Rust-Coreutils-0.2
Huh I guess I'm lucky I never faced this, we've always used Debian or RHEL containers where I've worked. Every time I toyed with using a minimalist distro I found debugging to be much more difficult and ended up abandoning the idea.
Once the container OS forks and runs your binary, I'm curious why does it matter? Is it because people run interpreted code (like Python or Node) and use runtimes that link musl libc? If you deploy JVM or Go apps this will probably not be a factor.
Jvm will also use whatever libc is available, afaik. Here's an article on switching a jvm container to jemalloc from 2021. But this isn't for the heap, it's just for the jvm itself & io related concerns! https://blog.malt.engineering/java-in-k8s-how-weve-reduced-m...
Go is a rare counter example, which ignores the system allocator & bundles its own.
GNU coreutils can be built as a single binary with ./configure --enable-single-binary. One can install this variant on Fedora for example with the coreutils-single package, and this is used in some container images.
I'm not a fan of Rust I'm more of a C++ guy but Ripgrep is also nice I always install it.
Chimera Linux did some changes on their distro because of that.
EDIT: Ah, they were mentioned, of course.
On some malloc replacements, telescope -a gopher/gemini client- used to be a bit crashy until I used jemalloc on some platforms (with LD_PRELOAD).
Also, the performance rendering pages with tons of links improved a lot.
[dead]
Another day, another reason to avoid musl libc