AI PCs Aren't Good at AI: The CPU Beats the NPU

488 points by dbreunig 9 months ago

isusmelj 9 months ago

I think the results show that just in general the compute is not used well. That the CPU took 8.4ms and GPU took 3.2ms shows a very small gap. I'd expect more like 10x - 20x difference here. I'd assume that the onnxruntime might be the issue. I think some hardware vendors just release the compute units without shipping proper support yet. Let's see how fast that will change.

Also, people often mistake the reason for an NPU is "speed". That's not correct. The whole point of the NPU is rather to focus on low power consumption. To focus on speed you'd need to get rid of the memory bottleneck. Then you end up designing your own ASIC with it's own memory. The NPUs we see in most devices are part of the SoC around the CPU to offload AI computations. It would be interesting to run this benchmark in a infinite loop for the three devices (CPU, NPU, GPU) and measure power consumption. I'd expect the NPU to be lowest and also best in terms of "ops/watt"

AlexandrB 9 months ago

> Also, people often mistake the reason for an NPU is "speed". That's not correct. The whole point of the NPU is rather to focus on low power consumption.
I have a sneaking suspicion that the real real reason for an NPU is marketing. "Oh look, NVDA is worth $3.3T - let's make sure we stick some AI stuff in our products too."
- pclmulqdq 9 months ago
  
  The correct way to make a true "NPU" is to 10x your memory bandwidth and feed a regular old multicore CPU with SIMD/vector instructions (and maybe a matrix multiply unit).
  Most of these small NPUs are actually made for CNNs and other models where "stream data through weights" applies. They have a huge speedup there. When you stream weights across data (any LLM or other large model), you are almost certain to be bound by memory bandwidth.
  - sounds 9 months ago
    
    Apple Silicon is surprisingly a good approach here -
    * On CPU: SIMD NEON * On CPU: custom matrix multiply accelerator, separate from SIMD unit * On CPU package: NPU * GPU
    Then they go and hide it all in proprietary undocumented features and force you to use their framework to access it :c
  - bee_rider 9 months ago
    
    I’m sure we’ll get GPNPU. Low precision matvecs could be fun to play with.
    
    touisteur 9 months ago
    
    SHAVE from MOVIDIUS was fun, before Intel bought them out.
    
    hedgehog 9 months ago
    
    Did they become un-fun? There are a bunch on the new Intel CPUs.
    
    touisteur 9 months ago
    
    Most of the toolchain got hidden behind openvino and there was no hardware released for years. Keembay was 'next year' for years. I have some code for DSP using it that I can't use anymore. Has Intel actually released new shave cores, with an actual dev environment ? I'm curious.
    
    hedgehog 9 months ago
    
    The politics behind the software issues are complex. At least from the public presentation the new SHAVE cores are not much changed besides bigger vector units. I don't know what it would take to make a lower level SDK available again but it sure seems like it would be useful.
- Spooky23 9 months ago
  
  Microsoft needs to throw something in the gap to slow down MacBook attrition.
  The M processors changed the game. My teams support 250k users. I went from 50 MacBooks in 2020 to over 10,000 today. I added zero staff - we manage them like iPhones.
  - pjmlp 9 months ago
    
    Microsoft has indeed a problem, however only in countries whose people can afford Apple level prices, and not everyone is a G7 citizen.
    
    jocaal 9 months ago
    
    Microsoft is slowly being squeezed from both sides of the market. Chromebooks have silently become wildly popular on the low end. The only advantage I see windows have is corporate and gaming. But valve is slowly chopping away at the gaming advantage as well.
    
    pjmlp 9 months ago
    
    Chromebooks are no where to be seen outside US school market.
    Coffe shops, trains and airports in Europe? Nope, rare animal on tables.
    European schools? Most countries parents buy their kids a computer, and most often it is a desktop used by the whole family, or a laptop of some kind running Windows, unless we are talking about the countries where buying Apple isn't an issue on the monthly expenses.
    Popular? In Germany, the few times they get displayed on shopping mall stores, they get rountinely discounted, or bundled with something else, until finally they get rid of them.
    Valve is heavily dependent on game studios producing Windows games.
    
    vrighter 9 months ago
    
    I have never seen, much less interacted with, a chromebook. I don't think they're as popular as you think, in a lot of not-usa
  - cj 9 months ago
    
    Rightly so.
    The M processor really did completely eliminate all sense of “lag” for basic computing (web browsing, restarting your computer, etc). Everything happens nearly instantly, even on the first generation M1 processor. The experience of “waiting for something to load” went away.
    Not to mention these machines easily last 5-10 years.
    
    morsch 9 months ago
    
    It's fine. For basic computing, my M3 doesn't feel much faster than my Linux desktop that's like 8 years old. I think the standard for laptops was just really, really low.
    
    thanksgiving 9 months ago
    
    > I think the standard for laptops was just really, really low.
    As someone who used windows laptops, I was amazed when I saw someone sitting next to me on a public transit subway on her MacBook Pro editing images on photoshop with just her trackpad. The standard for windows laptops used to be that low (about ten or twelve years ago?) that seeing a MacBook trackpad just woke someone is a part of my permanent memory.
    
    thesuitonym 9 months ago
    
    I don't understand the hype around Apple trackpads. 15 years ago, sure, there was a huge gulf of difference, but today? The only difference that I can see or fee, at least between lenovo or dell and apple, is that the mac trackpad is physically larger.
    
    nxobject 9 months ago
    
    As a very happy M1 Max user (should've shelled out for 64GB of RAM, though, for local LLMs!), I don't look forward to seeing how the Google Workspace/Notions/etc. of the world somehow reintroduce lag back in.
    
    bugbuddy 9 months ago
    
    The problem for Intel and AMD is they are stuck with an OS that ships with a lag-inducing Anti-malware suite. I just did a simple git log and it took 2000% longer than usual because the Antivirus was triggered to scan and run a simulation on each machine instruction and byte of data accessed. The commit log window stayed blank waiting to load long enough for me to complete another tiny project. It always ruin my day.
    
    alisonatwork 9 months ago
    
    Pro tip: turn off malware scanning in your git repos[0]. There is also the new Dev Drive feature in Windows 11 that makes it even easier for developers (and IT admins) to set this kind of thing up via policies[1].
    In companies where I worked where the IT team rolled out "security" software to the Mac-based developers, their computers were not noticeably faster than Windows PCs at all, especially given the majority of containers are still linux/amd64, reflecting the actual deployment environment. Meanwhile Windows also runs on ARM anyway, so it's not really something useful to generalize about.
    [0] https://support.microsoft.com/en-us/topic/how-to-add-a-file-...
    [1] https://learn.microsoft.com/en-us/windows/dev-drive/
    
    bugbuddy 9 months ago
    
    Unfortunately, the IT department people think they are literal GODs for knowing how to configure Domain Policies and lock down everything. They even refuse to help or even answer requests for help when there are false positives on our own software builds that we cannot unmark as false positives. These people are proactively antagonistic to productivity. Management could not careless…
    
    lynx23 9 months ago
    
    Nobody wants to be resonsible for giving allowing exceptions in security-matters. Its far easier to ignore the problems at hand, then to risk being wrong just once.
    
    thesuitonym 9 months ago
    
    They don't think they're gods, they just think you're an idiot. This is not to say that you are, or even that they believe YOU individually are an idiot, it's just that users are idiots.
    There are also insurance, compliance, and other constraints that IT folks have that make them unwilling to turn off scanning for you.
    
    cj 9 months ago
    
    > they just think you're an idiot.
    To be fair, the average employee doesn’t have much more than idiot-level knowledge when it comes to security.
    The majority of employees would rather turn off automatic OS updates simply because it’s a hassle to restart your computer because god forbid they you loose those 250 chrome tabs waiting for you to never get around to revisiting!
    
    xxs 9 months ago
    
    they are allowed to do that for the folks that produce the goods of course, it just makes a lot harder to retain the said idiots.
    
    xxs 9 months ago
    
    the short answer is that you can't without the necessary permissions, and even if you do - the next roll out will wipe out your changes.
    So the pro-part of the tip does not apply.
    On my own machines anti-virus is one the very first things to be removed. Most of the time I'd turn off all the swap file, yet Windows doesn't overcommit and certain applications are notorious for allocating memory w/o even using it.
    
    zdw 9 months ago
    
    This is most likely due to corporate malware.
    Even modern macs can be brought to their knees by something that rhymes with FrowdStrike Calcon and interrupts all IO.
    
    djur 9 months ago
    
    Oh, just work for a company that uses Crowdstrike or similar. You'll get back all the lag you want.
    
    n8cpdx 9 months ago
    
    Chrome managed it. Not sure how since Edge still works reasonably well and Safari is instant to start (even faster than system settings, which is really an indictment of SwiftUI).
    
    ddingus 9 months ago
    
    I have a first gen M1 and it holds up very nicely even today. I/O is crazy fast and high compute loads get done efficiently.
    One can bury the machine and lose very little basic interactivity. That part users really like.
    Frankly the only downside of the MacBook Air is the tiny storage. The 8GB RAM is actually enough most of the time. But general system storage with only 1/4 TB is cramped consistently.
    Been thinking about sending the machine out to one of those upgrade shops...
    
    lynguist 9 months ago
    
    Why did you buy a 256GB device for personal use in the first place? Too good of a deal? Or saving these $400 for upgrades for something else?
    
    112233 9 months ago
    
    Not OP, but by booting M1 from external thunderbolt nvme you lose less than 50% of benchmark disk throughput (3GB/s is still ridiculously fast), can buy 8TB drive for less than 1k, plus can boot it on another M1 mac if something happens. If there was "max mem, min disk" model, would def get that.
    
    ddingus 9 months ago
    
    Interesting. You know I bought one of those USB 3 port expanders from TEMU and it is excellent! (I know, TEMU right? But it was so cheap!)
    I could 3d print a couple of brackets and probably lodge a bigger SSD or the smaller form factor eMMC I think and pack it all into a little package one just plugs in. The port extender is currently shaped such that it fits right under the Air tilting it nicely for general use.
    The Air only has external USB... still, I don't need to boot from it. The internal one can continue to do that. Storage is storage for most tasks.
    
    ddingus 9 months ago
    
    I got it for a song. Literally a coupla hundred bucks a few months after release.
    So yeah, great deal. And I really wanted to run the new CPU.
    Frankly, I can do more and generally faster than I would expect running on those limited resources. It has been a quite nice surprise.
    For a lot of what I do, the RAM and storage are enough.
    
    bzzzt 9 months ago
    
    Depends on the application as well. Just try to start up Microsoft Teams.
  - wkat4242 9 months ago
    
    In our company we see the opposite. 5 years ago all the devs wanted Mac instead of Linux. Now they want to go back.
    I think part of the reason is that we manage Mac pretty strictly now but we're getting there with Linux too.
    We also tried to get them to use WSL 1 and 2 but they just laugh at it :) And point at its terrible disk performance and other dealbreakers. Can't blame them.
- itishappy 9 months ago
  
  I assume you're both right. I'm sure NPUs exist to fill a very real niche, but I'm also sure they're being shoehorned in everywhere regardless of product fit because "AI big right now."
  - wtallis 9 months ago
    
    Looking at it slightly differently: putting low-power NPUs into laptop and phone SoCs is how to get on the AI bandwagon in a way that NVIDIA cannot easily disrupt. There are plenty of systems where a NVIDIA discrete GPU cannot fit into the budget (of $ or Watts). So even if NPUs are still somewhat of a solution in search of a problem (aka a killer app or two), they're not necessarily a sign that these manufacturers are acting entirely without strategy.
  - brookst 9 months ago
    
    The shoehorning only works if there is buyer demand.
    As a company, if customers are willing to pay a premium for a NPU, or if they are unwilling to buy a product without one, it is not your place to say “hey we don’t really believe in the AI hype so we’re going to sell products people don’t want to prove a point”
    
    MBCook 9 months ago
    
    Is there demand? Or do they just assume there is?
    If they shove it in every single product and that’s all anyone advertises, whether consumers know it will help them or not, you don’t get a lot of choice.
    If you want the latest chip, you’re getting AI stuff. That’s all there is to it.
    
    Terr_ 9 months ago
    
    "The math is clear: 100% of our our car sales come from models with our company logo somewhere on the front, which shows incredible customer desire for logos. We should consider offering a new luxury trim level with more of them."
    "How many models to we have without logos?"
    "Huh? Why would we do that?"
    
    MBCook 9 months ago
    
    Heh. Yeah more or less.
    To some degree I understand it, because as we’ve all noticed computers have pretty much plateaued for the average person. They last much longer. You don’t need to replace them every two years anymore because the software isn’t out stripping them so fast.
    AI is the first thing to come along in quite a while that not only needs significant power but it’s just something different. It’s something they can say your old computer doesn’t have that the new one does. Other than being 5% faster or whatever.
    So even if people don’t need it, and even if they notice they don’t need it, it’s something to market on.
    The stuff up thread about it being the hotness that Wall Street loves is absolutely a thing too.
    
    ddingus 9 months ago
    
    That was all true nearly 10 years ago. And it has only improved. Almost any computer one finds these days is capable of the basics.
    
    bdd8f1df777b 9 months ago
    
    There are two kinds of buyer demands: product, buyers, and the stock buyers. The AI hype can certainly convince some of the stock buyers.
    
    Spooky23 9 months ago
    
    Apple will have a completely AI capable product line in 18 months, with the major platforms basically done.
    Microsoft is built around the broken Intel tick/tick model of incremental improvement — they are stuck with OEM shitware that will take years to flush out of the channel. That means for AI, they are stuck with cloud based OpenAI, where NVIDIA has them by the balls and the hyperscalers are all fighting for GPU.
    Apple will deliver local AI features as software (the hardware is “free”) at a much higher margin - while Office 365 AI is like $400+ a year per user.
    You’ll have people getting iPhones to get AI assisted emails or whatever Apple does that is useful.
    
    hakfoo 9 months ago
    
    We're still looking for "that is useful".
    The stuff they've been trying to sell AI to the public with is increasingly looking as absurd as every 1978 "you'll store your recipes on the home computer" argument.
    AI text became a Human Centipede story: Start with a coherent 10-word sentence, let AI balloon it into five pages of flowery nonsense, send it to someone else, who has their AI smash it back down to 10 meaningful words.
    Coding assistance, even as spicy autocorrect, is often a net negative as you have to plow through hallucinations and weird guesses as to what you want but lack the tools to explain to it.
    Image generation is already heading rapidly into cringe territory, in part due to some very public social media operations. I can imagine your kids' kids in 2040 finding out they generated AI images in the 2020s and looking at them with the same embarrassment you'd see if they dug out your high-school emo fursona.
    There might well be some more "closed-loop" AI applications that make sense. But are they going to be running on every desktop in the world? Or are they going to be mostly used in datacentres and purpose-built embedded devices?
    I also wonder how well some of the models and techniques scale down. I know Microsoft pushed a minimum spec to promote a machine as Copilot-ready, but that seems like it's going to be "Vista Basic Ready" redux as people try to run tools designed for datacentres full of Quadro cards, or at least high-end GPUs, on their $299 HP laptop.
    
    jjmarr 9 months ago
    
    Cringe emo girls are trendy now because the nostalgia cycle is hitting the early 2000s. Your kid would be impressed if you told them you were a goth gf. It's not hard to imagine the same will happen with primitive AIs in the 40s.
    
    defrost 9 months ago
    
    Early 2000's ??
    Bela Lugosi Died in 1979, and Peter Murphy was onto his next band by 1984.
    By 2000 Goth was fully a distant dot in the rear view mirror for the OG's
    In 2002, Murphy released *Dust* with Turkish-Canadian composer and producer Mercan Dede, which utilizes traditional Turkish instrumentation and songwriting, abandoning Murphy's previous pop and rock incarnations, and juxtaposing elements from progressive rock, trance, classical music, and Middle Eastern music, coupled with Dede's trademark atmospheric electronics.
    https://www.youtube.com/watch?v=Yy9h2q_dr9k
    https://en.wikipedia.org/wiki/Bauhaus_(band)
    
    djur 9 months ago
    
    I'm not sure what "gothic music existed in the 1980s" is meant to indicate as a response to "goths existed in the early 2000s as a cultural archetype".
    
    defrost 9 months ago
    
    That Goths in 2000's were at best second wave nostalgia cycle of Goths from the 1980s.
    That people recalling Goths in that period should beware of thinking that was a source and not an echo.
    In 2006 Noel Fielding's Richmond Felicity Avenal was a basement dwelling leftover from many years past.
    
    bee_rider 9 months ago
    
    True Goth died our way before any of that. They totally sold out when the sacked Rome, the gold went to their heads and everything since then has been nostalgia.
    
    defrost 9 months ago
    
    That was just the faux life Westside Visigoths .. what'd you expect?
    #Ostrogoth #TwueGoth
    
    carlob 9 months ago
    
    There was a submission here a few months ago about the various incarnations of goth starting from the late Roman empire.
    https://www.the-hinternet.com/p/the-goths
    
    defrost 9 months ago
    
    Was there? This one: https://news.ycombinator.com/item?id=41232761 ?
    Nice: https://www.youtube.com/watch?v=VZvSqgn_Zf4
    
    HelloNurse 9 months ago
    
    I expect this sort of thing to go out of fashion and/or be regulated after "AI" causes some large life loss, e.g. starting a war or designing a collapsing building.
    
    Spooky23 9 months ago
    
    The product isn’t released, so I don’t think we know what is or isn’t good.
    People are clearly finding LLM tech useful, and we’re barely scratching the surface.
    
    nxobject 9 months ago
    
    I hope that once they get a baseline level of AI functionality in, they start working with larger LLMs to enable some form of RAG... that might be their next generational shift.
    
    wkat4242 9 months ago
    
    > while Office 365 AI is like $400+ a year per user
    And I'm pretty sure this is only Introductory pricing. As people get used to it and use it more it won't cover the cost. I think they rely on the gym model currently; many people not using the ai features much. But eventually that will change. Also, many companies figured that out and pull the copilot license from users that don't use it enough.
    
    im3w1l 9 months ago
    
    Until AI chips become abundant, and we are not there yet, cloud AI just makes too much sense. Using a chip constantly vs using it 0.1% of the time is just so many orders of magnitude better.
    Local inference does have privacy benefits. I think at the moment it might make sense to send most of queries to a beefy cloud model, and send sensitive queries to a smaller local one.
    
    justahuman74 9 months ago
    
    Who is getting $400/y of value from that?
    
    xp84 9 months ago
    
    Apple hasn’t shipped any ai features besides betas. I trust the people responsible for the useless abomination that is Siri to deliver a useful ai tool as much as I would trust Joe Biden to win a breakdancing competition.
    
    brookst 9 months ago
    
    Well fortunately for all of us the people delivering client side ML today are totally different from the people who implemented a server side rule base assistant 10 years ago.
    
    xp84 9 months ago
    
    The fact that they couldn’t deliver something even approaching kindergarten levels of understanding a year ago makes me worry that either zero of the people who know what they’re doing in present-day “AI” work at Apple, or, plenty of great minds do but they can’t get anything done because Apple’s management is too conservative to release something that would be vastly more powerful than Siri but might possibly under certain circumstances hurt someone’s feelings or otherwise embarrass Apple.
    Nothing would make me happier than to finally be wrong betting against “Siri,” though.
- conradev 9 months ago
  
  The real consumers of the NPUs are the operating systems themselves. Google’s TPU and Apple’s ANE are used to power OS features like Apple’s Face ID and Google’s image enhancements.
  We’re seeing these things in traditional PCs now because Microsoft has demanded it so that Microsoft can use it in Windows 11.
  Any use by third party software is a lower priority
- shermantanktop 9 months ago
  
  That’s how we got an explosion of interesting hardware in the early 80s - hardware companies attempting to entice consumers by claiming “blazing 16 bit speeds” or other nonsense. It was a marketing circus but it drove real investments and innovation over time. I’d hope the same could happen here.
- kmeisthax 9 months ago
  
  You forget "Because Apple is doing it", too.
  - rjsw 9 months ago
    
    I think other ARM SoC vendors like Rockchip added NPUs before Apple, or at least around the same time.
    
    acchow 9 months ago
    
    I was curious so looked it up. Apple's first chip with an NPU was the A11 bionic in Sept 2017. Rockchip's was the RK1808 in Sept 2019.
    
    j16sdiz 9 months ago
    
    Google TPU was introduced around same time as apple. Basically everybody knew it can be something around that time, just don't know exactly how
    
    Someone 9 months ago
    
    https://en.wikipedia.org/wiki/Tensor_Processing_Unit#Product... shows the first one is from 2015 (publicly announced in 2016). It also shows they have a TDP of 75+W.
    I can’t find TDP for Apple’s Neural Engine (https://en.wikipedia.org/wiki/Neural_Engine), but the first version shipped in the iPhone 8, which has a 7 Wh battery, so these are targeting different markets.
    
    GeekyBear 9 months ago
    
    Face ID was the first tent pole feature that ran on the NPU.
    
    bdd8f1df777b 9 months ago
    
    Even if it were true, they wouldn’t have the same influence as Apple has.
- Dalewyn 9 months ago
  
  There are no nerves in a neural processing unit, so yes: It's 300% bullshit marketing.
  - brookst 9 months ago
    
    Neural is an adjective. Adjectives do not require their associated nouns to be present. See also: digital computers have mo fingers at all.
    
    -mlv 9 months ago
    
    I always thought 'digital' referred to numbers, not fingers.
    
    bdd8f1df777b 9 months ago
    
    The derivative meaning has been use so widely that it has surpassed its original one in usage. But it doesn’t change the fact that it originally refers to the fingers.
  - jcgrillo 9 months ago
    
    Maybe the N secretly stands for NFT.. Like the tesla self driving hardware only smaller and made of silicon.
- WithinReason 9 months ago
  
  yeah I'm not sure being 1% utilised helps power consumption
theresistor 9 months ago

> Also, people often mistake the reason for an NPU is "speed". That's not correct. The whole point of the NPU is rather to focus on low power consumption.
It's also often about offload. Depending on the use case, the CPU and GPU may be busy with other tasks, so the NPU is free bandwidth that can be used without stealing from the others. Consider AI-powered photo filters: the GPU is probably busy rendering the preview, and the CPU is busy drawing UI and handling user inputs.
- cakoose 9 months ago
  
  Offload only makes sense if there are other advantages, e.g. speed, power.
  Without those, wouldn't it be better to use the NPUs silicon budget on more CPU?
  - mapt 9 months ago
    
    For PC CPUs, there are already so many watts per square millimeter that many of the top tiers of the recent generations are running thermally throttled 24/7; More cooling improves performance rather than temperatures because it allows more of the cores to run at 'full' speed or at 'boost' speed. This kills their profitable market segmentation.
    In this environment it makes some sense to use more efficient RISC cores, and to spread out cores a bit with dedicated bits that either aren't going to get used all the time, or that are going to be used at lower power draws, and combining cores with better on-die memory availability (extreme L2/L3 caches) and other features. Apple even has some silicon in the power section left as empty space for thermal reasons.
    Emily (formerly Anthony) on LTT had a piece on the Apple CPUs that pointed out some of the inherent advantages of the big-chip ARM SOC versus the x86 motherboard-daughterboard arrangement as we start to hit Moore's Wall. https://www.youtube.com/watch?v=LFQ3LkVF5sM
  - theresistor 9 months ago
    
    If you know that you need to offload matmuls, then building matmul hardware is more area efficient than adding an entire extra CPU. Various intermediate points exist along that spectrum, e.g. Cell's SPUs.
  - heavyset_go 9 months ago
    
    More CPU means siphoning off more of the power budget on mobile devices. The theoretical value of NPUs is power efficiency on a limited budget.
  - avianlyric 9 months ago
    
    Not really. To get extra CPU performance that likely means more cores, or some other general compute silicon. That stuff tends to be quite big, simply because it’s so flexible.
    NPUs focus on one specific type of computation, matrix multiplication, and usually with low precision integers, because that’s all a neural net needs. That vast reduction in flexibility means you can take lots of shortcuts in your design, allowing you cram more compute into a smaller footprint.
    If you look at the M1 chip[1], you can see the entire 16-Neural engine has a foot print about the size of 4 performance cores (excluding their caches). It’s not perfect comparison, without numbers on what the performance core can achieve in terms of ops/second vs the Neural Engine. But it seems reasonable to be that the Neural Engine and handily outperform the performance core complex when doing matmul operations.
    [1] https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...
kmeisthax 9 months ago

> I think some hardware vendors just release the compute units without shipping proper support yet
This is Nvidia's moat. Everything has optimized kernels for CUDA, and maybe Apple Accelerate (which is the only way to touch the CPU matrix unit before M4, and the NPU at all). If you want to use anything else, either prepare to upstream patches in your ML framework of choice or prepare to write your own training and inference code.
- noduerme 9 months ago
  
  I'm not sure why this is a moat. Isn't it just a matter of translation from CUDA to some other instruction set? If AMD or someone else makes cheaper hardware that does the same thing, it doesn't seem like a stretch for them to release a PyTorch patch or whatever.
  - david-gpu 9 months ago
    
    Most of the computations are done inside NVidia proprietary libraries, not open-source CUDA. And if you saw what goes inside those libraries, I think you would agree that it is a substantial moat.
    
    theGnuMe 9 months ago
    
    There are clean room approaches like AMDs and Scale.
    
    caeril 9 months ago
    
    Geohot has multiple (and ongoing) rants about the sheer instability of AMD RDNA3 drivers. Lisa Su engaged directly with him on this, and she didn't seem to give a shit about their problems.
    AMD is not taking ML applications seriously, outside of their marketing hype.
    
    fvv 9 months ago
    
    Rdna3 is not cdna
    
    david-gpu 9 months ago
    
    Are you suggesting that Scale can take cuDNN kernels and run them at anything resembling peak performance on AMD GPUs?
    Because functional compatibility is hardly useful if the performance is not up to par, and cuDNN will run specific kernels that are particularly tuned to not only a specific model of GPU, but also to the specific inputs that the user is submitting. NVidia is doing a ton of work behind the scenes to both develop high-performance kernels for their exact architecture, but also to know which ones are best for a particular application.
    This is probably the main reason why I was hesitant to join AMD a few years ago and to this day it seems like it was a good decision.
  - blharr 9 months ago
    
    Sure you can probably translate rough code and get something that "works" but all the thousands of small optimizations that are baked in are not trivial to just translate.
    
    noduerme 9 months ago
    
    I like the take that small optimizations, taken together, amount to a moat. I feel like this could be a profoundly understated paradigm.
spookie 9 months ago

I've been building an app in pure C using onnxruntime, and it outperforms a comparable one done with python by a substancial amount. There are many other gains to be made.
(In the end python just calls C, but it's pretty interesting how much performance is lost)
- dacryn 9 months ago
  
  agree there, but then again using ort in Rust is faster again.
  You cannot compare python with a onxx executor.
  I don't know what you used in Python, but if it's pytorch or similar, those are built with flexibility in mind, for optimal performance you want to export those to onxx and use whatever executor that is optimized for your env. onxxruntime is one of them, but definitely not the only one, and given it's from Microsoft, some prefer to avoid it and choose among the many free alternatives.
  - rerdavies 9 months ago
    
    Why would the two not be entirely comparable? PyTorch may be slower at building the models; but once the model is compiled and loaded on the NPU, there's just not a whole lot of Python involved anymore. A few hundred CPU cycles to push the input data using python; a few hundred CPU cycles to receive the results using python. And everything in-between gets executed on the NPU.
    
    noduerme 9 months ago
    
    I really wish Python wasn't the language controlling all the C code. You need a controller, in a scripting language that's easy to modify, but it's a rather hideous choice. It would be like choosing to build the world's largest social network in PHP or something. lol.
    
    robertlagrant 9 months ago
    
    > it's a rather hideous choice
    Why?
    
    rerdavies 9 months ago
    
    Be careful for what you wish for.
    I've just spent a week writing Neural net code in C++, so i have direct insight into what a C++ implementation might look like.
    Much as I dislike python, and having to deal with endless runtime errors when your various inputs and outputs are mismatched, the inconvenience pales in comparison with having to wade through three pages of error messages that a C++ compiler generates when you have a single mismatched templated Eigen matrix with the incorrect dimension. The "Required from here" message you are actually interested in is typically the 3rd or fourth "Required from here", amongst a massive stack of cascading errors, each of which wraps to about 4 lines when displayed. You know what I mean. Sometimes you don't get a "Required from here" at all, which is horrifying. And it's infuriating to find and parse the template arguments of classes involved.
    Debugging Python runtime errors is kind of horrible, and rubs me the wrong way on principle. But it is sweetness and light compared to debugging C++ compile-time error messages, which are unimaginably horrible.
    The project: converting a C++ Neural Amp Modeller LV2 plugin to use fixed-size matrices (Eigen::Matrix<float,N,M>) instead of dynamic matrixes (Eigen::MatrixXf) to see if doing so would improve performance. (It does. Signficantly). So a substantial and realistic experiment in doing ML work in C++. Not directly comparable to working in Pytorch, but directly analogous in that it involves hooking up high-level ML constructs, like Conv1D, LayerT, WaveNet ML chunks.
    
    johnny22 9 months ago
    
    isn't that the case? Which then became a dialect of php with a custom interpreter (and then compiler) as they scaled.
    
    noduerme 9 months ago
    
    Yes, that was the case. I was being sarcastic. Zuck wrote facebook in PHP and spent millions of dollars then writing a custom interpreter to let his janky code run slightly faster than normal.
    
    rerdavies 9 months ago
    
    Zuck's obvious mistake: he should have written the PHP compiler to precompile chunks of GPU code that would be the only code that actually runs when serving web pages. </sarcasm>
    Facebook isn't really a comparable problems, because ALL of the performance-critical code in PyTorch does runs on a GPU.
godelski 9 months ago
They definitely aren't doing the timing properly, but also what you might think is timing is not what is generally marketed. But I will say, those marketed versions are often easier to compare. One such example is that if you're using GPU then have you actually considered that there's an asynchronous operation as part of your timing?
If you're naively doing `time.time()` then what happens is this
```
  start = time.time() # cpu records time
  pred = model(input.cuda()).cuda() # push data and model (if not already there) to GPU memory and start computation. This is asynchronous
  end = time.time() # cpu records time, regardless of if pred stores data
```
You probably aren't expecting that if you don't know systems and hardware. But python (and really any language) is designed to be smart and compile into more optimized things than what you actually wrote. There's no lock, and so we're not going to block operations for cpu tasks. You might ask why do this? Well no one knows what you actually want to do. And do you want the timer library now checking for accelerators (i.e. GPU) every time it records a time? That's going to mess up your timer! (at best you'd have to do a constructor to say "enable locking for this accelerator") So you gotta do something a bit more nuanced.
If you want to actually time GPU tasks, you should look at cuda event timers (in pytorch this is `torch.cuda.Event(enable_timing=True)`. I have another comment with boilerplate)
Edit:
There's also complicated issues like memory size and shape. They definitely are not being nice to the NPU here on either of those. They (and GPUs!!!) want channels last. They did [1,6,1500,1500] but you'd want [1,1500,1500,6]. There's also the issue of how memory is allocated (and they noted IO being an issue). 1500 is a weird number (as is 6) so they aren't doing any favors to the NPU, and I wouldn't be surprised that this is a surprisingly big hit considering how new these things are
And here's my longer comment with more details: https://news.ycombinator.com/item?id=41864828
- artemisart 9 months ago
  
  Important precision: the async part is absolutely not python specific, but comes from CUDA, indeed for performance, and you will have to use cuda events too in C++ to properly time it.
  For ONNX the runtimes I know of are synchronous as we don't do each operation individually but whole models at once, there is no need for async, the timings should be correct.
  - godelski 9 months ago
    
    Yes, it isn't python, it is... hardware. Not even CUDA specific. It is about memory moving around and optimization (remember, even the CPUs do speculative execution). I say a little more in the larger comment.
    I'm less concerned about the CPU baseline and more concerned about the NPU timing. Especially given the other issues
fennecfoxy 9 months ago

I think it's definitely possibly now (or very soon) for an LLM to write native GPU/NPU code to get itself to run on different hardware.
hulitu 9 months ago

> The whole point of the NPU is rather to focus on low power consumption
You know which chip has the lowest power consumption ? The one which is turned off. /s
lincpa 9 months ago

[dead]

jsheard 9 months ago

These NPUs are tying up a substantial amount of silicon area so it would be a real shame if they end up not being used for much. I can't find a die analysis of the Snapdragon X which isolates the NPU specifically but AMDs equivalent with the same ~50 TOPS performance target can be seen here, and takes up about as much area as three high performance CPU cores:

https://www.techpowerup.com/325035/amd-strix-point-silicon-p...

ezst 9 months ago

I can't wait for the LLM fad to be over so we get some sanity (and efficiency) back. I personally have no use for this extra hardware ("GenAI" doesn't help me in any way nor supports any work-related tasks). Worse, most people have no use for that (and recent surveys even show predominant hostility towards AI creep). We shouldn't be paying extra for that, it should be opt-in, and then it would become clear (by looking at the sales and how few are willing to pay a premium for "AI") how overblown and unnecessary this is.
- kalleboo 9 months ago
  
  > most people have no use for that
  Apple originally added their NPUs before the current LLM wave to support things like indexing your photo library so that objects and people are searchable. These features are still very popular. I don't think these NPUs are fast enough for GenAI anyway.
  - wmf 9 months ago
    
    MS Copilot and "Apple Intelligence" are running a small language model and image generation on the NPU so that should count as "GenAI".
    
    kalleboo 9 months ago
    
    It's still in beta so we'll see how things go but I saw someone testing what Apple Intelligence ran on-device vs sent off to the "private secure cloud" and even stuff like text summaries were being sent to the cloud.
  - grugagag 9 months ago
    
    I wish I could turn that off on my phone.
- mardifoufs 9 months ago
  
  NPUs were a thing (and a very common one in mobile CPUs too) way before the LLM craze.
- barfingclouds 9 months ago
  
  Agreed. I use perplexity and chat gpt daily, but if I understand correctly that’s all done off device which is fine with me. I don’t need to generate weird images, or have my phone rewrite emails for me, or summarize stuff.
- jcgrillo 9 months ago
  
  I just got an iphone and the whole photos thing is absolutely garbage. All I wanted to do was look through my damn photos and find one I took recently but it started playing some random music and organized them in no discernible order.. like it wasn't the reverse time sorted.. Idk what kind of fucked up "creative process" came up with that bullshit but I sure wish they'd unfuck it stat.
  The camera is real good though.
  - james_marks 9 months ago
    
    There’s an album called “Recents” that’s chronological and scrolled to the end.
    “Recent” seems to mean everything; I’ve got 6k+ photos, I think since the last fresh install, which is many devices ago.
    Sounds like the view you’re looking for and will stick as the default once you find it, but you do have to bat away some BS at first.
  - nioj 9 months ago
    
    I think the music part is related to the setting called "Show Featured Content" in the Photos app settings.
    
    jcgrillo 9 months ago
    
    Yeah that's how I ended up making it stop but if this is what these NPUs are being used for just.. why?
  - coldpie 9 months ago
    
    There is a chronological view tucked in there somewhere, but they really do hide it behind the other crap. Once you manage to get into it, it usually stays that way, but sometimes it kicks me back out to the random nonsense view and I have to take a few minutes to find chronological again.
- renewiltord 9 months ago
  
  I was telling someone this and they gave me link to a laptop with higher battery life and better performance than my own, but I kept explaining to them that the feature I cared most about was die size. They couldn't understand it so I just had to leave them alone. Non-technical people don't get it. Die size is what I care about. It's a critical feature and so many mainstream companies are missing out on my money because they won't optimize die size. Disgusting.
  - _zoltan_ 9 months ago
    
    News flash: you're in the niche of the niche. People don't care about die size.
    I'd be willing to bet that the amount of money they are missing out on is miniscule and is by far offset by people's money who care about other stuff. Like you know, performance and battery life, just to stick to your examples.
    
    mattnewton 9 months ago
    
    That’s exactly what the poster is arguing- they are being sarcastic.
    
    shermantanktop 9 months ago
    
    It whooshed over my head too. That’s the danger of sarcasm…it’s a cooperative form of humor but the other party might not get it.
  - nl 9 months ago
    
    Is this a parody?
    Why would anyone care about die size? And if you do why not get one of the many low power laptops with Atoms etc that do have small die size?
    
    thfuran 9 months ago
    
    Yes, they're making fun of the comment they replied to.
    
    singlepaynews 9 months ago
    
    Would you do me the favor of explaining the joke? I get the premise—nobody cares about die size, but the comment being mocked seems perfectly innocuous to me? They want a laptop without an NPU b/c according to link we get more out of CPU anyways? What am I missing here?
    
    michaelt 9 months ago
    
    It has been the norm for several decades to have hardware features that go unused.
    The realities of mass manufacturing and supply chains and whatnot mean it's cheaper to get a laptop with a webcam I don't use, a fingerprint reader I don't use, and an SD card reader I don't use. It's cheaper to get a CPU with integrated graphics I don't use, a trusted execution environment I don't use, remote management features I don't use. It's cheaper to get a discrete GPU with RGB LEDs I don't use, directx support I don't use, four outputs when I only need one. It's cheaper to get a motherboard with integrated wifi than one without.
    
    tedunangst 9 months ago
    
    No, no, no, you just don't get it. The only thing Dell will sell me is a laptop 324mm wide, which is totally appalling, but if they offered me a laptop that's 320mm wide, I'd immediately buy it. In my line of work, which is totally serious business, every millimeter counts.
    
    throwaway48476 9 months ago
    
    Maybe through a game of telephone they confused die size and node size?
  - waveBidder 9 months ago
    
    your satire is off base enough that people don't understand it's satire.
    
    0xDEAFBEAD 9 months ago
    
    Says a lot about HN that so many believed he was genuine.
    
    heavyset_go 9 months ago
    
    The Poe's Law means it's working.
  - ezst 9 months ago
    
    I'm fine with the mockery, I genuinely hadn't realized that "wanting to pay for what one needs" was such a hot and controversial take.
    
    ginko 9 months ago
    
    The extra cost of the area spent on npu cores is pretty hard to quantify. I guess removing it would allow for higher yields and number of chips per wafer but then you’d need to set up tooling for two separate runs (one with npu and one without) Add to that that most of the cost is actually the design of the chip and it’s clear why manufacturers just always add the extra features. Maybe they could sell a chip with the NPU permanently disabled but I guess that wouldn’t be what you want either?
    Fwiw there should be no power downside to having an unused unit. It’ll just not be powered.
    
    ezst 9 months ago
    
    The argument boils down to "since it's there, better to keep it because making a version without it would defeat economies of scale and not save much, if at all", and that's a sensible take… under the assumption that there's a general demand for NPUs, which I contest.
    In practice, everyone is paying a premium for NPUs that only a minority desires, and only a fraction of that minority essentially does "something" with it.
    This thread really helps to show that the use-cases are few, non-essential, and that the general application landscape hasn't adopted NPUs and has very little incentive to do so (because of the alien programming model, because of hardware compat across vendors, because of the ecosystem being a moving target with little stability in sight, and because of the high-effort/low-reward in general).
    I do want to be wrong, of course. Tech generally is exciting because it offers new tools to crack old problems, opening new venues and opportunities in the process. Here it looks like we have a solution in search for a problem that was set by marketing departments.
    
    Miraste 9 months ago
    
    Modern SoCs already have all kinds of features with use-cases that are few and non-essential. Granted they don't take as much space as NPUS, but manufacturers are betting that if NPUs are available, software will evolve to use them regularly. If it doesn't, they'll probably go away in a few generations. But at a minimum, Microsoft and Apple seem highly committed to using them.
  - fijiaarone 9 months ago
    
    Yeah, I know what you mean. I hate lugging around a big CPU core.
- DrillShopper 9 months ago
  
  Corporatized gains in the market from hype Socialized losses in increased carbon emissions, upheaval from job loss, and higher prices on hardware.
  The more they say the future will be better the more that it looks like the status quo.
Kon-Peki 9 months ago

Modern chips have to dedicate a certain percentage of the die to dark silicon [1] (or else they melt/throttle to uselessness), and these kinds of components count towards that amount. So the point of these components is to be used, but not to be used too much.
Instead of an NPU, they could have used those transistors and die space for any number of things. But they wouldn't have put additional high performance CPU cores there - that would increase the power density too much and cause thermal issues that can only be solved with permanent throttling.
[1] https://en.wikipedia.org/wiki/Dark_silicon
- jcgrillo 9 months ago
  
  Question--what's to be lost by making your features sufficiently not dense to allow them to cool at full tilt?
  - AlotOfReading 9 months ago
    
    Messes with timing, among other things. A lot of those structures are relatively fixed blocks that are designed for specific sizes. Signals take more time to propagate longer distances, and longer conductors have worse properties. Dense and hot is faster and more broadly useful.
    
    jcgrillo 9 months ago
    
    Interesting, so does that mean we're basically out of runway without aggressive cooling?
    
    joha4270 9 months ago
    
    No.
    Every successive semiconductor node uses less power than the previous per transistor at the same clock speed. Its just that we then immediately use this headroom to pack more transistors closer and run them faster, so every chip keeps running into power limits, even if they continually do more with said power.
  - positr0n 9 months ago
    
    Good discussion on how at multi GHz clock speeds, the speed of light is actually limiting on some circuit design choices: https://news.ycombinator.com/item?id=12384596
- IshKebab 9 months ago
  
  If they aren't being used it would be better to dedicate the space to more SRAM.
  - a2l3aQ 9 months ago
    
    The point is parts of the CPU have to be off or throttled down when other components are under load to maintain TDP, adding cache that would almost certainly be being used defeats the point of that.
    
    jsheard 9 months ago
    
    Doesn't SRAM have much lower power density than logic with the same area though? Hence why AMD can get away with physically stacking cache on top of more cache in their X3D parts, without the bottom layer melting.
    
    Kon-Peki 9 months ago
    
    Yes, cache has a much lower power density and could have been a candidate for that space.
    But I wasn’t on the design team and have no basis for second-guessing them. I’m just saying that cramming more performance CPU cores onto this die isn’t a realistic option.
    
    wtallis 9 months ago
    
    The SRAM that AMD is stacking also has the benefit of being last-level cache, so it doesn't need to run at anywhere near the frequency and voltage that eg. L1 cache operates at.
    
    IshKebab 9 months ago
    
    Cache doesn't use nearly as much power as active computation; that was my point.
  - VHRanger 9 months ago
    
    SRAM is extremely hot, it's the very opposite of dark silicon
JohnFen 9 months ago

> These NPUs are tying up a substantial amount of silicon area so it would be a real shame if they end up not being used for much.
This has been my thinking. Today you have to go out of your way to buy a system with an NPU, so I don't have any. But tomorrow, will they just be included by default? That seems like a waste for those of us who aren't going to be running models. I wonder what other uses they could be put to?
- jonas21 9 months ago
  
  NPUs are already included by default in the Apple ecosystem. Nobody seems to mind.
  - acchow 9 months ago
    
    It enables many features on the phone that people like, all without sending your personal data to the cloud. Like searching your photos for "dog" or "receipt".
  - JohnFen 9 months ago
    
    It's not really a question of minding if it's there, unless its presence increases cost, anyway. It just seems a waste to let it go idle, so my mind wanders to what other use I could put that circuitry to.
  - shepherdjerred 9 months ago
    
    I actually love that Apple includes this — especially now that they’re actually doing something with it via Apple Intelligence
- crazygringo 9 months ago
  
  Aren't they used for speech recognition -- for dictation? Also for FaceID.
  They're useful for more things than just LLM's.
  - JohnFen 9 months ago
    
    Yes, but I'm not interested in those sorts of uses. I'm wondering what else an NPU could be used for. I don't know what an NPU actually is at a technical level, so I'm ignorant of the possibilities.
    
    ItsBob 9 months ago
    
    I'm probably about to show my ignorance here (I'm not neck-deep in the AI space but I am a software architect...) but are they not just dedicated matrix multiplication engines (plus some other AI stuff)? So instead of asking the CPU to do the math, you have a dedicated area that does it instead... well, that's my understanding of it.
    As to why, I think it's along the lines of this: the CPU does 100 things, one of those is AI acceleration. Let's take the AI acceleration and give it its own space instead so we can keep the power down a bit, add some specialization, and leave the CPU to do other stuff.
    Again, I'm coming at this from a high-level as if explaining it to my ageing parents.
    
    JohnFen 9 months ago
    
    Yes, that's my understanding as well. What I meant is that I don't know the fine details. My ignorance is purely because I don't actually have a machine that has an NPU, so I haven't bothered to study up on them.
- heavyset_go 9 months ago
  
  The idea is that your OS and apps will integrate ML models, so you will be running models whether you know it or not.
  - JohnFen 9 months ago
    
    I'm confident that I'll be able to know and control whether or not my Linux and BSD machines will be using ML models.
    
    hollerith 9 months ago
    
    --and whether anyone is using your interactions with your computer to train a model.
    
    heavyset_go 9 months ago
    
    Luckily, while NPUs do nothing about data exfiltration, they're a poor solution for training models. Your data is still going to get sucked up to the mothership, but offloading training to your machine hopefully won't happen.
    
    hollerith 9 months ago
    
    Yes, when I was writing my comment, I was imagining my user-interaction data getting sucked up to data centers.
    
    heavyset_go 9 months ago
    
    I agree with the premise as a Linux user myself, but if you're using any JetBrains products, or Zoom, you're running models on the client-side. I suspect small models will continue to creep into apps. Even Firefox ships ML models in the browser.
- consteval 9 months ago
  
  We already can't fit much more in CPUs. You can't just throw cores in there. CPUs these days are, like, 80% cache if you look at the die. We constantly shrink the compute part, but we don't put much more compute - that space is just used for cache.
  So, I'm not sure that you're wasting much with the NPU. But I'm not an expert.
- jsheard 9 months ago
  
  > But tomorrow, will they just be included by default?
  That's already the way things are going due to Microsoft decreeing that Copilot+ is the future of Windows, so AMD and Intel are both putting NPUs which meet the Copilot+ performance standard into every consumer part they make going forwards to secure OEM sales.
  - AlexAndScripts 9 months ago
    
    It almost makes me want to find some use for them on my Linux box (not that is has an NPU), but I truly can't think of anything. Too small to run a meaningful LLM, and I'd want that in bursts anyway, I hate voice controls (at least with the current tech), and Recall sounds thoroughly useless. Could you do mediocre machine translation on it, perhaps? Local github copilot? An LLM that is purely used to build an abstract index of my notes in the background?
    Actually, could they be used to make better AI in games? That'd be neat. A shooter character with some kind of organic tactics, or a Civilisation/Stellaris AI that doesn't suck.
    
    ywvcbk 9 months ago
    
    > box
    Presumably you have a GPU? If so there is nothing an NPU can do that a discrete GPU can’t (and it would be much slower than a recent GPU).
    The real benefits are power efficiency and cost since they are built into the SoC which are not necessarily that useful on a desktop PC.
    
    Miraste 9 months ago
    
    In short: no. Current-gen NPUs are so slow they can't do anything useful. AMD and Intel have 2nd-gen ones that came out a few weeks ago, and by spec they may able to run local translation and small LLMs (haven't seen benchmarks yet), but for now they are laptop-only.
  - bcoates 9 months ago
    
    Microsoft has declared a whole lot of things to be the future of Windows, almost all of them were quietly sidelined in a version or two.
    https://www.joelonsoftware.com/2002/01/06/fire-and-motion/
    
    jsheard 9 months ago
    
    Yeah, but the lead times on silicon mean we're going to be stuck with Microsoft's decision for while regardless of how hard they commit to it. AMD and Intel probably already have two or three future generations of Copilot+ CPUs in the pipeline.
- idunnoman1222 9 months ago
  
  Voice to text
kllrnohj 9 months ago

Snapdragon X still has a full 12 cores (all same cores, it's homogeneous) and the Strix Point is also 12 cores but in a 4+8 configuration but with the "little" cores not sacrificing that much (nothing like the little cores in ARM's designs which might as well not even exist, they are a complete waste of silicon). Consumer software doesn't scale to that, so what are you going to do with more transistors allocated to the CPU?
It's not unlike why Apple puts so many video engines in their SoCs - they don't actually have much else to do with the transistor budget they can afford. Making single thread performance better isn't limited by transistor count anymore and software is bad at multithreading.
- wmf 9 months ago
  
  GPU "infinity" cache would increase 3D performance and there's a rumor that AMD removed it to make room for the NPU. They're not out of ideas for features to put on the chip.

eightysixfour 9 months ago

I thought the purpose of these things was not to be fast, but to be able to run small models with very little power usage? I have a newer AMD laptop with an NPU, and my power usage doesn't change using the video effects that supposedly run on it, but goes up when using the nvidia studio effects.

It seems like the NPUs are for very optimized models that do small tasks, like eye contact, background blur, autocorrect models, transcription, and OCR. In particular, on Windows, I assumed they were running the full screen OCR (and maybe embeddings for search) for the rewind feature.

boomskats 9 months ago

That's especially true because yours is a Xilinx FPGA. The one that they just attached to the latest gen mobile ryzens is 5x more capable too.
AMD are doing some fantastic work at the moment, they just don't seem to be shouting about it. This one is particularly interesting https://lore.kernel.org/lkml/DM6PR12MB3993D5ECA50B27682AEBE1...
edit: not an FPGA. TIL. :'(
- errantspark 9 months ago
  
  Wait sorry back up a bit here. I can buy a laptop that has a daughter FPGA in it? Does it have GPIO??? Are we seriously building hardware worth buying again in 2024? Do you have a link?
  - eightysixfour 9 months ago
    
    It isn't as fun as you think - they are setup for specific use cases and quite small. Here's a link to the software page: https://ryzenai.docs.amd.com/en/latest/index.html
    The teeny-tiny "NPU," which is actually an FPGA, is 10 TOPS.
    Edit: I've been corrected, not an FPGA, just an IP block from Xilinx.
    
    wtallis 9 months ago
    
    It's not a FPGA. It's an NPU IP block from the Xilinx side of the company. It was presumably originally developed to be run on a Xilinx FPGA, but that doesn't mean AMD did the stupid thing and actually fabbed a FPGA fabric instead of properly synthesizing the design for their laptop ASIC. Xilinx involvement does not automatically mean it's an FPGA.
    
    boomskats 9 months ago
    
    Do you have any more reading on this? How come the XDNA drivers depend on Xilinx' XRT runtime?
    
    wtallis 9 months ago
    
    It would be surprising and strange if AMD didn't reuse the software framework they've already built for doing AI when that IP block is instantiated on an FPGA fabric rather than hardened in an ASIC.
    
    boomskats 9 months ago
    
    Well, I'm irrationally disappointed, but thanks. Appreciate the correction.
    
    almostgotcaught 9 months ago
    
    because XRT has a plugin architecture: XRT<-shim plugin<-kernel driver. The shims register themselves with XRT. The XDNA driver repo houses both the shim and the kernel driver.
    
    boomskats 9 months ago
    
    Thanks, that makes sense.
    
    eightysixfour 9 months ago
    
    Thanks for the correction, edited.
    
    boomskats 9 months ago
    
    Yes, the one on the ryzen 7000 chips like the 7840u isn't massive, but that's the last gen model. The one they've just released with the HX370 chip is estimated at 50 TOPS, which is better than Qualcomm's ARM flagship that this post is about. It's a fivefold improvement in a single generation, it's pretty exciting.
    A̵n̵d̵ ̵i̵t̵'̵s̵ ̵a̵n̵ ̵F̵P̵G̵A̵ It's not an FPGA
    
    almostgotcaught 9 months ago
    
    > And it's an FPGA.
    nope it's not.
    
    boomskats 9 months ago
    
    I've just ordered myself a jump to conclusions mat.
    
    almostgotcaught 9 months ago
    
    Lol during grad school my advisor would frequently cut me off and try to jump to a conclusion, while I was explaining something technical often enough he was wrong. So I did really buy him one (off eBay or something). He wasn't pleased.
  - dekhn 9 months ago
    
    If you want GPIOs, you don't need (or want) an FPGA.
    I don't know the details of your use case, but I work with low level hardware driven by GPIOs and after a bit of investigation, concluded that having direect GPIO access in a modern PC was not necessary or desirable compared to the alternatives.
    
    errantspark 9 months ago
    
    I get a lot of use out of the PRUs on the BeagleboneBlack, I would absolutely get use out of an FPGA in a laptop.
    
    dekhn 9 months ago
    
    It makes more sense to me to just use the BeagleboneBlack in concert with the FPGA. Unless you have highly specific compute or data movement needs that can't be satisfied over a USB serial link. If you have those needs, and you need a laptop, I guess an FPGA makes sense but that's a teeny market.
- pclmulqdq 9 months ago
  
  It's not an FPGA. It's a VLIW DSP that Xilinx built to go into an FPGA-SoC to help run ML models.
  - almostgotcaught 9 months ago
    
    this is the correct answer. one of the compilers for this DSP is https://github.com/Xilinx/llvm-aie.
- beeflet 9 months ago
  
  It would be cool if most PCs had a general purpose FPGA that could be repurposed by the operating system. For example you could use it as a security processor like a TPM or as a bootrom, or you could repurpose it for DSP or something.
  It just seems like this would be better in terms of firmware/security/bootloading because you would be more able to fix it if an exploit gets discovered, and it would be leaner because different operating systems can implement their own stuff (for example linux might not want pluton in-chip security, windows might not want coreboot or linux-based boot, bare metal applications can have much simpler boot).
  - walterbell 9 months ago
    
    Xilinx Artix 7-series PicoEVB fits in M.2 wifi slot and has an OSS toolchain, http://www.enjoy-digital.fr/
- davemp 9 months ago
  
  Unfortunately FPGA fabric is ~2x less power efficient than equivalent ASIC logic at the same clock speeds last time I checked. So implementing general purpose logic on an FPGA is not usually the right option even if you don’t care about FMAX or transistor counts.
- numpad0 9 months ago
  
  Sorry for an OT comment but what is going on with that ascii art!? The content fits within 80 columns just fine[1], is it GPT generated?
  1: https://pastebin.com/raw/R9BrqETR
conradev 9 months ago

That is my understanding as well: low power and low latency.
You can see this in action when evaluating a CoreML model on a macOS machine. The ANE takes half as long as the GPU which takes half as long as the CPU (actual factors being model dependent)
- nickpsecurity 9 months ago
  
  To take half as long, doesn’t it have to perform twice as fast? Or am I misreading your comment?
  - eightysixfour 9 months ago
    
    No, you can have latency that is independent of compute performance. The CPU/GPU may have other tasks and the work has to wait for the existing threads to finish, or for them to clock up, or have slower memory paths, etc.
    If you and I have the same calculator but I'm working on a set of problems and you're not, and we're both asked to do some math, it may take me longer to return it, even though the instantaneous performance of the math is the same.
    
    refulgentis 9 months ago
    
    In isolation, makes sense.
    Wouldn't it be odd for OP to present examples that are the opposite of their claim, just to get us thinking about "well the CPU is busy?"
    Curious for their input.
  - conradev 9 months ago
    
    The GPU is stateful and requires loading shaders and initializing pipelines before doing any work. That is where its latency comes from. It is also extremely power hungry.
    The CPU is zero latency to get started, but takes longer because it isn't specialized at any one task and isn't massively parallel, so that is why the CPU takes even longer.
    The NPU often has a simpler bytecode to do more complex things like matrix multiplication implemented in hardware, rather than having to instantiate a generic compute kernel on the GPU.
monkeynotes 9 months ago

I believe that low power = cheaper tokens = more affordable and sustainable, to me this is what a consumer will benefit from overall. Power hungry GPUs seem to sit better in research, commerce, and enterprise.
The Nvidia killer would be chips and memory that are affordable enough to run a good enough model on a personal device, like a smartphone.
I think the future of this tech, if the general populace buys into LLMs being useful enough to pay a small premium for the device, is personal models that by their nature provide privacy. The amount of personal information folks unload on ChatGPT and the like is astounding. AI virtual girlfriend apps frequently get fed the most darkest kinks, vulnerable admissions, and maybe even incriminating conversations, according to Redditors that are addicted to these things. This is all given away to no-name companies that stand up apps on the app store.
Google even states that if you turn Gemini history on then they will be able to review anything you talk about.
For complex token prediction that requires a bigger model the personal could switch to consulting a cloud LLM, but privacy really needs to be ensured for consumers.
I don't believe we need cutting edge reasoning, or party trick LLMs for day to day personal assistance, chat, or information discovery.
refulgentis 9 months ago

You're absolutely right IMO, given what I heard when launching on-device speech recognition on Pixel, and after leaving Google, what I see from ex. Apple Neural Engine vs. CPU when running ONNX stuff.
I'm a bit suspicious of the article's specific conclusion, because it is Qualcomm's ONNX, and it be out of date. Also, Android loved talking shit about Qualcomm software engineering.
That being said, its directionally correct, insomuch as consumer hardware AI acceleration claims are near-universally BS unless you're A) writing 1P software B) someone in the 1P really wants you to take advantage.
- kristianp 9 months ago
  
  1P?
  - refulgentis 9 months ago
    
    First party, i.e. Google/Apple/Microsoft
godelski 9 months ago
```
  > but to be able to run small models with very little power usage
```
yes
But first, I should also say you probably don't want to be programming these things with python. I doubt you'll get good performance there, especially as the newness means optimizations haven't been ported well (even using things like TensorRT is not going to be as fast as writing it from scratch, and Nvidia is throwing a lot of man power at that -- for good reason! But it sure as hell will get close and save you a lot of time writing).
They are, like you say, generally optimized for doing repeated similar tasks. That's also where I suspect some of the info gathered here is inaccurate.
```
  (I have not used these NPU chips so what follows is more educated guesses, but I'll explain. Please correct me if I've made an error)
```
Second, I don't trust the timing here. I'm certain the CUDA timing (at the end) is incorrect, as the code written wouldn't properly time. Timing is surprisingly not easy. I suspect the advertised operations are only counting operations directly on the NPU while OP would have included CPU operations in their NPU and GPU timings[0]. But the docs have benchmarking tools, so I suspect they're doing something similar. I'd be interested to know the variance and how this holds after doing warmups. They do identify the IO as an issue, and so I think this is evidence of this being an issue.
Third, their data is improperly formatted.
```
  MATRIX_COUNT, MATRIX_A, MATRIX_B, MATRIX_K = (6, 1500, 1500, 256)
  INPUT0_SHAPE = [1, MATRIX_COUNT, MATRIX_A, MATRIX_K]
  INPUT1_SHAPE = [1, MATRIX_COUNT, MATRIX_K, MATRIX_B]
  OUTPUT_SHAPE = [1, MATRIX_COUNT, MATRIX_A, MATRIX_B]
```
You want "channels last" here. I suspected this (do this in pytorch too!) and the docs they link confirm.
1500 is also an odd choice and this could be cause for extra misses. I wonder how things would change with 1536, 2048, or even 256. Might (probably) even want to look smaller, since this might be a common preprocessing step. Your models are not processing full res images and if you're going to optimize architecture for models, you're going to use that shape information. Shape optimization is actually pretty important in ML[1]. I suspect this will be quite a large miss.
Fourth, a quick look at the docs and I think the setup is improper. Under "Model Workflow" they mention that they want data in 8 or 16 bit *float*. I'm not going to look too deep, but note that there are different types of floats (e.g. pytorch's bfloat is not the same as torch.half or torch.float16). Mixed precision is still a confusing subject and if you're hitting issues like these it is worth looking at. I very much suggest not just running a standard quantization procedure and calling it a day (start there! But don't end there unless it's "good enough", which doesn't seem too meaningful here.)
FWIW, I still do think these results are useful, but I think they need to be improved upon. This type of stuff is surprisingly complex, but a large amount of that is due to things being new and much of the details still being worked out. Remember that when you're comparing to things like CPU or GPU (especially CUDA) that these have had hundreds of thousands of man hours put into then and at least tens of thousands into high level language libraries (i.e. python) to handle these. I don't think these devices are ready for the average user where you can just work with them from your favorite language's abstraction level, but they're pretty useful if you're willing to work close to the metal.
[0] I don't know what the timing is for this, but I do this in pytorch a lot so here's the boilerplate
```
    times = torch.empty(rounds)
    # Don't need use dummy data, but here
    input_data = torch.randn((batch_size, *data_shape), device="cuda")
    # Do some warmups first. There's background actions dealing with IO we don't want to measure
    #    You can remove that line and do a dist of times if you want to see this
    # Make sure you generate data and save to a variable (write) or else this won't do anything
    for _ in range(warmup):
        data = model(input_data)
    for i in range(rounds):
        starter = torch.cuda.Event(enable_timing=True)
        ender = torch.cuda.Event(enable_timing=True)
        starter.record()
        data = model(input_data)
        ender.record()
        torch.cuda.synchronize()
        times[i] = starter.elapsed_time(ender)/1000
    total_time = times.sum()
```
The reason we do it this way is if we just wrap the model output with a timer then we're looking at CPU time but the GPU operations are asynchronous so you could get deceptively fast (or slow) times
[1] https://www.thonking.ai/p/what-shapes-do-matrix-multiplicati...
moffkalast 9 months ago

[flagged]
- eightysixfour 9 months ago
  
  The 7940HS shipped before recall and doesn't support it because it is not performant enough, so, that doesn't make sense.
  I just gave you a use case, mine in particular uses it for background blur and eye contact filters with the webcam and uses essentially no power to do it. If I do the same filters with nvidia broadcast, the power usage is dramatically higher.
  - moffkalast 9 months ago
    
    I doubt there's no notable power draw, NPUs in general have always pulled a handful of watts which should at least about match a modern CPU's idle draw. But it does seem odd that your power usage doesn't change at all, it might be always powered on or something.
    Eye contact filters seem like a horrible thing, autocorrect won't work better than a dictionary with a tiny model and I doubt these things can come even close to running whisper for decent voice transcription. Background blur alright, but that's kind of stretching it. I always figured Zoom/Teams do these things serverside anyway.
    And alright, if it's not MS making them do it, then they're just chasing the fad themselves while also shipping subpar hardware. Not sure if that makes it better.
    
    kalleboo 9 months ago
    
    > I doubt these things can come even close to running whisper for decent voice transcription
    https://github.com/ggerganov/whisper.cpp/pull/566
    "The performance gain is more than x3 compared to 8-thread CPU"
    And this is on the 3 year old M1 Pro
    
    Dylan16807 9 months ago
    
    > I doubt these things can come even close to running whisper for decent voice transcription.
    Whisper runs almost realtime on a single core of my very old CPU. I'd be very surprised if it can't fit in an NPU.
  - wtallis 9 months ago
    
    Intel is also about to launch their first desktop processors with an NPU which falls far short of Microsoft's performance requirements for a "Copilot+ PC". Should still be plenty for webcam use.

protastus 9 months ago

Deploying a model on an NPU requires significant profile based optimization. Picking up a model that works fine on the CPU but hasn't been optimized for an NPU usually leads to disappointing results.

CAP_NET_ADMIN 9 months ago

Beauty of CPUs - they'll chew through whatever bs code you throw at them at a reasonable speed.
- marginalia_nu 9 months ago
  
  I don't think this is correct. The difference between well optimized code and unoptimized code on the CPU is frequently at least an order of magnitude performance.
  Reason it doesn't seem that way is that the CPU is so fast we often bottleneck on I/O first. However, for compute-workloads like inference, it really does matter.
  - consteval 9 months ago
    
    While this is true, the most effective optimizations you don't do yourself. The compiler or runtime does it. They get the low-hanging fruit. You can further optimize yourself, but unless your design is fundamentally bad, you're gonna be micro-optimizing.
    gcc -O0 and -O2 has a HUGE performance gain. We don't really have anything to auto-magically do this for models, yet. Compilers are intimately familiar with x86.
    
    marginalia_nu 9 months ago
    
    While the compiler is decent at producing code that is good in terms of saturating the instruction pipeline, there are many things the compiler simply can't help you with.
    Having cache friendly memory access patterns is perhaps the biggest one. Though automatic vectorization is also still not quite there, so in cases where there's a severe bottleneck, doing that manually may still considerably improve performance, if the workload is vectorizable.
catgary 9 months ago

Yeah whenever I’ve spoken to people who work on stuff like IREE or OpenXLA they gave me the impression that understanding how to use those compilers/runtimes is an entire job.

fancyfredbot 9 months ago

The write up on the GitHub repo is much more informative than the blog.

When running int8 matmul using onnx performance is ~0.6TF.

https://github.com/usefulsensors/qc_npu_benchmark

dang 9 months ago

Thanks—we changed the URL to that from https://petewarden.com/2024/10/16/ai-pcs-arent-very-good-at-.... Readers may way want to look at both, of course!
- dhruvdh 9 months ago
  
  Oh, maybe also change the title? I flagged it because of the title/url not matching.

cjbgkagh 9 months ago

> We've tried to avoid that by making both the input matrices more square, so that tiling and reuse should be possible.

While it might be possible it would not surprise me if a number of possible optimizations had not made it into Onnx. It appears that Qualcomm does not give direct access to the NPU and users are expected to use frameworks to convert models over to it, and in my experience conversion tools generally suck and leave a lot of optimizations on the table. It could be less of NPUs suck and more of the conversions tools suck. I'll wait until I get direct access - I don't trust conversion tools.

My view of NPUs is that they're great for tiny ML models and very fast function approximations which is my intended use case. While LLMs are the new hotness there are huge number of specialized tasks that small models are really useful for.

Hizonner 9 months ago

> While LLMs are the new hotness there are huge number of specialized tasks that small models are really useful for.
Can you give some examples? Preferably examples that will run continuously enough for even a small model to stay in cache, and are valuable enough to a significant number of users to justify that cache footprint?
I am not saying there aren't any, but I also honestly don't know what they are and would like to.
- cjbgkagh 9 months ago
  
  I guess these days basically anything in ML prior to LLMs would be considered small. LLMs are rather unusual because of how large they are.
  NNs can be used as a general function approximators so any function which can be approximated is a candidate for using a NN in it's place. I have a very complex trig function that produces a high dimensional smooth manifold which I know will only be used within a narrow range of inputs and I can sacrifice some accuracy for speed. My inner loops have inner loops which have inner loops with inner loops. So when you're 4+ inner loops deep the speed becomes essential. I can sweep the entire input domain to make sure the error always stays within limits.
  If you're doing things such as counting instructions, intrinics, inline assembly, bit-twiddling, fast math, polynomial approximations, LUTs, fixed point math, etc. you could probably add NNs to your toolkit.
  Stockfish uses a 'small' 82K parameter neural net of 3 dense integer only layers (https://news.ycombinator.com/item?id=27734517). I think Stockfish performance would be a really good candidate for testing NPUs as there is a time / accuracy tradeoff.
- consteval 9 months ago
  
  iPhones use a lot of these. There's a bunch of little features that run on the NPU.
  Suggestions, predictive text, smart image search, automatic image classification, text selection in images, image processing. These don't run continuously, but I think they are valuable to a lot of users. The predictive text is quite good, and it's very nice to be able to search for vague terms like "license plate" and get images in my camera roll. Plus, selecting text and copying it from images is great.
  For desktop usecases, I'm not sure.
jaygreco 9 months ago

I came here to say this. I haven’t worked with the Elite X but the past gen stuff I’ve used (865 mostly) the accelerators - compute DSP and much smaller NPU - required _very_ specific setup, compilation with a bespoke toolchain, and communication via RPC to name a few.
I would hope the NPU on Elite X is easier to get to considering the whole copilot+ thing, but I bring this up mainly to make the point that I doubt it’s just as easy as “run general purpose model, expect it to magically teleport onto the NPU”.

_davide_ 9 months ago

The RTX 4080 should be capable of ~40 TFLOPS, yet they only report 2,160 billion operations per second. Shouldn't this be enough to reconsider the benchmark? They probably made some serious error in measuring FLOPS. Regarding the fact that CPU beats NPU is possible but they should benchmark many matrix multiplications without any application synchronization in order to have a decent comparison.

Grimblewald 9 months ago

That isnt the half of it. A quick skim of the documentation shows that the cpu inference wasnt done in a comparable way either.

lambda-research 9 months ago

The benchmark is matrix multiplcation with the shapes `(6, 1500, 256) X (6, 256, 1500)`, which just aren't that big in the AI world. I think the gap would be larger with much larger matrices.

E.g. Llama 3.1 8B which is one of the smaller models has matrix multiplications like `(batch, 14336, 4096) x (batch, 4096, 14336)`.

I just don't think this benchmark is realistic enough.

NebulaTrek 9 months ago

We ran qprof (a Qualcomm NPU profiler) on this benchmark. The profiling results indicate that the workload was distributed to the vector cores instead of the tensor core, which provide the vast majority of the compute power in the NPU (my back of napkin math suggests that HMX is 30x stronger than HVX).

The workload is relatively small, which results in underutilization of the hardware capacity due to the overhead associated with input/output quantization-dequantization and NCHW-NHCW mapping. Padding the weights and inputs to be a multiple of 64 would also help the performance.

Edit: Link to the profiling graph https://imgur.com/a/2OKR93e

Estimated HVX compute capability 421.43*1024/8 = 1.46TOPS in int8,

in which 4 is number of vector cores

2 is number operation per cycle

1.43GHz is HVX frequency

1024bit is vector register width

8bit is precision

NebulaTrek 9 months ago

The formula was formatted wrong, it should be 4 * 2 * 1.43 * 1024 / 8

teilo 9 months ago

Actual article title: Benchmarking Qualcomm's NPU on the Microsoft Surface Tablet

Because this isn't about NPUs. It's about a specific NPU, on a specific benchmark, with a specific set of libraries and frameworks. So basically, this proves nothing.

gnabgib 9 months ago

The title is from the original article (https://petewarden.com/2024/10/16/ai-pcs-arent-very-good-at-...), the URL was changed by dang: https://news.ycombinator.com/item?id=41863591
iml7 9 months ago

But you can’t get more clicks. You have to attack enough people to get clicks.I feel like this place is becoming more and more filled with posts and titles like this.
- gerdesj 9 months ago
  
  Internet points are a bit crap but HN generally discusses things properly and off topic and downright weird stuff generally gets downvoted to doom.

freehorse 9 months ago

I always thought that the main point of NPUs is energy efficiency (and being able to run ML models without taking over all computer resources, making it practical to integrate ML applications in the OS itself in ways that it does not disturb the user or the workflow) rather than being exceptionally faster. At least this has been my experience with running stable diffusion on macs. Similar to using other specialised hardware like media encoders; they are not necessarily faster than a CPU if you throw a dozen+ cpu cores on the task, but it will draw a minuscule part of the power.

tokyolights 9 months ago

Not sure why this isn't discussed more here. I think exactly the same, the NPU occupies more silicon area because it has custom circuits specifically to reduce the number of cycles (and thus energy) needed to perform those calculations. Doesn't necessarily mean that a CPU wouldn't be able to bulldoze through it faster (with much more energy).

cloudhan 9 months ago

OK, I am one of the developers in onnxruntime team. Perviously working on ROCm EP now has been transfered to QNN EP. The following is purely devrant and the opinions are mine.

So ROCm already sucks whereas QNN sucks even harder!

The conclusion here is NVIDIA knows how to make software that just works. AMD makes software that might work. Qualcomm, however, knows zero piece of shit of how to make a useful software.

The dev experience is just another level of disaster with Qualcomm. Their tools and APIs return absolutely zero useful infomation about what error you are getting, just an error code that you can grep from their include headers from SDK. To debug an error code, you need strace to get the internal error string on the device. Their profiler merely gives you a trace that cannot be associated back to original computing logic with very high stddev on the runtime. Their docs website is not indexed by the MF search engine, not to say LLMs, so if you have any question, good luck then!

So if you don't have a reason to use QNN, just don't use it (and any other NPU you name it).

Back to the benchmark script. There is a lot of flaws as I can see.

1. the session is not warmed up and the iteraion is too small. 2. the onnx graph is too small, I suspect the onnxruntime overhead cannot be ignored in this case. Try stack more gemm in the graph instead of increasing the iteration naively. 3. the "htp_performance_mode": "sustained_high_performance" might give a lower perf compare to "burst" mode.

A more reliable way to benchmark might just dump the context binary[1] and dump context inputs[2] and run this with qnn-net-run to get rid of the onnxruntime overhead.

[1]: https://github.com/cloudhan/misc-nn-test-driver/blob/main/qn... [2]: https://github.com/cloudhan/misc-nn-test-driver/blob/main/qn...

cloudhan 9 months ago

NPU folks offen time say
> it's not enough time to get new silicon designs specifically for <blahblah>
Where blahblah stands for a model architecture that has caused a paradigm shift.
When you need a new silicon for a new model, you are already losing.

pzo 9 months ago

Haven't played much with Qualcomm NPU but Apple Neural Engine available in iOS and MacOS for many Computer Vision models was significantly faster than when running on CPU or GPU (e.g. mediapipe models, yolo, depth-anything) - to the point that inference was much faster on Macbook M2 Max using its NPU that is the same as on older iPhones rather than executing on all 38 GPU cores.

This all depends on model architecture, conversions and tuning. Apple provides good tooling in XCode for benchmarking models up to execution time of single operators and where such operator got executed (CPU, GPU, NPU) in case couldn't been executed on NPU and have to fallback to CPU/GPU. Sometimes model have be tweaked to slightly different operator if it's not available in NPU. On top of that ML frameworks/runtimes such as ONNX/Pytorch/TensorflowLite sometimes don't implement all operators in CoreML or MPS.

lostmsu 9 months ago

The author's benchmark sucks if he could only get 2 tops from a laptop 4080. The thing should be doing somewhere around 80 tops.

Given that you should take his NPU results with a truckload of salt.

Havoc 9 months ago

>We see 1.3% of Qualcomm's NPU 45 Teraops/s claim

To me that suggests that the test is wrong.

I could see intel massaging results, but that far off seems incredibly improbable

wmf 9 months ago

This headline is seriously misleading because the author did not test AMD or Intel NPUs. If Qualcomm is slow don't say all AI PCs are not good.

guelermus 9 months ago

One should pay attention also to power efficiency, a direct comparison could be misleading here.

piskov 9 months ago

Snapdragon touts 45 TOPS but it’s only int8.

For example Apple's m3 neural engine is mere 18 TOPS but it’s FP16.

So windows has bigger number but it’s not apple to apple comparison.

Did author test int8 performance?

m00x 9 months ago

NPUs are efficient, not especially fast. The CPU is much bigger than the NPU and has better cache access. Of course it'll perform better.

acdha 9 months ago

It’s more complicated than that (you’re assuming that the bigger CPU is optimized for the same workload) but it’s also irrelevant to the topic at hand: they’re seeing this NPU within a factor of 2-4 of the CPU, but if it performed half as well as Qualcomm claims it would be an order of magnitude faster. The story here isn’t another round of the specialized versus general debate but that they fell so far short of their marketing claims.
llm_nerd 9 months ago

NPUs are actually incredibly fast for standard inference operations.
This benchmark is horribly flawed in many ways, and was so evidently useless that I'm surprised that they still decided to "publish" this. When your test gets 1% of the published performance, it's a good indication that things aren't being done correctly.

p1necone 9 months ago

I might be overly cynical but I just assumed that the entire purpose of "AI PCs" was marketing - of course they don't actually achieve much. Any real hardware that's supposedly for the "AI" features will actually be just special purpose hardware for literally anything the sales department can lump under that category.

jamesy0ung 9 months ago

What exactly does Windows do with a NPU? I don't own an 'AI PC' but it seems like the NPUs are slow and can't run much.

I know Apple's Neural Engine is used to power Face ID and the facial recognition stuff in Photos, among other things.

dagaci 9 months ago

Its used for improving video calls, special effects, image editing/ effects and noise cancelling, teams stuff
DrillShopper 9 months ago

It supports Microsoft's Recall (now required) spyware
- Janicc 9 months ago
  
  Please remind me again how Recall sends data to Microsoft. I must've missed that part. Or are you against the print screen button too? I heard that takes images too. Very scary.
  - cmeacham98 9 months ago
    
    While calling it spyware like GP is over-exaggeration to a ridiculous level, comparing Recall to Print Screen is also inaccurate:
    Print Screen takes images on demand, Recall does so effectively at random. This means Recall could inadvertently screenshot and store information you didn't intend to keep a record of (To give an extreme example: Imagine an abuser uses Recall to discover their spouse browsing online domestic violence resources).
  - bloated5048 9 months ago
    
    It's always safe to assume it does if it's closed source. I rather be suspicious of big corporations seeking to profit at every step than naive.
    Also, it's security risk which already been exploited. Sure, MS fixed it, but can you be certain it won't be exploited some time in the future again?
  - Terr_ 9 months ago
    
    > Please remind me again how Recall sends data to Microsoft. I must've missed that part.
    Sure, just post the source code and I'll point out where it does so, I somehow misplaced my copy. /s
    The core problem here is trust, and over the last several years Microsoft has burned a hell of a lot of theirs with power-users of Windows. Even their most strident public promises of Recall being "opt-in" and "on-device only" will--paradoxically--only be kept as long as enough people remain suspicious.
    Glance away and MS go back to their old games, pushing a mandatory "security update" which reset or entirely-removes your privacy settings and adding new "telemetry" streams which you cannot inspect.
  - throwaway314155 9 months ago
    
    [dead]
downrightmike 9 months ago

AI PC is just a marketing term, doesn't have any real substance
- acdha 9 months ago
  
  Yea, we know that. I believe that’s why the person you’re replying too was asking for examples of real usage.

dmitrygr 9 months ago

In general MAC unit utilization tends to be low for transformers, but 1.3% seems pretty bad. I wonder if they fucked up the memory interface for the NPU. All the MACs in the world are useless if you cannot feed them.

Hizonner 9 months ago

It's a tablet. It probably has like one DDR channel. It's not so much that they "fucked it up" as that they knowingly built a grossly unbalanced system so they could report a pointless number.
- dmitrygr 9 months ago
  
  Well, no. If the CPU can hit better numbers on the same model then the bandwidth from the DDR IS there. Probably the NPU does not attach to the proper cache level, or just has a very thin pipe to it
  - Hizonner 9 months ago
    
    The CPU is only about twice as good as the NPU, though (four times as good on one test). The NPU is being advertised as capable of 45 trillion operations per second, and he's getting 1.3 percent of that.
    So, OK, yeah, I concede that the NPU may have even worse access to memory than the CPU, but the bottom line is that neither one of them has anything close to what it needs to to actually delivering anything like the marketing headline performance number on any realistic workload.
    I bet a lot of people have bought those things after seeing "45 TOPS", thinking that they'd be able to usefully run transformers the size of main memory, and that's not happening on CPU or NPU.
    
    dmitrygr 9 months ago
    
    Yup, sad all round. We are in agreement.
moffkalast 9 months ago

I recall looking over the Ryzen AI architecture and the NPU is just plugged into PCIe and thus gets completely crap memory bandwidth. I would expect it might be similar here.
- PaulHoule 9 months ago
  
  I spent a lot of time with a business partner and an expert looking at the design space for accelerators and it was made very clear to me that the memory interface puts a hard limit on what you can do and that it is difficult to make the most of. Particularly if a half-baked product is being rushed out because of FOMO you’d practically expect them to ship something that gives a few percent of the performance because the memory interface doesn’t really work, it happens to the best of them:
  https://en.wikipedia.org/wiki/Cell_(processor)
- wtallis 9 months ago
  
  It's unlikely to be literally connected over PCIe when it's on the same chip. It just looks like it's connected over PCIe because that's how you make peripherals discoverable to the OS. The integrated GPU also appears to be connected over PCIe, but obviously has access to far more memory bandwidth.

woadwarrior01 9 months ago

IMO, benchmarking accelerator hardware with onnxruntime is like benchmarking a CPU with a Python script.

> We've seen similar performance results to those shown here using the Qualcomm QNN SDK directly.

Why not include those results?

Mistletoe 9 months ago

>The second conclusion is that the measured performance of 573 billion operations per second is only 1.3% of the 45 trillion ops/s that the marketing material promises.

It just gets so hard to take this industry seriously.

fschutze 9 months ago

Is there a possibility to use the Qualcomm SNPE SDK? I thought this SDK isn't bad. Also, for those who have access to the Qualcomm NPU: Is the Hexagon SDK working properly? Do apps still need to be signed (which i never got to work) when using Hexagon?

NoPicklez 9 months ago

Fairly misleading title, boiling down AI PCs to just the Microsoft Surface running Qualcomm

ein0p 9 months ago

Memory bound workload is memory bound. Doesn’t matter how many TOPS you have if you’re sitting idle waiting on DRAM during generation. You will, however notice a difference in prefill for long prompts.

downrightmike 9 months ago

They should have just made a pci card and not tried to push whole new machines on us. We are all good with the machines we already have. If you want to sell a new feature, then it needs to be an add-on

stanleykm 9 months ago

the ARM SME could be an interesting alternative to NPUs in the future. Unlike the NPUs which have at best some fixed function API it will be possible to program the SMEs more directly

bsmartt 9 months ago

what are all these folks hoping to accomplish? By crying and starting shit about windows recall, all you did was signal to their shareholders and the financial analysts that windows recall actually substance and not just a marketing facade. Otherwise, why would all those nerds be so angry?

So microsoft takes some of the criticisms on twitter and gets them in before shipping. Free appsec, nice.

Now, microsoft doesnt care about your benchmarks, dude. Grandma isnt gonna notice these workloads finish faster on a different compiled program utilizing different chips. Her last PC was EOL'd 10 years ago, it certainly cant keep up with this new ai laptop.

irusensei 9 months ago

Are NPUs the VLIW of our times in terms of hype?

wyldfire 9 months ago

It's interesting that you bring that up. VLIW isn't hype in this case, it's the actual architecture of the Hexagon ISA used in these PCs. As to whether AI PCs revolutionize our lives - well, that certainly might be hype.
cloudhan 9 months ago

Yes.

bsmartt 9 months ago

You dont seriously think MSFT expects this shit to benefit consumers do you? Their datacenters are overheating and the billing meter is still ticking while they burn, they need to figure out how to get consumers to start paying for this shit before they go broke and wall st sells them off for parts..

Either way, these are some of the first personal computers to have NPUs. They will improve. CPUs are 20 years optimized, this is literally the first try for some of these companies

bsmartt 9 months ago

if anything this is a very promising benchmark for the new tech,. we're getting close.
so what this means if NPUs are anywhere close to CPUs in the benchmarks is that NPUs are going to blow past CPUs very soon, because CPUs dont have much more weight to shed whereas NPUs are just getting started.

pram 9 months ago

I laughed when I saw that the Qualcomm “AI PC” is described as this in the ComfyUI docs:

"Avoid", "Nothing works", "Worthless for any AI use"

LikelyABurner 9 months ago

Didn't believe that anyone would be bridge-burning-happy enough to put this in their official docs, but you're not kidding: https://github.com/comfyanonymous/ComfyUI/wiki/Which-GPU-sho...
In retrospect, the fact that Intel and AMD's stock prices both closed slightly up when Microsoft announced the Snapdragon X on Windows 11 was a dead giveaway that the major players knew behind the scenes that it was being released seriously under baked.

tromp 9 months ago

> the 45 trillion operations per second that’s listed in the specs

Such a spec should be ideally be accompanied by code demonstrating or approximating the claimed performance. I can't imagine a sports car advertising a 0-100km/h spec of 2.0 seconds where a user is unable to get below 5 seconds.

dmitrygr 9 months ago

most likely multiplying the same 128x128 matrix from cache to cache. That gets you perfect MAC utilization with no need to hit memory. Gets you a big number that is not directly a lie - that perf IS attainable, on a useless synthetic benchmark
- kmeisthax 9 months ago
  
  Sounds great for RNNs! /s
tedunangst 9 months ago

I have some bad news for you regarding how car acceleration is measured.
- otterley 9 months ago
  
  Well, what is it?
  - ukuina 9 months ago
    
    Everything from rolling starts to perfect road conditions and specific tires, I suppose.

hkgjjgjfjfjfjf 9 months ago

Sutherland's wheel of reincarnation turns.

rationalfaith 9 months ago

[dead]