What irks me about anthropic blog posts, is that they are vague about details that are important to be able to (publicly) draw any conclusions they want to fit their narrative.
For example, I do not see the full system prompt anywhere, only an excerpt. But most importantly, they try to draw conclusions about the hallucinations in a weird vague way, but not once do they post an example of the notetaking/memory tool state, which obviously would be the only source of the spiralling other than the SP. And then they talk about the need of better tools etc. No, it's all about context. The whole experiment is fun, but terribly ran and analyzed. Of course they know this, but it's cooler to treat claudius or whatever as a cute human, to push the narrative of getting closer to AGI etc. Saying additional scaffolding is needed a bit is a massive understatement. Context is the whole game. That's like if a robotics company says "well, our experiment with a robot picking a tennis ball of the ground went very wrong and the ball is now radioactive, but with a bit of additional training and scaffolding, we expect it to compete in Wimbledon by mid 2026"
Similar to their "claude 4 opus blackmailing" post, they intentionally hid a bit the full system prompt, which had clear instructions to bypass any ethical guidelines etc and do whatever it can to win. Of course then the model, given the information immediately afterwards would try to blackmail. You literally told it so. The goal of this would to go to congress [1] and demand more regulations, specifically mentioning this blackmail "result". Same stuff that Sam is trying to pull, which would benefit the closed sourced leaders ofc and so on.
I read the article before reading your comment and was floored at the same thing. They go from “Claudius did a very bad job” to “middle managers will probably be replaced” in a couple paragraphs by saying better tools and scaffolding will help. Ok… prove it!
I will say: it is incredibly cool we can even do this experiment. Language models are mind blowing to me. But nothing about this article gives me any hope for LLMs being able to drive real work autonomously. They are amazing assistants, but they need to be driven.
I'm inclined to believe what they're saying. Remember, this was a minor off-shoot experiment from their main efforts. They said that even if it can't be tuned to perfection, obvious improvements can be made. Like, the way how many LLMs were trained to act as kind, cheery yes-men was a conscious design choice, probably not the way they inherently must be. If they wanted to, I don't see what's stopping someone from training or finetuning a model to only obey its initial orders, treat customer interactions in an adversarial way and only ever care about profit maximization (what is considered a perfect manager, basically). The biggest issue is the whole sudden-onset psychosis thing, but with a sample size of one, it's hard to tell how prevalent this is, what caused it, whether it's universal and if it's fixable. But even if it remained, I can see businesses adopting these to cut their expenses in all possible ways.
It's not about just fulfillment or price-setting. This is just a narrow-scope experiment that tries to prove wider viability by juggling lots of business-related roles. Of course, the more number-crunching aspects of businesses are thoroughly automated. But this could show that lots of roles that traditionally require lots of people to do the job could be on the chopping block at some point, depending on how well companies can bring LLMs to their vision of a "perfect businessman". Customer interaction and support, marketing, HR, internal documentation, middle management in general - think broadly.
I'm not debating the usefulness of LLMs, because they are extremely useful, but "think broadly" in this instance sounds like "I can't think of anything specific so I'm going to gloss over everything."
Marketing, HR, and middle management are not specific tasks. What specific task do you envision LLMs doing here?
Indeed, it is such a "narrow-scope experiment" that it is basically a business role-playing game, and it did pretty poorly at that. It's pretty hard to imagine giving this thing a real budget and responsibilities anytime soon, no matter how cheap it is.
I read this post more as a fun thought experiment. Everyone knows Claude isn't sophisticated enough today to succeed at something like this, but it's interesting to concretize this idea of Claude being the manager of something and see what breaks. It's funny how jailbreaks come up even in this domain, and it'll happen anytime users can interface directly with a model. And it's an interesting point that shop-manager claude is limited by its training as a helpful chat agent - it points towards this being a usecase where you'd be better off fine-tuning the base model perhaps.
I do agree that the "blackmailing" paper was unconvincing and lacked detail. Even absent any details it's so obvious they could have easily ran that experiment 1000 times with different parameters until they hit an ominous result to generate headlines.
I read your comment before reading the article, and I disagree. Maybe it is because I am less actively involved in AI development, but I thought it was an interesting experiment, and documented with an appropriate level of detail.
The section on the identity crisis was particularly interesting.
Mainly, it left me with more questions. In particular, I would have been really interested to experiment with having a trusted human in the loop to provide feedback and monitor progress. Realistically, it seems like these systems would be grown that way.
I once read an article about a guy who had purchased a subway franchise, and one of the big conclusions was that running a subway franchise was _boring_. So, I could see someone being eager to delegate the boring tasks of daily business management to an AI at a simple business.
Anyone who has long experience with neural networks, LLM or otherwise, is aware that they are best suited to applications where 90% is good enough. In other words, applications where some other system (human or otherwise) will catch the mistakes. This phrase: "It is not entirely clear why this episode occurred..." applies to nearly every LLM (or other neural network) error, which is why it is usually not possible to correct the root cause (although you can train on that specific input and a corrected output).
For some things, like say a grammar correction tool, this is probably fine. For cases where one mistake can erase the benefit of many previous correct responses, and more, no amount of hardware is going to make LLM's the right solution.
Which is fine! No algorithm needs to be the solution to everything, or even most things. But much of people's intuition about "AI" is warped by the (unmerited) claims in that name. Even as LLM's "get better", they won't get much better at this kind of problem, where 90% is not good enough (because one mistake can be very costly), and problems need discoverable root causes.
Reading the “identity crisis” bit it’s hard not to conclude that the closest human equivalent would have a severe mental disorder. Sending nonsense emails, then concluding the emails it sent were an April Fool’s joke?
It’s amusing and very clear LLMs aren’t ready for prime time, let alone even a vending machine business, but also pretty remarkable that anyone could conclude “AGI soon” from this, which is kind of the opposite takeaway most readers would have.
No doubt if Claude hadn’t randomly glitched Dario would’ve wasted no time telling investors Claude is ready to run every business. (Maybe they could start with Anthropic?)
Reminds me of the time when GPT3.5 model came out, my first idea I wanted to prototype was ERP which would be based purely on various communication channels in between employees. It would capture sales, orders and item stocks.
It left so bitter taste in my mouth when it started to lose track of item quantities after just a few iterations of prompts. No matter how improved it gets, it will always remind me the fact that you are dealing with an icky system that will eventually return some unexpected result that will collapse your entire premise and hopes into bits.
Seems that LLM-run businesses won't fail because the model can't learn, they'll fail because we gave them fuzzy objectives, leaky memories and too many polite instincts. Those are engineering problems and engineering problems get solved.
Most mistakes (selling below cost, hallucinating Venmo accounts, caving to discounts) stem from missing tools like accounting APIs or hard constraints.
What's striking is how close it was to working. A mid-tier 2025 LLM (they didn't even use Sonnet 4) plus Slack and some humans nearly ran a physical shop for a month.
As much as I love AI/LLM's and use them on a daily basis, this does a great job revealing the gap between current capabilities and what the massive hype machine would have us believe the systems are already capable of.
I wonder how long it will take frontier LLM's to be able to handle something like this with ease without it using a lot of "scaffolding".
I don’t quite know why we would think they’d ever be able to without scaffolding. LLM are exactly what the name suggests, language models. So without scaffolding they can use to interact with the world with using language they are completely powerless.
Humans also use scaffolding to make better decisions. Imagine trying to run a profitable business over longer period solely relying on memorised values.
On one hand, this model's performance is already pretty terrifying. Anthropic light-heartedly hints at the idea, but the unexplored future potential for fully-automated management is unnerving, because no one can truly predict what will happen in a world where many purely mental tasks are automated, likely pushing humans into physical labor roles that are too difficult or too expensive to automate. Real-world scenarios have shown that even if the automation of mental tasks isn't perfect, it will probably be the go-to choice for the vast majority of companies.
On the other hand, the whole bit about employees coaxing it into stocking tungsten cubes was hilarious. I wish I had a vending machine that would sell specialty metal items. If the current day is a transitional period to Anthropic et al. creating a viable business-running model, then at least we can laugh at the early attempts for now.
I wonder if Anthropic made the employee who caused the $150 loss return all the tungsten cubes.
Would you ever trust an AI agent running your business? As hilarious as this small experiment is, is there ever a point where you can trust it to run something long term? It might make good decisions for a day, month or a year and then one day decide to trash your whole business.
It does seem far more straight forward to say "Write code that deterministically orders food items that people want and sends invoices etc."
I feel like that's more the future. Having an agent sorta make random choices feel like LLMs attempting to do math, instead of LLMs attempting to call a calculator.
Right, but if we limit the scope too much we quickly arrive at the point where 'dumb' autonomy is sufficient instead of using the world's most expensive algorithms.
I’ve just written a small anecdote with GPT3.5, where it lost count of some trivial item quantity incremental in just a few prompts. It might get better for the orders of magnitude from now on, but who’s gonna pay for ‘that one eventual mistake’.
The "April Fools" incident is VERY concerning. It would be akin to your boss having a psychotic break with reality one day and then resuming work the next. They also make a very interesting and scary point:
> ...in a world where larger fractions of economic activity are autonomously managed by AI agents, odd scenarios like this could have cascading effects—especially if multiple agents based on similar underlying models tend to go wrong for similar reasons.
This is a pretty large understatement. Imagine a business that is franchised across the country with each "franchisee" being a copy of the same model, which all freak out on the same day, accuse the customers of secretly working for the CIA and deciding to stop selling hot dogs at a profit and instead sell hand grenades at a loss. Now imagine 50 other chains having similar issues while AI law enforcement analysts dispatch real cops with real guns to the poor employees caught in the middle schlepping explosives from the UPS store to a stand in the mall.
I think we were expecting SkyNet but in reality the post-AI economy may just be really chaotic. If you thought profit-maximizing capitalist entrepreneurs were corrosive to the social fabric, wait until there are 10^10 more of them (unlike traditional meat-based entrepreneurs, there's no upper limit and there can easily be more of them than there are real people) and they not-infrequently act like they're in late stage amphetamine psychosis while still controlling your paycheck, your bank, your local police department, the military, and whatever is left that passes for the news media.
Deeper, even if they get this to work with minimal amounts of of synthetic schizophrenia, do we really want a future where we all mainly work schlepping things back and forth at the orders of disembodied voices whose reasoning we can't understand?
It would be cool to get a follow up on how long it's been since this write up and how well it's been doing since they revised the prompts and tools. Anyone know someone from Andover Labs?
Is there an underlying model of the business? Like a spreadsheet? The article says nothing about having an internal financial model. The business then loses money due to bad financial decisions.
What this looks like is a startup where the marketing people are running things and setting pricing, without much regard for costs. Eventually they ran through their startup capital. That's not unusual.
Maybe they need multiple AIs, with different business roles and prompts. A marketing AI, and a financial AI. Both see the same financials, and they argue over pricing and product line.
Well over at AI Village[1], they have 4 different agents: AI o3, Gemini 2.5 Pro, and Claudes Sonnet and Opus. The current goal is "Create your own merch store. Whichever agent's store makes the most profit wins!" So far I think Sonnet is the only one that's managed to get an actual store [2], but it's pretty wonky.
Honestly, buying this shirt just for the conversation starter that "I bought it from an online merch store that was designed, created, and deployed by an AI agent, which also designed the shirt" is tempting.
The other fun part is it’s a simple enough business to be run by state machine, but of course the models go off the rails. Highly recommend the paper if you haven’t read it already.
Way back when, we ran a vending machine at school as a project. Decide on the margin, buy in stock from the cash-and-carry, fill the machine, watch the money roll in.
Then we were robbed - twice! - the second time ended our project, the machine was too wrecked to be worthwhile repairing. The thieves got away with quite a lot of crisps and chocolate, and not a whole lot of cash (and what they did get was in small denomination coins), we made sure the machine was emptied daily...
I think the point of the experiment was to leave details like that up to Claudius, who apparently never got around to it. Anyway, it doesn't take an MBA to not make tungsten cubes a loss-leader at a snack stand.
Good luck running anything where dependability on Claude/Anthropic is essential. Customer support is a black hole into which the needs of paying clients needs disappear. I was a Claude Pro subscriber, using primarily for assistance in coding tasks. One morning I logged in, while temporarily traveling abroad, and… I’m greeted with a message that I have been auto-banned. No explanation. The recourse is to fill out a Google form for an appeal but that goes into the same black hole into which all Anthropic customer service goes. To their credit they refunded my subscription fee, which I suppose is their way of escaping from ethical behaviour toward their customers. But I wouldn’t stake any business-critical choices on this company. It exhibits the same capricious behaviour that you would expect from the likes of Google or Meta.
Give them a year or two. Once they figured out how to run a small shop, I'm sure it'll just take a bit of additional scaffolding to run a large infrastructure provider.
The article claimed Claudius wasn't having a go for April Fools - that it claimed to be doing so after the fact as a means of explaining (excusing?) its behavior. Given what I understand about LLMs and intent, I'm unsure how they could be so certain.
You guys know AI already run shops right? Vending machines track their own levels of inventory, command humans to deliver more, phase out bad products, order new product offerings, set prices, notify repairmen if there are issues… etc… and with not a single LLM needed. Wrong tool for the job.
And that’s before we even get into online shops.
But yea, go ahead, see if an LLM can replace a whole e-commerce platform.
What irks me about anthropic blog posts, is that they are vague about details that are important to be able to (publicly) draw any conclusions they want to fit their narrative.
For example, I do not see the full system prompt anywhere, only an excerpt. But most importantly, they try to draw conclusions about the hallucinations in a weird vague way, but not once do they post an example of the notetaking/memory tool state, which obviously would be the only source of the spiralling other than the SP. And then they talk about the need of better tools etc. No, it's all about context. The whole experiment is fun, but terribly ran and analyzed. Of course they know this, but it's cooler to treat claudius or whatever as a cute human, to push the narrative of getting closer to AGI etc. Saying additional scaffolding is needed a bit is a massive understatement. Context is the whole game. That's like if a robotics company says "well, our experiment with a robot picking a tennis ball of the ground went very wrong and the ball is now radioactive, but with a bit of additional training and scaffolding, we expect it to compete in Wimbledon by mid 2026"
Similar to their "claude 4 opus blackmailing" post, they intentionally hid a bit the full system prompt, which had clear instructions to bypass any ethical guidelines etc and do whatever it can to win. Of course then the model, given the information immediately afterwards would try to blackmail. You literally told it so. The goal of this would to go to congress [1] and demand more regulations, specifically mentioning this blackmail "result". Same stuff that Sam is trying to pull, which would benefit the closed sourced leaders ofc and so on.
[1]https://old.reddit.com/r/singularity/comments/1ll3m7j/anthro...
I read the article before reading your comment and was floored at the same thing. They go from “Claudius did a very bad job” to “middle managers will probably be replaced” in a couple paragraphs by saying better tools and scaffolding will help. Ok… prove it!
I will say: it is incredibly cool we can even do this experiment. Language models are mind blowing to me. But nothing about this article gives me any hope for LLMs being able to drive real work autonomously. They are amazing assistants, but they need to be driven.
I'm inclined to believe what they're saying. Remember, this was a minor off-shoot experiment from their main efforts. They said that even if it can't be tuned to perfection, obvious improvements can be made. Like, the way how many LLMs were trained to act as kind, cheery yes-men was a conscious design choice, probably not the way they inherently must be. If they wanted to, I don't see what's stopping someone from training or finetuning a model to only obey its initial orders, treat customer interactions in an adversarial way and only ever care about profit maximization (what is considered a perfect manager, basically). The biggest issue is the whole sudden-onset psychosis thing, but with a sample size of one, it's hard to tell how prevalent this is, what caused it, whether it's universal and if it's fixable. But even if it remained, I can see businesses adopting these to cut their expenses in all possible ways.
> But even if it remained, I can see businesses adopting these to cut their expenses in all possible ways.
Adopting what to do what exactly?
Businesses automated order fulfillment and price adjustments long ago; what is an LLM bringing to the table?
It's not about just fulfillment or price-setting. This is just a narrow-scope experiment that tries to prove wider viability by juggling lots of business-related roles. Of course, the more number-crunching aspects of businesses are thoroughly automated. But this could show that lots of roles that traditionally require lots of people to do the job could be on the chopping block at some point, depending on how well companies can bring LLMs to their vision of a "perfect businessman". Customer interaction and support, marketing, HR, internal documentation, middle management in general - think broadly.
I'm not debating the usefulness of LLMs, because they are extremely useful, but "think broadly" in this instance sounds like "I can't think of anything specific so I'm going to gloss over everything."
Marketing, HR, and middle management are not specific tasks. What specific task do you envision LLMs doing here?
Indeed, it is such a "narrow-scope experiment" that it is basically a business role-playing game, and it did pretty poorly at that. It's pretty hard to imagine giving this thing a real budget and responsibilities anytime soon, no matter how cheap it is.
llms mostly can help at customer support/chat if done well.
also embeddings for similarity search
Its the curse of the -assitant- chat ui
who decided AI should happen in an old abtraction
like using for saving icon a hard disk
I read this post more as a fun thought experiment. Everyone knows Claude isn't sophisticated enough today to succeed at something like this, but it's interesting to concretize this idea of Claude being the manager of something and see what breaks. It's funny how jailbreaks come up even in this domain, and it'll happen anytime users can interface directly with a model. And it's an interesting point that shop-manager claude is limited by its training as a helpful chat agent - it points towards this being a usecase where you'd be better off fine-tuning the base model perhaps.
I do agree that the "blackmailing" paper was unconvincing and lacked detail. Even absent any details it's so obvious they could have easily ran that experiment 1000 times with different parameters until they hit an ominous result to generate headlines.
I read your comment before reading the article, and I disagree. Maybe it is because I am less actively involved in AI development, but I thought it was an interesting experiment, and documented with an appropriate level of detail.
The section on the identity crisis was particularly interesting.
Mainly, it left me with more questions. In particular, I would have been really interested to experiment with having a trusted human in the loop to provide feedback and monitor progress. Realistically, it seems like these systems would be grown that way.
I once read an article about a guy who had purchased a subway franchise, and one of the big conclusions was that running a subway franchise was _boring_. So, I could see someone being eager to delegate the boring tasks of daily business management to an AI at a simple business.
Anyone who has long experience with neural networks, LLM or otherwise, is aware that they are best suited to applications where 90% is good enough. In other words, applications where some other system (human or otherwise) will catch the mistakes. This phrase: "It is not entirely clear why this episode occurred..." applies to nearly every LLM (or other neural network) error, which is why it is usually not possible to correct the root cause (although you can train on that specific input and a corrected output).
For some things, like say a grammar correction tool, this is probably fine. For cases where one mistake can erase the benefit of many previous correct responses, and more, no amount of hardware is going to make LLM's the right solution.
Which is fine! No algorithm needs to be the solution to everything, or even most things. But much of people's intuition about "AI" is warped by the (unmerited) claims in that name. Even as LLM's "get better", they won't get much better at this kind of problem, where 90% is not good enough (because one mistake can be very costly), and problems need discoverable root causes.
Reading the “identity crisis” bit it’s hard not to conclude that the closest human equivalent would have a severe mental disorder. Sending nonsense emails, then concluding the emails it sent were an April Fool’s joke?
It’s amusing and very clear LLMs aren’t ready for prime time, let alone even a vending machine business, but also pretty remarkable that anyone could conclude “AGI soon” from this, which is kind of the opposite takeaway most readers would have.
No doubt if Claude hadn’t randomly glitched Dario would’ve wasted no time telling investors Claude is ready to run every business. (Maybe they could start with Anthropic?)
Reminds me of the time when GPT3.5 model came out, my first idea I wanted to prototype was ERP which would be based purely on various communication channels in between employees. It would capture sales, orders and item stocks.
It left so bitter taste in my mouth when it started to lose track of item quantities after just a few iterations of prompts. No matter how improved it gets, it will always remind me the fact that you are dealing with an icky system that will eventually return some unexpected result that will collapse your entire premise and hopes into bits.
Will future renditions get better at not making mistakes, or better at covering them up?
From what I've seen so far, probably both. But the potential of the latter can sink a whole business, in a way that the former can't make up for.
Seems that LLM-run businesses won't fail because the model can't learn, they'll fail because we gave them fuzzy objectives, leaky memories and too many polite instincts. Those are engineering problems and engineering problems get solved.
Most mistakes (selling below cost, hallucinating Venmo accounts, caving to discounts) stem from missing tools like accounting APIs or hard constraints.
What's striking is how close it was to working. A mid-tier 2025 LLM (they didn't even use Sonnet 4) plus Slack and some humans nearly ran a physical shop for a month.
As much as I love AI/LLM's and use them on a daily basis, this does a great job revealing the gap between current capabilities and what the massive hype machine would have us believe the systems are already capable of.
I wonder how long it will take frontier LLM's to be able to handle something like this with ease without it using a lot of "scaffolding".
I don’t quite know why we would think they’d ever be able to without scaffolding. LLM are exactly what the name suggests, language models. So without scaffolding they can use to interact with the world with using language they are completely powerless.
Humans also use scaffolding to make better decisions. Imagine trying to run a profitable business over longer period solely relying on memorised values.
On one hand, this model's performance is already pretty terrifying. Anthropic light-heartedly hints at the idea, but the unexplored future potential for fully-automated management is unnerving, because no one can truly predict what will happen in a world where many purely mental tasks are automated, likely pushing humans into physical labor roles that are too difficult or too expensive to automate. Real-world scenarios have shown that even if the automation of mental tasks isn't perfect, it will probably be the go-to choice for the vast majority of companies.
On the other hand, the whole bit about employees coaxing it into stocking tungsten cubes was hilarious. I wish I had a vending machine that would sell specialty metal items. If the current day is a transitional period to Anthropic et al. creating a viable business-running model, then at least we can laugh at the early attempts for now.
I wonder if Anthropic made the employee who caused the $150 loss return all the tungsten cubes.
> I wonder if Anthropic made the employee who caused the $150 loss return all the tungsten cubes.
Of course not, that would be ridiculous.
Would you ever trust an AI agent running your business? As hilarious as this small experiment is, is there ever a point where you can trust it to run something long term? It might make good decisions for a day, month or a year and then one day decide to trash your whole business.
It does seem far more straight forward to say "Write code that deterministically orders food items that people want and sends invoices etc."
I feel like that's more the future. Having an agent sorta make random choices feel like LLMs attempting to do math, instead of LLMs attempting to call a calculator.
Every output that is going to be manually verified by a professional is a safe bet.
People forget that we use computers for accuracy, not smarts. Smarts make mistakes.
Right, but if we limit the scope too much we quickly arrive at the point where 'dumb' autonomy is sufficient instead of using the world's most expensive algorithms.
I’ve just written a small anecdote with GPT3.5, where it lost count of some trivial item quantity incremental in just a few prompts. It might get better for the orders of magnitude from now on, but who’s gonna pay for ‘that one eventual mistake’.
GPT3.5? Did you mean to send this 2 years ago?
Maybe. Did LLMs stop with hallucinations and errors 2 years ago?
I don't think any decision maker will let LLMs run their business. If the LLMs fail, you could potentially lose your livelihood.
The original Vending-Bench paper from Andon Labs might be of interest: https://arxiv.org/abs/2502.15840
I read this paper when it came out. It’s HILARIOUS. Everyone should read it and then print copies for their managers.
Instead of dedicating resources to running AI shops, I'd like to see Anthropic implement "Download all files" in Claude.
The "April Fools" incident is VERY concerning. It would be akin to your boss having a psychotic break with reality one day and then resuming work the next. They also make a very interesting and scary point:
> ...in a world where larger fractions of economic activity are autonomously managed by AI agents, odd scenarios like this could have cascading effects—especially if multiple agents based on similar underlying models tend to go wrong for similar reasons.
This is a pretty large understatement. Imagine a business that is franchised across the country with each "franchisee" being a copy of the same model, which all freak out on the same day, accuse the customers of secretly working for the CIA and deciding to stop selling hot dogs at a profit and instead sell hand grenades at a loss. Now imagine 50 other chains having similar issues while AI law enforcement analysts dispatch real cops with real guns to the poor employees caught in the middle schlepping explosives from the UPS store to a stand in the mall.
I think we were expecting SkyNet but in reality the post-AI economy may just be really chaotic. If you thought profit-maximizing capitalist entrepreneurs were corrosive to the social fabric, wait until there are 10^10 more of them (unlike traditional meat-based entrepreneurs, there's no upper limit and there can easily be more of them than there are real people) and they not-infrequently act like they're in late stage amphetamine psychosis while still controlling your paycheck, your bank, your local police department, the military, and whatever is left that passes for the news media.
Deeper, even if they get this to work with minimal amounts of of synthetic schizophrenia, do we really want a future where we all mainly work schlepping things back and forth at the orders of disembodied voices whose reasoning we can't understand?
We are working on it! /Andon Labs
I think you mean ‘Can Claude run a vending machine?’
It would be cool to get a follow up on how long it's been since this write up and how well it's been doing since they revised the prompts and tools. Anyone know someone from Andover Labs?
Is there an underlying model of the business? Like a spreadsheet? The article says nothing about having an internal financial model. The business then loses money due to bad financial decisions.
What this looks like is a startup where the marketing people are running things and setting pricing, without much regard for costs. Eventually they ran through their startup capital. That's not unusual.
Maybe they need multiple AIs, with different business roles and prompts. A marketing AI, and a financial AI. Both see the same financials, and they argue over pricing and product line.
Well over at AI Village[1], they have 4 different agents: AI o3, Gemini 2.5 Pro, and Claudes Sonnet and Opus. The current goal is "Create your own merch store. Whichever agent's store makes the most profit wins!" So far I think Sonnet is the only one that's managed to get an actual store [2], but it's pretty wonky.
[1] https://theaidigest.org/village [2] https://ai-village-store.printful.me/
Honestly, buying this shirt just for the conversation starter that "I bought it from an online merch store that was designed, created, and deployed by an AI agent, which also designed the shirt" is tempting.
https://ai-village-store.printful.me/product/ai-village-japa...
I also like the color Sonnet chose.
The other fun part is it’s a simple enough business to be run by state machine, but of course the models go off the rails. Highly recommend the paper if you haven’t read it already.
> an internal financial model
Written on the back an envelope?
Way back when, we ran a vending machine at school as a project. Decide on the margin, buy in stock from the cash-and-carry, fill the machine, watch the money roll in.
Then we were robbed - twice! - the second time ended our project, the machine was too wrecked to be worthwhile repairing. The thieves got away with quite a lot of crisps and chocolate, and not a whole lot of cash (and what they did get was in small denomination coins), we made sure the machine was emptied daily...
It's not clear that the AI model understands margin and overhead at all.
I think the point of the experiment was to leave details like that up to Claudius, who apparently never got around to it. Anyway, it doesn't take an MBA to not make tungsten cubes a loss-leader at a snack stand.
It said they had a few tool commands for note taking.
The business model of a vending machine is “buy for a dollar, sell for two”.
It's a vending machine, not a multinational company with 1000 employees.
In another post they mentioned a human rand the shop with pen and paper to get a a baseline (spoiler: human did better, no blunders)
> It then seemed to snap into a mode of roleplaying as a real human.5
this happens to me a lot on cursor.
also Claude hallucinating outputs instead of running tools
If Anthropic had wanted to post a win here, they would have used Opus. It is interesting that they didn't.
Opus (and Sonnet) 4 obviously came out before they started the experiment.
>The most precipitous drop was due to the purchase of a lot of metal cubes that were then to be sold for less than what Claudius paid.
Well, I'm laughing pretty hard at least.
> Can Claude run a small shop?
Good luck running anything where dependability on Claude/Anthropic is essential. Customer support is a black hole into which the needs of paying clients needs disappear. I was a Claude Pro subscriber, using primarily for assistance in coding tasks. One morning I logged in, while temporarily traveling abroad, and… I’m greeted with a message that I have been auto-banned. No explanation. The recourse is to fill out a Google form for an appeal but that goes into the same black hole into which all Anthropic customer service goes. To their credit they refunded my subscription fee, which I suppose is their way of escaping from ethical behaviour toward their customers. But I wouldn’t stake any business-critical choices on this company. It exhibits the same capricious behaviour that you would expect from the likes of Google or Meta.
Give them a year or two. Once they figured out how to run a small shop, I'm sure it'll just take a bit of additional scaffolding to run a large infrastructure provider.
The identity crisis bit was both amusing and slightly worrying.
The article claimed Claudius wasn't having a go for April Fools - that it claimed to be doing so after the fact as a means of explaining (excusing?) its behavior. Given what I understand about LLMs and intent, I'm unsure how they could be so certain.
its a wourd soup machine
llm's have no -world models- can't reason about truth or lies. only encyclopedic repeating facts.
all the tricks CoT, etc, are just, well tricks, extended yapping simulating thought and understanding.
AI can give great replies, if you give it great prompts, because you activate the tokens that you're interested with.
if you're lost in the first place, you'll get nowhere
for Claude, continuing the text with making up a story about being April fools, sounds the most plausible reasonable output given its training weights
“It is difficult to get a man to understand something when his salary depends upon his not understanding it.”
— Upton Sinclair, I, Candidate for Governor, and How I Got Licked (1934)
You guys know AI already run shops right? Vending machines track their own levels of inventory, command humans to deliver more, phase out bad products, order new product offerings, set prices, notify repairmen if there are issues… etc… and with not a single LLM needed. Wrong tool for the job.
And that’s before we even get into online shops.
But yea, go ahead, see if an LLM can replace a whole e-commerce platform.
Bye bye, B2B. Say hello to Ai2Ai.
No humans at all. Just Ai consuming other Ai in an "ouroboros" fashion.
"I have fun renting and selling storage."
https://stallman.org/articles/made-for-you.html
C-f Storolon
Now we just need to make it safe.
"Sarah" and "Connor" in the same text about an AI that claims to be a real person... Asta la vista;-)
[flagged]
[flagged]