Memory bandwidth is the bottleneck in the Spark. If you replace the SoC with an optimized ASIC but keep the same 256-bit LPDDR5 the performance will be the same. You can increase performance by using wider memory but that's also more expensive.
Personally, I doubt it. Apple hamstrung themselves with unified SOC memory, there are cheap dGPUs that smoke the M5's prefill speeds and even have faster decode too. Apple is running up against the limitations of putting a mobile integrated chipset up against the desktop form factor. An SOC stops looking like a smart decision at that scale.
The software side is still pretty sketchy, too. Apple's ecosystem is fractured between NPU, MPS and Accelerate BLAS, with libraries like MLX and CoreML built precariously overtop. Apple has to commit to a full rearchitecture of their GPU to challenge Nvidia, which fractures that ecosystem even further.
Me too! The problem is that people don't love having 128gb of DDR5 held back with a laptop-grade iGPU. It puts up strictly non-interactive speed for LLMs of that size.
When you layer those same models across 128gb of dGPUs, then you can actually fill the KV cache in seconds, instead of minutes. And you get higher memory bandwidth on most professional cards.
I don't expect them to be AS fast as Nvidia anytime soon. Understood that they need architectural improvements to get there.
Apple's business model will be to pay Google for compute for now, and then as they get better on device, move more and more locally. So they're very well incentivized to get better. The thing they've been best at in the last 19 years has been spinning flywheels they already have, and this is exactly that.
I'm just genuinely convinced that Apple's AI flywheel is going in reverse. Their killed their golden goose with OpenCL, which had a genuine shot at dethroning CUDA if Apple took it seriously. It had industry-wide buy in and multiple implementations before Apple threw in the towel. When they designed Apple Silicon, they could have used the lessons learned from that experience to create a CUDA-like ALU layer instead of focusing on raster efficiency for their GPUs. Nvidia had proven that it was possible with low-power ARM SOCs like Jetson and Tegra which did deliver CUDA in handheld experiences. But Apple chose instead to delegate AI to the NPU, which is now dark silicon on devices that defer to MPS backends for most inference. The architecture is locked in to an expensive and suboptimal raster-first GPU design.
It's not hard to see why Apple made those mistakes, and many of them were made by the rest of the industry too. It's specifically tragic that Apple snatched defeat from the jaws of victory with GPGPU programming, and it makes me think that their future will be more subscription services and less half-ass technical efforts. Or they rip up the foundation and start from scratch, it's never too late to start work on Apple Silicon 2.
I think it's easy to understand why Apple wouldn't build low level engineering solutions - they'd rather control the platform and just have developers call MLX. I'm not sure, if I was in their shoes, that I'd make the same call. But it's a call, and it's consistent with the rest of their ecosystem decisions.
Broadcom has become wealthy by being Google's TPU hardware partner, including sharing their TSMC capacity with Google, and evidently now they are doing the same thing with OpenAI. What a brilliant way to take advantage of the AI gold rush!
I wish they weren't using their piles of money to extort money out of the software industry like they are with VMWare and Bitnami.
> Broadcom has become wealthy by being Google's TPU hardware partner...
Kinda, but not exactly.
Broadcom cornered the enterprise infra and security market in the late 2010s and early 2020s after acquiring CA Technologies, BMC (EDIT: Did NOT acquire them, they were considering it back in 2018 but decided against it and KKR ended up acquiring them), Symantec (which they bought instead of BMC), and VMWare and were able to make a strong cybersecurity story during the late 2010s cybersecurity and SaaS boom.
That gave them plenty of cashflow that helped subsidize their hardware business when hardware was not viewed as hot as it is today.
Additionally, Broadcom is GCP's marquee customer and has been for a little under a decade so they were able to make a sweetheart deal where all that software businesses at Broadcom would be exclusively using GCP and in return GCP would working with Broadcom to design it's silicon and source infra needed for their DC buildouts.
Ironically, the DoJ blocking Broadcom's acquisition of Qualcomm was the best thing it ever could have done for Broadcom, because it gave Broadcom the dry powder to dominate the Enterprise SaaS and build a strong niche in the cybersecurity space.
> piles of money to extort money out of the software industry
From personal experience, executives and leadership who started off in the electronics and hardware industry are much more vicious and cutthroat than their peers who started in software.
Working in an industry that historically had to deal with high commodification, low margins, and long tail sales leads to leadership that can execute. Additionally, no one climbs the leadership ladder without having spent years as a line-level engineer, but that's true for software as well to an extent.
Edit: can't reply
> Did they acquire also BMC?
Nope.
Broadcom was considering acquiring them in 2018 but decided not to go through with the opportunity and KKR jumped in.
Good information, Broadcom is a playa, lots and lots of acquisitions! (a quick google search turns up a very eventful history for Broadcom)
> From personal experience, executives and leadership who started off in the electronics and hardware industry are much more vicious and cutthroat than their peers who started in software.
Only The Paranoid Survive is quite a name for a management book. It implies surviving in the world you are speaking about.
I just read a claim on Twitter that the reason these companies (Google and Amazon as well as OpenAI) are using Broadcom isn't just for design expertise, but because Broadcom have allocation agreements in place with TSMC and the memory manufacturers.
Most design partners have allocation agreements. The thing is Broadcom is an absolute GIANT in the ASIC design space, and it's closest competitor Marvell is a fraction of it's size.
There are a lot of large tech companies that most of HN has never heard about that completely dominate entire segments.
Nope, not down. "total Personal Computing Device (PCD) market — comprising traditional PCs and tablets — posted 2.8% year-over-year growth in Q1 2026, with combined shipments reaching 103.3 million units. PC shipments grew 3% YoY with 65.6 million units" https://www.idc.com/promo/pcdforecast/
Q2 is forecasted to be negative, partly because of RAM prices like you said, but for the most part this is something that only price sensitive nerds care about. Broadcom sells a ton of server chips. Server sales are up 30% vs last year so I highly doubt they're desperate to use their allocation
> the full-year 2026 [PCD] outlook has been revised to −10.4% year-over-year
because
> erosion of consumer purchasing power amid regional inflation and currency volatility in many key markets, compounded by memory and storage shortages that are proving more severe than anticipated in the previous forecast cycle.
The positive Q1 YoY growth
> was largely the product of pull-forward demand, as both consumer and commercial buyers accelerated purchases ahead of anticipated price increases and limited product availability.
The idea that only nerds care about the cost of things is... absurd.
> The idea that only nerds care about the cost of things is... absurd.
For hardware purchases, laypeople may go about it the other way from what nerds would do: instead of deciding what they need in terms of computing power and memory, and then finding a cheap offer for that, they just decide how much they want to spend, and then buy a device at that price point irrespective of its performance characteristics. If you shop like this, and would have purchased anything but a rock-bottom low-end device two years ago, prices have remained stable.
I was actually thinking of smartphones first because they seem to be the best-selling "personal computing devices" (different definition from IDC) and come with a lot of RAM (8-16 GB or so? Mine has 12) these days. And there I confused Broadcom with Qualcomm - Qualcomm's biggest end customers seem to be smartphone buyers.
I thought of PCs second since most chip manufacturers make some thing or another that goes into them (Broadcom probably more than Qualcomm), and yes it's very suprising that PC sales don't seem to be down yet.
Pretty huge move. Google and their TPUs are looking infinitely more prescient as I think they are on their 7th generation, along with the offshoots it inspired like the LPU and even others, perhaps like Cerebras and their Wafer Scale Engine.
However, based off first impressions, it seems like this is meant for inference side, and not training, which is also an interesting choice.
What makes you think this? With wider adoption the ratio shall shift in favor of inference. And API price is becoming more important than SOTA capability.
Training is pretty much a 1x cost, and efficiency there is already on the way down with architectural improvements. Inference though is an ongoing cost which over time takes orders of magnitude more resources, so focusing on making that far more efficient means way greater gains over time.
> early testing shows that Jalapeño will deliver performance per watt substantially better than current state-of-the-art
We're starting to see what really matters here, and though this is hand wavy the TPU makes similar claims.
I think googles memo about having no moat still stands (see: https://newsletter.semianalysis.com/p/google-we-have-no-moat... if you are unaware). It kind of makes sense that all of this is looking more like 60's to 90's IBM, DEC, Cray, Sun and the hardware race that happened then. History doesn't repeat but it often rhymes and I suspect that these efforts will follow the same trajectory.
To be clear, that is not "Google's memo". It's a memo by a guy who happened to work at Google. There is a diversity of opinions at a company that employs 180,000 people.
Cerebras's Codex Spark 5.3 has been a huge flop. Small context window and old model. But hopefully they can improve so that we can benefit from 1000 tokens/second with GPT 5.5.
One thing I don't like about California based companies is how cringe the names always are.
"Jalapeño" is such a bad name, having an "ñ" already makes it difficult and annoying to deal with in so many little ways. Good luck with that.
But also, theres the sort of "yes lets use Mexican related things because we're California" thought that I just really hate. I don't know, its like corporate Memphis to me. You see a product like this, you know it's an uppity califonia based firm that came up with it.
No worse, I suppose, than, the obsession with Lord of the Rings that the authoritarian surveillance companies have. Palantir, Anduril. Then we have the not defense/surveillance ones: Mithril, Valar, Narya, Erebor
None, probably. Just saying Jalapeño is no worse than any other non-descriptive company name. Although at least Palantir and Anduril are aptly named for what they do. The VC firms less so.
You are correct. I don't know why I thought there were 5 Rs in strawberry, and now I look properly I can count them correctly, there are indeed 6 Rs in strawberry.
I am sorry for initially giving an incorrect answer.
Don't worry, in Europe it's the same, but for insurances/lawyer stuff. Tons of companies have names based on Latin words such as Civitas/Insalus/Legalia/Legalitas or whatever which looks tacky/rancid/old fashioned kilometers away.
> May we scale smoothly, exponentially and uneventfully through A[SI]
That sentence sounds weird to me. I can't really put my finger on why, maybe the combination of adverbs, or just the fact of writing the desire of scaling as a company so directly. It feels (to me) like openly claiming their selfish goals. Or maybe I am just misinterpreting and they are referring to the whole humanity as "We" (but knowing Broadcom and in a lesser extent OpenAI doings, I am not convinced).
They don't have true competition, what they lose out on is market share with hyperscalers, since OpenAI would have no plans to share inference hardware with any other company right now. Plus, I don't know how does NVIDIA's investment equation pans out long terms given OpenAI will be investing in more purpose built inference stack for the future.
I'm assuming they used LLMs to (help humans) do custom circuit design. Even pre LLM there were various computer optimizations that didn't require humans like genetic algorithms. It'd be cool to see a paper on how they did it.
Con or not it is an obvious thing they have to do. Might as well promise.
IIRC their biggest cost they're "hiding" in their financials by doing creative accounting is inference (putting it into marketing and whatnot, in the billions)... if they can't hide it in their S-1 then they have to rationalize it, either by a) increasing the prices (not gonna happen, with token based billing orgs are already watching their codex spends) or b) lowering the inference costs. You can lower that by "soft optimizing" (dumbing down) your models but then you have the other players breathing down your neck (see quick rise of Claude), or actually optimizing, in software and in hardware. We're like 5 years into the rise of LLMs, there's not THAT much left on the table unless you write to the metal you specifically designed for your models (and I'm pretty sure the lack of "nvidia tax" would help with covering most of the r&d costs of a custom solution, at least in the long term).
50% cheaper inference without losses in fidelity would unquestionably be a massive win for OpenAI.
I mean I'd love to be able to buy something like the 17k tps taalas chip as a pcie or m.2.
Imagine when we can roar along at that speed, low power. Can just have the model reason for a while about anything and everything. It reminds me of the "race to idle" for mcus etc.
The current taalas chip is for a 3.1B param model. I’m hope so much that they can get that up to the 30B range. Just imagine Gemma 4 or Qwen 3.6 at 17k tps.
It's odd to me that I haven't heard anything about this approach (baking LLMs/weights into silicon directly) since. It seems almost common-sense that we're going to end up there eventually. And it feels like that point is drawing ever closer now that model capabilities, if not quite plateauing out, are at least getting to a "good enough" point for a LOT of use cases.
I wonder if it's being worked on in secret, if there's something about it that makes it infeasible, or if companies are really too nervous to lock in one model like that because the next one down the line could be a huge improvement. Re. infeasability, I have heard that the Taalas demonstration chip ran Llama 3.1 8B (a pretty horrible model) and that even that took a massive amount of transistors / die area. So it might just be the case that the good models are too big to fit on silicon?
Using multiple chips seems to work fine for Cerebras and Groq so it should also work for Taalas. It does sounds challenging to reach >10K tok/s but latency could be below 1 us which is a small part of the token budget.
> It's odd to me that I haven't heard anything about this approach ... I wonder if it's being worked on in secret, if there's something about it that makes it infeasible
The studies and efforts are ongoing and public, and there are technical hurdles to be faced - but the relevant works go back in time quite a lot and there is heightened interest in it now.
It seems that you simply took the "hyped headlines" for the whole of the work.
> It seems that you simply took the "hyped headlines" for the whole of the work.
Well, yeah, that's what I'm saying. It's odd that there haven't been any major headlines (customer interest, competitors' announcements, etc) other than their initial demo. Good to hear it's being worked on though!
Did we not play with MNIST and placed some calculated bet on NNs well before Yann LeCun started the fire with the explosive success of the Convolutional NNs?
I'd say it pretty consistently starts in the underground.
The real revolution in the context is that it /could/ be done practically - overcoming the hurdles. But for what the interest in the matter is concerned, I'd say there almost cannot be a greater interest at this stage: making NNs efficient. This must be absolutely evident, as evident it is that the separation of memory and processor is against the idea of NNs, as evident as it is that multiplication is achievable just physically.
Of course many have seen that and got on studying it. As soon as it will be optimally practical...
> It's odd to me that I haven't heard anything about this approach since.
It has only been four months since they unveiled their first prototype. I don't understand your confusion. Chip development does not happen overnight...?
Their initial blog post laid out a roadmap, so theoretically they should have another thing to demonstrate this summer.
You are focusing on Taalas, but (specific) analogue computing, electronic NNs, compute-in-memory etc. - the field including the contextual approach - backdate to Rosenblatt.
I'd say the original remark was more general («this approach (baking LLMs/weights into silicon directly) [... as if] worked on in secret») - which is salient, because when I investigated weeks ago, I found a large number of attempts to CIM and to general branching from Von Neumann architecture for the purpose of optimizing NNs implementations in HW.
Universities are studying, startups are proposing - the «approach» is under the big headlines level but quite lively. Not just Taalas, not just their way - which remains remarkable in the scene as the HW is achieved, working, online, available... and amazing.
CIM does not bake the weights into silicon. The level of optimization that you can do down to the last transistor when the weights are fixed is on an entirely different level than CIM where you still need general purpose ALUs all over the place.
If that were the extent of the terms, then what could we call "baking the weights into silicon"? Setting parts of the circuits to determined values for multiplication is is like printing a Read-Only Memory. (And you compute at it: Compute In Memory.)
> CIM where you still need general purpose ALUs all over the place
If that were so, then why do taxonomists present analogue computing as part of CIM? Ohm's Law does not constitute an "ALU" the way you intend it.
Simply, I used CIM, "Compute In Memory", for lack of a better term - for "store data there where you modify data", for "beyond Von Neumann's separation of data storage and processor".
EDIT: It's just not even worth arguing this point, so deleting my original, much longer comment. Abstract taxonomies can claim that Taalas is CIM, but this entirely and utterly misses the point, and misses what makes Taalas' approach special. If you told a room full of chip architects to go build "CIM for AI", they would not build a Taalas-like totally specialized chip, therefore it is not sufficient, and just muddies the conversation from my point of view. People have been doing "CIM" for decades and yet I've never seen anyone build a totally specialized chip at the scale of Taalas. And yes, you can (in theory) build an analog version of any computer, so of course you can build analog CIM, but "analog compute" is not inherently CIM, so conflating the two is just confusing.
I can't check everything right now, but for example, the divulgational from Rakesh Kumar mentiones "Analogue CIM".
And I do not get your rant about "analog computing", which has everything to do with NNs (otherwise, well, prove it): they started with that - they are basically that in fact. Analogue computing is a very great temptation since it would solve the issues of inefficiency in digital NNs. Unfortunately, it has drawbacks which are massive for big NNs. Taalas' seems to be the best compromise.
(Note just for homekeeping: this replied to the original parent response, later edited. The parallel ("Pretty simply") replied to the edited response.)
Pretty simply: the von Neumann architecture separates data storage and processing; we are now researching the alternatives - in all their forms, whatever allows to make NNs efficient; those alternatives basically focus on reducing the distance between stored data and its processing; some simply call stored data "memory"; we had the concept of "Compute in Memory" available; now some are calling "CIM" all of the relevant solutions that allow processing in the vicinity of the storage.
It is not the traditional CIM with the ALUs. Too bad. We can easily get the difference.
In the sense of interested customers, online discussion, other companies doing the same thing, etc. Of course it takes time to get actual results, but from an outsider's perspective it's surprising that it was basically just their initial demo and that's more or less it so far. Excited to see if they come out with something this summer though!
This is just an uncut wafer - I don't think it's intended to be a wafer-scale chip.
Cerebras etch memory onto the wafer alongside the processing elements, but AFAIK OpenAI are going to be using HBM memory and a conventional chiplet design.
I call BS. It’s probably a white label around existing Broadcom IP, impossible to go from zero to this kind of chip in nine months. I doubt OpenAI had any significant contribution.
9 months to production is completely impossible anyway.
9 months from design to early samples is probably impossible given than TSMC takes 3 months after tape out to produce them. Then it’s up to the customer to qualify and revise for production. TSMC doesn’t do that.
I am not sure how much of the work is done by OpenAI, or whether it is basically a Broadcom chip specifically built for OpenAI models. It is a necessary step, but building a high-performance chip is not easy. Look at companies like Groq, Amazon, and Google.
Both Google and Amazon also codesign heavily with Broadcomm (Amazon also with Marvell and Alchip)
Broadcomm does stuff like physical design, provides IP blocks, managing manufacturing process with TSMC, packaging and testing. Google and Amazon work with system architecture, performance targets, and requirements but Broadcomm as consultant.
I want a super fast LLM that is Opus 4.6+, like, in ability.
For inference workloads, it makes a lot more sense to optimize for prefill/ttft before maxing out memory bandwidth.
The software side is still pretty sketchy, too. Apple's ecosystem is fractured between NPU, MPS and Accelerate BLAS, with libraries like MLX and CoreML built precariously overtop. Apple has to commit to a full rearchitecture of their GPU to challenge Nvidia, which fractures that ecosystem even further.
When you layer those same models across 128gb of dGPUs, then you can actually fill the KV cache in seconds, instead of minutes. And you get higher memory bandwidth on most professional cards.
Apple's business model will be to pay Google for compute for now, and then as they get better on device, move more and more locally. So they're very well incentivized to get better. The thing they've been best at in the last 19 years has been spinning flywheels they already have, and this is exactly that.
It's not hard to see why Apple made those mistakes, and many of them were made by the rest of the industry too. It's specifically tragic that Apple snatched defeat from the jaws of victory with GPGPU programming, and it makes me think that their future will be more subscription services and less half-ass technical efforts. Or they rip up the foundation and start from scratch, it's never too late to start work on Apple Silicon 2.
1. https://www.investing.com/news/stock-market-news/openai-unve...
Broadcom has become wealthy by being Google's TPU hardware partner, including sharing their TSMC capacity with Google, and evidently now they are doing the same thing with OpenAI. What a brilliant way to take advantage of the AI gold rush!
I wish they weren't using their piles of money to extort money out of the software industry like they are with VMWare and Bitnami.
Kinda, but not exactly.
Broadcom cornered the enterprise infra and security market in the late 2010s and early 2020s after acquiring CA Technologies, BMC (EDIT: Did NOT acquire them, they were considering it back in 2018 but decided against it and KKR ended up acquiring them), Symantec (which they bought instead of BMC), and VMWare and were able to make a strong cybersecurity story during the late 2010s cybersecurity and SaaS boom.
That gave them plenty of cashflow that helped subsidize their hardware business when hardware was not viewed as hot as it is today.
Additionally, Broadcom is GCP's marquee customer and has been for a little under a decade so they were able to make a sweetheart deal where all that software businesses at Broadcom would be exclusively using GCP and in return GCP would working with Broadcom to design it's silicon and source infra needed for their DC buildouts.
Ironically, the DoJ blocking Broadcom's acquisition of Qualcomm was the best thing it ever could have done for Broadcom, because it gave Broadcom the dry powder to dominate the Enterprise SaaS and build a strong niche in the cybersecurity space.
> piles of money to extort money out of the software industry
From personal experience, executives and leadership who started off in the electronics and hardware industry are much more vicious and cutthroat than their peers who started in software.
Working in an industry that historically had to deal with high commodification, low margins, and long tail sales leads to leadership that can execute. Additionally, no one climbs the leadership ladder without having spent years as a line-level engineer, but that's true for software as well to an extent.
Edit: can't reply
> Did they acquire also BMC?
Nope.
Broadcom was considering acquiring them in 2018 but decided not to go through with the opportunity and KKR jumped in.
> From personal experience, executives and leadership who started off in the electronics and hardware industry are much more vicious and cutthroat than their peers who started in software.
Only The Paranoid Survive is quite a name for a management book. It implies surviving in the world you are speaking about.
[0] https://www.goodreads.com/book/show/66863.Only_the_Paranoid_...
https://finance.yahoo.com/sectors/technology/articles/broadc...
Oh dear god. I'm actually feeling sorry for Google at that point. Good luck, you'll need it...
There are a lot of large tech companies that most of HN has never heard about that completely dominate entire segments.
Q2 is forecasted to be negative, partly because of RAM prices like you said, but for the most part this is something that only price sensitive nerds care about. Broadcom sells a ton of server chips. Server sales are up 30% vs last year so I highly doubt they're desperate to use their allocation
> the full-year 2026 [PCD] outlook has been revised to −10.4% year-over-year
because
> erosion of consumer purchasing power amid regional inflation and currency volatility in many key markets, compounded by memory and storage shortages that are proving more severe than anticipated in the previous forecast cycle.
The positive Q1 YoY growth
> was largely the product of pull-forward demand, as both consumer and commercial buyers accelerated purchases ahead of anticipated price increases and limited product availability.
The idea that only nerds care about the cost of things is... absurd.
For hardware purchases, laypeople may go about it the other way from what nerds would do: instead of deciding what they need in terms of computing power and memory, and then finding a cheap offer for that, they just decide how much they want to spend, and then buy a device at that price point irrespective of its performance characteristics. If you shop like this, and would have purchased anything but a rock-bottom low-end device two years ago, prices have remained stable.
I thought of PCs second since most chip manufacturers make some thing or another that goes into them (Broadcom probably more than Qualcomm), and yes it's very suprising that PC sales don't seem to be down yet.
However, based off first impressions, it seems like this is meant for inference side, and not training, which is also an interesting choice.
Nvidia is king of general purpose training chips. But inferences can be specialized.
Yes? That’s why more money will be spent on inference than training?
I’m talking absolute cost. As the number of people using AI and burning tokens goes up the amount of spend on inference goes up.
I am fairly confident that Anthropic has way way more GPUs serving Claude Code to users than they have training models. They’ve got a lot of users!!
> API price is becoming more important than SOTA capability.
Also yes? This is why custom silicon for efficient inference makes sense!
I think we’re in total agreement here :)
We're starting to see what really matters here, and though this is hand wavy the TPU makes similar claims.
I think googles memo about having no moat still stands (see: https://newsletter.semianalysis.com/p/google-we-have-no-moat... if you are unaware). It kind of makes sense that all of this is looking more like 60's to 90's IBM, DEC, Cray, Sun and the hardware race that happened then. History doesn't repeat but it often rhymes and I suspect that these efforts will follow the same trajectory.
"Jalapeño" is such a bad name, having an "ñ" already makes it difficult and annoying to deal with in so many little ways. Good luck with that.
But also, theres the sort of "yes lets use Mexican related things because we're California" thought that I just really hate. I don't know, its like corporate Memphis to me. You see a product like this, you know it's an uppity califonia based firm that came up with it.
Jalapeño
Jalapeño
Really has a… ring to it
I am sorry for initially giving an incorrect answer.
That sentence sounds weird to me. I can't really put my finger on why, maybe the combination of adverbs, or just the fact of writing the desire of scaling as a company so directly. It feels (to me) like openly claiming their selfish goals. Or maybe I am just misinterpreting and they are referring to the whole humanity as "We" (but knowing Broadcom and in a lesser extent OpenAI doings, I am not convinced).
Make sure you all use that fancy ñ
So after the IPO and will be featured heavily in the IPO sales brochure as a future promise?
I'm sceptical over any pre-IPO announcements.
No, the nonprofit org stays nonprofit, while the for-profit org it owns will become publically traded.
See https://openai.com/index/evolving-our-structure/
Does anybody actually believe that?
IIRC their biggest cost they're "hiding" in their financials by doing creative accounting is inference (putting it into marketing and whatnot, in the billions)... if they can't hide it in their S-1 then they have to rationalize it, either by a) increasing the prices (not gonna happen, with token based billing orgs are already watching their codex spends) or b) lowering the inference costs. You can lower that by "soft optimizing" (dumbing down) your models but then you have the other players breathing down your neck (see quick rise of Claude), or actually optimizing, in software and in hardware. We're like 5 years into the rise of LLMs, there's not THAT much left on the table unless you write to the metal you specifically designed for your models (and I'm pretty sure the lack of "nvidia tax" would help with covering most of the r&d costs of a custom solution, at least in the long term).
50% cheaper inference without losses in fidelity would unquestionably be a massive win for OpenAI.
Imagine when we can roar along at that speed, low power. Can just have the model reason for a while about anything and everything. It reminds me of the "race to idle" for mcus etc.
It's odd to me that I haven't heard anything about this approach (baking LLMs/weights into silicon directly) since. It seems almost common-sense that we're going to end up there eventually. And it feels like that point is drawing ever closer now that model capabilities, if not quite plateauing out, are at least getting to a "good enough" point for a LOT of use cases.
I wonder if it's being worked on in secret, if there's something about it that makes it infeasible, or if companies are really too nervous to lock in one model like that because the next one down the line could be a huge improvement. Re. infeasability, I have heard that the Taalas demonstration chip ran Llama 3.1 8B (a pretty horrible model) and that even that took a massive amount of transistors / die area. So it might just be the case that the good models are too big to fit on silicon?
I guess that makes sense. Is this feasible, or does the added latency between chips kill any of the performance gains?
Taalas has a running demo here: https://chatjimmy.ai/
It's eye opening: generated an AVX-512 optimized Mersenne Twister in C in 0.076s, 13,706 tok/s. Too fast for the tok/s to be terribly accurate.
The studies and efforts are ongoing and public, and there are technical hurdles to be faced - but the relevant works go back in time quite a lot and there is heightened interest in it now.
It seems that you simply took the "hyped headlines" for the whole of the work.
Well, yeah, that's what I'm saying. It's odd that there haven't been any major headlines (customer interest, competitors' announcements, etc) other than their initial demo. Good to hear it's being worked on though!
I'd say it pretty consistently starts in the underground.
The real revolution in the context is that it /could/ be done practically - overcoming the hurdles. But for what the interest in the matter is concerned, I'd say there almost cannot be a greater interest at this stage: making NNs efficient. This must be absolutely evident, as evident it is that the separation of memory and processor is against the idea of NNs, as evident as it is that multiplication is achievable just physically.
Of course many have seen that and got on studying it. As soon as it will be optimally practical...
It has only been four months since they unveiled their first prototype. I don't understand your confusion. Chip development does not happen overnight...?
Their initial blog post laid out a roadmap, so theoretically they should have another thing to demonstrate this summer.
The person I replied to was acting as if Taalas was ancient history. I was pointing out it has only been a few months.
Universities are studying, startups are proposing - the «approach» is under the big headlines level but quite lively. Not just Taalas, not just their way - which remains remarkable in the scene as the HW is achieved, working, online, available... and amazing.
If that were the extent of the terms, then what could we call "baking the weights into silicon"? Setting parts of the circuits to determined values for multiplication is is like printing a Read-Only Memory. (And you compute at it: Compute In Memory.)
> CIM where you still need general purpose ALUs all over the place
If that were so, then why do taxonomists present analogue computing as part of CIM? Ohm's Law does not constitute an "ALU" the way you intend it.
Simply, I used CIM, "Compute In Memory", for lack of a better term - for "store data there where you modify data", for "beyond Von Neumann's separation of data storage and processor".
And I do not get your rant about "analog computing", which has everything to do with NNs (otherwise, well, prove it): they started with that - they are basically that in fact. Analogue computing is a very great temptation since it would solve the issues of inefficiency in digital NNs. Unfortunately, it has drawbacks which are massive for big NNs. Taalas' seems to be the best compromise.
Pretty simply: the von Neumann architecture separates data storage and processing; we are now researching the alternatives - in all their forms, whatever allows to make NNs efficient; those alternatives basically focus on reducing the distance between stored data and its processing; some simply call stored data "memory"; we had the concept of "Compute in Memory" available; now some are calling "CIM" all of the relevant solutions that allow processing in the vicinity of the storage.
It is not the traditional CIM with the ALUs. Too bad. We can easily get the difference.
Cerebras etch memory onto the wafer alongside the processing elements, but AFAIK OpenAI are going to be using HBM memory and a conventional chiplet design.
Cerebras are addressing very specific use cases, not general purpose LLM serving, and OpenAI does already partner with them.
[1] https://taalas.com/products/
https://chatjimmy.ai
I know, it's nick picking, but when people can just reach in and take services away, like Fable/Mythos, hardware is the only thing worth buying.
9 months to production is completely impossible anyway.
9 months from design to early samples is probably impossible given than TSMC takes 3 months after tape out to produce them. Then it’s up to the customer to qualify and revise for production. TSMC doesn’t do that.
There’s no AI that makes this happen in 9 months.
Broadcomm does stuff like physical design, provides IP blocks, managing manufacturing process with TSMC, packaging and testing. Google and Amazon work with system architecture, performance targets, and requirements but Broadcomm as consultant.
Cerebras stock is down nearly 20% today.
Not only is approach overlapping, OpenAI is also Cerebras's only major customer.
If this photo is real I wonder what can be revealed about the approach they have taken by analyzing the architecture of what we can see.
It's more like that "wafer as a big-chip" (more formally, "WSE - Wafer Scale Engine") is now a reality (see Cerebras).
But in this case, the wafer will be split into a few dozen chunks.