Ask HN: MacBook vs. Dedicated GPU for LLM (news.ycombinator.com)
For those who are using llms on macbook, Want to understand how macbook is different than dedicated GPU in running those models? and how to know how much a macbook is capable of running a model?
Dedicated GPUs have less video RAM so can run smaller less smart models quickly.
With the model using MLX the speed increase is night and day. Even non-MLX is good.
You also don't have the transfer costs related to moving CPU data into the GPU.
You gotta really want it right now.
It's still early!
I'm currently running those models using an RTX 5070 12GiB + RTX 5060 16GiB + RTX 3060 12GiB with a 96k context size with MTP/speculative decoding and I'm quite happy (the 5070 is about 4x faster than the 3060, the 5060 is inbetween them so about 2x faster than a 3060).
- Layer splitting used. Tensor splitting is ~1.2x faster in TG, but power spikes from 150W to 380W.
E.g. when doing text transcription/OCR from images (Qwen 3.6 27B Q4_K_M by Bartowski) with a context size of ~50k I get a pp of ~460 tokens per second and a generation ranging from 35 to 45 tokens per second (using "--spec-type draft-mtp --spec-draft-n-max 2" currently with llama.cpp b6548).
On the other hand when handling code (Qwen 3.6 27B Q5_K_M by Bartowski) with a context size of 128k I get a pp ranging between 500 to 1500 tokens per second and a generation between 25 and 40 tokens per second (using in this case as well "--spec-type draft-mtp --spec-draft-n-max 2" currently with llama.cpp b6548).
Anyway in theory with "--split-mode layer" I think that it's anyway the slowest card that drives the overall performance (I do see in "nvtop" that usually the 5070 is ~25% active, the 5060 ~50% and the 3060 ~75%).
Dual 3090s are terrible airpods
Both machines will be stuck in the 30-50GB model range, but the 3090 would have faster token prefill and faster decode speeds (614GB/s on M5 Max vs 936GB/s on 1x 3090).
Open weight models are getting good. With GLM 5.2 now chasing Opus, I'm very excited to see a smaller model's distillation.
Plus, the OLED MacBook Pro should be released by then.
Competition and innovation will hopefully make the bubble pop, and we'll get reasonably priced local hardware to run very intelligent models. Something like Talaas with GLM 5.2 would be pretty cool. Or Apple printing the latest model onto hardware—it would give a new reason to buy a new Mac every year (a new ai model with every new version).
Works out to about 1.1exaflops of fp4. Networking is 800gbps.
120kW per rack.
I bought a decent bike a few months ago for a few hundred bucks. I used it for commuting and for when I went to the park.
Of course, it depends on what kind of bike you meant and what you consider decent. Hopefully that illustrates the quality of that comparison.
The MacBook Neo is also a good enough laptop.
In my opinion, a decent bike is something that wouldn’t limit you in races. No need to spend enormous amounts of money for marginal gains, but something that would do the job well. That’s an order of magnitude more expensive.
I guess it could start around 3000€
On that note though, most car salesman are somewhat stupid, although the best ones are atleast normal. (I was a cal mechE grad and top 0.2% nationwide car salesman). They also slowly brainwash themselves into believing most of what they say.
I can’t help but draw parallels here.
We don't have that anymore. Specs have more or less stabilised and what you're buying now could easily last you years.
If you don’t already have a MacBook, then there’s a bit of a sweet-spot for the AI experimenter right now, which is to buy a second-hand 16” MBP with an M1 Max chip and 64GB of shared ram. Because these are about 5 years old now, they have depreciated to the point where they can be had for around £1100 / €1300 / $1500 and make a phenomenal platform for learning because the 64Gb of shared memory means you can host models up to about 48GB in size, and then task them to do interesting things with coding without ever having to worry about token burn.
The downside is that they’re slow, and prone to having to be nudged to keep them on track, but that’s part of the fun too. The “latency” is atrocious granted - you ask something and the machine thinks for a few minutes before saying anything which is a different experience to using Claude. But… it does work. You can think of yourself more like a manager with a junior member of staff and set the machine running and leave it to do its thing for a couple of hours which can be actually useful work, but this approach will likely be shouted down by some commenters here who treat Claude like some kind of expensive and quick-fire dopamine pump. Can also use a Mac like this for running diffusion models for image generation and suchlike in ComfyUI, even though, again, results will be slow. Spending more money on a more recent MBP with as much RAM as you can afford will deliver the same results more expensively in a quicker and quicker time.
To get the same kind of size of model you’d have to combine a couple of Nvidia 3090 24GB cards in a decent workstation with the PCI capacity to handle them, or hack some kind of solution to hang GPUs off the back of a motherboard on ribbon cables with the GPUs running on their own PSU, which is what I’m building next… the difference is those cards have 24GB of vram and cost about $1000 each second-hand, but will operate much much faster than the M1 Max MBP, or even the most recent M5 because they have so much more bandwidth (because they’re burning 350 watts on GPU compute rather than 140 watts total which is what a super efficient MBP has for the cpu/gpu/screen/everything).
So say you had $6000 to spend today, you could buy a second hand workstation and craft a solution with external GPUs which would completely smoke any Mac in existence, even though macs have the edge in the size of model you’d can run (slowly) due to their shared memory. External GPUs and access to the Nvidia frameworks and general CUDA ecosystem wins out on the performance front though. A real sweet spot is to buy an M1 Max MBP and have that as your front end to a Linux workstation full of GPUs.
But any apple silicon MBP is a totally competent gateway drug to local agentic computing.
Google Gemini could give you an in-depth and useful discussion about this exact question.
Also beware local models tend to be slow. Also, the main optimization trick for LLM inference is running large batches (concurrent users) and you won't take advantage of this (batch=1).
IMHO using Macs for LLMs is a fad. An expensive fad.
But it really depends on what it is you want to do. An MLX optimised recent model will run fine and at decent speeds. Granite4.1 (a few months old) for example takes up 2GB of memory, insanely fast and results are good vs much bigger models like gpt-oss-120b (a year old). It even runs on an M1 mac with good speeds.
The models are only getting better.
Together with pi mono I wouldn't want to go back to Claude & Co. Speed, quality of the answers, short answer times at any time of day - once you have eaten from the fruit your definition of SOTA will change...
For reference, I do software development since 30 years, I am not vibe coding the umpteenth todo list.
and for llms more RAM means access to better models
macbooks might not be as fast as a GPU with similar amount of RAM but more affordable and well integrated
last but not least: compared to a PC+GPU the macbook is either silent (air) or at least way less annoying when you care about noise
for ultimate flexibility and low noise: GPU in the cloud for when you need/want it is probably also most cost effective if you don't have workloads that need to run 24/7
The first question to ask is does your use case require handling personal or sensitive data.
If you're using the LLM for OpenClaw or you want to handle sensitive or medical data, a local model generally is necessary.
if it's not so sensitive - Cloud providers with some sort of user agreement guarantee on not using your data for training would be the next bet. I personally generally use Gemini or Sonnet as my cloud backup. As I understand, OpenAI, Cloudflare (which bought replicate) and Qwen also seem to provide such guarantees and make SOTA models available. Others like DeepSeek seem to have an opt-out setting. Open router & co I avoid except for benchmarking models with public or dummy data as there is absolutely zero guarantee or ability to enforce terms on providers where your data might be sent.
Gemini and Anthropic (and OpenAI) tend to be expensive - it's very easy to run up 15 dollars a day or so bills which puts you solidly in 1 year pay out on Mac Mini territory - at this point I decided to buy. Gemini Flash Lite 3.1 is however surprisingly good value.
the next question is Mac or CUDA. If your expected use is serving LLM models for inferences, the latest large memory Macs give pretty good inference speed (better than DGX Spark) at a reasonable cost - I think there offer much better value than CUDA if the only use case is LLM inference & harnesses.
if you plan to also fine tune models, experiment with other types of ML on GPUs, do computer vision stuff etc. the development tooling on CUDA is far in advance of all other platforms.
Lastly if you choose CUDA, the question is GB10 family (DGX Spark - cluster able with 128Gb RAM et all) or dedicated GPUs workstations. What I found is practically any serious models weighs in requiring at least 96GB VRAM - Antirez's 2 bit quant of Deepseek 4 flash (my current daily driver) [2] , the Qwen 3.5 122B A10B 4-bit quant, the Qwen 3.6 27B Dense and 35B A3B 8 but quants etc. So you're well out of the consumer GPU territory into 1 or more RTX 6000 Pros or Data center grade devices. Yes you can try to hack away with multiple consumer cards or SSD streaming but it's very fiddly and you probably have better things to do with your life.
The GB10 system - which I ultimately went with - is certainly much cheaper and can be clustered through the Special NVLink cable to get 256, 384 or 512 GB setups but comes with severely constrained bandwidth. The Pro GPUs blast these out of water on performance but are expensive.
Lastly, renting a cloud GPU machine doesn't really make sense except to run already debugged fine tuning workloads. You'll probably spend at least 4 dollar an hour for sufficient capacity which if it's personal use, will mostly sit idle.
1. https://srinathh.medium.com/mid-size-local-models-are-now-co...
2. https://github.com/antirez/ds4
So you need to decide whats important to you if you want to work locally, large/many models or speed. It's the pick 2 problem speed, size, cost, pick 2 or go with cloud and accept your development time is also spent $$ on each iteration.
I was hoping for the M5 ultra by now, but looks like that's not coming until much later this year for a much higher price now.