Ask HN: MacBook vs. Dedicated GPU for LLM (news.ycombinator.com)

33 points|by mzubairtahir|1d ago|65 comments|Read full story on news.ycombinator.com

For those who are using llms on macbook, Want to understand how macbook is different than dedicated GPU in running those models? and how to know how much a macbook is capable of running a model?

Comments (65)

65 shown

1. JSR_FDED|1d ago|context

MacBooks with their unified memory behave like a slow GPU with enormous amount of video RAM. So you can run large smart models slowly.
Dedicated GPUs have less video RAM so can run smaller less smart models quickly.
2. exabrial|1d ago|context

Do Mac Pros provide more headroom? noob here, noob questions
3. rho138|1d ago|context

Idk why you’re being downvoted for asking a question. Pending specs they _could_ provide more headroom for a larger model but they would still be limited by the CPU and it’s associated bus speeds.
4. JSR_FDED|1d ago|context

In what sense? Headroom for what?
5. zihotki|1d ago|context

Not that much - a bit better but still negligible. Prefill is highly parallelizable and benefits from many cores. NVidia GPU's simply have several times more cores than even Pros.
6. mzubairtahir|1d ago|context

how much memory is actually useable by gpu in macbook? as it is shared?
7. pylotlight|1d ago|context

roughly ~50–56GB, although this is somewhat configurable with iogpu.wired_limit_mb. By default, macOS reserves ~25% of memory for the system.
8. visarga|1d ago|context

Macbook M5 64GB - can run gemma-4-26b-a4b-it-4bit and Qwen3.6-35B-A3B-4bit at about 1500 tps prefix and 45 tps decode on contexts up to 100K tokens using MLX. It's faster than Claude. I was really surprised, chat quality is also similar to Claude for gemma4. Agentic works but does not compare to cloud models, you can still make agents where top level is code.
9. mzubairtahir|1d ago|context

sorry but asking again: how much memory is actually useable by gpu in macbook? as it is shared(os and apps also have to use same memory)? and it is different than dedicated gpu memory?
10. gizajob|1d ago|context

It’s completely shared so the OS and everything else takes up maybe 8GB of the RAM. On a 64GB machine you can run models about 45GB in size and still have space for those models run other tasks which themselves might need ram. To a user, the GPU appears to just use the RAM as much as it needs same as any other process running on the system. You can see what space your LLMs are taking up in Activity Monitor (or htop) and how much GPU capacity they’re using (all of it)
11. j45|1d ago|context

You can adjust the percentage available both on the MacOS side and how much the model uses.
12. cco|1d ago|context

Rule of thumb is about 70-80%.
13. EagnaIonat|1d ago|context

> MacBooks with their unified memory behave like a slow GPU with enormous amount of video RAM. So you can run large smart models slowly.
With the model using MLX the speed increase is night and day. Even non-MLX is good.
You also don't have the transfer costs related to moving CPU data into the GPU.
14. epsteingpt|1d ago|context

Both are going to be super super slow and low payback.
You gotta really want it right now.
It's still early!
15. jpgvm|1d ago|context

If you want a massive MacBook anyway then it's great. They are decent for local LLMs, awesome for local image models and it's a MacBook so AppleCare+ has your back. IMO it's a no brainer if you wanted a MacBook anyway but it's a poor choice if your reason to buy it is to run LLMs.
16. mzubairtahir|1d ago|context

are you saying because of speed or it just cant run them?
17. zepearl|1d ago|context

I agree. To run an acceptable model (e.g. Qwen/Qwen3.6-27B or google/gemma-4-31B) with a good quantization (minimum Q5) with a good context size (min 64k) you could buy 2 or even 3 GTX 5060 16GiB VRAM for ~550$ each. Fyi the much faster MoE models were useless for my usecases - e.g not able to correctly identify me/I/you, endless thinking loops, etc.
I'm currently running those models using an RTX 5070 12GiB + RTX 5060 16GiB + RTX 3060 12GiB with a 96k context size with MTP/speculative decoding and I'm quite happy (the 5070 is about 4x faster than the 3060, the 5060 is inbetween them so about 2x faster than a 3060).
18. nubg|1d ago|context

how many tokens per second do you get?
19. cybertim|1d ago|context

I bought two RTX3080s with 20GB during my holiday in china (set me back 700euros) I'm getting 800-1000 input tps and 60-100tps output with Qwen 3.6 27b Q8 (MTP, P2P, 200k context) this feels like opus4.5 level while coding (pi harness). Also easy to just host your own openai compatible api from home this way and still use your MacBook as dev station.
20. usagisushi|23h ago|context
Not the OP, but their setup must be faster than my 4060 16GB + 3060 12GB setup. Here are my numbers (typical values, N=1):
```
    Model                         pp (t/s)    tg (t/s)
    Qwen 3.6 27B            900           29
    Qwen 3.6 35B-A3B   2100          85
    Gemma 4 31B            750           28
    Gemma 4 26B-A4B   2500         90
```
- All models: UD-Q4 w/ MTP. Context size: ~100k (MoE) / ~70k (Dense).
- Layer splitting used. Tensor splitting is ~1.2x faster in TG, but power spikes from 150W to 380W.
21. zepearl|9m ago|context

In my case it somehow depends a lot on the task being performed... .
E.g. when doing text transcription/OCR from images (Qwen 3.6 27B Q4_K_M by Bartowski) with a context size of ~50k I get a pp of ~460 tokens per second and a generation ranging from 35 to 45 tokens per second (using "--spec-type draft-mtp --spec-draft-n-max 2" currently with llama.cpp b6548).
On the other hand when handling code (Qwen 3.6 27B Q5_K_M by Bartowski) with a context size of 128k I get a pp ranging between 500 to 1500 tokens per second and a generation between 25 and 40 tokens per second (using in this case as well "--spec-type draft-mtp --spec-draft-n-max 2" currently with llama.cpp b6548).
Anyway in theory with "--split-mode layer" I think that it's anyway the slowest card that drives the overall performance (I do see in "nvtop" that usually the 5070 is ~25% active, the 5060 ~50% and the 3060 ~75%).
22. eklavya|1d ago|context

How are you running these together, splitting the model somehow or did you mean different models on any one card at a time?
23. zepearl|1h ago|context

I always run just one model at once, I switch between them depending on what I do (e.g. Qwen3.6 27B Q5_K_M by Bartowski when programming with "OpenHands" and when doing OCR text transcription and transformation, Gemma4 31B-it Q5_K_M by Bartowski when chatting in Open WebUI doing general tasks).
24. brcmthrowaway|1d ago|context

Dual 3090 >>> Any Apple product.
25. gnabgib|1d ago|context

> Dual 3090 >>> Any Apple product.
Dual 3090s are terrible airpods
26. tim-tday|1d ago|context

I snorted
27. kylec|1d ago|context

Doesn't the 3090 cap out at 24GB VRAM? That's not a lot to run a local model
28. mzubairtahir|1d ago|context

but still it can run handsome models
29. EagnaIonat|1d ago|context

Latest MBP goes up to 128GB of memory.
30. bigyabai|1d ago|context

The M5 Max's GPU is slower than a single RTX 3090, though.
Both machines will be stuck in the 30-50GB model range, but the 3090 would have faster token prefill and faster decode speeds (614GB/s on M5 Max vs 936GB/s on 1x 3090).
31. brcmthrowaway|1d ago|context

https://old.reddit.com/r/LocalLLM/comments/1ug428c/4x3090_19...
32. nichch|1d ago|context

My opinion is that you should wait for 6-12 months before making a purchase either way.
Open weight models are getting good. With GLM 5.2 now chasing Opus, I'm very excited to see a smaller model's distillation.
Plus, the OLED MacBook Pro should be released by then.
33. Frannky|1d ago|context

This is my opinion too. Even if you buy hardware like a cluster of 8xGB10s or 4 A100s, they'll still be slow and a little dumber than what you're used to. We need to wait a little for better hardware. Lots of companies are pushing the frontier, so hopefully it'll come very soon.
Competition and innovation will hopefully make the bubble pop, and we'll get reasonably priced local hardware to run very intelligent models. Something like Talaas with GLM 5.2 would be pretty cool. Or Apple printing the latest model onto hardware—it would give a new reason to buy a new Mac every year (a new ai model with every new version).
34. gizajob|1d ago|context

The hardware is here today for people prepared to tolerate mild amounts of latency. It’s easy to forget that computing tasks used to often take major amounts of time - rendering an audio file, rendering a video, transcoding – all kinds of tasks took minutes or even hours of the computer spinning its fans on maximum just to deliver the result. AI and agentic AI and diffusion is the next round of that - trading a small bit of your waiting time for phenomenal power. The datacentre builders trying to get you hooked on instant responses on the LLM platforms have made you think that a “good” AI responds instantly and completely interactively - they can still be brilliant with a bit of delay. And having a competent agent doing things on my local machine, it doesn’t really matter if it takes ten minutes or an hour or six hours to complete a task while I’m out doing other things.
35. Frannky|1d ago|context

Hmm, I have access to A100s and a GB10, but if I use the models hosted there to code, I waste a lot of time waiting for answers and correcting errors. The amount of work I get done thanks to the quality and speed of frontier hosted models let me be insanely productive and have a lot of free time. I could use the slow local setup, but at what price?
36. gizajob|1d ago|context

Well if all that was taken away from you and you had to go to the bank to ask for the money to rebuild so you could become as productive as you are now, what would that cost and would the bank loan you the money?
37. FireBeyond|1d ago|context

The racks we're deploying are effectively GB300 NVL72s: 72 Blackwell Ultra GPUs 36 Grace CPUs, 20.7TB of unified HBM3e.
Works out to about 1.1exaflops of fp4. Networking is 800gbps.
120kW per rack.
38. gizajob|1d ago|context

That’s a majorly impressive computer. What’s the price of that per rack? Deploying for what?
39. FireBeyond|1d ago|context

$3-4M per rack. A variety of workloads...
40. speedgoose|1d ago|context

The OLED touchscreen MacBook is rumoured to be called MacBook Ultra now, and it has been delayed quite a few times. I will probably cost the same than a decent bike.
41. gizajob|1d ago|context

MacBook Ultra Expensive (when equipped with 256 or 512 Gb of ram)
42. cassianoleal|1d ago|context

> a decent bike
I bought a decent bike a few months ago for a few hundred bucks. I used it for commuting and for when I went to the park.
Of course, it depends on what kind of bike you meant and what you consider decent. Hopefully that illustrates the quality of that comparison.
43. speedgoose|1d ago|context

My commuting bike is okay and in this price range too.
The MacBook Neo is also a good enough laptop.
In my opinion, a decent bike is something that wouldn’t limit you in races. No need to spend enormous amounts of money for marginal gains, but something that would do the job well. That’s an order of magnitude more expensive.
44. cassianoleal|1d ago|context

So the MacBook Neo costs the same as a decent bike? What makes you think the Ultra will be in that range as well?
45. speedgoose|1d ago|context

I wouldn’t race on my commuter bike or work professionally on a MacBook Neo.
I guess it could start around 3000€
46. cassianoleal|1d ago|context

I think you're still making my point that comparing to "a decent bike" means nothing. It depends on what the bike will be used for, and what "decent" might mean to the person interpreting it.
47. cylentwolf|1d ago|context

I asked a few of my friends that are ML engineers this question and all of them said to run the LLMs in the cloud with their infrastructure because it was going to be way faster. If you just want to tinker around I would look at @JSR_FDD's comment.
48. derwiki|1d ago|context

There are more factors than speed though, like: privacy
49. throwawaytea|1d ago|context

Next you're going to tell me that car salesman recommend just leasing new cars, doctors recommend just following the standard of care, construction workers just recommend subbing it out, and your tech friends say just use AWS.
50. ramraj07|1d ago|context

If your car salesman friend gives you stupid advice it either means he is stupid or he is not your friend.
51. throwawaytea|1d ago|context

I sold cars for a decade and wouldn't even take the insanely cheap priced company demo car, just driving my 10 year old car instead.
On that note though, most car salesman are somewhat stupid, although the best ones are atleast normal. (I was a cal mechE grad and top 0.2% nationwide car salesman). They also slowly brainwash themselves into believing most of what they say.
52. al_borland|17h ago|context

My dad avoided buying a home PC for a long time, because he felt the systems he used at work in the 80s were so much more powerful than anything for the home at the time, that he didn’t see a point.
I can’t help but draw parallels here.
53. SenHeng|13h ago|context

And he wouldn't be wrong for that era. Prices also dropped pretty quickly while specs would almost double within a span of months. Whatever he could've bought in Jan would be easily outclassed by anything 12 months later.
We don't have that anymore. Specs have more or less stabilised and what you're buying now could easily last you years.
54. gizajob|1d ago|context

Local LLMs running in LM Studio on a MacBook Pro work great, if you’re prepared to wait for the answers because using an LLM locally is much much slower than having the instant results appear when using an online LLM like ChatGPT or Claude. You can also run OpenClaw on the MacBook and have that act as the front end for the LLM, to get full interactivity and have it install command line tools on your Mac to perform whatever tasks you’ve set it.
If you don’t already have a MacBook, then there’s a bit of a sweet-spot for the AI experimenter right now, which is to buy a second-hand 16” MBP with an M1 Max chip and 64GB of shared ram. Because these are about 5 years old now, they have depreciated to the point where they can be had for around £1100 / €1300 / $1500 and make a phenomenal platform for learning because the 64Gb of shared memory means you can host models up to about 48GB in size, and then task them to do interesting things with coding without ever having to worry about token burn.
The downside is that they’re slow, and prone to having to be nudged to keep them on track, but that’s part of the fun too. The “latency” is atrocious granted - you ask something and the machine thinks for a few minutes before saying anything which is a different experience to using Claude. But… it does work. You can think of yourself more like a manager with a junior member of staff and set the machine running and leave it to do its thing for a couple of hours which can be actually useful work, but this approach will likely be shouted down by some commenters here who treat Claude like some kind of expensive and quick-fire dopamine pump. Can also use a Mac like this for running diffusion models for image generation and suchlike in ComfyUI, even though, again, results will be slow. Spending more money on a more recent MBP with as much RAM as you can afford will deliver the same results more expensively in a quicker and quicker time.
To get the same kind of size of model you’d have to combine a couple of Nvidia 3090 24GB cards in a decent workstation with the PCI capacity to handle them, or hack some kind of solution to hang GPUs off the back of a motherboard on ribbon cables with the GPUs running on their own PSU, which is what I’m building next… the difference is those cards have 24GB of vram and cost about $1000 each second-hand, but will operate much much faster than the M1 Max MBP, or even the most recent M5 because they have so much more bandwidth (because they’re burning 350 watts on GPU compute rather than 140 watts total which is what a super efficient MBP has for the cpu/gpu/screen/everything).
So say you had $6000 to spend today, you could buy a second hand workstation and craft a solution with external GPUs which would completely smoke any Mac in existence, even though macs have the edge in the size of model you’d can run (slowly) due to their shared memory. External GPUs and access to the Nvidia frameworks and general CUDA ecosystem wins out on the performance front though. A real sweet spot is to buy an M1 Max MBP and have that as your front end to a Linux workstation full of GPUs.
But any apple silicon MBP is a totally competent gateway drug to local agentic computing.
Google Gemini could give you an in-depth and useful discussion about this exact question.
55. browningstreet|1d ago|context

It’s kind of amazing how steadily this question is asked in every forum where it can be asked. Kind of amazing that the answers previously given can’t reach the next person who’s going to ask it.
56. derwiki|1d ago|context

Isn’t it all rapidly evolving? I guess beyond that it’s a signal of growing interest in local LLMs
57. throwawaytea|1d ago|context

What's crazy is these people want to invest decent money to use AI/LLMs, but somehow don't think to just ask an AI chat bot which would easily answer it for them and any follow-up questions.
58. g-technology|1d ago|context

Around February or march I started looking into hardware options to help me start learning about training models and working with them. My budget was limited and an apple refurbished 32 gb Mac mini was far and away the best option for my budget. I wish it was faster but I can let it run 24/7 with no noise and minimal power draw. I just arrange long running tasks for when am asleep or at work. Then as a huge plus I have an awesome daily driver machine for whatever else I want to do
59. alecco|1d ago|context

MacBooks have lots of RAM and no PCIe bottleneck, but ~10x fewer FLOP/s than a much cheaper Nvidia GPU. Test LLMs on rented GPUs on vast.ai or other similar services (beware storage etc). Don't spend thousands before trying and knowing exactly what you get.
Also beware local models tend to be slow. Also, the main optimization trick for LLM inference is running large batches (concurrent users) and you won't take advantage of this (batch=1).
IMHO using Macs for LLMs is a fad. An expensive fad.
60. EagnaIonat|1d ago|context

With a dedicated GPU, the lag is in transferring data to the GPU. You don't have that lag in ARM.
But it really depends on what it is you want to do. An MLX optimised recent model will run fine and at decent speeds. Granite4.1 (a few months old) for example takes up 2GB of memory, insanely fast and results are good vs much bigger models like gpt-oss-120b (a year old). It even runs on an M1 mac with good speeds.
The models are only getting better.
61. dust42|1d ago|context

With a M5 16c 48GB and Qwen 3.6 35B Q4 I get up to 1900 PP/s and 80 TG/s. With an Nvidia 5090 I get 7800 PP/s and 280 TG/s.
Together with pi mono I wouldn't want to go back to Claude & Co. Speed, quality of the answers, short answer times at any time of day - once you have eaten from the fruit your definition of SOTA will change...
For reference, I do software development since 30 years, I am not vibe coding the umpteenth todo list.
62. tosh|1d ago|context

macbooks (macs in general) are a good package for llms because they come with so much RAM
and for llms more RAM means access to better models
macbooks might not be as fast as a GPU with similar amount of RAM but more affordable and well integrated
last but not least: compared to a PC+GPU the macbook is either silent (air) or at least way less annoying when you care about noise
for ultimate flexibility and low noise: GPU in the cloud for when you need/want it is probably also most cost effective if you don't have workloads that need to run 24/7
63. sfifs|1d ago|context

So a lot depends on your specific use case but mid-sized open weight models are pretty actually good now, so this is realistic [1]
The first question to ask is does your use case require handling personal or sensitive data.
If you're using the LLM for OpenClaw or you want to handle sensitive or medical data, a local model generally is necessary.
if it's not so sensitive - Cloud providers with some sort of user agreement guarantee on not using your data for training would be the next bet. I personally generally use Gemini or Sonnet as my cloud backup. As I understand, OpenAI, Cloudflare (which bought replicate) and Qwen also seem to provide such guarantees and make SOTA models available. Others like DeepSeek seem to have an opt-out setting. Open router & co I avoid except for benchmarking models with public or dummy data as there is absolutely zero guarantee or ability to enforce terms on providers where your data might be sent.
Gemini and Anthropic (and OpenAI) tend to be expensive - it's very easy to run up 15 dollars a day or so bills which puts you solidly in 1 year pay out on Mac Mini territory - at this point I decided to buy. Gemini Flash Lite 3.1 is however surprisingly good value.
the next question is Mac or CUDA. If your expected use is serving LLM models for inferences, the latest large memory Macs give pretty good inference speed (better than DGX Spark) at a reasonable cost - I think there offer much better value than CUDA if the only use case is LLM inference & harnesses.
if you plan to also fine tune models, experiment with other types of ML on GPUs, do computer vision stuff etc. the development tooling on CUDA is far in advance of all other platforms.
Lastly if you choose CUDA, the question is GB10 family (DGX Spark - cluster able with 128Gb RAM et all) or dedicated GPUs workstations. What I found is practically any serious models weighs in requiring at least 96GB VRAM - Antirez's 2 bit quant of Deepseek 4 flash (my current daily driver) [2] , the Qwen 3.5 122B A10B 4-bit quant, the Qwen 3.6 27B Dense and 35B A3B 8 but quants etc. So you're well out of the consumer GPU territory into 1 or more RTX 6000 Pros or Data center grade devices. Yes you can try to hack away with multiple consumer cards or SSD streaming but it's very fiddly and you probably have better things to do with your life.
The GB10 system - which I ultimately went with - is certainly much cheaper and can be clustered through the Special NVLink cable to get 256, 384 or 512 GB setups but comes with severely constrained bandwidth. The Pro GPUs blast these out of water on performance but are expensive.
Lastly, renting a cloud GPU machine doesn't really make sense except to run already debugged fine tuning workloads. You'll probably spend at least 4 dollar an hour for sufficient capacity which if it's personal use, will mostly sit idle.
1. https://srinathh.medium.com/mid-size-local-models-are-now-co...
2. https://github.com/antirez/ds4
64. zihotki|1d ago|context

From personal experience - it works, but you won't get a comfortable time to first token (latency is high). The reason is that prefill on Macs is bad. You need to have a lot more cores to do it quick. It's close to instant for small models on NVidia GPU's but on Macs it takes a few seconds to get the answer for a simple prompt. And the time grows proportionally with your context size.
65. hkchad|23h ago|context

It depends. I have a M5 128 so i can play around with large models and even keep several of them loaded at once and use something like llama swap to access them all via bifrost or litellm. You won't do this without some serious local GPU's with big memory. The downside is the speed, it's not fast, but fast enough to tinker and develop with worrying about ongoing cloud cost. When done developing and you need to really scale up this is when you can swap to cloud computing and get the job done faster. My $5k macbook can do more than a $50k nvidia/intel/amd setup, just not as fast.
So you need to decide whats important to you if you want to work locally, large/many models or speed. It's the pick 2 problem speed, size, cost, pick 2 or go with cloud and accept your development time is also spent $$ on each iteration.
I was hoping for the M5 ultra by now, but looks like that's not coming until much later this year for a much higher price now.