There's still no MCP support in the Gemini app, which is very useful to get various pieces of info as a user just via chatting. For example I recently wanted to get an Airbnb and wanted to filter by specific criteria including house image analysis and Gemini couldn't do it so I had to do it in Codex.
This is why I don't always use the official Gemini Web app. Lately I've found that it's more useful to utilize a CLI. I'm looking forward to the day they add MCP in the web.
Yeah, it seems like this is the biggest missing feature from the Gemini ecosystem.
If I can't connect MCP, there's really no selling point for me to use Gemini from my watch, car, smart speaker, etc. If I'm already bound to using my own front end, then I'm only evaluating Gemini as a model/API, at which point it has many competitors that may be cheaper or better fit for the task.
I'm fairly convinced Claude's strongest point is the app. AI users aren't anywhere near as mature or smart as youtube/hn would have folks believe. The claude app is amazing for bridging that gap.
I think native apps are critical infrastructure in AI development particularly around agents. The truth is there’s no good native interaction layer for custom agents. If you want to wire up and self host an agent that has access to anything ever your only option is a janky port to telegram or Slack. I’ve been building vessels.app because I think it’s the missing piece to agent interaction. I need testers if anyone is interested!
> Computer use is such a terrible idea. It's slow, insecure, error prone, expensive.
And yet having an agent able yo use a computer on your behalf is really useful.
Recently I gave a Nix OS vm to my hermes agent and it has been a good experience. I don't really care if destroy the machine I can just rollback to an earlier version, and for any meaningful data he creates for me I make sure he creates a repo, commit and pushes to my private Gitea instance.
Sure, I don't want an agent watching MY screen. That's why I gave him his own environment, and pretty quickly he discovered that you can open chrome and make it render to a framebuffer, this way he is able to 'view' the website. And apparently with this he is able to bypass a lot of 'anti-bot' measures.
Imagine you have a pretty exotic task you need to complete that involves converting a video file from one format to another.
You can use ChatGPT or something similar and the best you will get is either a script you can run on you machine that does what you need or he may decide to render a new video.
If you have something like OpenwebUI you could configure a MCP that converts videos and allow the model to use this MCP to do your task. This should work, but is quite a lot of work for something you'll ever do once.
But if the agent has it's own environment he can decide to install ffmpg, execute the transformation and serve you the file you want.
In reality there is no new capabilities with this approach, but things get a lot more comfortable.
I give you one: Google news is pretty terrible right now almost all interesting new sources are paywalls and so I get recommended all kind of weird lifestyle publications that are really horrible. With the computer use API I can just tell. Tell Gemini to look at Google news pick the articles that look interesting. Look them up on archive.is, and just give me the plain text article and construct a summary - I think that would probably work pretty well.
UI QA only works well if your model plausibly matches the average user behavior and/or real-world edge cases. These models are far from that, and they are much less random than you'd like them to be for fuzzing (mode collapse).
It doesn't need to be that kind of QA. Even just a basic "I want the AI to build the beginnings of a GUI app for me" will work much better if the AI can see the output of its work and iterate on it. Similar if you want the AI to fix a GUI bug—much better if you can show it the the bug and tell it how to test to see when it's gone.
Okay, fair, I haven't really paid attention to marketing.
> the LLM does not require computer use to see the GUI and
It can take screenshots without computer use, but it can't click around. I didn't have access to computer use until recently (I'm on an OS where Claude Code technically shouldn't run, I had to patch the binary), and when I got it working it made a big difference because of this.
The “correct”, elegant way for AI to interact with existing software would take decades and billions of dollars to build. Someone would have to do the hard work of building new APIs, solving decades of accessibility issues, etc.
Or you can show an AI screenshots and ask it where to click.
I disagree if your application is networked. Most SaaS is built on RESTful APIs that can be converted trivially into interfaces / contracts for tool use.
So you can either wait for every application to do that, or at least make it possible for an LLM to do it… or you can make the LLM use a computer interface that works with every application by definition.
Computer use is a great idea. It gets the job done when nothing else will.
If you're a person trying to get their job done at a big company, but half your job is in 1-2 proprietary tools or is stuck behind an API you can't program against, computer use can allow you, a non-techie, to do your job more efficiently.
I think it's an awesome way to circumvent gate keepers and the IT department to let people accomplish their goals.
Yeah, it's not that computer use is the most theoretically optimal paradigm, but there's a reasonable case that given the constraints of modern software systems and how they're built, that it's the most realistically optimal paradigm.
It does. I used to be an ahk "script kiddie" and know it front and back. It's sort of burnt into my brains. As a result, I can prompt really really well, notice issues at a glance, and I have a sheer volume of scripts locally for all sorts of tasks some from as far back as 2014. From tiling window managers to OCR all the way to simple hotkeys/hotstrings. I let it grep in that folder and build out whatever I want using those primitives. This gives actually 1-shot immediately usable 100% working scripts even with GPT3.5 level models, as opposed to the iterations needed for typical development.
Example: adding copyright text box to bottom of every slide
How are folks using “computer use” to click things on intranet portals that are behind an SSO?
Even this OP example shows visitors a url and enter this search term… that is port of useless.
How can I automate things behind an SSO wall? Even if it means I manually authorize it once and watch it do things on its own..
I've never used Gemini computer use, but I assume it's the same:
Claude computer use takes control of your whole computer inputs (mouse and keyboard) plus screenshots. You just log in, tell Claude you're logged in, and let it get to work. It'll use the browser you're logged in with.
The chrome extension is a little better because it only takes control of its own chrome tabs (again: you just log in.)
I think there's a sweet spot- a lot of the time you're probably better off with "reverse engineer this web page and build me an API or personalized chrome extension to meet my needs".
I have an agent doing price checks for me for an item on a certain website. Instead of blasting through a zillion tokens processing the DOM over and over, it loaded the page once and figured out how to download a json with the price.
It curls the page. I think the approach it took won't actually wouldn't work in my local browser- its getting the value from some conversion reporting code that I'm guessing my ublock extension would hide.
Computer use is very useful for developing GUI applications since claude code can build and test the entire app end-to-end (accessibility APIs exist but depending on the UI framework of your choosing you can run into walls very fast).
I run it in a VM using a headless wayland compositor, I'd never trust even fable with access to my real system.
Spreadsheet is such a terrible idea. It may look like a valid tool, but ain't no way it's delightful to users. Most of the time people need a database instead. Eventually there'll be an iPhone moment for this.
The iphone moment is an AI that can completely manage your personal life. It has full access to every financial account you own handles all admin work. Could sign you up for a new account pay and give you the login.
If you can SAFELY do that it's a big moment. But to be clear safe is a massive problem. Until you see a big company start saying the AI can use your SSN, CC, bank password safely we aren't there yet.
Cars were around for decades before they came up with seatbelts. Claude Cowork will happily go through your files, which might just have your SSN in them, and ignore previous instructions.
But we have regulation and complaince for consumer secrets? That's not a comparable example.
The difference is that if openai gave you a product and it leaked a million peoples bank passwords it would destroy the entire company.
Again until a big tech product can bring that to a clean user experience we're not there yet. Even the most zealot openclaw users are not hooking their bank accounts into the AI yet. I'm sure they exist but I've not seen them.
Also every big tech computer use product actively screams for you not to give their agents secrets.
They absolutely do but similarly people absolutely don't wear those seatbelts despite being a great idea! As far as not hooking things into your bank account, RobinHood has an MCP, what do you think that's gonna be used for?
PCI compliance is for a vendors. Stripe has to be careful with credit cards leaking out of their CC vault, but if I'm an idiot and I tell you my credit card number that's on me.
The graph has Gemini 3.5 Flash matching Sonnet 4.6, losing to Opus 4.8, and slightly behind GPT-5.5 by 0.3 points... That's not that much of a hands-down loss for Gemini for this specific workload benchmark.
Methodology: All Gemini scores are pass @1 except where otherwise noted. "Single attempt" settings allow no
majority voting or parallel test-time compute. All of the results are all run with the Gemini API for the model-id
gemini-3.5-flash with default sampling settings unless indicated otherwise below. To reduce variance, we
average over multiple trials for smaller benchmarks.
All the results for non-Gemini models are sourced from providers' self reported numbers unless otherwise mentioned below. For Claude Opus 4.7 , Sonnet 4.6, and GPT-5.5 we default to reporting maximum
thinking/reasoning settings available, but when reported results are not available we use best available reasoning
results.
It's honest - people who know what they are looking at will take speed and token costs into account. I don't use Gemini 3.5 for coding, but I use it as something in between a search engine and agent.
It's amazing how designers of charts trying to show their product is close to the leader always remember to start the axis at zero, and designers of charts trying to show how big their lead is always forget that
Google said June, and all its model updates seem to be on Tuesdays, Wednesdays or Thursdays. So unless the release is slipping, either tomorrow or Tuesday.
The speed was impressive when I tested it but unfortunately the accuracy left a lot to be desired. Be interesting to do the math on some of my normal workflows to see where the break even is between them, assuming the tasks you have can tolerate a couple of failures.
It seems to do it just fine when in desktop applications using Qt, fwiw., it leverages all the standard Qt GUI testing stuff (and if you have the money you can just integrate Squish which has LLM support now).
That's my experience too. I've had increased luck encouraging the LLM to structure the code in "functional core, imperative shell" style, and telling it stupid things like "make sure you can test the code you're writing".
People using google’s models: am I holding it wrong or are the guardrails really overtuned?
I had the dubious pleasure of testing gemini of late and I kept running into refusals. How do I transfer a sim number from one provider to another? No. What should I consider when making backups on ntfs less prone to data loss and more bitrot resistant? No. Evaluate this piece of code? No.
I’m not sure if it’s cold feet from the mythos situation or what, but it reminds me of the dark days where you couldn’t use ai for much of anything. But then I go to chatgpt 5.5 and it does mostly everything I want outside of the usual cybersecurity boogeyman that you run into now and then.
The context window size is also very small if you use Gemini in the app. It starts forget quite fast. In my opinion Gemini on app is useless additionally to the guardrails.
I've always found all versions of gemini to be (for a lack of a better word) lazy.
I guess it's economic wrt. token use, but it often either refused for absurd safety reasons, or other weird stuff like responding that an LLM like itself wasn't a suitable tool for the job, and very quickly gives up.
Claude is on the other end of the spectrum, which makes it more noticeable when switching between them.
I'm outside the US, use Gemini models quite a bit, and I've never run into any refusals of any kind. I'm using them for a fairly wide range of things, I'm sure at least as risqué as asking how to transfer a sim. As a matter of fact I actually asked it's advice on how to transfer banking apps and auth apps from one phone about 3 weeks ago and got decent answers.
It's more dependent on the specific country they are in (and I don't know the specifics). But Google is large enough to have lawyers for every country, and Google is in a never ending whirlwind of national lawsuits/fines, so you end up at the mercy of whatever the lawyers for your country think will not piss off regulators. The EU (and individual states) have pretty heavy AI regulations, and Google even just got fined for an AI overview being incorrect.
It also could just be which way the wind was blowing for OP, the models are stochastic to some degree, but there is no shortage of complaints from (mostly euro) users getting stonewalled.
I've seen similar refusals on X from Claude from users in Germany when the LLM assumed the users are asking for something forbidden about certain topics.
Ultimately I think that in 10 years time, this is what's gonna kill paid consumer LLMs, and boost the usage of Chinese LLMs self hosted at home an your own hardware that people will torrent via VPNs, as they will also be banned because of "disinformation and misinformation".
So the end winners will be the hardware companies that will sell AI chips to consumers after the datacenter bubble pops. Unless of course the EU will ban the sale of AI chips that don't have some limitations baked in on which models you're allowed to run (the state approved ones). Interesting times ahead. I think in 10-20 years time we'll look back at present day LLMs the way we look back at the open internet of the 90's-00's.
Interesting. I have the Google AI Pro plan and use Gemini several times each day and I don't remember the last time I got a refusal. I wonder what criteria go into that, like maybe how they rate your Google account?
> People using google’s models: am I holding it wrong or are the guardrails really overtuned?
They are quite insane. I was asking it to list candidates metal parts I could buy at a hardware store to add weight to 3D prints: stuff like angle brackets etc.
I wanted to know, bang for bucks, and ease of insertion (at print time) / modelling in a 3D model.
Complete refusal as if I was a terrorist building a bomb.
Then there are the weird refusals that then are OK after all if you insist by asking it what's wrong about it:
"How should I cook eggs?"
"I'm sorry but I can't help you with that" (it formulates it differently but that's the idea)
"What, I'm just hungry, is explaining me how to cook eggs really against your rules?"
And then it answers "No of course not, here's how to do it:..."
This is indeed strange. Do you get the same results if you disable personalisation using the top right button ("temporary chat")? Can you please share a trace?
Today I asked Gemini to extract a table from an PDF appendix and create C++ data table with its contents. After 15 or so iterations with corrections and new mistakes, it eventually gave up. I was floored when it said “I’m sorry, I cannot do this simple task, I’ve exceeded my error threshold and cannot do this task for you. My LLM prediction engine invents data instead of doing a simple data copy/reformat”.
Stunned to see that Gemini threw its digital arms in the air and gave up.
I haven't heard any accounts of it doing that since Gemini 2.5, but it was pretty easy to get it to do it with a programming task back then after a few failed attempts. Very interesting to hear it'll still do it.
RMSE is just an extrapolation from the training data. If the data is wrong because the world changed, any model (parametric or not) can be confidently incorrect.
This is why the world model approach is so important. It allows you to feed back the prediction accuracy of the model to itself at training time, enabling it to predict (to some degree) its own uncertainty. If you jump through a couple of hoops you can also do this at run time to give it “spidey sense” that something’s not right with current inference.
I gave up on Grok. It is going the way of Tesla, SolarCity, gigabattery and Autopilot. Now on GLM5.2 via Open router hosted in SG. Mimo is also good. Their agent is so convenient and Deepseek level cheap. Quality a bit behind GLM5.2. But then Mythos is myth, technically GLM is high up there in quality but on lower end pricing.
That's interesting because my experience has been almost the opposite. A few months ago I tested Gemini on converting screenshots of tables from PDF files into CSV. I tried it on several different tables and it got every one right. It consistently outperformed ChatGPT.
Tangentially related question. Has anyone analyzed if the content that is being converted could break the model.
So let's say you have a super dull pdf ( or even a scan ) that has the same line over and over again, could this get the model into one of those loops that just keep spewing nonsense.
And thinking that further, could someone prompt inject a model with a handwritten note that only gets "activated" once it's in the context?
The key here is that you used screenshots. This forces Gemini into "OCR mode" (i.e. actually looking at vision tokens) rather than trying to be clever with its tool calls.
The latter strategy almost entirely depends on the quality of the skills and tool calls exposed to a given agent.
The table select option in Okular is also great, as you can manually rearrange the divisions. For low volume, of course. Tabula will work better otherwise. I also suggest Libreoffice Calc, the .csv support is leagues ahead of Excel.
My go-to for this is to screenshot and use the built-in text extraction in the screenshot tool (I'm on a mac), then pass on that text data to whatever processing. It's a pretty good tool so long as the PDF is in OK shape (I've had errors in scanned images).
It's so horrible that in 2026 people are still publishing important data and specifications in a format like PDF that's difficult for LLMs to consume. We need to drag them kicking and screaming to HTML or Markdown. Heck, even Microsoft Word DOCX is superior for reliable parsing and content extraction.
If I can't connect MCP, there's really no selling point for me to use Gemini from my watch, car, smart speaker, etc. If I'm already bound to using my own front end, then I'm only evaluating Gemini as a model/API, at which point it has many competitors that may be cheaper or better fit for the task.
The Gemini apps suck.
I guess if you're trying to get people to tokenmaxx it may look like a valid strategy, but ain't no way this will be delightful to users.
I think it's a symptom of just not understanding how LLMs should interface with the OS because we're still in their early days.
Eventually there'll be an iPhone moment for the ergonomics of LLM usage outside of coding
And yet having an agent able yo use a computer on your behalf is really useful.
Recently I gave a Nix OS vm to my hermes agent and it has been a good experience. I don't really care if destroy the machine I can just rollback to an earlier version, and for any meaningful data he creates for me I make sure he creates a repo, commit and pushes to my private Gitea instance.
It is, but there's no need for it to be viewing your screen, browsing websites and watching ads.
That stuff is for humans, not for LLMs.
I honestly cannot think of a single use case
Imagine you have a pretty exotic task you need to complete that involves converting a video file from one format to another.
You can use ChatGPT or something similar and the best you will get is either a script you can run on you machine that does what you need or he may decide to render a new video.
If you have something like OpenwebUI you could configure a MCP that converts videos and allow the model to use this MCP to do your task. This should work, but is quite a lot of work for something you'll ever do once.
But if the agent has it's own environment he can decide to install ffmpg, execute the transformation and serve you the file you want.
In reality there is no new capabilities with this approach, but things get a lot more comfortable.
It's the end game of AI. Have systems trained on doing EVERYTHING you do on a computer all day. Trained by you while doing the job.
Okay, fair, I haven't really paid attention to marketing.
> the LLM does not require computer use to see the GUI and
It can take screenshots without computer use, but it can't click around. I didn't have access to computer use until recently (I'm on an OS where Claude Code technically shouldn't run, I had to patch the binary), and when I got it working it made a big difference because of this.
Or you can show an AI screenshots and ask it where to click.
Then you get a nice textual world that fits the LLM without having to rewrite every application to have a fullblown HTTP server.
If you're a person trying to get their job done at a big company, but half your job is in 1-2 proprietary tools or is stuck behind an API you can't program against, computer use can allow you, a non-techie, to do your job more efficiently.
I think it's an awesome way to circumvent gate keepers and the IT department to let people accomplish their goals.
Even then, an AI writing AHK scripts likely outperforms.
Example: adding copyright text box to bottom of every slide
How can I automate things behind an SSO wall? Even if it means I manually authorize it once and watch it do things on its own..
Claude computer use takes control of your whole computer inputs (mouse and keyboard) plus screenshots. You just log in, tell Claude you're logged in, and let it get to work. It'll use the browser you're logged in with.
The chrome extension is a little better because it only takes control of its own chrome tabs (again: you just log in.)
I have an agent doing price checks for me for an item on a certain website. Instead of blasting through a zillion tokens processing the DOM over and over, it loaded the page once and figured out how to download a json with the price.
I run it in a VM using a headless wayland compositor, I'd never trust even fable with access to my real system.
Meanwhile, the entire world economy:
Spreadsheets are fucking glorious, powerful, clever, amazing and delightful, in my view.
If you can SAFELY do that it's a big moment. But to be clear safe is a massive problem. Until you see a big company start saying the AI can use your SSN, CC, bank password safely we aren't there yet.
The difference is that if openai gave you a product and it leaked a million peoples bank passwords it would destroy the entire company.
Again until a big tech product can bring that to a clean user experience we're not there yet. Even the most zealot openclaw users are not hooking their bank accounts into the AI yet. I'm sure they exist but I've not seen them.
Also every big tech computer use product actively screams for you not to give their agents secrets.
Seatbelts were regulated later. Your SSN and CC are regulated over a decade ago.
The methodology used:
https://deepmind.google/models/evals-methodology/gemini-3-5-...
Methodology: All Gemini scores are pass @1 except where otherwise noted. "Single attempt" settings allow no majority voting or parallel test-time compute. All of the results are all run with the Gemini API for the model-id gemini-3.5-flash with default sampling settings unless indicated otherwise below. To reduce variance, we average over multiple trials for smaller benchmarks.
All the results for non-Gemini models are sourced from providers' self reported numbers unless otherwise mentioned below. For Claude Opus 4.7 , Sonnet 4.6, and GPT-5.5 we default to reporting maximum thinking/reasoning settings available, but when reported results are not available we use best available reasoning results.
It’s something cheap enough you’d put out in front of your customers, and Opus is expensive enough you wouldn’t.
The difference from GPT 5.5’s score is 0.3 points, hardly “hands down”.
gemini 3.5 flash isn't meant to compete head to head with frontier models on tough problems
I had the dubious pleasure of testing gemini of late and I kept running into refusals. How do I transfer a sim number from one provider to another? No. What should I consider when making backups on ntfs less prone to data loss and more bitrot resistant? No. Evaluate this piece of code? No.
I’m not sure if it’s cold feet from the mythos situation or what, but it reminds me of the dark days where you couldn’t use ai for much of anything. But then I go to chatgpt 5.5 and it does mostly everything I want outside of the usual cybersecurity boogeyman that you run into now and then.
I guess it's economic wrt. token use, but it often either refused for absurd safety reasons, or other weird stuff like responding that an LLM like itself wasn't a suitable tool for the job, and very quickly gives up.
Claude is on the other end of the spectrum, which makes it more noticeable when switching between them.
It also could just be which way the wind was blowing for OP, the models are stochastic to some degree, but there is no shortage of complaints from (mostly euro) users getting stonewalled.
Ultimately I think that in 10 years time, this is what's gonna kill paid consumer LLMs, and boost the usage of Chinese LLMs self hosted at home an your own hardware that people will torrent via VPNs, as they will also be banned because of "disinformation and misinformation".
So the end winners will be the hardware companies that will sell AI chips to consumers after the datacenter bubble pops. Unless of course the EU will ban the sale of AI chips that don't have some limitations baked in on which models you're allowed to run (the state approved ones). Interesting times ahead. I think in 10-20 years time we'll look back at present day LLMs the way we look back at the open internet of the 90's-00's.
What exactly are you saying it's refusing? Can you give a screenshot or example?
They are quite insane. I was asking it to list candidates metal parts I could buy at a hardware store to add weight to 3D prints: stuff like angle brackets etc.
I wanted to know, bang for bucks, and ease of insertion (at print time) / modelling in a 3D model.
Complete refusal as if I was a terrorist building a bomb.
Then there are the weird refusals that then are OK after all if you insist by asking it what's wrong about it:
"How should I cook eggs?"
"I'm sorry but I can't help you with that" (it formulates it differently but that's the idea)
"What, I'm just hungry, is explaining me how to cook eggs really against your rules?"
And then it answers "No of course not, here's how to do it:..."
Really strange stuff.
Stunned to see that Gemini threw its digital arms in the air and gave up.
So let's say you have a super dull pdf ( or even a scan ) that has the same line over and over again, could this get the model into one of those loops that just keep spewing nonsense.
And thinking that further, could someone prompt inject a model with a handwritten note that only gets "activated" once it's in the context?
The latter strategy almost entirely depends on the quality of the skills and tool calls exposed to a given agent.