Rendered at 14:50:00 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
leemoore 3 minutes ago [-]
GLM 5.2 feels like Opus 4.6 level. I actually think 4.6 and GLM work better in practice than opus 4.7 or 4.8 as I find both of those more erratic and seem to randomly have a super dumb turn. That random bad turn I see doesn't seem to be hitting the benchmark scores but they make 4.7 and 4.8 very hard to use for me. GLM is more stable like opus 4.6
Tiberium 4 hours ago [-]
It seems to really be a nice step-up and is getting quite close to the frontier. I wish they'd start focusing on the reasoning efficiency now, though. I have a simple (relatively) test task to evaluate LLMs: writing a simple math evaluator library in Nim (it's about 400-600 lines total max), and GLM 5.2 (xhigh which maps to max effort) spent over 15 minutes (!) reasoning, spending about 45k tokens, before it finally wrote the first file.
I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.
Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.
Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.
benjiro29 3 hours ago [-]
GLM 5.2 Max = Opus 4.8 Max in thinking behavior. The thinking chain is so similar, and so is the amount of token usage on the output.
If you want reasonable token usage, you need to run it GLM 5.2 at High. There is little drop in quality from Max to High (for most tasks). And it cuts token usage by 2 a 2.5x. GLM 5.2, Max is really something you only need for complex tasks.
In essence, GLM 5.2 is Opus 4.8 its little brother, at a way, WAY cheaper price.
There has been really no training on Opus models going on, really, none i tell you! /sarcasm
vitalyan123 3 hours ago [-]
distillation of thinking models is not particularly effective - both "Open"AI and Misanthropic don't show you the real chain of thought, only its severely downscaled version. both do everything in their power to combat such outrageous copyright infringement, so the bulk of unethically scrapped data the Chinese have is from several generations ago.
duskdozer 3 hours ago [-]
>such outrageous copyright infringement
Sarcasm, considering the source of their own training data?
baron3dl 44 minutes ago [-]
IP for me, not thee.
orphea 3 hours ago [-]
Narrator: it was sarcasm, indeed.
ComputerGuru 33 minutes ago [-]
Supposedly there are “jailbreaks” that expose considerably more of the thinking traces.
mannanj 1 hours ago [-]
The companies that did copyright infringement and unethically scrapped data think that copyright infringement and unethically scrapping data is wrong and needs to be stopped.
Though only in particular situations, like when it’s done to them and not when they do it. Cause they have the power and are morally right and know better than you. And if you question this at all, well you’re a threat to American values and a supporter of the Chinese and leading to the break down of Democracy.
This isn’t a type of reasoning argument or manipulation tactic used by the rich throughout history to trick the naive and gullible masses or anything like that. Trust me, I’m rich and I’m morally right. /sarcasm
3 hours ago [-]
vorticalbox 4 hours ago [-]
This is a problem I find with opus is will spend so long thinking then going “but wait what if”
To point where I stop it and simple tell it to “start writing code you can work it out as you go along”
In this paper they nerf an LLMs ability to emit waffling thinking tokens like "wait", "but", "alternatively", and the models (they're old, small models in the paper) terminate reasoning faster and perform better. I bet Anthropic is tuning this on their backend.
mikeocool 3 hours ago [-]
Seriously. Whenever I read the thinking output I get mad and turn down effort to medium or low.
Just output the code and we’ll work through it!
I feel similarly about having codex review claude’s plans. I don’t think I’ve ever seen it catch a major issue. It just points out things that would have inevitably been addressed during implementation anyway.
giancarlostoro 2 hours ago [-]
I usually have Claude build a plan first, then I put it into an XML file it updates with phases, usually we talk about some of those tasks, and then once its good and I like it, I have Claude implement the plan.
Another thing I tell Claude to do is to not guess, but look at documentation, it messes up a lot less, might use some tokens reading docs, but at least it has a higher success rate code wise.
xstas1 2 hours ago [-]
XML??
giancarlostoro 2 hours ago [-]
Apparently because of how Claude is trained, even the system level prompts go through as XML, it works better with XML "prompting" so I figured I could have it write plans in XML. I need to update my ticketing tool to output XML maybe by default.
Comments later in thread say markdown works just as fine and that it’s more important to organize your plan into sections.
Also just think about it, why would a model trained on the world’s corpus of text (that isnt formatted in xml) perform better with XML? It would be a better study if that post tested markdown, org, xml, json, etc. 10 times to see if their is a difference
root-parent 1 hours ago [-]
XML stands for Xtra ML....
thinkingtoilet 3 hours ago [-]
I've been having success with Opus but you REALLY have to tame it. Long prompts that list what files to look at, relationships between entities, etc... I went from regularly hitting my daily limit to almost never hitting it. Oh, and also I was being lazy with small changes and stopping that helped a lot too. As you said, it gets in these loops where it's just churning and if you don't stop it it can go on for way too long.
epolanski 4 hours ago [-]
Fable was 20 times worse on that.
It's clear it was the vibe coding model, as like no other model before, fully turned you into his assistant instead of the other way around.
RyanHamilton 3 hours ago [-]
Could it be possible, these firms are optimizing for two things: a) Better performance. b) Gathering data from you to further improve performance later. I've also found the huge amount of planning rather than iteration frustrating. I've felt like I'm teaching a junior!
epolanski 3 hours ago [-]
I think they simply optimize around E2E benchmarks, none of those benchmarks is designed as multi turn assistance to the user, but going from a prompt straight to the final solution.
happyPersonR 55 minutes ago [-]
more thinking == more tokens === more money LOLL
3 hours ago [-]
h14h 2 hours ago [-]
Hopefully the recent work Moonshot did with Kimi K2.7 Code trickles in to the other open-model labs.
Per AA, while K2.7 Code is roughly on par w/ K2.6 in terms of intelligence, it uses half the output tokens to get there.
bertili 4 hours ago [-]
This is GLM 5.2 Max. GLM 5.2 High which use less than half[1] the tokens.
Yes, but the Artificial Analysis result is also from GLM 5.2 (max), not high.
andai 4 hours ago [-]
They have this with a lot of models, measuring only the max setting, while the one you'd actually want to use for most tasks is much lower.
epolanski 4 hours ago [-]
For the brief period with had Fable, I never had to use it above medium.
Low nailed the overwhelming majority of mundane tasks on it's own, medium was good for more complex stuff.
robmccoll 2 hours ago [-]
That's interesting. I gave nearly the same task to Gemma4 31b as a test yesterday. Write a symbolic math engine in Typescript that can perform evaluation and simple expression reductions over +-/*(). It performed the task correctly with minimal reasoning - much fewer reasoning tokens than output tokens.
rdsubhas 1 hours ago [-]
As per stats in other comments, it is frontier, not close to frontier.
cmrdporcupine 3 hours ago [-]
> Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.
GLM5.2 ends up being far more expensive than I thought it would be when I tried it on openrouter. I ground through $5 USD worth of tokens quite quickly.
And this was high, not max.
4 hours ago [-]
kristopolous 3 hours ago [-]
I have a script that ranks these based on codingindex from Artificial Analysis.
All it does is pull a json from their main table page and parses it with the fields I care about (coding).
There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.
Current partial output
score age size name
47.1 58 large Kimi K2.6
47.5 54 large DeepSeek V4 Pro (Reasoning, Max Effort)
47.5 70 - Muse Spark
47.6 132 - Claude Opus 4.6 (Non-reasoning, High Effort)
47.8 205 - Claude Opus 4.5 (Reasoning)
48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
48.6 55 - GPT-5.5 (Non-reasoning)
48.7 188 - GPT-5.2 (xhigh)
50.1 29 - Qwen3.7 Max
50.7 1 large GLM-5.2 (max)
50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
51.5 92 - GPT-5.4 mini (xhigh)
52.1 55 - GPT-5.5 (low)
52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
53.1 132 - GPT-5.3 Codex (xhigh)
53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
55.5 118 - Gemini 3.1 Pro Preview
56.2 55 - GPT-5.5 (medium)
56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
57.2 104 - GPT-5.4 (xhigh)
58.5 55 - GPT-5.5 (high)
59.1 55 - GPT-5.5 (xhigh)
62 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
* open models are on about a 4-7 month lag right now depending on how you want to measure it
* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.
if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.
papersail 2 hours ago [-]
score age size name
62.0 8 - Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
59.1 55 - GPT-5.5 (xhigh)
58.5 55 - GPT-5.5 (high)
57.2 104 - GPT-5.4 (xhigh)
56.7 20 - Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
56.2 55 - GPT-5.5 (medium)
55.5 118 - Gemini 3.1 Pro Preview
53.1 132 - GPT-5.3 Codex (xhigh)
53.1 62 - Claude Opus 4.7 (Non-reasoning, High Effort)
52.5 62 - Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
52.1 55 - GPT-5.5 (low)
51.5 92 - GPT-5.4 mini (xhigh)
50.9 120 - Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
50.7 1 large GLM-5.2 (max)
50.1 29 - Qwen3.7 Max
48.7 188 - GPT-5.2 (xhigh)
48.6 55 - GPT-5.5 (Non-reasoning)
48.1 132 - Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
47.8 205 - Claude Opus 4.5 (Reasoning)
42 minutes ago [-]
bel8 2 hours ago [-]
you left some models out like DeepSeek and Kimi, for example.
kristopolous 1 hours ago [-]
It was a truncated output from the script to demonstrate what it does ...
Because it's not in the top 20 in their benchmark, it's at #23
tcp_handshaker 2 hours ago [-]
Short comments...
- GPT 5.5 consistently the best, an opinion who gets me constant downvotes here by the Anthropic Marketeer strike force...
- China is going to eat the US lunch on AI
- What have European universities and companies been doing?
Its like if, on a parallel past/future, Nikola Tesla and
Edison would have created flying Cyberpunk machines,
while Europeans researchers, would be getting together to
request EU funds, for investigation on how to
breed faster horses.
- If Zuckerberg could be fired, after spending
a total of $235 billion on AI and having
NOTHING to show for...should he be fired?
Certhas 2 hours ago [-]
None of these models come from universities, European or otherwise.
Mistral is clearly currently not competing for Frontier Model. Whether this is due to a lack of VC Funds or a lack of technical ability or the former arising from the latter would be interesting to know.
The top models are from startups. Among the FAANG only Google managed to get a Frontier model, and they litterally invented the architecture and have more money than they can possibly spend to throw at the problem. Facebook shows that even ungodly amounts of money don't get you there though.
So why did no EU based Startups succeed while two US start ups succeeded? I agree that that's a very important question the EU should ask. The Internet revolution was driven by US companies, and now AI will be as well, with Chinese Open Weights mixed in. The EU consistently can not turn its considerable economic output into fast moving tech firms.
Quarrel 37 minutes ago [-]
Mistral have moved to actually trying to make money, and been relatively successful; at least if we lived in a normal world.
They've got a heap of contractors working to help industry adopt LLMs. It is just classic consulting work, and they'd look like a really great company if we weren't comparing them to literal $2T+ companies losing money hand-over-fist...
sschueller 1 hours ago [-]
Apertus was built by universities in Switzerland. Although not frontier it is fully open.
I'm actually more curious about IBM. Their granite series appears to be nowhere close to competitive.
They had Watson, remember, it won on jeopardy like 15 years ago? They've been at this for a long time
Maybe it's good at something else?
tekchip 1 hours ago [-]
IBM doesn't do technology they do contracts. Any "technology" is marketing stunts. They hire a bunch of "fellows" outside contractors to make a thing they can be first at or whatever, do the stunt, then get a bunch of 5-10 year contracts with customers off the stunt. They then fuck it up for that length of time but still get paid due to those contracts. After that space of time the folks theyve burned have moved on, rinse repeat. Pretty easy to look back at the timeline of "firsts" they have and see the pattern.
root-parent 1 hours ago [-]
Agree that IBM has no excuse. Specially for how long they have been trying to do AI. Although Watson was a completely different technology.
They had to start from scratch, but dont seem to have the management to be smart enough, to stop doing it in house. They could have just acquired a startup that could build a frontier model.
What is also very ironic since their whole bussiness for the last 15 years, has been buying companies a la CA Associates...
Their previous Watson branding and collapse of Watson expectations cost them one CEO, but the current CEO was part of the same team. They just dont learn....
greenavocado 1 hours ago [-]
Granite is OK for speech to text (ASR)
marcus_cemes 1 hours ago [-]
To be honest, living in Switzerland and speaking with peers, we're just exhausted by the constant AI hype. For a lot of us, the fact that Europe isn't frantically trying to scrape the entire internet and every book in existence for the next massive model isn't a bad thing. The big players are doing their thing, like with the nuclear arms race. We regulate a lot, too much a lot of the time, but sometimes that trickles down to other places too. A lot was done right, imo.
ETH Zurich and EPFL universities recently put out an open model called Apertus (was on the HN front page a few months back), it's not a frontier model, but they built it properly regarding copyright and data transparency.
It might look a bit slow or old-fashioned, but focusing on doing things ethically and legally feels like a much better path than just joining the race to scrape everything.
dr_dshiv 55 minutes ago [-]
Sir, I would suggest that if Europe fails to be economically competitive, the downstream implications on European society will produce much worse outcomes than (for instance) data transparency…
Doing things with ethical intentions does not necessarily produce outcomes that are beneficial for society at large.
marcus_cemes 10 minutes ago [-]
I'm inclined to agree with you, but you could make the same argument for exploiting natural resources and the environment. I don't think it's being done right at the moment, and it does not seem to be benefiting people as much as certain companies.
JKCalhoun 40 minutes ago [-]
"…Anthropic Marketeer strike force…"
Might also just be the result of "good will" (that the company has deftly fostered). Other companies might learn from Anthropic in that regard.
ricardobayes 1 hours ago [-]
Well Europe is famously a laggard when it comes to new tech - in parts of Switzerland, two horses were required be mounted in front to carry cars up until 1925. UK required a person to walk in front of a car and wave a red flag.
kristopolous 2 hours ago [-]
They did muse spark ... it's not garbage.
Also what are they building it for? I'd think it's to serve ads better or something like that. Maybe Muse Spark fits facebook's needs perfectly...
jansan 1 hours ago [-]
Mo Bitar said something like "Meta's LLM is the one you use if you accidentially hit the wrong button in WhatsApp. Its user base is fat-finger phone users."
kristopolous 1 hours ago [-]
Understood - they're just doing other things. Maybe custom ad rewriting for a target audience or some kind of deep analytics insight into user behavior or translations that optimizes for maximizing purchasing habits over literary accuracy ... I'm just saying their incentives are elsewhere and maybe Muse is serving them well.
I mean that is the smart move here. Focus the model on optimizing the core business. For Meta, that's not coding tools.
applicative 1 hours ago [-]
> China is going to eat the US lunch on AI
They will forever have superior weights?
JKCalhoun 37 minutes ago [-]
I would imagine it will be a fundamental breakthrough, not weights alone, that are going to usher in the next generation of AI. Perhaps China will in fact make that breakthrough. They certainly seem to have a lot of eyeballs in the field right now.
senordevnyc 1 hours ago [-]
I downvoted you for your complaining about downvotes fwiw.
And Zuck hasn't spent that much on AI yet. Half of that is projected spending for 2026.
As to whether it's all for nothing, Q1 2026 revenue was up 33% over Q1 last year, driven largely by...better AI-driven ad targeting. So the spending doesn't seem that crazy to me.
alecco 2 hours ago [-]
Consider using decrementing score order (best on top)
kristopolous 2 hours ago [-]
then I'd have to scroll up over 500 lines after running it every time to see what I care about.
add an argument (any argument) and it will be sorted as your specified. It just works as a toggle flipping the order ... so literally any string will do.
The original link has been updated accordingly with the new code.
datadrivenangel 2 hours ago [-]
Have it print paginated or just top 10?
kristopolous 2 hours ago [-]
only the small ones:
$ ./art-analysis.sh | grep small
or maybe just the qwen
$ ./art-analysis.sh | grep Qwen
only the ones in the past 30 days
$ ./art-analysis.sh | awk '$2 < 31'
I use it in pipes like this.
spwa4 2 hours ago [-]
[dead]
scrollop 9 minutes ago [-]
Would be interesting to see where gpt 5.5 pro extended is.
bodhi_mind 2 hours ago [-]
Cool project! Side note: Kind of a bad practice imo to ask people to blindly execute bash from an unknown source.
slig 2 hours ago [-]
Thanks for sharing. I'm curious: why didn't you sort with the score descending?
kristopolous 2 hours ago [-]
Because it's currently 511 lines. Why would I want to scroll up to see the stuff I care about? Don't you want the relevant stuff to be right there in front of you?
duckmysick 2 hours ago [-]
I do and that's why I pipe the output to `head -n 20` or use `LIMIT 20` in SQL.
That aside, this is a good script you're running. Thanks.
tasuki 2 hours ago [-]
But maybe you decide you want to see more. It makes perfect sense for a cli tool to output the most interesting piece of info last: then you can decide on the fly whether you want to scroll up or not.
1 hours ago [-]
fridder 2 hours ago [-]
Not OP but if you run this from the CLI it does make the ordering make a little more sense
snsnbsne 2 hours ago [-]
Because programmers can’t figure out how to have a CLI that prints in a normal order, with the newest stuff on top instead of on the bottom.
Setup a fresh new large monitor. Open CLI. Run command. Watch output at the bottom of your screen. Keep watching the bottom of your screen for the rest of the day.
Sure you can tile windows and it helps but come on. Just have the command/input section in the bottom and the “output” on top. Keep the command bit on the bottom.
2 hours ago [-]
2 hours ago [-]
2 hours ago [-]
mrngld 3 hours ago [-]
Artificial Analysis coding benchmark shows GLM5.1 on high pretty close to GPT5.5 xhigh in cost to run, with GPT5.5 on medium significantly less expensive. Compared to GPT5.5 medium GLM5.1xhigh is twice the cost and half the intelligence. They don't have GLM5.2 on there yet, but that'd a big gap to bridge.
I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.
undecidabot 48 minutes ago [-]
It got 46.2 on DeepSWE in Z.ai's own run[1]. That would put it between Opus 4.7 xhigh and Opus 4.8 medium.
with open models you can get a subscription with privacy, at the same cost as codex.
openai, google and anthropic subscriptions are not available with privacy.
looking at the link there it's interesting that going from cursor cli to codex cli take gpt 5.5 from 7th to 3rd. but they didn't do open model in codex.
so, hard to say it's for sure a model benchmark. maybe open models are just shit at swe agent harness...it's not the most parsimonious explanation though.
vadansky 39 minutes ago [-]
> with open models you can get a subscription with privacy
Unless you're running it locally, aren't you just trusting some other entity?
ttul 1 hours ago [-]
DeepSWE “feels” like the right benchmark in comparison to Artificial Analysis indices and other coding benchmarks. And by their metrics, GPT-5.5 is still king in token efficiency, speed, and overall intelligence per dollar.
Fable 5 is cool and all, but we have not yet seen GPT-5.6.
cmrdporcupine 3 hours ago [-]
I gave GLM 5.2 a spin on openrouter yesterday and it was mostly fine but it racked up $5 in token use in 30 minutes of (relatively slow) work.
It's easily 4x the cost of DeepSeek V4 but I didn't actually feel the results were that much better. I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.
Having better luck with MiniMax M3, from a cost/benefit ratio.
pjerem 2 hours ago [-]
I really like DeepSeek V4 Pro. It's pretty smart and I get so much usage out of it on a $20 Ollama cloud plan.
With a good harness, that's my favorite model for any personal project. I use Opus 4.8 at work because i don't have to pay for it and of course I love it, but DeepSeek is like 80% there for one tenth of the price.
zooming 3 hours ago [-]
Try MiMo-2.5, I'm having astonishing success with it in opencode for cents per day. Not even the pro model.
re-thc 1 hours ago [-]
> I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.
GPT can find fault in everything and anything including its own work.
simonw 3 hours ago [-]
I was surprised that GLM 5.1/5.2 are not vision models - they are text input only.
That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.
In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.
Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.
0xbadcafebee 1 hours ago [-]
Configure a subagent in your coding harness to spin up a new sub-session with any vision model for those tasks and feed the result back to the main model. No need for "one model that does everything"
ashenke 1 hours ago [-]
I had the same reaction with Deepseek V4 ! It would be more useful as a vision model
_pdp_ 2 hours ago [-]
I don't see this being such a big gap. There are some use-cases for sure but apart from UX/UI work it is not really needed. Besides, none of the frontier models can replicate actual images - the can approximate at least in my own experience.
simonw 2 hours ago [-]
One of my tests for a new model is dumping in a screenshot of a web page and seeing if it can recreate it from scratch in HTML and CSS.
Even the local models I run on my Mac are getting surprisingly good at that now.
tiahura 1 hours ago [-]
Using llms to generate docx. Being able to rasterize and review is an important part of the process.
1 hours ago [-]
unrvl22 4 hours ago [-]
Why aren't more people talking about this? It's literally Opus 4.7 quality stupid prices. I know providers who are offering this at unlimited tokens for $50 a month. Some are even offering API rates at 3x lower than the official ZAI api rates which are already like 10x cheaper than Opus. (Crof and Umans btw)
This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.
stanac 4 hours ago [-]
> Some are even offering API rates at 3x lower than the official ZAI api rates
Looking at openrouter [1], some of the cheaper offerings are for quantized models. Not sure how much intelligence is lost in quantization. And they are not 3 times cheaper. Where did you find 3x lower prices for APIs? I am considering skipping open router and using them directly for that price.
IME, unquantised -> FP8 is pretty much lossless. What matters more is having an unquantized KV cache - using an FP8 KV cache can result in a significant drop in quality.
ComputerGuru 27 minutes ago [-]
Do infra providers reveal that level of implementation detail?
scrlk 4 minutes ago [-]
I've seen a few articles from providers talking about KV cache quantisation, but it's not something they explicitly point out like they do with weights.
benjiro29 3 hours ago [-]
Neuralwatt ... When you reverse calculate the actual energy usage / price on a token basis, the gap is large.
I do not have GLM 5.2 numbers because the whole default max setting is overkill. But GLM 5.1 numbers had it at 12x cheaper then API rates. And about 2.5x more tokens vs zai their own subscription service.
Yes, its FP8 but lets be honest, do we know for sure that even zai runs at FP16? I learned a long time ago with Claude and Codex how much cheating happens on model levels, even from the big boys.
CuriouslyC 4 hours ago [-]
Be careful about unofficial providers, a lot of them misconfigure models or stealth quantize them. For a while the difference between Kimi on the official API and most third party providers was 20-40%.
OpenRouter should be penalising or banning for this.
kilroy123 3 hours ago [-]
This is my biggest complaint about OpenRouter and I'm a fan. Might be pretty tough at scale?
orbital-decay 1 hours ago [-]
They have an "exacto" category with providers they supposedly verified
ComputerGuru 27 minutes ago [-]
That’s only for tool use.
alecco 2 hours ago [-]
Would that align with their VC-backed incentives?
unrvl22 4 hours ago [-]
the 2 I mentioned both have a fairly large following, who run benchmarks and absolutely will spot issues.
embedding-shape 4 hours ago [-]
> Why aren't more people talking about this?
Wasn't this released like 2 days ago? Everyone is still evaluating and playing around with it, things like the submission is just starting to come out. Give it some days at least before jumping to conclusions, ideally weeks.
Schiendelman 4 hours ago [-]
To answer the question in your first sentence - because it's VERY computationally (ha) expensive as a human being to keep up with all the options. It's also very hard to figure out how to run a model like this. There's no installer. If you really really care, which 99% of people do not, you have to google a guide, and then find out it's out of date...
I've tried a number of these, and the learning curve is very steep compared to "install Claude Code and pay $100/mo". There is no way saving me $50/month matters compared to figuring that out.
andai 4 hours ago [-]
But it just works with Claude Code? They have a guide on their website.
Even more pro tip: Claude Code can set this up for you haha
Schiendelman 4 hours ago [-]
Sure, I'm not saying I, a software engineer, cannot do this. I'm saying it's significant onboarding friction.
Unless this were a massive differentiator, people aren't going to be "talking about it" the way GP suggests!
fc417fc802 3 hours ago [-]
You're seriously suggesting that setting up opencode or tweaking your claude code config or etc is too much trouble to be worth saving $50 /mo? That's absurd. Doubly so when the audience in question is already using LLMs so ... just ask your existing LLM for help if it seems daunting.
Schiendelman 3 hours ago [-]
I'm not just suggesting that, I'm trying to be crystal clear: it's a gap that probably cuts TAM by 95% or more. Most LLM users are not software engineers. Even those that are don't care enough to muck with their settings to try out a model. Keep in mind I'm not answering the question "Is this hard to install?" - I'm answering the question "Why aren't people talking about this?"
donohoe 2 hours ago [-]
I would broadly agree with this (based on years of dealing directly with user-facing UX and setup steps). Small hurdles, even easy ones, create larger barriers to adoption then you’d think.
fc417fc802 3 hours ago [-]
Doesn't pass the sniff test. Casuals messing around already go to far more trouble to set up openclaw or comfyui or what have you.
Schiendelman 3 hours ago [-]
What percentage of "casuals"? ;)
neonstatic 2 hours ago [-]
"Casuals" just use the web interface from the provider, which Z.ai also has
ramraj07 32 minutes ago [-]
Thats not absurd. Do you know what software engineers make? Do you know what a Starbucks coffee costs? 50 bucks is nothing for someone in that life.
skeledrew 2 hours ago [-]
The friction is near 0 when you can ask another LLM to set it up for you.
Schiendelman 1 hours ago [-]
Here are a few frictions I see that reduce reach, in order:
1) You haven't even heard of it.
2) You have to know to look for both GLM and Z.ai. These are usually in the same article when reporting about GLM is written, at least.
3) You have to understand there could be a benefit in trying it; you have to want to try it for some reason. Their own blog post puts it below Opus 4.8 in each of the three benchmarks they used.
4) You have to figure out the pricing, which isn't obviously in the blog post...
5) When I first went to Z.ai, I got an error popup (not logged in): "You do not have permission to access this resource. Please contact your administrator for assistance." I am using a personal computer...
6) When I typed something in the resultant field and pressed enter, I got "Clear Current Chat? To start a new chat, your current conversation will be discarded. Sign in to save chats"
I think today's article helped with 1 and 2, which helps their top of funnel. But they're fighting a big uphill battle.
chen66996 3 hours ago [-]
[flagged]
chillfox 50 minutes ago [-]
install opencode, then either pay $10 for their plan, or add an openrouter api key.
gerryf2 59 minutes ago [-]
I agree with this.
I'd pay for an out of the box solution. i.e. an Installer with updates
That's as "easy" as it is for non-devs that you're complaining about.
Schiendelman 22 minutes ago [-]
I'm not complaining about anything. I'm answering a question.
cedws 4 hours ago [-]
In my org everyone is extremely Claude-pilled to the point you’d think it’s the only LLM that exists, purely because it caters to non-engineers within enterprises.
unrvl22 4 hours ago [-]
I cancelled my claude sub after realizing I can burn 300m tokens a day of this quality, for $50 a month.
knollimar 2 hours ago [-]
Isn't it closer to sonnet?
redox99 2 hours ago [-]
Definitely opus level for coding.
smith7018 2 hours ago [-]
Do you have benchmarks or at least anecdotes to back that up? I'm not arguing with you; I would just love to see some proof that open models are getting as good as Anthropic's models.
redox99 1 hours ago [-]
I've been running some test prompts comparing frontier models for webdev, particularly pretty visualizations, physics / orbital simulations, etc.
Do note that GLM is not multi modal, which can be a deal breaker. And these open models are not good outside coding.
unrvl22 1 hours ago [-]
look at benchmarks, use the model yourself. Im usually first to call BS on every chinese model that says they are as good as Opus. this is finally the first one that actually is. It is a massive jump from every other previous chinese model.
smith7018 48 minutes ago [-]
"use the model yourself"
I wish I had the time to set it up and work on side projects but unfortunately life and work have been crazy (as I'm sure many here feel). That's why I asked for anecdotes about it.
Hamuko 4 hours ago [-]
I’m not that interested in models that I can’t run on my desktop for ~0€, which is my AI budget.
andai 4 hours ago [-]
Electricity cost seems to be about $30/month for a 32B model on a GPU. It's probably better on Apple hardware.
The price, processed tokens, and output can be anything, it just depends on what GPU it is.
Nvidia GPUs are much more efficient than Apple hardware for inference(and training).
Hamuko 3 hours ago [-]
My Mac Studio uses about 60–80 watts whenever I’m running a model (as measured by the system metrics), so it’s less than 2 kWh/day at full blast. Electricity is like 0.125 €/kWh, so that 24-hour period would be <0.25 €.
Not accounting hardware in my costs, since I didn’t buy my hardware for running models. Running models is just something it can do in addition to what I got it for.
igravious 4 hours ago [-]
Cool beans. You're not the target audience then.
Hamuko 4 hours ago [-]
Did I claim I was? I just said why I and people like me are not talking about it.
simianwords 4 hours ago [-]
and he said its cool
anuramat 4 hours ago [-]
> unlimited tokens for $50 a month
link?
> Why
imho everything but opus produces unusable code (fable was even better...), eg gpt5.5 seems to write the absolute worst code that still technically solves the problem; tbh I'd be totally willing to trade "raw intelligence" for "code taste"
more labs need to figure out whatever anthropic did to destroy everybody else on frontiercode bench
CuriouslyC 1 hours ago [-]
Opus has the nickname "Slopus" in a lot of circles for a reason. It can write nice code in isolation, but the way it organizes that code and its rigor in addressing edge cases/making sure things are robust leave a lot to be desired. Opus is particularly famous for having a real problem reinventing stuff that already existed in the codebase because it wanted to get to work before exploring sufficiently.
CuriouslyC 4 hours ago [-]
I've been playing with this model a fair amount over the last 24 hours, and I can confirm it's quite capable, while being a little bit verbose (I've seen it reconsider things 3-4 times in thinking traces before deciding on a path forward), and not being quite as good as GPT5.5 at working through complex abstract requirements.
Honestly it's good enough that I feel comfortable recommending a Z.AI sub + a $20/mo OpenAI sub for all but the most AI pilled multi-orchestrators, or the die hard Claude fans. GLM writing + GPT reviewing/debugging feels pretty unlimited and minimally worse than just doing everything in GPT with the $200/mo plan.
sdesol 1 hours ago [-]
> GLM writing
This is honestly what I care bout the most now, which is how well they can write. I think we have reached a point now, if you know how to program, you can provide enough information for the models to pretty much do what you need.
What they still struggle immensely with is the writing which has too many nuances but they are truly getting better.
Havoc 3 hours ago [-]
> while being a little bit verbose
Discovered today that they set reasoning effort to max by default. So that’s probably why
andai 4 hours ago [-]
This is my workflow. And then once a day I copy paste the code into the free Claude Sonnet so it comes out actually readable.
igravious 4 hours ago [-]
After having got a taste of Fable 5 for me Opus 4.8 doesn't cut it any more -- and I don't know how to put this, I don't know if it's just me, but it's rhetorical flourishes are starting to really grate on me, never mind that it is at times deliberately weasel-wordy and economical with the truth until pressed. Opus 4.8 is definitely a stronger coding agent than DeepSeek 4.0 or Kimi 2.7 succeeding where they flounder and fail but its way of expressing itself conversationally is making me reconsider my subscription …
elwebmaster 4 hours ago [-]
You are not alone. How about GPT 5.5? Does it come close to Fable 5?
theplumber 3 hours ago [-]
GPT 5.5 xhigh is smarter than Fable but Fable like Opus 4.8 as well is faster and seems more “agentic”. It’s easy to test this. Build a fairly complex software with Claude(opus or Fable).
Review the commits with both Claude and GPT 5.5 Xhigh. You can see that Fable is still sloppy(er) compared to GPT. You can test it the other way around as well(drive the dev with GPT and review with GPT and Claude). You get the same result
Claude has an edge though and that’s on building more beautiful user interfaces.
fragmede 4 hours ago [-]
5.5 is pretty good. It's no Fable though. It is definitely better than opus tho.
CubsFan1060 3 hours ago [-]
Knowing very little about how to run these, how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?
It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.
wongarsu 3 hours ago [-]
I know of multiple businesses in Europe that have been doing that for a while with 70B models, and are upgrading hardware to run the new crop of 700B-1T models (really started around Kimi K2, but buying and hosting that kind of hardware takes time)
Not everyone is willing (or even legally able) to send their trade secrets to OpenAI or Anthropic
user43928 44 minutes ago [-]
While certainly there are such cases with trade secrets, it's worth noting that even large banks typically have a provider like Azure or AWS onboarded.
There they can deploy these models while using the existing legal frameworks.
CubsFan1060 3 hours ago [-]
What kind of hardware/price does it take to run those?
bitmasher9 2 hours ago [-]
Nvidia will sell you an entire server rack ready for inference. Or maybe you can roll out your own Blackwell based system.
We’re approaching a world where running a primer frontier model is possible on a workstation, probably will have something under $30k that looks like a desktop for Nvidia’s next generation. It sounds expensive, until you look at your Anthropic bill.
It’s similar unit economics as could computing for the open models. You can save a ton on the expenses by buying the hardware, but it requires a lot of in-house expertise, and you get the most value if you keep the system operating around the clock. The big kink is open models are usually 2 quarters behind frontier, and your competitors are probably trying to get access to mythos.
wongarsu 2 hours ago [-]
For an 8-bit quant (what people call "near lossless") you are looking at something like 4xMI350X, which comes out to about $150k after adding the rest of the server. More if you go with Nvidia instead of AMD. More if you want more than maybe 8x concurrency
But prices are changing rapidly, and not for the better
MikhailTal 3 hours ago [-]
This is not a new situation. This was happening also when good vision models like alexa net were coming through, especially for OCR. Companies had choice between cloud or self hosting with GPUs. But turns out, problem is usage patterns.
Your usage will peak during certain timezone work hours(even if you are a huge multinational company most of your engineers/users tend to be from only a few locations), so then you have a bunch of gpus doing nothing the rest of the day.
especially with latency sensitive stuff, this is a decades old tradeoff problem, its not unique to llms
Havoc 3 hours ago [-]
It’s a ~750B model so still a hell of a lot of vram
Would need to be a pretty determined medium biz
moffkalast 3 hours ago [-]
So far there seems to be one major use-case for complete privacy, and that is legal work. You don't need top of the line models to search vast amounts of text in discovery and it needs to be completely confidential. There's quite a few lawyers over on r/localllama showing off their multi-GPU builds. Coincidentally they also have the vast funding required for it.
petesergeant 3 hours ago [-]
Unless you have genuine national security concerns, you’d be better off just negotiating a commercial agreement with privacy protections with a couple of existing vendors.
CubsFan1060 3 hours ago [-]
I think that's true until it isn't, which may end up being the problem. Fable/Mythos doesn't fall under the ZDR agreements with Anthropic. And I'm curious if others will follow suit.
tancop 3 hours ago [-]
if you can afford the investment you get stable low costs for years with better security (at least if your cyber team is good). its even better in regulated industries where some vendors might add a premium for hipaa/soc/pci dss compliance to the point its a lot cheaper to self host. for a smaller business its not worth it and you should just use a hosted open model.
petesergeant 2 hours ago [-]
> to the point its a lot cheaper to self host
I'm pretty skeptical, especially given typical utilization patterns. Do you have numbers, or this is just vibes?
re-thc 3 hours ago [-]
> how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?
Years.
Even Microsoft said they don't have enough for Github and need to call Amazon.
Getting a few even at decent prices is hard. Unless the shortages goes down...
hyqzz8 4 minutes ago [-]
It is a very useful model
tensegrist 4 hours ago [-]
> On the Intelligence vs. Cost per Task Pareto Frontier: GLM-5.2 is on the Pareto frontier of the Intelligence vs Cost per Task chart, with the lowest cost per task among models at its intelligence level. GLM-5.2 costs ~$0.46 per task, compared to GLM-5.1 ($0.25), Kimi K2.6 ($0.31), MiniMax-M3 ($0.18) and DeepSeek V4 Pro (max, $0.05)
am i missing something?
OtherShrezzing 3 hours ago [-]
I think they’ve just picked poor peer examples. Instead of choosing other models near 5.2 on the intelligence scale, they’ve picked some open models from further down the scale.
xiaoyu2006 4 hours ago [-]
Some models are heavily subsidized. Total params & active params are better measurement of inference cost.
simianwords 4 hours ago [-]
No models are subsidised -- there are lots of third party hosting services that will still run at breakeven/profit. (except Deepseek after discount)
stymaar 2 hours ago [-]
> No models are subsidised
We have no proof in either direction, it's not like we had access to their financial numbers in details.
And the pricing itself muddies the water, as input tokens that are already in the KV cache are practically free for the provider, whereas other tokens are expensive. So they could still make money overall thanks to people having multi-turn conversation (and as such, paying multiple times for the same token), but lose money on actual compute done.
> there are lots of third party hosting services that will still run at breakeven/profit.
How can you be sure that they are making profit directly from token price, and are not billing at marginal cost (i.e. electricity price, without counting the cost of the GPUs) and aiming to make a profit later on from the valuable training data that they are collecting in the process?
simianwords 1 hours ago [-]
> How can you be sure
You are free to believe that they are doing all this. Or you can simply believe the intuition that models are getting cheaper by the day. I can run Gemma 4 31B from my laptop today.
XCSme 3 hours ago [-]
In my tests[0] GLM-5.2 is not much better than GLM-5, and overall DeepSeek V4 Flash seems to be the better/more cost-effective choice:
I think the problem is, as can also be seen on other benchmarks, is that most models nowadays are focused more and more purely on tool calling and coding.
This means, that models are losing more and more general and domain-specific knowledge.
Look at those graphs on ARtificialAnalysis, GLM-5.1 still performs similarly or better:
I still feel like models are not getting any smarter for a few months already, they just changed their training to be focused more on some areas than others, so shifting the intelligence from one place to another, not necessarily increasing the overall intelligence or "AGI" score.
sourcecodeplz 3 hours ago [-]
man, i love dsv4-flash but i found its weaknesses in complex projects with multiple moving parts. tried kimi 2.6 and it understood and could work on the task. bigger is better..
RDTvlokip 30 minutes ago [-]
I have a question, as it happens: Do you think the benchmarks and models were trained on benchmark datasets to skew the results, even though in real-world applications we realize they're not that great?
ponyous 1 hours ago [-]
Just ran and scored 63 3d model generations (via code) across high and no reasoning. 3D Modeling benchmark quickly shows spatial, logic and code performance of the model so I think it's a very good indicator of the quality.
Here are the results compared to Gemini 3.5 Flash:
Model + config CodeErr/gen Cost/gen Median time Quality
gemini-3.5-flash, low 0.71 $0.18 68s baseline
GLM 5.2, reasoning high 0.61 $0.18 289s -6.0%
GLM 5.2, reasoning off 1.52 $0.10 126s -13.6%
Although it is cheaper, it is significantly slower, and results are worse overall. Surprisingly - high reasoning produces less code errors than gemini 3.5 flash, but when I actually look at the models they are worse.
Edit: I recently ran evals with Kimi 2.7 and MiniMax-M3 and this is clearly open source SOTA model, by far.
NiloCK 1 hours ago [-]
Very interested in this! Can you share more about the modelling method (eg, three js?), the task list, and outputs here?
I think there's probably some good juice to squeeze in terms of spacial awareness by doing a benchmark something like
- give 3d modelling task
- render and snapshot from a variety of angles
- feed to third-party vision model for a "what is this" type query
- grade on end-to-end accuracy
Bonus points for asking the vision model something like "how beautiful is this 1-10".
ponyous 23 minutes ago [-]
I don't have the eval results live yet, so I cannot share them yet.
I was benchmarking using a soon to be released new version of my AI CAD modeling software[0].
It's basically an agent that has access to tools that can execute build123d scripts, get sculpted models, blender to combine sculpts + parametric models, tools to inspect the model (visually and with code), search datasheets, ...
I tried what you recommend a while ago (asking an AI to evaluate using different angles) and the AI evaluations were extremely bad - barely any correlation to what I scored. Things have gotten better, but I don't trust it enough yet.
Here is how I score adherence (and how AI did as well, but I tried methods where it would just give back a boolean "pass" or not):
<0.2 → Poor – Misses core intent; largely irrelevant or incorrect.
<0.4 → Weak – Partially relevant; significant omissions or errors.
<0.6 → Fair – Covers main points but lacks completeness or precision.
<0.8 → Good – Mostly accurate; minor gaps or deviations.
<=1.0 → Excellent – Fully aligned; precise, comprehensive, and faithful to intent.
Here is the scenario list (prompts are much more detailed):
Where we see "top" models drop way down in score when given longer tasks.
That being said, I've had a reasonably pleasant time with GLM-5.2 so far. (And have had an OK time with DeepSeek as well.)
By the time I'm done testing all the Chinese models, they'll be obsolete :)
wongarsu 3 hours ago [-]
It's also third best overall on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable.
That's the one benchmark that allows LLMs to answer "I don't know" and punishes them for trying to bullshit their way through the questions
m-dot-reviews 2 hours ago [-]
For anyone who's interested, I've put together a simple site for sharing ratings/opinions on models at a task-specific granularity. https://model.reviews/
The idea is that benchmark score comparisons are useful for a large cross-product comparison across models + their settings, but less useful if you're looking for the best model for <your-specific-task>. So I thought having a place to review and comment could be beneficial to people.
I'm not sure how best to get the corpus bootstrapped (i.e. people will likely only visit/post on the site if there's already activity), so posting it here for anyone who'd like to contribute.
jayess 6 minutes ago [-]
I asked z.ai what z.ai is, and it said "It seems you might be referring to xAI, as "z.ai" isn't a widely known or major AI company or platform at this time."
alansaber 1 hours ago [-]
These open source models need better multi-turn capabilities. They are always lacklustre in "agent mode". Whether it's just less RL, whatever, it's a worse "product". Whereas it feels like the frontier labs have been all-in on "agentic" multi-turn reasoning for a long time now.
rahidz 4 hours ago [-]
Correct me if I'm wrong, but neither DeepSeek nor GLM have image input modality. This makes them less useful when looking at UIs, photos, screenshots, etc. doesn't it? Or do they have alternate ways of doing so?
segmondy 1 hours ago [-]
DeepSeekv4+ will have image capability, they said so in their paper. GLM whenever they decide to. Both companies have they tech and for whatever reason haven't decide to prioritize it. Both of their OCR are SOTA among all OCR models closed or open. GLM demonstrated they know how to do this, with GLM-4.6V.
0xbadcafebee 1 hours ago [-]
Configure a subagent in your coding harness for vision, add a prompt about the vision use, configure a vision model for it, modify your main agent's prompt to use the vision subagent for vision tasks. Now your non-vision model has vision support.
dryarzeg 3 hours ago [-]
Yes, you are right (as far as I'm aware). For things where you need the LLM to look at screenshots, photos or other images you can use Kimi-K2.6/K2.7 - comparable pricing, somewhat comparable performance and quality. You can even probably combine two models (e.g Kimi and GLM) in one agent, using Kimi for multimodal inputs and GLM for everything else, although 1) I'm not sure if this will not cause some kind of context poisoning with low-quality patterns for better performing model (e.g. in some cases Kimi may be worse than GLM, but GLM, when following up, may adopt the same reasoning patterns as Kimi, undermining it's own performance), and 2) I'm not quite sure if it's possible with the tools currently available (I'm not really into agentic or chatbots stuff to be honest).
mordae 4 hours ago [-]
They do not and it sucks for certain tasks.
It also means that if they actually trained with vision, they'd be on par with Anthropic models as vision seems to improve model performance across the board even for non-vision tasks.
freigeist79 15 minutes ago [-]
it helps giving them a cli vision tool (curl to openrouter vision model for example)
osti 3 hours ago [-]
Many other open source models have vision but they don't compare to GLM in terms of coding quality. So I don't think it's because of vision that the frontier models are better, it's more that they are probably just much bigger models.
adrian_b 4 hours ago [-]
That's right, but there are other recent open weights and relatively big LLMs that are multimodal, e.g. MiniMax-M3.
With open weights LLMs, it is affordable to use many different models, each for whatever it is better.
Moreover, for analyzing "UIs, photos, screenshots, etc." there are small models that can be run locally on smartphones or laptops, e.g. IBM granite-vision-4.1-4B, certain Google Gemma 4 variants and certain Qwen variants, whose output you can use as input for a big LLM, in order to accomplish some more complex task.
Havoc 3 hours ago [-]
They have a separate VL model but never tried it
robertwt7 59 minutes ago [-]
what is that moodboard and chart of hypertension in the middle of the article that isn't explained?
This is a great step up in open models however the pricing to support z.ai is not far cheaper than Claude / OpenAI subscription
_pdp_ 4 hours ago [-]
I am helpful.
DeepSeek V4 has been quite amazing in our workloads and it operates at a fraction of the cost. I have not tried GLM 5.2 but it seems that it hits a sweet spot.
LUmBULtERA 1 hours ago [-]
Your system prompt is showing.
kreddor 12 minutes ago [-]
Maybe he meant "hopeful"...
davidwritesbugs 4 hours ago [-]
I like their models, super cheap - I'm a Lite plan subscriber, and subjective performance seems to be same as lower Anthropic models, useful for lots of grunt work.
The problem is that Ziphu really __really__ struggle with capacity - everyone is complaining of timeouts or very slow speeds. I can't get direct access to the model though I see it is in OpenRouter so I may play. But the capacity issues means DeepSeek is my main provider these days
30 minutes ago [-]
dizhn 2 hours ago [-]
FYI.. This is coming with 3mil GLM 5.2 tokens right now. (Needs login. Google SSO fine) https://zcode.z.ai/en
Alifatisk 42 minutes ago [-]
Where can I read more about the coming 3mil GLM 5.2?
ramon156 4 hours ago [-]
I've made a comment before that 5.1 will sometimes get stuck looping over a simple decision or statement. It will basically contradict and then not realize that one option is the definite option. Sometimes it's two statements that aren't even exclusive. Nonetheless, a lot of tokens that get wasted from this.
I haven't extensively used 5.2 yet, but it seems a lot better.
KaoruAoiShiho 1 hours ago [-]
This is really held back by one bench (omniscience accuracy) where it's really very far behind otherwise i think it's got at least a couple of points higher.
Pragmata 3 hours ago [-]
So this basically means we will have a near opus level model able to be run locally in the next couple of months right?
QWEN 3.6 27b is already pretty good, but it should be possible to get a better option now that runs in the same hardware, right?
segmondy 2 hours ago [-]
Why wait for the next few months? There are plenty of better models that you can run today locally. Qwen3.5-397B beats Qwen3.6-27B. MiniMax2.7 is a longrun horizon monster. (I haven't given 3 much of a try yet). KimiK2.6/2.7, MiMoV2.5/MiMoV2.5-Pro and GLM5.1 will wreck Qwen3.6-27B any day on any task.
Oh, or you meant a smaller model than GLM-5.2 with similar capabilities?
segmondy 2 hours ago [-]
Probably not. Qwen3.(5|6)-27B seems like an "accidental freak". I'm not even sure they know what they did to create that. A decent amount of the team members left after that, so unfortunately, we might not be seeing another small model that packs such a punch for a while. Hopefully the team is studying their entire training recipe for that and is able to replicate. If they are, then a 50-70B dense model might give us such capabilities...
Pragmata 3 hours ago [-]
Yep! I'm running things locally on a RTX5080 + RTX1060 + 64GB DDR5 ram, and would love to get a more capable model if possible!
QWEN3.6 27b is pretty good, but i can still notice some spots where it's not as good as the frontier models.
piterrro 1 hours ago [-]
DeepSeek v4 pro is still 10x cheaper than GLM-5.2 and the quality is still enough for 95% of coding tasks.
0xbadcafebee 1 hours ago [-]
....so use DeepSeek v4 Pro for 95% of your coding tasks, and GLM 5.2 for the other 5%? You don't need to stick to one model.
enraged_camel 1 hours ago [-]
People always say stuff like this, but it is misleading. The reason it's misleading is because that remaining 5% makes a huge difference, and is where most of the value of using AI agents lies.
I'm not interested in using AI to write code that would have taken me 5-10 minutes to write myself. I use AI to debug complex bugs and develop large features that span multiple domains - stuff that normally takes hours, if not days/weeks. A model that is "enough for 95%" does not cut it for that, because the failures compound during long-horizon tasks and the thing becomes a mess.
JustSkyfall 2 hours ago [-]
The problem with these benchmarks is that the Chinese models tend to be incredible on paper, and absolutely terrible in practice :/
CuriouslyC 1 hours ago [-]
This was a problem with older Qwen/MiMo/Kimi models mostly. GLM has always been on the more robust side, and newer iterations from all those labs have improved as well. The only lab I've seen regressing this way is DeepSeek, 3.2 was fairly robust but 4.0 feels more benchmaxxed.
bel8 1 hours ago [-]
I beg to differ. I replaced a $40/mo GitHub Copilot subscription where I used Opus 4.6 and GPT 5.5 with a $10/mo opencode Go plan where I use mostly DeepSeek V4 Flash and testing MiMo 2.5.
I work on mid-sized projects currently (200k to 1kk lines of code).
Alifatisk 47 minutes ago [-]
> 1kk lines of code
Isn't that a million?
bel8 27 minutes ago [-]
Yep. I consider up to a million lines of code as mid-sized.
When I worked in banking, the codebases were often larger than a million.
Mashimo 2 hours ago [-]
I have used GLM since version 4.8 I think and do enjoy using them. More then other models like Kimi or Deepseek. Though only tested them on smaller private projects.
Alifatisk 44 minutes ago [-]
> I have used GLM since version 4.8 I think
You probably refer to GLM-4.7
segmondy 2 hours ago [-]
You are obviously lying because it shows you have no experience with. GLM since 4.5 have been crushing it. all their models since then haven't skipped a beat. 4.5/4.5-air, 4.6, 4.7, 4.8, 5, 5.1. That aside, MiMoV2.5, MiniMax from 2.0, DeepSeek from V3, Kimi since V2, Qwen since 3, Hy3 have all been amazing models. All from China, we need to get over it. China is not losing yet as far as the AI race is concerned.
Alifatisk 45 minutes ago [-]
Is there a GLM-4.8 model?
jingpostmedia 2 hours ago [-]
[flagged]
zftnb666 2 hours ago [-]
Open-weight models are winning. The gap with closed models is now measured in months, not years.
creamyhorror 4 hours ago [-]
It's a real step forward, getting closer to SOTA. It seems to be very epistemically cautious in its reasoning. I hope Deepseek and the other open-weights labs stay in the game and catch up too.
Havoc 4 hours ago [-]
It’s pretty good. More talkative than 5.1. Reminds me of deepseek 4
Their servers are melting though - getting more timeouts etc
nh43215rgb 4 hours ago [-]
> GLM-5.2 sits off the most attractive quadrant on the Intelligence vs Output Tokens chart.
That is unfortunate...
lousken 4 hours ago [-]
Cerebras really needs to have this on their API list (if they even still exist).
Marciplan 4 hours ago [-]
they went public a few weeks ago
lousken 3 hours ago [-]
That's cool and all, but they are still on GLM 4.7
0xbadcafebee 45 minutes ago [-]
Which is fine for their target market. Their latest model is Kimi K2.6, available to enterprise customers. But older models become more powerful when you have time to do more reasoning. Also many applications don't need advanced models. Cerebras is making bank from all the other use cases that SOTA providers left on the table by focusing on 0-shot intelligence over speed
eckelhesten 3 hours ago [-]
Sure, but whatever you do, don't buy their (Z.ai) lite plan.
I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.
I had the Lite plan, I NEVER maxed out the quota because I considered these things. If I, for example, switched over to GLM-5-Turbo, then I could've easily burned through quota.
granra 2 hours ago [-]
How are you using it? I have the lite plan and I've only ever maxed my weekly usage a few hours before reset. I will concede that I'm not a super heavy LLM user but it's been really good for me.
My workflow is usually:
- read file. I want to achieve X, how do? Do not implement anything.
- I would do a, b and c
- sketch a brief implementation of your suggestion
- <code> (not writing files yet)
- instead of your approach x, wouldn't it make sense to instead do z? What would that look like?
- <code>
- nice, implement this
- starts writing files, run tests, etc.
eckelhesten 2 hours ago [-]
Try pointing it to a small codebase, or even ask it to conjure information found online.
You'll see that it quickly gives up. Thing is, they seem to count cached hits as if they were the non-cached tokens.
I wont be subscribing again thats for sure. I am not paying iPhone money for a Xiaomi.
Computer0 1 hours ago [-]
Regrettably I haven’t tried 5.2 yet but 5.1 I did not see as anything special. In practice I found it to be ~70% as good as Claude sonnet.
4 hours ago [-]
sourcecodeplz 3 hours ago [-]
1m context btw.
Alifatisk 1 hours ago [-]
And apparently, actual support for 1M context window, not just theoretical.
dsrtslnd23 3 hours ago [-]
looks like I need a GB300 workstation
Imustaskforhelp 3 hours ago [-]
I have been trying out GLM 5.2 and I am really impressed by it for the most part.
To all people on Hackernews, I am curious as to what agent harness are you using it with.
Previously I was using opencode and then I switched to using Opencode + obra/superpowers and creating custom skill.md themselves for it. I found things to take more time and intervene more but the result of it has been that I have found it to work better.
Now I have also started using oh-my-pi as well and I found it to be faster compared to Opencode.
I am unsure how much of there is a difference to it and how much of things are placebo but what is your opinion regarding the best Agent harness for GLM 5.2?
Alifatisk 60 minutes ago [-]
I just used CC with GLM, I was satisfied.
hit8run 3 hours ago [-]
Ok, it is nice to see another great open source model. Not sure what to think of all these benchmarks but GLM was already quite strong before so an update is very welcome.
kissgyorgy 3 hours ago [-]
I tried it today through Openrouter and the API is atrocious. I got multiple rate limit and random errors every turn.
Somebody wrote [1]; "I am never touching Minimax or GLM again. Their APIs had constant outages and I had to restart my runs multiple times — after burning money on the runs that failed midway." and I 100% agree.
The model might be good, but if the API is so bad, it's effectively useless.
Give it a few days and additional provider will be up and available on OpenRouter. Then the game of figuring out who’s not nuking the weights and neutering the quantization begins.
segmondy 2 hours ago [-]
The entire point of this post is that it's open weights, you can run it yourself and don't have to deal with the API issues. You really do have that choice.
ac29 33 minutes ago [-]
You could subscribe to Anthropic/OpenAI for the rest of your life for the cost it would take to host GLM5.2 locally - you need 1.5TB of VRAM just for the weights
zozbot234 20 minutes ago [-]
You don't need that much VRAM unless you're targeting a high-performance deployment that's intended to scale far beyond local use. For a lower-throughput case, you can keep the model weights on SSD at very low cost and stream them in for inference. This could actually scale reasonably well if you have something as simple as a previous-gen HEDT with a decent amount of PCIe lanes to host fast storage from.
Havoc 3 hours ago [-]
That’s what happens when you offer something decent at a fraction of the price of opus - more demand than you can serve
osti 3 hours ago [-]
I indeed got a few timeouts yesterday using the official API, I imagine for the coding plan users it'll be even worse.
I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.
Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.
Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.
If you want reasonable token usage, you need to run it GLM 5.2 at High. There is little drop in quality from Max to High (for most tasks). And it cuts token usage by 2 a 2.5x. GLM 5.2, Max is really something you only need for complex tasks.
In essence, GLM 5.2 is Opus 4.8 its little brother, at a way, WAY cheaper price.
There has been really no training on Opus models going on, really, none i tell you! /sarcasm
Sarcasm, considering the source of their own training data?
Though only in particular situations, like when it’s done to them and not when they do it. Cause they have the power and are morally right and know better than you. And if you question this at all, well you’re a threat to American values and a supporter of the Chinese and leading to the break down of Democracy.
This isn’t a type of reasoning argument or manipulation tactic used by the rich throughout history to trick the naive and gullible masses or anything like that. Trust me, I’m rich and I’m morally right. /sarcasm
To point where I stop it and simple tell it to “start writing code you can work it out as you go along”
Seems writers block also effects LLM
In this paper they nerf an LLMs ability to emit waffling thinking tokens like "wait", "but", "alternatively", and the models (they're old, small models in the paper) terminate reasoning faster and perform better. I bet Anthropic is tuning this on their backend.
Just output the code and we’ll work through it!
I feel similarly about having codex review claude’s plans. I don’t think I’ve ever seen it catch a major issue. It just points out things that would have inevitably been addressed during implementation anyway.
Another thing I tell Claude to do is to not guess, but look at documentation, it messes up a lot less, might use some tokens reading docs, but at least it has a higher success rate code wise.
https://www.reddit.com/r/ClaudeAI/comments/1psxuv7/anthropic...
Also just think about it, why would a model trained on the world’s corpus of text (that isnt formatted in xml) perform better with XML? It would be a better study if that post tested markdown, org, xml, json, etc. 10 times to see if their is a difference
It's clear it was the vibe coding model, as like no other model before, fully turned you into his assistant instead of the other way around.
Per AA, while K2.7 Code is roughly on par w/ K2.6 in terms of intelligence, it uses half the output tokens to get there.
[1] https://z.ai/blog/glm-5.2
Low nailed the overwhelming majority of mundane tasks on it's own, medium was good for more complex stuff.
GLM5.2 ends up being far more expensive than I thought it would be when I tried it on openrouter. I ground through $5 USD worth of tokens quite quickly.
And this was high, not max.
All it does is pull a json from their main table page and parses it with the fields I care about (coding).
There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.
Current partial output
To see everything, run it like so The repo: https://github.com/day50-dev/aa-eval-emailsome key takeaways:
* open models are on about a 4-7 month lag right now depending on how you want to measure it
* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.
if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.
If you really want to see all of them:
https://day50.dev/output.txt
Or run the script
- GPT 5.5 consistently the best, an opinion who gets me constant downvotes here by the Anthropic Marketeer strike force...
- China is going to eat the US lunch on AI
- What have European universities and companies been doing? Its like if, on a parallel past/future, Nikola Tesla and Edison would have created flying Cyberpunk machines, while Europeans researchers, would be getting together to request EU funds, for investigation on how to breed faster horses.
- If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?
Mistral is clearly currently not competing for Frontier Model. Whether this is due to a lack of VC Funds or a lack of technical ability or the former arising from the latter would be interesting to know.
The top models are from startups. Among the FAANG only Google managed to get a Frontier model, and they litterally invented the architecture and have more money than they can possibly spend to throw at the problem. Facebook shows that even ungodly amounts of money don't get you there though.
So why did no EU based Startups succeed while two US start ups succeeded? I agree that that's a very important question the EU should ask. The Internet revolution was driven by US companies, and now AI will be as well, with Chinese Open Weights mixed in. The EU consistently can not turn its considerable economic output into fast moving tech firms.
They've got a heap of contractors working to help industry adopt LLMs. It is just classic consulting work, and they'd look like a really great company if we weren't comparing them to literal $2T+ companies losing money hand-over-fist...
[1] https://apertvs.ai/pages/about/
They had Watson, remember, it won on jeopardy like 15 years ago? They've been at this for a long time
Maybe it's good at something else?
They had to start from scratch, but dont seem to have the management to be smart enough, to stop doing it in house. They could have just acquired a startup that could build a frontier model.
What is also very ironic since their whole bussiness for the last 15 years, has been buying companies a la CA Associates...
Their previous Watson branding and collapse of Watson expectations cost them one CEO, but the current CEO was part of the same team. They just dont learn....
ETH Zurich and EPFL universities recently put out an open model called Apertus (was on the HN front page a few months back), it's not a frontier model, but they built it properly regarding copyright and data transparency.
It might look a bit slow or old-fashioned, but focusing on doing things ethically and legally feels like a much better path than just joining the race to scrape everything.
Doing things with ethical intentions does not necessarily produce outcomes that are beneficial for society at large.
Might also just be the result of "good will" (that the company has deftly fostered). Other companies might learn from Anthropic in that regard.
Also what are they building it for? I'd think it's to serve ads better or something like that. Maybe Muse Spark fits facebook's needs perfectly...
I mean that is the smart move here. Focus the model on optimizing the core business. For Meta, that's not coding tools.
They will forever have superior weights?
And Zuck hasn't spent that much on AI yet. Half of that is projected spending for 2026.
As to whether it's all for nothing, Q1 2026 revenue was up 33% over Q1 last year, driven largely by...better AI-driven ad targeting. So the spending doesn't seem that crazy to me.
But if that's your thing, here you go: https://github.com/day50-dev/aa-eval-email/commit/1853be6461...
add an argument (any argument) and it will be sorted as your specified. It just works as a toggle flipping the order ... so literally any string will do.
The original link has been updated accordingly with the new code.
That aside, this is a good script you're running. Thanks.
Setup a fresh new large monitor. Open CLI. Run command. Watch output at the bottom of your screen. Keep watching the bottom of your screen for the rest of the day.
Sure you can tile windows and it helps but come on. Just have the command/input section in the bottom and the “output” on top. Keep the command bit on the bottom.
https://artificialanalysis.ai/agents/coding-agents?coding-ag...
I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.
[1] https://z.ai/blog/glm-5.2
openai, google and anthropic subscriptions are not available with privacy.
looking at the link there it's interesting that going from cursor cli to codex cli take gpt 5.5 from 7th to 3rd. but they didn't do open model in codex.
so, hard to say it's for sure a model benchmark. maybe open models are just shit at swe agent harness...it's not the most parsimonious explanation though.
Unless you're running it locally, aren't you just trusting some other entity?
https://deepswe.datacurve.ai/
Fable 5 is cool and all, but we have not yet seen GPT-5.6.
It's easily 4x the cost of DeepSeek V4 but I didn't actually feel the results were that much better. I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.
Having better luck with MiniMax M3, from a cost/benefit ratio.
With a good harness, that's my favorite model for any personal project. I use Opus 4.8 at work because i don't have to pay for it and of course I love it, but DeepSeek is like 80% there for one tenth of the price.
GPT can find fault in everything and anything including its own work.
That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.
In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.
Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.
Even the local models I run on my Mac are getting surprisingly good at that now.
This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.
Looking at openrouter [1], some of the cheaper offerings are for quantized models. Not sure how much intelligence is lost in quantization. And they are not 3 times cheaper. Where did you find 3x lower prices for APIs? I am considering skipping open router and using them directly for that price.
edit:
I see, croft [2] 8bit for $0.50/$0.08/$2.20
[1]: https://openrouter.ai/z-ai/glm-5.2
[2]: https://ai.nahcrof.com/pricing
I do not have GLM 5.2 numbers because the whole default max setting is overkill. But GLM 5.1 numbers had it at 12x cheaper then API rates. And about 2.5x more tokens vs zai their own subscription service.
Yes, its FP8 but lets be honest, do we know for sure that even zai runs at FP16? I learned a long time ago with Claude and Codex how much cheating happens on model levels, even from the big boys.
(there's a table which shows comparison between vendors)
Also, it seems there's a general one as well (for all kimi models?): https://github.com/MoonshotAI/Kimi-Vendor-Verifier
Wasn't this released like 2 days ago? Everyone is still evaluating and playing around with it, things like the submission is just starting to come out. Give it some days at least before jumping to conclusions, ideally weeks.
I've tried a number of these, and the learning curve is very steep compared to "install Claude Code and pay $100/mo". There is no way saving me $50/month matters compared to figuring that out.
https://docs.z.ai/devpack/tool/claude
Here's my setup. I add this to my .bashrc
export ZAI_API_KEY="your_key_here"
alias claudez='ANTHROPIC_AUTH_TOKEN="$ZAI_API_KEY" ANTHROPIC_BASE_URL="https://api.z.ai/api/anthropic" ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]" ANTHROPIC_DEFAULT_SONNET_MODEL="glm-4.7" ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-4.7" claude'
Then I just run claudez
pro tip the same thing works with deepseek https://api-docs.deepseek.com/guides/anthropic_api
Even more pro tip: Claude Code can set this up for you haha
Unless this were a massive differentiator, people aren't going to be "talking about it" the way GP suggests!
1) You haven't even heard of it.
2) You have to know to look for both GLM and Z.ai. These are usually in the same article when reporting about GLM is written, at least.
3) You have to understand there could be a benefit in trying it; you have to want to try it for some reason. Their own blog post puts it below Opus 4.8 in each of the three benchmarks they used.
4) You have to figure out the pricing, which isn't obviously in the blog post...
5) When I first went to Z.ai, I got an error popup (not logged in): "You do not have permission to access this resource. Please contact your administrator for assistance." I am using a personal computer...
6) When I typed something in the resultant field and pressed enter, I got "Clear Current Chat? To start a new chat, your current conversation will be discarded. Sign in to save chats"
I think today's article helped with 1 and 2, which helps their top of funnel. But they're fighting a big uphill battle.
I'd pay for an out of the box solution. i.e. an Installer with updates
There's ZCode (https://zcode.z.ai). Which is like the Codex App.
That's as "easy" as it is for non-devs that you're complaining about.
Do note that GLM is not multi modal, which can be a deal breaker. And these open models are not good outside coding.
I wish I had the time to set it up and work on side projects but unfortunately life and work have been crazy (as I'm sure many here feel). That's why I asked for anecdotes about it.
https://github.com/QuantiusBenignus/Zshelf/discussions/2
Not accounting for hardware, of course :)
Nvidia GPUs are much more efficient than Apple hardware for inference(and training).
Not accounting hardware in my costs, since I didn’t buy my hardware for running models. Running models is just something it can do in addition to what I got it for.
link?
> Why
imho everything but opus produces unusable code (fable was even better...), eg gpt5.5 seems to write the absolute worst code that still technically solves the problem; tbh I'd be totally willing to trade "raw intelligence" for "code taste"
more labs need to figure out whatever anthropic did to destroy everybody else on frontiercode bench
Honestly it's good enough that I feel comfortable recommending a Z.AI sub + a $20/mo OpenAI sub for all but the most AI pilled multi-orchestrators, or the die hard Claude fans. GLM writing + GPT reviewing/debugging feels pretty unlimited and minimally worse than just doing everything in GPT with the $200/mo plan.
This is honestly what I care bout the most now, which is how well they can write. I think we have reached a point now, if you know how to program, you can provide enough information for the models to pretty much do what you need.
What they still struggle immensely with is the writing which has too many nuances but they are truly getting better.
Discovered today that they set reasoning effort to max by default. So that’s probably why
Review the commits with both Claude and GPT 5.5 Xhigh. You can see that Fable is still sloppy(er) compared to GPT. You can test it the other way around as well(drive the dev with GPT and review with GPT and Claude). You get the same result Claude has an edge though and that’s on building more beautiful user interfaces.
It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.
Not everyone is willing (or even legally able) to send their trade secrets to OpenAI or Anthropic
There they can deploy these models while using the existing legal frameworks.
We’re approaching a world where running a primer frontier model is possible on a workstation, probably will have something under $30k that looks like a desktop for Nvidia’s next generation. It sounds expensive, until you look at your Anthropic bill.
It’s similar unit economics as could computing for the open models. You can save a ton on the expenses by buying the hardware, but it requires a lot of in-house expertise, and you get the most value if you keep the system operating around the clock. The big kink is open models are usually 2 quarters behind frontier, and your competitors are probably trying to get access to mythos.
But prices are changing rapidly, and not for the better
Your usage will peak during certain timezone work hours(even if you are a huge multinational company most of your engineers/users tend to be from only a few locations), so then you have a bunch of gpus doing nothing the rest of the day. especially with latency sensitive stuff, this is a decades old tradeoff problem, its not unique to llms
Would need to be a pretty determined medium biz
I'm pretty skeptical, especially given typical utilization patterns. Do you have numbers, or this is just vibes?
Years.
Even Microsoft said they don't have enough for Github and need to call Amazon.
Getting a few even at decent prices is hard. Unless the shortages goes down...
am i missing something?
We have no proof in either direction, it's not like we had access to their financial numbers in details.
And the pricing itself muddies the water, as input tokens that are already in the KV cache are practically free for the provider, whereas other tokens are expensive. So they could still make money overall thanks to people having multi-turn conversation (and as such, paying multiple times for the same token), but lose money on actual compute done.
> there are lots of third party hosting services that will still run at breakeven/profit.
How can you be sure that they are making profit directly from token price, and are not billing at marginal cost (i.e. electricity price, without counting the cost of the GPUs) and aiming to make a profit later on from the valuable training data that they are collecting in the process?
You are free to believe that they are doing all this. Or you can simply believe the intuition that models are getting cheaper by the day. I can run Gemma 4 31B from my laptop today.
[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...
This means, that models are losing more and more general and domain-specific knowledge.
Look at those graphs on ARtificialAnalysis, GLM-5.1 still performs similarly or better:
AA-Omnisicence Accuracy: https://i.snipboard.io/5DYmpx.jpg
IFBench: https://i.snipboard.io/74kg0R.jpg
I still feel like models are not getting any smarter for a few months already, they just changed their training to be focused more on some areas than others, so shifting the intelligence from one place to another, not necessarily increasing the overall intelligence or "AGI" score.
Here are the results compared to Gemini 3.5 Flash:
Although it is cheaper, it is significantly slower, and results are worse overall. Surprisingly - high reasoning produces less code errors than gemini 3.5 flash, but when I actually look at the models they are worse.Edit: I recently ran evals with Kimi 2.7 and MiniMax-M3 and this is clearly open source SOTA model, by far.
I think there's probably some good juice to squeeze in terms of spacial awareness by doing a benchmark something like
- give 3d modelling task
- render and snapshot from a variety of angles
- feed to third-party vision model for a "what is this" type query
- grade on end-to-end accuracy
Bonus points for asking the vision model something like "how beautiful is this 1-10".
I was benchmarking using a soon to be released new version of my AI CAD modeling software[0]. It's basically an agent that has access to tools that can execute build123d scripts, get sculpted models, blender to combine sculpts + parametric models, tools to inspect the model (visually and with code), search datasheets, ...
I tried what you recommend a while ago (asking an AI to evaluate using different angles) and the AI evaluations were extremely bad - barely any correlation to what I scored. Things have gotten better, but I don't trust it enough yet.
Here is how I score adherence (and how AI did as well, but I tried methods where it would just give back a boolean "pass" or not):
Here is the scenario list (prompts are much more detailed): [0]: https://grandpacad.comExcited to see if this turns out to be a Open Weight Opus 4.5 or better.
I've had models that benched poorly but performed great. And I constantly see models at near the top of AA, which are terrible.
There doesn't necessarily seem to be a lot of overlap between benchmarks and real world usage. (Let alone common sense!)
As far as they go, though, these harder benchmarks match my experience more closely:
https://deepswe.datacurve.ai/
and https://cognition.ai/blog/frontier-code
Where we see "top" models drop way down in score when given longer tasks.
That being said, I've had a reasonably pleasant time with GLM-5.2 so far. (And have had an OK time with DeepSeek as well.)
By the time I'm done testing all the Chinese models, they'll be obsolete :)
That's the one benchmark that allows LLMs to answer "I don't know" and punishes them for trying to bullshit their way through the questions
The idea is that benchmark score comparisons are useful for a large cross-product comparison across models + their settings, but less useful if you're looking for the best model for <your-specific-task>. So I thought having a place to review and comment could be beneficial to people.
I'm not sure how best to get the corpus bootstrapped (i.e. people will likely only visit/post on the site if there's already activity), so posting it here for anyone who'd like to contribute.
It also means that if they actually trained with vision, they'd be on par with Anthropic models as vision seems to improve model performance across the board even for non-vision tasks.
With open weights LLMs, it is affordable to use many different models, each for whatever it is better.
Moreover, for analyzing "UIs, photos, screenshots, etc." there are small models that can be run locally on smartphones or laptops, e.g. IBM granite-vision-4.1-4B, certain Google Gemma 4 variants and certain Qwen variants, whose output you can use as input for a big LLM, in order to accomplish some more complex task.
This is a great step up in open models however the pricing to support z.ai is not far cheaper than Claude / OpenAI subscription
DeepSeek V4 has been quite amazing in our workloads and it operates at a fraction of the cost. I have not tried GLM 5.2 but it seems that it hits a sweet spot.
I haven't extensively used 5.2 yet, but it seems a lot better.
QWEN 3.6 27b is already pretty good, but it should be possible to get a better option now that runs in the same hardware, right?
GLM-5.2 is already close to Opus-4.7 level:
https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...
QWEN3.6 27b is pretty good, but i can still notice some spots where it's not as good as the frontier models.
I'm not interested in using AI to write code that would have taken me 5-10 minutes to write myself. I use AI to debug complex bugs and develop large features that span multiple domains - stuff that normally takes hours, if not days/weeks. A model that is "enough for 95%" does not cut it for that, because the failures compound during long-horizon tasks and the thing becomes a mess.
I work on mid-sized projects currently (200k to 1kk lines of code).
Isn't that a million?
When I worked in banking, the codebases were often larger than a million.
You probably refer to GLM-4.7
Their servers are melting though - getting more timeouts etc
That is unfortunate...
I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.
I had the Lite plan, I NEVER maxed out the quota because I considered these things. If I, for example, switched over to GLM-5-Turbo, then I could've easily burned through quota.
My workflow is usually:
- read file. I want to achieve X, how do? Do not implement anything.
- I would do a, b and c
- sketch a brief implementation of your suggestion
- <code> (not writing files yet)
- instead of your approach x, wouldn't it make sense to instead do z? What would that look like?
- <code>
- nice, implement this
- starts writing files, run tests, etc.
You'll see that it quickly gives up. Thing is, they seem to count cached hits as if they were the non-cached tokens.
I wont be subscribing again thats for sure. I am not paying iPhone money for a Xiaomi.
To all people on Hackernews, I am curious as to what agent harness are you using it with.
Previously I was using opencode and then I switched to using Opencode + obra/superpowers and creating custom skill.md themselves for it. I found things to take more time and intervene more but the result of it has been that I have found it to work better.
Now I have also started using oh-my-pi as well and I found it to be faster compared to Opencode.
I am unsure how much of there is a difference to it and how much of things are placebo but what is your opinion regarding the best Agent harness for GLM 5.2?
Somebody wrote [1]; "I am never touching Minimax or GLM again. Their APIs had constant outages and I had to restart my runs multiple times — after burning money on the runs that failed midway." and I 100% agree.
The model might be good, but if the API is so bad, it's effectively useless.
[1]: https://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-ha...