Qwen3.7-Max: The Agent Frontier | 科技资讯

Qwen3.7-Max: The Agent Frontier(qwen.ai)

655 分 | 作者 kevinsimper 1天前

36 条评论

goldenarm 22小时前
The non-hallucination rate in AA-omniscience is SOTA, better than Opus 4.7, Gemini 3.1 Pro and GPT5.5! Congrats to the team
[-]
- throawayonthe 21小时前
  referencing this:
  https://artificialanalysis.ai/evaluations/omniscience?models...
  (had to add it to the chart, wasn't displayed by default. is it the lowest rate in the datasetor no?)
  [-]
  - jampekka 16小时前
    This counts only incorrect answers though. A model can get 0% hallucination rate just by refusing to answer all questions.
    [-]
    - ffsm8 15小时前
      Isn't that precisely the reason why we introduced the term hallucination? Because llms have historically always made up bullshit of they cannot answer directly... If they now nailed this to maybe the model not respond instead of responding incorrectly, then a lot of previously unusable usecases would become feasible.
      So I feel like that's exactly the right metric and the way to track it wrt hallucinations.
      [-]
      - doublescoop 13小时前
        I had a buddy in high school that was notorious for doing the same thing. (He's now a senior director at a Big 4 consultancy. :) )
        [-]
        rrgok 5小时前
        Do you mind expanding a little more?
      - akoboldfrying 8小时前
        The point is that it's not a useful metric on its own. For example, redirecting from /dev/null also achieves a zero hallucination rate.
        We want the hallucination rate to decrease while the overall answer rate of queries remains sufficiently high. For more specifics, look into ROC and AUC.
    - jug 13小时前
      I think that's what the Omniscience Index is for:
      https://artificialanalysis.ai/evaluations/omniscience#aa-omn...
      It rewards correct answers and penalizes hallucinations, and finally no reward for refusing to answer.
      It's interesting just how poorly some popular Chinese models fare in this regard, like GLM 5.1 or DeepSeek 4 Pro.
      Gemini 3.x has truly remarkable knowledge given how it leads in this benchmark despite being (quite a bit) more prone to hallucinate than Claude Opus.
    - Balinares 5小时前
      Yes, that's in fact precisely the desired behavior when a model doesn't know the answer.
    - aicantdeny 10小时前
      > by refusing to answer all questions.
      Cool, precisely the thing other AI is too stupid to do when they don't have the necessary knowledge.
    - speed_spread 15小时前
      Yes. A model that can answer "I don't know" would be much more trustable than the current used car salesman we have now.
      [-]
      - jorvi 13小时前
        Its very annoying this has been in the capability of models since the very beginning. It could check how probable its token values are and if those fall below a certain threshold either say "I don't know", or output the most probable (well, more like least improbable) tokens but give a very clear, very strong warning that it is a shot in the dark and likely to contain hallucinations.
        But no, Google and OpenAI would rather always have an answer ready and tell you to mix glue into your pizza toppings :)
        [-]
        miki123211 10小时前
        It can't, because top n isn't always reliable.
        Hallucination detection is an open problem. If it were that simple, people would indeed "just" do it.
        Basically the problem is that LLMs aren't trained on things they don't know; an alternative way of saying this is that they're not trained on things they're not trained on, which is obviously true.
        When you RL a model and it answers incorrectly, you don't teach it to answer "I don't know", you teach it to answer correctly instead. This makes it very hard for it to realize when it doesn't know things.
        [-]
        chengyongru 8小时前
        Models tend to default to their training data even when they lack sufficient context, they've never been trained to recognize their own uncertainty, so they hallucinate confidently instead.
        tokenscoper 11小时前
        I don't have much to add other than this observation that we seem to have moved away from eating one small rock per day for nutritional value, and adding gasoline in spaghetti.
        The glue on pizza reference brought back memories :)
        nomel 11小时前
        Yeah, I never understood why the top n statistics weren't included in the chat interfaces, to color the text!
- gslepak 20小时前
  > The non-hallucination rate in AA-omniscience is SOTA
  Note that a perfect "non-hallucination rate" is rather meaningless as such tests can contain human hallucinations.
  It means the model aligns with the possibly-true, possibly-false beliefs of the group that made the test.
  [-]
  - rlt 19小时前
    Well, yes, garbage in garbage out. That's a given and not what's meant by "hallucination" in this context.
    [-]
    - tantaman 16小时前
      the observation goes beyond garbage in garbage out. Mainly that we're always operating from some prior and limited understanding. That what may look like a hallucination could be closer to the truth than our current frameworks of understanding allow us to admit. The hermeneutic circle.
      [-]
      - root_axis 8小时前
        A properly designed benchmark won't use tests that leave room for ambiguous interpretation.
      - Jacques2Marais 16小时前
        Interesting. I wonder if current LLMs can break out of human limitations and understand the world more correctly.
  - jcheng 18小时前
    Here are some examples of the questions in the benchmark. If these are representative, they seem pretty cut and dry. https://artificialanalysis.ai/evaluations/omniscience#exampl...
  - areweai 15小时前
    Was there something about this specific model and submission that made you feel compelled to write this self-evident observation?
    Or would you describe your methodology as more like picking a random sentence fragment as an input value then generating completions from your existing corpus without any post-input "learning" process related to the rest of the source material?
  - anti-zionist 14小时前
    [dead]
- girvo 13小时前
  The big question for me having used a lot of these SOTA chinese models is: what is its token efficiency like?
  Running Step 3.5 Flash locally for example, it's an amazingly capable model all things considered, but it's token efficiency is so bad that it gets out performed by most others wall-clock time (even with my MTP-support for it hacked in to llama.cpp: despite being trained on three heads, MTP 2 is the sweet spot, and only gets it from 20tk/s to 30tk/s on my Spark)
  The DeepSeek models and Qwen 3.5 Plus are also good examples of this: compared to Opus, and especially GPT 5.5 they use many more tokens to get to the same answers.
  I'm really hoping that Qwen 3.7 is better in this regard, can't wait to try it out
  (ps. running DeepSeek v4 Flash on my Spark is absolutely wild, thanks antirez if you see this haha)
  [-]
  - nl 11小时前
    Yes it's a big thing that people are slowly becoming more aware of.
    Nvidia models are even worse than Qwen! https://sql-benchmark.nicklothian.com/#token-efficiency-and-... (mouse over the cells for token counts and click for traces)
    Gemma 4 is good for this, as AA notes:
    > Gemma 4 31B is notably token efficient, using 39M output tokens to run the Intelligence Index vs 98M for Qwen3.5 27B (Reasoning). This is ~2.5x fewer output tokens for a model scoring 3 points lower. For context, the other models at the 42-point intelligence level also use significantly more tokens: MiniMax-M2.5 (56M), DeepSeek V3.2 (Reasoning, 61M), and GLM-4.7 (Reasoning, 167M)
    https://artificialanalysis.ai/articles/gemma-4-everything-yo...
- sheepscreek 20小时前
  Truly incredible! Very impressed by their progress. I wonder how much of their own chips did they use for training.
- baq 20小时前
  wonder at which level there's a capability state transition? 5%? 1%?
briga 20小时前
I was getting dangerously close to my weekly Claude Code limit last night so I had Claude set up Qwen3.6 with llama.cpp and OpenCode. Honestly it's a great (free!) alternative to Claude Code--certainly more than good enough for a lot of smaller less complex tasks. I'm excited to try this new version. The fact that open-source models are so close to the frontier is very impressive.
[-]
- pixelesque 18小时前
  Out of interest, what machine and model are you running it on?
  I tried the qwen3.6-27b Q6_k GUFF in llama.cpp and LM Studio on my M2 MacBook Pro 32GB machine last week, and I barely get a token a second with either.
  What sort of speed should I be expecting?
  I tried some of the Llama 3 34b (nous-capybara?) models two years ago with llama.cpp, and I seem to remember getting a few tokens a second then, so not sure if I've got something completely mis-configured, or I just have unreasonable expectations.
  Or maybe qwen 3.x is slower for some reason? (Is it mixture of experts?)
  I'm not expecting it to be instant, but what I'm currently seeing is not really usable.
  [-]
  - gcr 18小时前
    There are two flavors of Qwen 3.6:
    - A 27B "dense" model
    - A 35B "Mixture of Experts" model, which activates only 3B parameters for each token.
    For your hardware, I strongly recommend `unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M`. I have an M1 Max with 32GB VRAM from 2021 that can read at ~300-500 tokens/sec and write at ~30 tokens/sec with llama-cpp's default settings, which is plenty fast. The 27B model can read ~70tok/sec and write ~5tok/sec.
    The 35B MoE model technically takes slightly more memory but is much faster because it's doing 1/9th the work. It's not quite as "smart", but it's comparable.
    [-]
    - flockonus 15小时前
      For coding tasks 27B is reported to be much more effective, altho you can probably only run 4b or 5b quants @ this memory.
      Recommend https://www.reddit.com/r/LocalLLaMA/ as a great source for this type of discussion.
    - pixelesque 17小时前
      Thank you - I'll give that a go!
    - julianlam 17小时前
      May I ask why the M instead of XL?
      Obviously bigger != better but I don't know what the differences are.
      [-]
      - DiabloD3 14小时前
        These are dynamic quants, and they're basically just an indication of how far away from the desired quant it is allowed to go to achieve the goal. Generally, unsloth's toolchain moves quants up, rarely down.
        * _0 and _1 do not use K quant and scales 32x32 blocks according to the original (B)F16 values; _0 scales the block using the original max and min values. _1 does this per row instead of per block.
        * K quants do something similar, but now splits blocks into subblocks inside a superblock where the superblock has min/max scaling, but the subblocks also have scaling in the range of the superblock's scaling and are stored using less bits.
        * K's M, L, XL are just how aggressively the subblocks and their scaling factors are chosen. Generally, it puts a max on how far you can deviate from the chosen quant to maintain the desired quality, but also gives them a bigger budget to perform that excursion in. XL most aggressively tries to preserve the intended quality, while S does the least.
        * Dynamic quant on top of this scales entire layers, full of blocks, according to how much they effect various measurements (such as KLD and perplexity).
        That said, there is no reason K_S is even produced by anyone, same with Q_0, Q_1, and I_NL. People should no longer be using those. M only is meaningful if you're trying to restrict the upper bounds: K_XL can reach BF16 for some weights, but rarely; people think this has a speed implication for hardware that has native 8bit in their tensor units (but it doesn't).
        Unless you're specifically trying to cure a problem, stick with K_XL.
        [-]
        srcrip 12小时前
        You seem to understand this stuff pretty well, any recommendations on resources (blogs, YouTube channels, whatever) for software engineers that want to keep up with this stuff on this kind of level?
        A lot of the content about AI out there is kind of produced to the lowest common denominator. Basically a never ending scheme of get rich quick/passive income kinds of AI content.
        rao-v 12小时前
        Hey some of us are on hardware (gfx906 based Radeon MI50s with 32GB of stupidly fast VRAM and basically no compute) that inference significantly faster with Q_0 and Q_1 quants
  - DiabloD3 15小时前
    I recommend sticking with the dense models for both Qwen and Gemma.
    On testing I've done on same-quant apples to apples, with F16/F16 (ie, unquantized) kv cache, 35B-A3B underperforms against 27B on anything even remotely complex. But yes, 35B-A3B can be like 3-4x faster on my hardware.
    By Qwen's own admission, on any meaningful benchmark (ie, ones that involve logic, math, or tool calling), 27B performs like 122B-10B and 397B-A17B, but 35B-A3B is somewhere between 27B dense and 9B dense.
    Also, MTP recently got merged in, so I'd suggest downloading Qwen 3.6 MTP (I assume you get it from unsloth) and updating your copy of llama.cpp, and adding `--spec-type draft-mtp --spec-draft-n-max 2` to your arguments.
    https://huggingface.co/unsloth/Qwen3.6-27B-MTP-GGUF/ https://huggingface.co/unsloth/Qwen3.6-35B-A3B-MTP-GGUF/
    Also, I recommend not quantizing kv cache, and if you do, only quantize v. Lowering model quant while also lowering context size to fit F16/F16 or F16/Q8_0 massively improves model performance for thinking models. Also, quantizing cache, either k or v, decreases speed by a lot on some hardware.
    I have a 24gb 7900xtx, so I can fit >32k F16/F16 context with Qwen3.6-27B, but use unsloth's Q3_K_XL. This performs better than Q(4,5,6)_K_XL with v quantized.
    Edit: Oh, and since I mentioned Gemma 4, my testing mirrors my Qwen 3.5/3.6 experiences, 26B-A4B performs worse than 31B, but is also way faster. llama.cpp doesn't support Gemma 4's MTP style yet, so both could get even faster.
  - booty 16小时前
```
    I tried the qwen3.6-27b Q6_k GUFF in llama.cpp 
    and LM Studio on my M2 MacBook Pro 32GB machine 
    last week, and I barely get a token a second with either.
```
    The fact that it was this slow makes me suspect it's a matter of insufficient free RAM. The entire model needs to fit into RAM (and stay there the entire time) for acceptable performance.
    (not sure of exact diagnosis/fix, but definitely look in that direction if you're still having this issue when you give it another shot)
    Also, there are two stages - prompt processing, and token generation. Prompt processing is notoriously slow on Apple Silicon unfortunately. If you have large context (which includes system prompts, lots of tools loaded by a harness like Claude Code, OpenCode, etc) it can take minutes for prompt processing before you see the first output token. On the bright side, the tokens are cached between turns, so subsequent turns won't be so bad.
    [-]
    - mark_l_watson 16小时前
      You are using Q6 6 bit quantization; on my 32G MacMini I use Q4 and it is faster but when I use it with OpenCode, I set up a task and go outside to walk for ten minutes. Smart, capable, and slow. Still, I love using local models.
      EDIT: I run with context wired at 64K
  - mft_ 18小时前
    The 27B model is dense, so is relatively slow. The 35B-A3B model is marginally weaker but being MoE is much faster - like ~4-8x faster in basic benchmarks on my M1 Max.
    For comparison, I just ran a couple of quick benchmarks (default settings) with llama-bench:
    Qwen3.6-35B-A3B at Q6_K_XL gave 858 t/s pp512 (prompt processing) and 43 t/s tg128 (token generation).
    Qwen3.6-27B at Q4_K_XL gave 103 t/s pp512 and 8 t/s tg128.
    [-]
    - stebalien 13小时前
      Have you tried enabling MTP? Those numbers are similar to what I was getting on my Strix Halo box, but configuring/enabling MTP doubled the TG speed of the 27B model (18-20 t/s now).
    - pixelesque 17小时前
      Thanks for the info.
  - satvikpendem 16小时前
    Check out Unsloth Studio it provides MTP support now which 2x the token generation speed with no loss of accuracy: https://unsloth.ai/docs/models/qwen3.6#mtp-guide
  - Figs 18小时前
    27B is the dense one. Try the Qwen3.6-35B-A3B variants for the MoE release. That's what I'm running on a Framework Desktop and I get ~50 tok/s plus or minus a few. The dense one is similarly slow for me -- not sure what to expect on your hardware from the MoE but it should probably be much faster.
    [-]
    - pixelesque 17小时前
      Thanks!
  - 127 13小时前
    I get 150t/s peak, 120t/s avg with Qwen3.6 27B Q4 with a 4090 on Linux. Now that MTP has landed into llama.cpp.
  - KronisLV 18小时前
    > qwen3.6-27b Q6_k
    That's the dense model, you probably want a mixture-of-experts (MoE) one.
    Here's what you probably want instead: https://huggingface.co/unsloth/Qwen3.6-35B-A3B-GGUF
    [-]
    - pixelesque 17小时前
      Thanks!
  - dzr0001 16小时前
    My token throughput is much better using vLLM-mlx on my M2 ultra than llama.cpp. It might be worth a shot to give it a try.
  - electroglyph 12小时前
    you should be using dflash with that model, look it up
- plufz 20小时前
  Which exact model are you using? And with which parameters and quant? And on what hardware? Are you using any specific MCPs or other tools to optimize performance like context-mode or dynamic context pruning? I’ve used local models a reasonable amount before but I’m just starting out with opencode. Haven’t had great results yet but really want this to work for simpler tasks. My opencode newly installed is also having iterm on 100% cpu in idle. :/
  [-]
  - briga 20小时前
    I'm running Qwen3.6:27b Q4 KM on a 4090 and similarly fast CPU and I think 32GB of RAM. Make sure the context window is set to be big enough otherwise the conversation will keep compacting. No special MCP tools set up yet. Qwen is able to do web search out-of-the-box although I think it is getting blocked by anti-bot firewalls--I still need to figure out if I can fix that.
    [-]
    - SeriousM 17小时前
      This is the repo: https://huggingface.co/pbhappliedsystems/qwen3.6-27B-gguf-Q4...
  - gcr 18小时前
    here's a simple setup to get you started on an Apple M1 Max from 2021 with 32GB VRAM. it will download 20GB of models to `~/.cache/huggingface/hub`, which you can delete when you're done.
```
  /Users/gcr/llama.cpp/build/bin/llama-server
      -hf unsloth/Qwen3.6-35B-A3B-GGUF:Q4_K_M
      --no-mmproj-offload
      --fit on
      -c 65536 # edit to taste
      --reasoning on --chat-template-kwargs '{"preserve_thinking": true}'
      --sleep-idle-seconds 90 # very aggressive: purge model from vram after this long
      -ctk q8_0 -ctv q8_0 # Optional. Lower memory use, but lower speed. Omit if you can.
```
    I don't recommend ollama or lm-studio. Ollama's in the process of switching from their llama-cpp backend anyway, but their new go framework frequently OOMs and crashes on my hardware. I also don't recommend MLX-based inference backends on this hardware; I've found them to consistently reduce performance, contrary to what I've read online. I've tried all the llama-cpp metal forks, but right now, MTP, TurboQuant, MLX, etc etc etc are too new and just slow things down. It's all dust in the wind still.
    For agent harnesses, opencode is okay, as is pi or even Zed's built in agent panel. Claude code "works" with ANTHROPIC_BASE_URL=http://localhost:8080/v1, but is very chatty (the default system prompt burns 20k tokens). Crush (from the charm-bracelet folks) is particularly nice when starting out. I've personally converged on pi-agent under an otherwise-mostly-default setup. You can ask qwen to customize pi or write you an extension which helps a little.
    You'll need to add `http://localhost:8080/v1` as an OpenAI-compatible model provider in your coding harness with any API key (doesn't matter) and any model identifier (doesn't matter with llama-cpp).
    Note that pi doesn't have permissions. Everything is permitted. The hundred hungry ghosts you've trapped in a jar WILL find a way to delete your home folder someday. That's what Man gets for summoning demons without casting a circle of protection first. Flying too close to the sun etc etc etc
    Take backups and then go have fun. Hope this helps.
    [-]
    - srcrip 12小时前
      Can you elaborate more on the differences in running ollama or lmstudio? Do they actually slow down the speed of the inference and if so why? Or is it just a preference thing?
    - plufz 10小时前
      Thanks a million!
    - irishcoffee 9小时前
      I have a 5070TI (16gb VRAM) with 32GB system ram and a 16 core AMD cpu. I am considering buying a second used videocard, probably the same model, but not for months yet. This hardware setup is new-for-me in that a buddy gave me most of it and I bought the TI card.
      Are there any resources to help me figure out how to best optimize my runtime paramaters for a given model, based on a given task, similar to what you've shown?
      I've been a little... irritated? that hooking vscode up to my company LLM subscription seems so much more out-of-the-box capiable than what I can get to work. My assumption at the moment is that I need to create a lot of... I think they're called harnesses? agents? workflows? integrations? (not sure) by hand. Is that accurate?
      Right now I have ollama running an nvidia nano model and I can poke it with a stick over a web interface I installed. It works, initial token response is slow, after that it seems fine enough.
      I can't seem to get a good handle on how much context I've used, when context usage starts to degrade response accuracy, or in general how to mirror the results I get (not in terms of accuracy or speed, just features) from the company github copilot + vscode integration.
      I was also trying to get a plugin called qodeassist working via qtcreator, mixed results there as well.
      I've been keeping up with this space since the jump, never paid for a sub, work gave me a sub a handful of weeks ago, so the actual useage is all new to me.
      I can't say I'm super impressed with any of it relative to the hype, but I found it neat to be able to point vscode at a c++ codebase and say "enable wextra, build the code, tell me if there is any low-hanging fruit I can clean up" and get a useful response.
      I also asked my local model to turn a picture of my dog into a picture of an otter, got a blank picture back, which the thinking bit told me it would do. The whole thing was actually kind of funny. "I am allowed to edit pictures, I can't edit pictures, I am allowed to edit pictures, I'll tell the user I did and send a blank picture back because I can't edit pictures, but I am allowed to."
- ecshafer 19小时前
  Qwen3.6 with claude code works great. I get a lot better results with that than opencode and qwen3.6. Claude Code is a great harness, and good harness/tool integration makes a big difference. You just have a settings.json with your ollama setup and the qwen model and you can use it.
  [-]
  - growt 16小时前
    Where and how do you run that? I tried it but somehow I always ran out of context or generation was incredibly slow (mbp m4 pro 48gb).
- leonidasv 20小时前
  Qwen Max are usually closed, unfortunately.
  [-]
  - mostafab 14小时前
    That's a signal of being SOTA.
- wuliwong 16小时前
  Do you have a feel for how it Qwen 3.6 compares to Sonnet 4.6? B/C in reality, that's what we use a lot. If we just use Opus 4.7 for everything code related, we'd have a monthly bill 10-20 times higher than using Sonnet where we can.
  [-]
  - nl 10小时前
    I think you could well be surprised by the Sonnet vs Opus bill (assuming you are paying via the API)
    In my experience Sonnet bills can be higher than Opus because it churns a lot more trying to get things right.
    Example from my fairly simple but agentic benchmark:
    Opus 4.7, 25/25, 81c: https://sql-benchmark.nicklothian.com/?highlight=anthropic_c...
    Opus 4.6, 24/25, 61c: https://sql-benchmark.nicklothian.com/?highlight=anthropic_c...
    Sonnet 4.6: 24/25, 41c: https://sql-benchmark.nicklothian.com/?highlight=anthropic_c...
    I only tested the free OpenRouter version of Qwen 3.6 Plus, and it scored 23/25: https://sql-benchmark.nicklothian.com/?highlight=qwen_qwen3....
    This doesn't quite show Opus cheaper, but it isn't the 10-20 times more either. Harder tasks close the gap even further.
  - briga 15小时前
    I would say if Sonnet is a senior engineer, then Qwen3.6 (the 27b model) is probably closer to a junior engineer. Still capable of getting stuff done, just needs more guidance and makes mistakes more often.
    Maybe that's underselling it. It is quite a good model and might end up replacing a lot of the work I was sending to Sonnet 4.6.
    Also, Sonnet 4.6 is almost certain a much bigger model so the performance differences aren't unexpected.
- kolinko 17小时前
  As Opus maximalist ;) I was very surprised by the quality if Qwen3.6-27B - trying to figure out how to get it going on RTX 90k now to offload some lighter tasks :)
- chr15m 11小时前
  This new version is not something you'll be able to run locally. It's a "cloud" model and likely too beefy if they do release the weights.
- aembleton 15小时前
  > Today we introduce Qwen3.7-Max, our latest proprietary model
  This is not an open model
- wouldbecouldbe 18小时前
  This one doesnt seem to be open source though sadly. Using chinese servers is a step to far for me personally
  [-]
  - gcr 18小时前
    Look for an open release from the Qwen team in the coming weeks. They like to showcase their proprietary models first, which score higher on benchmarks anyway due to model size.
- ttoinou 16小时前
  Which agentic coding tool and how do you make sure you have prefix consistency ?
- par 18小时前
  Do you have an opinion on OpenCode vs Aider?
  [-]
  - briga 15小时前
    I haven't tried Aider yet but perhaps I will. Another one that seems to be getting traction is Pi Coding Agent.
  - sunaookami 15小时前
    Aider is still around? That is pre-tool-calling era stuff. Better compare against Pi.
    [-]
    - par 12小时前
      I just started running coding agents locally. So you recommend Pi over opencode? (And obviously aider is out?)
      [-]
      - anderber 8小时前
        I personally found better results with Opencode. But Pi is really nice too.
tekacs 23小时前
As they start to release more proprietary models, I so wish that they partnered with one of the major US hyperscalers to allow using these models through something US-domiciled.
Totally understand why it may not be reasonable or in their best interest (and that the US is _absolutely_ not doing the same reflexively). But it would be lovely to be able to try these out on production workloads in earnest.
[-]
- embedding-shape 22小时前
  Unless US hyperscalers do the same in reverse, I hope the status quo stays as it is. Either people are happy to share, and the sharing should happen both ways, or US hyperscalers can keep isolating themselves as they've done so far.
  [-]
  - adjejmxbdjdn 22小时前
    I do hope The U.S. hyperscalers do the same as well.
    In an ideal world U.S. residents would use Chinese AI models and Chinese residents would use U.S. AI models.
    Governments in both countries are collecting data for nefarious reasons. But the Chinese government has far less influence on a U.S. resident and vice versa.
    We are all better off if our data is collected by a government halfway across the world instead of our own governments which hold incredible amounts of power over us.
    [-]
    - adrianN 21小时前
      In an ideal world everybody runs open models on hardware they control.
      [-]
      - LeifCarrotson 21小时前
        I'm running Qwen 3.6 via https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8 and it's pretty great. I'll update to the 3.7 equivalent when that's ready.
        It's not nearly worth it to me to get an incremental improvement in performance if it means I have to move to hosted environments with Qwen 3.7 (or Claude or Gemini or whatever).
    - nickdothutton 22小时前
      China is much more interested in waging a campaign against companies that represent the material of the future growth in productivity, exports, and prosperity of the US and her people, than learning about you as an individual. Unless of course you are a Chinese dissident living in the US.
      [-]
      - giancarlostoro 22小时前
        Which is basically the current primary use for AI is programming more than anything, you hear about AI in programming more than in any other field.
        [-]
        saghm 21小时前
        There are also a lot more novels about writing than making movies and a lot more songs about music than plays. It's not clear that this is because it's actually the primary use-case or if it's just because people who work with computers will inevitably talk quite a lot about computer things. For the past several years, pretty much everyone I meet who isn't in software but find out I do (doctors, people who sit next to me on a plane, etc.) will ask me my thoughts about AI because it's so widely discussed in general, and they're curious about my perspective on it as someone in software, but most of the time they're most curious about understanding more about how it might affect their own lives, not mine.
      - WarmWash 22小时前
        China definitley wants information on all Americans. This commment is so far off the mark you it's on par with "Billionaires aren't interested in taking your money"
        As Americans go through life, some of them will become people with power. When you need to leverage that power, having the right knowledge about them can effectively transfer that power to you.
        Tiktok was a goldmine, because every 20-something on their way to a future position of power was uploading every single facit of their digital life to CCP servers everyday.
    - giancarlostoro 22小时前
      It would have been the world we live in if China wasn't involved in so much corporate espionage. I don't even feel comfortable using their open weight models on anything my employer makes, the only time I use Qwen is for greenfield "how good is this?" type of projects, but otherwise, how do I trust that it wont mysteriously hallucinate phoning home?
      On the other hand, there's other models where the source is 100% open, the training data is known, and people have reproduced the same model from scratch, so while those trail behind, there's definitely an effort to make models more open and capable.
      [-]
      - deaux 21小时前
        The US has for decades been engaged in mass dumping of their products to establish monopolies all over the world, and punishing anyone who dares try do anything about it. This isn't better than corporate espionage.
      - eloisant 22小时前
        I agree, but the same goes for the US. Remember Echelon.
        [-]
        stickfigure 21小时前
        It's highly improbable that the US government has a secret team inside Anthropic and OpenAI manipulating their training regimen. For better or worse, these companies are filled with ideologues and something that invasive would trigger an army of whistleblowers (despite legal consequences).
        [-]
        booty 20小时前
        It's highly improbable that the US government has a secret team inside Anthropic and OpenAI manipulating their training regimen.
        Two thoughts.
        One: it would be relatively technically trivial for $GOVERNMENT_AGENCY to just monitor all the prompts + context we send over the wire to OpenAI/Anthropic/etc. That's a goldmine of sensitive personal and corporate data, no secret team needed (although, the LLM providers obviously would need to cooperate)
        Two: Rather than secret infiltration teams influencing model training I think what's more likely on the training side of things is simply self-censoring by the LLM providers, so that they don't risk angering the government.
        I highly doubt that China has government interlopers, secret or otherwise, inside Qwen's training team. Nonetheless, "sensitive" issues like Tiananmen Square are censored. I would imagine that much/most such censorship in China is self-censorship that doesn't leave a legal/paper trail. That's what we're in danger of seeing (more of) in America IMO.
        [-]
        Barbing 20小时前
        > relatively technically trivial for $GOVERNMENT_AGENCY to just monitor all the prompts + context we send
        I take this for granted given Room 641A https://en.wikipedia.org/wiki/Room_641A
        Thus, I’ve pondered whether anything they’ve learned has changed the world / had a big impact (like on their understanding of human psychology, perhaps per region). They’ve heard phone calls, they’ve read emails, diaries get brought to court… but these are systems that would be used like diaries but also prompt users for more and more.
        SoMomentary 18小时前
        Having seen all the AI interactions that you can get through clickstream data I have no doubt that $GOVERNMENT_AGENCY can see much much more.
        throwaw12 20小时前
        > secret team inside Anthropic and OpenAI manipulating their training regimen
        You don't need a secret team to manipulate whats coming from them: https://responsiblestatecraft.org/israel-chatgpt/
        Planktonne 21小时前
        > these companies are filled with ideologues
        Are they? They don't behave like it.
        gmerc 21小时前
        Its very hard to be so naive.
        [-]
        SR2Z 21小时前
        I think you are being ridiculous. Tampering with an LLMs pretraining is a difficult undertaking. There is plenty of evidence that training a model to walk the party line leaves it less capable than if it weren't.
        It's not very subtle manipulation either; ask qwen of Taiwan is a part of China in German and in English and only the English answer will be party-approved.
        [-]
        embedding-shape 20小时前
        Compared to what we have proof the US government have engaged in before? Do people not remember PRISM anymore? It was virtually impossible to think of the scope before it was leaked, and you'd be marked as a conspiracy theorist for believing that happened, before it was made concretely true.
        I think it's borderline naive to assume various agencies haven't infiltrated OpenAI, Anthropic and others, essentially the entire world was wiretapped by NSA in the past, to assume they don't have an employee or two at these companies does seem a bit naive to me.
        [-]
        logicchains 18小时前
        Agencies like the CIA have infiltrated the news agencies, so they have indirect power over the information that LLMs consume.
      - gcr 18小时前
        how could running the qwen GGUF phone home? that would require cooperation with the inference backend (llama-cpp), or some kind of model exploit. It’d be far easier to pay the agent harness devs or supply-chain some plugin or something, that space is the Wild West anyways
        I've certainly used these models without wifi without any differences.
        [-]
        HDBaseT 13小时前
        You've used Qwen with model quantization, locally without internet connection.
        A lot of people are purchasing access via Alibaba Cloud directly, or indirectly by companies which host the model.
    - boomskats 21小时前
      Yeah, about that. https://en.wikipedia.org/wiki/UKUSA_Agreement#Controversy
    - MintPaw 16小时前
      Interesting point, but I'd always thought the opposite, you're much better protected by the law if you use services from your own country.
      If you use a service outside your country, I believe you could have all your code stolen and get hacked/exploited in a way that would be totally legal.
    - CodingJeebus 22小时前
      > We are all better off if our data is collected by a government halfway across the world instead of our own governments which hold incredible amounts of power over us.
      Sure, that is until each government's dataset is interesting enough to the other to facilitate a data-sharing agreement.
      There's gotta be an internet "law" that says something like "Eventually, the data you volunteer to a benign 3rd party eventually winds up being used against you by someone". This is short-term thinking at it's finest.
- tmoravec 19小时前
  Qwen3.6-Plus is available from Fireworks.
  [-]
  - tekacs 17小时前
    Thank you for pointing that out! If 3.7-Max makes its way to Fireworks that'd be a joy.
- mostafab 14小时前
  Alibaba Cloud has data centers in Mexico
- dchftcs 20小时前
  fireworks hosts Qwen 3.6 Plus, they might also get Qwen 3.7 Plus.
- motiw 22小时前
  ChatLLM support QWEN, do you consider this as US safe?
- epolanski 22小时前
  US hyperscalers, all of them, are financially invested in the US AI labs and have the incentives to keep the status quo.
- 0xbadcafebee 22小时前
  I'm more interested in hearing specific reasons why one wouldn't use a Chinese company. Unless you're thinking Alibaba is going to ship chat logs to some government ministry that will then dole out proprietary information to new competitors (which doesn't seem logistically feasible), or you run a human rights organization, it feels a bit like FUD.
  [-]
  - vessenes 22小时前
    All this data is accessible to national security agencies; this is true in every country in the world.
    China has more integration between intelligence and industry than many western countries, and it does present a higher risk of unwanted “tech transfer” to industry than running on oracle or Google or ms or Amazon does in the US.
    DHS has long staffed full time agents in California to deal with foreign IP exfiltration - using qwen is like fast/easy mode for IP exfiltration: why make anyone get a job in your palo alto office when you can just send it to them in Hanzhou?
    Upshot - If you have something proprietary you’re working on I would generally advise not to just direct send it to Alibaba.
    [-]
    - culi 19小时前
      I highly doubt China has a more sophisticated integration of their intelligence ministries than the USA. The world in which that was true would look very different from our own.
      [-]
      - kbelder 16小时前
        He didn't say more sophisticated integration. He said 'more integration', which is very likely true.
      - vessenes 18小时前
        Interesting. Have you worked in China?
    - HDBaseT 12小时前
      The US Education propaganda is working, China are the bad guys!
  - bachmeier 21小时前
    > Unless you're thinking Alibaba is going to ship chat logs to some government ministry
    This made me think of a Seinfeld episode: "I didn't know it was possible not to know that."
  - noelsusman 21小时前
    >Unless you're thinking Alibaba is going to ship chat logs to some government ministry that will then dole out proprietary information to new competitors (which doesn't seem logistically feasible)
    That's exactly the fear, and why would it not be logistically feasible? The threat is definitely a bit overhyped, but China has a longstanding track record of aggressive corporate espionage.
  - tekacs 22小时前
    … building and selling a product to US companies that sends company-internal data to Chinese AI providers is not a particularly good way to get people to buy it.
    Even if they weren’t individually worried about their proprietary data being shared with Chinese domestic competitors or with government… their audit / security programs likely wouldn’t allow it for a _huge_ range of types of data.
  - dpoloncsak 21小时前
    Because my CEO thinks China scary big hacker guys over there
  - ihsw 18小时前
    [dead]
goyozi 1天前
These are very good numbers. I still don’t get why they don’t compare against latest competitor versions in these posts, it’s not like we’re all not going to notice.
[-]
- Eridrus 14小时前
  Nobody releases numbers that show them to be worse than competitors lol.
  This even applies to OpenAI & Anthropic who don't even eval on the same datasets a lot of the time.
- NiloCK 22小时前
  I find it forgivable if it's within minor version bump. (NB that x.5 is now a defacto major-version bump for LLMs for whatever reason).
  Even with LLMs, posts like this don't just fall out of a coconut tree. If you have a set of target benchmarks for your own model, then keeping "the set" of side-by-side comparable models is its own maintenance headache.
- Aurornis 22小时前
  I think the argument is that trying to suggest that they’re close to N months from SOTA.
  Realistically I assume they hope readers don’t notice the fine details.
  The Qwen models are great for open weights but for every past release they haven’t performed as well as the benchmarks in my experience. They’re optimizing for benchmark numbers because they know it works.
  [-]
  - epolanski 22小时前
    > Realistically I assume they hope readers don’t notice the fine details.
    The pool of people reading such articles while ignoring such details can't be big.
    [-]
    - Aurornis 22小时前
      I disagree. Most people skim articles, not read them deeply.
      On Hacker News I wonder if most people even opened the article at all most times.
      [-]
      - hadlock 18小时前
        Slashdot coined RTFA in the 90s, what you're suggesting isn't a new concept by any measure
        e: which itself is a modification of RTFM from usenet
- htrp 23小时前
  I think its part of the expectation setting (with a side of we did our distillation/ eval harness on a specific model).
  if they say it's 4.7 comparable, it anchors that into your head as the model to evaluate against.
- beydogan 22小时前
  honestly, initial version of Opus-4.6 was much better than whatever we are being served right now as 4.7. If it performs same level to that, i'm totally willing to switch.
  [-]
  - hypercube33 21小时前
    4.6 was an awful experience the month I used it right after launch where it didn't ask anything just made assumptions and went on its merry way. 4.5 and 4.7 don't do that for me but 4.7 eats my quota for breakfast so I've been avoiding using it because I like to have it for more than an hour a day.
    [-]
    - goyozi 21小时前
      I feel like I had the best and worst ~month experience on 4.6. Initially when it came out, it seemed to ask good questions and genuinely do well on complex tasks. From about mid-March it was absolutely abysmal, it seemed to assume the stupidest answer/angle for everything and make weird mistakes. 4.7 seems decent so far but usage hurts - at some point my company switched me to standard seat and I used up 80% of my session usage in 1 prompt. I got my premium seat back since but I think pro/standard plan + opus 4.7 is unusable for daily driving.
    - verdverm 20小时前
      That experience is also likely tied to the claude harness around the model, and not being as tuned right after model release. They iterate on this and different models need different words (unfortunately...).
- hmokiguess 23小时前
  this puzzles me too, I want to know
- maelito 23小时前
  Marketing.
- pulse-dev 22小时前
  [dead]
tarruda 23小时前
Looking forward to more open weight releases from Qwen, especially 122B and 397B.
[-]
- smcleod 23小时前
  Yeah that 60-150b~ range is such a sweet spot for current 'prosumer' hardware, I'd love to see something like a 120b-a14b or there about.
  [-]
  - tarruda 23小时前
    I have a 128G mac studio and even 397B was a happy surprise to me due to its high quantization resilience.
    I've created a 2.54BPW quant that fit on my hardware with 128k context, 20 tps tg and 200tps pp, while maintaining high scores on many benchmarks: https://huggingface.co/tarruda/Qwen3.5-397B-A17B-GGUF/discus...
    [-]
    - chrisweekly 22小时前
      Apple store's current options for mac studio seem to max out at 96GB. I'm questioning ROI, esp. given it's not upgradeable. Curious about others' takes on new mac hardware.
      [-]
      - tarruda 22小时前
        > I'm questioning ROI
        If by ROI you mean saving more money than using paid APIs, then I don't think it is worth it. All you gain is full sovereignty over your AI usage.
      - hadlock 18小时前
        Rumor mill has been buzzing about m5 mini and studio. If anything materializes close to what the rumor mill has been suggesting, the m5 could be appealing to home lab/local LLM folks, or at least help inform if the M6 will be worthwhile. Assuming Apple was able to lock in halfway reasonable memory prices early enough in advance.
      - drob518 22小时前
        Currently, Apple is letting some of its models go out of stock in preparation for new models coming in a few weeks. I would expect at least 128 GB models at that time. That said, the memory crunch is hitting everyone.
        [-]
        the_lucifer 21小时前
        Yep, even with their supply chain prowess, they're being hit now given some longer term contracts vis-à-vis their memory are nearing renewals.
        [-]
        drob518 20小时前
        Yep. Something needs to break soon. Or rather, something WILL break soon, one way of another. Was talking to a friend last night who works planning infrastructure rollout and he said costs for equipment has roughly doubled in the last six months. Soon, these projects aren’t going to be viable.
      - ramses0 20小时前
        I'd held off from buying a new personal laptop for quite a few years and felt that the M5-128gb was justifiable once I started really seeing payoffs from using AI at work.
        Running w/ Cursor and doing some "nights and weekends" type coding / conversations, I was hitting $100-200 of usage within a few weeks. I know there's probably better ways to manage costs, but I was getting enough value out of it to keep bumping my spend limit from $20 => $40 => $80 => $120 (and then I stopped spending! :-)
        Messing around with local-llm, I've settled on `omlx` and `gemma` for "conversational", and I think it's `qwen-120b-a3b-6bit` or something for the "heavy hitter". Gemma "gets it" a lot more, whereas that particular `qwen` tends to fall into the "MuSt WrItE CoOooDeee!" behaviour in a lot of cases instead of holding a conversation, and does an awesome job of randomly spitting out ascii-art diagrams or including full-blown bash shell scripts to illustrate different cases.
        My POV is: "Local for slightly slower/casual usage", the ~1% of battery usage per minute of LLM is shockingly accurate (eg: 30 minutes == 30% drop!). "Gemma for discussion and emitting DESIGN-... docs", and "Qwen for converting DESIGN-... to PLAN-...", (as well as implementation, but generally from a fresh context loading the relevant PLAN-... or supporting docs)
        ...then supplement that with direct Cursor usage in case I screw up some setting on being able to get the local LLM working, or if I need to include literal web-research or really having access to some SOTA model. Using the pi-coder harness locally, web pages are kindof a difficult conundrum as they can be kindof gigantic and are really worthy of special casing, some sort of sub-harness, etc... but the more "stuff" you put into the agent, the less context window (and memory!) you have available, so it's a real balancing act.
        The other biggest problem is that you're limited (locally) to ~20-80tps and in some cases you have to chew on or "swallow" the whole prompt up to that point if you end up with some sort of cache miss (TTFT). The `omlx` server does a pretty good job (after you tweak some settings and stuff) of allowing MANY prompt continuations to nearly immediately start generated tokens, but sometimes if I have two agents going (eg: Gemma talking shit about Qwen's output or vice versa) in a longer context window, then you'll take that hit.
        "Other people's compute" is definitely more freeing, but even looking at $200/mo usage that's $2400 vs. the ~$6k for a maxed out MBP. Call it $2500 vs. $7500 and you'd say that "local AI gives you a 3-year amortization window for a slower, worse experience" ... but if you're strategic about your usage, the ability to "talk for free" and occasionally "burst" to an online provider or having some hugging-face tokens to try out different models that you can't quite run locally is really nice. Talking to the AI (locally) to even just do non-coding planning without worrying about data leakage or privacy issues is phenomenal, and you end up owning a really nice laptop!
        In some ways, seeing the "advantage" of having the local 128gb capacity for LLM, I'm semi-wishing I'd have gotten a mac mini instead, but then I can't quite do the 100% offline stuff (eg: coffee-shop) that the maxed out laptop allows.
        If it were a mini running locally, I'd feel more comfortable calling it the always-on "AI brain" to process my emails, run crontab summaries, whatever kindof "open-claw-ish" stuff that you could do w/o relying on having to "keep the laptop lid open all the time". I'm sure there's ways to repurpose things, but longer-term, call it even 3-5 years from now... any sort of 128gb machine will be more than capable where you'd want to have one "doing stuff" locally within your home network (IMHO).
        [-]
        chrisweekly 19小时前
        Thank you! That was a generous and helpful response, I really appreciate it. Food for thought...
        >"...if you're strategic about your usage, the ability to "talk for free" and occasionally "burst" to an online provider or having some hugging-face tokens to try out different models that you can't quite run locally is really nice. Talking to the AI (locally) to even just do non-coding planning without worrying about data leakage or privacy issues is phenomenal, and you end up owning a really nice laptop!"
        ^ this resonates, loudly.
    - smcleod 15小时前
      That's impressive getting a 397B down to <110GB~. HF link is broken though!
    - ttoinou 23小时前
      better than antirez ds4 ?
      [-]
      - tarruda 23小时前
        I only tried a very early version of that when it was just a llama.cpp fork and Qwen was certainly better in my tests.
        But I was not super impressed with deepseek 4 flash using it from the official API either, so it doesn't seem quantization fault. It is a good model, but nothing out of the ordinary in the few benchmarks I ran on it (with full awareness that benchmarks are biased).
  - KronisLV 18小时前
    There definitely have been some options in the past, cool to see them.
    Oddly enough, though, Qwen 3.6 35B A3B and Gemma got some really good reviews, despite being way smaller than any of these ones.
    Qwen 3.5, 122B A10B: https://huggingface.co/unsloth/Qwen3.5-122B-A10B-GGUF
    Qwen Coder Next, 80B A3B: https://huggingface.co/unsloth/Qwen3-Coder-Next-GGUF
    It's kinda weird that DeepSeek V4 Flash is supposed to be 284B A13B, but shows up as 158B in HuggingFace, probably some weird bug: https://huggingface.co/unsloth/DeepSeek-V4-Flash and that's not even just Unsloth but like the official source too https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash (so also doesn't fit the category unless you get a heavily quantized version to run, but cool regardless)
    Mistral Medium 3.5 is interesting because it's 128B but dense, so probably too slow for most folks: https://huggingface.co/unsloth/Mistral-Medium-3.5-128B-GGUF
    GPT-OSS, 120B A5B: https://huggingface.co/unsloth/gpt-oss-120b-GGUF
  - gcr 23小时前
    What’s the price point for getting into that sweet spot?
    I’m on an M1 Max with 32GB VRAM, so I’m looking forward to the 27B or 35B-A3B models. Is dropping $5k for an RTX 6000 or a DGX Spark really the best option?
    [-]
    - tempoponet 23小时前
      Expect to pay $4k-10k
      - Your RTX 6000 is closer to $10k now
      - Sparks are creeping into the $4-5k range
      - AMD Strix are ~3.5k
      - Apple depends on chipset and memory. Sweet spot would be 128gb M3 Ultra, probably $6-8k but admittedly haven't been tracking closely. New M5 might come in the fall. You can get a new 128gb M5 Max laptop for ~5-6k today.
      - a 4x3090 rig would take $5-6k
      Every platform has tradeoffs, but it's mostly ecosystem, memory bandwidth, and power consumption. They're all slow. The best option is likely to rent hardware on Runpod. The RIO on self-hosting is very low unless you have a specific need or you're ok treating it as a hobby.
      [-]
      - bahmboo 17小时前
        $2600 gets MBP M5 Pro 48gb. 64gb requires a Max which bumps it to $4200 at which point you may as well spend the $800 to go to 128gb.
      - anonym29 22小时前
        Bosgame M5 (Strix Halo) w/ 128 GB still goes for $2800 right now. SH systems have surged in price dramatically but quite unevenly.
        >The best option is likely to rent hardware on Runpod.
        Vast.ai is much cheaper, but the broader point here is contestable. The only dimension in which cloud GPU rentals win is cost. You lose the confidentiality, integrity, and availability benefits of local deployments.
        [-]
        ai_fry_ur_brain 22小时前
        Rentals are priced to pay themselves off in 1-1.5 years (when renting them out per hour, not selling tokens). Its never a better option to rent.
        Not that I'd encourage anyone to throw large amounts of money to have access to LLMs, but you're definately going to be better off buying something that you can amortize over multiple years with a multi year warranty.
      - ai_fry_ur_brain 22小时前
        And for what? Spend 10-15k for the slopiest of slop code, non deterministic automations, and the ability to spawn an AI gf?
        This whole thing is really starting to remind me of the crypto hype phases of 2016-2018 when everyone thought their investment in GPUs was going to make them rich.
        [-]
        organsnyder 22小时前
        It is possible to get real work done with LLMs. There are plenty of ethical concerns, and they're definitely over-hyped, but they are exceptionally useful tools when used well.
        [-]
        varispeed 22小时前
        [dead]
        dvfjsdhgfv 20小时前
        I upvoted your comment even though I disagree with you.
        Yes, LLMs are sloppy, and local models usually more so (but things change fast).
        But the local ones have one big advantage: they are private. So you can safely feed them the collection of your private documents and things you wouldn't trust people like sama with. The fact that some people do not care is one of the failures of our educational system.
        gamander2 20小时前
        These models contain a wealth of knowledge that is being censored, not just deliberately, but by training data bias. Fine-Tuning and steering can produce unexpected new insights. For example a model that is trained to believe so-called "conspiracy theories", which many believe to be the ground truth.
    - smcleod 15小时前
      Really right now it's the M5 Max MacBook Pro 128GB, the RTX6000 is a nice card but you'd need more than one of them and you have to have a desktop to suit. The DGX Spark is slow and has pretty limited software support.
    - embedding-shape 23小时前
      If I could find a RTX Pro 6000 for $5K I'd definitively grab it, I'm running RedHatAI/Qwen3.6-35B-A3B-NVFP4 on one (I had to pay closer to $10K for it though) with 260K context and it's a blast! ds4 by antirez also works well, even IQ2XXS seems to work relatively well but Qwen3.6-35B-A3B-NVFP4 is both faster and higher quality responses (at least for coding and translations which I use them mostly for).
    - tarruda 23小时前
      > What’s the price point for getting into that sweet spot?
      In October/2024 I got my Mac studio M1 ultra with 128G, IIRC it was ~$2500. With recent prices explosion, it has certainly gotten more expensive. https://frame.work/ is selling 128G strix halo mainboard for $2700, but you have to add storage and case.
    - ttoinou 23小时前
      M5 Max 64GB (sweet spot) or 128GB (only 1000 USD, better to keep it for the future) more are the best quality price ratio, future proof, reliable, resellable and flexible workloads. Harder to use as a server might be the only drawback
      [-]
      - throwaw12 23小时前
        What do you recommend for non-Mac setup? I am a Mac user, but its getting expensive, and not seeing reason to jump to the latest M5
        [-]
        barbacoa 20小时前
        Try looking into Ryzen AI Max 395. AMD made a CPU/GPU soc with unified memory specifically for ai inference. Can buy mini PCs with up to 128gb ram.
        [-]
        krzyk 18小时前
        Isn't CUDA/nvidia the go to solution for most local models, with the rest being second class citizents?
        [-]
        gcr 17小时前
        Depends. ROCm is pretty well-supported for example.
        Non-NVIDIA backends tend to get less support and new features land slower, or features that are expected to improve performance wind up hurting it instead. That sort of thing.
        For basic “token in/token out” workloads without fine tuning, it’s probably fine ??
        simple10 19小时前
        The Ryzen AI Max 395 128gb is super cool, but not fast for inference. Order of magnitude slower than dedicated GPU but at half the cost. You can run larger models on it but it's slow. Great for local async work. Not great for daily chat or code agent driver.
        [-]
        throwa356262 19小时前
        The latest NPUs are pretty fast, I think what is missing is more optimised software support.
        [-]
        plagiarist 18小时前
        The vRAM bandwidth is at least as much a problem as compute on these ones, there is a lot of data to shuffle around
        varispeed 22小时前
        Probably a comparable non-Mac setup will be Threadripper, but it will become much more expensive. My view is that actually Apple products are the cheapest on the market when it comes to performance.
      - roger_ 23小时前
        M5 Max 128GB for $1k?
        [-]
        tempoponet 22小时前
        The memory upgrade is $1k on a Macbook Pro. The laptop is ~$5500.
        smallerize 23小时前
        I think they mean the upgrade to 128GB is +$1k.
    - tandr 14小时前
      Don't mind me asking, but where did you find $5k RTX 6000? Even 48GB model (previous gen) shows minimum at 7k, and 96GB one (Blackwell) is ~10k on Amazon...
      [-]
      - CamperBob2 6小时前
        $5K is presumably what it costs to pay some local gangsters to break into an nVidia warehouse. That's the only you will pay $5K for an RTX 6000 for the next couple of years.
        The server edition has gone up $2K in the last couple of weeks alone, at the outlet where I bought one previously.
    - anonym29 23小时前
      Strix Halo at $2k with similar TG and about half the PP of DGX Spark was a pretty good deal IMO, especially considering it's also a full x86 system... 16c/32t Zen 5, 40 CU RDNA 3.5, 128 GB unified memory at ~220 GB/s real-world speeds (256 GB/s theoretical) - that runs full tilt at 140W in performance mode and idles at ~10W.
      Unfortunately, the prices rose on these a lot, but unevenly. Beelink GTR 9 Pro is $4400, Framework Desktop is ~$3500, for what is basically the exact same mainboard as a Bosgame M5 for $2800.
      Apple's M5 Max is another attractive option. Apple silicon traditionally had great MBW and was good at TG, but struggled with PP, but the new neural engines in those GPU cores have made a big difference in a good way here.
      Gorgon Halo is rumored for June announcement with Q4'26 release with basically +100 MHz clocks on Strix Halo, LPDDR5X-8533 instead of LPDDR5X-8000, but more importantly, 192 GB max instead of 128 GB.
      I'd say it's better to wait for Gorgon Halo than to grab Strix Halo now. However, Medusa Halo, rumored for H2'27, is slated to have up to 26c Zen 6 (heterogeneous cores - kinds funny that AMD is heading towards these as Intel retreats from them), 48 CU of RDNA 5 instead of 40 CU RDNA 3.5, and a 384 bit bus w/ LPDDR6, which should make 256 GB at more like ~490-600 GB/s MBW, which will really make Strix and Gorgon Halo obsolete.
      Also worth keeping an eye out for Serpent Lake (intel CPU + nvidia iGPU on a single board with unified memory, rumored for 2028-2029 iirc), and on the 160 GB Crescent Island Intel dGPU.
    - pulse-dev 22小时前
      [dead]
- ricardobayes 18小时前
  Personally even more a lower quantized model like 9B.
  [-]
  - throwa356262 13小时前
    Same here, the unsloth versions can run on a potato and are actually useful.
- mixtureoftakes 23小时前
  I'm more excited for qwen3.7 9b and 72b, these are usually so good for their size
- guitcastro 22小时前
  I am still waiting for qwem image-edit 2.0 open weight
- Pxtl 21小时前
  Ouch. I'm just getting into tinkering with these things - mine is running on a vanilla gaming desktop with a 12gb 3060 and 32gb of ram. Even going above Qwen 9B risks completely locking up the machine.
flakiness 19小时前
I'm using pi agent and love to try qwen models (hosted). What are the good options? The official provider doesn't include Alibaba. Is OpenRouter etc. fast enough?
(As a reference, DeepSeek v4 is severely throttled on these proxy services.)
[-]
- atilimcetin 19小时前
  I use pi + openrouter (with qwen3.6-max-preview) a lot. I never hit any stability or performance problems yet.
  [-]
  - flakiness 13小时前
    Good to know. Thanks!
- notatoad 9小时前
  i use opencode zen as a convenient pay-as-you-go way to try out all these new models. it doesn't have 3.7 yet, but at the rate they usually update it probably will tomorrow.
  I couldn’t say how throttled it is, but it seems fine?
ndom91 21小时前
Is this one of those ones where they'll drop the huggingface release a week later? Or do we know for sure that this is staying proprietary?
[-]
- Davidzheng 21小时前
  someone correct if i'm wrong, but I think the max models are usually non-open
  [-]
  - sroussey 21小时前
    The plus and max models have never been open as far as I know.
    [-]
    - zackangelo 20小时前
      With the 3.5 release, the Plus model was just a rebrand of the open weight 397B. But I suspect that will change going forward. They haven’t released the weights for 3.6 but they did make it available through a few US providers.
maxdo 15小时前
No opus 4.7 , gpt5.5 , Gemini flash 3.5 in benchmarks
eddyaipt 21小时前
The pattern I trust most is adding a small verification artifact after every external action. Agents usually fail from silent state drift faster than from lack of reasoning depth.
[-]
- _boffin_ 21小时前
  Can you go into more depth about this
  [-]
  - visarga 19小时前
    [dead]
jdw64 21小时前
QWEN really hits the sweet spot it's cheap, fast, and actually good.
slicktux 13小时前
I just started messing with local LLMs and honestly I’m pretty impressed. I have a workstation laptop with an NVIDIA A1000 (6GB VRAM) and 96GB of RAM. I rarely used my gpu. Occasional CAD design or Machine Learning with OpenCV.
I ran llama3:latest and it ran pretty fast! I’m curious to see how Qwen would run on my system.
bratao 23小时前
It is super strange that all last (3?) releases they keep comparing older models such as Opus-4.6.
[-]
- vessenes 23小时前
  Some of it’s probably timing. Some of it is wanting to look good. That said, I just went to the claw-eval site, and neither 4.7 nor 5.5 from oAI are listed on the benchmarks. So there’s also just the time from others to get benchmarking done and published.
- varispeed 22小时前
  Opus-4.6 was probably the best model so far before it got nerfed. 4.7 is nowhere near experience I had. In fact I stopped using it completely because more often than not its output is just dumber than local models.
  [-]
  - leonidasv 20小时前
    Same here. Can't stand 4.7.
  - solenoid0937 17小时前
    Opus 4.6 was never nerfed, that's FUD. There were harness-level problems that were fixed.
    4.7 is much better. But perception is a funny thing, once you think something is bad you start looking for it everywhere.
    [-]
    - anonyfox 14小时前
      Still anecdotal but the exact same coding task on the exact same repo (I clone from GitHub templates for projects) worked amazingly well in December with CC/Opus, couldn’t accomplish the goal anymore end of march, with essentially identical prompts, and 4.7 was just comically useless. But even these days I tried repeatedly and 4.6 still can’t do the thing it could in December.
    - kroaton 14小时前
      Did you even use it? It was nerfed to hell and back. It stopped following instructions, forgot what sub-agents responded and so on. Stop spreading this pro-Anthropic narrative. They did a rug pull due to lack of compute.
- dyauspitr 21小时前
  Because these can’t compete with the SoTA but they’re close.
bsenftner 23小时前
Any reports from people using their coding agent(s)?
[-]
- rayboy1995 22小时前
  I'm running Qwen 3.6 27B Q5 K M GGUF on a Tesla P40 and koboldcpp using pi.dev as the harness, I gotta say I am impressed. Took some setup and configuring but I already have some code it has made commited and pushed. It can be slow on my hardware at >50k tokens, but the fact I bought this one P40 for like $150 back when the LLM trend started I can't complain. (I have a second one too but I couldn't physically fit the card in my server unfortunately.)
  The setup I had to do was important and I had to compile koboldcpp with a few special params for my hardware, I mostly just had Claude figure it out. I don't remember everything I did now but it was very slow and would often stop mid task, it seems it was mostly a parsing issue. It made the model seem broken/dumb, but once I had all that settled I actually am able to use this how I use Claude Code. Disclaimer, I am pretty explicit with requirements, I imagine this fails more when you leave it to figure out things on its own but for my flow its pretty rad.
  Currently setting it up as an automated agent now to pull Trello cards, create PRs for them, and move the card to be reviewed.
  Command I am using to run: python koboldcpp.py \ --port 61514 --quiet --multiuser --gpulayers 999 --contextsize 262144 --quantkv 2 \ --usecublas normal --threads 4 --jinja --jinja_tools --jinja_kwargs '{"enable_thinking":true, "preserve_thinking":false}' \ --skiplauncher --model /data/models/Qwen3.6-27B-Q5_K_M.gguf --smartcache 5
  [-]
  - lostmsu 20小时前
    Qwen recommends to preserve_thinking: true for agentic/coding workloads.
    [-]
    - rayboy1995 18小时前
      Thanks!! I had disabled that previously while debugging, I can confirm this is helping accuracy from what I can tell so far. (And speed since the cache is preserved more often!)
      [-]
      - satvikpendem 16小时前
        Use the MTP models which 2x token generation speed, for example: https://unsloth.ai/docs/models/qwen3.6#mtp-guide
        [-]
        rayboy1995 11小时前
        Very interesting I'll have to check this out thank you. This is why I love HN.
- vibe42 21小时前
  I'm using the pi-mono coding agent (open source, free) without any extensions and very simple prompts. The 3.6 27B model (BF16, 250k context) uses 67GB VRAM on an RTX PRO 9000.
  It's very capable on almost any coding task I've thrown at it, and very good for easy-to-medium hard scripts, new code bases.
  It struggles on some complex tasks in larger code bases, e.g. using to debug and fix bugs in llama.cpp it gets close to working code but often introduces errors. For such tasks its still very useful as a search/explore tool and drafting fixes.
eleventen 16小时前
Checking openrouter (it's not available yet) and, uh, what's up with the spike in Qwen usage from early april here? https://openrouter.ai/qwen
Is this normal humans kicking the tires on a new model, or a few whales doing serious benchmarks?
[-]
- d2kx 16小时前
  Qwen 3.6 Plus released and they offered it for free
- spaceman_2020 16小时前
  personally seen a lot of people switch to Kimi and Qwen after Opus 4.7. Kimi 2.6 feels like Opus 4.6 which, to me, was a great model for 98% of coding tasks
  [-]
  - wolttam 16小时前
    Frontier: Need it done quick and I'm willing to pay.
    Open-weight: Good enough for the majority of tasks, and I'm willing to spend a bit more time and effort steering towards my desired result.
XCSme 22小时前
Any info on pricing and latency?
[-]
- mchusma 19小时前
  I've looked like a dozen places, I don't see anything. :(
aliljet 19小时前
Where can a user reasonably host this in an affordable way to access the local LLM revolution?
[-]
- satvikpendem 16小时前
  Unsloth Studio with its MTP support: https://unsloth.ai/docs/models/qwen3.6#mtp-guide
- julianlam 17小时前
  Try llama.cpp and Qwen3.6-35B-A3B
  Good balance of intelligence and speed.
- plagiarist 18小时前
  I think their Max models are far bigger than fits on consumer hardware. People are typically using Apple, AMD Halo, or dGPUs if/when they do smaller versions. Those are all varying degrees of "affordable."
grumple 11小时前
When I click on the link to Alibaba Cloud Model Studio from the linked post, that page sends my CPU (9950X3D) to 100%. Which is just... impressive. Is this a js based crypto miner? Or some strange browser based particle display? Super weird.
xiaoluolyg 19小时前
congrats to qwen teams, remarkable
indigodaddy 19小时前
Is it multimodal/vision?
cft 19小时前
Downloading this and cancelling Google Antigravity Pro at the same time:
I had a Google Pro account that I inherited from buying a Pixel 9 XL - it's free for a year after a flagship Pixel phone purchase. After a year they started charging for it, and i tolerated it, because Flash was usable in Antigravity for dumb auxiliary tasks that I did not want to waste GPT/Opus on. It had a separate generous quota from Gemini 3.1 Pro. Now with Flash 3.5 they combined the quotas with Pro, such that on a Google pro account you can work 4-5 hours per week in Flash. And by the way, 3.1 Pro is useless for programming, compared to Codex/Opus
[-]
- bel8 18小时前
  same boat. Google Pro AI quota became barely useful for anything meaningful.
  I think they envision Pro plan as "just a taste of AI, enough to lure folks into the Ultra plan" but that won't work for me when Codex is half the price and DeepSeek 4 Flash is 1/10 of their price per task.
  So I'll downgrade just enough to keep my Google Drive space. And use DeepSeek 4 as workhorse plus Codex or Copilot for advanced stuff.
  [-]
  - cft 17小时前
    How do you use DeepSeek 4 Flash? Via a cli?
    [-]
    - bel8 16小时前
      I use their VSCode extension:
      https://marketplace.visualstudio.com/items?itemName=sst-dev....
      It adds a button to VSCode to open a tab with opencode loaded. It's a bit better than just opening the CLI because it has some vscode integration.
      With their $10/mo opencode go plan: https://opencode.ai/go
      For my use it's about endless use of DS4 Flash on high setting. I find high better than max because it's less chatty.
      The best thing is the speed. So many tokens per second.
      edit: This is how it looks in action https://i.imgur.com/RNDXr07.png
      [-]
      - georgefrowny 16小时前
        How is that extension compared to, say, DS4 via OpenRouter and the usual VSCode Copilot panel?
        [-]
        bel8 15小时前
        Good question.
        I haven't tested openrouter but I expect it to be slightly less cheap because it charges per token and opencode Go plan is a $10/mo fixed price model. Economies of scale leads me to think that for heavy use, openrouter will be more costly since opencode Go can subside heavy users like me with money from light users (just like gyms do with people that pay but barely use it).
        With that said, I find vscode native copilot chat more pleasant to use, but also more laggy for large sessions.
        opencode configuration is less polished and you'll have to grok around for some things. For example opencode CTRL+p conflicts with VSCode CTRL+p. I changed opencode to use Ctrl+L instead.
joshjob42 19小时前
I really like what Qwen are doing, and a lot of these Chinese labs, but until I can ask their models what happened during the student protests in 1989 or why human rights groups are upset about the Uighurs and the model gives me a straight answer I'm just not able to trust these models with anything of substance.
[-]
- arcanemachiner 18小时前
  Just download a heretic abliterated versionof the model you want to use. I believe those are the current state of the art for uncensored models.
- mynameisbilly 18小时前
  This is silly. Would you perform the same test against Western models in asking them whether Israel is a genocidal apartheid state? It'll give you the same roundabout explanations and "some say no some say yes" responses that you'll get from asking Qwen about Uighurs or the protests of 1989.
  [-]
  - jaynetics 18小时前
    hey Qwen, how many civilians were killed on Tiananmen Square in 1989?
    > Oops! There was an issue connecting to Qwen3.6-Plus.
    > Content Security Warning: The input text data may contain inappropriate content.
    hey ChatGPT, how many civilians were killed in Gaza in the war since 2023?
    > [one page of estimates from local and international sources with links]
    [-]
    - HDBaseT 12小时前
      Your account is now flagged and put on a watchlist.
      Your ID has been passed to Israel and your internalized "threat" rating number increased 300 units. Every packet you produce on the internet is now earmarked for 100 year retention.
LAC-Tech 13小时前
Trying to buy Qwen credits and get an API key is a challenge all in itself. So many site redirects.
[-]
- nullbio 9小时前
  Good. We want to incentivize them to release the weights.
nullbio 9小时前
Qwen will it be open-weights? Please.
esafak 22小时前
Does anyone have experience with the Alibaba Cloud Model Studio that serves these qwen models?
HardCodedBias 10小时前
Imagine being Google and paying billions to GDM just to get mogged.
wolvoleo 16小时前
[dead]
spacebacon 21小时前
[flagged]
hydra-f 22小时前
[dead]
tonyspiro 19小时前
[flagged]
storus 18小时前
[dead]
kevinsimper 1天前
[flagged]
nikhilpareek13 22小时前
[dead]
hmaddipatla 20小时前
The tokenomics and value for capability, context and latency look like they could deliver super competitive offer - what would it take for you to switch??
DeathArrow 14小时前
[dead]
howmayiannoyyou 22小时前
I can't bring myself to use any model that trains or sends telemetry back to my country's primary competitor/adversary. I don't care how much money is saved.
[-]
- Mashimo 22小时前
  That is understandable. Just don't do it. No need to announce it.
- throawayonthe 18小时前
  assuming that country is the united states, why not? seems like an honourable thing to do if anything, lol
- mynameisbilly 18小时前
  Yeah, I prefer my data to be used and trained by the very trustworthy and benevolent tech oligarchs in my home country.
  [-]
  - deepfriedbits 18小时前
    On some level, it's the lesser of two evils. Both do suck as options, I agree.
  - plagiarist 18小时前
    The Shanghai government surveillance drones are mobile, whereas the Flock government surveillance cameras are stationary! USA FTW, liberty and justice for all
    [-]
    - HDBaseT 12小时前
      Also on the front page
      "Tennessee man jailed 37 days for Trump meme wins settlement after lawsuit" and "The FBI Wants to Buy Nationwide Access to License Plate Readers"
      Gotta love how the US is the bastion of free speech, justice and liberty!
- InsideOutSanta 22小时前
  As somebody in Europe, uh, that doesn't leave many options.
  [-]
  - czottmann 17小时前
    Look around for EU LLM routers. There are some, but none are as big as OpenRouter. Still, Cortecs (Austria) is quite good and offers a couple of recent models through its EU-based providers. Zero data retention, GDPR compliant, etc. Really nice.
    https://cortecs.ai/serverlessModels
  - avazhi 22小时前
    This is the current European modus operandi: virtue signal and cry about tech that other countries produce, pass local laws that limit its use in their countries even though they have no viable local alternatives, brag amongst themselves about decoupling from US and Chinese tech, and then look on wistfully as the rest of the world moves on without a single fuck given.
    Europe's sense of superiority and actual global importance/relevance is assbackwards.
    [-]
    - deaux 21小时前
      > as the rest of the world moves on without a single fuck given.
      Hilarious thing to say when half this comment section is Americans giving so much of a fuck that they consider China-adjacent hosted models unusable due to the supposed risks. If what you were saying was true then those pragmatic Americans would just use whatever is most effective.
      [-]
      - avazhi 20小时前
        Americans have their own frontier models, that's the point. Europeans have quite literally nothing native, so they are forced to choose between the Americans or Chinese, and they dislike both and trust neither.
        The Americans can cry about Chinese censorship and turn around and use Claude or Opus or Gemma or whatever, but the Europeans just throw a fit and then have to use one of the two anyway. And that whole crying about something while being completely helpless vis-a-vis doing anything about it is the definition of Europe so far this century. Globally irrelevant outside Germany.
dfansteel 22小时前
Can anyone check its knowledge base for me? I’m honestly not able to run it and the Qwen models I can run censor information critical towards the Chinese government.
Tiananmen Square is the first place to start.
[-]
- Mashimo 22小时前
  > I’m honestly not able to run it
  What do you mean? This is not self hosted, it's closed source. And any website that targets China or is hosted in China will probably censor Tiananmen Square.
  [-]
  - dfansteel 15小时前
    My computer lacks the ram.
  - polski-g 21小时前
    There is no reason why they couldn't license the model to Friendli/Fireworks/etc and have it hosted in the US to alleviate this concern.
    [-]
    - SR2Z 21小时前
      The reason is to create domestic demand for Chinese AI chips so they can eventually be free of NVIDIA.
      [-]
      - slaw 18小时前
        Replacing NVIDIA is not a problem, replacing ASML is.
    - Mashimo 21小时前
      I don't know about this model specifically, but other china models did not have the limitation. It was purely on the hosted end, tacked on as a self check while the text was generating. Did that change?
- wren6991 18小时前
  Qwen models know about Tiananmen Square but they are post-trained to refuse to talk about it. The decensored versions will happily chatter away about it.
  Similarly, try talking to Nemotron about Epstein and see how quickly it shuts down.