i am a reasonably intelligent person. i have been coding for years. i can hold my own in a technical conversation. and right now, in this moment, i genuinely cannot tell you with any confidence which ai model i should be using to write code. not even close. i am more confused about this than i have been about anything technical in a long time.
here's where i am. i have cursor open. cursor lets me pick the model. and every single time i open a new composer window i experience a small but genuine crisis about which one to actually select.
claude opus 4.8. claude sonnet 4.6. gpt-5.5. gpt-5.4. grok 4.3. gemini 3.1 pro. qwen3-coder. deepseek v4-pro. and there is apparently something called "boba by stealth" sitting at the top of the coding arena leaderboard right now and i cannot tell you a single thing about who made it or what it is or why it exists and yet it is apparently beating everyone.
i have read approximately forty reddit threads about this. they all contradict each other. someone with eight hundred upvotes says opus 4.8 is the only correct answer for anything serious. the top reply says that person is wrong and gpt-5.5 has better agentic performance on multi-file refactors. third comment says both of them are cooked on long runs and gemini 3.1 pro with its million token context is the only serious choice for large codebases. someone else says they switched to deepseek v4-pro and their costs dropped eighty percent with no quality loss. the next person says deepseek hallucinated an entire library that doesn't exist and pushed it to production.
i have no framework for evaluating any of this.
because here's the thing. the benchmarks don't help. i have looked at so many benchmarks. swe-bench verified. swe-bench pro. terminal-bench 2.0. terminal-bench 2.1. live code bench. the coding arena elo. and then i pick the model that scored highest and it does something confidently wrong that a junior dev wouldn't do, and i'm back to square one wondering if i'm prompting wrong or if the benchmark is fake or if i just got unlucky.
and it's not just the model. it's the mode. are you using agent mode. are you letting it run terminal commands autonomously. are you doing ask mode and reviewing everything first. do you have a rules file. a memory file. a custom system prompt per project. there are people with elaborate cursor setups that look like mission control and i genuinely cannot determine if they are more productive than me or just performing productivity for the content.
and then there's the routing question. because apparently you're supposed to use different models for different tasks now. opus 4.8 for long autonomous runs where judgment compounds. gpt-5.5 for dense structured reasoning and anything scientific. gemini 3.1 pro for multimodal work and long document retrieval. qwen for cost-sensitive agent loops when you need fifty tool calls and don't want to remortgage your house. people have actual decision flowcharts for this. i have seen the flowcharts. they are not making me feel better.
and grok 4.3. what do i do with grok 4.3. the benchmarks put it fourth overall. fourth out of everything. that's extraordinary. and yet every time it comes up in a thread someone immediately says something that makes me put it back down again and i can't even remember what it is but the feeling sticks.
i think what happened is the capability race moved faster than anyone's ability to develop genuine intuition about the tools. two years ago this was easier. you picked claude or gpt-4 and you got on with it. now there are fifteen serious options, they are genuinely different, the differences matter for different workloads, and also the differences change every six weeks when someone drops a new version and all the advice goes stale instantly. the thread telling you that sonnet 4.5 is the coding king is four months old. four months is basically a geological era now.
and the switching cost of actually testing them properly is high. you need to use a model on real work for weeks before you have proper feel for it. you can't benchmark it yourself in an afternoon. so you're always working with someone else's intuition, formed on different work, in a different context, possibly three model versions ago, posted by someone whose use case has nothing to do with yours.
i'm not even sure this is a solvable problem. i think it might just be the permanent condition of working in this space now. perpetual low-level confusion interrupted by brief moments of "okay this one is clearly working" before the next release drops and the discourse resets entirely.
so i'm actually asking. not rhetorically.
what are you using right now. for real work. not what sounds impressive. what model, what tool, what mode, and what are you actually building with it.
because i am genuinely lost and the benchmark threads are not helping and i would very much like to hear from people doing the actual thing.
and if anyone can explain what boba by stealth is i would appreciate that too.
[link] [comments]