GPT-5.3-Codex

(openai.com)

1501 points | by meetpateltech 1 day ago

87 comments

Rperry2174 1 day ago

Whats interesting to me is that these gpt-5.3 and opus-4.6 are diverging philosophically and really in the same way that actual engineers and orgs have diverged philosophically
With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.
With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
that feels like a reflection of a real split in how people think llm-based coding should work...
some want tight human-in-the-loop control and others want to delegate whole chunks of work and review the result
Interested to see if we eventually see models optimize for those two philosophies and 3rd, 4th, 5th philosophies that will emerge in the coming years.
Maybe it will be less about benchmarks and more about different ideas of what working-with-ai means

[-]
- karmasimida 1 day ago
  
  > With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.
  > With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
  Ain't the UX is the exact opposite? Codex thinks much longer before gives you back the answer.
  
  [-]
  - xd1936 1 day ago
    
    I've also had the exact opposite experience with tone. Claude Code wants to build with me, and Codex wants to go off on its own for a while before returning with opinions.
    
    [-]
    - mrkstu 1 day ago
      
      Its likely that both are steering towards the middle from their current relative extremes and converging to nearly the same place.
      
      [-]
      - gervwyk 1 day ago
        
        also my experience in using these two models. they are trying to recover from oversteer perhaps.
    - mercnz 1 day ago
      
      well with the recent delays i can easily find claude code going off on it's own for 20 minutes and have no idea what it's going to come back with. but one time it overflowed it's context on a simple question, and then used up the rest of my session window. in a way a lot of ai assistants have ime have this awkward thing where they complicate something in a non-visible and think about it for a long time burning up context before coming up with a summary based upon some misconception.
      
      [-]
      - zen4ttitude 8 hours ago
        
        For complex tasks I ask ChatGPT or Grok to define context then I take it to Claude for accurate execution. I also created a complete pipeline to use locally and enrich with skills, agents, RAG, profiles. It is slower but very good. There is no magic, the richer the context window the more precise and contained the execution.
      - esperent 1 day ago
        
        The key is a well defined task with strong guardrails. You can add these to your agents file over time or you can probably just find someone's online to copy the basics from. Any time you find it doing something you didn't expect or don't like, add guardrails to prevent that in future. Claude hooks are also useful here, along with the hookify plugin to create them for you based on the current conversation.
        
        [-]
        
        vorticalbox 1 day ago
        
        I have started using openspec for this. I find it works far better to have a proposal and a list of tasks the ai stays more focused.
        https://openspec.dev/
    - PeterStuer 23 hours ago
      
      In terms of 'tone', I have been very impressed with Qwen-code-next over the last 2 days, especially as I have it running locally on a single modest 4090.
      
      [-]
      - turtle4 20 hours ago
        
        Did you set that up following a guide or anything you could share?
        
        [-]
        
        PeterStuer 19 hours ago
        
        Easiest way I know is to just use LMStudio. Just download and press play :). Optional, but recommended, increase the context length to 262144 if you have the DRAM available. It will definitely get slower as your interaction prolongs, but (at least for me) still tolerable speed.
        
        mathrawka 20 hours ago
        
        not OP, but I got it running on my 4090 (and RAM) by following this guide: https://unsloth.ai/docs/models/qwen3-coder-next
        I see around 30 t/s
    - kamban 1 day ago
      
      Same here, CC gives me options to pick direction after the planning stage.
  - WilcoKruijer 1 day ago
    
    Yes, you’re right for 4.5 and 5.2. Hence they’re focusing on improving the opposite thing and thus are actually converging.
  - cwyers 1 day ago
    
    Codex now lets you tell the LLM tgings in the middle of its thinking without interrupting it, so you can read the thinking traces and tell it to change course if it's going off track.
    
    [-]
    - fluidcruft 1 day ago
      
      That just seems like a UI difference. I've always interrupted claude code added a comment and it's continued without much issue. Otherwise if you just type the message is queued for next. There's no real reason to prefer one over the other except it sounds like codex can't queue messages?
      
      [-]
      - int_19h 1 day ago
        
        Codex can queue messages, but the queue only gets flushed once the agent is done with whatever it was working on, whereas Claude will read messages and adjust accordingly in the middle of whatever it is doing. It sounds like OP is saying that Codex can now do this latter bit as well.
      - esperent 1 day ago
        
        The problem is if you're using subagents, the only way to interject is often to press escape multiple times which kills all the running subagents. All I wanted to do was add a minor steering guideline.
        This might be better with the new teams feature.
        
        [-]
        
        Skwrm 1 day ago
        
        They actually made a change a few weeks ago that made subagents more steerable
        When they ask approval for a tool call, press down til the selector is on "No" and press tab, then you can add any extra instructions
        
        cruffle_duffle 1 day ago
        
        That is so annoying too because it basically throws away all the work the subagent did.
        Another thing that annoys me is the subagents never output durable findings unless you explicitly tell their parent to prompt the subagent to “write their output to a file for later reuse” (or something like that anyway)
        I have no idea how but there needs to be ways to backtrack on context while somehow also maintaining the “future context”…
  - bt1a 1 day ago
    
    This is most likely an inference serving problem in terms of capacity and latency given that Opus X and the latest GPT models available in the API have always responded quickly and slowly, respectively
- ghosty141 1 day ago
  
  I'm personally 100% convinced (assuming prices stay reasonable) that the Codex approach is here to stay.
  Having a human in the loop eliminates all the problems that LLMs have and continously reviewing small'ish chunks of code works really well from my experience.
  It saves so much time having Codex do all the plumbing so you can focus on the actual "core" part of a feature.
  LLMs still (and I doubt that changes) can't think and generalize. If I tell Codex to implement 3 features he won't stop and find a general solution that unifies them unless explicitly told to. This makes it kinda pointless for the "full autonomy" approach since effecitly code quality and abstractions completely go down the drain over time. That's fine if it's just prototyping or "throwaway" scripts but for bigger codebases where longevity matters it's a dealbreaker.
  
  [-]
  - _zoltan_ 1 day ago
    
    I'm personally 100% convinced of the opposite, that it's a waste of time to steer them. we know now that agentic loops can converge given the proper framing and self-reflectiveness tools.
    
    [-]
    - sealeck 1 day ago
      
      Converge towards what though... I think the level of testing/verification you need to have an LLM output a non-trivial feature (e.g. Paxos/anything with concurrency, business logic that isn't just "fetch value from spreadsheet, add to another number and save to the database") is pretty high.
      
      [-]
      - replygirl 1 day ago
        
        in the new world, engineers have to actually be good at capturing and interpreting requirements
        
        [-]
        
        halfcat 1 day ago
        
        In this new world, why stop there? It would be even better if engineers were also medical doctors and held multiple doctorate degrees in mathematics and physics and also were rockstar sales people.
        
        [-]
        
        NamlchakKhandro 1 day ago
        
        sounds like the kinds of hyperbole someone whose just been forced to set a linter for the first time
        
        craigdalton 1 day ago
        
        As a doctor, this sounds like an engineers job.
    - zeroxfe 1 day ago
      
      > it's a waste of time to steer them
      It's not a waste of time, it's a responsibility. All things need steering, even humans -- there's only so much precision that can be extrapolated from prompts, and as the tasks get bigger, small deviations can turn into very large mistakes.
      There's a balance to strike between micro-management and no steering at all.
      
      [-]
      - adw 1 day ago
        
        The prompt is decreasingly relevant. The verification environment you have is what actually matters.
        
        [-]
        
        freakynit 1 day ago
        
        I think this all comes down to information.
        Most prompts we give are severely information-deficient. The reason LLMs can still produce acceptable results is because they compensate with their prior training and background knowledge.
        The same applies to verification: it's fundamentally an information problem.
        You see this exact dynamic when delegating work to humans. That's why good teams rely on extremely detailed specs. It's all a game of information.
        
        [-]
        
        adrianN 22 hours ago
        
        Having prompts be information deficient is the whole point of LLMs. The only complete description of a typical programming problem is the final code or an equivalent formal specification.
        
        [-]
        
        freakynit 13 hours ago
        
        Exactly the point. But, LLM's miss that human intuition part.
    - bcarv 1 day ago
      
      Does the AI agent know what your company is doing right now, what every coworker is working on, how they are doing it, and how your boss will change priorities next month without being told?
      If it really knows better, then fire everyone and let the agent take charge. lol
      
      [-]
      - hyldmo 1 day ago
        
        No, but Codex wouldn’t have asked you those questions either
        
        [-]
        
        bcarv 1 day ago
        
        For me, it still asks for confirmation at every decision when using plans. And when multiple unforeseen options appear, it asks again. I don’t think you’ve used Codex in a while.
        
        [-]
        
        hyldmo 1 day ago
        
        It asks you what your coworkers are working on and whether the thing you are working on are your boss’ number one priority?
        
        [-]
        
        bcarv 1 day ago
        
        [flagged]
        
        fHr 1 day ago
        
        skill issue
      - IMTDb 1 day ago
        
        A significant portion of engineering time is now spent ensuring that yes, the LLM does know about all of that. This context can be surfaced through skills, MCP, connectors, RAG over your tools, etc. Companies are also starting to reshape their entire processes to ensure this information can be properly and accurately surfaced. Most are still far from completing that transformation, but progress tends to happen slowly, then all at once.
        
        [-]
        
        bcarv 1 day ago
        
        [flagged]
        
        [-]
        
        generallyjosh 14 hours ago
        
        All we can do is try our best to look at the world with clear eyes, and think about where the industry's going over the next couple years
        Not how we want things to be, but how they actually are and will be
        I don't think AI for programming is a passing fad
        
        jondwillis 1 day ago
        
        Who hurt you?
        Also what are you even proposing/advocating for here?
        This meta-state-of-company context is just as capturable as anything else with the right lines of questioning and spyware and UI/UX to elicit it.
    - rapind 1 day ago
      
      Maybe some day, but as a claude code user it makes enough pretty serious screw ups, even with a very clearly defined plan, that I review everything it produces.
      You might be able to get away without the review step for a bit, but eventually (and not long) you will be bitten.
      
      [-]
      - jaggederest 1 day ago
        
        I use that to feed back into my spec development and prompting and CI harnesses, not steering in real time.
        Every mistake is a chance to fix the system so that mistake is less likely or impossible.
        I rarely fix anything in real time - you review, see issues, fix them in the spec, reset the branch back to zero and try again. Generally, the spec is the part I develop interactively, and then set it loose to go crazy.
        This feels, initially, incredibly painful. You're no longer developing software, you're doing therapy for robots. But it delivers enormous compounding gains, and you can use your agent to do significant parts of it for you.
        
        [-]
        
        Terretta 1 day ago
        
        > You're no longer developing software, you're doing therapy for robots.
        Or, really, hacking in "learning", building your knowhow-base.
        > But it delivers enormous compounding gains, and you can use your agent to do significant parts of it for you.
        Strong yes to both, so strong that it's curious Claude Code, Codex, Claude Cowork, etc., don't yet bake in an explicit knowledge evolution agent curating and evolving their markdown knowledge base:
        https://github.com/anthropics/knowledge-work-plugins
        Unlikely to help with benchmarks. Very likely to improve utility ratings (as rated by outcome improvements over time) from teams using the tools together.
        For those following along at home:
        This is the return of the "expert system", now running on a generalized "expert system machine".
        
        rapind 1 day ago
        
        I assumed you'd build such a massive set of rules (that claude often does not obey) that you'd eat up your context very quickly. I've actually removed all plugins / MCPs because they chewed up way too much context.
        
        [-]
        
        jaggederest 1 day ago
        
        It's as much about what to remove as what to add. Curation is the key. Skills also give you some levers to get the kind of context-sensitive instruction you need, though I haven't delved too deeply into them. My current total instruction set is around ~2500 tokens at the moment
      - vidarh 23 hours ago
        
        Reviewing what it produces once it thinks it has met the acceptance criteria and the test suite passes is very different from wasting time babysitting every tiny change.
        
        [-]
        
        rapind 18 hours ago
        
        True, and that's usually what I'm doing now, but to be honest I'm also giving all of it's code at least a cursory glance.
        Some of the things it occasionally does:
        - Ignores conventions (even when emphasized in the CLAUDE.md)
        - Decides to just not implement tests if gets spins out on them too much (it tells you, but only as it happens and that scrolls by pretty quick)
        - Writes badly performing code (N+1)
        - Does more than you asked (in a bad way, changing UIs or adding cruft)
        - Makes generally bad assumptions
        I'm not trying to be overly negative, but in my experience to date, you still need to babysit it. I'm interested though in the idea of using multiple models to have them perform independent reviews to at least flag spots that could use human intervention / review.
        
        [-]
        
        vidarh 4 hours ago
        
        Sure, but non of those things requires you to watch it work. They're all easy to pick up on when reviewing a finished change, which ideally should come after it's instructions have had it run linters, run sub agents that verify it has added tests, run sub agents doing a code review.
        I don't want to waste my time reviewing a change the model can still significantly improve all by itself. My time costs far more than the models.
      - _zoltan_ 5 hours ago
        
        then you're using it wrong, to be frank with you.
        you give it tools so it can compile and run the code. then you give it more tools so it can decide between iterations if it got closer to the goal or not. let it evaluate itself. if it can't evaluate something, let it write tests and benchmark itself.
        I guarantee that if the criteria is very well defined and benchmarkable, it will do the right thing in X iterations.
        (I don't do UI development. I do end-to-end system performance on two very large code bases. my tests can be measured. the measure is very simply binary: better or not. it works.)
        
        [-]
        
        dostick 2 hours ago
        
        That’s what oh-my-open-code does.
    - halfcat 1 day ago
      
      > given the proper framing
      This sounds like never. Most businesses are still shuffling paper and couldn’t give you the requirements for a CRUD app if their lives depended on it.
      You’re right, in theory, but it’s like saying you could predict the future if you could just model the universe in perfect detail. But it’s not possible, even in theory.
      If you can fully describe what you need to the degree ambiguity is removed, you’ve already built the thing.
      If you can’t fully describe the thing, like some general “make more profit” or “lower costs”, you’re in paper clip maximizer territory.
      
      [-]
      - jondwillis 1 day ago
        
        > If you can fully describe what you need to the degree ambiguity is removed, you’ve already built the thing.
        Trying to get my company to realize this right now.
        Probably the most efficient way to work, would be on a video call including the product person/stakeholder, designer, and me, the one responsible for the actual code, so that we can churn through the now incredibly fast and cheap implementation step together in pure alignment.
        You could probably do it async but it’s so much faster to not have to keep waiting for one another.
    - retinaros 18 hours ago
      
      good luck.
      
      [-]
      - _zoltan_ 5 hours ago
        
        I've been working on very complex problems with this model and the results I have have surprised people over and over again.
  - xXSLAYERXx 1 day ago
    
    I've been using codex for one week and I have been the most productive I have ever been. Small prs, tight rules, I get almost exactly what I want. Things tend to go sideways when scope creeps into my request. But I just close the PR instead of fighting with the agent. In one week: 28 prs, 26 merged. Absolutely unreal.
  - vidarh 23 hours ago
    
    I will personally never consider using an agent that can't be easily pushed toward working on its own for long periods (hours) at a time. It's a total waste of time for me to babysit the LLM.
  - sejje 1 day ago
    
    Aider was doing this a long time ago
  - Skidaddle 1 day ago
    
    But tokens are way cheaper than human labor
  - NuclearPM 1 day ago
    
    > If I tell Codex to implement 3 features he won't stop and find a general solution that unifies them unless explicitly told to
    That could easily be automated.
- utilize1808 1 day ago
  
  I think it's the opposite. Especially considering Codex started out as a web app that offers very little interactivity: you are supposed to drop a request and let it run automatously in a containerized environment; you can then follow up on it via chat --- no interactive code editing.
  
  [-]
  - Rperry2174 1 day ago
    
    Fair I agree that was true of early codex and my perception too.. but today there are two announcements that came out and thats what im referring to.
    specifically, the GPT-5.3 post explicitly leans into "interactive collaborator" langauge and steering mid execution
    OpenAI post: "Much like a colleague, you can steer and interact with GPT-5.3-Codex while it’s working, without losing context."
    OpenAI post: "Instead of waiting for a final output, you can interact in real time—ask questions, discuss approaches, and steer toward the solution"
    Claude post: "Claude Opus 4.6 is designed for longer-running, agentic work — planning complex tasks more carefully and executing them with less back-and-forth from the user."
    
    [-]
    - stingraycharles 1 day ago
      
      I think those OpenAI announcements are mainly because this hasn’t been the case for them earlier, while it has been part of Claude Code since the beginning.
      I don’t think there’s something deeply philosophical in here, especially as Claude Code is pushing stronger for asking more questions recently, introduced functionality to “chat about questions” while they’re asked, etc.
    - user34283 1 day ago
      
      When I tried 5.2 Codex in GitHub Copilot it executed some first steps like searching for the relevant files, then it output the number "2" and stopped the response.
      On further prompting it did the next step and terminated early again after printing how it would proceed.
      It's most likely just a bug in GitHub Copilot, but it seems weird to me that they add models that clearly don't even work with their agentic harness.
    - fluidcruft 1 day ago
      
      Frankly it seems to be that codex is playing catch-up with claude code and claude code is just continuing to move further ahead. The thing with claude code is it will work longer... if you want it to. It's always had good oversight and (at least for me) it builds trust slowly until you are wishing it would do more at once. When I've used codex (it has been getting better) but back in the day it would just do things and say it's done and you're just sitting there wondering "wtf are you doing?". Claude code is more the opposite where you can watch as closely as you want and often you get to a point where you have enough trust and experience with it that you know what it's going to do and don't want to bother.
- mcintyre1994 1 day ago
  
  This kind of sounds like both of them stepping into the other’s turf, to simplify a bit.
  I haven’t used Codex but use Claude Code, and the way people (before today) described Codex to me was like how you’re describing Opus 4.6
  So it sounds like they’re converging toward “both these approaches are useful at different times” potentially? And neither want people who prefer one way of working to be locked to the other’s model.
- giancarlostoro 1 day ago
  
  > With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
  This feels wrong, I can't comment on Codex, but Claude will prompt you and ask you before changing files, even when I run it in dangerous mode on Zed, I can still review all the diffs and undo them, or you know, tell it what to change. If you're worried about it making too many decisions, you can pre-prompt Claude Code (via .claude/instructions.md) and instruct it to always ask follow up questions regarding architectural decisions.
  Sometimes I go out of my way to tell Claude DO NOT ASK ME FOR FOLLOW UPS JUST DO THE THING.
  
  [-]
  - Rperry2174 1 day ago
    
    yeah I'm mostly just talking about how they're framing it: "Claude Opus 4.6 is designed for longer-running, agentic work — planning complex tasks more carefully and executing them with less back-and-forth from the user"
    I guess its also quite interesting that how they are framing these projects are opposite from how people currently perceive them and I guess that may be a conscious choice...
    
    [-]
    - giancarlostoro 1 day ago
      
      I get what you mean now, I like that to be fair, sometimes I want Claude to tell me some architectural options, so I ask it so I can think about what my options are, sometimes I rethink my problem if I like Claudes conclusion.
- jhancock 1 day ago
  
  Good breakdown.
  I usually want the codex approach for code/product "shaping" iteratively with the ai.
  Once things are shaped and common "scaling patterns" are well established, then for things like adding a front end (which is constantly changing, more views) then letting the autonomous approach run wild can *sometimes* be useful.
  I have found that codex is better at remembering when I ask to not get carried away...whereas claude requires constant reminders.
- techbro_1a 1 day ago
  
  > With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.
  This is true, but I find that Codex thinks more than Opus. That's why 5.2 Codex was more reliable than Opus 4.5
- bob1029 1 day ago
  
  I think there is another philosophy where the agent is domain specific. Not that we have to invent an entirely new universe for every product or business, but that there is a small amount of semi-customization involved to achieve an ideal agent.
  I would much rather work with things like the Chat Completion API than any frameworks that compose over it. I want total control over how tool calling and error handling works. I've got concerns specific to my business/product/customer that couldn't possibly have been considered as part of these frameworks.
  Whether or not a human needs to be tightly looped in could vary wildly depending on the specific part of the business you are dealing with. Having a purpose-built agent that understands where additional verification needs to occur (and not occur) can give you the best of both worlds.
- dimgl 1 day ago
  
  Did you get those backwards? Codex, Gemini, etc. all wait until the requests are done to accept user feedback. Claude Code allows you to insert messages in between turns.
  
  [-]
  - aurareturn 1 day ago
    
    Codex added an experimental feature to allow steering mid task.
- aulin 1 day ago
  
  Admit I didn't follow the announcements but isn't that a matter of UI? Doesn't seem something that should be baked in the model but in the tooling around it and the instructions you give them. E.g. I've been playing with with GitHub copilot CLI (that despite the bad fame is absolutely amazing) and the same model completely changes its behavior with the prompt. You can have it answer a question promptly or send it on a multi-hour multi-agent exploration writing detailed specs with a single prompt. Or you can have it stop midway for clarification. It all depends on the instructions. Also this is particularly interesting with GitHub billing model as each prompt counts 1 request no matter how many tokens it burns.
  
  [-]
  - F7F7F7 1 day ago
    
    It depends honestly. Both are prone to doing the exact opposite of what you asked. Especially with poor context management.
    I’ve had both $200 plans and now just have Max x20 and use the $20 ChatGPT plan for an inferior Codex.
    My experience (up until today) has always been that Codex acts like that one Sr Engineer that we all know. They are kind of a dick. And will disappear into a dark hole and emerge with a circle when you asked for a pentagon. Then let you know why edges are bad for you.
    And yes, Anthropic is pivoting hard into everything agentic. I bet it’s not too long before Claude Code stops differentiating models. I had Opus blow 750k tokens on a single small task.
- cchance 1 day ago
  
  Just because you can inject steering doesn't mean they stered away from long running...
  Theres hundreds of people who upload Codex 5.2 running for hours unattended and coming back with full commits
- mdale 1 day ago
  
  I think it's just both companies building/ marketing to the strength of their competitor. As general perception has been the opposite for codex and Opus respectfully.
- sfmike 1 day ago
  
  It's the opposite? codex course corrects and is self inquisitive. opus is just wrong and need to refeed it it's wrong.
- hbarka 1 day ago
  
  How can they be diverging, LLMs are built on similar foundations aka the Transformer architecture. Do you mean the training method (RLHF) is diverging?
  
  [-]
  - iranintoavan 1 day ago
    
    I'm not OP but I suspect they are meaning the products / tooling / company direction, not necessarily the underlying LLM architecture.
- dboon 1 day ago
  
  …what? It is quite literally the opposite. This isn’t a matter of taste or perception.
- mi_lk 16 hours ago
  
  It’s the opposite way
- blurbleblurble 1 day ago
  
  Funny cause the situation was totally flipped last iteration.
- pyrolistical 1 day ago
  
  Boing vs airbus philosophy
- rippeltippel 1 day ago
  
  Grabbing popcorn...
- rozumbrada 1 day ago
  
  I read this exact comment with I would say completely the same words several times in X and I would bet my money it's LLM generated by someone who has not even tried both the tools. This AI slop even in the site like this without direct monetisation implications from fake engagement is making me sick...
- drsalt 1 day ago
  
  be rich, hire an ai guy, let him deal with it
- d--b 1 day ago
  
  I am definitely using Opus as an interactive collaborator that I steer mid-execution, stay in the loop and course correct as it works.
  I mean Opus asks a lot if he should run things, and each time you can tell it to change. And if that's not enough you can always press esc to interrupt.
- adarsh2321 1 day ago
  
  [dead]
granzymes 1 day ago

I think Anthropic rushed out the release before 10am this morning to avoid having to put in comparisons to GPT-5.3-codex!
The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.
GPT-5.3-codex scores 77.3.

[-]
- the_duke 1 day ago
  
  I do not trust the AI benchmarks much, they often do not line up with my experience.
  That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.
  So very much looking forward to trying out 5.3.
  
  [-]
  - NitpickLawyer 1 day ago
    
    Just some anecdata++ here but I found 5.2 to be really good at code review. So I can have something crunched by cheaper models, reviewed async by codex and then re-prompt with the findings from the review. It finds good things, doesn't flag nits (if prompted not to) and the overall flow is worth it for me. Speed loss doesn't impact this flow that much.
    
    [-]
    - kilroy123 1 day ago
      
      Personally, I have Claude do the coding. Then 5.2-high do the reviewing.
      
      [-]
      - mmaunder 1 day ago
        
        I might flip that given how hard it's been for Claude to deal with longer context tasks like a coding session with iterations vs a single top down diff review.
      - seunosewa 1 day ago
        
        Then I pass the review back to Claude Opus to implement it.
        
        [-]
        
        VladVladikoff 1 day ago
        
        Just curious is this a manual process or you guys have automated these steps?
        
        [-]
        
        ricketycricket 1 day ago
        
        I have a `codex-review` skill with a shell script that uses the Codex CLI with a prompt. It tells Claude to use Codex as a review partner and to push back if it disagrees. They will go through 3 or 4 back-and-forth iterations some times before they find consensus. It's not perfect, but it does help because Claude will point out the things Codex found and give it credit.
        
        [-]
        
        bryanlarsen 1 day ago
        
        Mind sharing the skill/prompt?
        
        [-]
        
        dror 1 day ago
        
        Not the OP, but I use the same approach.
        https://gist.github.com/drorm/7851e6ee84a263c8bad743b037fb7a...
        I typically use github issues as the unit of work, so that's part of my instruction.
        
        _zoltan_ 1 day ago
        
        zen-mcp (now called pal-mcp I think) and then claude code can actually just pass things to gemini (or any other model)
        
        kilroy123 1 day ago
        
        Sometimes, depends on how big of a task. I just find 5.2 so slow.
      - _zoltan_ 1 day ago
        
        I have Opus 4.5 do everything then review it with Gemini 3.
    - StephenHerlihyy 1 day ago
      
      I don’t use OpenAI too much, but I follow a similar work flow. Use Opus for design/architecture work. Move it to Sonnet for implementation and build out. Then finally over to Gemini for review, QC and standards check. There is an absolute gain in using different models. Each has their own style and way of solving the problem just like a human team. It’s kind of awesome and crazy and a bit scary all at once.
      
      [-]
      - readyforbrunch 1 day ago
        
        How do you orchestrate this workflow? Do you define different skills that all use different models, or something else?
        
        [-]
        
        nitroedge 1 day ago
        
        You should check out the PAL MCP and then also use this process, its super solid: https://github.com/glittercowboy/get-shit-done
        The way "Phases" are handled is incredible with research then planning, then execution and no context rot because behind the scenes everything is being saved in a State.md file...
        I'm on Phase 41 of my own project and the reliability and almost absence of any error is amazing. Investigate and see if its a fit for you. The PAL MCP you can setup to have Gemini with its large context review what Claude codes.
  - aurareturn 1 day ago
    
    5.2 Codex became my default coding model. It “feels” smarter than Opus 4.5.
    I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.
    Looking forward to trying 5.3.
    
    [-]
    - koakuma-chan 1 day ago
      
      Opus 4.5 is more creative and better at making UIs
      
      [-]
      - hypercube33 1 day ago
        
        Unless it's scroll bar theming then my God it's bad. it told me it gives up. Gemini 3 got stuck but the right prompt it did work.
  - fooker 1 day ago
    
    Yeah, these benchmarks are bogus.
    Every new model overfits to the latest overhyped benchmark.
    Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.
    
    [-]
    - bunderbunder 1 day ago
      
      All shared machine learning benchmarks are a little bit bogus, for a really “machine learning 101” reason: your test set only yields an unbiased performance metric if you agree to only use it once. But that just isn’t a realistic way to use a shared benchmark. Using them repeatedly is kind of the whole point.
      But even an imperfect yardstick is better than no yardstick at all. You’ve just got to remember to maintain a healthy level of skepticism is all.
      
      [-]
      - abustamam 1 day ago
        
        Is an imperfect yardstick better than no yardstick? It reminds me of documentation — the only thing worse than no documentation is wrong documentation.
        
        [-]
        
        bunderbunder 1 day ago
        
        Yes, because there’s value in a common reference for comparison. It helps to shed light on different models’ relative strengths and weaknesses. And, just like with performance benchmarks, you can learn to spot and read past the ways that people game their results. The danger is really more in when people who are less versed in the subject matter take what are ultimately just a semi tamed genre of sales pitch at face value.
        When such benchmarks aren’t available what you often get instead is teams creating their own benchmark datasets and then testing both their and existing models’ performance against it. Which is eve worse because they probably still the rest multiple times (there’s simply no way to hold others accountable on this front), but on top of that they often hyperparameter tune their own model for the dataset but reuse previously published hyperparameters for the other models. Which gives them an unfair advantage because those hyperparameters were tuned to a doffeeent dataset and may not have even been optimizing for the same task.
        
        [-]
        
        abustamam 1 day ago
        
        Thanks, that makes sense!
    - mrandish 1 day ago
      
      > Yeah, these benchmarks are bogus.
      It's not just over-fitting to leading benchmarks, there's also too many degrees of freedom in how a model is tested (harness, etc). Until there's standardized documentation enabling independent replication, it's all just benchmarketing .
      
      [-]
      - fooker 1 day ago
        
        For the current state of AI, the harness is unfortunately part of the secret sauce.
        
        [-]
        
        ndriscoll 1 day ago
        
        In what sense? Codex CLI is FOSS and works fine with other models as a backend, including those served by llama.cpp.
    - scoring1774 1 day ago
      
      This has been done: https://arxiv.org/abs/2510.04871v1
  - mmaunder 1 day ago
    
    ARG-AGI-2 leaderboard has a strong correlation with my Rust/CUDA coding experience with the models.
  - int_19h 1 day ago
    
    Codex 5.3 seems to be a lot chattier. As in, it comments in the chat about things it has done or is about to do. They don't show up as "thinking" CoT blocks, but as regular outputs, but overall the experience is somewhat more like Claude is in that you can spot the problems in model's reasoning much earlier if you keep an eye on it as it works, and steer it away.
  - jahsome 1 day ago
    
    Another day, another hn thread of "this model changes everything" followed immediately by a reply stating "actually I have the literal opposite experience and find competitor's model is the best" repeated until it's time to start the next day's thread.
    
    [-]
    - StephenHerlihyy 1 day ago
      
      What amazes me the most is the speed at which things are advancing. Go back a year or even a year before that and all these incremental improvements have compounded. Things that used to require real effort to consistently solve, either with RAGs, context/prompt engineering, have become… trivial. I totally agree with your point that each step along the way doesn’t necessarily change that much. But in the aggregate it’s sort of insane how fast everything is moving.
      
      [-]
      - Rudybega 1 day ago
        
        The denial of this overall trend on here and in other internet spaces is starting to really bother me. People need to have sober conversations about the speed of this increase and what kind of effects it's going to have on the world.
        
        [-]
        
        girvo 1 day ago
        
        The effects when extrapolated out aren’t good, IMO. Certainly bad for me, a mid 30s software engineer who’s been doing this for nearly 20 years…
        
        theLiminator 1 day ago
        
        Yeah, I really didn't believe in agentic coding until December, that was where it took off from being slightly more useful than hand crafting code to becoming extremely powerful.
        
        szundi 1 day ago
        
        [dead]
    - SatvikBeri 1 day ago
      
      I use Claude Code every day, and I'm not certain I could tell the difference between Opus 4.5 and Opus 4.0 if you gave me a blind test
    - malshe 1 day ago
      
      This pretty accurately summarizes all the long discussions about AI models on HN.
    - clhodapp 1 day ago
      
      And of course the benchmarks are from the school of "It's better to have a bad metric than no metric", so there really isn't any way to falsify anyone's opinions...
    - cactusplant7374 1 day ago
      
      Hourly occurrence on /r/codex. Model astrology is about the vibes.
    - wasmainiac 1 day ago
      
      [flagged]
      
      [-]
      - nocman 1 day ago
        
        > Who are making these claims? script kiddies? sr devs? Altman?
        AI agents, perhaps? :-D
      - locknitpicker 1 day ago
        
        > All anonymous as well. Who are making these claims? script kiddies? sr devs? Altman?
        You can take off your tinfoil hat. The same models can perform differently depending on the programming language, frameworks and libraries employed, and even project. Also, context does matter, and a model's output greatly varies depending on your prompt history.
        
        [-]
        
        andrepd 1 day ago
        
        It's hardly tinfoil to understand that companies riding a multi-trillion dollar funding wave would spend a few pennies astroturfing their shit on hn. Or overfit to benchmarks that people take as objective measurements.
      - BoredPositron 1 day ago
        
        When you keep his ramblings on twitter or company blog in mind I bet he is a shit poster here.
  - nerdsniper 1 day ago
    
    Opus 4.5 still worked better for most of my work, which is generally "weird stuff". A lot of my programming involves concepts that are a bit brain-melting for LLMs, because multiple "99% of the time, assumption X is correct" are reversed for my project. I think Opus does better at not falling into those traps. Excited to try out 5.3
    
    [-]
    - nubg 1 day ago
      
      what do you do?
      
      [-]
      - audience_mem 1 day ago
        
        He works on brain-melting stuff, the understanding of which is far beyond us.
        
        [-]
        
        nerdsniper 1 day ago
        
        It's relatively easy for people to grok, if a bit niche. Just sometimes confuses LLMs. Humans are much better at holding space for rare exceptions to usual rules than LLMs are.
- leumon 1 day ago
  
  they tested it at xhigh reasoning though, which is probably double the cost of Anthropic's model.
  Cost to Run Artificial Analysis Intelligence Index:
  GPT-5.2 Codex (xhigh): $3244
  Claude Opus 4.5-reasoning: $1485
  (and probably similar values for the newer models?)
  
  [-]
  - redox99 1 day ago
    
    With $20 gpt plan you can use xhigh no problem. With $20 Claude plan you reach the 5h limit with a single feature.
    
    [-]
    - mattkevan 1 day ago
      
      Ha, Claude Code on a pro plan often can't complete a single message before hitting the 5h limit. Not hit it once so far on Codex.
      
      [-]
      - naths88 1 day ago
        
        This, so frustrating. But CC is so much faster too.
  - Computer0 1 day ago
    
    A provider's API costs seemingly do not reflect each respective SOTA provider's subscription usage allowances.
- __jl__ 1 day ago
  
  Impressive jump for GPT-5.3-codex and crazy to see two top coding models come out on the same day...
  
  [-]
  - granzymes 1 day ago
    
    Insane! I think this has to be the shortest-lived SOTA for any model so far. Competition is amazing.
- wilg 1 day ago
  
  In my personal experience the GPT models have always been significantly better than the Claude models for agentic coding, I’m baffled why people think Claude has the edge on programming.
  
  [-]
  - dudeinhawaii 1 day ago
    
    I think for many/most programmers = 'speed + output' and webdev == "great coding".
    Not throwing shade anyone's way. I actually do prefer Claude for webdev (even if it does cringe things like generate custom CSS on every page) -- because I hate webdev and Claude designs are always better looking.
    But the meat of my code is backend and "hard" and for that Codex is always better, not even a competition. In that domain, I want accuracy and not speed.
    Solution, use both as needed!
    
    [-]
    - falloutx 1 day ago
      
      > I actually do prefer Claude for webdev
      Ah and let me guess all your frontends look like cookie cutter versions of this: https://openclaw.dog/
      
      [-]
      - Yiin 1 day ago
        
        Yes and I love it.
    - whynotminot 1 day ago
      
      > Solution, use both as needed!
      This is the way. People are unfortunately starting to divide themselves into camps on this — it’s human nature we’re tribal - but we should try to avoid turning this into a Yankees Redsox.
      Both companies are producing incredible models and I’m glad they have strengths because if you use them both where appropriate it means you have more coverage for important work.
    - flir 1 day ago
      
      That's the best theory I've heard. Or at least, it's the one that fits with my usage. I'm mostly-backend, and I'm mostly-GPT.
      (I'm also a "small steps under guidance" user rather than a "fire and forget" user, so maybe that plays into it too).
    - theLiminator 1 day ago
      
      Actually for me the killer feature isn't Claude, but is the planning mode.
      It's a very nice UX for iteratively creating a spec that I can refine.
      
      [-]
      - Dma54rhs 1 day ago
        
        Codex has it as well now.
  - soulofmischief 1 day ago
    
    GPT 5.2 codex plans well but fucks off a lot, goes in circles (more than opus 4.5) and really just lacks the breadth of integrated knowledge that makes opus feel so powerful.
    Opus is the first model I can trust to just do things, and do them right, at least small things. For larger/more complex things I have to keep either model on extremely short leashes. But the difference is enough that I canceled my GPT Pro sub so I could switch to Claude. Maybe 5.3 will change things, but I also cannot continue to ethically support Sam Altman's business.
    
    [-]
    - int_19h 1 day ago
      
      I'd say that GPT 5.2 did slightly better on the stuff that I'm working on currently compared to Opus 4.5, but it's rather niche - a fancy Lojban parser in Haskell). However Opus is much easier to steer interactively because you can see what it's doing in more detail (although 5.3 is much improved in that regard!). I wouldn't feel empty-handed with either model, and both wrote large chunks of code for this project.
      All that said, the single biggest reason why I use Codex a lot more is because the $200 plan for it is so much more generous. With Claude, I very quickly burn through the quota and then have to wait for several days or else buy more credit. With Codex, running in High reasoning mode as standard with occasional use of XHigh to write specs or debug gnarly issues, and having agents run almost around the clock in the background, I have hit the limit exactly once so far.
    - wilg 1 day ago
      
      I always use 5.2-Codex-High or 5.2-Codex-Extra High (in Cursor). The regular version is probably too dumb.
      
      [-]
      - soulofmischief 1 day ago
        
        Didn't make a difference for me. Though I will say, so far 4.6 is really pissing me off and I might downgrade back to 4.5. It just refuses to listen to what I say, the steering is awful.
  - fragmede 1 day ago
    
    How many people are building the same thing multiple times to compare model performance? I'm much more interested in getting the thing I'm building getting built, than than comparing AIs to each other.
- jronak 1 day ago
  
  Did you look at the ARC AGI 2? Codex might be overfit for terminal bench
  
  [-]
  - tedsanders 1 day ago
    
    ARC AGI 2 has a training set that model providers can choose to train on, so really wouldn't recommend using it as a general measure of coding ability.
    
    [-]
    - mrandish 1 day ago
      
      A key aspect of ARC AGI is to remain highly resistant to training on test problems which is essential for ARC AGI's purpose of evaluating fluid intelligence and adaptability in solving novel problems. They do release public test sets but hold back private sets. The whole idea is being a test where training on public test sets doesn't materially help.
      The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly, didn't cheat or accidentally have public ARC AGI test data slip into their training data. IIRC, some time ago there was an issue when OpenAI published ARC AGI 1 test results on a new model's release which the ARC AGI non-profit was unable to replicate on a private set some weeks later (to be fair, I don't know if these issues were resolved). Edit to Add: Summary of what happened: https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...
      I have no expertise to verify how training-resistant ARC AGI is in practice but I've read a couple of their papers and was impressed by how deeply they're thinking through these challenges. They're clearly trying to be a unique test which evaluates aspects of 'human-like' intelligence other tests don't. It's also not a specific coding test and I don't know how directly ARC AGI scores map to coding ability.
    - janalsncm 1 day ago
      
      More fundamentally, ARC is for abstract reasoning. Moving blocks around on a grid. While in theory there is some overlap with SWE tasks, what I really care about is competence on the specific task I will ask it to do. That requires a lot of domain knowledge.
      As an analogy, Terence Tao may be one of the smartest people alive now, but IQ alone isn’t enough to do a job with no domain-specific training.
- nurettin 1 day ago
  
  Opus was quite useless today. Created lots of globals, statics, forward declarations, hidden implementations in cpp files with no testable interface, erasing types, casting void pointers, I had to fix quite a lot and decouple the entangled mess.
  Hopefully performance will pick up after the rollout.
  
  [-]
  - nickstinemates 1 day ago
    
    Did you give it any architecture guidance? An architecture skill that it can load to make sure it lays out things according to your taste?
    
    [-]
    - nurettin 1 day ago
      
      Yes, it has a very tight CLAUDE.md which it used to follow. Feels like this happens a couple of times a month.
xiphias2 1 day ago

,,GPT‑5.3-Codex is the first model we classify as High capability for cybersecurity-related tasks under our Preparedness Framework , and the first we’ve directly trained to identify software vulnerabilities. While we don’t have definitive evidence it can automate cyber attacks end-to-end, we’re taking a precautionary approach and deploying our most comprehensive cybersecurity safety stack to date. Our mitigations include safety training, automated monitoring, trusted access for advanced capabilities, and enforcement pipelines including threat intelligence.''
While I love Codex and believe it's amazing tool, I believe their preparedness framework is out of date. As it is more and more capable of vibe coding complex apps, it's getting clear that the main security issues will come up by having more and more security critical software vibe coded.
It's great to look at systems written by humans and how well Codex can be used against software written by humans, but it's getting more important to measure the opposite: how well humans (or their own software) are able to infiltrate complex systems written mostly by Codex, and get better on that scale.
In simpler terms: Codex should write secure software by default.

[-]
- mrkeen 1 day ago
  
  Is "high-capability" a stronger or weaker claim than "team of phd-level experts"?
  https://www.nbcnews.com/tech/tech-news/openai-releases-chatg...
  
  [-]
  - semiinfinitely 1 day ago
    
    much stronger
    
    [-]
    - Xunjin 1 day ago
      
      Don't forget that is also Harder, Better, Faster.
- trcf23 1 day ago
  
  That’s just classical OpenAI trying to make us believe they’re closing on AGI… Like all « so called » research from them and Anthropic about safety alignment and that their tech is so incredibly powerful that guardrails should be put on them.
- ActionHank 1 day ago
  
  I heard the other day that every time someone claps another vibe coded project embeds the api keys in the webpage.
  I wonder if this will continue to be the case.
- da_grift_shift 1 day ago
  
  >Our mitigations include safety training, automated monitoring, trusted access for advanced capabilities, and enforcement pipelines including threat intelligence.
  "We added some more ACLs and updated our regex"
- manmal 1 day ago
  
  Please no, I don’t need my quick prototypes hardened against every perceivable threat.
  
  [-]
  - comex 1 day ago
    
    In most cases security is not a matter of adding anything in particular, but a matter of just not making specific types of mistakes.
    
    [-]
    - dimitri-vs 1 day ago
      
      Maybe I'm being dumb but that reads very contradictory? I would say that security is explicitly a matter of adding particular things.
      
      [-]
      - Ronsenshi 1 day ago
        
        Not an OP, but seems like you might be talking about different things.
        Security could be about not adding certain things/making certain mistakes. Like not adding direct SQL queries with data inserted as part of the query string and instead using bindings or ORM.
        If you have insecure raw query that you feed into ORM that you added on top - that's not going to make query more secure.
        But on the other hand when you're securing some endpoints in APIs you do add things like authorization, input validation and parsing.
        So I think a lot depends on what you mean when you're talking about security.
        Security is security - making sure bad things don't happen and in some cases it's different approach in the code, in some cases additions to the code and in some cases removing things from the code.
    - victorbjorklund 1 day ago
      
      Is there ever a reason to store passwords in plaintext instead of as a hash? Even in a prototype.
      
      [-]
      - fragmede 1 day ago
        
        The better question is which LLM is going to make such a basic mistake?
itay-maman 1 day ago

Something that caught my eye from the announcement:
> GPT‑5.3‑Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training
I'm happy to see the Codex team moving to this kind of dogfooding. I think this was critical for Claude Code to achieve its momentum.

[-]
- codethief 1 day ago
  
  Sounds like the researchers behind https://ai-2027.com/ haven't been too far off so far.
  
  [-]
  - cootsnuck 1 day ago
    
    We'll see. The first two things that they said would move from "emerging tech" to "currently exists" by April 2026 are:
    - "Someone you know has an AI boyfriend"
    - "Generalist agent AIs that can function as a personal secretary"
    I'd be curious how many people know someone that is sincerely in a relationship with an AI.
    And also I'd love to know anyone that has honestly replaced their human assistant / secretary with an AI agent. I have an assistant, they're much more valuable beyond rote input-output tasks... Also I encourage my assistant to use LLMs when they can be useful like for supplementing research tasks.
    Fundamentally though, I just don't think any AI agents I've seen can legitimately function as a personal secretary.
    Also they said by April 2026:
    > 22,000 Reliable Agent copies thinking at 13x human speed
    And when moving from "Dec 2025" to "Apr 2026" they switch "Unreliable Agent" to "Reliable Agent". So again, we'll see. I'm very doubtful given the whole OpenClaw mess. Nothing about that says "two months away from reliable".
    
    [-]
    - zozbot234 1 day ago
      
      > Someone you know has an AI boyfriend
      MyBoyfriendIsAI is a thing
      > Generalist agent AIs that can function as a personal secretary
      Isn't that what MoltBot/OpenClaw is all about?
      So far these look like successful predictions.
      
      [-]
      - ainch 1 day ago
        
        Moltbot is an attempt to do that. Would you hire it as a personal secretary and entrust all your personal data to it?
        
        [-]
        
        danpalmer 1 day ago
        
        Only people who haven't had a secretary would think it's a personal secretary.
        Like, it can't even answer the phone.
        
        [-]
        
        fragmede 1 day ago
        
        There are plenty of companies that sell an AI assistant that answers the phone as a service, they just aren't named OpenAI or Anthropic. They'll let callers book an appointment onto your calendar, even!
        
        [-]
        
        danpalmer 1 day ago
        
        No, there are companies that sell voice activated phone trees, but no one is getting results out of unstructured, arbitrary phone call answering with actions taken by an LLM.
        I'm sure there are research demos in big companies, I'm sure some AI bro has done this with the Twilio API, but no one is seriously doing this.
        All it takes is one "can you take this to the post office", the simplest, of requests, and you're in a dead end of at best refusal, but more likely role-play.
        
        [-]
        
        PranayKumarJain 1 day ago
        
        Agreed that “unstructured arbitrary phone calls + arbitrary actions” is where things go to die.
        What does work in production (at least for SMB/customer-support style calls) is making the problem less magical: 1) narrow domain + explicit capabilities (book/reschedule/cancel, take a message, basic FAQs) 2) strict tool whitelist + typed schemas + confirmations for side effects 3) robust out-of-scope detection + graceful handoff (“I can’t do that, but I can X/Y/Z”) 4) real logs + eval/test harnesses so regressions get caught
        Once you do that, you can get genuinely useful outcomes without the role-play traps you’re describing.
        We’ve been building this at eboo.ai (voice agents for businesses). If you’re curious, happy to share the guardrails/eval setup we’ve found most effective.
        
        fragmede 1 day ago
        
        https://www.instagram.com/p/DMfpj0hM7e0/
        is obviously a staged demo but it seems pretty serious for him. He's wearing a suit and everything!
        https://www.instagram.com/p/DK8fmYzpE1E/
        seems like research by some dude (no disrespect, he doesn't seems like he's at big company though).
        https://www.instagram.com/p/DH6EaACR5-f/
        could be astroturf, but seems maybe a little bit serious.
    - Davidzheng 1 day ago
      
      It's important to remember though (this is besides the point for what you're saying) that job displacement of things like secretaries from AI do not require it to be a near perfect replacement. There are many other factors (for example if it's much cheaper and can do part of the work it can dramatically shrink demand as people can shift to an imperfect replacement in AI)
    - Rudybega 1 day ago
      
      I think they immediately corrected their median timelines for takeoff to 2028 upon releasing the article (I believe there was a math mistake or something initially), so all those dates can probably be bumped back a few months. Regardless, the trend seems fairly on track.
    - speed_spread 1 day ago
      
      People have been in love with machines for a long time. It's just that the machines didn't talk back so we didn't grant them the "partner" status. Wait for car+LLM and you'll have a killer combo.
      
      [-]
      - fragmede 1 day ago
        
        KITT, is that you?
  - YawningAngel 1 day ago
    
    I don't think generative AI is even close to making model development 50% faster
  - 0x1ceb00da 20 hours ago
    
    Is gpt5.3 200x bigger than gpt4? Looks like openai used this fanfiction as its marketing strategy
  - JackYoustra 1 day ago
    
    > researchers
    that's certainly one way to refer to Scott Alexander
    
    [-]
    - apetresc 1 day ago
      
      Scott Alexander essentially provided editing and promotion for AI 2027 (and did a great job of it, I might add). Are you unaware of the actual researchers behind the forecasting/modelling work behind it, and you thought it was actually all done by a blogger? Or are you just being dismissive for fun?
  - beernet 1 day ago
    
    Only on HN will people still doubt what is happening right in front of their eyes. I understand that putting things into perspective is important, still, the type of downplaying we can see in the comments here is not only funny but also has a dangerous dimension to it. Ironically, these are the exact same people who will claim "we should have prepared better!" once the effects become more and more visible. Dear super engineers, while I feel sorry that your job and passion become a commodity right in front of you, please stay out the way.
- aurareturn 1 day ago
  
  More importantly, this is the early steps of a model self improving itself.
  Do we still think we'll have soft take off?
  
  [-]
  - mrandish 1 day ago
    
    > Do we still think we'll have soft take off?
    There's still no evidence we'll have any take off. At least in the "Foom!" sense of LLMs independently improving themselves iteratively to substantial new levels being reliably sustained over many generations.
    To be clear, I think LLMs are valuable and will continue to significantly improve. But self-sustaining runaway positive feedback loops delivering exponential improvements resulting in leaps of tangible, real-world utility is a substantially different hypothesis. All the impressive and rapid achievements in LLMs to date can still be true while major elements required for Foom-ish exponential take-off are still missing.
    
    [-]
    - rahulyc 1 day ago
      
      Yes, but also you'll never have any early evidence of the Foom until the Foom itself happens.
      
      [-]
      - janalsncm 1 day ago
        
        If only General Relativity had such an ironclad defense of being as unfalsifiable as Foom Hypothesis is. We could’ve avoided all of the quantum physics nonsense.
        
        [-]
        
        phdlalala 17 hours ago
        
        https://zenodo.org/records/18498514 GR is nonsense!
        
        Davidzheng 1 day ago
        
        it doesn't mean it's unfalsifiable - it's a prediction about the future so you can falsify it when there's a bound on when it is going to happen. it just means there's little to no warning. I think it's a significant risk to AI progress that it can reach some sort of improvement speed > speed of warning or any threats from AI improvement
    - Davidzheng 1 day ago
      
      To me FOOM means like the hardest of hard takeoffs and improving at a sustained rate which is higher than without humans is not a takeoff at all.
  - quinncom 1 day ago
    
    Exponential growth may look like a very slow increase at first, but it's still exponential growth.
    
    [-]
    - janalsncm 1 day ago
      
      Sigmoids may look like exponential growth at first, until they saturate. Early growth alone cannot distinguish between them.
      
      [-]
      - naasking 1 day ago
        
        Intelligence must be sigmoid of course, but it may not saturate until well past human intelligence.
    - Garlef 1 day ago
      
      On the other hand: Perception of change might not be linear but logarithmic.
      (= it might take an order of magnitude of improvements to be perceived as a substantial upgrade)
      So the perceived rate of change might be linear.
      It's definitely true for some things such as wealth:
      - $2000 is a lot of you have $1000.
      - It's a substantial improvement of you have $10000.
      - It's not a lot you have $1m
      - It does not matter if you have $1b
      
      [-]
      - varjag 1 day ago
        
        $2000 is not substantial over $1b on the linear scale
    - gf000 1 day ago
      
      If it's exponential growth. It may just as well be some slow growth and continue to be so.
  - thrance 1 day ago
    
    I think the limiting factor is capital, not code. And I doubt GPTX is anymore competent at raising funds than the other, fleshy, snake oilers...
  - aaaalone 1 day ago
    
    I'm only saying no to keep optimistic tbh
    It feels crazy to just say we might see a fundamental shift in 5 years.
    But the current addition to compute and research etc. def goes in this direction I think.
  - reducesuffering 1 day ago
    
    This has already been going on for years. It's just that they were using GPT 4.5 to work on GPT 5. All this announcement mean is that they're confident enough in early GPT 5.3 model output to further refine GPT 5.3 based on initial 5.3. But yes, takeoff will still happen because of this recursive self improvement works, it's just that we're already past the inception point.
    
    [-]
    - mirsadm 1 day ago
      
      I can't tell if this is a serious conversation anymore.
      
      [-]
      - Davidzheng 1 day ago
        
        I think it's important in AI discussions to reason correctly from fundamentals and not disregard possibilities simply because they seem like fiction/absurd. If the reasoning is sound, it could well happen.
      - kristofferR 1 day ago
        
        I totally got what you felt there. We are truly living in a sci-fi world
      - reducesuffering 1 day ago
        
        “Best start believing in science fiction stories. You're in one.”
        https://x.com/TheZvi/status/2017310187309113781
    - manmal 1 day ago
      
      I guess humans were involved in all that, so how is that anything but tool use?
  - 8note 1 day ago
    
    making the specifications is still hard, and checking how well results match against specifications is still hard.
    i dont think the model will figure that out on its own, because the human in the loop is the verification method for saying if its doing better or not, and more importantly, defining better
minimaxir 1 day ago

I remember when AI labs coordinated so they didn't push major announcements on the same day to avoid cannibalizing each other. Now we have AI labs pushing major announcements within 30 minutes.

[-]
- observationist 1 day ago
  
  The labs have fully embraced the cutthroat competition, the arms race has fully shed the civilized facade of beneficient mutual cooperation.
  Dirty tricks and underhanded tactics will happen - I think Demis isn't savvy in this domain, but might end up stomping out the competition on pure performance.
  Elon, Sam, and Dario know how to fight ugly and do the nasty political boardroom crap. 26 is gonna be a very dramatic year, lots of cinematic potential for the eventual AI biopics.
  
  [-]
  - manquer 1 day ago
    
    >civilized facade of mutual cooperation
    >Dirty tricks and underhanded tactics
    As long the tactics are legal ( i.e. not corporate espionage, bribes etc), the no holds barred full free market competition is the best thing for the market and the consumers.
    
    [-]
    - ajam1507 1 day ago
      
      > As long the tactics are legal ( i.e. not corporate espionage, bribes etc), the no holds barred full free market competition is the best thing for the market and the consumers.
      The implicit assumption here is that we have constructed our laws so skillfully that the only path to win a free market competition is by producing a better product, or that all efforts will be spent doing so. This is never the case. It should be self-evident from this that there is a more productive way for companies to compete and our laws are not sufficient to create the conditions.
    - thethimble 1 day ago
      
      The consumers are getting huge wins.
      Model costs continue to collapse while capability improves.
      Competition is fantastic.
      
      [-]
      - mrandish 1 day ago
        
        > The consumers are getting huge wins.
        However, the investors currently subsidizing those wins to below cost may be getting huge losses.
        
        [-]
        
        credit_guy 1 day ago
        
        Yes, but that's the nature of the game, and they know it.
      - doom2 1 day ago
        
        > Model costs continue to collapse
        And yet RAM prices are still sky high. Game consoles are getting more expensive, not cheaper, as a result. When will competition benefit those consumers? Or consumers of desktop RAM?
        
        [-]
        
        nasreddin 1 day ago
        
        The free market has simply decided these consumers are not as relevant as the others.
        
        [-]
        
        niek_pas 1 day ago
        
        Maybe the free market is wrong.
        
        [-]
        
        anomaly_ 1 day ago
        
        It can’t be. Those uses are suboptimal, hence the users aren’t willing to pay the new prices.
        
        acjohnson55 1 day ago
        
        Not really. Investors with hundreds of billions of dollars have decided it. The process by which capital has been allocated the way it has isn't some mathematically natural or optimal thing. Our market is far from free.
        
        [-]
        
        ETH_start 23 hours ago
        
        Saying "investors with hundreds of billions decided it" makes it sound like a few people just chose the outcome, when in reality prices and capital move because millions of consumers, companies, workers, and smaller investors keep making choices every day. Big investors only make money if their decisions match what people actually want; they can't just command success. If they guess wrong, others profit by allocating money better, so having influence isn't the same as having control.
        The system isn't mathematically perfect, but that doesn't make it arbitrary. It works through an evolutionary process: bad bets lose money, better ones gain more resources.
        Any claim that the outcome is suboptimal only really means something if the claimant can point to a specific alternative that would reliably do better under the same conditions. Otherwise critics are mostly just expressing personal frustration with the outcome.
    - dwaltrip 1 day ago
      
      Sure, it can be beneficial. But don't forget that externalities are a thing.
    - wiz21c 1 day ago
      
      in the short term maybe, in the long term it depends how many winners you have. If only two, the market will be a duopoly. Customers will get better AI but will have zero power over the way the AI is produced or consumed (i.e. cO2 emission, ethics, etc will be burnt)
      
      [-]
      - manquer 1 day ago
        
        > how many winners ... duopoly
        There aren't any insurmountable large moats, plenty of open weight models that perform close enough.
        > CO₂ emissions
        Different industry that could also benefit from more competition ? Clean(er) energy is not even more expensive than dirty sources on pure $/kWh, we still do need dirty sources for workloads like base demand, peakers etc that the cheap clean sources cannot service today.
    - KoolKat23 1 day ago
      
      Yes, but not cutthroat competition that implies unsustainable, detrimental competition that kills off the industry.
  - webdevver 19 hours ago
    
    demis is more than qualified. he was politically savvy enough to stay in control of deepmind when it was acquired by google, thats up there with the altman fiasco a year or two ago.
- zozbot234 1 day ago
  
  They're also coordinating around Chinese New Year to compete with new releases of the major open/local models.
  
  [-]
  - DonHopkins 1 day ago
    
    Year of the Pelican!
    
    [-]
    - hoeoek 1 day ago
      
      simonw?
    - iujasdkjfasf 1 day ago
      
      [dead]
- tedsanders 1 day ago
  
  This goes way back. When OpenAI launched GPT-4 in 2023, both Anthropic and Google lined up counter launches (Claude and Magic Wand) right before OpenAI's standard 10am launch time.
- crorella 1 day ago
  
  The thrill of competition
- sathish316 1 day ago
  
  AI commoditized to the point where Nash Equilibrium and 35 minutes between announcements matter :)
- manquer 1 day ago
  
  Wouldn't that be illegal ? i.e. cartel to collude like that ?
  
  [-]
  - avaer 1 day ago
    
    You were downvoted but I don't understand why. This is the purpose/spirit of antitrust law [1]
    [1] https://en.wikipedia.org/wiki/United_States_antitrust_law
    
    [-]
    - manquer 1 day ago
      
      I have long since given up trying to understand voting patterns in HN :)
      ---
      Sadly it was the core of anti-trust law, since 1970s things have changed.
      The predominant view today (i.e. Chicago School view) in both judiciary and executive are influenced by Justice Bork's ideas that consumer benefit being the deciding factor over company's actions.
      Consumer benefits becomes opinions of projections by either side of a case about the future, whereas company actions like collusion, pricing fixing or M&A are hard facts with strong evidence. Today it is all vibes on how the courts (or executive) feel .
      So now we have Government sanctioned cartels like in Aviation Alliances [1] that is basically based on convoluted catch-22-esque reasoning because it favors strategic goals even though it would be a violation of the letter/spirit of the law.
      [1] https://www.transportation.gov/office-policy/aviation-policy...
- IhateAI 1 day ago
  
  A sign of the inevitible implosion !
- cedws 1 day ago
  
  I wish they’d just stop pretending to care about safety, other than a few researchers at the top they care about safety only as long as they aren’t losing ground to the competition. Game theory guarantees the AI labs will do what it takes to ensure survival. Only regulation can enforce the limits, self policing won’t work when money is involved.
  
  [-]
  - thethimble 1 day ago
    
    As long as China continues to blitz forward, regulation is a direct path to losing.
    
    [-]
    - cedws 1 day ago
      
      Define "losing."
      Europe is prematurely regarded as having lost the AI race. And yet a large portion of Europe live higher quality lives compared to their American counterparts, live longer, and don't have to worry about an elected orange unleashing brutality on them.
      
      [-]
      - thethimble 1 day ago
        
        If the world is built on AI infrastructure (models, compute, etc.) that is controlled by the CCP then the west has effectively lost.
        This may lead to better life outcomes, but if the west doesn't control the whole stack then they have lost their sovereignty.
        This is already playing out today as Europe is dependent on the US for critical tech infrastructure (cloud, mail, messaging, social media, AI, etc). There's no home grown European alternatives because Europe has failed to create an economic environment to assure its technical sovereignty.
        
        [-]
      - fakedang 1 day ago
        
        Europe has already lost the tech race - their cloud systems that their entire welfare states rely upon are all hosted on servers hosted by American private companies, which can turn them off with a flick of a switch if and when needed.
        When the welfare state, enabled by technology, falls apart, it won't take long for European society to fall apart. Except France maybe.
        
        [-]
        
        clows 1 day ago
        
        welfare state enabled by cloud services/technology?
        I'm not sure if you know less about europe or tech.
        > Except France maybe.
        sure
    - pixl97 1 day ago
      
      You mean all paths are direct paths to losing.
  - vovavili 1 day ago
    
    The last thing I would want is for excessively neurotic bureaucrats to interfere with all the mind-blowing progress we've had in the last couple of years with LLM technology.
    
    [-]
    - iujasdkjfasf 1 day ago
      
      [dead]
SunshineTheCat 1 day ago

I've always been fascinated to see significantly more people talking about using Claude than I see people talking about Codex.
I know that's anecdotal, but it just seems Claude is often the default.
I'm sure there are key differences in how they handle coding tasks and maybe Claude is even a little better in some areas.
However, the note I see the most from Claude users is running out of usage.
Coding differences aside, this would be the biggest factor for me using one over the other. After several months on Codex's $20/mo. plan (and some pretty significant usage days), I have only come close to my usage limit once (never fully exceeded it).
That (at least to me) seems to be a much bigger deal than coding nuances.

[-]
- timpera 1 day ago
  
  In my experience, OpenAI gives you unreasonable amounts of compute for €20/month. I am subscribed to both and Claude's limits are so tiny compared to ChatGPT's that it often feels like a rip-off.
  Claude also doesn't let you use a worse model after you reach your usage limits, which is a bit hard to swallow when you're paying for the service.
  
  [-]
  - replwoacause 1 day ago
    
    Same experience here. I started our devoutly using Claude but ran into some many limits that I switched back to ChatGPT and it's been night and day. I haven't even really been able to play with the Opus model on my Pro plan because it devours usage and then blocks me for X hours until it resets, costing me a work day. OpenAI has never done that to me. In fact, Codex just churned away for 2 hours on a task and I'm still using it without hitting a limit. I used to love using Claude but the limits are too prohibitive.
  - appsoftware 23 hours ago
    
    Claude when used via Github Co-Pilot is much better for useage allowance. I used Opus 4.5 for a months worth of development and only just hit 90 pct of the pro $40 per month allowance.
  - lm28469 1 day ago
    
    If their pay as you go api token prices reflect their internal costs then it makes sense, but it could also be that claude makes money while gpt sells at loss to stay on top. Claude is way more expensive overall, and way more limited with flat rate subscriptions
    opus: 5/25 gpt: 1.75/14
    
    [-]
    - int_19h 1 day ago
      
      Given how much you can use Codex on their $200 plan, I'm virtually certain that it's subsidized.
      As to why, I think in part it is because people who are willing to pay that much per month are much more likely to be using it heavily on "serious" tasks, which is, of course, a goldmine for training data - even if you can't use the inputs directly for training, just looking at various real world issues and how agents handle them (or not) is valuable, especially when all the low-hanging fruit have already been picked.
      I wouldn't even be surprised if the $20 users are actually subsidizing the $200 users.
- mrandish 1 day ago
  
  > the note I see the most from Claude users is running out of usage.
  I suspect that tells us less about model capability/efficiency and more about each company's current need to paint a specific picture for investors re: revenue, operating costs, capital requirements, cash on hand, growth rate, retention, margins etc. And those needs can change at any moment.
  Use whatever works best for your particular needs today, but expect the relative performance and value between leaders to shift frequently.
- vercaemert 1 day ago
  
  This will be a Harvard Business case study on market share.
  Claude Code was instrumental for Anthropic.
  What's interesting is that people haven't heard of it/them outside of software development circles. I work on a volunteer project, a webapp basically, and even the other developers don't know the difference between Cursor and Claude Code.
- superfrank 1 day ago
  
  I only switched to using the terminal based agents in the last week. Prior to this I was pretty much only using it through Cursor and GH Copilot. The Anthropic models when used through GH Copilot were far superior to the codex ones and I didn't really get the hype of Codex. Using them through the CLI though, Codex is much better, IMO.
  My guess is that it's potentially that and just momentum from developers who started using CC when it was far superior to Codex has allowed it to become so much more popular. Potentially, it's might be that, as it's more autonomous, it's better for true vibe-coding and it's more popular with the Twitter/LinkedIn wantrepreneur crew which meant it gets a lot of publicity which increases adoption quicker.
  
  [-]
  - jorl17 1 day ago
    
    Out of curiosity, what do you feel are the key differences between cursor + models versus something like Claude Code/Codex?
    Are you feeling the benefits of the switch? What prompted you to change?
    I've been running cursor with my own workflows (where planning is definitely a key step) and it's been great. However, the feeling of missing out, coupled with the fact I am a paying ChatGPT customer, got me to try codex. It hasn't really clicked in what way this is better, as so far it really hasn't been.
    I have this feeling that supposedly you can give these tools a bit more of a hands-off approach so maybe I just haven't really done that yet. Haven't fiddled with worktrees or anything else yet either.
    
    [-]
    - oofbey 1 day ago
      
      AFAICT it really is just a preference for terminal vs IDE. The terminal folks often believe terminal is intrinsically better and say things like “you’re still using an IDE.” Yegge makes this quite explicit in his gastown manifesto.
      I been using Unix command lines since before most people here were born. And I actively prefer cursor to the text only coding agents. I like being able to browse the code next to the chat and easily switch between sessions, see markdown rendered properly, etc.
      On fundamentals I think the differences are vanishing. They have converged on the same skills format standards. Cursor uses RAG for file lookups but Claude reads the whole file - token efficiency vs completeness. They both seem to periodically innovate some orchestration function which the other copies a few weeks later.
      I think it really is just a stylistic preference. But the Claude people seem convinced Claude is better. Having spent a bunch of time analyzing both I just don’t see it.
- AstroBen 1 day ago
  
  I'm with you. Codex's plans seems to be a much more generous offering than Claude
  I just.. can't tell a different in quality between them.. so I go for the cheapest
- fHr 1 day ago
  
  Codex is great and I hit the usage once doing multiagent full 5 hour absolute degen session for the nornal workflow alongside never hit it and now x2 useage even and now with the planmode switch back and forth absolute great.
4corners4sides 2 hours ago

A 77% score on terminal-bench 2 is really impressive. I remember reading the article about the pi coding agent (https://mariozechner.at/posts/2025-11-30-pi-coding-agent/) getting into the top ten percent of agents on that benchmark. It got about 50%. While it may still be in the top ten, that category just turned into one champion and a long of inferior offerings.
I was shocked to see that in the prompt for one of the landing pages the text “lavender to blue gradient” was included as if that’s something that anybody actually wants. It’s like going to the barber and saying “just make me look awful”.
This was my first time actually seeing what the GDPval benchmark looked like. Essentially they benchmark for all the artifacts that HR/finance might make or work on (onboarding documents, accounting spreadsheets, powerpoint presentations .etc). I think it’s good that models are trained to generate things like this well since people are going to use AI to do such anyway. If the middlemen passing AI ouputs around are going to be lazy I’m grateful that at least OpenAI researchers are cooking something behind the scenes.
bgirard 1 day ago

> Using the develop web game skill and preselected, generic follow-up prompts like "fix the bug" or "improve the game", GPT‑5.3-Codex iterated on the games autonomously over millions of tokens.
I wish they would share the full conversation, token counts and more. I'd like to have a better sense of how they normalize these comparisons across version. Is this a 3-prompt 10m token game? a 30-prompt 100m token game? Are both models using similar prompts/token counts?
I vibe coded a small factorio web clone [1] that got pretty far using the models from last summer. I'd love to compare against this.
[1] https://factory-gpt.vercel.app/

[-]
- veb 1 day ago
  
  I just wanted to say that's a pretty cool demo! I hadn't realised people were using it for things like this.
  
  [-]
  - bgirard 1 day ago
    
    Thank you. There's a demo save to get the full feel of it quickly. There's also a 2D-ASCII and 3D render you can hotswap between. The 3D models are generated with Meshy. The entire game is 'AI slop'. I intentionally did no code reviews to see where that would get me. Some prompts were very specific but other prompts were just 'add a research of your choice'.
    This was built using old versions of Codex, Gemini and Claude. I'll probably work on it more soon to try the latest models.
    
    [-]
    - gspetr 1 day ago
      
      Any estiimates on how much it cost you? In terms of total real world time, money, and time spent by the agents.
      
      [-]
      - bgirard 1 day ago
        
        About ~$300: $200 for Claude max subscription $20 for Vercel $20 for Codex $20 for Meshy
        I think these days the $200 Max subscription wouldn't be needed. I bet with these latest models you can make due with mixing two $20/mo subscriptions.
        Real time was 2 weeks of watching the agents while watching TV and playing games, waiting for limit resets, etc... Very little decided focused time.
tosh 1 day ago
Terminal Bench 2.0
```
  | Name                | Score |
  |---------------------|-------|
  | OpenAI Codex 5.3    | 77.3  |
  | Anthropic Opus 4.6  | 65.4  |
```
[-]
- greenfish6 1 day ago
  
  yea but i feel like we are over the hill on benchmaxxing, many times a model has beaten anthropic on a specific bench, but the 'feel' is that it is still not as good at coding
  
  [-]
  - falloutx 1 day ago
    
    When Anthropic beats Benchmarks its somehow earned, when OpenAi games it, its somehow about not feeling good at coding.
    
    [-]
    - apetresc 1 day ago
      
      I mean… yeah? It sounds biased or whatever, but if you actually experience all the frontier models for yourself, the conclusion that Opus just has something the others don’t is inescapable.
      
      [-]
      - manmal 1 day ago
        
        Opus is really good at bash, and it’s damn fast. Codex is catching up on that front, but it’s still nowhere near. However, Codex is better at coding - full stop.
  - AstroBen 1 day ago
    
    'feel' is no more accurate
    not saying there's a better way but both suck
    
    [-]
    - thethimble 1 day ago
      
      Speak for yourself. I've been insanely productive with Codex 5.2.
      With the right scaffolding these models are able to perform serious work at high quality levels.
      
      [-]
      - helloplanets 1 day ago
        
        He wasn't saying that both of the models suck, but that the heuristics for measuring model capability suck
      - AstroBen 1 day ago
        
        ..huh?
    - crorella 1 day ago
      
      The variety of tasks they can do and will be asked to do is too wide and dissimilar, it will be very hard to have a transversal measurement, at most we will have area specific consensus that model X or Y is better, it is like saying one person is the best coder at everything, that does not exist.
      
      [-]
      - pixl97 1 day ago
        
        Yea, we're going to need benchmarks that incorporate series of steps of development for a particular language and how good each model is at it.
        Like can the model take your plan and ask the right questions where there appear to be holes.
        How wide of architecture and system design around your language does it understand.
        How does it choose to use algorithms available in the language or common libraries.
        How often does it hallucinate features/libraries that aren't there.
        How does it perform as context get larger.
        And that's for one particular language.
    - tavavex 1 day ago
      
      The 'feel' of a single person is pretty meaningless, but when many users form a consensus over time after a model is released, it feels a lot more informative than a simple benchmark because it can shift over time as people individually discover the strong and weak points of what they're using and get better at it.
    - forrestthewoods 1 day ago
      
      At the end of the day “feel” is what people rely on to pick which tool they use.
      I’d feel unscientific and broken? Sure maybe why not.
      But at the end of the day I’m going to choose what I see with my own two eyes over a number in a table.
      Benchmarks are a sometimes useful to. But we are in prime Goodharts Law Territory.
      
      [-]
      - AstroBen 1 day ago
        
        yeah, to be honest it probably doesn't matter too much. I think the major models are very close in capabilities
        
        [-]
        
        forrestthewoods 1 day ago
        
        I don’t think this is even remotely true in practice.
        I honestly I have no idea what benchmarks are benchmarking. I don’t write JavaScript or do anything remotely webdev related.
        The idea that all models have very close performance across all domains is a moderately insane take.
        At any given moment the best model for my actual projects and my actual work varies.
        Quite honestly Opus 4.5 is proof that benchmarks are dumb. When Opus 4.5 released no one was particularly excited. It was better with some slightly large numbers but whatever. It took about a month before everyone realized “holy shit this is a step function improvement in usefulness”. Benchmarks being +15% better on SWE bench didn’t mean a damn thing.
  - karmasimida 1 day ago
    
    Your feeling is not my feeling, codex is unambiguously smarter model for me
- xyst 1 day ago
  
  Benchmarks are useless compared to real world performance.
  Real world performance for these models is a disappoint.
nananana9 1 day ago

I've been listening to the insane 100x productivity gains you all are getting with AI and "this new crazy model is a real game changer" for a few years now, I think it's about time I asked:
Can you guys point me ton a single useful, majority LLM-written, preferably reliable, program that solves a non-trivial problem that hasn't been solved before a bunch of times in publicly available code?

[-]
- pkoiralap 1 day ago
  
  In the 1930s, when electronic calculators were first introduced, there was a widespread belief that accounting as a career was finished. Instead, the opposite became true. Accounting as a profession grew, becoming far more analytical/strategic than it had been previously.
  You are correct that these models primarily address problems that have already been solved. However, that has always been the case for the majority of technical challenges. Before LLMs, we would often spend days searching Stack Overflow to find and adapt the right solution.
  Another way to look at this is through the lens of problem decomposition as well. If a complex problem is a collection of sub-problems, receiving immediate solutions for those components accelerates the path to the final result.
  For example, I was recently struggling with a UI feature where I wanted cards to follow a fan-like arc. I couldn't quite get the implementation right until I gave it to Gemini. It didn't solve the entire problem for me, but it suggested an approach involving polar coordinates and sine/cosine values. I was able to take that foundational logic turn it into a feature I wanted.
  Was it a 100x productivity gain? No. But it was easily a 2x gain, because it replaced hours of searching and waiting for a mental breakthrough with immediate direction.
  There was also a relevant thread on Hacker News recently regarding "vibe coding":
  https://news.ycombinator.com/item?id=45205232
  The developer created a unique game using scroll behavior as the primary input. While the technical aspects of scroll events are certainly "solved" problems, the creative application was novel.
  
  [-]
  - suddenlybananas 1 day ago
    
    The story you're describing doesn't seem much better than one could get from googling around and going on stackoverflow
    
    [-]
    - strokirk 1 day ago
      
      It doesn’t have to be, really. Even if it could replace 30% of documentation and SO scrounging, that’s pretty valuable. Especially since you can offload that and go take a coffee.
    - solarkraft 5 hours ago
      
      It’s better in the sense that it’s much faster. Bikes and cars don’t theoretically get you to different places than walking, but open up whole categories of what’s practically reachable.
    - pkoiralap 1 day ago
      
      I think the 'better than googling' part is less about the final code and more about the friction.
      For example, consider this game: The game creates a target that's randomly generated on the screen and have a player at the middle of the screen that needs to hit the target. When a key is pressed, the player swings a rope attached to a metal ball in circles above it's head, at a certain rotational velocity. Upon key release, the player has to let go of the rope and the ball travels tangentially from the point of release. Each time you hit the target you score.
      Now, I’m trying to calculate the tangential velocity of a projectile from a circular path, I could find the trig formulas on Stack Overflow. But with an LLM, I can describe the 'vibe' of the game mechanic and get the math scaffolded in seconds.
      It's that shift from searching for syntax to architecting the logic that feels like the real win.
      
      [-]
      - comex 1 day ago
        
        The downside is that you miss the chance to brush up on your math skills, skills that could help you understand and express more complicated requirements.
        ...This may still be worth it. In any case it will stop being a problem once the human is completely out of the loop.
        edit: but personally I hate missing out on the chance to learn something.
        
        [-]
        
        pkoiralap 1 day ago
        
        That would indeed be the case if one has never learned the stuff. And I am all in for not using AI/LLM for homework/assignments. I don't know about others, but when I was in school, they didn't let us use calculators in exams.
        Today, I know very well how to multiply 98123948 and 109823593 by hand. That doesn't mean I will do it by hand if I have a calculator handy.
        Also, ancient scholars, most notably Socrates via Plato, opposed writing because they believed it would weaken human memory, create false wisdom, and stifle interactive dialogue. But hey, turns out you learn better if you write and practice.
        
        [-]
        
        fragmede 1 day ago
        
        In later classes in school, the calculator itself didn't help. If you didn't know the material well enough, you didn't know what to put into the calculator.
- xandrius 1 day ago
  
  Why even come to this site if you're so anti-innovation?
  Today with LLMs you can literally spend 5 minutes defining what you want to get, press send, go grab a coffee and come back to a working POC of something, in literally any programming language.
  This is literally stuff of wonders and magic that redefines how we interface with computers and code. And the only thing you can think of is to ask if it can do something completely novel (that it's so hard to even quantity for humans that we don't have software patents mainly for that reason).
  And the same model can also answer you if you ask it about maths, making you an itinerary or a recipe for lasagnas. C'mon now.
  
  [-]
  - cowl 1 day ago
    
    Agree but you are talking about a POC, and he is talking about reliable, working software. this phase of LLM are perfect for POCs and there you can have 10x speedup, no question. But going from a POC to a working reliable software is where most of our time is spent anyway even without LLMS.
    With LLMs this phase becomes worse. we speedup 10x the poc time, we slow down almost as much in the next phases, because now you have a poc of 10k lines that you are not familiar with at all, that have to pay way more attention at code review, that have to bolt on security as an afterthought (a major slowdown now, so much so that there are dedicated companies whose business model has become fixing Security problems caused by LLM POCs). Next phase, POCs are almost always 99% happy path. Bolt on edge case as another after thought and because you did not write any of those 10k lines how do you even know what edge cases might be neccesary to cover? maybe you guessed it rigth, spend even more time studing the unfamiliar code.
    We use LLM extensivly now in our day to day, development has become somewhat more enjoyable but there is, at least as of now, no real increase in final delivry times, we have just redestributed where effort and time goes.
    
    [-]
    - xandrius 1 day ago
      
      At our company we use AI extensively to see if we missed edge cases and it does a pretty good job in pointing us towards places which could be handled better.
      I know we all think we are always so deep into absolutely novel territory, which only our beautiful mind can solve. But for the vast majority of work done in the world, that work is transformative. You take X + Y and you get Z. Even with brand new api, you can just slap in the documentation and navigate it in order of magnitude faster than without.
      I started using it for embedded systems doing something which I could literally find nothing about in rust but plenty in arduino/C code. The LLM allowed me to make that process so much faster.
    - manmal 1 day ago
      
      > no real increase in final delivry times
      That’s not true though. The ability to de-risk concepts within a day instead of weeks will speed up the timeline tremendously.
  - legulere 1 day ago
    
    I don't think that the user you are responding to is anti-innovation, but rather points out that the usefulness of AI is oversold.
    I'm using Copilot for Visual Studio at work. It is useful for me to speed some typing up using the auto-complete. On the other hand in agentic mode it fails to follow simple basic orders, and needs hand-holding to run. This might not be the most bleeding-edge setup, but the discrepancy between how it's sold and how much it actually helps for me is very real.
    
    [-]
    - ifwinterco 1 day ago
      
      I think copilot is widely considered to be fairly rubbish, your description of agentic coding was also my experience prior to ~Q3 2025, but things have shifted meaningfully since then
      
      [-]
      - pcloadlett3r 1 day ago
        
        Copilot has access to the latest models like Opus 4.6 in agentic mode as well. It's got certain quirks and I prefer a TUI myself but it isn't radically different.
        
        [-]
        
        fragmede 1 day ago
        
        Even at Microsoft they're using Claude Code over Copilot, so I think it's different enough.
      - Rapzid 1 day ago
        
        You are so behind the curve if you think copilot is mostly rubbish. That's a 4+ month old take.
        
        [-]
        
        ifwinterco 6 hours ago
        
        I just don't use any Microsoft software anymore, thankfully
  - svantana 1 day ago
    
    There are different kinds of innovation.
    I want AI that cures cancer and solves climate change. Instead we got AI that lets you plagiarize GPL code, does your homework for you, and roleplay your antisocial horny waifu fantasies.
    
    [-]
    - baq 1 day ago
      
      Hard problems take more time than easy problems
      
      [-]
      - svantana 1 day ago
        
        Of course, but at least DeepMind is taking a crack at the important problems
- rohit89 1 day ago
  
  > that hasn't been solved before a bunch of times in publicly available code?
  And this matters because? Most devs are not working on novel never before seen problems.
  
  [-]
  - kevstev 1 day ago
    
    Heh, I agree. There is a vast ocean of dev work that is just "upgrade criticalLib to v2.0" or adding support for a new field from the FE through to the BE.
    I can name a few times where I worked on something that you could consider groundbreaking (for some values of groundbreaking), and even that was usually more the combination of small pieces of work or existing ideas.
    As maybe a more poignant example- I used to do a lot of on-campus recruiting when I worked in HFT, and I think I disappointed a lot of people when I told them my day to day was pretty mundane and consisted of banging out Jiras, usually to support new exchanges, and/or securities we hadn't traded previously. 3% excitement, 97% unit tests and covering corner cases.
- turblety 1 day ago
  
  I'm not sure if you'd call it a productivity gain, but I have to host our infrastructure on a system that runs processes entirely in Linux userland.
  To bridge the containers in userland only, without root, I had to build: https://github.com/puzed/wrapguard
  I'm sure it's not perfect, and I'm sure there are lots of performance/productivity gains that can be made, but it's allowed us to connect our CDN based containers (which don't have root) across multiple regions, talking to each other on the same Wireguard network.
  No product existed that I could find to do this (at least none I could find), and I could never build this (within the timeframe) without the help of AI.
- epolanski 1 day ago
  
  I know for a fact I deliver more and at higher quality and while being less tired. Mental energy is also a huge factor, because after digging in code for half a day i'd be exhausted.
  People should stop focusing on vibecoding and realize how many things LLMs can do such as investigating messy codebases that took me ages of writing paper notes to connect the dots, finding information about dependencies just by giving them access to replacing painful googling and GitHub issues or outdated documentation digging, etc.
  Hell I can jump in projects I know nothing about, copy paste a Jira ticket, investigate, have it write notes, ask questions and in two hours I'm ready to implement with very clear ideas about what's going on. That was multi day work till few years ago.
  I can also have it investigate the task at hand and automatically find the many unknowns unknowns that as usual work tasks have, which means cutting deliveries and higher quality software. Getting feedback early is important.
  LLMs are super useful even if you don't make them author a single line of code.
  And yes, they are increasingly good at writing boilerplate if you have a nice and well documented codebase thus sparing you time. And in my career I've written tons of mostly boilerplate code, that was another api, another form, another table.
  And no, this is not vibe coding. I review every single line, I use all of its failures to write better architectural and coding practices docs which further improves the output at each iteration.
  Honestly I just don't get how people can miss the huge productivity bonus you get, even if you don't have it edit a singl line of code.
- revahage 1 day ago
  
  Well, it took opus 4.5 five messages to solve a trivial git problem for me. It hallucinated nonexistent flags three times. Hallucinating nonexistent flags is certainly a novel solution to my git ineptness.
  Not to be outdone, chatgpt 5.2 thinking high only needed about 8 iterations to get a mostly-working ffmpeg conversion script for bash. It took another 5 messages to translate it to run in windows, on powershell (models escaping newlines on windows properly will be pretty nuch AGI, as far as I’m concerned).
  
  [-]
  - xvector 1 day ago
    
    You've got to be doing something wrong IMO. Mind sharing your system prompt and prompt/response pairs?
- llmslave 1 day ago
  
  baffled that people are still suspicious of ai coding models
  
  [-]
  - iwanttoread 16 hours ago
    
    [dead]
- AstroBen 1 day ago
  
  the 100x gains, even 10x, are obviously ridiculous but that doesn't mean AI is useless
- Def_Os 1 day ago
  
  Yeah, I would LOVE to see attempts at significant video games that are then open-sourced for communities to work on. E.g. OpenGTA or OpenFIFA/OpenNHL.
- beernet 1 day ago
  
  Can you point me to a human written program an LLM cannot write? And no, just answering with a massively large codebase does not count because this issue is temporary.
  Some people just hate progress.
  
  [-]
  - HAL3000 1 day ago
    
    > Can you point me to a human written program an LLM cannot write?
    Sure:
    "The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
    As one particularly challenging example, Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase (This is only the case for x86. For ARM or RISC-V, Claude’s compiler can compile completely by itself.)"[1]
    1. https://www.anthropic.com/engineering/building-c-compiler
  - svantana 1 day ago
    
    Pretty much any software that people pay for? If LLMs could clone an app, why would anyone still pay good money for the original?
  - falloutx 1 day ago
    
    Even a normal website like landonorris.com. Try copying all those effects with AI.
    Another example: Red Dead Redemption 2
    Another one: Roller coaster tycoon
    Another one: ShaderToy
    
    [-]
    - avaer 1 day ago
      
      I wish I could agree with you, but as a game dev, shader author, and occasional asm hacker, I still think AIs have demonstrated being perfectly capable of copying "those effects". It's been trained on them, of course.
      You're not gonna one-shot RD2, but neither will a human. You can one-shot particles and shader passes though.
      
      [-]
      - falloutx 1 day ago
        
        I didnt say one shot it, coding agents have been out for more than couple years and yet we cant point to single Good piece of software built by it.
        
        [-]
        
        Davidzheng 1 day ago
        
        "coding agents have been out for more than couple years"?????
        
        [-]
        
        gavinflud 1 day ago
        
        Depends on what we categorize as a coding agent. Devin was released two years ago. Cursor was about the same, and it released agent mode around 1.5 years ago. Aider has been around even longer than that I think.
        
        xvector 1 day ago
        
        "Good" is obviously subjective but this mentality is so interesting because at my big tech company most of our software today is written by agents.
        From my perspective, comments like these read as people having their head stuck in the sand (no offense, I might be missing something.)
        
        [-]
        
        falloutx 1 day ago
        
        Show me whats been built by agents
    - satvikpendem 1 day ago
      
      Why do you believe an LLM can't write these, just because they're 3D? If the assets are given (just as with a human game programmer, who has artists provide them the assets), then an LLM can write the code just the same.
      
      [-]
      - falloutx 1 day ago
        
        What? People can easily get assets, thats not a even a problem in 2026. Roller coaster tycoon's assets were done by the programmer himself. If its so easy why haven't we seen actually complex pieces of software done in couple of weeks by LLM users?
        Also try building any complex effects by prompting LLMs, you wont get any far, this is why all of the LLM coded websites look stupidly bland.
        
        [-]
        
        satvikpendem 1 day ago
        
        Not sure what you're confused about, I never said assets were hard to get, I just said that the LLM needs to be provided a folder of the assets for it to make use of them, it's not going to create them from scratch (at least not without great difficulty, because LLMs are capable of using and coding Three.js for example). I don't know the answer to your first question because I don't hang around in the 3D or game dev fields, I'm sure there are examples of vibe coded games however.
        As to your second question, it is about prompting them correctly, for example [0]. Now I don't know about you but some of those sites especially after using the frontend skill look pretty good to me. If those look bland to you then I'm not really sure what you're expecting, keeping in mind that the example you showed with the graphics are not regular sites but more design oriented, and even still nothing stops LLMs from producing such sites.
        [0] https://youtu.be/f2FnYRP5kC4
        
        [-]
        
        falloutx 1 day ago
        
        you have shown me 0 examples, I showed actual examples to the given question. Your answers have just been "AI can also do this" but gave no actual proof.
        
        [-]
        
        satvikpendem 1 day ago
        
        The examples are in the video I linked, as I said, if you don't bother to watch it then I'm not sure what to tell you. As I said for games I don't know and won't presume to search up some random vibe coded game if I don't have personal experience with how LLMs handle games, but for web development, the sites I've made and seen made look pretty good.
        Edit: I found examples [0] of games too with generated assets as well. These are all one shot so I imagine with more prompting you can get a decent game all without coding anything yourself.
        [0] https://www.youtube.com/watch?v=8brENzmq1pE
  - suddenlybananas 1 day ago
    
    And some people clearly hate humans.
  - iwanttoread 16 hours ago
    
    [dead]
- nurettin 8 hours ago
  
  I wouldn't say that anything before 11/2025 was a game changer, but after that, wow.
  That said, I wouldn't expect there to be an innovative solution to an unsolved problem written by AI or humans that has been open sourced within the past 3 months.
- dimgl 1 day ago
  
  I'm building an entire game on Unity using LLMs. It's an action RPG.
  
  [-]
  - seanmcdirmid 1 day ago
    
    Is it just a game built with LLMs or are you leaning into the cheap content gen capabilities to make the game exceptionally deep/braod?
    
    [-]
    - dimgl 1 day ago
      
      I'm building all of the systems with LLMs and using LLMs to fast track the creation of content such as storylines, characters, etc. All of the assets are mostly bought and created by me.
      
      [-]
      - seanmcdirmid 1 day ago
        
        Sounds fun! Asset creation...at least in terms of story content, should be the one area where LLMs would really shine, especially if it can somehow extend into logic and gameplay. Couple that with the ways of generating art assets (hard with an LLM, but it can do something at least), that would be cool. I hope to see these games in the future, although they might be labelled as slop unless done really well.
        
        [-]
        
        dimgl 21 hours ago
        
        Actually, LLM fiction writing is awfully bad. But it does help with ideas!
        I'm trying my hardest to make it feel high quality instead of just slop.
- jason_oster 1 day ago
  
  Personally, I’ve only been using a coding agent for a few months infrequently, so I have nothing to show for it. (It is not 100x productivity, that’s absurd.)
  But I have plenty of examples of really atrocious human written code to show you! TheDailyWtf has been documenting the phenomenon for decades.
- eviks 1 day ago
  
  Great question, here is the link from the future:
- xvector 1 day ago
  
  I work for a big tech company, most of our code today is written by agents. This includes backend infra and frontend app/UX code.
  It satisfies your relevant criteria: LLM-written, reliable, non-trivial.
  No major program is perfectly reliable so I wouldn't call it that (but we have fewer incidents vs human-written code), and "useful" is up to the reader (but our code is certainly useful to us.)
- silveraxe93 1 day ago
  
  Yeah, Claude Code.
- logicprog 1 day ago
  
  > single useful ... preferably reliable, program that solves a non-trivial problem that hasn't been solved before a bunch of times in publicly available code
  I see this originality criteria appended a lot, and
  1) I don't think it's representative of the actual requirements for something to be extremely useful and productivity-enhancing, even revolutionary, for programming. IDE features, testing, code generation, compilers — all of these things did not really directly help you produce more original solutions to original problems, and yet they were huge advances in program or productivity.
  I mean like. How many such programs are there in general?
  The vast vast majority of programs that are written are slight modifications, reorganizations, or extensions, of one or more programs that are already publicly available a bunch of times over.
  Even the ones that aren't could fairly easily be considered just recombinations of different pieces of programs that have been written and are publicly available dozens or more times over, just different parts of them combined in a different order.
  Hell, most code is a reorganization or recombination of the exact same types of patterns just in a different way corresponding to different business logic or algorithms, if you want to push it that far.
  And yet plenty of deeply unoriginal programs are very useful and fill a useful niche, so they get written anyway.
  2) Nor is it a particularly satisfiable goal. If there aren't, as a percentage, very many reliable, useful, and original programs that have been written in the decades since open source became a thing, why would we expect a five-year-old technology to have done so, especially when, obviously, the more reliable original and broadly useful programs have already written, the narrower the scope for new ones to satisfy the originality criteria?
  3) Nor is it actually something that we would expect even under the hypothesis that agents make people significantly more productive at programs. Even if agents give 100x productivity gains to writing a useful tool or service or program or improving existing ones with new features. We still wouldn't expect them to give necessarily very many much productivity gains at all to writing original programs, precisely because of their current technology is a product of deep thinking, understanding a specific domain, seeing a niche, inspiration, science, talent and luck much more than the ability to even do productive engineering.
- Rover222 1 day ago
  
  Not 100x but absolutely 4x to 5x increase in productivity for everyone on team on a large enterprise codebase that serves the military a lot of serious clients.
  To deny at least that level of productivity at this point, you have to have your head in the sand.
- mrcwinn 1 day ago
  
  No, but I have seen privately available code that matches this description.
  
  [-]
  - iwanttoread 16 hours ago
    
    [dead]
jstummbillig 1 day ago

It's so interesting that I start to feel a change, that is developing as a separate thing to capability. Previously, yeah sure, things changed but models got so outrageously better at the basic things that I simply wouldn't care.
Now... increasingly it's like changing a partner just so slightly. I can feel that something is different and it gives me pause. That's probably not a sign of the improvement diminishing. Maybe more so my capability to appreciate them.
I can see how one might get from here to the whole people being upset about 4o thing.
RivieraKid 1 day ago

Do software engineers here feel threatened by this? I certainly am. I'm surprised that this topic is almost entirely missing in these threads.

[-]
- AstroBen 1 day ago
  
  No. It turns into a complete mess without someone that knows what they're doing to steer it. It's an upgrade to autocomplete
  
  [-]
  - energy123 1 day ago
    
    Unless you're retiring in less than 5 years this is extremely short sighted.
    
    [-]
    - lurking_swe 1 day ago
      
      It’s also silly to try predicting the future 5 years from now, IMO. Historically progress is very unpredictable. It often plateaus when you least expect it.
      It’s good to be cautious and not in denial, but i usually ignore people who talk so authoritatively about the future. It’s just a waste of time. Everyone thinks they are right.
      My recommendation is have a very generous emergency fund and do your best to be effective at work. That’s the only thing you can control and the only thing that matters.
      
      [-]
      - xvector 1 day ago
        
        Or just move into technical leadership or management/executive permissions.
        In any case, everyone should be riding the AI wave! Anyone doing so should have enough to retire five years from now.
    - benry1 1 day ago
      
      I'm reading Maintenance of Everything and it has a section about the switch from artisan-crafted weapons to making uniform parts that feels comparable to this.
      French military had pioneered a way to make fully interchangeable weapon parts, but the French public fought back in fear of the jobs of the artisans who used to hand-make weapons. Over the next 20 years they completely lost their edge on the battlefield, nothing could be repaired in the field. Other countries embraced the change, could repair anything in the field with cheap and precise spare parts, and soon fostered in the industrial revolution.
      The artisans stopped being people who made weapons, the artisans became people who made machines that made weapons.
      
      [-]
      - energy123 11 hours ago
        
        > The artisans stopped being people who made weapons, the artisans became people who made machines that made weapons.
        Although many French artisans become unemployed because British industrial productivity made them uncompetitive. It was one of the causes of the French Revolution.
    - AstroBen 1 day ago
      
      What would things look like to make someone with currently ~10 years of experience unemployable?
      It's possible the job might change drastically, but I'm struggling to think of any scenario that doesn't also put most white collar professions out of work alongside me, and I don't think that's worth worrying about
      
      [-]
      - therealdrag0 1 day ago
        
        Unemployable is a charged word. A lot of outdated professions still have professionals. There are still professional horse drawn carriages.
        If the AI performance gains are 50% improvement, and companies decide they rather cut costs and pocket the difference, could be due to many factors, that leaves millions out of a job. And those performance gains are coming for many white collar jobs. I guess your premise is mass unemployment is not worth worrying about, so okay then.
        Marginal changes in productivity can make huge impacts to industries employment rates.
        
        [-]
        
        fatherwavelet 1 day ago
        
        People still pay thousands of dollars for wedding photographers even though everyone at the wedding also has a camera and many are taking their own pictures.
        I am not a software engineer and it seems to me if someone has experience as a software engineer before LLMs, they have skills no one will really be able to acquire again in the same way.
        I would expect current software engineers to eat the entire non-customer facing back office in the next ten years.
        
        [-]
        
        varjag 1 day ago
        
        > People still pay thousands of dollars for wedding photographers even though everyone at the wedding also has a camera and many are taking their own pictures.
        Wedding photography used to be the lowest in the pecking order of professional photography. Now all the photojournalists, travel magazine and corporate events photographers are as good as extinct. Even the arts market for photography been on decline for years.
        
        AstroBen 22 hours ago
        
        > I guess your premise is mass unemployment is not worth worrying about, so okay then
        My point wasn't that it's not a big deal. My point there is that if AI ends up taking a large % of white collar work you're going to have a huge portion of the population in the same boat. Maybe an overly optimistic view but that'll end up forcing change through politics
        ..I also think this is a ridiculously low % chance of happening and it would take something close to AGI to bring about. I don't know how you can use AI regularly and think we're anywhere close to that
        Contracting an incurable illness that renders me blind and thus unable to work is just as likely and not something I spend time worrying about
        > Marginal changes in productivity can make huge impacts to industries employment rates
        Maybe? We also have Jevon's paradox. Software is incredibly expensive to build right now - how many more applications for it can people find if the cost halves?
      - netdevphoenix 1 day ago
        
        > I'm struggling to think of any scenario that doesn't also put most white collar professions out of work alongside me
        You don't need to be out of a job to struggle. Just for your pay to remain the same (or lower), for your work conditions to degrade (you think jQuery spaguetti was a mess? good luck with AI spaguetti slop) or for competition to increase because now most of the devving involves tedious fixing of AI code and the actual programming heavy jobs are as fought for as dev roles at Google/Jane Street/etc.
        Devving isn't going anywhere but just like you don't punch cards anymore, you shouldn't expect your role in the coming decades to be the same as the 90s-25s period.
- llmslave 1 day ago
  
  theres alot of denial, and people that havent taken a serious look at the ai models
- epolanski 1 day ago
  
  Software developers should. Software engineers shouldn't.
  My experience is that most developers have little to no understanding about engineering at all: meaning weighting pros and cons, understanding the requirements thoroughly, having a business oriented mindset.
  Instead they think engineering is about coding practices and technologies to write better code.
  That's because they focus on the code, the craft, not money.
  
  [-]
  - netdevphoenix 1 day ago
    
    You should wonder whether any of those devs will train themselves to become engineers and whether the supply of engineers will be lower than the demand for them. Because if any of them become true, you will likely struggle to keep your employee stats relatively the same (ie you will struggle in very specific ways) unless you are the kind of person who doesn't need to interview to land a gig at a top 10 tech company.
- vatsachak 1 day ago
  
  AI is mostly garbage at creating useful abstractions. I'd feel threatened if I was a competitive programmer or IMO kid
  
  [-]
  - manmal 1 day ago
    
    Tell me you haven’t used codex-xhigh without telling me you haven’t used it. It’s bad at overall architecture and big picture. But not at useful abstractions.
- OsrsNeedsf2P 1 day ago
  
  I would feel threatened if I didn't invest in learning how to best use AI
- ReptileMan 1 day ago
  
  Jevons paradox hints that the situation is not as bleak as it sounds.
  
  [-]
  - svachalek 1 day ago
    
    I've been in this profession for 32 years now and this is my experience. Every time coding gets easier or cheaper, the response is first to lay off developers but quickly the demand for more software spikes and they need everyone back and more than ever.
    When we achieve true AGI we're truly cooked, but it won't just be software developers by definition of AGI, it will be everyone else too. But the last people in the building before they turn the lights out for good will be the software developers.
  - anthonypasq 22 hours ago
    
    i think i fundamentally agree that the demand for code is essentially infinite. Code has just been notoriously expensive and therefore it could only de deployed towards the most economically efficient activities. this is now changing.
- erfburfl 1 day ago
  
  [dead]
- worldsavior 1 day ago
  
  No. AI does not work well enough, you still need a person to look on it and CODE. It probably never will, until AGI which probably also in my opinion will never come.
  
  [-]
  - dude250711 1 day ago
    
    It's a super-special AI tier that can replace developers and other grunts, yet somehow cannot replace managers and C-suite.
    It can only replace whoever is not writing a fat cheque to it.
nickandbro 23 hours ago

I have found GPT 5.3-Codex to do exceedingly well when working with graphics rendering pipelines. They must have better training data or RL approaches than Antropic as I have given the same prompt and config to Opus 4.6 and it seems to have added unwanted rendering artifacts. This may be just an issue specific to my use case, but wonder since OpenAI is partners with MSFT, which makes lots of games, that this may be an area they heavily invested in
fishpham 1 day ago

Model card: https://cdn.openai.com/pdf/23eca107-a9b1-4d2c-b156-7deb4fbc6...
trilogic 1 day ago

When 2 multi billion giants advertise same day, it is not competition but rather a sign of struggle and survival. With all the power of the "best artificial intelligence" at your disposition, and a lot of capital also all the brilliant minds, THIS IS WHAT YOU COULD COME UP WITH?
Interesting

[-]
- sdf2erf 1 day ago
  
  Yeah they are both fighting for survival. No surprise really.
  Need to keep the hype going if they are both IPO'ing later this year.
  
  [-]
  - thethimble 1 day ago
    
    The AI market is an infinite sum market.
    Consider the fact that 7 year old TPUs are still sitting at near 100p utilization today.
  - superze 1 day ago
    
    How many IPOs can a company really do?
    
    [-]
    - anshumankmr 1 day ago
      
      IPO only ONE but followups are called FPOs.
    - re-thc 1 day ago
      
      As many as they want. They can "spin off" and then "merge" again.
- rishabhaiover 1 day ago
  
  What happened to you?
  
  [-]
  - raincole 1 day ago
    
    AI fried brains, unfortunately.
    
    [-]
    - wasmainiac 1 day ago
      
      I mean, he has a point it’s just not very eloquently written.
      
      [-]
      - trilogic 1 day ago
        
        I empathize with the situation, no elegance from them, no eloquence from me :)
- lossolo 1 day ago
  
  What's funny is that most of this "progress" is new datasets + post-training shaping the model's behavior (instruction + preference tuning). There is no moat besides that.
  
  [-]
  - Davidzheng 1 day ago
    
    "post-training shaping the models behavior" it seems from your wording that you find it not that dramatic. I rather find the fact that RL on novel environments providing steady improvements after base-model an incredibly bullish signal on future AI improvements. I also believe that the capability increase are transferring to other domains (or at least covers enough domains) that it represents a real rise in intelligence in the human sense (when measured in capabilities - not necessarily innate learning ability)
    
    [-]
    - CuriouslyC 1 day ago
      
      What evidence do you base your opinions on capability transfer on?
  - WarmWash 1 day ago
    
    >There is no moat besides that.
    Compute.
    Google didn't announce $185 billion in capex to do cataloguing and flash cards.
    
    [-]
    - causalmodels 1 day ago
      
      Google didn't buy 30% of Anthropic to starve them of compute
      
      [-]
      - WarmWash 1 day ago
        
        Probably why it's selling them TPUs.
  - riku_iki 1 day ago
    
    > is new datasets + post-training shaping the model's behavior (instruction + preference tuning). There is no moat besides that.
    sure, but acquiring/generating/creating/curating so much high quality data is still significant moat.
- iwanttoread 16 hours ago
  
  [dead]
tombert 1 day ago

Actually kind of excited for this. I've been using 5.2 for awhile now, and it's already pretty impressive if you set the context window to "high".
Something I have been experimenting with is AI-assisted proofs. Right now I've been playing with TLAPS to help write some more comprehensive correctness proofs for a thing I've been building, and 5.2 didn't seem quite up to it; I was able to figure out proofs on my own a bit better than it was, even when I would tell it to keep trying until it got it right.
I'm excited to see if 5.3 fairs a bit better; if I can get mechanized proofs working, then Fields Medal here I come!

[-]
- EnPissant 1 day ago
  
  "High" the the reasoning level. The context window never changes.
  
  [-]
  - tombert 1 day ago
    
    You're right! Still learning the details of this agentic stuff; I was pretty late to the party.
morleytj 1 day ago

The behind the scenes on deciding when to release these models has got to be pretty insanely stressful if they're coming out within 30 minutes-ish of each other.

[-]
- meisel 1 day ago
  
  I wonder if their "5.3" was continuously being updated, with regenerated benchmarks with each improvement, and they just stayed ready to release it when claude released
  
  [-]
  - morleytj 1 day ago
    
    This seems plausible. It would be shocking if these companies didn't have an automated testing suite which is recomputing these benchmarks on a regular basis, and uploading to a dashboard for supervision.
    Given that they already pre-approved various language and marketing materials beforehand there's no real reason they couldn't just leave it lined up with a function call to go live once the key players make the call.
- Havoc 1 day ago
  
  It’s also functionally not likely without some sort of insider knowledge or coordination
  
  [-]
  - morleytj 1 day ago
    
    Could be, could also be situations where things are lined up to launch in the near future and then a mad dash happens upon receiving outside news of another launch happening.
    I suppose coincidences happen too but that just seems too unlikely to believe honestly. Some sort of knowledge leakage does seem like the most likely reason.
  - sumedh 1 day ago
    
    The AI labs tell their partners when the models are coming out, the partners might be sharing the news with their sources in other AI labs.
dllrr 1 day ago

Using opus 4.6 in claude code right now. It's taking about 5x longer to think things through, if not more.

[-]
- andyferris 1 day ago
  
  The notes explicitly call out you may want to dial the effort setting back to medium to reduce latency/tokens (high being default, apparently there is a max setting too).
  
  [-]
  - gverrilla 1 day ago
    
    There's 3 options to choose from on /model: Low, medium and high effort.
ksynwa 1 day ago

Why does OpenAI have a separate model for coding (Codex) but Anthropic uses the same model for chatbots and coding?

[-]
- zozbot234 1 day ago
  
  They seem to be slowly moving away from having a separate coding model. With this release, they're calling the model Codex but expressly mention that it's also supposed to be more suitable than GPT 5.2 for general use.
  
  [-]
  - sejje 23 hours ago
    
    I remember a paper coming out a while back that said training the models to code made them much better at normal tasks. It improved their logic etc.
  - prng2021 1 day ago
    
    Where did they say that?
    
    [-]
    - zozbot234 1 day ago
      
      "The model advances both the frontier coding performance of GPT‑5.2-Codex and the reasoning and professional knowledge capabilities of GPT‑5.2, together in one model, which is also 25% faster. ... With GPT‑5.3-Codex, Codex goes from an agent that can write and review code to an agent that can do nearly anything developers and professionals can do on a computer."
      They're specifically saying that they're planning for an overall improvement over the general-purpose GPT 5.2.
zen4ttitude 8 hours ago

Does anyone know more about the benchmark? 60% accuracy gets a drumroll? How would Claude do? How would a human do? I tried the previous version and was not impressed. I went back to Claude that is very hard to beat, and versatile with context enrichment.
arjun810 22 hours ago

Our results on our rails app surprised us. Codex 5.3 far and away the best — much faster and cheaper (though cost isn’t relevant yet, since you can only access this model via a ChatGPT plan).
https://x.com/sergeykarayev/status/2019541031986032925?s=46
modeless 1 day ago

It's so difficult to compare these models because they're not running the same set of evals. I think literally the only eval variant that was reported for both Opus 4.6 and GPT-5.3-Codex is Terminal-Bench 2.0, with Opus 4.6 at 65.4% and GPT-5.3-Codex at 77.3%. None of the other evals were identical, so the numbers for them are not comparable.

[-]
- alexhans 1 day ago
  
  Isn't the best eval the one you build yourself, for your own use cases and value production?
  I encourage people to try. You can even timebox it and come up with some simple things that might look initially insufficient but that discomfort is actually a sign that there's something there. Very similar to moving from not having unit/integration tests for design or regression and starting to have them.
- rsanek 1 day ago
  
  I usually wait to see what ArtificialAnalysis says for a direct comparison.
- input_sh 1 day ago
  
  It's better on a benchmark I've never heard of!? That is groundbreaking, I'm switching immediately!
  
  [-]
  - modeless 1 day ago
    
    I also wasn't that familiar with it, but the Opus 4.6 announcement leaned pretty heavily on the TerminalBench 2.0 score to quantify how much of an improvement it was for coding, so it looks pretty bad for Anthropic that OpenAI beat them on that specific benchmark so soundly.
    Looking at the Opus model card I see that they also have by far the highest score for a single model on ARC-AGI-2. I wonder why they didn't advertise that.
    
    [-]
    - input_sh 1 day ago
      
      No way! Must be a coinkydink, no way OpenAI knew ahead of time that Anthropic was gonna put a focus on that specific useless benchmark as opposed to all the other useless benchmarks!?
      I'm firing 10 people now instead of 5!
netdevphoenix 1 day ago

How come that OpenAI and Anthropic both released their models pretty much at the same time? Does anyone know if the timing is coincidental?

[-]
- phil917 21 hours ago
  
  I would bet to be ready before the Superbowl ads
energy123 1 day ago

First impression: It's much faster for the same task.
When they hook it up to Cerebras it's going to be a head-exploding moment.
gallerdude 1 day ago

Both Opus 4.6 and GPT-5.3 one shot a Gameboy emulator for me. Guess I need a better benchmark.

[-]
- paxys 1 day ago
  
  As coding agents get "good enough" the next differentiator will be which one can complete a task in fewer tokens.
  
  [-]
  - tgtweak 1 day ago
    
    Or quicker, or more comprehensively for the same price.
  - nlh 1 day ago
    
    Or the same number of tokens in less time. Kinda feels like the CPU / modem wars of the 90s all over again - I remember those differences you felt going from a 386 -> 486 or from a 2400 -> 9600 baud modem.
    We're in the 2400 baud era for coding agents and I for one look forward to the 56k era around the corner ;)
- gf000 1 day ago
  
  Is such an emulator not part of their training data sets?
- well_ackshually 1 day ago
  
  There's hundreds of gameboy emulators available on Github they've been trained on. It's quite literally the simplest piece of emulation you could do. The fact that they couldn't do it before is an indictment of how shit they were, but a gameboy emulator should be a weekend project for anyone even ever so slightly qualified. Your benchmark was awful to begin with.
  
  [-]
  - plantain 1 day ago
    
    Your expectations are wild. Most software engineers could not write a game boy emulator - and now you need zero programming skills whatsoever to write one.
  - nasreddin 1 day ago
    
    "a gameboy emulator should be a weekend project for anyone even ever so slightly qualified" do you really believe something so ridiculous?
kingstnap 1 day ago

> GPT‑5.3-Codex was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems. We are grateful to NVIDIA for their partnership.
This is hilarious lol

[-]
- uh_uh 1 day ago
  
  How so?
  
  [-]
  - Philpax 1 day ago
    
    They're on shaky ground right now https://arstechnica.com/information-technology/2026/02/five-...
  - kingstnap 1 day ago
    
    Its kind of a suck up that more or less confirms the beef stories that were floating around this past week.
    In case you missed it. For example:
    Nvidia's $100 billion OpenAI deal has seemingly vanished - Ars Technica
    https://arstechnica.com/information-technology/2026/02/five-...
    Specifically this paragraph is what I find hilarious.
    > According to the report, the issue became apparent in OpenAI’s Codex, an AI code-generation tool. OpenAI staff reportedly attributed some of Codex’s performance limitations to Nvidia’s GPU-based hardware.
    
    [-]
    - dajonker 1 day ago
      
      There was never a $100 billion deal. Only a letter of intent which doesn't mean anything contractually.
    - esafak 1 day ago
      
      > OpenAI staff reportedly attributed some of Codex’s performance limitations to Nvidia’s GPU-based hardware.
      They should design their own hardware, then. Somehow the other companies seem to be able to produce fast-enough models.
      
      [-]
      - sumedh 1 day ago
        
        > They should design their own hardware
        They made a deal with Cerebras for fast inference.
ffitch 1 day ago

> our team was blown away > by how much Codex was able > to accelerate its own development
they forgot to add “Can’t wait to see what you do with it”
textlapse 1 day ago

I would love to see a nutritional facts label on how many prompts / % of code / ratio of human involvement needed to use the models to develop their latest models for the various parts of their systems.
karmasimida 1 day ago

For those who cared:
GPT-5.3-Codex dominates terminal coding with a roughly 12% lead (Terminal-Bench 2.0), while Opus 4.6 retains the edge in general computer use by 8% (OSWorld).
Anyone knows the difference between OSWorld vs OSWorld Verified?

[-]
- nopinsight 1 day ago
  
  From Claude 4.6 Thinking:
  OSWorld is the full 369-task benchmark. OSWorld Verified is a ~200-task subset where humans have confirmed the eval scripts reliably score success/failure — the full set has some noisy grading where correct actions can still get marked wrong.
  Scores on Verified tend to run higher, so they're not directly comparable.
ponyous 1 day ago

I think models are smart enough for most of the stuff, these little incremental changes barely matter now. What I want is the model that is fast.

[-]
- energy123 1 day ago
  
  I predict a bifurcation in usage.
  Serial usecases ("fix this syntax errors") will go on Cerebras and get 10x faster.
  Deep usecases ("solve Riemann hypothesis") will become massively parallel and go on slower inference compute.
  Teams will stitch both together because some workflows go through stages of requiring deep parallel compute ("scan my codebase for bugs and propose fixes") followed by serial compute ("dedupe and apply the 3 fixes, resolve merge conflict").
- newtwilly 1 day ago
  
  I've been using 5.1-codex-max with low reasoning (in Cursor fwiw) recently and it feels like a nice speed while still being effective. Might be worth a shot.
- derac 1 day ago
  
  This is faster if their marketing is right, it uses significantly less tokens. Gemini 3 flash is very good as well.
Robin_f 1 day ago

Anthropic mostly had an advantage in speed. It feels like with a 25% increase in speed with Codex 5.3, they are now losing that advantage as well.

[-]
- smith7018 1 day ago
  
  I just asked Opus 4.6 to debug a bug in my current changes and it went for 20 minutes before I interrupted it. Take that as you will.
  
  [-]
  - bgirard 1 day ago
    
    Doesn't feel like a useful data point without more context. For some hard bugs I'd be thrilled to wait 30 minutes for a fix, for a trivial CSS fix not so much. I've spent weeks+ of my career fix single bugs. Context is everything.
    
    [-]
    - smith7018 1 day ago
      
      Sure, but I've never experienced a 20 minute wait with CC before. It was an architectural question but it would have taken a couple minutes with a definitive answer on 4.5.
    - sejje 20 hours ago
      
      > I've spent weeks+ of my career fix single bugs.
      Same, same. It's not a useful data point at all.
      bug: llm alignment
      timeframe to fix : probably never
  - mrcwinn 1 day ago
    
    As they point out in their blog post, 4.6 intentionally thinks longer, but you can adjust it from the CLI.
cheriot 1 day ago

Page me when codex can run the right version of node. Are we all changing the system node version to match the current project again?
[shell_environment_policy]
inherit = "all"
experimental_use_profile = true
[shell_environment_policy.set]
NVM_DIR = "[redacted]"
PATH = "[redacted]"

[-]
- Sn0wCoder 16 hours ago
  
  If you are already using Volta in your project Codex will use the correct version assuming you are running in the same directory as your .json file and the json file has the” volta”:{ “node”: “xx.x.x”, “npm”: “xx.x.x”} configured. Personally use a Dockerfile to setup the container with volta installed. Need to set up Volta and configure at least one version of Node then install Codex in the docker. One caveat is you need to update codex with the initial version of node assuming it’s not the same as your project. If you are using one image per project you should never run into this but I have been using one image and firing up a container for each project, so it was great to see Codex able to use the correct version configured for the project via Volta.
  From other comments sounds like Codex using mise for internal tools can cause issues but not sure that is 100% Codex fault if the project is not already defining the node/npm version in the json “engines” entry. If it’s ignoring that entry then I guess this is a valid complaint, but not sure how Codex is supposed to guess which version of tools to use for different projects.
  Would you mind adding more details as to the exact setup where Codex is using the wrong version?
  
  [-]
  - cheriot 13 hours ago
    
    Codex is using a login shell so moving my PATH setup to .zprofile fixed it (previously was in .zshrc). Now we just need to write this on the internet enough times that future codex can suggest the fix :p
- maxkfranz 1 day ago
  It worked for me after I configured mise. I needed the mise setup in both `.zprofile` and `.zshrc` for Codex to pick it up. I think mise sets up itself in one of those by default, but Codex uses the other. I expect the same problem would present itself with nvm.
  I.e. `eval "$(/Users/max/.local/bin/mise activate zsh)"` in `.zprofile` and `.zshrc`
  Then Codex will respect whatever node you've set as default, e.g.:
```
    mise install node@24
    mise use -g node@24
```
  Codex might respect your project-local `.nvmrc` or `mise.toml` with this setup, but I'm not certain. I was just happy to get Codex to not use a version of node installed by brew (as a dependency of some other package).
  [-]
  - cheriot 13 hours ago
    
    Thanks! I moved my PATH setup to .zprofile and everything works now. Brew had added itself to .zprofile and everything else was in .zshrc.
    
    [-]
    - maxkfranz 9 hours ago
      
      Glad it worked out. And I agree it’s annoying that this doesn’t just work out of the box. It’s not like node/nvm are uncommon, so you’d think they would have ran into the issue when using their own tool.
- smarx007 1 day ago
  
  Both Claude and Gemini (the web variants, not CLI) tried to downgrade my .NET 10 projects to .NET 9 at least a few times.
dawidg81 1 day ago

May AI not write the code for me.
May I at least understand what it has "written". AI help is good but don't replace real programmers completely. I'm enough copy pasting code i don't understand. What if one day AI will fall down and there will be no real programmers to write the software. AI for help is good but I don't want AI to write whole files into my project. Then something may broke and I won't know what's broken. I've experienced it many times already. Told the AI to write something for me. The code was not working at all. It was compiling normally but the program was bugged. Or when I was making some bigger project with ChatGPT only, it was mostly working but after a longer time when I was promting more and more things, everything got broken.

[-]
- katspaugh 1 day ago
  
  Honest question: have you tried evolving your code architecture when adding features instead of just "promting more and more things"?
  
  [-]
  - dawidg81 1 day ago
    
    I've tried that too but it was almost the same, chatgpt kept forgetting many things about the code and project structure. In summary AI can get problematic for me and i get with troubles with it. This is like one of the reasons why I still prefer traditional text editor for writing code like Vim over a "software on steroids" like VS Code and things like that...
- pixl97 1 day ago
  
  > What if one day AI will fall down and there will be no real programmers to write the software.
  What if you want to write something very complex now that most people don't understand? You keep offering more money until someone takes the time to learn it and accomplish it, or you give up.
  I mean, there are still people that hammer out horseshoes over a hot fire. You can get anything you're willing to pay money for.
- nubg 1 day ago
  
  Sorry but companies will not hire you but instead a person who learned how to code with AI. Get with the times or lose.
  
  [-]
  - dawidg81 1 day ago
    
    I'm afraid of all of the modern world especially in technology, I guess if now I would "come back" to all of modern and new things: the commercialized world, AI, corporations, etc...my head would explode. I mean I can't imagine living in such world. I am not sure if everything would be alright eith myself in all this everything,This is just too much...
  - cheeze 1 day ago
    
    It's that Austin Powers clip of the guy slowly getting smooshed by the steam roller.
exabrial 1 day ago

After using Anthropic's products, I think it's going to be difficult to go back to OpenAI. It feels more like a discussion with a peer; ChatGPT has always felt like arguing with an idiot on Reddit.

[-]
- replwoacause 1 day ago
  
  I agree the tone is better on Claude but the limits suck on the Pro plan.
imasliev 1 day ago

GPT-5.2-Codex was so cool at price/value rate, hope 5.3 will not ruin the race with claude
mmaunder 1 day ago

Take a screenshot of ARG-AGI-2 leaderboard now because GPT-5.3-Codex isn't up there yet and I suspect it'll cram down Claude Opus 4.6 which rules the roost for the next few hours. King for a day.
prng2021 1 day ago

Did they post the knowledge cutoff date somewhere

[-]
- jacekm 1 day ago
  
  It's here: https://platform.claude.com/docs/en/about-claude/models/over...
  Reliable knowledge cutoff: May 2025, training data cutoff: August 2025
  
  [-]
  - brikym 1 day ago
    
    This is the thread for GPT 5.3
binsquare 1 day ago

At first try it solved a problem that 5.2 couldn't previously.
Seems to be slower/thinks longer.
koolala 1 day ago

I want to recompile a Rust project to be f32 instead of f64.
Am I better off buying 1 month of Codex, Claude, or Antigravity?
I want to have the agent continuesly recompile and fix compile errors on loop until all the bugs from switching to f32 are gone.

[-]
- int_19h 1 day ago
  
  Antigravity (and Gemini in general) is not on par with the rest when it comes to agentic coding.
  Between Codex and Claude, Codex will have much more generous limits for the same price, especially if you use top-of-the-line models (although for your task, Sonnet might actually be good enough).
- azuanrb 1 day ago
  
  Codex by a mile. Also, there's double rate limit until April. So you're paying 1 month for 2 months usage.
- TuxSH 1 day ago
  
  If I'm not mistaken Codex is free until April 2nd with the previous generous rate limits (while paying customers get 2x).
- vatsachak 1 day ago
  
  Literally just find and replace
  
  [-]
  - koolala 1 day ago
    
    find and replace is step 1 that generates all the compile errors I want it to loop through
    I'm wanting to do it on an entire programming language made in rust: https://github.com/uiua-lang/uiua
    Because there are no float32 array languages in existence today
    
    [-]
    - xyzsparetimexyz 1 day ago
      
      Why do you want a float32 array language? Anyway the free glm4.6 model that is opencode defaults to should be fine. Why pay for something to do this.
      
      [-]
      - koolala 1 day ago
        
        I want to use an array language for Real-time 3D. Float32 is faster for real-time calculations and can map memory directly to the GPU since 3D graphics runtimes are limited to float32.
- argsnd 1 day ago
  
  All of them can do it but Codex has the least frustrating usage limits.
  
  [-]
  - koolala 1 day ago
    
    When using it in VSCode? The browser system running its own container seems like it would be the most demanding on their resources. The stand-alone client is Mac-only but I don't know if it makes a difference.
    My goal is to do it within the usage I get from a $20 monthly plan.
    
    [-]
    - int_19h 1 day ago
      
      You don't have to use their container thingy though, you can run Codex (CLI or VSCode, it doesn't matter) just fine in YOLO mode in your own local containers, or VMs, or however you want to isolate it.
    - energy123 1 day ago
      
      Why would you use it in VSCode?
      OpenAI are offering double the normal usage limits for Codex for two months. Go with them and do it in the terminal or the Mac OS codex app if you have a Mac.
      
      [-]
      - koolala 1 day ago
        
        It's different to use it in the terminal vs. vscode? Don't have a mac.
        
        [-]
        
        energy123 1 day ago
        
        Sorry I wasn't aware it's available in vscode. Scratch my suggestion, then.
        
        [-]
        
        koolala 1 day ago
        
        It is confusing especially when token efficiency is on the line.
- EmilStenstrom 1 day ago
  
  Doesn't matter which one. All of them can do things like this now, given a good enough feedback loop. Which your problem has.
DrBazza 22 hours ago

It took several decades for the language server protocol and debugger server protocol (or whatever it's called). Is there a common 'agent' protocol yet? Or are these companies still in the walled-garden phase?

[-]
- arcanemachiner 22 hours ago
  
  https://agentclientprotocol.com/get-started/introduction
farazbabar 1 day ago

I have wanted to hold back from answering comments that ask for proof of real work/productivity gains because everyone works differently, has different skill levels and frankly not everyone is working on world changing stuff. I really liked a comment someone made a few of these posts ago, these models are amazing! amazing! if you don't actually need them, but if you actually do need them, you are going to find yourself in a world of hurt. I cannot agree more, I (believe) I am a good software engineer, I have developed some interesting pieces of software over the decades and usually when I got passionate about a project, I could do really interesting things within weeks, sometimes months. I will say this, I am working on some really cool stuff, stuff I cannot tell you about, or else. And my velocity is for what used to take months is days and hours for what used to take weeks. I still review everything, I understand all the gotchas of distributed systems, performance, latency/throughput, C, java, SQL, data and infra costs, I get all of it so I am able to catch these mofos when they are about to stab me in the back but man! my productivity is through the roof. And I am loving it. Just so I can avoid saying I cannot tell you I am working on, I will start something that I can share soon (as soon as decades of pent up work is done, its probably less than a few months away!). Take it with a grain of salt, and know this, these things are not your friends, they WILL stab you in the back when you least expect them, cut a corner, take a short cut, so you have to be the PHB (dilbert reference!) with actual experience to catch them slacking. Good luck.
aavci 1 day ago

To anyone trying this, does this unlock anything you tried to do with the past LLM models but failed and now you can try again? Do you find this as an incremental improvement or something that brings in new opportunities?
gwd 1 day ago

gpt-5.3-codex isn't available on the API yet. From TFA:
> We are working to safely enable API access soon.
EZ-E 1 day ago

Can someone explain the difference between this and VSCode agent chat? Except the fact that it's a separate app?

[-]
- gverrilla 1 day ago
  
  VsCode = IDE Codex/Claude Code = TUI
jdthedisciple 1 day ago

Gotta love how the game demo's page title is "threejs" – I guess the point was to demo its vibe-coding abilities anyway, but yea..
sidgarimella 1 day ago

Many are saying codex is more interactive but ironically I think that very interactivity/determinism works best when using codex remotely as a cloud agent and in highly async cases. Conversely I find opus great locally, where I can ram messages into it to try to lever its autonomy best (and interrupt/clean up)
aavci 1 day ago

“our team was blown away by how much Codex was able to accelerate its own development.”
At what point will LLMs be autonomously self creating new versions of themselves?
tyfon 1 day ago

I'm having a hard time parsing the openai website.
Anyone know if it is possible to use this model with opencode with the plus subscription?

[-]
- mbil 1 day ago
  
  It's possible to use opencode with the plus subscription using this plugin for auth [0][1]. Just tested this and it appears to work.
  [0]: https://opencode.ai/docs/ecosystem/#:~:text=Use%20your%20Cha...
  [1]: https://github.com/numman-ali/opencode-openai-codex-auth
  
  [-]
  - fisf 21 hours ago
    
    You do not need a plugin anymore.
__mharrison__ 1 day ago

I never really used Codex (found it to slow) just 5.2, which I going to be an excellent model for my work. This looks like another step up.
This week, I'm all local though, playing with opencode and running qwen3 coder next on my little spark machine. With the way these local models are progressing, I might move all my llm work locally.

[-]
- andix 1 day ago
  
  I think codex got much faster for smaller tasks in the last few months. Especially if you turn thinking down to medium.
- raffkede 1 day ago
  
  I think the slow feeling is a UI thing in codex
  
  [-]
  - __mharrison__ 1 day ago
    
    I realize my comment was unclear. I use codex the CLI all the time, but generally with this invocation: `codex --full-auto -m gpt-5.2`
    However, when I use the 5.2codex model, I've found it to be very slow and worse (hard to quantify, but I preferred straight-up 5.2 output).
GenerWork 1 day ago

I find it very, very interesting how they demoed visuals in the form of the “soft SaaS” website and mentioned how it can do user research. Codex has usually lagged behind Claude and Gemini when it comes to UX, so I’m curious to see if 5.3 will take the lead in real world use. Perhaps it’ll be available in Figma Make now?

[-]
- brokencode 1 day ago
  
  I’m hoping they add better IDE integration to track active file and selection. That’s the biggest annoyance I have in working with Codex.
kingstnap 1 day ago

That was fast!
I really do wonder whats the chain here. Did Sam see the Opus announcement and DM someone a minute later?

[-]
- Mond_ 1 day ago
  
  OpenAI has a whole history of trying to scoop other providers. This was a whole thing for Google launches, where OpenAI regularly launched something just before Google to grab the media attention.
  
  [-]
  - rsanek 1 day ago
    
    Some recent examples:
    GPT-4o vs. Google I/O (May 2024): OpenAI scheduled its "Spring Update" exactly 24 hours before Google’s biggest event of the year, Google I/O. They launched GPT-4o voice mode.
    Sora vs. Gemini 1.5 Pro (Feb 2024): Just two hours after Google announced its breakthrough Gemini 1.5 Pro model, Sam Altman tweeted the reveal of Sora (text-to-video).
    ChatGPT Enterprise vs. Google Cloud Next (Aug 2023): As Google began its major conference focused on selling AI to businesses, OpenAI announced ChatGPT Enterprise.
- NewsaHackO 1 day ago
  
  I assume some sort of corporate espionage. This is high stakes after all
- maxpert 1 day ago
  
  Tell me that you are hurt without telling me that you are hurt this applies to Sam right now
foft 1 day ago

Having used codex a fair bit I find it really struggles with … almost anything. However using the equivalent chat gpt model is fantastic. I guess it’s a matter of focus and being provided with a smaller set of code to tackle.

[-]
- sumedh 1 day ago
  
  Can you share your prompts?
rustyhancock 1 day ago

Anyone remember the dot-com era when you would see one provider claim the most miles of fibre and then later that week another would have the title?
ecshafer 1 day ago

Funny that this and Opus 4.6 released within minutes of each other. Each showing similar score improvements. Each claiming to be revolutionary.
jpau 1 day ago

Interesting that this was released without a prior GPT-5.3 release. I wonder if that means we won't see a GPT-5.3?
vatsachak 1 day ago

AI designed websites are so easy to spot that I need to actively design my UI so that it doesn't look AI
maheshrijal 1 day ago

It seems Fast!
davidmurdoch 1 day ago

I've been using 5.2 the way they're describing the new use case for 5.3 this whole time.
synergy20 1 day ago

i like the opus 4.6 announcement a lot more, concise and to the point. for the 5.3 codex, it's a long post, but still, the most important info, the context window, is nowhere to be found. thus, I'm keeping using opus.
edem 1 day ago

So can I use this from Opencode? Because Anthropic started to enforce their TOS to kill the Opencode integration

[-]
- tfehring 1 day ago
  
  OpenAI models in general, yes - `opencode auth login`, select OpenAI, then ChatGPT Pro/Plus. I just checked and 5.3-codex isn't available in opencode yet, but I assume it will be soon.
- avb 1 day ago
  
  You can also use via Opencode Zen, Github Copilot, or probably any number of other model providers that Opencode integrates with.
  Not sure why everyone stays focused on getting it from Anthropic or OpenAI directly when there are so many places to get access to these models and many others for the same or less money.
- regularfry 1 day ago
  
  I've tried opus 4.5 in opencode via the GitHub Copilot API, mostly to see if it works all. I don't think that broke any terms of service? But also I haven't checked how much more expensive I made it for myself over just calling them directly.
- rs_rs_rs_rs_rs 1 day ago
  
  You can use Anthropic models in Opencode, make an api key and you're good to do(you can even use the in house Opencode router, Zen).
  What you can't do is pretend opencode is claude code to make use of that specific claude code subscription.
- InsideOutSanta 1 day ago
  
  Yes, OpenAI said they'd allow usage of their subscriptions in opencode.
PieUser 1 day ago

How'd they both release at the same time? Insiders?
simianwords 1 day ago

Any notes on pricing?

[-]
- Tiberium 1 day ago
  
  It's not in the API yet - "We are working to safely enable API access soon.", but I assume the rate-limits won't be worse than for 5.2 Codex.
  
  [-]
  - nine_k 1 day ago
    
    Ah, "It's ready, but not yet".
    
    [-]
    - yunyu 1 day ago
      
      You can just use it outside of the API?
kopollo 1 day ago

Where is the google?

[-]
- hsaliak 1 day ago
  
  gemini-3-flash-preview will be GA soon i hope. /s
bryanhogan 1 day ago

The most important question: Can it do Svelte now?

[-]
- speedgoose 1 day ago
  
  Today is the best day to rewrite everything in React. You may not enjoy React, but AI agents do. And they are the ones writing the code.
  
  [-]
  - bryanhogan 1 day ago
    
    But human and AI agents enjoy writing Svelte even more.
    This really is a non-argument.
    
    [-]
    - speedgoose 1 day ago
      
      I don’t know, you ask above whether it can do svelte now.
- davidmurdoch 1 day ago
  
  5.2 was already very good with svelte 5, at least when you have the svelte MCP server set up.
bg24 1 day ago

I am on a max subscription for Claude, and hate the fact that OpenAI have not figured out that $20 => $200 is a big jump. Good luck to them. In terms of model, just last night, Codex 5.2 solved a problem for me which other models were going round and round. Almost same instructions. That said, I still plan to be on $100 Claude (overall value across many tasks, ability to create docs, co-work), and may bump up OpenAI subscription to the next tier should they decide to introduce one. Not going to $200 even with 5.3, unless my company pays for it.

[-]
- aerhardt 1 day ago
  
  I'm coding about 6-9h per day with Codex CLI on the $20 Plus sub, occasionally switching to extra-high reasoning and feeding it massive contexts, all tools enabled, sometimes 2-3 terminal sessions running in parallel and I've never hit limits... I operate on small-ish codebases but even so I try to work in the most local scope possible with AGENTS.md at the sub-directory levels.
  Are you really hitting limits, or are you turned off by the fact you think you will?
  
  [-]
  - bg24 1 day ago
    
    You are correct :-) I am turned off by the fact that I will hit the limit if I used more. But you gave me confidence. I guess $20 can go a long way. I think only once in the last 3 months I got rate limited in Codex.
- satvikpendem 1 day ago
  
  You should look into Kilo Pass by Kilo Code (https://kilo.ai/features/kilo-pass). It's basically a fixed subscription and your credits roll over each month, and you get free extra credits too which are used up first before paid credits. It's similar to paying for Cursor except the credits roll over which is why I'm contemplating moving to it, because I don't want to be locked into any one LLM provider the way Claude Code or Codex make you become.
  
  [-]
  - rektlessness 1 day ago
    
    I was wondering how KiloCode Kilo Pass pricing compared to OpenRouter's top-up pricing, and did some digging and discovered the main difference is that OpenRouter provides a standard API key (sk-or-...) that works in any application (LangChain, curl, your own Python apps), while Kilo Pass credits are tied to the Kilo Gateway, which is designed to power the KiloCode Extension (VS Code/JetBrains) and CLI. KiloCode does not appear to allow you generate a "Kilo API Key" to use in your external Python scripts or third-party apps. But the monthly bonus credits are sweet.
    
    [-]
    - satvikpendem 1 day ago
      
      Yes, it's for development not deployment.
- andix 1 day ago
  
  I guess the jump is on purpose. You can buy Codex credits and also use codex via the API (manual switching required).
- wiether 1 day ago
  
  I use Codex in OpenCode through the API and find the experience quite enjoyable.
  
  [-]
  - bg24 1 day ago
    
    Need to try OpenCode. Thanks.
virtualzx 1 day ago

is so fun that the two releases used almost completely non-overlapping benchmarks!
jiggawatts 1 day ago

I think this announcement says a lot about OpenAI and their relationship to partners like Microsoft and NVIDIA, not to mention the attitude of their leadership team.
On Microsoft Foundry I can see the new Codex 4.6 model right now, but GPT-5.3 is nowhere to be seen.
I have a pre-paid account directly with OpenAI that has credits, but if I use that key with the Codex CLI, it can't access 5.3 either.
The press release very prominently includes this quote: "GPT‑5.3-Codex was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems. We are grateful to NVIDIA for their partnership."
Sounds like OpenAI's ties with their vendors are fraying while at the same time they're struggling to execute on the basics like "make our own models available to our own coding agents", let alone via third-party portals like Microsoft Foundry.

[-]
- zozbot234 1 day ago
  
  GPT 5.3 is not in the API yet AIUI.
  
  [-]
  - jiggawatts 1 day ago
    
    Translation: "We've announced this great new thing that we're struggling to make available on channels we control, unlike our competitors who seem to have no issues despite a fraction of our budget!"
petetnt 1 day ago

Whoa, I think GPT-5.2-Codex was a disappointment, but GPT-5.3-Codex is definitely the future!
roya51788 1 day ago

what are the benchmarks against opus 4.6?
mrcwinn 1 day ago

According to Sam Altman, Anthropic is for "rich people." Judging by his $4 million man-baby Koeniggsegg, he must be a huge Claude Code user!

[-]
- swordsith 1 day ago
  
  He would be! If they didn't get banned from using it in the OAI offices. He's so mad.
drcongo 1 day ago

Does it insert adverts in your code?
hubraumhugo 1 day ago

Anybody else not seeing it available in Codex app or CLI yet (with Plus)?

[-]
- haneul 1 day ago
  
  My codex CLI didn’t notice version bump available, but I manually did pnpm add -g @openai/codex and 5.3 was there after.
heraldgeezer 1 day ago

Anthropic and GTP 2 new models at once?
wahnfrieden 1 day ago

Pelican seems much worse than the Opus 4.6 one (though the bicycle is more accurate):
https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...

[-]
OutOfHere 1 day ago

It is absurd to release 5.3-Codex before first releasing 5.3.
Also, there is no reason for OpenAI and Anthropic to be trying to one-up each other's releases on the same day. It is hell for the reader.

[-]
- aurareturn 1 day ago
  
  Because Claude Code is stealing the thunder so OpenAI is focusing on coding now.
  
  [-]
  - whizzter 1 day ago
    
    Yeah, Claude Code is what everyone is talking about these days and since OpenAI has always been the spending driver being 2nd or 3rd fiddle just isn't acceptable if they're gonna justify it.
  - stri8ted 1 day ago
    
    That is where the money is.
    
    [-]
    - ghosty141 1 day ago
      
      This. I think software development is the best usecase for AI yet. I use it almost daily at work and it's a huge help.
      Enterprise customers will happily pay even 100$/mo subscriptions and it has a clear value proposition that can be decently verified.
      
      [-]
      - OutOfHere 23 hours ago
        
        Revenue should not be confused with profit. The large AI companies must easily be spending more on compute than they're making from a $20-200/mo subscription. In the best case it might break even for the AI companies. There is no way that they're actually earning a profit from these subscriptions at this time.
    - OutOfHere 1 day ago
      
      It's where the revenue is, but it isn't going to be where the profit is. Developers will easily use absurdly large amounts of compute, costing the AI provider a lot more than they receive in revenue.
- int_19h 1 day ago
  
  Let them fight, that's how we as users get more tokens for less money. If only all markets were so competitive...
  
  [-]
  - OutOfHere 23 hours ago
    
    Sure. At this point, what matters more to me imho is in effect the Plan stage where I refine my crude specification, iteratively and repeatedly, until I work out all the flaws and minimally necessary details in it. This is hard to do and uses up a lot of tokens, and it is where a lot of my initial effort goes. I have literally repeatedly exhausted my token quota just in this stage alone. I can take then take this refined specification to even a dumb agent from a year ago, and it would have no trouble producing decent code for it. This refined spec is what is equally important to me as the code. Tests, both unit and integration, also go hand in hand with the spec, although they're less important to me if I am carefully reviewing every line of code, and more important when vibe coding instead.
- apetresc 1 day ago
  
  Why is it absurd?
  
  [-]
  - OutOfHere 1 day ago
    
    At the very least, it is absurd to announce a model but not release it on the same day, making it vaporware. Was it released?
    
    [-]
    - apetresc 22 hours ago
      
      Which model, 5.3 or 5.3-Codex? Yes, 5.3-Codex was announced and released. 5.3 wasn't announced. None of it is "absurd", and it also wouldn't have been "absurd" if they announce something but don't release it that same day (which they didn't do, but if they had - what exactly is absurd about that? Companies make announcements about future releases ALL the time.)
- tomashubelbauer 1 day ago
  
  I agree, I was confused about where 5.3 non Codex was. 5.2-Codex disappointed me enough that I won't be giving 5.3 Codex a try, but I'm looking forward to trying 5.3 non Codex with Pi.
  
  [-]
  - sunaookami 1 day ago
    
    GPT-5.x in general are very disappointing, the only good chat model was GPT-5 in the first week before they made "the personality warmer" and Codex in general was always kinda meh.
nubg 1 day ago

lmao so cringe that they delay releasing the model until anthropic does

[-]
- sumedh 1 day ago
  
  Its good from a marketing perspective though, steal the thunder of your competitor.
raincole 1 day ago

Almost like Anthropic and OpenAI are trying to front run each other
copilot_king 1 day ago

[dead]
xyst 1 day ago

[flagged]
mannanj 1 day ago

[flagged]
verdverm 1 day ago

[flagged]

[-]
- Mond_ 1 day ago
  
  why call him that when "saltman" is right there
  
  [-]
  - dbt00 1 day ago
    
    I can't get over Scam Altman.
  - verdverm 1 day ago
    
    The Dr Seuss reference was more appealing to me at the time
copilot_king_2 1 day ago

[flagged]
I_am_tiberius 1 day ago

I'd like to know if and how much illegal use of customer prompts are used for training.

[-]
- xlbuttplug2 1 day ago
  
  "But we anonymize prompts before training!"
  Meanwhile the prompt: Crop this photo of my passport
- renewiltord 1 day ago
  
  Oh yeah that’s in the “These Are The Illegal Things We Did” section 7.4 in the Model Card.
shibeprime 1 day ago

I know we just got a reset and a 2× bump with the native app release, but shipping 5.3 with no reset feels mismatched. If I’d known this was coming, I wouldn’t have used up the quota on the previous model.
maxpert 1 day ago

Is this me or Sam is being absolute sore loser he is and trying to steal Opus thunder?

[-]
- nickthegreek 1 day ago
  
  Why is it loser? He very well could be a sore winner here.
  
  [-]
  - koakuma-chan 1 day ago
    
    OpenAI is still the only AI company that has structured outputs. Anthropic now supports JSON schema but you can't specify array length.
    
    [-]
    - jiggawatts 1 day ago
      
      Google Gemini definitely has structured output.
      
      [-]
      - roflcopter69 1 day ago
        
        Not so fast! Check this out https://github.com/googleapis/python-genai/issues/460
        In my experience, you can only use Gemini structured outputs for the most trivial of schemas. No integer literals, no discriminated unions and many more paper cuts. So at least for me, it was completely unusable for what I do at work.
        
        [-]
        
        jiggawatts 19 hours ago
        
        That's the level of coding I expect from a bunch of Python-only ML computer scientists, but still... wow.
        On the upside, they seem to have fixed it: https://blog.google/innovation-and-ai/technology/developers-...
    - wahnfrieden 1 day ago
      
      Can you elaborate what you mean - OAI structured outputs means JSON schema doesn't it? So are you just saying they both support JSON schema but Anthropic has a limitation?
      
      [-]
      - koakuma-chan 1 day ago
        
        OpenAI, in addition to JSON schema, supports "context-free grammar"[0], i.e. regex and lark. Anthropic also supports JSON schema since a few weeks ago, but they don't support specifying the length of JSON array, so you still have to worry about the model producing invalid output.
        [0]: https://platform.openai.com/docs/guides/function-calling#con...
        One thing that pisses me off is this widespread misunderstanding that you can just fall back to function calling (Anthropic's function calling accepts JSON schema for arguments), and that it's the same as structured outputs. It is not. They just dump the JSON schema into the context without doing the actual structured outputs. Vercel's AI SDK does that and it pisses me off because doing that only confuses the model and prefilling works much better.
- OutOfHere 1 day ago
  
  They both are doing this to each other.
  BTW, loser is spelled with a single o.
- wahnfrieden 1 day ago
  
  You could also claim that Anthropic is trying to scoop OpenAI by launching minutes earlier, as OpenAI has done with Google in the past.
  For downvoters, you must be naive to think these companies are not surveilling each other through various means.
- fHr 1 day ago
  
  lol cope harder