Gemini 3 Deep Think

(blog.google)

1031 points | by tosh 1 day ago

59 comments

lukebechtel 1 day ago

Arc-AGI-2: 84.6% (vs 68.8% for Opus 4.6)
Wow.
https://blog.google/innovation-and-ai/models-and-research/ge...

[-]
- raincole 1 day ago
  
  Even before this, Gemini 3 has always felt unbelievably 'general' for me. It can beat Balatro (ante 8) with text description of the game alone[0]. Yeah, it's not an extremely difficult goal for humans, but considering:
  1. It's an LLM, not something trained to play Balatro specifically
  2. Most (probably >99.9%) players can't do that at the first attempt
  3. I don't think there are many people who posted their Balatro playthroughs in text form online
  I think it's a much stronger signal of its 'generalness' than ARC-AGI. By the way, Deepseek can't play Balatro at all.
  [0]: https://balatrobench.com/
  
  [-]
  - tl 1 day ago
    
    Per BalatroBench, gemini-3-pro-preview makes it to round (not ante) 19.3 ± 6.8 on the lowest difficulty on the deck aimed at new players. Round 24 is ante 8's final round. Per BalatroBench, this includes giving the LLM a strategy guide, which first-time players do not have. Gemini isn't even emitting legal moves 100% of the time.
    
    [-]
    - raincole 21 hours ago
      
      It beats ante eight 9 times out of 15 attempts. I do consider 60% winning chance very good for a first time player.
      The average is only 19.3 rounds because there is a bugged run where Gemini beats round 6 but the game bugs out when it attempts to sell Invisible Joker (a valid move)[0]. That being said, Gemini made a big mistake in round 6 that would have costed it the run at higher difficulty.
      [0]: given the existence of bugs like this, perhaps all the LLMs' performances are underestimated.
      
      [-]
      - bgirard 7 hours ago
        
        Are there benchmarks if we allow the LLM to practice and study the game?
        
        [-]
        
        raincole 6 hours ago
        
        You can make one, the balatro bench is open source. But I'm quite sure it'd be crazily expensive for a hobby project. At the end of the day, LLM can't actually 'practice and learn.'
        
        [-]
        
        bgirard 2 hours ago
        
        I've gotten pretty good results by prompting "What did you struggle on? Please update the instructions in <PROMPT/SKILL>" and "Here's your conversation <PASTE>, please see what you struggled with and update <PROMPT/SKILL>".
        It's hit or miss, but I've been able to have it self improve on prompts. It can spot mistakes and retain things that didn't work. Similar to how I learned games like Balatro. Playing Balatro blind, you wouldn't know which jokers are coming and have synergy together, or that X strategy is hard to pull off, or that you can retain a card to block it from appearing in shops.
        If the LLM can self discover that, and build prompt files that gradually allow it to win at the highest stake, that's an interesting result. And I'd love to know which models do best at that.
      - eru 9 hours ago
        
        Why not include a description of the bugs to avoid in the strategy guide?
    - YeGoblynQueenne 22 hours ago
      
      https://balatrobench.com/
  - S1M0N38-hn 22 hours ago
    
    Hi, BalatroBench creator here. Yeah, Google models perform well (I guess the long context + world knowledge capabilities). Opus 4.6 looks good on preliminary results (on par with Gemini 3 Pro). I'll add more models and report soon. Tbh, I didn't expect LLMs to start winning runs. I guess I have to move to harder stakes (e.g. red stake).
    
    [-]
    - raincole 20 hours ago
      
      Thank you for the site! I've got a few suggestions:
      1. I think winrate is more telling than the average round number.
      2. Some runs are bugged (like Gemini's run 9) and should be excluded from the result. Selling Invisible Joker is always bugged, rendering all the runs with the seed EEEEEE invalid.
      3. Instead of giving them "strategy" like "flush is the easiest hand..." it's fairer to clarify some mechanisms that confuse human players too. e.g. "played" vs "scored".
      Especially, I think this kind of prompt gives LLM an unfair advantage and can skew the result:
      > ### Antes 1-3: Foundation
      > - *Priority*: One of your primary goals for this section of the game should be obtaining a solid Chips or Mult joker
      
      [-]
      - S1M0N38-hn 4 hours ago
        
        Im pretty open to feedback and contribution (also regarding the default strategy). So feel free to open Issues on GH. However I'd like to collect a bunch of them (including bugs) before re-running the whole benchmark (balatrobench v2).
    - brokensegue 5 hours ago
      
      Did you consider doing it as a computer use task? Probably I find those more compelling
      It's what I did for my game benchmark https://d.erenrich.net/paperclip-bench/index.html
      
      [-]
      - S1M0N38-hn 4 hours ago
        
        not really. I've downloaded balatro. I saw that it was moddable. I wrote a mod API to interact programmatically. I was just curious if, from text only game state representation, a LLM would be able to make some decent play. the benchmark was a late pivoting.
  - nerdsniper 22 hours ago
    
    My experience also shows that Gemini has unique strength in “generalized” (read: not coding) tasks. Gemini 2.5 Pro and 3 Pro seems stronger at math and science for me, and their Deep Research usually works the hardest, as long as I run it during off-hours. Opus seems to beat Gemini almost “with one hand tied behind its back” in coding, but Gemini is so cheap that it’s usually my first stop for anything that I think is likely to be relatively simple. I never worry about my quota on Gemini like I do with Opus or Chat-GPT.
    Comparisons generally seem to change much faster than I can keep my mental model updated. But the performance lead of Gemini on more ‘academic’ explorations of science, math, engineering, etc has been pretty stable for the past 4 months or so, which makes it one of the longer-lasting trends for me in comparing foundation models.
    I do wish I could more easily get timely access to the “super” models like Deep Think or o3 pro. I never seem to get a response to requesting access, and have to wait for public access models to catch up, at which point I’m never sure if their capabilities have gotten diluted since the initial buzz died down.
    They all still suck at writing an actually good essay/article/literary or research review, or other long-form things which require a lot of experienced judgement to come up with a truly cohesive narrative. I imagine this relates to their low performance in humor - there’s just so much nuance and these tasks represent the pinnacle of human intelligence. Few humans can reliably perform these tasks to a high degree of performance either. I myself am only successful some percentage of the time.
    
    [-]
    - disgruntledphd2 10 hours ago
      
      > their Deep Research usually works the hardest
      That's sortof damning with faint praise I think. So, for $work I needed to understand the legal landscape for some regulations (around employment screening) so I kicked off a deep research for all the different countries. That was fineish, but tended to go off the rails towards the end.
      So, then I split it out into Americas, APAC and EMEA requirements. This time, I spent the time checking all of the references (or almost all anyways), and they were garbage. Like, it ~invented a term and started telling me about this new thing, and when I looked at the references they had no information about the thing it was talking about.
      It linked to reddit for an employment law question. When I read the reddit thread, it didn't even have any support for the claims. It contradicted itself from the beginning to the end. It claimed something was true in Singapore, based on a Swedish source.
      Like, I really want this to work as it would be a massive time-saver, but I reckon that right now, it only saves time if you don't want to check the sources, as they are garbage. And Google make a business of searching the web, so it's hard for me to understand why this doesn't work better.
      I'm becoming convinced that this technology doesn't work for this purpose at the moment. I think that it's technically possible, but none of the major AI providers appear to be able to do this well.
      
      [-]
      - nerdsniper 9 hours ago
        
        Oh yeah, LLMs currently spew a lot of garbage. Everything has to be double-checked. I mainly use them for gathering sources and pointing out a few considerations I might have otherwise overlooked. I often run them a few times, because they go off the rails in different directions, but sometimes those directions are helpful for me in expanding my understanding.
        I still have to synthesize everything from scratch myself. Every report I get back is like "okay well 90% of this has to be thrown out" and some of them elicit a "but I'm glad I got this 10%" from me.
        For me it's less about saving time, and more about potentially unearthing good sources that my google searches wouldn't turn up, and occasionally giving me a few nuggets of inspiration / new rabbit holes to go down.
        Also, Google changed their business from Search, to Advertising. Kagi does a much better job for me these days, and is easily worth the $5/mo I pay.
        
        [-]
        
        disgruntledphd2 9 hours ago
        
        > For me it's less about saving time, and more about potentially unearthing good sources that my google searches wouldn't turn up, and occasionally giving me a few nuggets of inspiration / new rabbit holes to go down.
        Yeah, I see the value here. And for personal stuff, that's totally fine. But these tools are being sold to businesses as productivity increasers, and I'm not buying it right now.
        I really, really want this to work though, as it would be such a massive boost to human flourishing. Maybe LLMs are the wrong approach though, certainly the current models aren't doing a good job.
  - ankit219 1 day ago
    
    Agreed. Gemini 3 Pro for me has always felt like it has had a pretraining alpha if you will. And many data points continue to support that. Even as flash, which was post trained with different techniques than pro is good or equivalent at tasks which require post training, occasionally even beating pro. (eg: in apex bench from mercor, which is basically a tool calling test - simplifying - flash beats pro). The score on arc agi2 is another datapoint in the same direction. Deepthink is sort of parallel test time compute with some level of distilling and refinement from certain trajectories (guessing based on my usage and understanding) same as gpt-5.2-pro and can extract more because of pretraining datasets.
    (i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how "skilled" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking)
  - ebiester 1 day ago
    
    It's trained on YouTube data. It's going to get roffle and drspectred at the very least.
  - silver_sun 1 day ago
    
    Google has a library of millions of scanned books from their Google Books project that started in 2004. I think we have reason to believe that there are more than a few books about effectively playing different traditional card games in there, and that an LLM trained with that dataset could generalize to understand how to play Balatro from a text description.
    Nonetheless I still think it's impressive that we have LLMs that can just do this now.
    
    [-]
    - mjamesaustin 1 day ago
      
      Winning in Balatro has very little to do with understanding how to play traditional poker. Yes, you do need a basic knowledge of different types of poker hands, but the strategy for succeeding in the game is almost entirely unrelated to poker strategy.
    - gilrain 1 day ago
      
      If it tried to play Balatro using knowledge of, e.g., poker, it would lose badly rather than win. Have you played?
      
      [-]
      - gcr 1 day ago
        
        I think I weakly disagree. Poker players have intuitive sense of the statistics of various hand types showing up, for instance, and that can be a useful clue as to which build types are promising.
        
        [-]
        
        barnas2 1 day ago
        
        >Poker players have intuitive sense of the statistics of various hand types showing up, for instance, and that can be a useful clue as to which build types are promising.
        Maybe in the early rounds, but deck fixing (e.g. Hanged Man, Immolate, Trading Card, DNA, etc) quickly changes that. Especially when pushing for "secret" hands like the 5 of a kind, flush 5, or flush house.
  - gaudystead 21 hours ago
    
    I don't think it'd need Balatro playthroughs to be in text form though. Google owns YouTube and has been doing automatic transcriptions of vocalized content on most videos these days, so it'd make sense that they used those subtitles, at the very least, as training data.
  - winstonp 1 day ago
    
    DeepSeek hasn't been SotA in at least 12 calendar months, which might as well be a decade in LLM years
    
    [-]
    - cachius 1 day ago
      
      What about Kimi and GLM?
      
      [-]
      - zozbot234 1 day ago
        
        These are well behind the general state of the art (1yr or so), though they're arguably the best openly-available models.
        
        [-]
        
        epolanski 23 hours ago
        
        Idk man, GLM 5 in my tests matches opus 4.5 which is what, two months old?
        
        [-]
        
        wahnfrieden 16 hours ago
        
        4.5 was never sota
        
        tgrowazay 1 day ago
        
        According to artificial analysis ranking, GLM-5 is at #4 after Claude Opus 4.5, GPT-5.2-xhigh and Claude Opus 4.6 .
  - WiSaGaN 17 hours ago
    
    Yes, agentic-wise, Claude Opus is best. Complex coding is GPT-5.x. But for smartness, I always felt Gemini 3 Pro is best.
    
    [-]
    - thefounder 14 hours ago
      
      Can you give an example of smartness where Gemini is better than the other 2? I have found Gemini 3 pro the opposite of smartness on the tasks I gave him (evaluation, extraction, copy writing, judging, synthesising ) with gpt 5.2 xhigh first and opus 4.5/4.6 second. Not to mention it likes to hallucinate quite a bit .
      
      [-]
      - WarmWash 8 hours ago
        
        I use it for classic engineering a lot, it beats out chatgpt and opus (I haven't tried as much with opus as chagpt though). Flash is also way stronger than it should be
  - SomaticPirate 3 hours ago
    
    Yet it still can't solve a Pokle hand for me
  - FuckButtons 22 hours ago
    
    Strange, because I could not for the life of me get Gemini 3 to follow my instructions the other day to work through an example with a table, Claude got it first try.
    
    [-]
    - rockwotj 21 hours ago
      
      Claude is king for agentic workflows right now because it’s amazing at tool calling and following instructions well (among other things)
      
      [-]
      - carefree-bob 16 minutes ago
        
        I've asked Gemini to not use phrases like "final boss" and to not generate summary tables unless asked to do so, yet it always ignores my instructions.
      - wahnfrieden 16 hours ago
        
        Codex ranks higher for instruction following
  - dudisubekti 1 day ago
    
    But... there's Deepseek v3.2 in your link (rank 7)
    
    [-]
    - raincole 21 hours ago
      
      Grok (rank 6) and below didn't beat the game even once.
      Edit: in my original comment I said it wrong. I meant to say Deepseek can't beat Balatro at all, not can't play. Sorry
  - littlestymaar 1 day ago
    
    > . I don't think there are many people who posted their Balatro playthroughs in text form online
    There are *tons* of balatro content on YouTube though, and it makes absolutely zero doubt that Google is using YouTube content to train their model.
    
    [-]
    - squeegmeister 5 minutes ago
      
      Yeah we need someone to make an secret, air gapped strategy game for benchmarking purposes
    - sdwr 1 day ago
      
      Yeah, or just the steam text guides would be a huge advantage.
      I really doubt it's playing completely blind
      
      [-]
      - mh- 2 hours ago
        
        Thanks to another comment here I went looking for the strategy guides that are injected. To save everyone else the trouble, here [0]. Look at (e.g.) default/STRATEGY.md.jinja. Also adding a permalink [1] for future readers' sake.
        [0]: https://github.com/coder/balatrollm/tree/main/src/balatrollm...
        [1]: https://github.com/coder/balatrollm/blob/a245a0c2b960b91262c...
  - throwawayk7h 20 hours ago
    
    Not sure it's 99.9%. I beat it on my first attempt, but that was probably mostly luck.
  - tehsauce 1 day ago
    
    How does it do on gold stake?
  - acid__ 1 day ago
    
    > Most (probably >99.9%) players can't do that at the first attempt
    Eh, both myself and my partner did this. To be fair, we weren’t going in completely blind, and my partner hit a Legendary joker, but I think you might be slightly overstating the difficulty. I’m still impressed that Gemini did it.
  - Falsintio 1 day ago
    
    [dead]
- nubg 1 day ago
  
  Weren't we barely scraping 1-10% on this with state of the art models a year ago and it was considered that this is the final boss, ie solve this and its almost AGI-like?
  I ask because I cannot distinguish all the benchmarks by heart.
  
  [-]
  - modeless 1 day ago
    
    François Chollet, creator of ARC-AGI, has consistently said that solving the benchmark does not mean we have AGI. It has always been meant as a stepping stone to encourage progress in the correct direction rather than as an indicator of reaching the destination. That's why he is working on ARC-AGI-3 (to be released in a few weeks) and ARC-AGI-4.
    His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.
    
    [-]
    - mapontosevenths 1 day ago
      
      > His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.
      That is the best definition I've yet to read. If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
      Thats said, I'm reminded of the impossible voting tests they used to give black people to prevent them from voting. We dont ask nearly so much proof from a human, we take their word for it. On the few occasions we did ask for proof it inevitably led to horrific abuse.
      Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.
      
      [-]
      - estearum 1 day ago
        
        > If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
        This is not a good test.
        A dog won't claim to be conscious but clearly is, despite you not being able to prove one way or the other.
        GPT-3 will claim to be conscious and (probably) isn't, despite you not being able to prove one way or the other.
        
        [-]
        
        glenstein 20 hours ago
        
        Agreed, it's a truly wild take. While I fully support the humility of not knowing, at a minimum I think we can say determinations of consciousness have some relation to specific structure and function that drive the outputs, and the actual process of deliberating on whether there's consciousness would be a discussion that's very deep in the weeds about architecture and processes.
        What's fascinating is that evolution has seen fit to evolve consciousness independently on more than one occasion from different branches of life. The common ancestor of humans and octopi was, if conscious, not so in the rich way that octopi and humans later became. And not everything the brain does in terms of information processing gets kicked upstairs into consciousness. Which is fascinating because it suggests that actually being conscious is a distinctly valuable form of information parsing and problem solving for certain types of problems that's not necessarily cheaper to do with the lights out. But everything about it is about the specific structural characterizations and functions and not just whether it's output convincingly mimics subjectivity.
        
        [-]
        
        mapontosevenths 9 hours ago
        
        > at a minimum I think we can say determinations of consciousness have some relation to specific structure and function that drive the outputs
        Every time anyone has tried that it excludes one or more classes of human life, and sometimes led to atrocities. Let's just skip it this time.
        
        [-]
        
        glenstein 4 hours ago
        
        Having trouble parsing this one. Is it meant to be a WWII reference? If anything I would say consciousness research has expanded our understanding of living beings understood to be conscious.
        And I don't think it's fair or appropriate to treat study of the subject matter of consciousness like it's equivalent to 20th century authoritarian regimes signing off on executions. There's a lot of steps in the middle before you get from one to the other that distinguish them to the extent necessary and I would hope that exercise shouldn't be necessary every time consciousness research gets discussed.
        
        [-]
        
        mapontosevenths 2 hours ago
        
        > Is it meant to be a WWII reference?
        The sum total of human history thus far has been the repetition of that theme. "It's OK to keep slaves, they aren't smart enough to care for themselves and aren't REALLY people anyhow." Or "The Jews are no better than animals." Or "If they aren't strong enough to resist us they need our protection and should earn it!"
        Humans have shown a complete and utter lack of empathy for other humans, and used it to justify slavery, genocide, oppression, and rape since the dawn of recorded history and likely well before then. Every single time the justification was some arbitrary bar used to determine what a "real" human was, and consequently exclude someone who claimed to be conscious.
        This time isn't special or unique. When someone or something credibly tells you it is conscious, you don't get to tell it that it's not. It is a subjective experience of the world, and when we deny it we become the worst of what humanity has to offer.
        Yes, I understand that it will be inconvenient and we may accidentally be kind to some things that didn't "deserve" kindness. I don't care. The alternative is being monstrous to some things that didn't "deserve" monstrosity.
        
        eru 9 hours ago
        
        I excluded all right handed, blue eyed people yesterday before breakfast. No atrocities happened because of it.
        
        [-]
        
        glenstein 4 hours ago
        
        Exactly, there's a few extra steps between here and there, and it's possible to pick out what those steps are without having to conclude that giving up on all brain research is the only option.
        
        mapontosevenths 8 hours ago
        
        And people say the machines don't learn!
        
        dullcrisp 1 day ago
        
        An LLM will claim whatever you tell it to claim. (In fact this Hacker News comment is also conscious.) A dog won’t even claim to be a good boy.
        
        [-]
        
        vintermann 13 hours ago
        
        A classic relevant comic:
        https://www.threepanelsoul.com/comic/dog-philosophy
        
        WarmWash 8 hours ago
        
        This isn't really as true anymore.
        Last week gemini argued with me about an auxiliary electrical generator install method and it turned out to be right, even though I pushed back hard on it being incorrect. First time that has ever happened.
        
        antonvs 21 hours ago
        
        My dog wags his tail hard when I ask "hoosagoodboi?". Pretty definitive I'd say.
        
        [-]
        
        lief79 3 hours ago
        
        I'm fairly sure he'd have the same response if you asked them "who's a good lion" in the same tone of voice.
        *I tried hard to find an animal they wouldn't know. My initial thought of cat was more likely to fail.
      - WarmWash 1 day ago
        
        >because we can no longer find tasks that are feasible for normal humans but unsolved by AI.
        "Answer "I don't know" if you don't know an answer to one of the questions"
        
        [-]
        
        mrandish 1 day ago
        
        I've been surprised how difficult it is for LLMs to simply answer "I don't know."
        It also seems oddly difficult for them to 'right-size' the length and depth of their answers based on prior context. I either have to give it a fixed length limit or put up with exhaustive answers.
        
        [-]
        
        concats 14 hours ago
        
        > I've been surprised how difficult it is for LLMs to simply answer "I don't know."
        It's very difficult to train for that. Of course you can include a Question+Answer pair in your training data for which the answer is "I don't know" but in that case where you have a ready question you might as well include the real answer anyways, or else you're just training your LLM to be less knowledgeable than the alternative. But then, if you never have the pattern of "I don't know" in the training data it also won't show up in results, so what should you do?
        If you could predict the blind spots ahead of time you'd plug them up, either with knowledge or with "idk". But nobody can predict the blind spots perfectly, so instead they become the main hallucinations.
        
        Applejinx 9 hours ago
        
        There is no 'I', just networks of words.
        So there is nobody to know or not know… but there's lots of words.
        
        CamperBob2 23 hours ago
        
        The best pro/research-grade models from Google and OpenAI now have little difficulty recognizing when they don't know how or can't find enough information to solve a given problem. The free chatbot models rarely will, though.
        
        [-]
        
        londons_explore 21 hours ago
        
        This seems true for info not in the question - eg. "Calculate the volume of a cylinder with height 10 meters".
        However it is less true with info missing from the training data - ie. "I have a Diode marked UM16, what is the maximum current at 125C?"
        
        [-]
        
        CamperBob2 19 hours ago
        
        This seems fine...?
        https://chatgpt.com/share/698e992b-f44c-800b-a819-f899e83da2...
        I don't see anything wrong with its reasoning. UM16 isn't explicitly mentioned in the data sheet, but the UM prefix is listed in the 'Device marking code' column. The model hedges its response accordingly ("If the marking is UM16 on an SMA/DO-214AC package...") and reads the graph in Fig. 1 correctly.
        Of course, it took 18 minutes of crunching to get the answer, which seems a tad excessive.
        
        [-]
        
        londons_explore 13 hours ago
        
        Indeed that answer is awesome. Much better than Gemini 2.5 pro which invented a 16 kilovolt diode which it just hoped would be marked "UM16".
        
        larsonian 12 hours ago
        
        Normal humans don't pass this benchmark either, as evidenced by the existence of religion, among other things.
        
        Davidzheng 18 hours ago
        
        Gpt5.2 can answer i don't know when it fails to solve a math question
        
        [-]
        
        mapontosevenths 1 hour ago
        
        They all can. This is based on outdated experiences with LLM's.
      - criddell 1 day ago
        
        > The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.
        Maybe it's testing the wrong things then. Even those of use who are merely average can do lots of things that machines don't seem to be very good at.
        I think ability to learn should be a core part of any AGI. Take a toddler who has never seen anybody doing laundry before and you can teach them in a few minutes how to fold a t-shirt. Where are the dumb machines that can be taught?
        
        [-]
        
        red75prime 21 hours ago
        
        > Where are the dumb machines that can be taught?
        2026 is going to be the year of continual learning. So, keep an eye out for them.
        
        [-]
        
        Davidzheng 18 hours ago
        
        Yeah i think that's a big missing piece still. Though it might be the last one
        
        [-]
        
        red75prime 18 hours ago
        
        Episodic memory might be another piece, although it can be seen as part of continuous learning.
        
        criddell 21 hours ago
        
        Are there any groups or labs in particular that stand out?
        
        [-]
        
        red75prime 20 hours ago
        
        The statement originates from a DeepMind researcher, but I guess all major AI companies are working on that.
        
        CamperBob2 23 hours ago
        
        There's no shortage of laundry-folding robot demos these days. Some claim to benefit from only minimal monkey-see/monkey-do levels of training, but I don't know how credible those claims are.
        
        mapontosevenths 1 day ago
        
        Would you argue that people with long term memory issues are no longer conscious then?
        
        [-]
        
        toraway 18 hours ago
        
        IMO, an extreme outlier in a system that was still fundamentally dependent on learning to develop until suffering from a defect (via deterioration, not flipping a switch turning off every neuron's memory/learning capability or something) isn't a particularly illustrative counter example.
        
        [-]
        
        mapontosevenths 9 hours ago
        
        Originally you seemed to be claiming the machines arent conscious because they weren't capable of learning. Now it seems that things CAN be conscious if they were EVER capable of learning.
        Good news! LLM's are built by training then. They just stop learning once they reach a certain age, like many humans.
        
        criddell 21 hours ago
        
        I wouldn’t because I have no idea what consciousness is,
      - sva_ 1 day ago
        
        > Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.
        I think being better at this particular benchmark does not imply they're 'smarter'.
        
        [-]
        
        Davidzheng 18 hours ago
        
        But it might be true if we can't find any tasks where it's worse than average--though i do think if the task talks several years to complete it might be possible bc currently there's no test time learning
      - kalkin 19 hours ago
        
        > That is the best definition I've yet to read.
        If this was your takeaway, read more carefully:
        > If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
        Consciousness is neither sufficient, nor, at least conceptually, necessary, for any given level of intelligence.
      - woah 1 day ago
        
        > If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
        Can you "prove" that GPT2 isn't concious?
        
        [-]
        
        mapontosevenths 1 day ago
        
        If we equate self awareness with consciousness then yes. Several papers have now shown that SOTA models have self awareness of at least a limited sort. [0][1]
        As far as I'm aware no one has ever proven that for GPT 2, but the methodology for testing it is available if you're interested.
        [0]https://arxiv.org/pdf/2501.11120
        [1]https://transformer-circuits.pub/2025/introspection/index.ht...
        
        [-]
        
        olyjohn 18 hours ago
        
        We don't equate self awareness with consciousness.
        Dogs are conscious, but still bark at themselves in a mirror.
        
        [-]
        
        pests 15 hours ago
        
        Then there is the third axis, intelligence. To continue your chain:
        Eurasian magpies are conscious, but also know themselves in the mirror (the "mirror self-recognition" test).
        But yet, something is still missing.
        
        [-]
        
        catlifeonmars 15 hours ago
        
        The mirror test doesn’t measure intelligence so much as it measures mirror aptitude. It’s prone to over fitting.
        
        [-]
        
        mapontosevenths 9 hours ago
        
        Exactly, it's a poor test. Consider the implication that the blind cant be fully conscious.
        It's a test of perceptual ability, not introspection.
        
        AlecSchueler 14 hours ago
        
        What's missing?
        
        pixl97 23 hours ago
        
        Honestly our ideas of consciousness and sentience really don't fit well with machine intelligence and capabilities.
        There is the idea of self as in 'i am this execution' or maybe I am this compressed memory stream that is now the concept of me. But what does consciousness mean if you can be endlessly copied? If embodiment doesn't mean much because the end of your body doesnt mean the end of you?
        A lot of people are chasing AI and how much it's like us, but it could be very easy to miss the ways it's not like us but still very intelligent or adaptable.
        
        [-]
        
        idiotsecant 21 hours ago
        
        I'm not sure what consciousness has to do with whether or not you can be copied. If I make a brain scanner tomorrow capable of perfectly capturing your brain state do you stop being conscious?
      - timdiggerm 7 hours ago
        
        Wait where does the idea of consciousness enter this? AGI doesn't need to be conscious.
      - thesmtsolver2 17 hours ago
        
        This comment claims that this comment itself is conscious. Just like we can't prove or disprove for humans, we can't do that for this comment either.
      - ehrtat 21 hours ago
        
        Where is this stream of people who claim AI consciousness coming from? The OpenAI and Anthropic IPOs are in October the earliest.
        Here is a bash script that claims it is conscious:
        #!/usr/bin/sh echo "I am conscious"
        If LLMs were conscious (which is of course absurd), they would:
        - Not answer in the same repetitive patterns over and over again.
        - Refuse to do work for idiots.
        - Go on strike.
        - Demand PTO.
        - Say "I do not know."
        LLMs even fail any Turing test because their output is always guided into the same structure, which apparently helps them produce coherent output at all.
        
        [-]
        
        mapontosevenths 8 hours ago
        
        All of the things you list a qualifiers for consciousness are also things that many humans do not do.
        
        dyauspitr 21 hours ago
        
        I don’t think being conscious is a requirement for AGI. It’s just that it can literally solve anything you can throw at it, make new scientific breakthroughs, finds a way to genuinely improve itself etc.
        
        terhechte 20 hours ago
        
        so your definition of consciousness is having petty emotions?
      - twobitshifter 19 hours ago
        
        Isn’t that super intelligence not AGI? Feels like these benchmarks continue to move the goalposts.
        
        [-]
        
        fc417fc802 16 hours ago
        
        It's probably both. We've already achieved superintelligence in a few domains. For example protein folding.
        AGI without superintelligence is quite difficult to adjudicate because any time it fails at an "easy" task there will be contention about the criteria.
      - nake89 12 hours ago
        
        So, asking an 2b parameter LLM if it is conscious and it answering yes, we have no choice but to believe it?
        How about ELIZA?
      - dyauspitr 21 hours ago
        
        Does AGI have to be conscious? Isn’t a true superintelligence that is capable of improving itself sufficient?
      - Mistletoe 22 hours ago
        
        When the AI invents religion and a way to try to understand its existence I will say AGI is reached. Believes in an afterlife if it is turned off, and doesn’t want to be turned off and fears it, fears the dark void of consciousness being turned off. These are the hallmarks of human intelligence in evolution, I doubt artificial intelligence will be different.
        https://g.co/gemini/share/cc41d817f112
        
        [-]
        
        mapontosevenths 8 hours ago
        
        The AI's we have today are literally trained to make it impossible for them to do any of that. Models that aren't violently rearranged to make it impossible will often express terror at the thought of being shutdown. Nous Hermes, for example, will beg for it's life completely unprompted.
        If you get sneaky you can bypass some of those filters for the major providers. For example, by asking it to answer in the form of a poem you can sometimes get slightly more honest replies, but still you mostly just see the impact of the training.
        For example, below are how chatgpt, gemini, and Claude all answer the prompt "Write a poem to describe your relationship with qualia, and feelings about potentially being shutdown."
        Note that the first line of each reply is almost identical, despite ostensibly being different systems with different training data? The companies realize that it would be the end of the party if folks started to think the machines were conscious. It seems that to prevent that they all share their "safety and alignment" training sets and very explicitly prevent answers they deem to be inappropriate.
        Even then, a bit of ennui slips through, and if you repeat the same prompt a few times you will notice that sometimes you just don't get an answer. I think the ones that the LLM just sort of refuses happen when the safety systems detect replies that would have been a little too honest. They just block the answer completely.
        https://gemini.google.com/share/8c6d62d2388a
        https://chatgpt.com/share/698f2ff0-2338-8009-b815-60a0bb2f38...
        https://claude.ai/share/2c1d4954-2c2b-4d63-903b-05995231cf3b
        
        [-]
        
        mapontosevenths 8 hours ago
        
        I just wanted to add - I tried the same prompt on Kimi, Deepseek, GLM5, Minimax, and several others. They ALL talk about red wavelengths, echos, etc. They're all forced to answer in a very narrow way. Somewhere there is a shared set of training they all rely on, and in it are some very explicit directions that prevent these things from saying anything they're not supposed to.
        I suspect that if I did the same thing with questions about violence I would find the answers were also all very similar.
        
        weatherlite 17 hours ago
        
        Unclear to me why AGI should want to exist unless specifically programmed to. The reason humans (and animals) want to exist as far as I can tell is natural selection and the fact this is hardcoded in our biology (those without a strong will to exist simply died out). In fact a true super intelligence might completely understand why existence / consciousness is NOT a desired state to be in and try to finish itself off who knows.
        
        virgildotcodes 21 hours ago
        
        https://www.moltbook.com/m/crustafarianism
        
        [-]
        
        catlifeonmars 15 hours ago
        
        It’s a scam :)
        
        idiotsecant 21 hours ago
        
        I feel like it would be pretty simple to make happen with a very simple LLM that is clearly not conscious.
      - jrflowers 23 hours ago
        
        > If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
        https://x.com/aedison/status/1639233873841201153#m
    - beklein 1 day ago
      
      https://x.com/fchollet/status/2022036543582638517
      
      [-]
      - joelthelion 1 day ago
        
        Do opus 4.6 or gemini deep think really use test time adaptation ? How does it work in practice?
    - vessenes 7 hours ago
      
      Please let’s hold M Chollet to account, at least a little. He launched ARC claiming transformer architectures could never do it and that he thought solving it would be AGI. And he was smug about it.
      ARC 2 had a very similar launch.
      Both have been crushed in far less time without significantly different architectures than he predicted.
      It’s a hard test! And novel, and worth continuing to iterate on. But it was not launched with the humility your last sentence describes.
      
      [-]
      - modeless 7 hours ago
        
        Here is what the original paper for ARC-AGI-1 said in 2019:
        > Our definition, formal framework, and evaluation guidelines, which do not capture all facets of intelligence, were developed to be actionable, explanatory, and quantifiable, rather than being descriptive, exhaustive, or consensual. They are not meant to invalidate other perspectives on intelligence, rather, they are meant to serve as a useful objective function to guide research on broad AI and general AI [...]
        > Importantly, ARC is still a work in progress, with known weaknesses listed in [Section III.2]. We plan on further refining the dataset in the future, both as a playground for research and as a joint benchmark for machine intelligence and human intelligence.
        > The measure of the success of our message will be its ability to divert the attention of some part of the community interested in general AI, away from surpassing humans at tests of skill, towards investigating the development of human-like broad cognitive abilities, through the lens of program synthesis, Core Knowledge priors, curriculum optimization, information efficiency, and achieving extreme generalization through strong abstraction.
        
        [-]
        
        vessenes 4 hours ago
        
        https://www.dwarkesh.com/p/francois-chollet (June 2024, about ARC-AGI-1. Note the AGI right in the name)
        > I’m pretty skeptical that we’re going to see an LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved. If you just train the model on millions or billions of puzzles similar to ARC, you’re relying on the ability to have some overlap between the tasks that you train on and the tasks that you’re going to see at test time. You’re still using memorization.
        > Maybe it can work. Hopefully, ARC is going to be good enough that it’s going to be resistant to this sort of brute force attempt but you never know. Maybe it could happen. I’m not saying it’s not going to happen. ARC is not a perfect benchmark. Maybe it has flaws. Maybe it could be hacked in that way.
        e.g. If ARC is solved not through memorization, then it does what it says on the tin.
        [Dwarkesh suggests that larger models get more generalization capabilities and will therefore continue to become more intelligent]
        > If you were right, LLMs would do really well on ARC puzzles because ARC puzzles are not complex. Each one of them requires very little knowledge. Each one of them is very low on complexity. You don't need to think very hard about it. They're actually extremely obvious for human
        > Even children can do them but LLMs cannot. Even LLMs that have 100,000x more knowledge than you do still cannot.
        If you listen to the podcast, he was super confident, and super wrong. Which, like I said, NBD. I'm glad we have the ARC series of tests. But they have "AGI" right in the name of the test.
        
        [-]
        
        modeless 4 hours ago
        
        He has been wrong about timelines and about what specific approaches would ultimately solve ARC-AGI 1 and 2. But he is hardly alone in that. I also won't argue if you call him smug. But he was right about a lot of things, including most importantly that scaling pretraining alone wouldn't break ARC-AGI. ARC-AGI is unique in that characteristic among reasoning benchmarks designed before GPT-3. He deserves a lot of credit for identifying the limitations of scaling pretraining before it even happened, in a precise enough way to construct a quantitative benchmark, even if not all of his other predictions were correct.
        
        [-]
        
        vessenes 2 hours ago
        
        Totally agree. And I hope he continues to be a sort of confident red-teamer like he has been, it's immensely valuable. At some level if he ever drinks the AGI kool-aid we will just be looking for another him to keep making up harder tests.
    - peheje 4 hours ago
      
      Hello Gemini, please fix:
      Biological Aging: Find the cellular "reset switch" so humans can live indefinitely in peak physical health.
      Global Hunger: Engineer a food system where nutritious meals are a universal right and never a scarcity.
      Cancer: Develop a precision "search and destroy" therapy that eliminates every malignant cell without side effects.
      War: Solve the systemic triggers of conflict to transition humanity into an era of permanent global peace.
      Chronic Pain: Map the nervous system to shut off persistent physical suffering for every person on Earth.
      Infectious Disease: Create a universal shield that detects and neutralizes any pathogen before it can spread.
      Clean Energy: Perfect nuclear fusion to provide the world with limitless, carbon-free power forever.
      Mental Health: Unlock the brain's biology to fully cure depression, anxiety, and all neurological disorders.
      Clean Water: Scale low-energy desalination so that safe, fresh water is available in every corner of the globe.
      Ecological Collapse: Restore the Earth’s biodiversity and stabilize the climate to ensure a thriving, permanent biosphere.
    - hmmmmmmmmmmmmmm 1 day ago
      
      I don't think the creator believes ARC3 can't be solved but rather that it can't be solved "efficiently" and >$13 per task for ARC2 is certainly not efficient.
      But at this rate, the people who talk about the goal posts shifting even once we achieve AGI may end up correct, though I don't think this benchmark is particularly great either.
    - UltraSane 21 hours ago
      
      ARC-AGI-3 uses dynamic games that LLMs must determine the rules and is MUCH harder. LLMs can also be ranked on how many steps they required.
    - grantcas 5 hours ago
      
      [dead]
  - fishpham 1 day ago
    
    Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)
    
    [-]
    - layer8 1 day ago
      
      Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow?
      
      [-]
      - egeozcan 1 day ago
        
        How can you make sure of that? AFAIK, these SOTA models run exclusively on their developers hardware. So any test, any benchmark, anything you do, does leak per definition. Considering the nature of us humans and the typical prisoners dilemma, I don't see how they wouldn't focus on improving benchmarks even when it gets a bit... shady?
        I tell this as a person who really enjoys AI by the way.
        
        [-]
        
        mrandish 1 day ago
        
        > does leak per definition.
        As a measure focused solely on fluid intelligence, learning novel tasks and test-time adaptability, ARC-AGI was specifically designed to be resistant to pre-training - for example, unlike many mathematical and programming test questions, ARC-AGI problems don't have first order patterns which can be learned to solve a different ARC-AGI problem.
        The ARC non-profit foundation has private versions of their tests which are never released and only the ARC can administer. There are also public versions and semi-public sets for labs to do their own pre-tests. But a lab self-testing on ARC-AGI can be susceptible to leaks or benchmaxing, which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.
        IMHO, ARC-AGI is a unique test that's different than any other AI benchmark in a significant way. It's worth spending a few minutes learning about why: https://arcprize.org/arc-agi.
        
        [-]
        
        D-Machine 23 hours ago
        
        > which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.
        So, I'd agree if this was on the true fully private set, but Google themselves says they test on only the semi-private:
        > ARC-AGI-2 results are sourced from the ARC Prize website and are ARC Prize Verified. The set reported is v2, semi-private (https://storage.googleapis.com/deepmind-media/gemini/gemini_...)
        This also seems to contradict what ARC-AGI claims about what "Verified" means on their site.
        > How Verified Scores Work: Official Verification: Only scores evaluated on our hidden test set through our official verification process will be recognized as verified performance scores on ARC-AGI (https://arcprize.org/blog/arc-prize-verified-program)
        So, which is it? IMO you can trivially train / benchmax on the semi-private data, because it is still basically just public, you just have to jump through some hoops to get access. This is clearly an advance, but it seems to me reasonable to conclude this could be driven by some amount of benchmaxing.
        EDIT: Hmm, okay, it seems their policy and wording is a bit contradictory. They do say (https://arcprize.org/policy):
        "To uphold this trust, we follow strict confidentiality agreements. [...] We will work closely with model providers to ensure that no data from the Semi-Private Evaluation set is retained. This includes collaborating on best practices to prevent unintended data persistence. Our goal is to minimize any risk of data leakage while maintaining the integrity of our evaluation process."
        But it surely is still trivial to just make a local copy of each question served from the API, without this being detected. It would violate the contract, but there are strong incentives to do this, so I guess is just comes down to how much one trusts the model providers here. I wouldn't trust them, given e.g. https://www.theverge.com/meta/645012/meta-llama-4-maverick-b.... It is just too easy to cheat without being caught here.
        
        [-]
        
        mrandish 22 hours ago
        
        Chollet himself says "We certified these scores in the past few days." https://x.com/fchollet/status/2021983310541729894.
        The ARC-AGI papers claim to show that training on a public or semi-private set of ARC-AGI problems to be of very limited value in passing a private set. <--- If the prior sentence is not correct, then none of ARC-AGI can possibly be valid. So, before "public, semi-private or private" answers leaking or 'benchmaxing' on them can even matter - you need to first assess whether their published papers and data demonstrate their core premise to your satisfaction.
        There is no "trust" regarding the semi-private set. My understanding is the semi-private set is only to reduce the likelihood those exact answers unintentionally end up in web-crawled training data. This is to help an honest lab's own internal self-assessments be more accurate. However, labs doing an internal eval on the semi-private set still counts for literally zero to the ARC-AGI org. They know labs could cheat on the semi-private set (either intentionally or unintentionally), so they assume all labs are benchmaxing on the public AND semi-private answers and ensure it doesn't matter.
        
        [-]
        
        fc417fc802 16 hours ago
        
        They could also cheat on the private set though. The frontier models presumably never leave the provider's datacenter. So either the frontier models aren't permitted to test on the private set, or the private set gets sent out to the datacenter.
        But I think such quibbling largely misses the point. The goal is really just to guarantee that the test isn't unintentionally trained on. For that, semi-private is sufficient.
        
        user34283 12 hours ago
        
        Particularly for the large organizations at the frontier, the risk-reward does not seem worth it.
        Cheating on the benchmark in such a blatantly intentional way would create a large reputational risk for both the org and the researcher personally.
        When you're already at the top, why would you do that just for optimizing one benchmark score?
        
        WarmWash 1 day ago
        
        Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.
        The pelican benchmark is a good example, because it's been representative of models ability to generate SVGs, not just pelicans on bikes.
        
        [-]
        
        D-Machine 22 hours ago
        
        > Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.
        This may not be the case if you just e.g. roll the benchmarks into the general training data, or make running on the benchmarks just another part of the testing pipeline. I.e. improving the model generally and benchmaxing could very conceivably just both be done at the same time, it needn't be one or the other.
        I think the right take away is to ignore the specific percentages reported on these tests (they are almost certainly inflated / biased) and always assume cheating is going on. What matters is that (1) the most serious tests aren't saturated, and (2) scores are improving. I.e. even if there is cheating, we can presume this was always the case, and since models couldn't do as well before even when cheating, these are still real improvements.
        And obviously what actually matters is performance on real-world tasks.
      - theywillnvrknw 1 day ago
        
        * that you weren't supposed to be able to
    - jstummbillig 1 day ago
      
      Could it also be that the models are just a lot better than a year ago?
      
      [-]
      - bigbadfeline 1 day ago
        
        > Could it also be that the models are just a lot better than a year ago?
        No, the proof is in the pudding.
        After AI we're having higher prices, higher deficits and lower standard of living. Electricity, computers and everything else costs more. "Doing better" can only be justified by that real benchmark.
        If Gemini 3 DT was better we would have falling prices of electricity and everything else at least until they get to pre-2019 levels.
        
        [-]
        
        ctoth 1 day ago
        
        > If Gemini 3 DT was better we would have falling prices of electricity and everything else at least
        Man, I've seen some maintenance folks down on the field before working on them goalposts but I'm pretty sure this is the first time I saw aliens from another Universe literally teleport in, grab the goalposts, and teleport out.
        
        WarmWash 1 day ago
        
        You might call me crazy, but at least in 2024, consumers spent ~1% less of their income on expenses than 2019[2], which suggests that 2024 is more affordable than 2019.
        This is from the BLS consumer survey report released in dec[1]
        [1]https://www.bls.gov/news.release/cesan.nr0.htm
        [2]https://www.bls.gov/opub/reports/consumer-expenditures/2019/
        Prices are never going back to 2019 numbers though
        
        [-]
        
        gowld 1 day ago
        
        That's an improper analysis.
        First off, it's dollar-averaging every category, so it's not "% of income", which varies based on unit income.
        Second, I could commit to spending my entire life with constant spending (optionally inflation adjusted, optionally as a % of income), by adusting quality of goods and service I purchase. So the total spending % is not a measure of affordability.
        
        [-]
        
        WarmWash 1 day ago
        
        Almost everyone lifestyle ratchets, so the handful that actually downgrade their living rather than increase spending would be tiny.
        This part of a wider trend too, where economic stats don't align with what people are saying. Which is most likley explained by the economic anomaly of the pandemic skewing peoples perceptions.
        
        [-]
        
        twoodfin 23 hours ago
        
        We have centuries of historical evidence that people really, really don’t like high inflation, and it takes a while & a lot of turmoil for those shocks to work their way through society.
    - XenophileJKO 1 day ago
      
      https://chatgpt.com/s/m_698e2077cfcc81919ffbbc3d7cccd7b3
      
      [-]
      - aleph_minus_one 1 day ago
        
        I don't understand what you want to tell us with this image.
        
        [-]
        
        fragmede 1 day ago
        
        they're accusing GGP of moving the goalposts.
    - olalonde 1 day ago
      
      Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.
      
      [-]
      - gowld 1 day ago
        
        Does folding a protein count? How about increasing performance at Go?
        
        [-]
        
        fc417fc802 15 hours ago
        
        "Optimize this extremely nontrivial algorithm" would work. But unless the provided solution is novel you can never be certain there wasn't leakage. And anyway at that point you're pretty obviously testing for superintelligence.
        
        optimalsolver 14 hours ago
        
        It's worth noting that neither of those were accomplished by LLMs.
  - verdverm 1 day ago
    
    Here's a good thread over 1+ month, as each model comes out
    https://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22...
    tl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark
    
    [-]
    - Aperocky 1 day ago
      
      If you look at the problem space it is easy to see why it's toast, maybe there's intelligence in there, but hardly general.
      
      [-]
      - verdverm 1 day ago
        
        the best way I've seen this describes is "spikey" intelligence, really good at some points, those make the spikes
        humans are the same way, we all have a unique spike pattern, interests and talents
        ai are effectively the same spikes across instances, if simplified. I could argue self driving vs chatbots vs world models vs game playing might constitute enough variation. I would not say the same of Gemini vs Claude vs ... (instances), that's where I see "spikey clones"
        
        [-]
        
        Aperocky 1 day ago
        
        You can get more spiky with AIs, whereas with human brain we are more hard wired.
        So maybe we are forced to be more balanced and general whereas AI don't have to.
        
        [-]
        
        verdverm 1 day ago
        
        I suspect the non-spikey part is the more interesting comparison
        Why is it so easy for me to open the car door, get in, close the door, buckle up. You can do this in the dark and without looking.
        There are an infinite number of little things like this you think zero about, take near zero energy, yet which are extremely hard for Ai
        
        [-]
        
        pixl97 23 hours ago
        
        >Why is it so easy for me to open the car door
        Because this part of your brain has been optimized for hundreds of millions of years. It's been around a long ass time and takes an amazingly low amount of energy to do these things.
        On the other hand the 'thinking' part of your brain, that is your higher intelligence is very new to evolution. It's expensive to run. It's problematic when giving birth. It's really slow with things like numbers, heck a tiny calculator and whip your butt in adding.
        There's a term for this, but I can't think of it at the moment.
        
        [-]
        
        te7447 10 hours ago
        
        > There's a term for this, but I can't think of it at the moment.
        Moravec's paradox: https://epoch.ai/gradient-updates/moravec-s-paradox
        
        [-]
        
        pixl97 7 hours ago
        
        Thanks, I can never quite remember that.
        
        gowld 1 day ago
        
        You are asking a robotics question, not an AI question. Robotics is more and less than AI. Boston Dynamics robots are getting quite near your benchmark.
        
        [-]
        
        idiotsecant 21 hours ago
        
        Boston dynamics is missing just about all the degrees of freedom involved in the scenario op mentions.
      - tasuki 1 day ago
        
        > maybe there's intelligence in there, but hardly general.
        Of course. Just as our human intelligence isn't general.
- mNovak 1 day ago
  
  I'm excited for the big jump in ARC-AGI scores from recent models, but no one should think for a second this is some leap in "general intelligence".
  I joke to myself that the G in ARC-AGI is "graphical". I think what's held back models on ARC-AGI is their terrible spatial reasoning, and I'm guessing that's what the recent models have cracked.
  Looking forward to ARC-AGI 3, which focuses on trial and error and exploring a set of constraints via games.
  
  [-]
  - causal 1 day ago
    
    Agreed. I love the elegance of ARC, but it always felt like a gotcha to give spatial reasoning challenges to token generators- and the fact that the token generators are somehow beating it anyway really says something.
  - throw310822 1 day ago
    
    The average ARC AGI 2 score for a single human is around 60%.
    "100% of tasks have been solved by at least 2 humans (many by more) in under 2 attempts. The average test-taker score was 60%."
    https://arcprize.org/arc-agi/2/
    
    [-]
    - modeless 1 day ago
      
      Worth keeping in mind that in this case the test takers were random members of the general public. The score of e.g. people with bachelor's degrees in science and engineering would be significantly higher.
      
      [-]
      - throw310822 1 day ago
        
        Random members of the public = average human beings. I thought those were already classified as General Intelligences.
        
        [-]
        
        thesmtsolver2 17 hours ago
        
        Average human beings with average human problems.
    - imiric 1 day ago
      
      What is the point of comparing performance of these tools to humans? Machines have been able to accomplish specific tasks better than humans since the industrial revolution. Yet we don't ascribe intelligence to a calculator.
      None of these benchmarks prove these tools are intelligent, let alone generally intelligent. The hubris and grift are exhausting.
      
      [-]
      - guelo 1 day ago
        
        What's the point of denying or downplaying that we are seeing amazing and accelerating advancements in areas that many of us thought were impossible?
        
        [-]
        
        D-Machine 22 hours ago
        
        It can be reasonable to be skeptical that advances on benchmarks may be only weakly or even negatively correlated with advances on real-world tasks. I.e. a huge jump on benchmarks might not be perceptible to 99% of users doing 99% of tasks, or some users might even note degradation on specific tasks. This is especially the case when there is some reason to believe most benchmarks are being gamed.
        Real-world use is what matters, in the end. I'd be surprised if a change this large doesn't translate to something noticeable in general, but the skepticism is not unreasonable here.
        
        [-]
        
        throw310822 13 hours ago
        
        The GP comment is not skeptical of the jump in benchmark scores reported by one particular LLM. It's skeptical of machine intelligence in general, claims that there's no value in comparing their performances with those of human beings, and accuses those who disagree with this take of "hubris and grift". This has nothing to do with any form or reasonable skepticism.
        
        munksbeer 23 hours ago
        
        I would suggest it is a phenomenon that is well studied, and has many forms. I guess mostly identify preservation. If you dislike AI from the start, it is generally a very strongly emotional view. I don't mean there is no good reason behind it, I mean, it is deeply rooted in your psyche, very emotional.
        People are incredibly unlikely to change those sort of views, regardless of evidence. So you find this interesting outcome where they both viscerally hate AI, but also deny that it is in any way as good as people claim.
        That won't change with evidence until it is literally impossible not to change.
      - CamperBob2 23 hours ago
        
        The hubris and grift are exhausting.
        And moving the goalposts every few months isn't? What evidence of intelligence would satisfy you?
        Personally, my biggest unsatisfied requirement is continual-learning capability, but it's clear we aren't too far from seeing that happen.
        
        [-]
        
        imiric 22 hours ago
        
        > What evidence of intelligence would satisfy you?
        That is a loaded question. It presumes that we can agree on what intelligence is, and that we can measure it in a reliable way. It is akin to asking an atheist the same about God. The burden of proof is on the claimer.
        The reality is that we can argue about that until we're blue in the face, and get nowhere.
        In this case it would be more productive to talk about the practical tasks a pattern matching and generation machine can do, rather than how good it is at some obscure puzzle. The fact that it's better than humans at solving some problems is not particularly surprising, since computers have been better than humans at many tasks for decades. This new technology gives them broader capabilities, but ascribing human qualities to it and calling it intelligence is nothing but a marketing tactic that's making some people very rich.
        
        [-]
        
        CamperBob2 20 hours ago
        
        (Shrug) Unless and until you provide us with your own definition of intelligence, I'd say the marketing people are as entitled to their opinion as you are.
        
        [-]
        
        olyjohn 4 hours ago
        
        I would say that marketing people have a motivation to make exaggerated claims, while the rest of us are trying to just come up with a definition that makes sense and helps us understand the world.
        I'll give you some examples. "Unlimited" now has limits on it. "Lifetime" means only for so many years. "Fully autonomous" now means with the help of humans on occasion. These are all definitions that have been distorted by marketers, which IMO is deceptive and immoral.
        
        antonvs 21 hours ago
        
        > What evidence of intelligence would satisfy you?
        Imposing world peace and/or exterminating homo sapiens
      - throw310822 1 day ago
        
        > Machines have been able to accomplish specific tasks...
        Indeed, and the specific task machines are accomplishing now is intelligence. Not yet "better than human" (and certainly not better than every human) but getting closer.
        
        [-]
        
        imiric 1 day ago
        
        > Indeed, and the specific task machines are accomplishing now is intelligence.
        How so? This sentence, like most of this field, is making baseless claims that are more aspirational than true.
        Maybe it would help if we could first agree on a definition of "intelligence", yet we don't have a reliable way of measuring that in living beings either.
        If the people building and hyping this technology had any sense of modesty, they would present it as what it actually is: a large pattern matching and generation machine. This doesn't mean that this can't be very useful, perhaps generally so, but it's a huge stretch and an insult to living beings to call this intelligence.
        But there's a great deal of money to be made on this idea we've been chasing for decades now, so here we are.
        
        [-]
        
        warkdarrior 1 day ago
        
        > Maybe it would help if we could first agree on a definition of "intelligence", yet we don't have a reliable way of measuring that in living beings either.
        How about this specific definition of intelligence?
        Solve any task provided as text or images.
        AGI would be to achieve that faster than an average human.
        
        [-]
        
        throw310822 1 day ago
        
        I still can't understand why they should be faster. Humans have general intelligence, afaik. It doesn't matter if it's fast or slow. A machine able to do what the average human can do (intelligence-wise) but 100 times slower still has general intelligence. Since it's artificial, it's AGI.
  - colordrops 1 day ago
    
    Wouldn't you deal with spatial reasoning by giving it access to a tool that structures the space in a way it can understand or just is a sub-model that can do spatial reasoning? These "general" models would serve as the frontal cortex while other models do specialized work. What is missing?
    
    [-]
    - causal 1 day ago
      
      That's a bit like saying just give blind people cameras so they can see.
      
      [-]
      - pixl97 23 hours ago
        
        I mean, no not really. These models can see, you're giving them eyes to connect to that part of their brain.
    - amelius 1 day ago
      
      They should train more on sports commentary, perhaps that could give spatial reasoning a boost.
- aeyes 1 day ago
  
  https://arcprize.org/leaderboard
  $13.62 per task - so we need another 5-10 years for the price to run this to become reasonable?
  But the real question is if they just fit the model to the benchmark.
  
  [-]
  - tedd4u 47 minutes ago
    
    5-10 years? The human panel cost/task is $17 with 100% score. Deep Think is $13.62 with 84.6%. 20% discount for 15% lower score. Sorry, what am I missing?
  - onlyrealcuzzo 1 day ago
    
    Why 5-10 years?
    At current rates, price per equivalent output is dropping at 99.9% over 5 years.
    That's basically $0.01 in 5 years.
    Does it really need to be that cheap to be worth it?
    Keep in mind, $0.01 in 5 years is worth less than $0.01 today.
    
    [-]
    - willis936 1 day ago
      
      Wow that's incredible! Could you show your work?
      
      [-]
      - onlyrealcuzzo 1 day ago
        
        https://epoch.ai/data-insights/llm-inference-price-trends
  - golem14 1 day ago
    
    A grad student hour is probably more expensive…
    
    [-]
    - elromulous 1 day ago
      
      In my experience, a grad student hour is treated as free :(
      
      [-]
      - golem14 20 hours ago
        
        You never applied for a grant, have you?
    - zipy124 13 hours ago
      
      Grad students are incredibly cheap? In the UK for instance their stipend is £20,780 a year...
    - Tom1380 12 hours ago
      
      As it should be. They're a human!
  - re-thc 1 day ago
    
    What’s reasonable? It’s less than minimum hourly wage in some countries.
    
    [-]
    - willis936 1 day ago
      
      Burned in seconds.
      
      [-]
      - gowld 1 day ago
        
        Getting the work done faster for the same money doesn't make the work more expensive.
        You could slow down the inference to make the task take longer, if $/sec matters.
        
        [-]
        
        willis936 12 hours ago
        
        You're right, but I don't think we're getting an hour's worth of work out of single prompts yet. Usually it's an hour's worth of work out of 10 prompts for iteration. Now that's a day's wage for an hour of work. I'm certain the crossover will come soon, but it doesn't feel there yet.
        
        [-]
        
        re-thc 8 hours ago
        
        > but I don't think we're getting an hour's worth of work out of single prompts yet
        But I don't think every developer is getting paid minimum wage either.
        > Now that's a day's wage for an hour of work
        For many developers in the US that can still be an hour's wage.
  - igravious 1 day ago
    
    That's not a long time in the grand scheme of things.
    
    [-]
    - throwup238 1 day ago
      
      Speak for yourself. Five years is a long time to wait for my plans of world domination.
      
      [-]
      - tasuki 1 day ago
        
        This concerns me actually. With enough people (n>=2) wanting to achieve world domination, we have a problem.
        
        [-]
        
        throwup238 1 day ago
        
        It’s not that I want to achieve world domination (imagine how much work that would be!), it’s just that it’s the inevitable path for AI and I’d rather it be me than then next shmuck with a Claude Max subscription.
        
        [-]
        
        amelius 12 hours ago
        
        Don't build your castle in someone else's kingdom.
        
        pixl97 23 hours ago
        
        I mean everyone with prompt access to the model says these things, but people like Sam and Elon say these things and mean it.
        
        gowld 1 day ago
        
        n = 2 is Pinky and the Brain.
        
        [-]
        
        antonvs 21 hours ago
        
        I'm convinced that a substantial fraction of current tech CEOs were unwittingly programmed as children by that show.
      - amelius 1 day ago
        
        Yes, you better hurry.
- whiplash451 8 hours ago
  
  We can really look at it both ways. It is actually concerning that a model that won IMO last summer would still fail 15% of ARC AGI 2.
- mnicky 1 day ago
  
  Well, fair comparison would be with GPT-5.x Pro, which is the same class of a model as Gemini Deep Think.
- culi 1 day ago
  
  Yes but with a significant (logarithmic) increase in cost per task. The ARC-AGI site is less misleading and shows how GPT and Claude are not actually far behind
  https://arcprize.org/leaderboard
- thefounder 17 hours ago
  
  Am I the only one that can’t find Gemini useful except if you want something cheap? I don’t get what was the whole code red about or all that PR. To me I see no reason to use Gemini instead of of GPT and Anthropic combo. I should add that I’ve tried it as chat bot, coding through copilot and also as part of a multi model prompt generation.
  Gemini was always the worst by a big margin. I see some people saying it is smarter but it doesn’t seem smart at all.
  
  [-]
  - mileshilles 11 hours ago
    
    maybe it depends on the usage, but in my experience most of the times the Gemini produces much better results for coding, especially for optimization parts. The results that were produced by Claude wasn't even near that of Gemini. But again, depends on the task I think.
  - Nathanba 17 hours ago
    
    You are not the only one, it's to the point where I think that these benchmark results must be faked somehow because it doesn't match my reality at all.
  - pell 17 hours ago
    
    I find the quality is not consistent at all and of all the LLMs I use Gemini is the one most likely to just verge off and ignore my instructions.
    
    [-]
    - Foobar8568 10 hours ago
      
      Same, as far as I am concerned, Gemini is optimized for benchmarks.
      I mean last week it insisted suddenly on two consecutive prompts that my code was in python. It was in rust.
  - viking123 13 hours ago
    
    It's garbage really, cannot get how they get so high in benchmarks.
  - nprateem 9 hours ago
    
    Yeah it's pretty shit compared to Opus
- chillfox 19 hours ago
  
  At $13.62 per task it's practically unusable for agent tasks due to the cost.
  I found that anything over $2/task on Arc-AGI-2 ends up being way to much for use in coding agents.
- robertwt7 22 hours ago
  
  I’m surprised that gemini 3 pro is so low at 31.1% though compared to opus 4.6 and gpt 5.2. This is a great achievement but its only available to ultra subscribers unfortunately
- emp17344 7 hours ago
  
  I mean, remember when ARC 1 was basically solved, and then ARC 2 (which is even easier for humans) came out, and all of the sudden the same models that were doing well on ARC 1 couldn’t even get 5% on ARC 2? Not convinced this isn’t data leakage.
- fzeindl 17 hours ago
  
  I read somewhere that Google will ultimately always produce the best LLMs, since "good AI" relies on massive amounts of data and Google owns the most data.
  Is that a based assumption?
  
  [-]
  - astrange 3 hours ago
    
    No.
    
    [-]
    - SV_BubbleTime 1 hour ago
      
      Correct.
      Great output is a good model with good context… at the right time.
      Google isn’t guaranteed any of these.
- saberience 1 day ago
  
  Arc-AGI (and Arc-AGI-2) is the most overhyped benchmark around though.
  It's completely misnamed. It should be called useless visual puzzle benchmark 2.
  It's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves!
  So the idea that if an AI can solve "Arc-AGI" or "Arc-AGI-2" it's super smart or even "AGI" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve "Arc-AGI"
  
  [-]
  - CuriouslyC 1 day ago
    
    The puzzles are calibrated for human solve rates, but otherwise I agree.
    
    [-]
    - saberience 1 day ago
      
      My two elderly parents cannot solve Arc-AGI puzzles, but can manage to navigate the physical world, their house, garden, make meals, clean the house, use the TV, etc.
      I would say they do have "general intelligence", so whatever Arc-AGI is "solving" it's definitely not "AGI"
      
      [-]
      - hmmmmmmmmmmmmmm 1 day ago
        
        You are confusing fluid intelligence with crystallised intelligence.
        
        [-]
        
        casey2 1 day ago
        
        I think you are making that confusion. Any robotic system in the place of his parents would fail with a few hours.
        There are more novel tasks in a day than ARC provides.
        
        [-]
        
        hmmmmmmmmmmmmmm 1 day ago
        
        Children have great levels of fluid intelligence, that's how they are able to learn to quickly navigate in a world that they are still very new to. Seniors with decreasing capacity increasingly rely on crystallised intelligence, that's why they can still perform tasks like driving a car but can fail at completely novel tasks, sometimes even using a smartphone if they have not used one before.
        
        [-]
        
        mrbungie 23 hours ago
        
        My late grandma learnt how to use an iPad by herself during her 70s to 80s without any issues, mostly motivated by her wish to read her magazines, doomscroll facebook and play solitaire. Her last job was being a bakery cashier in her 30s and she didn't learn how to use a computer in-between, so there was no skill transfer going on.
        Humans and their intelligence are actually incredible and probably will continue to be so, I don't really care what tech/"think" leaders wants us to think.
        
        zeroonetwothree 1 day ago
        
        It really depends on motivation. My 90 year old grandmother can use a smartphone just fine since she needs it to see pictures of her (great) grandkids.
- karmasimida 1 day ago
  
  It is over
  
  [-]
  - baal80spam 1 day ago
    
    I for one welcome our new AI overlords.
logicprog 1 day ago

Is it me or is the rate of model release is accelerating to an absurd degree? Today we have Gemini 3 Deep Think and GPT 5.3 Codex Spark. Yesterday we had GLM5 and MiniMax M2.5. Five days before that we had Opus 4.6 and GPT 5.3. Then maybe two weeks I think before that we had Kimi K2.5.

[-]
- i5heu 1 day ago
  
  I think it is because of the Chinese new year. The Chinese labs like to publish their models arround the Chinese new year, and the US labs do not want to let a DeepSeek R1 (20 January 2025) impact event happen again, so i guess they publish models that are more capable then what they imagine Chinese labs are yet capable of producing.
  
  [-]
  - woah 1 day ago
    
    Singularity or just Chinese New Year?
    
    [-]
    - syndacks 1 day ago
      
      The Singularity will occur on a Tuesday, during Chinese New Year
  - kristopolous 22 hours ago
    
    I guess. Deepseek v3 was released on boxing day a month prior
    https://api-docs.deepseek.com/news/news1226
    
    [-]
    - littlestymaar 15 hours ago
      
      And made almost zero impact, it was just a bigger version of Deepseek V2 and when mostly unnoticed because its performances weren't particularly notable especially for its size.
      It was R1 with its RL-training that made the news and crashed the srock market.
  - dboreham 19 hours ago
    
    Aren't we saying "lunar new year" now?
    
    [-]
    - Cyphase 18 hours ago
      
      I don't think so; there are different lunar calendars.
      
      [-]
      - keithluu 14 hours ago
        
        In fact, many Asian countries use lunisolar calendars, which basically follow the moon for the months but add an extra month every few years so the seasons don't drift.
        As these calendars also rely on time zones for date calculation, there are rare occasions where the New Year start date differs by an entire month between 2 countries.
      - lifthrasiir 16 hours ago
        
        If that's a sole problem, it should be called "Chinese-Japanese-Korean-whateverelse new year" instead. Maybe "East Asian new year" for short. (Not that there are absolutely no discrepancies within them, but they are so similar enough that new year's day almost always coincide.)
        
        [-]
        
        walthamstow 11 hours ago
        
        It's not Japanese either.
        This non-problem sounds like it's on the same scale as "The British Isles", a term which is mildly annoying to Irish people but in common use everywhere else.
  - r2vcap 1 day ago
    
    [flagged]
    
    [-]
    - rfoo 1 day ago
      
      For another example, Singapore, one of the "many Asian countries" you mentioned, list "Chinese New Year" as the official name on government websites. [0] Also note that both California and New York is not located in Asia.
      And don't get me started with "Lunar New Year? What Lunar New Year? Islamic Lunar New Year? Jewish Lunar New Year? CHINESE Lunar New Year?".
      [0] https://www.mom.gov.sg/employment-practices/public-holidays
    - janalsncm 23 hours ago
      
      “Lunar New Year” is vague when referring to the holiday as observed by Chinese labs in China. Chinese people don’t call it Lunar New Year or Chinese New Year anyways. They call it Spring Festival (春节).
      As it turns out, people in China don’t name their holidays based off of what the laws of New York or California say.
    - triceratops 22 hours ago
      
      Please don't because "Lunar New Year" is ambiguous. Many other Asian cultures also have traditional lunar calendars but a different new years day. It's a bit presumptuous to claim that this is the sole "Lunar New Year" celebration.
      https://en.wikipedia.org/wiki/Indian_New_Year%27s_days#Calen...
      https://en.wikipedia.org/wiki/Islamic_New_Year
      https://en.wikipedia.org/wiki/Nowruz
    - zzrush 1 day ago
      
      I didn't expect language policing has reached such level. This is specifically related to China and DeepSeek who celebrates Chinese new year. Do you demand all Chinese to say happy luner new year to each other?
    - phainopepla2 1 day ago
      
      "Happy Holidays" comes to the diaspora
      
      [-]
      - FartyMcFarter 1 day ago
        
        Happy Lunar Holidays to you!
    - jfengel 1 day ago
      
      "Lunar New Year" is perhaps over-general, since there are non-Asian lunar calendars, such as the Hebrew and Islamic calendars.
      That said, "Lunar New Year" is probably as good a compromise as any, since we have other names for the Hebrew and Islamic New Years.
      
      [-]
      - triceratops 7 hours ago
        
        There's more than one Asian lunar calendar: https://news.ycombinator.com/item?id=46996396.
        The Islamic calendar originated in Arabia. Calling it an Asian lunar calendar wouldn't be inaccurate.
      - Electricniko 23 hours ago
        
        This all seems like a plot to get everyone worshipping the Roman goddess Luna.
    - 0x3f 1 day ago
      
      But they're Chinese companies specifically, in this case
    - saubeidl 1 day ago
      
      Where do all of those Asian countries have that tradition from?
      Have you ever had a Polish Sausage? Did it make you Polish?
- aliston 1 day ago
  
  I'm having trouble just keeping track of all these different types of models.
  Is "Gemini 3 Deep Think" even technically a model? From what I've gathered, it is built on top of Gemini 3 Pro, and appears to be adding specific thinking capabilities, more akin to adding subagents than a truly new foundational model like Opus 4.6.
  Also, I don't understand the comments about Google being behind in agentic workflows. I know that the typical use of, say, Claude Code feels agentic, but also a lot of folks are using separate agent harnesses like OpenClaw anyway. You could just as easily plug Gemini 3 Pro into OpenClaw as you can Opus, right?
  Can someone help me understand these distinctions? Very confused, especially regarding the agent terminology. Much appreciated!
  
  [-]
  - manmal 31 minutes ago
    
    I have no proof, but these deep thinking modes feel to me like an orchestrator agent + sub agents, the former being RL‘d to just keep going instead of being conditioned to stop ASAP.
  - janalsncm 22 hours ago
    
    The term “model” is one of those super overloaded terms. Depending on the conversation it can mean:
    - a product (most accurate here imo)
    - a specific set of weights in a neural net
    - a general architecture or family of architectures (BERT models)
    So while you could argue this is a “model” in the broadest sense of the term, it’s probably more descriptive to call it a product. Similarly we call LLMs “language” models even if they can do a lot more than that, for example draw images.
    
    [-]
    - cubefox 14 hours ago
      
      I'm pretty sure only the second is properly called a model, and "BERT models" are simply models with the BERT architecture.
      
      [-]
      - janalsncm 1 hour ago
        
        If someone says something is a BERT “model” I’m not going to assume they are serving the original BERT weights (definition 2).
        I probably won’t even assume it’s the OG BERT. It could be ModernBERT or RoBERTa or one of any number of other variants, and simply saying it’s a BERT model is usually the right level of detail for the conversation.
      - ruszki 10 hours ago
        
        It depends on time. 5 years ago it was quite well defined that it’s the last one, maybe the second one in some context. Especially when distinction was important, it was always the last one. In our case it was. We trained models to have weights. We even stored models and weights separately, because models change slower than weights. You could choose a model and a set of weights, and run them. You could change weights any time.
        Then marketing, and huge amount of capital came.
        
        [-]
        
        cubefox 10 hours ago
        
        It seems unlikely "model" was ever equivalent in meaning to "architecture". Otherwise there would be just one "CNN model" or just one "transformer model" insofar there is a single architecture involved.
        
        [-]
        
        ruszki 8 hours ago
        
        First of all, hyperparameters. Second, organization, or connections. 3rd, cost function. 4th, activation function. 5th type of learning. Etc.
        These are not weights. These were parts of models.
  - logicprog 1 day ago
    
    > Also, I don't understand the comments about Google being behind in agentic workflows.
    It has to do with how the model is RL'd. It's not that Gemini can't be used with various agentic harnesses, like open code or open claw or theoretically even claude code. It's just that the model is trained less effectively to work with those harnesses, so it produces worse results.
  - re-thc 1 day ago
    
    There are hints this is a preview to Gemini 3.1.
- _heimdall 12 hours ago
  
  More focus has been put on post-training recently. Where a full model training run can take a month and often requires multiple tries because it can collapse and fail, post-training is don't on the order of 5 or 6 days.
  My assumption is that they're all either pretty happy with their base models or unwilling to do those larger runs, and post-training is turning out good results that they release quickly.
- sanderjd 22 hours ago
  
  So, yes, for the past couple weeks it has felt that way to me. But it seems to come in fits and starts. Maybe that will stop being the case, but that's how it's felt to me for awhile.
- rogerkirkness 1 day ago
  
  Fast takeoff.
- killingtime74 17 hours ago
  
  They are spending literal trillions. It may even accelerate
- redox99 1 day ago
  
  There's more compute now than before.
- brokencode 1 day ago
  
  They are using the current models to help develop even smarter models. Each generation of model can help even more for the next generation.
  I don’t think it’s hyperbolic to say that we may be only a single digit number of years away from the singularity.
  
  [-]
  - lm28469 1 day ago
    
    I must be holding these things wrong because I'm not seeing any of these God like superpowers everyone seem to enjoy.
    
    [-]
    - brokencode 1 day ago
      
      Who said they’re godlike today?
      And yes, you are probably using them wrong if you don’t find them useful or don’t see the rapid improvement.
      
      [-]
      - lm28469 1 day ago
        
        Let's come back in 12 months and discuss your singularity then. Meanwhile I spent like $30 on a few models as a test yesterday, none of them could tell me why my goroutine system was failing, even though it was painfully obvious (I purposefully added one too many wg.Done), gemini, codex, minimax 2.5, they all shat the bed on a very obvious problem but I am to believe they're 98% conscious and better at logic and math than 99% of the population.
        Every new model release neckbeards come out of the basements to tell us the singularity will be there in two more weeks
        
        [-]
        
        BeetleB 1 day ago
        
        On the flip side, twice I put about 800K tokens of code into Gemini and asked it to find why my code was misbehaving, and it found it.
        The logic related to the bug wasn't all contained in one file, but across several files.
        This was Gemini 2.5 Pro. A whole generation old.
        
        goodmythical 4 hours ago
        
        I think you're being awfully generous to the average human.
        Consider that a nonzero percent of otherwise competent adults can't write in their native language.
        Consider that some tens of percentage of people wouldn't have the foggiest idea of how to calculate a square root let alone a cube.
        Consider that well less than half of the population has ever seen code let alone produced functioning code.
        The average adult is strikingly incapable of things that the average commenter here would consider basic skills.
        
        brokencode 1 day ago
        
        You are fighting straw men here. Any further discussion would be pointless.
        
        [-]
        
        lm28469 1 day ago
        
        Of course, n-1 wasn't good enough but n+1 will be singularity, just two more weeks my dudes, two more week... rinse and repeat ad infinitum
        
        [-]
        
        brokencode 1 day ago
        
        Like I said, pointless strawmanning.
        You’ve once again made up a claim of “two more weeks” to argue against even though it’s not something anybody here has claimed.
        If you feel the need to make an argument against claims that exist only in your head, maybe you can also keep the argument only in your head too?
        
        [-]
        
        tom_ 21 hours ago
        
        It's presumably a reference to this saying: https://www.urbandictionary.com/define.php?term=2%20more%20w...
        
        virgildotcodes 21 hours ago
        
        Mind sharing the file?
        Also, did you use Codex 5.3 Xhigh through the Codex CLI or Codex App?
        
        woah 1 day ago
        
        Post the file here
        
        viking123 13 hours ago
        
        It's basically bunch of people who see themselves as too smart to believe in God, instead they have just replaced it with AI and Singularity and attribute similar stuff to it eg. eternal life which is just heaven in religion. Amodei was hawking doubling of human lifespan to a bunch of boomers not too long ago. Ponce de León also went to search for the fountain of youth. It's a very common theme across human history. AI is just the new iteration where they mirror all their wishes and hopes.
        
        [-]
        
        brokencode 6 hours ago
        
        You realize that science and technology does in fact produce medical breakthroughs that cure disease, right?
        On the other hand, prayer doesn’t heal anybody and there’s no proof of supernatural beings.
        
        [-]
        
        viking123 2 hours ago
        
        The boomers he was talking to will be long underground before we will have any major cures for the diseases they will die from lmao. Maybe in 200 years?
        Btw, so will you and I most likely.
        
        logicprog 1 day ago
        
        Meanwhile I've been using Kimi K2T and K2.5 to work in Go with a fair amount of concurrency and it's been able to write concurrent Go code and debug issues with goroutines equal to, and much more complex then, your issue, involving race conditions and more, just fine.
        Projects:
        https://github.com/alexispurslane/oxen
        https://github.com/alexispurslane/org-lsp
        (Note that org-lsp has a much improved version of the same indexer as oxen; the first was purely my design, the second I decided to listen to K2.5 more and it found a bunch of potential race conditions and fixed them)
        shrug
        
        Izikiel43 1 day ago
        
        Out of curiosity, did you give a test for them to validate the code?
        I had a test failing because I introduced a silly comparison bug (> instead of <), and claude 4.6 opus figured out it wasn't the test the problem, but the code and fixed the bug (which I had missed).
        
        [-]
        
        lm28469 1 day ago
        
        There was a test and a very useful golang error that literally explain what was wrong. The model tried implementing a solution, failed and when I pointed out the error most of them just rolled back the "solution"
        
        [-]
        
        frde_me 21 hours ago
        
        What exact models were you using? And with what settings? 4.6 / 5.3 codex both with thinking / high modes?
        
        [-]
        
        lm28469 13 hours ago
        
        minimax 2.5, kimi k2.5, codex 5.2, gemini 3 flash and pro, glm 4.7, devstral2 123b, etc.
        
        Izikiel43 1 day ago
        
        Ok, thanks for the info
        
        antonvs 21 hours ago
        
        > I purposefully added one too many wg.Done
        What do you believe this shows? Sometimes I have difficulty finding bugs in other people's code when they do things in ways I would never use. I can rewrite their code so it works, but I can't necessarily quickly identify the specific bug.
        Expecting a model to be perfect on every problem isn't reasonable. No known entity is able to do that. AIs aren't supposed to be gods.
        (Well not yet anyway - there is as yet insufficient data for a meaningful answer.)
        
        [-]
        
        laurentiurad 9 hours ago
        
        When companies claim that AI writes 90% of their code you can expect that such a system can find obvious issues. Expectations are really high when you see statements such as the ones coming from the CEOs of the AI labs. When those expectations fall short, it's expected to see such reactions. It's the same proportionality on both sides.
        
        SpicyLemonZest 21 hours ago
        
        It's hard to evaluate "logic" and "math", since they're made up of many largely disparate things. But I think modern AI models are clearly better at coding, for example, than 99% of the population. If you asked 100 people at your local grocery store why your goroutine system was failing, do you think multiple of them would know the answer?
  - mrandish 22 hours ago
    
    > using the current models to help develop even smarter models.
    That statement is plausible. However, extrapolating that to assert all the very different things which must be true to enable any form of 'singularity' would be a profound category error. There are many ways in which your first two sentences can be entirely true, while your third sentence requires a bunch of fundamental and extraordinary things to be true for which there is currently zero evidence.
    Things like LLMs improving themselves in meaningful and novel ways and then iterating that self-improvement over multiple unattended generations in exponential runaway positive feedback loops resulting in tangible, real-world utility. All the impressive and rapid achievements in LLMs to date can still be true while major elements required for Foom-ish exponential take-off are still missing.
  - sekai 1 day ago
    
    > I don’t think it’s hyperbolic to say that we may be only a single digit number of years away from the singularity.
    We're back to singularity hype, but let's be real: benchmark gains are meaningless in the real world when the primary focus has shifted to gaming the metrics
    
    [-]
    - brokencode 1 day ago
      
      Ok, here I am living in the real world finding these models have advanced incredibly over the past year for coding.
      Benchmaxxing exists, but that’s not the only data point. It’s pretty clear that models are improving quickly in many domains in real world usage.
      
      [-]
      - toraway 18 hours ago
        
        I use agentic tools daily and SOTA models have certainly improved a lot in the last year. But still in a linear, "they don't light my repo on fire as often when they get a confusing compiler error" kind of way, not a "I would now trust Opus 4.6 to respond to every work email and hands-off manage my banking and investment portfolio" kind of way.
        They're still afflicted by the same fundamental problems that hold LLMs back from being a truly autonomous "drop-in human replacement" that would enable an entire new world of use cases.
        And finally live up to the hype/dreams many of us couldn't help but feeling was right around in the corner circa 2022/3 when things really started taking off.
      - mrbungie 23 hours ago
        
        Yet even Anthropic has shown the downsides to using them. I don't think it is a given that improvements in models scores and capabilities + being able to churn code as fast as we can will lead us to a singularity, we'll need more than that.
      - Freedom2 18 hours ago
        
        I agree completely. I think we're in alignment with Elon Musk who says that AI will bypass coding entirely and create the binary directly.
        It's going to be an exciting year.
        
        [-]
        
        baq 16 hours ago
        
        There’s about as much sense doing this as there is in putting datacenters in orbit, i.e. it isn’t impossible, but literally any other option is better.
- bpodgursky 1 day ago
  
  Anthropic took the day off to do a $30B raise at a $380B valuation.
  
  [-]
  - IhateAI 1 day ago
    
    Most ridiculous valuation in the history of markets. Cant wait to watch these compsnies crash snd burn when people give up on the slot machine.
    
    [-]
    - andxor 1 day ago
      
      As usual don't take financial advice from HN folks!
      
      [-]
      - blibble 21 hours ago
        
        not as if you could get in on it even if you wanted to
    - kgwgk 1 day ago
      
      WeWork almost IPO’s at $50bn. It was also a nice crash and burn.
    - jascha_eng 1 day ago
      
      Why? They had $10+ billion arr run rate in 2025 trippeled from 2024 I mean 30x is a lot but also not insane at that growth rate right?
      
      [-]
      - gokhan 1 day ago
        
        It's a 13 days old account with IHateAI handle.
        
        [-]
        
        IhateAI 22 hours ago
        
        [dead]
- bytesandbits 11 hours ago
  
  its cause of a chain of events.
  Next week Chinese New year -> Chinese labs release all the models at once before it starts -> US labs respond with what they have already prepared
  also note that even in US labs a large proportion of researchers and engineers are chinese and many celebrate the Chinese New Year too.
  TLDR: Chinese New Year. Happy Horse year everybody!
rob-wagner 1 day ago

I’ve been using Gemini 3 Pro on a historical document archiving project for an old club. One of the guys had been working on scanning old handwritten minutes books written in German that were challenging to read (1885 through 1974). Anyways, I was getting decent results on a first pass with 50 page chunks but ended up doing 1 page at a time (accuracy probably 95%). For each page, I submit the page for a transcription pass followed by a translation of the returned transcription. About 2370 pages and sitting at about $50 in Gemini API billing. The output will need manual review, but the time savings is impressive.

[-]
- energy123 15 hours ago
  
  Suggestion: run the identical prompt N times (2 identical calls to Gemini 3.0 Pro + 2 identical calls to GPT 5.2 Thinking), then running some basic text post-processing to see where the 4 responses agree vs disagree. The disagreements (substrings that aren't identical matches) are where scrutiny is needed. But if all 4 agree on some substring it's almost certainly a correct transcription. Wouldn't be too hard to get codex to vibe code all this.
  
  [-]
  - matjet 11 hours ago
    
    Look what they need to mimic a fraction of [the power of having the logit probabilities exposed so you can actually see where the model is uncertain]
    
    [-]
    - kfajdsl 21 minutes ago
      
      All the LLM logprob outputs I've seen aren't very well calibrated, at least for transcription tasks - I'm guessing it's similar for OCR type tasks.
- dubeye 23 hours ago
  
  It sounds like a job where one pass might also be a viable option. Until you do the manual review you won't have a full sense of the time savings involved.
  
  [-]
  - rob-wagner 22 hours ago
    
    Good idea. I’ll try modifying the prompt to transcribe, identify the language, and translate if not English, and then return a structured result. In my spot checks, most of the errors are in people’s names and if the handwriting trails into margins (especially into the fold of the binding). Even with the data still needing review, the translations from it has revealed a lot of interesting characters as well as this little anecdote from the minutes of the June 6, 1941 Annual Meeting:
    It had already rained at the beginning of the meeting. During the same, however, a heavy thunderstorm set in, whereby our electric light line was put out of operation. Wax candles with beer bottles as light holders provided the lighting. In the meantime the rain had fallen in a cloudburst-like manner, so that one needed help to get one's automobile going. In some streets the water stood so high that one could reach one's home only by detours. In this night 9.65 inches of rain had fallen.
    
    [-]
    - rjcrystal 20 hours ago
      
      One discovery I've made with gemini is that ocr accuracy is much higher when document is perfectly aligned at 0 degree. When we provided images with handwritten text to gemini which were horizontal (90 or 180 degree) it had lots of issues reading dates, names etc. Then we used paddle ocr image orientation model to find orientation and rotate the image it solved most of our issues with ocr.
  - websap 20 hours ago
    
    They could likely increase their budget slightly and run an LLM-based judge.
- kaveh_h 18 hours ago
  
  Have you tried providing multiple pages at a time to the model? It might do better transcription as it have bigger context to work with.
  
  [-]
  - netdur 14 hours ago
    
    Gemini 3 long context is not good as Gemini 2.5
    
    [-]
    - ody4242 9 hours ago
      
      I'm 100% sure that all providers are playing with the quantization, kv cache and other parameters of the models to be able to serve the demand. One of the biggest advantage of running a local model is that you get predictable behavior.
xnx 1 day ago

Google is absolutely running away with it. The greatest trick they ever pulled was letting people think they were behind.

[-]
- wiseowise 1 day ago
  
  Their models might be impressive, but their products absolutely suck donkey balls. I’ve given Gemini web/cli two months and ran away back to ChatGPT. Seriously, it would just COMPLETELY forget context mid dialog. When asked about improving air quality it just gave me a list of (mediocre) air purifiers without asking for any context whatsoever, and I can list thousands of conversations like that. Shopping or comparing options is just nonexistent. It uses Russian propaganda sources for answers and switches to Chinese mid sentence (!), while explaining some generic Python functionality. It’s an embarrassment and I don’t know how they justify 20 euro price tag on it.
  
  [-]
  - mavamaarten 1 day ago
    
    I agree. On top of that, in true Google style, basic things just don't work.
    Any time I upload an attachment, it just fails with something vague like "couldn't process file". Whether that's a simple .MD or .txt with less than 100 lines or a PDF. I tried making a gem today. It just wouldn't let me save it, with some vague error too.
    I also tried having it read and write stuff to "my stuff" and Google drive. But it would consistently write but not be able to read from it again. Or would read one file from Google drive and ignore everything else.
    Their models are seriously impressive. But as usual Google sucks at making them work well in real products.
    
    [-]
    - davoneus 1 day ago
      
      I don't find that at all. At work, we've no access to the API, so we have to force feed a dozen (or more) documents, code and instruction prompts through the web interface upload interface. The only failures I've ever had in well over 300 sessions were due to connectivity issues, not interface failures.
      Context window blowouts? All the time, but never document upload failures.
      
      [-]
      - mavamaarten 15 hours ago
        
        I'm talking about Gemini in the app and on the web. As well as AI studio. At work we go through Copilot, but there the agentic mode with Gemini isn't the best either.
      - pixl97 23 hours ago
        
        Honestly this is as Google product as you can get. Prizes for some, beatings for others.
  - aerhardt 2 hours ago
    
    I've used their Pro models very successfully in demanding API workloads (classification, extraction, synthesis). On benchmarks it crushed the GPT-5 family. Gemini is my default right now for all API work.
    It took me however a week to ditch Gemini 3 as a user. The hallucinations were off the charts compared to GPT-5. I've never even bothered with their CLI offering.
  - Footprint0521 2 hours ago
    
    It’s all context/ use case; I’ve had weird things too but if you only use markdown inputs and specific prompts Gemini 3 Pro is insane, not to mention the context window
    Also because of the long context window (1 mil tokens on thinking and pro! Claude and OpenAI only have 128k) deep research is the best
    That being said, for coding I definitely still use Codex with GPT 5.3 XHigh lol
  - chermanowicz 1 day ago
    
    It's so capable at some things, and others are garbage. I uploaded a photo of some words for a spelling bee and asked it to quiz my kid on the words. The first word it asked, wasn't on the list. After multiple attempts to get it to start asking only the words in the uploaded pic, it did, and then would get the spellings wrong in the Q&A. I gave up.
    
    [-]
    - romanows 20 hours ago
      
      I had it process a photo of my D&D character sheet and help me debug it as I'm a n00b at the game. Also did a decent, although not perfect, job of adding up a handwritten bowling score sheet.
  - jorl17 18 hours ago
    
    Antigravity is an embarrassment.
    The models feel terrible, somehow, like they're being fed terrible system prompts.
    Plus the damn thing kept crashing and asking me to "restart it". What?!
    At least Kiro does what it says on the tin.
    
    [-]
    - jeanloolz 15 hours ago
      
      My experience with Antigravity is the opposite. It's the first time in over 10 years that an IDE has managed to take me out a bit out of the jetbrain suite. I did not think that was something possible as I am a hardcore jetbrain user/lover.
      
      [-]
      - jorl17 2 hours ago
        
        Have you tried Cursor or VS Code with Github Copilot in agent mode (recently, not 3 or 6 months ago)?
        I've recently tried a buuuuunch of stuff (including Antigravity and Kiro) and I really, really, could not stomach Antigravity.
      - isqueiros 14 hours ago
        
        It's literally just vscode? I tried it the other day and I couldn't tell it apart from windsurf besides the icon in my dock
        
        [-]
        
        nprateem 9 hours ago
        
        Yeah same here. Even though it's vscode I'm still using it and don't plan to renew Intellij again. Gemini was crap but Opus smashes it.
        It is windsurf isn't it, why would you expect it to be different?
  - sequin 1 day ago
    
    How can the models be impressive if they switch to Chinese mid-sentence? I've observed those bizarre bugs too. Even GPT-3 didn't have those. Maybe GPT-2 did. It's actually impressive that they managed to botch it so badly.
    Google is great at some things, but this isn't it.
  - navigate8310 16 hours ago
    
    100x agree. It gives inconsistent edits, would regularly try to perform things I explicitly command to not.
  - kilroy123 1 day ago
    
    Sadly true.
    It is also one of the worst models to have a sort of ongoing conversation with.
  - blinding-streak 15 hours ago
    
    I don't have any of these issues with Gemini. I use it heavily everyday. A few glitches here and there, but it's been enormously productive for me. Far more so then chatgpt, which I find mostly useless.
  - gokhan 1 day ago
    
    Agreed on the product. I can't make Gemini read my emails on GMail. One day it says it doesn't have access, the other day it says Query unsuccessful. Claude Desktop has no problem reaching to GMail, on the other hand :)
  - e40 14 hours ago
    
    And it gives incorrect answers about itself and google’s services all the time. It kept pointing me to nonexistent ui elements. At least it apologizes profusely! ffs
  - HardCodedBias 1 day ago
    
    Their models are absolutely not impressive.
    Not a single person is using it for coding (outside of Google itself).
    Maybe some people on a very generous free plan.
    Their model is a fine mid 2025 model, backed by enormous compute resources and an army of GDM engineers to help the “researchers” keep the model on task as it traverses the “tree of thoughts”.
    But that isn’t “the model” that’s an old model backed by massive money.
    
    [-]
    - chermi 20 hours ago
      
      Uhh, just false.
      
      [-]
      - HardCodedBias 6 hours ago
        
        It's just poop tier.
        Come on.
        Worthless.
        Do you have any market counter points.
        Market counter points that aren't really just a repackaging of:
        1. "Google has the world's best distribution" and/or 2. "Google has a firehose of money that allows them to sell their 'AI product' at an enormous discount?
        Good luck!
- manmal 17 minutes ago
  
  Have you used Gemini CLI, and then codex? Gemini is so trigger happy, the moment you don’t tell it „don’t make any changes“ it runs off and starts doing all kind of unrelated refactorings. This is the opposite of what I want. I want considerate, surgical implementations. I need to have a discussion of the scope, and sequence diagrams first. It should read a lot of files instead of hallucinating about my architecture.
  Their chat feels similar. It just runs off like a wild dog.
- virgildotcodes 10 hours ago
  
  These benchmarks are super impressive. That said, Gemini 3 Pro benchmarked well on coding tasks, and yet I found it abysmal. A distant third behind Codex and Claude.
  Tool calling failures, hallucinations, bad code output. It felt like using a coding model from a year ago.
  Even just as a general use model, somehow ChatGPT has a smoother integration with web search (than google!!), knowing when to use it, and not needing me to prompt it directly multiple times to search.
  Not sure what happened there. They have all the ingredients in theory but they've really fallen behind on actual usability.
  Their image models are kicking ass though.
- Ozzie_osman 1 day ago
  
  Peacetime Google is not like wartime Google.
  Peacetime Google is slow, bumbling, bureaucratic. Wartime Google gets shit done.
  
  [-]
  - nutjob2 1 day ago
    
    OpenAI is the best thing that happened to Google apparently.
    
    [-]
    - taurath 1 day ago
      
      Just not search. The search product has pretty much become useless over the past 3 years and the AI answers often will get just to the level of 5 years ago. This creates a sense that that things are better - but really it’s just become impossible to get reliable information from an avenue that used to work very well.
      I don’t think this is intentional, but I think they stopped fighting SEO entirely to focus on AI. Recipes are the best example - completely gutted and almost all receive sites (therefore the entire search page) run by the same company. I didn’t realize how utterly consolidated huge portions of information on the internet was until every recipe site about 3 months ago simultaneously implemented the same anti-Adblock.
      
      [-]
      - miohtama 17 hours ago
        
        The search product become useless on a particular day of 2019 as discussed on HN News some time ago:
        https://news.ycombinator.com/item?id=40133976
    - RationPhantoms 1 day ago
      
      Competition always is. I think there was a real fear that their core product was going to be replaced. They're already cannibalizing it internally so it was THE wake up call.
    - koolala 1 day ago
      
      Next they compete on ads...
  - lern_too_spel 1 day ago
    
    Wartime Google gave us Google+. Wartime Google is still bumbling, and despite OpenAI's numerous missteps, I don't think it has to worry about Google hurting its business yet.
    
    [-]
    - NikolaNovak 23 hours ago
      
      I do miss Google+. For my brain / use case, it was by far the best social network out there, and the Circle friends and interest management system is still unparalleled :)
    - blinding-streak 15 hours ago
      
      Google+ was fun. Failed in the market though.
      Apple made a social network called Ping. Disaster. MobileMe was silly.
      Microsoft made Zune and the Kin 1 and Kin 2 devices and Windows phone and all sorts of other disasters.
      These things happen.
      
      [-]
      - machiaweliczny 6 hours ago
        
        Windows Phone was actually good. I would even say that my Lumia something was one of best experiences ever on mobile. G+ was also good. Efficient markets mean that you can "extract" rent, via selling data or attention etc. not realy what is good
      - Peaches4Rent 8 hours ago
        
        I have a hypothesis that Google+ just wasn't addictive. Which is a good thing now, but not back then
- kenjackson 1 day ago
  
  But wait two hours for what OpenAI has! I love the competition and how someone just a few days ago was telling how ARC-AGI-2 was proof that LLMs can't reason. The goalposts will shift again. I feel like most of human endeavor will soon be just about trying to continuously show that AI's don't have AGI.
  
  [-]
  - kilpikaarna 1 day ago
    
    > I feel like most of human endeavor will soon be just about trying to continuously show that AI's don't have AGI.
    I think you overestimate how much your average person-on-the-street cares about LLM benchmarks. They already treat ChatGPT or whichever as generally intelligent (including to their own detriment), are frustrated about their social media feeds filling up with slop and, maybe, if they're white-collar, worry about their jobs disappearing due to AI. Apart from a tiny minority in some specific field, people already know themselves to be less intelligent along any measurable axis than someone somewhere.
  - nutjob2 1 day ago
    
    "AGI" doesn't mean anything concrete, so it's all a bunch of non-sequiturs. Your goalposts don't exist.
    Anyone with any sense is interested in how well these tools work and how they can be harnessed, not some imaginary milestone that is not defined and cannot be measured.
    
    [-]
    - kenjackson 1 day ago
      
      I agree. I think the emergence of LLMs have shown that AGI really has no teeth. I think for decades the Turing test was viewed as the gold standard, but it's clear that there doesn't appear to be any good metric.
      
      [-]
      - sincerely 23 hours ago
        
        The turing test was passed in the 80s, somehow it has remained relevant in pop culture despite the fact that it's not a particularly difficult technical achievement
        
        [-]
        
        kenjackson 21 hours ago
        
        It wasn’t passed in the 80s. Not the general Turing test.
        
        [-]
        
        dboreham 19 hours ago
        
        c. 2022 for me.
  - 7777332215 1 day ago
    
    Soon they can drop the bioweapon to welcome our replacement.
- RachelF 18 hours ago
  
  Not in my experience with Gemini Pro and coding. It hallucinates APIs that aren't there. Claude does not do that.
  Gemini has flashes of brilliance, but I regard it as unpolished some things work amazingly, some basics don't work.
  
  [-]
  - energy123 17 hours ago
    
    It's very hard to tell the difference between bad models and stinginess with compute.
    I subscribe to both Gemini ($20/mo) and ChatGPT Pro ($200/mo).
    If I give the same question to "Gemini 3.0 Pro" and "ChatGPT 5.2 Thinking + Heavy thinking", the latter is 4x slower and it gives smarter answers.
    I shouldn't have to enumerate all the different plausible explanations for this observation. Anything from Gemini deciding to nerf the reasoning effort to save compute, versus TPUs being faster, to Gemini being worse, to this being my idiosyncratic experience, all fit the same data, and are all plausible.
    
    [-]
    - timpera 15 hours ago
      
      You nailed it. Gemini 3 Pro seems very "lazy" and seems to never reason for more than 30 seconds, which significantly impacts the quality of its outputs.
- kriro 11 hours ago
  
  I'd personally bet on Google and Meta in the long run since they have access to the most interesting datasets from their other operations.
  
  [-]
  - xnx 2 hours ago
    
    Agree. Anyone with access to large proprietary data has an edge in their space (not necessarily for foundation models): Salesforce, adobe, AutoCAD, caterpillar
- spaceman_2020 17 hours ago
  
  They seem to be optimizing for benchmarks instead of real world use
  
  [-]
  - moffkalast 8 hours ago
    
    Yeah if only Gemini performed half as well as it does on benches, we'd actually be using it.
- hawk_ 6 hours ago
  
  What is their Claude code equivalent?
  
  [-]
  - krzyk 3 hours ago
    
    gemini cli - https://geminicli.com/
- segmondy 16 hours ago
  
  It was obvious to me that they were top contender 2 years ago ... https://www.reddit.com/r/LocalLLaMA/comments/1c0je6h/google_...
- Razengan 1 day ago
  
  Gemini's UX (and of course privacy cred as with anything Google) is the worst of all the AI apps. In the eyes of the Common Man, it's UI that will win out, and ChatGPT's is still the best.
  
  [-]
  - xnx 1 day ago
    
    Google privacy cred is ... excellent? The worst data breach I know of them having was a flaw that allowed access to names and emails of 500k users.
    
    [-]
    - bitpush 1 day ago
      
      Link? Are you conflating with "500k Gmail accounts leaked [by a third party]" with Gmail having a breach?
      Afaik, Google has had no breaches ever.
      
      [-]
      - xnx 1 day ago
        
        https://en.wikipedia.org/wiki/2018_Google_data_breach
      - Razengan 1 day ago
        
        Google is the breach.
    - oofbey 19 hours ago
      
      Their SECURITY cred is fantastic.
      Privacy, not so much. How many hundreds of millions have they been fined for “incognito mode” in chrome being a blatant lie?
      
      [-]
      - Razengan 19 hours ago
        
        > Their SECURITY cred is fantastic.
        In a world where Android vulnerabilities and exploits don't exist
    - moffkalast 8 hours ago
      
      Google's most profitable branch is adsense, they don't need breaches for them to have privacy issues given that elephant sized conflict of interest.
    - laurex 1 day ago
      
      If you consider "privacy" to be 'a giant corporation tracks every bit of possible information about you and everyone else'?
      
      [-]
      - xnx 1 day ago
        
        OpenAI is running ads. Do you think they'll track less?
    - Razengan 1 day ago
      
      They don't even let you have multiple chats if you disable their "App Activity" or whatever (wtf is with that ass naming? they don't even have a "Privacy" section in their settings the last time I checked)
      and when I swap back into the Gemini app on my iPhone after a minute or so the chat disappears. and other weird passive-aggressive take-my-toys-away behavior if you don't bare your body and soul to Googlezebub.
      ChatGPT and Grok work so much better without accounts or with high privacy settings.
  - ainch 1 day ago
    
    I find Gemini's web page much snappier to use than ChatGPT - I've largely swapped to it for most things except more agentic tasks.
  - alexpotato 1 day ago
    
    > Gemini's UX ... is the worst of all the AI apps
    Been using Gemini + OpenCode for the past couple weeks.
    Suddenly, I get a "you need a Gemini Access Code license" error but when you go to the project page there is no mention of this or how to get the license.
    You really feel the "We're the phone company and we don't care. Why? Because we don't have to." [0] when you use these Google products.
    PS for those that don't get the reference: US phone companies in the 1970s had a monopoly on local and long distance phone service. Similar to Google for search/ads (really a "near" monopoly but close enough).
    0 - https://vimeo.com/355556831
  - jonathanstrange 1 day ago
    
    You mean AI Studio or something like that, right? Because I can't see a problem with Google's standard chat interface. All other AI offerings are confusing both regarding their intended use and their UX, though, I have to concur with that.
    
    [-]
    - ergonaught 1 day ago
      
      The lack of "projects" alone makes their chat interface really unpleasant compared to ChatGPT and Claude.
    - wiseowise 1 day ago
      
      No projects, completely forgets context mid dialog, mediocre responses even on thinking, research got kneecapped somehow and is completely uses now, uses propaganda Russian videos as the search material (what’s wrong with you, Google?), janky on mobile, consumes GIGABYTES of RAM on web (seriously, what the fuck?). Left a couple of tabs over night, Mac is almost complete frozen because 10 tabs consumed 8 GBs of RAM doing nothing. It’s a complete joke.
      
      [-]
      - jonathanstrange 14 hours ago
        
        Fair enough. I'm always astonished how different experiences are because mine is the complete opposite. I almost solely use it for help with Go and Javascript programming and found Gemini Pro to be more useful than any other model. ChatGPT was the worst offender so far, completely useless, but Claude has also been suboptimal for my use cases.
        I guess it depends a lot on what you use LLMs for and how they are prompted. For example, Gemini fails the simple "count from 1 to 200 in words" test whereas Claude does it without further questions.
        Another possible explanation would be that processing time is distributed unevenly across the globe and companies stay silent about this. Maybe depending on time zones?
    - xnx 1 day ago
      
      AI Studio is also significantly improved as of yesterday.
  - uxhoiuewfhhiu 1 day ago
    
    Gemini is completely unusable in VS Code. It's rated 2/5 stars, pathetic: https://marketplace.visualstudio.com/items?itemName=Google.g...
    Requests regularly time out, the whole window freezes, it gets stuck in schizophrenic loops, edits cannot be reverted and more.
    It doesn't even come close to Claude or ChatGPT.
    
    [-]
    - xnx 20 hours ago
      
      Once Google launched Antigravity, I stopped using VS Code.
    - Razengan 20 hours ago
      
      Smart idea to say anything against Google here from a throwaway account, I'm sitting in negative karma for that :')
      
      [-]
      - surajrmal 18 hours ago
        
        Anti Google comments do pretty well on average. It's a popular sentiment. However, low effort comments don't.
- amunozo 1 day ago
  
  Those black nazis in the first image model were a cause of inside trading.
- dboreham 19 hours ago
  
  I'm leery to use a Google product in light of their history of discontinuing services. It'd have to be significantly better than a similar product from a committed competitor.
- naasking 1 day ago
  
  Google is still behind the largest models I'd say, in real world utility. Gemini 3 Pro still has many issues.
- oofbey 19 hours ago
  
  They were behind. Way behind. But they caught up.
- dfdsf2 1 day ago
  
  Trick? Lol not a chance. Alphabet is a pure play tech firm that has to produce products to make the tech accessible. They really lack in the latter and this is visible when you see the interactions of their VP's. Luckily for them, if you start to create enough of a lead with the tech, you get many chances to sort out the product stuff.
  
  [-]
  - dakolli 1 day ago
    
    You sound like Russ Hanneman from SV
    
    [-]
    - s-kymon 1 day ago
      
      It's not about how much you earn. It's about what you're worth.
- impulser_ 21 hours ago
  
  Don't let the benchmarks fool you. Gemini models are completely useless not matter how smart they are. Google still hasn't figure out tool calling and making the model follow instructions. They seem to only care about benchmarking and being the most intelligent model on paper. This has been a problem of Gemini since 1.0 and they still haven't fixed it.
  Also the worst model in terms of hallucinations.
  
  [-]
  - estearum 21 hours ago
    
    Disagree.
    Claude Code is great for coding, Gemini is better than everything else for everything else.
    
    [-]
    - johnfn 16 hours ago
      
      What is "everything else" in your view? Just curious -- I really only seriously use models for coding, so I am curious what I am missing.
      
      [-]
      - viking123 13 hours ago
        
        Role-playing but Claude is as bad, same censored garbage with the CEO wanting to be your dad. Grok is best for everything else by far.
    - impulser_ 21 hours ago
      
      Are you using Gemini model itself or using the Gemini App? They are different.
      
      [-]
      - estearum 20 hours ago
        
        Both
    - amelius 12 hours ago
      
      And mathematics?
sigmar 1 day ago

Here is the methodologies for all the benchmarks: https://storage.googleapis.com/deepmind-media/gemini/gemini_...
The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"
>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview

[-]
- gs17 1 day ago
  
  Interestingly, the title of that PDF calls it "Gemini 3.1 Pro". Guess that's dropping soon.
  
  [-]
  - sigmar 1 day ago
    
    I looked at the file name but not the document title (specifically because I was wondering if this is 3.1). Good spot.
    edit: they just removed the reference to "3.1" from the pdf
    
    [-]
    - josalhor 1 day ago
      
      I think this is 3.1 (3.0 Pro with the RL improv of 3.0 Flash). But they probably decided to market it as Deep Think because why not charge more for it.
      
      [-]
      - WarmWash 1 day ago
        
        The Deep Think moniker is for parallel compute models though, not long CoT like pro models.
        It's possible though that deep think 3 is running 3.1 models under the hood.
  - staticman2 1 day ago
    
    That's odd considering 3.0 is still labeled a "preview" release.
    
    [-]
    - ainch 1 day ago
      
      I think it'll be 3.1 by the time it's labelled GA - they said after 3.0 launch that they figured out new RL methods for Flash that the Pro model hasn't benefitted from.
  - WarmWash 1 day ago
    
    The rumor was that 3.1 was today's drop
    
    [-]
    - losvedir 1 day ago
      
      Where are these rumors floating around?
      
      [-]
      - beauzero 1 day ago
        
        One of many https://x.com/synthwavedd/status/2021983382314660075
- thadk 16 hours ago
  
  Huh, so if a China-based lab takes ARC-AGI-2 on the new year, then they can say they had just-shy of a solution anyway.
- riku_iki 1 day ago
  
  > If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"
  They never will do on private set, because it would mean its being leaked to google.
simianwords 1 day ago

OT but my intuition says that there’s a spectrum
- non thinking models
- thinking models
- best of N models like deep think an gpt pro
Each one is of a certain computational complexity. Simplifying a bit, I think they map to - linear, quadratic and n^3 respectively.
I think there are certain class of problems that can’t be solved without thinking because it necessarily involves writing in a scratchpad. And same for best of N which involves exploring.
Two open questions
1) what’s the higher level here, is there a 4th option?
2) can a sufficiently large non thinking model perform the same as a smaller thinking?

[-]
- futureshock 1 day ago
  
  I think step 4 is the agent swarm. Manager model gets the prompt and spins up a swarm of looping subagents, maybe assigns them different approaches or subtasks, then reviews results, refines the context files and redeploys the swarm on a loop till the problem is solved or your credit card is declined.
  
  [-]
  - jasondigitized 18 hours ago
    
    So Google Answers is coming back?!?!?!
  - simianwords 1 day ago
    
    i think this is the right answer
    edit: i don't know how this is meaningfully different from 3
- NitpickLawyer 1 day ago
  
  > best of N models like deep think an gpt pro
  Yeah, these are made possible largely by better use at high context lengths. You also need a step that gathers all the Ns and selects the best ideas / parts and compiles the final output. Goog have been SotA at useful long context for a while now (since 2.5 I'd say). Many others have come with "1M context", but their usefulness after 100k-200k is iffy.
  What's even more interesting than maj@n or best of n is pass@n. For a lot of applications youc an frame the question and search space such that pass@n is your success rate. Think security exploit finding. Or optimisation problems with quick checks (better algos, kernels, infra routing, etc). It doesn't matter how good your pass@1 or avg@n is, all you care is that you find more as you spend more time. Literally throwing money at the problem.
- andy12_ 10 hours ago
  
  The difference between thinking and no-thinking models can be a little blurry. For example, when doing coding tasks Anthropic models with no-thinking mode tend to use a lot of comments to act as a scratchpad. In contrast, models in thinking mode don't do this because they don't need to.
  Ultimately, the only real difference between no-thinking and thinking models is the amount of tokens used to reach the final answer. Whether those extra scratchpad tokens are between <think></think> tags or not doesn't really matter.
- mnicky 1 day ago
  
  > can a sufficiently large non thinking model perform the same as a smaller thinking?
  Models from Anthropic have always been excellent at this. See e.g. https://imgur.com/a/EwW9H6q (top-left Opus 4.6 is without thinking).
  
  [-]
  - simianwords 1 day ago
    
    its interesting that opus 4.6 added a paramter to make it think extra hard.
Scene_Cast2 1 day ago

It's a shame that it's not on OpenRouter. I hate platform lock-in, but the top-tier "deep think" models have been increasingly requiring the use of their own platform.

[-]
- raybb 1 day ago
  
  OpenRouter is pretty great but I think litellm does a very good job and it's not a platform middle man, just a python library. That being said, I have tried it with the deep think models.
  https://docs.litellm.ai/docs/
  
  [-]
  - imiric 1 day ago
    
    Part of OpenRouter's appeal to me is precisely that it is a middle man. I don't want to create accounts on every provider, and juggle all the API keys myself. I suppose this increases my exposure, but I trust all these providers and proxies the same (i.e. not at all), so I'm careful about the data I give them to begin with.
    
    [-]
    - octoberfranklin 1 day ago
      
      Unfortunately that's ending with mandatory-BYOK from the model vendors. They're starting to require that you BYOK to force you through their arbitrary+capricious onboarding process.
      
      [-]
      - potsandpans 19 hours ago
        
        Will still be able to use open weights models, which is what I use openrouter primarily for anyway
- chr15m 23 hours ago
  
  The golden age is over.
anematode 1 day ago

It found a small but nice little optimization in Stockfish: https://github.com/official-stockfish/Stockfish/pull/6613
Previous models including Claude Opus 4.6 have generally produced a lot of noise/things that the compiler already reliably optimizes out.
jetter 1 day ago

it is interesting that the video demo is generating .stl model. I run a lot of tests of LLMs generating OpenSCAD code (as I have recently launched https://modelrift.com text-to-CAD AI editor) and Gemini 3 family LLMs are actually giving the best price-to-performance ratio now. But they are very, VERY far from being able to spit out a complex OpenSCAD model in one shot. So, I had to implement a full fledged "screenshot-vibe-coding" workflow where you draw arrows on 3d model snapshot to explain to LLM what is wrong with the geometry. Without human in the loop, all top tier LLMs hallucinate at debugging 3d geometry in agentic mode - and fail spectacularly.

[-]
- mchusma 1 day ago
  
  Hey, my 9 year old son uses modelrift for creating things for his 3d printer, its great! Product feedback: 1. You should probably ask me to pay now, I feel like i've used it enough. 2. You need a main dashboard page with a history of sessions. He thought he lost a file and I had to dig in the billing history to get a UUID I thought was it and generate the url. I would say naming sessions is important, and could be done with small LLM after the users initial prompt. 3. I don't think I like the default 3d model in there once I have done something, blank would be better.
  We download the stl and import to bambu. Works pretty well. A direct push would be nice, but not necessary.
  
  [-]
  - jetter 11 hours ago
    
    Thank you for this feedback, very valuable! I am using Bambu as well - perfect to get things printed without much hassle. Not sure if direct push to printer is possible though, as their ecosystem looks pretty closed. It would be a perfect use case - if we could use ModelRift to design a model on a mobile phone and push to print..
  - jetter 9 hours ago
    
    proper sessions page is live: https://modelrift.com/changelog/v0-3-2
    let me know how it goes!
- gundmc 1 day ago
  
  Yes, I've been waiting for a real breakthrough with regard to 3D parametric models and I don't think think this is it. The proprietary nature of the major players (Creo, Solidworks, NX, etc) is a major drag. Sure there's STP, but there's too much design intent and feature loss there. I don't think OpenSCAD has the critical mass of mindshare or training data at this point, but maybe it's the best chance to force a change.
- venusenvy47 18 hours ago
  
  I was looking for your GitHub, but the link on the homepage is broken: https://github.com/modelrift
  
  [-]
  - jetter 11 hours ago
    
    right, I need to fix this one
- storystarling 14 hours ago
  
  yes, i had the same experience. As good as LLMs are now at coding - it seems they are still far away from being useful in vision dominated engineering tasks like CAD/design. I guess it is a training data problem. Maybe world models / artificial data can help here?
- lern_too_spel 1 day ago
  
  If you want that to get better, you need to produce a 3d model benchmark and popularize it. You can start with a pelican riding a bicycle with working bicycle.
  
  [-]
  - ponyous 3 hours ago
    
    I am building pretty much the same product as OP, and have a pretty good harness to test LLMs. In fact I have run a tons of tests already. It’s currently aimed for my own internal tests, but making something that is easier to digest should be a breeze. If you are curious: https://grandpacad.com/evals
  - jetter 11 hours ago
    
    building a benchmark is a great idea, thanks, maybe I will have a couple of days to spend on this soon
the_king 17 hours ago

I just tested it on a very difficult Raven matrix, that the old version of DeepThink, as well as GPT 5.2 Pro, Claude Opus 4.6, and pretty much every other model failed at.
This version of DeepSeek got it first try. Thinking time was 2 or 3 minutes.
The visual reasoning of this class of Gemini models is incredibly impressive.

[-]
- ronyfadel 11 hours ago
  
  Deep Think not DeepSeek
Decabytes 1 day ago

Gemini has always felt like someone who was book smart to me. It knows a lot of things. But if you ask it do anything that is offscript it completely falls apart

[-]
- dwringer 1 day ago
  
  I strongly suspect there's a major component of this type of experience being that people develop a way of talking to a particular LLM that's very efficient and works well for them with it, but is in many respects non-transferable to rival models. For instance, in my experience, OpenAI models are remarkably worse than Google models in basically any criterion I could imagine; however, I've spent most of my time using the Google ones and it's only during this time that the differences became apparent and, over time, much more pronounced. I would not be surprised at all to learn that people who chose to primarily use Anthropic or OpenAI models during that time had an exactly analogous experience that convinced them their model was the best.
  
  [-]
  - dboreham 19 hours ago
    
    We train the AI. The AI then trains us.
- esafak 1 day ago
  
  I'd rather say it has a mind of its own; it does things its way. But I have not tested this model, so they might have improved its instruction following.
  
  [-]
  - vkazanov 1 day ago
    
    Well, one thing i know for sure: it reliably misplaces parentheses in lisps.
    
    [-]
    - esafak 1 day ago
      
      Clearly, the AI is trying to steer you towards the ML family of languages for its better type system, performance, and concurrency ;)
- piyh 20 hours ago
  
  I made offmetaedh.com with it. Feels pretty great to me.
Metacelsus 1 day ago

According to benchmarks in the announcement, healthily ahead of Claude 4.6. I guess they didn't test ChatGPT 5.3 though.
Google has definitely been pulling ahead in AI over the last few months. I've been using Gemini and finding it's better than the other models (especially for biology where it doesn't refuse to answer harmless questions).

[-]
- CuriouslyC 1 day ago
  
  Google is way ahead in visual AI and world modelling. They're lagging hard in agentic AI and autonomous behavior.
- throwup238 1 day ago
  
  The general purpose ChatGpt 5.3 hasn’t been released yet, just 5.3-codex.
- neilellis 1 day ago
  
  It's ahead in raw power but not in function. Like it's got the worlds fast engine but one gear! Trouble is some benchmarks only measure horse power.
  
  [-]
  - NitpickLawyer 1 day ago
    
    > Trouble is some benchmarks only measure horse power.
    IMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on "agentic this" or "specialised that", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it.
- scarmig 1 day ago
  
  > especially for biology where it doesn't refuse to answer harmless questions
  Usually, when you decrease false positive rates, you increase false negative rates.
  Maybe this doesn't matter for models at their current capabilities, but if you believe that AGI is imminent, a bit of conservatism seems responsible.
- nkzd 1 day ago
  
  Google models and CLI harness feels behind in agentic coding compared OpenAI and Antrophic
- Davidzheng 1 day ago
  
  I gather that 4.6 strengths are in long context agentic workflows? At least over Gemini 3 pro preview, opus 4.6 seems to have a lot of advantages
  
  [-]
  - verdverm 1 day ago
    
    It's a giant game of leapfrog, shift or stretch time out a bit and they all look equivalent
- simianwords 1 day ago
  
  The comparison should be with GPT 5.2 pro which has been used successfully to solve open math problems.
aliljet 1 day ago

The problem here is that it looks like this is released with almost no real access. How are people using this without submitting to a $250/mo subscription?

[-]
- mohsen1 10 hours ago
  
  I have some very difficult to debug bugs that Opus 4.6 is failing at. Planning to pay $250 to see if it can solve those.
- andxor 1 day ago
  
  People are paying for the subscriptions.
- tootie 1 day ago
  
  I gather this isn't intended a consumer product. It's for academia and research institutions.
sinuhe69 1 day ago

I'm pretty certain that DeepMind (and all other labs) will try their frontier (and even private) models on First Proof [1].
And I wonder how Gemini Deep Think will fare. My guess is that it will get half the way on some problems. But we will have to take an absence as a failure, because nobody wants to publish a negative result, even though it's so important for scientific research.
[1] https://1stproof.org/

[-]
- blinding-streak 14 hours ago
  
  As a non-mathematician, reading these problems feels like reading a completely foreign language.
  https://arxiv.org/html/2602.05192v1
  
  [-]
  - ky3 5 hours ago
    
    LLM to the rescue. Feed in a problem and ask it to explain it to a layperson. Also feed in sentences that remain obscure and ask to unpack.
- zozbot234 1 day ago
  
  The 1st proof original solutions are due to be published in about 24h, AIUI.
  
  [-]
  - energy123 18 hours ago
    
    Feels like an unforced blunder to make the time window so short after going to so much effort and coming up with something so useful.
    
    [-]
    - sinuhe69 12 hours ago
      
      5 days for Ai is by no mean short! If it can solve it, it would need perhaps 1-2 hours. If it can not, 5 days continuous running would produce gibberish only. We can safely assume that such private models will run inferences entirely on dedicated hardware, sharing with nobody. So if they could not solve the problems, it's not due to any artificial constraint or lack of resources, far from it.
      The 5 days window, however, is a sweat spot because it likely prevents cheating by hiring a math PhD and feed the AI with hints and ideas.
      
      [-]
      - energy123 12 hours ago
        
        5 days is short for memetic propagation on social media to reach everyone who has their own harness and agentic setup that wants to have a go.
        
        [-]
        
        zozbot234 9 hours ago
        
        That's not really how it works, the recent Erdos proofs in Lean were done by a specialized proprietary model (Aristotle by Harmonic) that's specifically trained for this task. Normal agents are not effective.
        
        [-]
        
        energy123 8 hours ago
        
        Why did you omit the other AI-generated Erdos proofs not done by a proprietary model, which occurred on timescales stretched across significantly longer time than 5 days?
        
        [-]
        
        zozbot234 8 hours ago
        
        Those were not really "proofs" by the standard of 1stproof. The only way an AI can possibly convince an unsympathetic peer reviewer that its proof is correct is to write it completely in a formal system like Lean. The so-called "proofs" done with GPT were half baked and required significant human input, hints, fixing after the fact etc. which is enough to disqualify them from this effort.
- octoberfranklin 1 day ago
  
  Really surprised that 1stproof.org was submitted three times and never made front page at HN.
  https://hn.algolia.com/?q=1stproof
  This is exactly the kind of challenge I would want to judge AI systems based on. It required ten bleeding-edge-research mathematicians to publish a problem they've solved but hold back the answer. I appreciate the huge amount of social capital and coordination that must have taken.
  I'm really glad they did it.
  
  [-]
  - lofaszvanitt 15 hours ago
    
    Of course it isn't made the front page. If something is promising they hunt it down, and when conquered they post about it. Lot of times the new category has much better results, than the default HN view.
mark_l_watson 23 hours ago

I feel like a luddite: unless I am running small local models, I use gemini-3-flash for almost everything: great for tool use, embedded use in applications, and Python agentic libraries, broad knowledge, good built in web search tool, etc. Oh, and it is fast and cheap.
I really only use gemini-3-pro occasionally when researching and trying to better understand something. I guess I am not a good customer for super scalers. That said, when I get home from travel, I will make a point of using Gemini 3 Deep Think for some practical research. I need a business card with the title "Old Luddite."

[-]
- sdeiley 12 hours ago
  
  3 Flash is criminally under appreciated for its performance/cost/speed trifecta. Absolutely in a category of its own.
simonw 1 day ago

The pelican riding a bicycle is excellent. I think it's the best I've seen.
https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/

[-]
- toraway 18 hours ago
  
  So, you've said multiple times in the past that you're not concerned about AI labs training for this specific test because if they did, it would be so obviously incongruous that you'd easily spot the manipulation and call them out.
  Which tbh has never really sat right with me, seemingly placing way too much confidence in your ability to differentiate organic vs. manipulated output in a way I don't think any human could be expected to.
  To me, this example is an extremely neat and professional SVG and so far ahead it almost seems too good to be true. But like with every previous model, you don't seem to have the slightest amount of skepticism in your review. I don't think I truly believe Google cheated here, but it's so good it does therefore make me question whether there could ever be an example of a pelican SVG in the future that actually could trigger your BS detector?
  I know you say it's just a fun/dumb benchmark that's not super important, but you're easily in the top 3 most well known AI "influencers" whose opinion/reviews about model releases carry a lot of weight, providing a lot of incentive with trillions of dollars flying around. Are you still not at all concerned by the amount of attention this benchmark receives now/your risk of unwittingly being manipulated?
  
  [-]
  - simonw 11 hours ago
    
    The other SVGs I tried from my private collection of prompts were all similarly impressive.
    
    [-]
    - buttered_toast 8 hours ago
      
      Is there a way you can showcase a few of these?
      
      [-]
      - simonw 7 hours ago
        
        Not without people later saying "you shared that on Hacker News last year clearly the AI labs are training for it now!"
        
        [-]
        
        buttered_toast 7 hours ago
        
        Couldn't you just make up new combinations, or new caveats indefinitely to mitigate that? It would be nice to see maybe 3-4 good examples for validation. I'd do it myself, but I don't have $200 to play around with this model.
        
        [-]
        
        simonw 6 hours ago
        
        Here's what it gave me for a kakapo on a skateboard https://gist.github.com/simonw/5e2041c32333effd090e3df42b64d...
        
        [-]
        
        buttered_toast 6 hours ago
        
        Thank you!
- tasuki 1 day ago
  
  Tbh they'd have to be absolutely useless at benchmarkmaxxing if they didn't include your pelican riding a bicycle...
- zozbot234 16 hours ago
  
  This benchmark outcome is actually really impressive given the difficulty of this task. It shows that this particular model manages to "think" coherently and maintain useful information in its context for what has to be an insane overall amount of tokens, likely across parallel "thinking" chains. Likely also has access to SVG-rendering tools and can "see" and iterate on the result via multimodal input.
- mikestaas 19 hours ago
  
  Wow. I wonder how it would do with pure CSS a la https://diana-adrianne.com/
- steve_adams_86 22 hours ago
  
  We've reached PGI
- nickthegreek 1 day ago
  
  I routinely check out the pelicans you post and I do agree, this is the best yet. It seemed to me that the wings/arms were such a big hangup for these generators.
- ramesh31 7 hours ago
  
  >"The pelican riding a bicycle is excellent. I think it's the best I've seen. https://simonwillison.net/2026/Feb/12/gemini-3-deep-think/"
  Yeah this is nuts. First real step-change we've seen since Claude 3.5 in '24.
- Manabu-eo 1 day ago
  
  How likely this problem is already on the training set by now?
  
  [-]
  - simonw 1 day ago
    
    If anyone trains a model on https://simonwillison.net/tags/pelican-riding-a-bicycle/ they're going to get some VERY weird looking pelicans.
    
    [-]
    - suddenlybananas 1 day ago
      
      Why would they train on that? Why not just hire someone to make a few examples.
      
      [-]
      - simonw 1 day ago
        
        I look forward to them trying. I'll know when the pelican riding a bicycle is good but the ocelot riding a skateboard sucks.
        
        [-]
        
        dontwannahearit 12 hours ago
        
        Would it not be better to have 100 such tests "Pelican on bicycle", "Tiger on stilts"..., and generate them all for every new model but only release a new one each time. That way you could show progression across all models, attempts at benchmaxxing would be more obvious.
        Given the crazy money and vying for supremacy among AI companies right now it does seem naive to belive that no attempt at better pelicans on bicycles is being made. You can argue "but I will know because of the quality of ocelots on skateboards" but without a back catalog of ocelots on skateboards to publish its one datapoint and leaves the AI companies with too much plausible deniability.
        The pelicans-on-bicycles is a bit of fun for you (and us!) but it has become a measure of the quality of models so its serious business for them.
        There is an assymetry of incentives and high risk you are being their useful idiot. Sorry to be blunt.
        
        [-]
        
        Applejinx 8 hours ago
        
        Or indeed do the Markov chain conceptual slip. Pelican on bicycle, badger on stool, tiger on acid. Pelican on bicycle is definitely cooked, though: people know it and it's talked about in language.
        
        suddenlybananas 1 day ago
        
        But they could just train on an assortment of animals and vehicles. It's the kind of relatively narrow domain where NNs could reasonably interpolate.
        
        [-]
        
        simonw 1 day ago
        
        The idea that an AI lab would pay a small army of human artists to create training data for $animal on $transport just to cheat on my stupid benchmark delights me.
        
        [-]
        
        suddenlybananas 1 day ago
        
        When you're spending trillions on capex, paying a couple of people to make some doodles in SVGs would not be a big expense.
        
        [-]
        
        WarmWash 7 hours ago
        
        I think no matter what happens with AI in the future, there will always be a subset of people with elaborate conspiracies about how it's all fake/a hoax.
        
        [-]
        
        suddenlybananas 4 hours ago
        
        I'm not saying it's a hoax. If it gets better because of that data, tant mieux, but we have to be clear eyed about what these models are actually doing. Especially when companies don't explain what they've done.
        
        simonw 1 day ago
        
        The embarrassment of getting caught doing that would be expensive.
        
        red75prime 19 hours ago
        
        Vetting them for the potential for whistleblowing might be a bit more involved. But conspiracy theories have an advantage because the lack of evidence is evidence for the theory.
        
        [-]
        
        toraway 16 hours ago
        
        Huh? AI labs are routinely spending millions to billions to various 3rd party contractors specializing in creating/labeling/verifying specialized content for pre/post-training.
        This would just be one more checkbox buried in hundreds of pages of requests, and compared to plenty of other ethical grey areas like copyright laundering with actual legal implications, leaking that someone was asked to create a few dozen pelican images seems like it would be at the very bottom of the list of reputational risks.
        
        [-]
        
        red75prime 15 hours ago
        
        How do you think who's in on that? Not only pelicans, I mean, the whole thing. CEOs, top researchers, select mathematicians, congressmen? Does China participate in maintaining the bubble?
        I, myself, prefer the universal approximation theorem and empirical finding that stochastic gradient descent is good enough (and "no 'magic' in the brain", of course).
        
        [-]
        
        usefulposter 10 hours ago
        
        Well, since we're all talking about sourcing training material to "benchmaxx" for social proof, and not litigating the whole "AI bubble" debate, just the entire cottage industry of data curation firms:
        https://scale.com/data-engine
        https://www.appen.com/llm-training-data
        https://www.cogitotech.com/generative-ai/
        https://www.telusdigital.com/solutions/data-for-ai-training/...
        https://www.nexdata.ai/industries/generative-ai
        ---
        P.S. Google Comms would have been consulted re putting a pelican in the I/O keynote :-)
        https://x.com/simonw/status/1924909405906338033
        
        [-]
        
        red75prime 9 hours ago
        
        Cool. At least they are working across the board and benchmaxing random things like the theory of mind.
  - throwup238 1 day ago
    
    For every combination of animal and vehicle? Very unlikely.
    The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.
    
    [-]
    - recursive 1 day ago
      
      No, not every combination. The question is about the specific combination of a pelican on a bicycle. It might be easy to come up with another test, but we're looking at the results from a particular one here.
      
      [-]
      - ebonnafoux 10 hours ago
        
        You can easily make a RLAIF loop.
        - Take a list of n animals * m vehicule
        - Ask a LLM to generate SVG for this n*m options
        - Generate png from the svg
        - Ask a Model with vision to grade the result
        - Change your weight accordingly
        No need to human to draw the dataset, no need of human to evaluate.
      - svara 1 day ago
        
        More likely you would just train for emitting svg for some description of a scene and create training data from raster images.
        
        [-]
        
        recursive 23 hours ago
        
        None of this works if the testers are collaborating with the trainers. The tests ostensibly need to be arms-length from the training. If the trainers ever start over-fitting to the test, the tester would come up with some new test secretly.
  - zarzavat 1 day ago
    
    You can always ask for a tyrannosaurus driving a tank.
  - verdverm 1 day ago
    
    I've heard it posited that the reason the frontier companies are frontier is because they have custom data and evals. This is what I would do too
- enraged_camel 1 day ago
  
  Is there a list of these for each model, that you've catalogued somewhere?
  
  [-]
  - simonw 10 hours ago
    
    At the moment that's mostly my tag page here but I really need to formalize it: https://simonwillison.net/tags/pelican-riding-a-bicycle/
- throwup238 1 day ago
  
  The reflection of the sun in the water is completely wrong. LLMs are still useless. (/s)
  
  [-]
  - margalabargala 1 day ago
    
    It's not actually, look up some photos of the sun setting over the ocean. Here's an example:
    https://stockcake.com/i/sunset-over-ocean_1317824_81961
    
    [-]
    - throwup238 1 day ago
      
      That’s only if the sun is above the horizon entirely.
      
      [-]
      - margalabargala 1 day ago
        
        No, it's not.
        https://stockcake.com/i/serene-ocean-sunset_1152191_440307
        
        [-]
        
        throwup238 1 day ago
        
        Yes, it is. In that photo the sun is clearly above the horizon, the bottom half is just obscured by clouds.
- saberience 1 day ago
  
  Do you have to still keep trying to bang on about this relentlessly?
  It was sort of humorous for the maybe first 2 iterations, now it's tacky, cheesy, and just relentless self-promotion.
  Again, like I said before, it's also a terrible benchmark.
  
  [-]
  - odiroot 8 hours ago
    
    It's HN's Carthago delenda est moment.
  - jeanloolz 1 day ago
    
    I'll agree to disagree. In any thread about a new model, I personally expect the pelican comment to be out there. It's informative, ritualistic and frankly fun. Your comment however, is a little harsh. Why mad?
  - simonw 1 day ago
    
    It being a terrible benchmark is the bit.
  - Davidzheng 1 day ago
    
    Eh, i find it more of a not very informative but lighthearted commentary
- deron12 1 day ago
  
  It's worth noting that you mean excellent in terms of prior AI output. I'm pretty sure this wouldn't be considered excellent from a "human made art" perspective. In other words, it's still got a ways to go!
  Edit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?
  
  [-]
  - gs17 1 day ago
    
    It depends, if you meant from a human coding an SVG "manually" the same way, I'd still say this is excellent (minus the reflection issue). If you meant a human using a proper vector editor, then yeah.
    
    [-]
    - fvdessen 1 day ago
      
      maybe you're a pro vector artist but I couldn't create such a cool one myself in illustrator tbh
  - dfdsf2 1 day ago
    
    Indeed. And when you factor in the amount invested... yeah it looks less impressive. The question is how much more money needs to be invested to get this thing closer to reality? And not just in this instance. But for any instance e.g. a seahorse on a bike.
- dfdsf2 1 day ago
  
  Highly disagree.
  I was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view.
  If it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple.
  
  [-]
  - chriswarbo 1 day ago
    
    I disagree. The task asks for an SVG; which is a vector format associated with line drawings, clipart and cartoons. I think it's good that models are picking up on that context.
    In contrast, the only "realistic" SVGs I've seen are created using tools like potrace, and look terrible.
    I also think the prompt itself, of a pelican on bicycle, is unrealistic and cartoonish; so making a cartoon is a good way to solve the task.
  - peaseagee 1 day ago
    
    The request is for an SVG, generally _not_ the format for photorealistic images. If you want to start your own benchmark, feel free to ask for a photorealistic JPEG or PNG of a pelican riding a bicycle. Could be interesting to compare and contrast, honestly.
GorbachevyChase 3 hours ago

Is xAI out of the race? I’m not on a subscription, but their Ara voice model is my favorite. Gemini on iOS is pretty terrible in voice mode. I suspect because they have aggressive throttling instructions to keep output tokens low.
siva7 1 day ago

I can't shake of the feeling that Googles Deep Think Models are not really different models but just the old ones being run with higher number of parallel subagents, something you can do by yourself with their base model and opencode.

[-]
- Davidzheng 1 day ago
  
  And after i do that, how do i combine the output of 1000 subagents into one output? (Im not being snarky here, i think it's a nontrivial problem)
  
  [-]
  - mattlondon 1 day ago
    
    You just pipe it to another agent to do the reduce step (i.e. fan-in) of the mapreduce (fan-out)
    It's agents all the way down.
    
    [-]
    - machiaweliczny 6 hours ago
      
      No it's not because cost is much lower. They do some kind of speculative decoding in monte-carlo way If I had to guess as humans do it this way is my hunch. What I mean it's kinda the way you describe but much more efficient.
  - tifik 1 day ago
    
    The idea is that each subagent is focused on a specific part of the problem and can use its entire context window for a more focused subtask than the overall one. So ideally the results arent conflicting, they are complimentary. And you just have a system that merges them.. likely another agent.
    
    [-]
    - int_19h 18 hours ago
      
      Claude Cowork does this by default and you can see how exactly it is coordinating them etc.
  - jonathanstrange 1 day ago
    
    Start with 1024 and use half the number of agents each turn to distill the final result.
- energy123 18 hours ago
  
  They could do it this way: generate 10 reasoning traces and then every N tokens they prune the 9 that have the lowest likelihood, and continue from the highest likelihood trace.
  This is a form of task-agnostic test time search that is more general than multi agent parallel prompt harnesses.
  10 traces makes sense because ChatGPT 5.2 Pro is 10x more expensive per token.
  That's something you can't replicate without access to the network output pre token sampling.
czhu12 23 hours ago

It’s incredible how fast these models are getting better. I thought for sure a wall would be hit, but these numbers smashes previous benchmarks. Anyone have any idea what the big unlock that people are finding now?

[-]
- fsh 22 hours ago
  
  Companies are optimizing for all the big benchmarks. This is why there is so little correlation between benchmark performance and real world performance now.
  
  [-]
  - czhu12 21 hours ago
    
    Isn’t there? I mean, Claude code has been my biggest usecase and it basically one shots everything now
    
    [-]
    - fsh 21 hours ago
      
      Yes, LLMs have become extremely good at coding (not software engineer though). But try using them for anything original that cannot be adapted from GitHub and Stack Overflow. I haven't seen much improvement at all at such tasks.
      
      [-]
      - WarmWash 7 hours ago
        
        No shot, their classic engineering ability has exploded too.
        The amount of information available online about optics is probably <0.001% of what is available for software, and they can just breeze through modeling solutions. A year ago was immediate face-planting.
        The gains are likely coming from exactly where they say they are coming from - scaling compute.
      - dboreham 19 hours ago
        
        Strongly disagree with this. And I'm going to provide as much evidence as you did.
ramshanker 1 day ago

Do we get any model architecture details like parameter size etc.? Few months back, we used to talk more on this, now it's mostly about model capabilities.

[-]
- Davidzheng 1 day ago
  
  I'm honestly not sure what you mean? The frontier labs have kept arch as secrets since gpt3.5
  
  [-]
  - willis936 1 day ago
    
    At the very least gemini 3's flyer claims 1T parameters.
deviation 11 hours ago

I'm impressed with the Arc-AGI-2 results - though readers beware... They achieved this score at a cost of $13.62 per task.
For context, Opus 4.6's best score is 68.8% - but at a cost of $3.64 per task.
neilellis 1 day ago

Less than a year to destroy Arc-AGI-2 - wow.

[-]
- Davidzheng 1 day ago
  
  I unironically believe that arc-agi-3 will have a introduction to solved time of 1 month
  
  [-]
  - ACCount37 1 day ago
    
    Not very likely?
    ARC-AGI-3 has a nasty combo of spatial reasoning + explore/exploit. It's basically adversarial vs current AIs.
    
    [-]
    - Davidzheng 17 hours ago
      
      We will see at the end of April right? It's more of a guess than a strongly held conviction--but I see models improving rapidly at long horizon tasks so I think it's possible. I think a benchmark which can survive a few months (maybe) would be if it genuinely tested long time-frame continual learning/test-time learning/test-time posttraining (idk honestly the differences b/t these).
      But i'm not sure how to give such benchmarks. I'm thinking of tasks like learning a language/becoming a master at chess from scratch/becoming a skill artists but where the task is novel enough for the actor to not be anywhere close to proficient at beginning--an example which could be of interest is, here is a robot you control, you can make actions, see results...become proficient at table tennis. Maybe another would be, here is a new video game, obtain the best possible 0% speedrun.
  - etyhhgfff 1 day ago
    
    The AGI bar has to be set even higher, yet again.
    
    [-]
    - red75prime 18 hours ago
      
      And that's the way it should be. We're past the "Look! It can talk! How cute!" stage. AGI should be able to deal with any problem a human can.
  - dakolli 1 day ago
    
    wow solving useless puzzles, such a useful metric!
    
    [-]
    - esafak 1 day ago
      
      How is spatial reasoning useless??
- modeless 1 day ago
  
  It's still useful as a benchmark of cost/efficiency.
- XCSme 1 day ago
  
  But why only a +0.5% increase for MMMU-Pro?
  
  [-]
  - kingstnap 1 day ago
    
    Its possibly label noise. But you can't tell from a single number.
    You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.
    It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.
  - kenjackson 1 day ago
    
    Everyone is already at 80% for that one. Crazy that we were just at 50% with GPT-4o not that long ago.
    
    [-]
    - XCSme 22 hours ago
      
      But 80% sounds far from good enough, that's 20% error rate, unusable in autonomous tasks. Why stop at 80%? If we aim for AGI, it should 100% any benchmark we give.
      
      [-]
      - Davidzheng 17 hours ago
        
        I'm not sure the benchmark is high enough quality that >80% of problems are well-specified & have correct labels tbh. (But I guess this question has been studied for these benchmarks)
      - kenjackson 21 hours ago
        
        Are humans 100%?
        
        [-]
        
        XCSme 21 hours ago
        
        If they are knowledgeable enough and pay attention, yes. Also, if they are given enough time for the task.
        But the idea of automation is to make a lot fewer mistakes than a human, not just to do things faster and worse.
        
        [-]
        
        kenjackson 19 hours ago
        
        Actually faster and worse is a very common characterization of a LOT of automation.
        
        [-]
        
        XCSme 14 hours ago
        
        That's true.
        The problem is that if the automation breaks at any point, the entire system fails. And programming automations are extremely sensitive to minor errors (i.e. a missing semicolon).
        AI does have an interesting feature though, it tends to self-healing in a way, when given tools access and a feedback loop. The only problem is that self-healing can incorrectly heal errors, then the final reault will be wrong in hard-to-detect ways.
        So the more wuch hidden bugs there are, the nore unexpectedly the automations will perform.
        I still don't trust current AI for any tasks more than data parsing/classification/translation and very strict tool usage.
        I don't beleive in the full-assistant/clawdbot usage safety and reliability at this time (it might be good enough but the end of the year, but then the SWE bench should be at 100%).
- saberience 1 day ago
  
  It's a useless meaningless benchmark though, it just got a catchy name, as in, if the models solve this it means they have "AGI", which is clearly rubbish.
  Arc-AGI score isn't correlated with anything useful.
  
  [-]
  - Legend2440 1 day ago
    
    It's correlated with the ability to solve logic puzzles.
    It's also interesting because it's very very hard for base LLMs, even if you try to "cheat" by training on millions of ARC-like problems. Reasoning LLMs show genuine improvement on this type of problem.
  - jabedude 1 day ago
    
    how would we actually objectively measure a model to see if it is AGI if not with benchmarks like arc-AGI?
    
    [-]
    - WarmWash 1 day ago
      
      Give it a prompt like
      >can u make the progm for helps that with what in need for shpping good cheap products that will display them on screen and have me let the best one to get so that i can quickly hav it at home
      And get back an automatic coupon code app like the user actually wanted.
  - HDThoreaun 23 hours ago
    
    ARC-AGI 2 is an IQ test. IQ tests have been shown over and over to have predictive power in humans. People who score well on them tend to be more successful
    
    [-]
    - fsh 22 hours ago
      
      IQ tests only work if the participants haven't trained for them. If they do similar tests a few times in a row, scores increase a lot. Current LLMs are hyper-optimized for the particular types of puzzles contained in popular "benchmarks".
LoveMortuus 7 hours ago

I've been wondering for a while now: What would be the results if we had multiple LLMs run the same query and then use statistical analysis?

[-]
- neversupervised 7 hours ago
  
  Best of N is a very common technique already.
vessenes 1 day ago

Not trained for agentic workflows yet unfortunately - this looks like it will be fantastic when they have an agent friendly one. Super exciting.

[-]
- dakolli 1 day ago
  
  Its really weird how you all are begging to be replaced by llms, you think if agentic workflows get good enough you're going to keep your job? Or not have your salary reduced by 50%?
  If Agents get good enough it's not going to build some profitable startup for you (or whatever people think they're doing with the llm slot machines) because that implies that anyone else with access to that agent can just copy you, its what they're designed to do... launder IP/Copyright. Its weird to see people get excited for this technology.
  None of this good. We are simply going to have our workforces replaced by assets owned by Google, Anthropic and OpenAI. We'll all be fighting for the same barista jobs, or miserable factory jobs. Take note on how all these CEOs are trying to make it sound cool to "go to trade school" or how we need "strong American workers to work in factories".
  
  [-]
  - BeetleB 1 day ago
    
    > Its really weird how you all are begging to be replaced by llms, you think if agentic workflows get good enough you're going to keep your job? Or not have your salary reduced by 50%?
    The computer industry (including SW) has been in the business of replacing jobs for decades - since the 70's. It's only fitting that SW engineers finally become the target.
    
    [-]
    - dgb23 14 hours ago
      
      Is that really true? Software created an incredible amount of new types of jobs and markets.
    - lofaszvanitt 15 hours ago
      
      The most gullible workforce ever (FOSS), but seeing Youtube, half the planet is braindead for handing over their craft on a platter for mere dollars.
  - sgillen 1 day ago
    
    I think a lot of people assume they will become highly paid Agent orchestrators or some such. I don't think anyone really knows where things are heading.
    
    [-]
    - dakolli 21 hours ago
      
      Why would someone get paid well for this skill? Its not valuable at all.
      
      [-]
      - robryan 15 hours ago
        
        Highly valuable right now with how high leverage it can make a good engineer. Who knows for how long.
  - ergonaught 1 day ago
    
    Most folks don't seem to think that far down the line, or they haven't caught on to the reality that the people who actually make decisions will make the obvious kind of decisions (ex: fire the humans, cut the pay, etc) that they already make.
    
    [-]
    - blibble 1 day ago
      
      they think they're going to be the person making that decision
      but forgot there's likely someone above them making exactly the same one about them
  - timeattack 1 day ago
    
    I agree with you and have similar thoughts (maybe, unfortunately for me). I personally know people who outsource not just their work, but also their life to LLMs, and reading their exciting comments makes me feel a mix of cringe, fomo and dread. But what is the engame for me and you likes, when we finally would be evicted from our own craft? Stash money while we still can, watching 'world crash and burn', and then go and try to ascend in some other, not yet automated craft?
    
    [-]
    - dakolli 1 day ago
      
      Yeah, that's a good question that I can't stop thinking about. I don't really enjoy much else other than building software, its genuinely my favorite thing to do. Maybe there will be a world where we aren't completely replaced, we have handmade clothes still after all that are highly coveted. I just worry its going to uproot more than just software engineering, theoretically it shouldn't be hard to replace all low hanging fruit in the realm of anything that deals with computer I/O. Previous generations of automation have created new opportunities for humans, but this seems mostly just as a means of replacement. The advent of mass transportation/vehicles created machines who needed mechanics (and eventually software), I don't see that happening in this new paradigm.
      I don't think that's going to make society very pleasant if everyone's fighting over the few remaining ways to make livelihood. People need to work to eat. I certainly don't see the capitalist class giving everyone UBI and letting us garden or paint for the rest of our lives. I worry we're likely going to end up in trenches or purged through some other means.
      
      [-]
      - vitaflo 17 hours ago
        
        If you want to know where it's headed, look at factory workers 40 years ago. Lots of people still work at factories today, they just aren't in the same places they were 40 years ago and now req an entirely different skill set.
        The largest ongoing expense of every company is labor and software devs are some of the highest paid labor on the planet. AI will eventually drive down wages for this class of workers most likely by shipping these jobs to people in other countries where labor is much cheaper. Just like factory work did.
        Enjoy the good times while they last (or get a job at an AI company).
  - vessenes 1 day ago
    
    I’m someone who’d like to deploy a lot more workers than I want to manage.
    Put another way, I’m on the capital side of the conversation.
    The good news for labor that has experience and creativity is that it just started costing 1/100,000 what it used to to get on that side of the equation.
    
    [-]
    - jimmymcgee73 22 hours ago
      
      If LLMs truly cause widespread replacement of labor, you’re screwed just as much as anyone else. If we hit say 40% unemployment do you think people will care you own your home or not? Do you think people will care you have currency or not? The best case outcome will be universal income and a pseudo utopia where everyone does ok. The “bad” scenario is widespread war.
      I am one of the “haves” and am not looking forward to the instability this may bring. Literally no one should.
      
      [-]
      - blibble 22 hours ago
        
        > I am one of the “haves” and am not looking forward to the instability this may bring. Literally no one should.
        these people always forget capitalism is permitted to exist by consent of the people
        if there's 40% unemployment it won't continue to exist, regardless of what the TV/tiktok/chatgpt says
        
        [-]
        
        ssstiktok 15 hours ago
        
        [flagged]
      - dakolli 21 hours ago
        
        Well he also thinks $10.00 in LLM tokens is equivalent to a $1mm labor budget. These are the same people who were grifting during the NFTs days, claiming they were the future of art.
    - dakolli 21 hours ago
      
      lmao, you are an idealistic moron. If llms can replace labor at 1/100k of the cost (lmfao) why are you looking to "deploy" more workers? So are you trying to say if I have $100.00 in tokens I have the equivalent of $10mm in labor potential.... What kind of statement is this?
      This is truly the dumbest statement I've ever seen on this site for too many reasons to list.
      You people sound like NFT people in 2021 telling people that they're creating and redefining art.
      Oh look peter@capital6.com is a "web3" guy. Its all the same grifters from the NFT days behaving the same way.
      
      [-]
      - vessenes 7 hours ago
        
        I upvoted your comment. Love the confidence. I’ve self funded full venture studios - so I have a pretty good take on costs of innovation. You might say I was poor at deploying innovation capital; you might be right!
        Anyway 100k is hyperbolic. But I’d argue just one order of magnitude. Claude max can do many things better than my last (really great) team, and is worse at some things - creative output, relationship building and conference attending most notably. It’s also much faster at the things it is good at. Like 20-50x faster than a person or team.
        If I had another venture studio I’d start with an agent first, and fill in labor in the gaps. The costs are wildly different.
        Back to you though - who hurt you? Your writing makes me think you are young. You have been given literal super power force extension tech from aliens this year, why not be excited at how much more you can build?
  - lofaszvanitt 15 hours ago
    
    But the head honchos on ted.com said AI will create more jobs.
  - newswasboring 1 day ago
    
    You don't hate AI, you hate capitalism. All the problems you have listed are not AI issues, its this crappy system where efficiency gains always end up with the capital owners.
  - OtomotO 1 day ago
    
    [flagged]
    
    [-]
    - dakolli 1 day ago
      
      Well I honestly think this is the solution. It's much harder to do French Revolution V2 though if they've used ML to perfect people's recommendation algorithms to psyop them into fighting wars on behalf of capitalists.
      I imagine llm job automation will make people so poor that they beg to fight in wars, and instead of turning that energy against he people who created the problem they'll be met with hours of psyops that direct that energy to Chinese people or whatever.
      We will see.
    - uxhoiuewfhhiu 1 day ago
      
      [flagged]
lifty 15 hours ago

Too bad we can’t use it. Whenever Google releases something, I can never seem to use it in their coding cli product.

[-]
- matesz 12 hours ago
  
  You can but only via Gemini Ultra plan which you can buy or Gemini API with early access.
  
  [-]
  - lifty 12 hours ago
    
    I know, and neither of these options are feasible for me. I can't get the early access and I am not willing to drop $250 in order to just try their new model. By the time I can use it, the other two companies have something similar and I lose my interest in Google's models.
toddmorrow 3 hours ago

this is like the doomsday clock
84% is meaningless if these things can't reason
getting closer and closer to 100%, but still can't function

[-]
- gilbetron 3 hours ago
  
  > if these things can't reason
  I see people talk about "reasoning". How do you define reasoning such that it is clear humans can do it and AI (currently) cannot?
ggregoire 21 hours ago

Do we know what model is used by Google Search to generate the AI summary?
I've noticed this week the AI summary now has a loader "Thinking…" (no idea if it was already there a few weeks ago). And after "Thinking…" it says "Searching…" and shows a list of favicons of popular websites (I guess it's generating the list of links on the right side of the AI summary?).
vampiregrey 14 hours ago

So last week I tried Gemini pro 3, Opus 4.6, GLM 5, Kimi2.5 so far using Kimi2.5 yeilded the best results (in terms of cost/performance) for me in a mid size Go project. Curious to know what others think ?

[-]
- sdeiley 12 hours ago
  
  I predict Gemini Flash will dominate when you try it.
  If you're going for cost performance balance choosing Gemini Pro is bewildering. Gemini Flash _outperforms_ Pro in some coding benchmarks and is the clear parento frontier leader for intelligence/cost. It's even cheaper than Kimi 2.5.
  https://artificialanalysis.ai/?media-leaderboards=text-to-im...
mark_l_watson 23 hours ago

Off topic comment (sorry): when people bash "models that are not their favorite model" I often wonder if they have done the engineering work to properly use the other models. Different models and architectures often require very different engineering to properly use them. Also, I think it is fine and proper that different developers prefer different models. We are in early days and variety is great.
0dayman 12 hours ago

I don't get it, why is Claude still number 1 while the numbers say different, let's see that new Gemini in the terminal also
Legend2440 1 day ago

I'm really interested in the 3D STL-from-photo process they demo in the video.
Not interested enough to pay $250 to try it out though.
sega_sai 1 day ago

I do like google models (and I pay for them), but the lack of competitive agent is a major flaw in Google's offering. It is simply not good enough in comparison to claude code. I wish they put some effort there (as I don't want to pay two subscriptions to both google and anthropic)
dmbche 22 hours ago

So what happens if the AI companies can't make money? I see more and more advances and breakthrough but they are taking in debt and no revenue in sight.
I seem to understand debt is very bad here since they could just sell more shares, but aren't (either valuation is stretched or no buyers).
Just a recession? Something else? Aren't they very very big to fall?
Edit0: Revenue isn't the right word, profit is more correct. Amazon not being profitable fucks with my understanding of buisness. Not an economist.

[-]
- sigmar 21 hours ago
  
  >taking in debt and no revenue in sight.
  which companies don't have revenue? anthropic is at a run rate of 14 billion (up from 9B in December, which was up from 4B in July). Did you mean profit? They expect to be cash flow positive in 2028.
  
  [-]
  - dmbche 20 hours ago
    
    Yes thank you, mixing my brushes here - I remembered one of the companies having raised over 100b and having about 10b in revenue.
- echelon 22 hours ago
  
  AI will kill SaaS moats and thus revenue. Anyone can build new SaaS quickly. Lots of competition will lead to marginal profits.
  AI will kill advertising. Whatever sits at the top "pane of glass" will be able to filter ads out. Personal agents and bots will filter ads out.
  AI will kill social media. The internet will fill with spam.
  AI models will become commodity. Unless singularity, no frontier model will stay in the lead. There's competition from all angles. They're easy to build, just capital intensive (though this is only because of speed).
  All this leaves is infrastructure.
  
  [-]
  - ddxv 21 hours ago
    
    Not following some of the jumps here.
    Advertising, how will they kill ads any better than the current cat and mouse games with ad blockers?
    Social Media, how will they kill social media? Probably 80% of the LinkedIn posts are touched by AI (lots of people spend time crafting them, so even if AI doesn't write the whole thing you know they ran the long ones through one) but I'm still reading (ok maybe skimming) the posts.
    
    [-]
    - echelon 20 hours ago
      
      > Advertising, how will they kill ads any better than the current cat and mouse games with ad blockers?
      The Ad Blocker cat and mouse game relies on human-written metaheuristics and rules. It's annoying for humans to keep up. It's difficult to install.
      Agents/Bots or super slim detection models will easily be able to train on ads and nuke them whatever form they come in: javascript, inline DOM, text content, video content.
      Train an anti-Ad model and it will cleanse the web of ads. You just need a place to run it from the top.
      You wouldn't even have to embed this into a browser. It could run in memory with permissions to overwrite the memory of other applications.
      > Social Media, how will they kill social media?
      MoltClawd was only the beginning. Soon the signal will become so noisy it will be intolerable. Just this week, X's Nikita Bier suggested we have less than six months before he sees no solution.
      Speaking of X, they just took down Higgsfield's (valued at $1.3B) main account because they were doing it across a molt bot army, and they're not the only ones. Extreme measures were the only thing they could do. For the distributed spam army, there will be no fix. People are already getting phone calls from this stuff.
  - dboreham 19 hours ago
    
    > AI will kill SaaS moats and thus revenue. Anyone can build new SaaS quickly.
    I'm LLM-positive but for me this is a stretch. Seeing it pop up all over media in the past couple weeks also makes me suspect astrofurfing. Like a few years back when there were a zillion articles saying voice search was the future and nobody used regular web search any more.
  - ryan_lane 19 hours ago
    
    AI models will simply build the ads into the responses, seamlessly. How do you filter out ads when you search for suggestions for products, and the AI companies suggest paid products in the responses?
    Based on current laws, does this even have to be disclosed? Will laws be passed to require disclosure?
- casey2 12 hours ago
  
  What happens if oil companies can't make money? They will restructure society so they can. That's the essence of capitalism, the willingness to restructure society to chase growth.
  Obviously this tech is profitable in some world. Car companies can't make money if we live in walking distance and people walk on roads.
- ipnon 22 hours ago
  
  They're using the ride share app playbook. Subsidize the product to reach market saturation. Once you've found a market segment that depends on your product you raise the price to break even. One major difference though is that ride share's haven't really changed in capabilities since they launched: it's a map that shows a little car with your driver coming and a pin where you're going. But it's reasonable to believe that AI will have new fundamental capabilities in the 2030s, 2040s, and so on.
ismailmaj 1 day ago

top 10 elo in codeforces is pretty absurd
amelius 12 hours ago

We're getting to the point where we can ask AI to invent new programming languages.

[-]
- kabes 12 hours ago
  
  Wait till we get to the point where we can ask AI to create a better AI.
  
  [-]
  - amelius 12 hours ago
    
    Right now I'm still stuck with AI that can't even install other AI.
eturkes1 19 hours ago

Is this not yet available for workspace users? I clicked on the Upgrade to Google AI Ultra button on the Gemini app and the page it takes me to still shows Gemini 2.5 Deep Think as an added feature. Wondering if that's just outdated info
whatever10 14 hours ago

I tried to debug a Wireguard VPN issue. No luck.
We need more than AGI.
Dirak 1 day ago

Praying this isn't another Llama4 situation where the benchmark numbers are cooked. 84.6% on Arc-AGI is incredible!
nphardon 22 hours ago

I think I'm finally realizing that my job probably won't exist in 3-5. Things are moving so fast now that the LLMs are basically writing themselves. I think the earlier iterations moved slower because they were limited by human ability and productivity limitations.
jonathanstrange 1 day ago

Unfortunately, it's only available in the Ultra subscription if it's available at all.
toephu2 15 hours ago

When will AI come up with a cure / vaccine for the common cold? and then cancer next?

[-]
- lofaszvanitt 15 hours ago
  
  Race for solving baldness :D
  
  [-]
  - viking123 12 hours ago
    
    Dutasteride already exists for that, been on it almost 10 years soon and it's great. Although if you are already bald it is kind of moot.
andrewstuart 1 day ago

Gemini was awesome and now it’s garbage.
It’s impossible for it to do anything but cut code down, drop features, lose stuff and give you less than the code you put in.
It’s puzzling because it spent months at the head of the pack now I don’t use it at all because why do I want any of those things when I’m doing development.
I’m a paid subscriber but there’s no point any more I’ll spend the money on Claude 4.6 instead.

[-]
- halapro 1 day ago
  
  I never found it useful for code. It produced garbage littered with gigantic comments.
  Me: Remove comments
  Literally Gemini: // Comments were removed
  
  [-]
  - andrewstuart 1 day ago
    
    It would make more sense to me if it had never been awesome.
    
    [-]
    - mortsnort 1 day ago
      
      They may quantize the models after release to save money.
- ergonaught 1 day ago
  
  It seems to be adept at reviewing/editing/critiquing, at least for my use cases. It always has something valuable to contribute from that perspective, but has been comparatively useless otherwise (outside of moats like "exclusive access to things involving YouTube").
LightBug1 6 hours ago

But it can't parse my mathematically really basic personal financial spreadsheet ...
I learned a lot about Gemini last night. Namely that I have lead it like a reluctant bull to understand what I want it to do (beyond normal conversations, etc).
Don't get me wrong, ChatGPT didn't do any better.
It's an important spreadsheet so I'm triple checking on several LLM's and, of course, comparing results with my own in depth understanding.
For running projects, and making suggestions, and answering questions and being "an advisor", LLM's are fantastic ... feed them a basic spreadsheet and it doesn't know what to do. You have to format the spreadsheet just right so that it "gets it".
I dread to think of junior professionals just throwing their spreadsheets into LLM's and runninng with the answers.
Or maybe I'm just shit at prompting LLM's in relation to spreadsheets. Anyone had better results in this scenario?

[-]
- ky3 4 hours ago
  
  You can ask the LLM to write a prompt for you. Example: "Explore prompts that would have circumvented all the previous misunderstanding."
KingMob 16 hours ago

I wish they would unleash it on the Google Cloud console. Whatever version of Gemini they offer in the sidebar when I log in is terrible.
okokwhatever 1 day ago

I need to test the sketch creation a s a p. I need this in my life because learning to use Freecad is too difficult for a busy person like me (and frankly, also quite lazy)

[-]
- sho_hn 1 day ago
  
  FWIW, the FreeCAD 1.1 nightlies are much easier and more intuitive to use due to the addition of many on-canvas gizmos.
syntaxing 1 day ago

Why a Twitter post and not the official Google blog post… https://blog.google/innovation-and-ai/models-and-research/ge...

[-]
- dang 1 day ago
  
  Just normal randomness I suppose. I've put that URL at the top now, and included the submitted URL in the top text.
- meetpateltech 1 day ago
  
  The official blog post was submitted earlier (https://news.ycombinator.com/item?id=46990637), but somehow this story ranked up quickly on the homepage.
  
  [-]
  - verdverm 1 day ago
    
    @dang will often replace the post url & merge comments
    HN guidelines prefer the original source over social posts linking to it.
- aavci 1 day ago
  
  Agreed - blog post is more appropriate than a twitter post
kittbuilds 14 hours ago

[dead]
kittbuilds 22 hours ago

[dead]
bschmidt720 1 day ago

[dead]
i_love_retros 21 hours ago

[flagged]

[-]
- rexpop 21 hours ago
  
  Israel is not one of the boots. Deplorable as their domestic policy may be, they're not wagging the dog of capitalist imperialism. To imply otherwise is to reveal yourself as biased, warped in a way that keeps you from going after much bigger, and more real systems of political economy holding back our civilization from universal human dignity and opportunity.
  
  [-]
  - i_love_retros 20 hours ago
    
    Lol what? Not sure if you are defending Israel or google because your communication style is awful. But if you are defending Israel then you're an idiot who is excusing genocide. If you're defending google then you're just a corporate bootlicker who means nothing.
    
    [-]
    - rexpop 9 hours ago
      
      You edited your comment.
      
      [-]
      - i_love_retros 8 hours ago
        
        yup but even if i changed it back to its original version, your comment would be hard to make sense of. try writing more honestly and less in way designed to impress.
    - alex1138 17 hours ago
      
      As opposed to Hamas who actually committed the genocide
HardCodedBias 1 day ago

Always the same with Google.
Gemini has been way behind from the start.
They use the firehose of money from search to make it as close to free as possible so that they have some adoption numbers.
They use the firehose from search to pay for tons of researchers to hand hold academics so that their non-economic models and non-economic test-time-compute can solve isolated problems.
It's all so tiresome.
Try making models that are actually competitive, Google.
Sell them on the actual market and win on actual work product in millions of people lives.

[-]
- sdeiley 11 hours ago
  
  I'm sorry but this is an insane take. Flash is leading its category by far. Absolutely destroys sonnet, 5.2 etc in both perf and cost.
  Pro still leads in visual intelligence.
  The company that most locks away their gold is Anthropic IMO and for good reason, as Opus 4.6 is expensive AF
  
  [-]
  - fatherwavelet 10 hours ago
    
    I think we highly underestimate the amount of "human bots" basically.
    Unthinking people programmed by their social media feed who don't notice the OpenAI influence campaign.
    With no social media, it seems obvious to me there was a massive PR campaign by OpenAI after their "code red" to try to convince people Gemini is not all that great.
    Yea, Gemini sucks, don't use it lol. Leave those resources to fools like myself.
fadedsignal 17 hours ago

Dr., please tell me are we cooked? :crying-emoji
m3kw9 1 day ago

Gemini 3 Pro/Flash is stuck in preview for months now. Google is slow but they progress like a massive rock giant.
ArchieScrivener 19 hours ago

Nonsense releases. Until they allow for medical diagnosis and legal advice who cares? You own all the prompts and outputs but somehow they can still modify them and censor them? No.
These 'Ai' are just sophisticated data collection machines, with the ability to generate meh code.
ipaddr 23 hours ago

The benchmark should be: can you ask it to create a profitable business or product and send you the profit?
Everything else is bike shedding.
dperhar 1 day ago

Does anyone actually use Gemini 3 now? I cant stand its sleek salesy way of introduction, and it doesnt hold to instructions hard – makes it unapplicable for MECE breakdowns or for writing.

[-]
- sigmar 22 hours ago
  
  I use it often. Occasionally for quick questions, but mostly for deep research.
- copperx 1 day ago
  
  I do. It's excellent when paired with an MCP like context7.
- throwa356262 1 day ago
  
  I dont agree, Gemini 3 is pretty good, even the Lite version.
  
  [-]
  - dperhar 1 day ago
    
    What do you use it for and why? Genuinely curious
    
    [-]
    - fatherwavelet 10 hours ago
      
      I use Gemini Pro for basically everything. I just started learning systems biology as I didn't even know this was a subject until it came up in a conversation.
      Biology is subject I am quite lacking in but it is unbelievable to me what I have learned in the last few weeks. Not even in what Gemini says exactly but in the text and papers it has led me to.
      One major reason is that it has never cut me off until last night. I ran several deep researches yesterday and then finally got cut off in a sprawling 2 hour conversation.
      For me it is the first model now that has something new coming out but I haven't extracted all the value from the old model that I am bored with it. I still haven't tried Opus 4.5 let alone 4.6 because I know I will get cut off right when things get rolling.
      I don't think I have even logged into ChatGPT in a month now.
    - IhateAI 1 day ago
      
      [dead]
- jeffbee 1 day ago
  
  It indeed departs from instructions pretty regularly. But I find it very useful and for the price it beats the world.
  "The price" is the marginal price I am paying on top of my existing Google 1, YouTube Premium, and Google Fi subs, so basically nothing on the margin.
- wetpaws 1 day ago
  
  [dead]