NewsLab
Apr 28 20:38 UTC

Ask HN: Scaling a targeted web crawler beyond 500M pages/day (news.ycombinator.com)

27 points|by honungsburk||10 comments|Read full story on news.ycombinator.com
I've been reading up on crawler architecture. The two most useful sources I've found are the blog post "Crawling a billion web pages in just over 24 hours, in 2025" and the Mercator paper ("Mercator: A Scalable, Extensible Web Crawler").

Both of these, and most other material I've come across, focus on crawling the broad open web rather than a targeted set of domains. For product prices it's the latter. Mercator calls out DNS resolution as a major bottleneck, for example, but when you're only hitting a few hundred domains that isn't really a concern.

The other gap is that both assume static HTML. For our use case we need a headless browser, and we also have to deal with Cloudflare and similar anti-bot systems.

For product prices specifically, a lot of sites publish price feeds which simplifies things, but plenty don't, and getting good coverage still requires scraping. Our current system does about 500M pages/day and we're looking to improve its performance.

Does anyone here have experience in this space, or know of articles/blog posts on scaling targeted (rather than broad) crawlers with headless browsers? Any pointers appreciated.

Comments (10)

10 shown
  1. 1. 4lx87||context
    I'm curious, how do you deal with Cloudflare and similar anti-bot systems? Just keep shopping the job around to different proxies?
  2. 2. fragmede||context
    Cloudflare reads this forum. By answering your question here, they burn that workaround. Why would someone do that? (No one bring up Warframe)
  3. 3. faangguyindia||context
    it's fairly simple, you use browser profiles and you visit multiple website like a normal guy using residential proxyy network

    and cloudflare cannot detect you this way.

    the older your browser profile is, the less often cloudflare bans.

  4. 4. faangguyindia||context
    If you want to access data from websites which prevent it, you gotta use a headless browser with Residential Proxy Network Like Bright Data (formerly Luminati).
  5. 5. nicbou||context
    Our industry's understanding of consent is terrifying
  6. 6. jeong_jeong||context
    It’s called hacker news, bro
  7. 7. ccgreg||context
    I'm a life-long hacker, and my crawler crawls with consent.
  8. 8. nicbou||context
    The hacker ethos is to think for yourself and find ways around rigid social structures. It is not to plunder without thought.
  9. 9. fragmede||context
    have you already incorporated common crawl into your index?
  10. 10. ccgreg||context
    Common Crawl is a sample of the web, so it's not that directly helpful for someone wanting to make a product price dataset.