Scraping
Scraping Limitations / Road Blockers:
  • How many requests can their server take
    • How long does it take for the server to handle a request
    • Request/second
  • How many requests can you parallelise:
    • From a single process
    • From multiple process
  • How do you track what needs to be scraped
  • Authentication
    • Watch out for
      • Password change reminders
      • Account being locked out
  • If the page uses JavaScript
Goals:
  • Maximize number of requests/sec
  • Less compute resources used
Tools:
  • Database
  • Queue
Database:
  • Avoid nosql
  • Use a SQL database from the start, since you’ll most likely be exporting/querying it
    • Easier to change field names
    • Run SQL queries to fix
  • One table per “type” of page
    • One table for the pagination results
    • Another table for page results
  • Another table for consolidated results
    • This can be the source of truth
    • Hard part may be figuring out what should exist in the consolidated table, but doesn’t
Log to the console:
  • page/id scraped
  • time scraped
  • Time it took to scrape page
Copy link