Scraping
Scraping Limitations / Road Blockers:
    How many requests can their server take
      How long does it take for the server to handle a request
      Request/second
    How many requests can you parallelise:
      From a single process
      From multiple process
    How do you track what needs to be scraped
    Authentication
      Watch out for
        Password change reminders
        Account being locked out
    If the page uses JavaScript
Goals:
    Maximize number of requests/sec
    Less compute resources used
Tools:
    Database
    Queue
Database:
    Avoid nosql
    Use a SQL database from the start, since you’ll most likely be exporting/querying it
      Easier to change field names
      Run SQL queries to fix
    One table per “type” of page
      One table for the pagination results
      Another table for page results
    Another table for consolidated results
      This can be the source of truth
      Hard part may be figuring out what should exist in the consolidated table, but doesn’t
Log to the console:
    page/id scraped
    time scraped
    Time it took to scrape page
Last modified 1yr ago
Copy link