Scraping
Scraping Limitations / Road Blockers:
How many requests can their server take
How long does it take for the server to handle a request
Request/second
How many requests can you parallelise:
From a single process
From multiple process
How do you track what needs to be scraped
Authentication
Watch out for
Password change reminders
Account being locked out
If the page uses JavaScript
Goals:
Maximize number of requests/sec
Less compute resources used
Tools:
Database
Queue
Database:
Avoid nosql
Use a SQL database from the start, since you’ll most likely be exporting/querying it
Easier to change field names
Run SQL queries to fix
One table per “type” of page
One table for the pagination results
Another table for page results
Another table for consolidated results
This can be the source of truth
Hard part may be figuring out what should exist in the consolidated table, but doesn’t
Log to the console:
page/id scraped
time scraped
Time it took to scrape page
Last updated