Comment on page
Scraping
Scraping Limitations / Road Blockers:
- How many requests can their server take
- How long does it take for the server to handle a request
- Request/second
- How many requests can you parallelise:
- From a single process
- From multiple process
- How do you track what needs to be scraped
- Authentication
- Watch out for
- Password change reminders
- Account being locked out
- If the page uses JavaScript
Goals:
- Maximize number of requests/sec
- Less compute resources used
Tools:
- Database
- Queue
Database:
- Avoid nosql
- Use a SQL database from the start, since you’ll most likely be exporting/querying it
- Easier to change field names
- Run SQL queries to fix
- One table per “type” of page
- One table for the pagination results
- Another table for page results
- Another table for consolidated results
- This can be the source of truth
- Hard part may be figuring out what should exist in the consolidated table, but doesn’t
Log to the console:
- page/id scraped
- time scraped
- Time it took to scrape page
Last modified 3yr ago