a
a
aizatto.com
Build.my
GitHub
Linkedin
Notion
Search…
aizatto.com
Table of Contents
Portfolio, Projects, Tools, Toys
Interview Guide
Engineering Code
Engineering Management
Why GitBook?
Getting into Tech
Personal Goals
Daily Drivers
Contacting Me
Notes
AWS
JavaScript
Node.js
Software Engineering
Technical Due Diligence
Web Development
Archive
Amazon Echo Dot (3rd Gen) with clock
Apple
Audible
Balance
Bags
Bandwidth Requirements
B2B/B2C
Blockchain
Board Games
Broadway
Cheap, Good, Fast
CLI
Cloud Providers
Communication
Company
Content Creation
COVID 19/Corona Virus
Coworking Spaces
Daily Routine
Dating
Displays / Monitors
DNS
Domain Registrars
Driving
eCommerce
Empire Building
Facebook for Developers
Fever
Fiverr
Flights
Gaming Tablet
GitHub
GTD
Go Lang
Headsets
Hiking
Home Device Calling
iCalendar
Keyboards
Malaysia Insurance
Mental Health Malaysia
Multiroom Wireless Speaker System
Musicals
Mouse
Movies
Password Managers
Phabricator
Physical Health
Podcasts
Programming Bootcamps
Property
Productivity
Redang
Relationships
Referral Codes
Remote Calls
Remote Work
Road Trips
Ruby / Ruby on Rails
Scraping
Slack
Stripe
Singapore
UX
Venture Builder
Video Games
Virtual Personal Assistant
VPN
WebDAV / CalDAV
WebSocket
Withings
Xiaomi Roborock Mijia
Old Hardware
More on Notion
Powered By
GitBook
Scraping
Scraping Limitations / Road Blockers:
How many requests can their server take
How long does it take for the server to handle a request
Request/second
How many requests can you parallelise:
From a single process
From multiple process
How do you track what needs to be scraped
Authentication
Watch out for
Password change reminders
Account being locked out
If the page uses JavaScript
Goals:
Maximize number of requests/sec
Less compute resources used
Tools:
Database
Queue
Database:
Avoid nosql
Use a SQL database from the start, since you’ll most likely be exporting/querying it
Easier to change field names
Run SQL queries to fix
One table per “type” of page
One table for the pagination results
Another table for page results
Another table for consolidated results
This can be the source of truth
Hard part may be figuring out what should exist in the consolidated table, but doesn’t
Log to the console:
page/id scraped
time scraped
Time it took to scrape page
Previous
Ruby / Ruby on Rails
Next
Slack
Last modified
2yr ago
Copy link