aizatto.com
  • aizatto.com
  • Table of Contents
  • Portfolio, Projects, Tools, Toys
  • Interview Guide
    • Choosing A Company
    • Job Boards
    • Practice
    • Technical Interview Cheatsheet
    • Interview Process
      • Questions to Ask
      • Coding
      • Soft Skills
      • Rejection
      • Negotiation / Deciding
      • Accepting, Joining
    • FAQ
  • Engineering Code
    • Communication
    • Different Types of Coding
    • Commit Messages
    • Reviewing Code
      • Requesting Changes
    • Writing Code
      • Consistency
      • Writing for a code base of 1,000,000+ Lines
      • Write Code Knowing It Will Be Refactored
      • Naming
        • Versioning
        • Create Searchable Names
      • Commenting
        • Don't commit commented code
      • Make It Easy To Reproduce
      • Scripts
      • 80 character limit
      • Exit Early
      • Be careful of enum in switch statements
      • Be careful about chaining conditions
      • Be careful of chaining ternary operators
      • Write Code Knowing You Will be Blamed
      • Hacks
      • Bad Practices
      • Logs
      • Time
      • Other rules
    • Engineering Code
    • Engineering Data
    • Pipelines
    • Configuration Files
    • Site Reliability Engineering (SRE)
    • Best Engineers
  • Engineering Management
    • Hiring
    • New Reports
    • 1:1s
      • Calibration
      • Expectations
      • Mentorship / Learning / Growing
      • Task Management
      • Teams
    • Interviewing Candidates
    • Messenger Groups
    • Resources
  • Why GitBook?
  • Getting into Tech
    • Terminology
  • Personal Goals
  • Daily Drivers
  • Contacting Me
  • Notes
    • JavaScript
      • Array
      • Async & Await / Promises
      • Booleans
      • Collections
      • Cons/Dislikes
      • fetch
      • Map
      • Modules
      • Object
      • Regex
      • Set
      • Style Guides
      • Versions
    • Node.js
      • Best Practices
      • DraftJS
      • eslint
      • GraphQL
      • Relay
      • Hapi
      • Knex
      • Koa
      • TypeScript
      • Webservers
    • Technical Due Diligence
    • Archive
      • Amazon Echo Dot (3rd Gen) with clock
      • Apple
        • AirPods Pro
        • Apple Notes
        • Apple Watch Series 4
        • iPad Pro 11" 2018
        • MacBook Pro 15" 2017
        • macOS
      • Audible
      • Balance
        • Growth vs Contentment
        • Leading vs Following
        • Mindful vs Mindless
        • New vs Old
      • Bags
      • Bandwidth Requirements
      • B2B/B2C
      • Blockchain
      • Board Games
        • Bang
      • Broadway
      • Cheap, Good, Fast
      • CLI
        • git
        • ufw
        • xargs
      • Cloud Providers
        • GCP
      • Communication
        • Asking Questions / Making Requests
        • Making Edits
        • Synchronous vs Asynchronous
        • Change Management
        • Problem Definition
      • Company
        • All Hands
        • The Problematic CTO
        • Organizational Structure
      • Content Creation
      • COVID 19/Corona Virus
      • Coworking Spaces
      • Daily Routine
      • Dating
      • Displays / Monitors
      • DNS
      • Domain Registrars
      • Driving
      • eCommerce
      • Empire Building
      • Facebook for Developers
      • Fever
      • Fiverr
      • Flights
      • Gaming Tablet
      • GitHub
      • GTD
      • Go Lang
      • Headsets
      • Hiking
        • Chamang Waterfalls
        • Kanching Waterfalls
        • Kota Damansara Community Forest Reserve
        • Sungai Chilling
      • Home Device Calling
      • iCalendar
      • Keyboards
        • Ergodox Ez
      • Malaysia Insurance
      • Mental Health Malaysia
      • Multiroom Wireless Speaker System
      • Musicals
      • Mouse
      • Movies
      • Password Managers
      • Phabricator
      • Physical Health
        • Cardio
      • Podcasts
      • Programming Bootcamps
      • Property
      • Productivity
        • Note Taking
      • Redang
      • Relationships
      • Referral Codes
      • Remote Calls
      • Remote Work
        • Comparison
      • Road Trips
      • Ruby / Ruby on Rails
      • Scraping
      • Slack
      • Stripe
      • Singapore
      • UX
      • Venture Builder
      • Video Games
      • Virtual Personal Assistant
      • VPN
      • WebDAV / CalDAV
      • WebSocket
      • Withings
      • Xiaomi Roborock Mijia
      • Old Hardware
        • Netgear R7000P
      • Web Development
        • React
        • SSO Providers
      • Software Engineering
        • Software Architectures
          • Monolithic
          • Non-Monolithic
            • Microservice
            • FaaS (Functions as a Service) or Serverless
        • Repository Management
  • More on Notion
Powered by GitBook
On this page

Was this helpful?

  1. Notes
  2. Archive

Scraping

Scraping Limitations / Road Blockers:

  • How many requests can their server take

    • How long does it take for the server to handle a request

    • Request/second

  • How many requests can you parallelise:

    • From a single process

    • From multiple process

  • How do you track what needs to be scraped

  • Authentication

    • Watch out for

      • Password change reminders

      • Account being locked out

  • If the page uses JavaScript

Goals:

  • Maximize number of requests/sec

  • Less compute resources used

Tools:

  • Database

  • Queue

Database:

  • Avoid nosql

  • Use a SQL database from the start, since you’ll most likely be exporting/querying it

    • Easier to change field names

    • Run SQL queries to fix

  • One table per “type” of page

    • One table for the pagination results

    • Another table for page results

  • Another table for consolidated results

    • This can be the source of truth

    • Hard part may be figuring out what should exist in the consolidated table, but doesn’t

Log to the console:

  • page/id scraped

  • time scraped

  • Time it took to scrape page

PreviousRuby / Ruby on RailsNextSlack

Last updated 5 years ago

Was this helpful?