Data Ingestion
Two-Phase Scraping Engine
Built on Selenium WebDriver with a Lite → Heavy two-phase strategy, supporting multiple categories including Property and Vehicles.
Two-Phase Strategy
Category Scrapers
PropertyScraper
Property Rentals & Sales
BedroomsBathroomsParkingPet FriendlySize (sqm)Available From
VehiclesScraper
Cars & Bakkies
MakeModelYearMileage (km)Fuel TypeTransmission
Scraping Lifecycle
1
URL Collection
Lite scrape gathers listing URLs from search pages
2
Queue Population
New URLs added to scrape_queue table
3
Heavy Scrape
Visit each URL for full detail extraction
4
Phone Extraction
Login & click 'Show number' for seller phone
5
Image Upload
Upload images to DigitalOcean Spaces
6
Persistence
UPSERT listing with lead quality score
Anti-Detection Strategies
User-Agent Rotation
Random User-Agent for each new driver session
Barebones Mode
Block images/CSS for faster, lighter scraping
Stealth Arguments
--disable-blink-features=AutomationControlled
Proxy Support
Webshare proxy integration for IP rotation
Error Resilience
Resume support via scrape_queue tracking
scraper_config.py
# Two-Phase Scraping Strategy
@dataclass
class ScraperConfig:
target_urls: list[str]
barebones: bool = True # Block images/CSS
use_proxy: bool = False # Webshare proxy
upload_images: bool = True # DO Spaces upload
# Phase 1: Lite Scrape - Collect URLs
def lite_scrape(self) -> list[str]:
"""Collect listing URLs from search pages"""
return self._extract_urls_from_search()
# Phase 2: Heavy Scrape - Full Details
def heavy_scrape(self, urls: list[str]):
"""Visit each URL for complete data extraction"""
for url in urls:
self._extract_listing_details(url)
self._reveal_phone_number() # Requires login
self._upload_images_to_cloud()