← Projects

// data engineering · 2024

County Clerk Research Pipeline

Playwright scraping, PDF ingestion, and TRC data cleaning for Texas mineral rights research

The source layer for Ranger Discovery. Built to automate the document collection side of mineral rights research — scraping county clerk portals, ingesting and parsing PDFs, and processing Texas Railroad Commission records into a usable dataset.

County Clerk Scraping

County clerk portals vary significantly in structure — some are JavaScript-rendered, others are paginated form submissions. Playwright handles session management, dynamic page rendering, and rate-aware pagination across multiple Texas county systems. Raw PDFs and record metadata are staged for ingestion.

PDF Ingestion & Parsing

Mineral lease documents, deed records, and conveyance instruments are ingested via pdfplumber. Regex-based extraction pulls grantor and grantee names, legal descriptions, tract identifiers, and acreage figures into structured records. This structured extraction layer runs before any LLM is involved.

Texas Railroad Commission Data

TRC drilling permit datasets, wellbore records, and operator filings are downloaded and cleaned into normalized tables. This dataset provides the cross-reference layer for matching scraped deed records to active well and lease activity in the target counties.

Skills

PythonPlaywrightpdfplumberPostgreSQLRegex extractionBrowser automationPDF parsingData cleaningRailroad Commission datasets