// technical · data engineering
W-1 Drilling Permit PDF Extraction System
Structured extraction from Texas Railroad Commission drilling permit forms
The W-1 system extracts structured permit metadata from Texas Railroad Commission W-1 drilling permit PDFs. These forms contain operator name, legal description, formation, API number, and county data in unstructured table layouts — the extractor parses and normalizes this into typed DataFrame output.
relationship to ranger
This system was an early prototype for Ranger's OCR ingestion pipeline. The PDF parsing patterns, regex extraction logic, and DataFrame normalization approach developed here directly influenced how Ranger handles mineral lease documents and RRC permit ingestion.
System Design
W-1 forms from the Railroad Commission are structured as multi-section PDFs with tabular data, checkboxes, and free-text fields. The forms are not machine-readable without explicit parsing — they are scanned or digitally generated PDFs that require layout-aware extraction.
pdfplumber is used to extract text and table coordinates from each page. Regex patterns then isolate specific fields — operator name, API number, legal description, formation name, county, and permit date — from the extracted text. Output is written to a typed DataFrame with column-level validation.
Extraction Pipeline
// extraction stages
- 01Load PDF → pdfplumber page objects
- 02Extract raw text + table coordinates per page
- 03Apply regex patterns per field type (operator, API, formation)
- 04Normalize extracted strings → typed values
- 05Write to DataFrame with column schema
- 06Flag incomplete or ambiguous extractions