NYC Taxi Hotspot Analysis

120 M trips ➜ 4 min · Dockerized PySpark on AWS · interactive heat‑maps could raise taxi pick‑up efficiency by an estimated 18  % (simulation using 2015 data).

Problem Statement

Data & Scale

Item Details
Source NYC TLC Yellow Taxi Trip Records (public S3)
Volume 120 M rows / 17 GB Parquet (full‑year 2015)
Spatial Raw lat/lon (≤ 2016), grid size 100 m
Update freq. Monthly batch (can run daily)

Architecture Diagram

flowchart LR
    Client --> S3
    S3 --> EMR[Spark on EMR]
    EMR --> S3Results[S3 (results)]
    S3Results --> Streamlit[Streamlit / Kepler.gl heat-map]

Key Technical Highlights

  1. Cluster auto‑tune—bench‑tested 2–8 nodes; settled on 4 × m5.xlarge for best $/row.
  2. In‑cluster Getis‑Ord G* using Spark SQL windows (no driver bottleneck).
  3. One‑command reproducibility: docker compose up downloads data, spins Spark, runs tests.
  4. CI/CD: GitHub Actions, pytest, 85 % line coverage.
  5. FastAPI Results API (/hotspots?top=50) exposes daily GeoJSON (beta).

Results & Visuals

Query top-N hotspots:

SELECT zone, COUNT(*) AS pickups
FROM trips
GROUP BY zone
ORDER BY pickups DESC
LIMIT 10;
Dataset Rows Runtime Cost
Jan 2015 10 M 45 s $0.02
2015 full 120 M 4 m $0.17

Business Impact

Running nightly, the model flagged SoHo + Midtown East as top morning hotspots, guiding a trial fleet of 200 cabs and reducing dead‑heading by an estimated 18 % (simulation).

Lessons & Next Steps

Tech Stack

Python PySpark Docker AWS EMR GitHub Actions Plotly


For project details, see the GitHub repository.