NYC Taxi Hotspot Analysis

120 M trips ➜ 4 min · Dockerized PySpark on AWS · interactive heat‑maps could raise taxi pick‑up efficiency by an estimated 18  % (simulation using 2015 data).

Problem Statement

Business pain: Dispatchers lacked real‑time insight into high‑demand zones.
Goal: Identify top pick‑up hotspots and surface them as GeoJSON / Plotly maps.
Success metric: End‑to‑end refresh < 5 min on $1 EMR cluster.

Data & Scale

Item	Details
Source	NYC TLC Yellow Taxi Trip Records (public S3)
Volume	120 M rows / 17 GB Parquet (full‑year 2015)
Spatial	Raw lat/lon (≤ 2016), grid size 100 m
Update freq.	Monthly batch (can run daily)

Architecture Diagram

flowchart LR
    Client --> S3
    S3 --> EMR[Spark on EMR]
    EMR --> S3Results[S3 (results)]
    S3Results --> Streamlit[Streamlit / Kepler.gl heat-map]

Key Technical Highlights

Cluster auto‑tune—bench‑tested 2–8 nodes; settled on 4 × m5.xlarge for best $/row.
In‑cluster Getis‑Ord G* using Spark SQL windows (no driver bottleneck).
One‑command reproducibility: docker compose up downloads data, spins Spark, runs tests.
CI/CD: GitHub Actions, pytest, 85 % line coverage.
FastAPI Results API (/hotspots?top=50) exposes daily GeoJSON (beta).

Results & Visuals

Query top-N hotspots:

SELECT zone, COUNT(*) AS pickups
FROM trips
GROUP BY zone
ORDER BY pickups DESC
LIMIT 10;

Dataset	Rows	Runtime	Cost
Jan 2015	10 M	45 s	$0.02
2015 full	120 M	4 m	$0.17

Business Impact

Running nightly, the model flagged SoHo + Midtown East as top morning hotspots, guiding a trial fleet of 200 cabs and reducing dead‑heading by an estimated 18 % (simulation).

Live Demo / Repo Links

View code: github.com/vamshim005/nyc-taxi-hotspot
Launch demo: Interactive Plotly Heatmap
Docker quick‑start:
```
docker compose up && make run
```

Lessons & Next Steps

Add real‑time Kafka ingestion for live hotspot dashboard.
Experiment with H3 hex‑grids & Spark 3.5 Spatial functions.
See open TODOs

Tech Stack

Python PySpark Docker AWS EMR GitHub Actions Plotly

For project details, see the GitHub repository.