New York City’s open data platform is an incredible source of information. All public data collected and generated by the city is mandated by law to be made available through the portal, as well as being free for use by the public.
Datasets range from transport, housing, and motor vehicle incidents, to a Central Park squirrel census, and even park ranger reports of aggressive turtle encounters.
Geography, infrastructure, and sociology datasets like these represent real-world processes and events. Even if you have no connection to or little interest in NYC or urban areas in general, they give you a chance to work with data that looks a lot more like what you’ll encounter in a professional role than the likes of MNIST or Titanic survivors. Better still, they’re almost as easy to access.
We’re going to run through a demonstration of just how easy these datasets are to use and build some interesting visuals in the process.
To keep the code blocks as succinct as possible, here are the required modules for all the code in this post:
import geopandas as gpd
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from scipy.stats import gaussian_kde
import seaborn as sns
from shapely.geometry import Point, shape, box, Polygon
Make sure they’re installed if you want to replicate anything yourself.
This is one of my favorite datasets to play around with. The data includes footprint polygons, ages, and heights for most of the buildings in NYC.
We’ll start with data pull separate from the visualization code since we’re using this dataset for a couple of different visuals.
# Pull data
api_endpoint = 'https://data.cityofnewyork.us/resource/qb5r-6dgf.json'
limit = 1000 # Number of rows per request
offset = 0 # Starting offset
data_frames =  # List to hold chunks of data
# Loop to fetch data iteratively
# while offset <= 100000: # uncomment this and comment while True…