In this final group project we analyze NYC taxi data on topic of "Understanding Taxi Economics", it is implemented in Map-Reduce Algorithm with Hadoop Streamming API and Python.
- How does revenue vary across neighborhoods and how does it correlate with the median household income in the neighborhood?
- How does revenue vary over time? Are the months or seasons when taxi companies make more (or less) money?
- How long do cab drives ride without passengers? How does this vary over time?
- Are revenues affected during major events? E.g., parades, presidential visits, storms
2013 Taxi data
Trip data: https://chriswhong.com/wp-content/uploads/2014/06/nycTaxiTripData2013.torrent
Fare data: https://chriswhong.com/wp-content/uploads/2014/06/nycTaxiFareData2013.torrent
Census data
Demographics: https://www.nyc.gov/html/dcp/html/census/demo_tables_2010.shtml
Income information: https://www.nyc.gov/html/dcp/html/census/socio_tables.shtml
Shape files for census tracts: https://www.nyc.gov/html/dcp/html/bytes/districts_download_metadata.shtml (search for "tract")
Weather data
https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets
https://www7.ncdc.noaa.gov/CDO/dataproduct -- select "Surface Data, Hourly Global", and then when it comes to select the region, choose NY and the three main stations (Central Park, JFK and LaGuardia).
Shaopeng Zhang
Hao Chen
Guang Xiong