The turnstiles in New York City Subway stations record how many people enter and exit the stations on different days of the week and at different times of the day. If we could count how many people enter and exit the stations, we could obtain an estimate for the ridership of different subway stations.
Naturally, the question arises why we might want to account for the ridership of subway stations. It can be useful for many ‘things’.
Let’s say, there is a fictional organization called Women Tech Women Yes (WTWY), an advocate of women’s engagement in technology. They would like to reach out to people on the streets of New York to raise awareness and get people to participate and contribute to their organization. They will also be hosting a gala in early summer. They want to collect signatures from people who may support their organization in future and give them free tickets to the gala.
WTWY wants to place their street teams in foot traffic populated areas of New York City. Placing street teams in the populated parts of the city allows them to reach out to more people and collect as many signatures as possible.
Understanding the subway ridership volume gives us a good sense of which NYC blocks are more populated. It may be wise to place street teams at or near the subway stations during popular times. This, however, does not necessarily mean that WTWY is going to collect as many signatures as they would like. People may not stop to fill out information, but it is an effective way to get noticed.
There are over 400 subway stations in NYC. That’s a lot of stations. Data science can help WTWY figure out a strategy for placing WTWY street teams based on the ridership of the stations.
NYC Subway Ridership
Here’s a glimpse of our investigations of New York Subway ridership. This discussion is limited to finding the stations with relatively higher ridership. The actual data processing is quite involved. We don’t provide implementation details here. Python libraries such as math, numpy, matplotlib, especially pandas were used extensively.
The New York City Metropolitan Transit Agency (MTA) provides publicly available ridership data. We use this MTA data to account for the daily movement of people on subway networks. We downloaded a few datasets from the MTA site.
We can display a CSV file like this in a more readable tabular format as shown below. Note this is a pandas dataframe.
MTA data are often messy and difficult to work with (this is true for a formidable number of datasets in the world). A necessary preliminary step therefore is data wrangling – the process of converting or mapping data from a “raw-ish” form to “clean” form that allows us perform analytical or mathematical operations on the data conveniently.
A couple of examples of wrangling the MTA data for convenience:
Note the DATE and TIME columns in the data table above. The entries of these columns are in text (string) format. We convert them to a
datetimeobject using Python’s datetime library. This helps us extract date/time-specific information from the data efficiently. For instance, we may want to order the data by date and time to follow the flow of riders on a day or over a given period.
Each block of turnstiles is called a “Control Area” or “C/A” in the downloaded data. Each turnstile is designated with a “SCP” number. Since “C/A” and “SCP” together defines a turnstile uniquely, it may be helpful to combine them in a column (that is, add a new column to the dataframe called Turnstile).
In addition there are duplicate entries or spurious information (e.g. negative entry or exit counts) in the data which we need to remove or interpolate.
Basically, we want to compress the data down so that we have a total number of entries and exits (ridership) for each day, week or several days or weeks.
A major complication of the MTA data is that it seems to include data from each turnstile in each of the 700+ blocks of turnstiles in the NYC Subway system, but not in any particular order. Moreover, each turnstile is equipped with a counter that does not reset at the start of each day. A close look at the ENTRIES and EXITS columns reveals that the entry/exit counters continually increase over time.
To compute the total number of ridership at a turnstile, we need to order the turnstiles by date/time. To get the total number entries or exits, we have to take the difference between the previous reading and the next reading. Then we can add them up for turnstiles in each station to get a ridership estimate for the station.
We can tailor our analysis based on what we would like to infer from the data. If we are interested in the ridership from a particular period, we can do this by separating the data by dates and times.
We can further investigate what fractions of the total ridership come from weekday and weekends. We can break the ridership by hours (4-hour intervals) too.
In order to find out which stations have higher average ridership during a certain period, we have to compress the data to station level. Once we have the total ridership for each station we can rank them in terms of ridership volume.
Here’s a few results that we’d like to share.
- The following figure shows the number of average riders in different stations over a certain period of time. These are top 10 ridership stations in New York based on the dataset we used.
- The next figure shows the ridership profile of a particular station over a week. The green shaded area represents weekend period.
We found that overall, stations in NYC have a higher ridership on weekdays than weekends. The most populated stations (not surprisingly) are located in New York’s central district in Manhattan. The top 30 thirty stations account for almost 50% for the total subway ridership. The top ridership stations show similar ridership volumes. Although the ranking of these stations (ordering of the stations by their ridership count) relative to each other shifts over different days or hours, they are quite consistently the top ridership stations in the NYC Subway networks.