-
Process Post: MetroCard Swipes Project
December 9, 2011
One of my last projects for the WSJ was a story and interactive map of New York City showing the usage of different types of MetroCards at different subway stations.
I always mean to write process posts, describing how I did things. I think "show your work" is a great idea and have really enjoyed reading other people's posts showing their work.
I never got to write one about my foursquare check-ins project and by now I've probably forgotten too many of the details of the process to do a proper write up. Not this time. This'll be kind of a mind dump though.
False start with turnstile data
The idea for this project came from finding the dataset available on the MTA's developer website. In addition to the fare type data we ultimately ended up using, they also make available raw turnstile swipes and that was the one I first looked at. This was last July.
The turnstiles data contains cumulative counts of entrances and exists and status report codes for every turnstile in the system every four hours. Then there's another file matching those codes up to station names and subway lines. In a first pass at this data, I made simple line charts of each turnstiles hourly entrances and exits. They were pretty messy and only mildly interesting. I can't find the charts anymore or I'd show them here. Then the project lay dormant for about a year as other news intervened.
Real start, cleaning data
Eventually, the Greater New York section came upon the fare type data posted by the MTA and was interested in running a story based upon it. The fare type data set is a single for each week and records station by station how many times each type of metrocard was swiped.
Starting out, we didn't know what the data would show and we didn't have an entirely clear idea of what we were looking for so I decided to just clean it up and play around with the data for a bit to see if any interesting trends popped out. This is not necessarily how I would approach a data project in the future. I think we might've ended up with something even more interesting and more pointed if we had had some questions we wanted to answer with the data to start with.
The first step was to import all the data to one place instead of one separate file per week. A quick Python script to put everything into MySQL, then an export to Google Refine to fix inconsistent spellings of station names and then exporting to CSV.
Then I thought I'd try a non-geographic visualization. I'd make a grid with the weeks on one axis and the subway stations along another ordered from high traffic to low traffic. At each point in the grid there'd be a pie chart or a stacked bar showing the proportion of each type of swipe at that station in that week.
Using Processing
I decided to first try and do it in Processing as a way to learn more about it and 3D graphics.
It came out looking kind of like this.

A sea of little cylinders So yea. I didn't quite have the scale of the data right. 460 stations by 60ish weeks of data each? Oh that's almost 28,000 datapoints. It was not particularly comprehensible.
Maybe it'd be better to try and place stations according to their geographic location instead of in a grid, and then animate over time.
Unfortunately, the station names in the fare type files didn't match the station names anywhere else. They were a combination of the names shown on the official subway map, and when those conflicted, an added cross street.
The official file of station locations gave station locations by station name and line along with an exact lat lng for each entrance or exit.
I matched these up by hand, picking one entrance for each station. The resulting file is here.
After a few iterations, I ended up with something looking like this, (using the plate carrée projection, a.k.a x=lng, y=lat)
Each station is a stack of cylinders, larges ton the bottom, with volume proportional to the number of swipes Kind of cool. Still not exactly easy to make sense of, even though in processing I can adjust the camera and fly around it.
Time to try another tack.
Using R
Since trying to visualize the data straight away wasn't working so well, I decided to try and analyze the data in R and find some basic summary statistics.
I use R in RStudio rstudio.org which is a really nice IDE for R. I'm almost a complete beginner at R, and it's been really helpful.
There's this really cool function,
summary(dataframe)that takes some data and prints out a whole bunch of summary statistics of it. So I did:MTAFARES1108 <- read.csv("~/MTAFARES1108/data/MTAFARES1108_cleaned.csv") summary(MTAFARES1108)and got out
start_date end_date REMOTE STATION 2010-08-21: 466 2010-08-27: 466 R001 : 61 42ND STREET & GRAND CENTRAL: 183 2010-11-06: 466 2010-11-12: 466 R002 : 61 23RD STREET-6TH AVENUE : 122 2010-11-20: 466 2010-11-26: 466 R003 : 61 25TH STREET-4TH AVENUE : 122 2010-11-27: 466 2010-12-03: 466 R004 : 61 34TH STREET & 6TH AVENUE : 122 2010-12-04: 466 2010-12-10: 466 R005 : 61 34TH STREET & 8TH AVENUE : 122 2010-12-11: 466 2010-12-17: 466 R006 : 61 42ND STREET & 8TH AVENUE : 122 (Other) :25535 (Other) :25535 (Other):27965 (Other) :27538 FF SEN.DIS X7.D.AFAS.UNL X30.D.AFAS.RMF.UNL JOINT.RR.TKT X7.D.UNL Min. : 0 Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0 1st Qu.: 9517 1st Qu.: 363 1st Qu.: 37 1st Qu.: 128.0 1st Qu.: 1.0 1st Qu.: 3080 Median : 16029 Median : 664 Median : 78 Median : 262.0 Median : 5.0 Median : 6226 Mean : 27757 Mean : 1217 Mean : 113 Mean : 412.6 Mean : 104.4 Mean : 8999 3rd Qu.: 32164 3rd Qu.: 1427 3rd Qu.: 147 3rd Qu.: 490.0 3rd Qu.: 26.0 3rd Qu.:11638 Max. :291172 Max. :13083 Max. :1082 Max. :5062.0 Max. :7951.0 Max. :97486 X30.D.UNL X14.D.RFM.UNL X1.D.UNL X14.D.UNL X7D.XBUS.PASS Min. : 0 Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.00 1st Qu.: 5134 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 7.00 Median : 11295 Median : 2.00 Median : 8.0 Median : 108.0 Median : 20.00 Mean : 19833 Mean : 12.65 Mean : 325.9 Mean : 665.8 Mean : 86.85 3rd Qu.: 26114 3rd Qu.: 16.00 3rd Qu.: 226.0 3rd Qu.: 915.5 3rd Qu.: 73.00 Max. :276941 Max. :251.00 Max. :18867.0 Max. :21757.0 Max. :2371.00 TCMC LIB.SPEC.SEN RR.UNL.NO.TRADE TCMC.ANNUAL.MC MR.EZPAY.EXP Min. : 0.0 Min. :0.000000 Min. : 0.0 Min. : 0 Min. : 0.0 1st Qu.: 44.0 1st Qu.:0.000000 1st Qu.: 3.0 1st Qu.: 386 1st Qu.: 17.0 Median : 104.0 Median :0.000000 Median : 12.0 Median : 854 Median : 49.0 Mean : 271.2 Mean :0.006424 Mean : 280.7 Mean : 1416 Mean : 175.3 3rd Qu.: 283.0 3rd Qu.:0.000000 3rd Qu.: 69.0 3rd Qu.: 1701 3rd Qu.: 168.0 Max. :3600.0 Max. :3.000000 Max. :16197.0 Max. :21629 Max. :2890.0 MR.EZPAY.UNL PATH.2.T AIRTRAIN.FF AIRTRAIN.30.D AIRTRAIN.10.T Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.00 1st Qu.: 12.00 1st Qu.: 0.00 1st Qu.: 22.0 1st Qu.: 0.00 1st Qu.: 0.00 Median : 35.00 Median : 0.00 Median : 56.0 Median : 0.00 Median : 0.00 Mean : 92.81 Mean : 34.48 Mean : 274.9 Mean : 46.38 Mean : 13.59 3rd Qu.: 113.00 3rd Qu.: 0.00 3rd Qu.: 161.0 3rd Qu.: 0.00 3rd Qu.: 0.00 Max. :1707.00 Max. :10265.00 Max. :47909.0 Max. :17933.00 Max. :6150.00 AIRTRAIN.MTHLY total Min. : 0.000 Min. : 1 1st Qu.: 0.000 1st Qu.: 21130 Median : 0.000 Median : 37422 Mean : 1.111 Mean : 62134 3rd Qu.: 0.000 3rd Qu.: 76816 Max. :687.000 Max. :697709Similarly, the plot function has a cool default when called on a dataframe that prints a whole bunch of summary plots.
BYDATE <- aggregate(MTAFARES1108[,c(5,6,10,11,13,14,27)], list(start_date=MTAFARES1108$start_date), sum) BYDATE$subtotal <- rowSums(BYDATE[,c(2:7)]) plot(BYDATE)
Click for larger version BYSTATION <- aggregate(MTAFARES1108[,c(5,6,10,11,27)], list(STATION=MTAFARES1108$STATION), sum) BYSTATION$subtotal <- rowSums(BYSTATION[,c(2:5)]) plot(BYSTATION)
Click for larger version Printed out big, these are kind of fun to look at. Each variable in the data is in a scatter plot with each other variable.
You can see some trends in these plots. The usage of full fare and seven-day unlimited cards trending up when the one and 14-day unlimited cards are discontinued. The usage of different types of cards are generally pretty well correlated with others. At PATH stations, only full fare cards are used so there's a set of stations without unlimited card swipes.
More data, mooore data

Select blocks that intersect a 1km radius circle. Now I already had a database full of census data on population from making Census Map Maker so I decided to bring that in too. I wrote a Python script (GeoDjango script to be exact) to loop through all the 2010 census blocks for Manhattan, Brooklyn, Queens and the Bronx and assign each block to the closest subway station and then calculate the union polygon of the set of blocks for each subway station. Then each shape was assigned the data for that subway.

Later I limited each area to also be within 1000 meters of a subway stop to get the final shapes.
By doing this, we get the race and income data for each area around a subway. That could be interesting to look at. Exporting back into a CSV file with one row per station and then using R again gives us a couple of charts like these below. Each point represents one subway station area. The y-scale, commuters_percent is people using 30 day unlimited or TransitChek unlimited metrocards.

Call: lm(formula = commuters_percent ~ whites_percent) Residuals: Min 1Q Median 3Q Max -0.236574 -0.048477 -0.004978 0.048037 0.198023 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.28391 0.00543 52.28 <2e-16 *** whites_percent 0.15259 0.01207 12.64 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.0689 on 390 degrees of freedom Multiple R-squared: 0.2905, Adjusted R-squared: 0.2887 F-statistic: 159.7 on 1 and 390 DF, p-value: < 2.2e-16
Call: lm(formula = commuters_percent ~ blacks_percent) Residuals: Min 1Q Median 3Q Max -0.191503 -0.052115 -0.001902 0.054073 0.155704 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.377349 0.004894 77.10 <2e-16 *** blacks_percent -0.162457 0.013517 -12.02 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.06987 on 390 degrees of freedom Multiple R-squared: 0.2703, Adjusted R-squared: 0.2684 F-statistic: 144.4 on 1 and 390 DF, p-value: < 2.2e-16
Call: lm(formula = commuters_percent ~ median_income) Residuals: Min 1Q Median 3Q Max -0.179520 -0.058256 -0.002264 0.053546 0.206933 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.823e-01 8.225e-03 34.326 < 2e-16 *** median_income 1.055e-06 1.411e-07 7.476 5.11e-13 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.0765 on 390 degrees of freedom Multiple R-squared: 0.1253, Adjusted R-squared: 0.1231 F-statistic: 55.89 on 1 and 390 DF, p-value: 5.109e-13So these are pretty noisy and I didn't get all that far with an analysis. We ended up not doing anything with these regressions. I'm posting the datafile here in case anyone else wants to play around with it further.
Fare price increase
Next was to try and see if the fare increase made a difference in people's metrocard usage habits. Looking at systemwide usage, there were big dips in the number of swipes in weeks without five working days as so much usage is by people commuting. To smooth things out for comparison, I cut out holiday weeks and then picked as many from before and after as there was still data for. This turned out to be 27 weeks each.

With all weeks included it's really spiky. 
With holiday weeks removed, the data looks a lot smoother. Testing to see if there was a significant change was done like this in R, one t-test at the 99% confidence level for each station and type of swipe.
t.test(after$swipes,before$swipes,conf.level=0.99,var.equal=TRUE)Restrospective
Ultimately, once all the data work was complete building the map interactive itself was fairly straightforward. The part I think worked best was the annotations with descriptions of interesting points to look at. Despite having that though, I still think this project ended up kind of falling victim to the throwing data at readers problem. It could've beneftied from an even stronger narrative strand to tell a specific story.
As noted at the beginning, we didn't have a good sense of what story we wanted to tell or what questions we wanted to answer, so in the data analysis I kind of struggled to figure out what was interesting and what wasn't.
Someone suggested on Twitter having a feature where users could add annotations or questions to each subway stop. In retrospect, I think this would've been really useful and I wish I had built that in. So many of the stories are probably local and specific to a subway station. And there are probably interesting data anomalies at that level that people could ask about and then we could put in the reporting effort to find answers.
-
Joining the New York Times
December 4, 2011There's some news I'm very excited to announce — I'll soon be starting a new job as part of the Interactive News Team at the New York Times.
For quite a while, the NYT has been setting the standard for the kind of work I want to do. The 2009 New York Magazine profile of the team was in large part what convinced me that a career in journalism might actually be viable for me. So many members of the team I'm joining are people whose work I greatly admire. I've already learned a ton from reading their blog posts, open source code, listening to their conference and meetup presentations and trying to reverse engineer their work. So needless to say, I'm really excited to work with and learn even more from them.
In my time at the Wall Street Journal, I've also had the pleasure of working with many talented people. I'm especially grateful for the freedom I had to experiment and try new things.
Still, I'm sure this is the right move for me. It's clear that at the highest levels of the organization the New York Times has made the web a priority in a way that the Journal has not.
I'm extremely grateful to Aron Pilhofer and everyone else at the Times who I've talked to before now for giving me this opportunity. I can't wait to get started. My first day is December 12th.
-
Running on Django
November 27, 2011This site is now running on Django with a pretty stripped down amount of code. In many ways, it's going backwards from WordPress, but I see this as a way to keep my Django skills sharp and to try out a few ideas of mine on a site I'm fully invested in.
-
Lessons from teaching programming
July 29, 2011On Wednesday night I taught a workshop at CUNY Graduate School of Journalism called Intro to Data Journalism with Python. In this class, I tried to teach enough programming to analyze a University registrar's website and find the most popular time slots for classes. The course outline is here on github.
I think the class went pretty well, though some people looked bored at the beginning and some left before the end. I think one issue was that when teaching a one time workshop people come in with a range of different levels of experience from beforehand. I haven't done much teaching before and so I learned some things about how I could improve it for next time.
- Better communicate the expected knowledge level coming in. The class description should have more clearly stressed that the class was for complete beginners and described what the starting material would be in more depth.
- Having an assistant (or a smaller class) would have helped to get people set up and using their computers. At the beginning of the class people needed to open up the command line and navigate to the proper directory. Having another instructor would have made this quicker. And there would be someone to float around the room during instruction to help anyone who got lost get back on track.
- In the class, I jumped into the coding part too quickly. In retrospect, people would've learned more/been more engaged if I had gone over more examples of why what I was about to each is useful. Having specific use cases in mind would have made it easier to understand the coding part.
- Been more insistent on getting feedback from people about the pace of the class to make sure that people weren't falling behind.
- Finally, I should have made people type more and listen less. If I had split up the me talking part with some simple exercises or incomplete programs and asked people to finish them I think people would've been more engaged and learned more.
Anyone else have other tips for teaching programming?
-
Social media or user engagement?
July 14, 2011Social media has quickly become a major source of traffic for news sites. (See the Pew Research study Navigating News Online published May 2011) People spend a lot of time on social sites and find a lot of relevant news through them. It seems imperative for news websites to "go where the readers are" and engage with them through social media. All the major social networking players even have special media relations teams to help news brands use their networks to the fullest.
Initial steps into using social media usually seem to succeed with increased site traffic in a big way. And newsrooms have before been faulted for failing to innovate enough and embrace the web. So it seems they should jump into Twitter and Facebook with both feet and join the modern web before it's too late.
Not such a good idea?
By investing effort into Facebook and Twitter, news sites give the social networks more mainstream legitimacy and consequently more new users. And by easily making news available on social networks users become more locked in to those platforms. By all means reporters and editors should be using social networks to find sources and do the business of reporting and spreading a story. But for any specific social networking site to become a major part of a news website's strategy is giving up too much control of the reader relationship and could be a dangerous mistake.
If users always interact with a news site through Facebook or Twitter, then that news site is at the mercy of the platform and a small algorithm tweak could easily send all that traffic to a competitor.
The interests of profit-seeking tech companies are at best orthogonal to those of any media company. Depending on their platforms to engage with readers would turn a news organization into a sharecropper, putting in journalistic effort but letting others reap the majority of the rewards in exchange for a pittance of pageviews.
To thrive, news sites need to own their reader relationships with social networking sites playing a secondary role. The user experience should be such that they are not substantively harmed if the social networks were to disappear (or change the rules) the next day.
The key concept is lock-in. Is the news site building user engagement in a way that increases a user's lock-in to the news site more than their lock-in to Facebook and Twitter? If not, then it's probably a mistake
For an elections news app, it may be smart to use Facebook to provide recommendations to a user based on their friends. But the apps core manner of engaging with the user should be something independent, like the ability to pick candidates or races of interest to follow.
Print publications have long known the value of a loyal, locked-in audience of subscribers. A successful online strategy will be one that focuses on user engagement and making the news site irreplaceable for users. Social media is then just another customer acquisition channel to bring new readers in.
This is a topic I've spent a lot of time thinking about and talking with people about (including one long discussion on a rainy hike in the south Jersey Pine Barrens) but this is the first time I've tried to set the ideas down in a fixed form. I did write about social engagment for news sites from a paid content perspective a year and a half ago.
-
Teaching a class
July 11, 2011So I'm teaching a class on "data journalism" and Python at CUNY Journalism School. This will be the third time I've done this particular session. First at the CMA College Media Convention in March and then at BCNI Philly in April. The session at CMA went pretty well, but it was much better at BCNI because all the attendees had computers and could follow along and so that's the way I'll be doing it this time.
The code I taught with before is here, if you're curious.
-
Measuring "casual" website visitors
June 15, 2011One widely adopted kernel of wisdom about news online has become that the vast majority of traffic to a news site is made up of "casual" visitors or "fly-bys" that visit just once or twice a month. I think measurement error might be driving this statistic far higher than reality. I'm reading Matthew Hindman's report for the FCC on local news consumption (summarized and linked to from here) and it again repeats this observation.
My roommate has a habit of clearing his browsers cookies and all private data every time he closes it. Yet, he basically visits the same set of news sites every single day. If these sites are using cookies to track his visits, as is the standard way, they are over counting there visitors number for him by 30 times. Let's do some rough math to observe how much impact this could have on the results of a study of that data.
Let's assume we have a site that has measured 130 unique visitors at an average of 10 pageviews per visitor for the month. In total they've got 1,300 pageviews. If 1% of their visitors browsed like my roommate did, they would actually have only 100 unique visitors, and each person would have 13 pageviews for the month. What if 2% of people did it? Then the average pageviews per person soars to 19.
Maybe news visitors aren't so disengaged after all.
-
The most interesting parts of "War at the Wall Street Journal" (to me)
February 21, 2011Beyond the story lines about Murdoch and the Bancroft family, and Marcus Brauchli and Robert Thomson, Sarah Ellison's "War at the Wall Street Journal" has an interesting story line about what had made the Journal unique before the takeover and about a newspaper trying to adapt to the Internet.
About being a "second read" paper
The notion that the Journal could be a second read, famously espoused by the legendary midcentury Journal editor Barney Kilgore, was no more. No one had time to read two publications. And anyway, Murdoch didn't want to be second at anything. As smaller papers around the country faltered, Murdoch wanted to pick off their readers.
-- War at the Wall Street Journal, by Sarah Ellison. page 199
About "Journal 3.0"
[Publisher Gordon] Crovitz decided he would call the new iteration of the newspaper "Journal 3.0." He arrived at the name &em; never popular in the Journal's newsroom or executive floor &em; by taking particular note of the Journal's lead front-page story the day after Japan attacked Pearl Harbor: "War with Japan Means Industrial Revolution in the United State" read the headline. The story outlined the implications of the attack on the country's economy, industry, and financial markets. For Crovitz, it also marked the end of the first phase of the Journal &em; "Journal 1.0," the time between the paper's founding in 1889 and December 5, 1941. During that period, the Journal reported the news like any other outlet. After that headline and under Bernard Kilgore, who became the paper's managing editor the year of the Pearl Harbor attack, the Journal started adding more analysis to its stories and expanded its coverage beyond business and finance. Crovitz defined "Journal 2.0" as starting on December 8, 1941. He planned for it to end of December 31, 2006, when he would usher in the paper's third phase.To compete against the immediacy of the Web, Crovitz wanted the paper, instead of running stories that rehashed what people had learned the day before on their BlackBerrys, to become more analytical. Journal reporters would break news on the Web site and then examine it in the next day's paper.
-- War at the Wall Street Journal, by Sarah Ellison. page 51
About the morning news meeting
Following the Journal's tradition, the editors wouldn't talk about the biggest news of the day. Unlike every other newspaper in every jurisdiction of every country in the world, the Wall Street Journal didn't put news on its front page. The paper relegated the biggest news stories to the inside of the paper, on page A3. Epic features and investigations for Page One were mapped out weeks if not months in advance. Because of this Journal peculiarity, the morning news meeting was not a frenetic debate about the most disastrous or dramatic news events, but rather a mannered recitation of the day's "sked" of stories. In a business of attention-grabbing headlines and color photos, the paper treated its front page like a quiet haven for reflective storytelling. Breaking news was important, and the paper did plenty of it, but the craft of feature writing was the center of the paper's identity.
-- War at the Wall Street Journal, by Sarah Ellison. page 48
About "the pack"
[Murdoch] wanted the Journal to lead the media pack. It was antithetical to the Journal ethos. "Even if you're leading the pack, you're still part of the pack," Peter Kann, the Journal's former CEO, liked to say. "If there's something everyone is talking about, that should be on the front page of the Wall Street Journal," Murdoch told his aides.
-- War at the Wall Street Journal, by Sarah Ellison. page 170
-
From Print to Portal: More Online News Pricing Research
May 20, 2010Some classmates of mine at Penn recently finished a class on Pricing Strategies in the Marketing Department taught by Professor Z. John Zhang who studies such things and they've written a paper named "From Print to Portal: Pricing Strategies in the Online News Realm."
They've kindly given me permission to post it online and share it so go ahead and check it out here. (PDF Link) They give a history of the topic and discuss what many companies are doing now. In the conclusion they suggest that news sites should adopt hybrid subscription models.
The paper is a good qualitative treatment of the subject and a fresh take from some people not personally invested in the subject. This was a final paper for the class, and from what I know, none of the five team members have ties to or have worked in the industry.
-
Goodbye Dear Penn!
May 19, 2010
I am officially a graduate of the University of Pennsylvania.
This infographic I made for the DP does a fair job of summing it up.


