December 27, 2013
December has been a contemplative month, and January is going to be hectic.
I don't believe in setting goals or making resolutions for the New Year. Usually I just let the New Year pass by. But there are ideas circulating in my head that I know I want to spend more time figuring out in 2014 — these ideas are half-baked but I don't want to forget them.
One idea I'm grappling with is the relationship between data and narrative and how to communicate uncertainty.
Consider the following ways of talking about a burger:
- That hamburger has 550 calories.
- That hamburger has between 400 and 700 calories.
- That hamburger is bad for you.
- I ate that hamburger every day for a month and gained a lot of weight and felt crappy.
- That hamburger has 550 calories and to burn it off you'll have to walk for 153 minutes.
- That hamburger has 10g of saturated fat. Do you want to give yourself a heart attack?
I'm happy that data journalism looks poised for ever higher popularity in 2014 — our society is plagued with innumeracy.
But I'm also somewhat terrified.
The Certainty of Numbers
With narrative story telling there are built in affordances for uncertainty and ambiguity — it's one story that's representative of the whole. Telling the same story with data may be more accurate or better at capturing the whole, but it doesn't take away ambiguity and uncertainty — it just hides it.
Don't get me wrong, I'm not against data. Data does help us better understand the world, solve problems and predict what will happen.
But the data-mindset is also a very reductionist mindset that glosses over complexity by shoving it into a black box or down into the footnotes. Including numbers or using data to make a decision gives people a sense of certainty and false confidence. Like in a fantasy TV show where "magic!" can be invoked as a catch-all explanation or solution to any problem, "numbers!" can be an argument-proof explanation for any decision. "The data say so," can be a quick way to shut down disagreement.
Because narratives have ambiguity about how they describe the world they allow for uncertainty. As anyone who has made even the simplest attempt at data analysis outside of the confines of a statistics textbook knows, data about the world also has ambiguity. But when cleaned up and presented it no longer shows any trace of that messy original.
This is less a problem for those of us who actually get our hands dirty cleaning a data set or tuning a model than for those who have no understanding of that and just make decisions off of the conclusions.
There are about a million ways the data in a seemingly simple statement like "user engagement is higher with design feature X" could be wrong, but without actually doing the data analysis it's impossible to have an intuitive understanding of what those ways might be.
While not using data to make decisions might be bad, what's definitely much worse is using the wrong data to make decisions. Because then in addition to being wrong, you have the sense of false certainty from having used data.
Is there a way to change that?
Margins of Error
If so, it feels related to how the margin of error of a data point is communicated.
The human mind actually seems very good at instinctively dealing with uncertainty. When we ponder a decision we imagine all the possible future narratives that might result: ranging from wildly successful to humiliating defeat. We can look at something and have a good sense of if we'll be able to jump over it or fall in.
But too often in journalism that margin of error is simply not communicated at all.
Really changes the story doesn't it?
Measurement Error and Measurement Bias
Lately, I've been spending a lot of time wearing different activity trackers like the Fitbit, Jawbone Up, Nike Fuelband, etc. They all purport to tell me how many calories I've burned throughout the day. What none of them do is bother to communicate at all that the number they show me might be wrong. Every day it's some number with four significant figures and no hint that it's an estimate or what some of the measurement biases might be.
At the end of the day, my activity tracker might tell me that I walked 10,874 steps and burned 2,397 calories. But are you really so sure that it wasn't 10,875 steps? Or 10,876? And what's a step anyways? And does that calorie count include the set of pushups I did since my wrist was stationary? So that number must be low.
The alternative narrative description of my day might be waking up in the morning and doing a few sets of pushups, then walking to the subway. Some pacing around at work and then walking to dinner and back home. It's described much less "precisely". But that sort of anecdotal description preserves the narrative uncertainty that keeps it a truer description. I think there might be something hardwired into the way humans think that lets us parse anecdotes into an understanding of the world that allows for uncertainty.
Narratives. Data. Uncertainty.
These are all related in some ways but I can't quite grasp it.
Increasingly I'm recognizing that a lot of the work that I've done in the past was deeply unsatisfying because it was oblivious to the narrative context of the data I was working with.
I'd love to hear more thinking about this from others.
September 20, 2013
This is my first time doing freelance work of any sort and it was an interesting experience. The editors there were great to work with and they have some very cool things going on. I'm impressed at how organized and on top of things they are especially considering they both have day jobs.
August 12, 2012
I just spent the past three weeks in Asia, visiting Tokyo, several cities in mainland China (Shanghai, Wuxi, Wuhu, Nanjing) and then Hong Kong. This was the first time I'd been in China since 2007. My cousin had invited the whole family to Nanjing for her wedding, and my brother and I decided to tack on a few more destinations. China is changing very rapidly and it was my first time in Tokyo. A few things I noticed:
On trains and public infrastructure
The subways and trains across Asia were fantastic and by far the easiest way to get around. The subways all had cell service and TV's and multilingual announcements of all stops. Many stations had doors built onto the platform and air conditioned platforms.
Traveling Nanjing to Shanghai on high speed rail (about the same distance as New York to Boston) took an hour and a half from city center to city center.
Bikeshare was everywhere. Tokyo had it in many places. Shanghai, Nanjing and Wuxi all had it. Wuhu is planning it. Most didn't seem to require any sort of advance sign up, just pay at a vending machine and ride away. In Tokyo I even saw what looked like a Vespa share station.
On restaurants in China
Several restaurants I went to in China have adopted a curious habit of charging extra for napkins and even for plates and dinnerware. Each setting at the table would have a sealed wet napkin and the plate, cup and bowl sealed in shrink wrapped plastic. Opening the wet napkin cost 1 RMB and the plates 2 RMB. The restaurants have outsourced dishwashing to companies that specialize and return dishes shrink wrapped and disinfected and are passing the cost on to customers.
Curiously though, every restaurant allowed you to bring in outside beverages, alcoholic and otherwise.
On health and food safety
In China, my relatives were very concerned about the safety of food. Any fresh fruit was peeled, including grapes and peaches. In our luggage, we brought several bags of powdered milk and baby formula for people who had requested it.
While in mainland China we experienced a string of beautiful hot blue sky days with no smog, rain or clouds. Near Wuhu many factories were shut down due to high temperature. Hong Kong was smoggy while we were there though. The South China Morning Post seemed to be leading a call for warning the public about the dangers of smog.
On shopping malls
There were an absurd number of shopping malls and department stores. And yet they somehow all seemed full of people shopping. The malls are mostly built up vertically, and the basements have giant food courts. In Japan in particular, the department store basements have "depachika", glorious food markets with stall after stall of varied food and pastries.
On suburbs, gated communities and cars
China's cities are expanding and the streets everywhere are clogged with cars, but they still haven't reached U.S. levels of sprawl yet.
New construction in the outlying areas of cities are of apartment complexes with multiple high or low rise buildings, parking spaces, green space and maintenance buildings. They look like New York's Stuyvesant Town but on a smaller scale.
There's a real bias against buying a "used" apartment. People generally preferring new construction. Many newly constructed buildings provide the apartments unfurnished with no fixtures, appliances, floor boards, etc to allow the new purchaser to install those to their own liking.
On banquets, table seating and drinking
When we went out to eat as a large group in China, we rarely sat in the main dining room of a restaurant. Most restaurants have a large number of private dining rooms of different sizes to accommodate the group. The dining rooms all have a large circular dining table with a lazy susan to hold the food.
Even though the table is round, seating matters. The guest of honor sits at the seat furthest from the door and facing it. There is typically some sort of aesthetically pleasing backdrop behind them. That end is the "top" of the table. In the opposite seat, at the "bottom" of the table and with their back to the door, is the person who has invited everyone to dinner, and who will confer with the waiters to order and will pay at the end.
For the rest of the seats, people sit from the top of the table to the bottom according to their position.
On Chinese tourists and tour groups
Tourism in China is booming. My aunt owns a travel agency in Wuhu and told me she has dozens of new competitors. Tour groups from China dominated many of Tokyo's tourist attractions with their matching hats and flag waving guides. Within China, tour groups were even more common. In fact, people don't seem to travel on vacation in any way other than a tour group. These tours are organized and paid for by offices for all their employees to go on vacation together, bringing along spouses and children.
On global cultural convergence
Wuhu is a fairly small city by Chinese standards, in a part of the country analogous to the Midwest. In the center of town there is a large pedestrian shopping area, and at night at the center of that shopping area there were high school age kids with a boombox blasting LMFAO's Party Rock Anthem and shuffling. Shuffling pretty darn well too.
May 26, 2012
My New York City jury summons instructs me to report to 111 Centre St, Room 1121 on Wednesday May 23rd at 9 a.m.
So at about 9:10 a.m. Wednesday morning I arrive to the jury waiting room on the 11th floor of the court building. The waiting room is long, with rows of cushioned chairs and a few small TV's mounted towards the front of the room. The TVs play a video explaining the important role a juror plays in the justice system. Every other seat in the room is filled with one of those important jurors, about 100 in total. Half watch the video, half just fidget with a phone and wait for something to happen. I find myself a seat and start to fidget with my phone.
The video ends and we wait some more. After a little while a woman arrives to the front of the room and starts talking. People shout out "here!" as she reads names of people who will be part of the first jury panel.
The rest of us wait some more. The waiting area has free wifi and vending machines and some desks with power outlets but not much more than that. People look bored.
After more waiting, a second group of jurors is selected. This time I am called and we are led down to the 10th floor where there are benches along a hallway outside the courtroom. We sit and wait some more. After a bit, we enter the courtroom single-file. Waiting for us are a prosecutor, defense attorney, defendant, judge, clerk and a stenographer.
After some instructions from the judge, twenty of us are picked to sit in and around the jury box for questioning. As we are directed to our seats, the judge writes our names on a board in front of her so she can address everyone by name to take notes about each of us.
A long, long list of questions begins to be asked of us by the judge.
She asks about our prior experiences with crime, with police, with the neighborhood the incident took place in, our neighborhoods and professions, our roommates' professions, our education, our hobbies, and what newspapers or magazines we read. We're asked about our interaction with the police and all confess to the date of our last speeding ticket. Some have relatives who have been arrested. Some have relatives who are police. I have neither.
Then both lawyers, first the proescutor and then the defense attorney ask us even more questions. They ask about our ability to follow the law, and about how we will interpret whether someone is telling the truth or not and about whether we would find someone guilty or not guilty in different circumstances.
Eventually we are released for lunch and told to report back at 2:30 p.m. We scatter out of the room, down the elevator and into Chinatown.
At 2:30 p.m. I am back on the benches outside the 10th floor courtroom to wait some more. The court officer calls us back into the courtroom to find out who has been selected for the jury. Of the 20 of us questioned, only four are kept. Another group is seated in the jury box to be questioned. The 16 of us go back to the jury room to wait some more.
After a shorter wait we get an announcement that they won't be needing more jurors for the next few days and that our jury service is complete!
April 25, 2012
Here at the Times, we've just launched an internal system for sharing ideas and I posted this there. But I figured others might also be interested to hear my case for why the NYTimes.com should offer free online access to schools and libraries. I, of course, have no real influence over this decision.
I hope this is in the pipeline already or being considered but I think we should whitelist the IP addresses of these public institutions.
When the paywall was launched there was a lot of hue and cry over how we were restricting the public value of our journalism by putting it behind a paywall. There are a lot of people for whom $15 a month is more than they can afford and if we cut them off we become more of a tool purely of and for the elite. Public libraries are important institutions that provide access to information to large swaths of society which are underserved. They often have free copies of the printed New York Times. They should have free NYTimes.com too.
Kids in school are unlikely to have any influence on the decision to subscribe to the Times or not and they should have the opportunity to read. Lifelong habits can develop early. I know that when I was a kid and had no control of money what I read and what software I used was purely dictated by what I could get for free. By not letting kids read for free, we risk alienating an entire generation of new readers.
Neither of these groups of readers are likely to overlap much with the set of people likely to purchase a digital subscription. And it's unlikely that people who would otherwise purchase a subscription will start trekking to a school or library every time they want to read.
Site licenses and group subscriptions might be a good solution for universities or workplaces, but for primary and secondary schools and public libraries it's likely to be beyond their budget or beyond the mind of whoever is in charge of purchasing.
April 11, 2012
A lot of effort at journalism innovation has been focused around the product that our readers experience. People are doing great things to take advantage of the new storytelling forms and new ways of engaging with people that the web browser and the internet have made possible.
But I want to turn some attention to the opposite side of things. What about all the myriad tasks that lead up to writing and producing a story that represent most of the work that a reporter does? Where is the innovation that makes that work faster and easier?
What tools do people currently use?
I would love to read a series of posts similar to News.me's "Getting the News" series but instead "Reporting the News" talking with a variety of different reporters going in-depth about their personal processes for reporting and writing stories.
Anecdotally, it seems that most reporters use some mix of the standard email, address book, web search, note taking and writing tools that are available to everyone.
But journalism is a specialized process and these are generalist tools. Surely there is room for improvement.
On the Apartment Hunt
Searching for an apartment is New York can be a long and painful process of navigating mercenary real estate brokers and misleading listings on multiple different sites. My two roommates and I have been through this process twice.
Two years ago, we kept track of our search with a Google spreadsheet of possible apartments we could find and the status of our contact with each listing. It required a lot of manual work to remove duplicates and update information.
This year, we used a new tool called Nestio.
Nestio has no apartment listings on it, it's not a competitor for Streeteasy or Craigslist. Instead, it's a tool for people searching for apartments to organize their search. You can add links or use a bookmarklet to save listings to it. Then it goes out and crawls that listing page and saves the photos and structures the information about the listing.
You can keep track of when you are scheduled to visit each one and who the contact is for the listing. Through their mobile app you can add additional photos and notes when you visit or update and correct the information that was scraped. And there's a mailer that lets you send a form email to the listing broker with one click and get responses back to your email.
Nestio made the search process a whole lot easier because there was a single way to refer to all the information around each apartment we were considering.
It's a great tool for organizing information around a single purpose: finding a great apartment. Now where's the equivalent for reporting?
April 5, 2012
Quick sequence of interesting news to read about advertising.
Twitter is allowing advertisers to take their existing Twitter accounts and tweets have them be shown as "promoted" content in the timelines of people who don't follow them. Tweets are algorithmically selected based on which ones people are engaging with and, also automatically, inserted into the feeds of people who will hopefully find them relevant.
Second, an interview with Chris Batty, former head of ad sales at Gawker Media who is headed to The Atlantic as the publisher for their planned new business site. Here talking about sponsored posts:
Mr. Batty: I know personally I want to know how big these shale-deposit discoveries are. If you listen to one side of the debate, it solves our energy problems. If you listen to the other, it’s too polluting. Let’s get to the bottom of it.
Those are the kind of things that I think digital-publishing platforms can do really, really well relative to other media. We’re going to bring the power of the web to advertisers, not just hoard it for the purpose of aggregating enormous audience and not having a powerful enough ad system to generate the profits we need to reinvest.
Ad Age: Is fracking really the right subject to investigate with paid posts written by people with huge stakes in the outcome? Isn’t that much better handled by a reporter without as much of a vested interest?
Mr. Batty: Sure and we will do that for the benefit of the audience. But look, Shell knows a lot about the nature of these deposits. Let’s give them the power of our publishing tools to talk to our audience about it with the disclosure that this is Shell.
BuzzFeed currently earns all of its revenue from branded content—a form of advertising in which corporations create story-like units that live among a publisher’s editorial products and share the same underlying aesthetic, tone, and technology. Recent clients have included Kraft Foods, Dell, and McDonald’s.
Taken together, the three pieces linked above point a possible way forward for advertising supported media.
Bypassing the Media
The noise around aggregation and how the internet devalues original reporting misses the point altogether and is irrelevant to anything except authors egos. The real threat to traditional journalism outfits is marketers going direct and bypassing the media altogether.
Historically, the high cost barrier of distribution and production of content prevented marketers from taking their message directly to the audiences they wanted to reach. The media choices people could make on any given day were finite and countable. In front of a newsstand, people would pick some number of publications to purchase and read. There was enough time to watch or listen to a fixed number of programs per day. Given that limited set, advertisers were left to buy space for their messages alongside the news articles people wanted to read and in-between the TV and radio programs people wanted to watch. Outside of a few exceptions, consumers wouldn't consciously choose to see advertising.
That barrier has now collapsed. In our online lives we make hundreds if not thousands of choices about what media to experience every single day. No one outlet has the burden of providing "completeness." If marketers can create original content that both promotes their brand and is interesting and entertaining, then that content can spread to people through all the same channels that any other news or entertainment content does.
Better than obnoxious pushdown banner ads, homepage takeovers and interstitials.
What companies have to say is often a part of the news and the public discourse. There are wires over which companies will send press releases and which journalists monitor for story ideas. Spokespeople for companies are often quoted in stories. Why waste a reporter's time rewriting a press release or copying down a company spokesperson's statement? Why not just let them publish those statements directly?
Many companies already use their own company blog to communicate very effectively, but most don't have the ability to reach everyone they want to reach whenever they want.
Below, the headers from two pieces of content that don't originate from the publication hosting them.
December 9, 2011
I never got to write one about my foursquare check-ins project and by now I've probably forgotten too many of the details of the process to do a proper write up. Not this time. This'll be kind of a mind dump though.
False start with turnstile data
The idea for this project came from finding the dataset available on the MTA's developer website. In addition to the fare type data we ultimately ended up using, they also make available raw turnstile swipes and that was the one I first looked at. This was last July.
The turnstiles data contains cumulative counts of entrances and exists and status report codes for every turnstile in the system every four hours. Then there's another file matching those codes up to station names and subway lines. In a first pass at this data, I made simple line charts of each turnstiles hourly entrances and exits. They were pretty messy and only mildly interesting. I can't find the charts anymore or I'd show them here. Then the project lay dormant for about a year as other news intervened.
Real start, cleaning data
Eventually, the Greater New York section came upon the fare type data posted by the MTA and was interested in running a story based upon it. The fare type data set is a single for each week and records station by station how many times each type of metrocard was swiped.
Starting out, we didn't know what the data would show and we didn't have an entirely clear idea of what we were looking for so I decided to just clean it up and play around with the data for a bit to see if any interesting trends popped out. This is not necessarily how I would approach a data project in the future. I think we might've ended up with something even more interesting and more pointed if we had had some questions we wanted to answer with the data to start with.
The first step was to import all the data to one place instead of one separate file per week. A quick Python script to put everything into MySQL, then an export to Google Refine to fix inconsistent spellings of station names and then exporting to CSV.
Then I thought I'd try a non-geographic visualization. I'd make a grid with the weeks on one axis and the subway stations along another ordered from high traffic to low traffic. At each point in the grid there'd be a pie chart or a stacked bar showing the proportion of each type of swipe at that station in that week.
I decided to first try and do it in Processing as a way to learn more about it and 3D graphics.
It came out looking kind of like this.
So yea. I didn't quite have the scale of the data right. 460 stations by 60ish weeks of data each? Oh that's almost 28,000 datapoints. It was not particularly comprehensible.
Maybe it'd be better to try and place stations according to their geographic location instead of in a grid, and then animate over time.
Unfortunately, the station names in the fare type files didn't match the station names anywhere else. They were a combination of the names shown on the official subway map, and when those conflicted, an added cross street.
The official file of station locations gave station locations by station name and line along with an exact lat lng for each entrance or exit.
I matched these up by hand, picking one entrance for each station. The resulting file is here.
After a few iterations, I ended up with something looking like this, (using the plate carrée projection, a.k.a x=lng, y=lat)
Kind of cool. Still not exactly easy to make sense of, even though in processing I can adjust the camera and fly around it.
Time to try another tack.
Since trying to visualize the data straight away wasn't working so well, I decided to try and analyze the data in R and find some basic summary statistics.
I use R in RStudio rstudio.org which is a really nice IDE for R. I'm almost a complete beginner at R, and it's been really helpful.
There's this really cool function,
summary(dataframe)that takes some data and prints out a whole bunch of summary statistics of it. So I did:
MTAFARES1108 <- read.csv("~/MTAFARES1108/data/MTAFARES1108_cleaned.csv") summary(MTAFARES1108)
and got out
start_date end_date REMOTE STATION 2010-08-21: 466 2010-08-27: 466 R001 : 61 42ND STREET & GRAND CENTRAL: 183 2010-11-06: 466 2010-11-12: 466 R002 : 61 23RD STREET-6TH AVENUE : 122 2010-11-20: 466 2010-11-26: 466 R003 : 61 25TH STREET-4TH AVENUE : 122 2010-11-27: 466 2010-12-03: 466 R004 : 61 34TH STREET & 6TH AVENUE : 122 2010-12-04: 466 2010-12-10: 466 R005 : 61 34TH STREET & 8TH AVENUE : 122 2010-12-11: 466 2010-12-17: 466 R006 : 61 42ND STREET & 8TH AVENUE : 122 (Other) :25535 (Other) :25535 (Other):27965 (Other) :27538 FF SEN.DIS X7.D.AFAS.UNL X30.D.AFAS.RMF.UNL JOINT.RR.TKT X7.D.UNL Min. : 0 Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0 Min. : 0 1st Qu.: 9517 1st Qu.: 363 1st Qu.: 37 1st Qu.: 128.0 1st Qu.: 1.0 1st Qu.: 3080 Median : 16029 Median : 664 Median : 78 Median : 262.0 Median : 5.0 Median : 6226 Mean : 27757 Mean : 1217 Mean : 113 Mean : 412.6 Mean : 104.4 Mean : 8999 3rd Qu.: 32164 3rd Qu.: 1427 3rd Qu.: 147 3rd Qu.: 490.0 3rd Qu.: 26.0 3rd Qu.:11638 Max. :291172 Max. :13083 Max. :1082 Max. :5062.0 Max. :7951.0 Max. :97486 X30.D.UNL X14.D.RFM.UNL X1.D.UNL X14.D.UNL X7D.XBUS.PASS Min. : 0 Min. : 0.00 Min. : 0.0 Min. : 0.0 Min. : 0.00 1st Qu.: 5134 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 7.00 Median : 11295 Median : 2.00 Median : 8.0 Median : 108.0 Median : 20.00 Mean : 19833 Mean : 12.65 Mean : 325.9 Mean : 665.8 Mean : 86.85 3rd Qu.: 26114 3rd Qu.: 16.00 3rd Qu.: 226.0 3rd Qu.: 915.5 3rd Qu.: 73.00 Max. :276941 Max. :251.00 Max. :18867.0 Max. :21757.0 Max. :2371.00 TCMC LIB.SPEC.SEN RR.UNL.NO.TRADE TCMC.ANNUAL.MC MR.EZPAY.EXP Min. : 0.0 Min. :0.000000 Min. : 0.0 Min. : 0 Min. : 0.0 1st Qu.: 44.0 1st Qu.:0.000000 1st Qu.: 3.0 1st Qu.: 386 1st Qu.: 17.0 Median : 104.0 Median :0.000000 Median : 12.0 Median : 854 Median : 49.0 Mean : 271.2 Mean :0.006424 Mean : 280.7 Mean : 1416 Mean : 175.3 3rd Qu.: 283.0 3rd Qu.:0.000000 3rd Qu.: 69.0 3rd Qu.: 1701 3rd Qu.: 168.0 Max. :3600.0 Max. :3.000000 Max. :16197.0 Max. :21629 Max. :2890.0 MR.EZPAY.UNL PATH.2.T AIRTRAIN.FF AIRTRAIN.30.D AIRTRAIN.10.T Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.00 1st Qu.: 12.00 1st Qu.: 0.00 1st Qu.: 22.0 1st Qu.: 0.00 1st Qu.: 0.00 Median : 35.00 Median : 0.00 Median : 56.0 Median : 0.00 Median : 0.00 Mean : 92.81 Mean : 34.48 Mean : 274.9 Mean : 46.38 Mean : 13.59 3rd Qu.: 113.00 3rd Qu.: 0.00 3rd Qu.: 161.0 3rd Qu.: 0.00 3rd Qu.: 0.00 Max. :1707.00 Max. :10265.00 Max. :47909.0 Max. :17933.00 Max. :6150.00 AIRTRAIN.MTHLY total Min. : 0.000 Min. : 1 1st Qu.: 0.000 1st Qu.: 21130 Median : 0.000 Median : 37422 Mean : 1.111 Mean : 62134 3rd Qu.: 0.000 3rd Qu.: 76816 Max. :687.000 Max. :697709
Similarly, the plot function has a cool default when called on a dataframe that prints a whole bunch of summary plots.
BYDATE <- aggregate(MTAFARES1108[,c(5,6,10,11,13,14,27)], list(start_date=MTAFARES1108$start_date), sum) BYDATE$subtotal <- rowSums(BYDATE[,c(2:7)]) plot(BYDATE)
BYSTATION <- aggregate(MTAFARES1108[,c(5,6,10,11,27)], list(STATION=MTAFARES1108$STATION), sum) BYSTATION$subtotal <- rowSums(BYSTATION[,c(2:5)]) plot(BYSTATION)
Printed out big, these are kind of fun to look at. Each variable in the data is in a scatter plot with each other variable.
You can see some trends in these plots. The usage of full fare and seven-day unlimited cards trending up when the one and 14-day unlimited cards are discontinued. The usage of different types of cards are generally pretty well correlated with others. At PATH stations, only full fare cards are used so there's a set of stations without unlimited card swipes.
More data, mooore data
Now I already had a database full of census data on population from making Census Map Maker so I decided to bring that in too. I wrote a Python script (GeoDjango script to be exact) to loop through all the 2010 census blocks for Manhattan, Brooklyn, Queens and the Bronx and assign each block to the closest subway station and then calculate the union polygon of the set of blocks for each subway station. Then each shape was assigned the data for that subway.
Later I limited each area to also be within 1000 meters of a subway stop to get the final shapes.
By doing this, we get the race and income data for each area around a subway. That could be interesting to look at. Exporting back into a CSV file with one row per station and then using R again gives us a couple of charts like these below. Each point represents one subway station area. The y-scale, commuters_percent is people using 30 day unlimited or TransitChek unlimited metrocards.
So these are pretty noisy and I didn't get all that far with an analysis. We ended up not doing anything with these regressions. I'm posting the datafile here in case anyone else wants to play around with it further.
Fare price increase
Next was to try and see if the fare increase made a difference in people's metrocard usage habits. Looking at systemwide usage, there were big dips in the number of swipes in weeks without five working days as so much usage is by people commuting. To smooth things out for comparison, I cut out holiday weeks and then picked as many from before and after as there was still data for. This turned out to be 27 weeks each.
Testing to see if there was a significant change was done like this in R, one t-test at the 99% confidence level for each station and type of swipe.
Ultimately, once all the data work was complete building the map interactive itself was fairly straightforward. The part I think worked best was the annotations with descriptions of interesting points to look at. Despite having that though, I still think this project ended up kind of falling victim to the throwing data at readers problem. It could've beneftied from an even stronger narrative strand to tell a specific story.
As noted at the beginning, we didn't have a good sense of what story we wanted to tell or what questions we wanted to answer, so in the data analysis I kind of struggled to figure out what was interesting and what wasn't.
Someone suggested on Twitter having a feature where users could add annotations or questions to each subway stop. In retrospect, I think this would've been really useful and I wish I had built that in. So many of the stories are probably local and specific to a subway station. And there are probably interesting data anomalies at that level that people could ask about and then we could put in the reporting effort to find answers.
December 4, 2011
There's some news I'm very excited to announce — I'll soon be starting a new job as part of the Interactive News Team at the New York Times.
For quite a while, the NYT has been setting the standard for the kind of work I want to do. The 2009 New York Magazine profile of the team was in large part what convinced me that a career in journalism might actually be viable for me. So many members of the team I'm joining are people whose work I greatly admire. I've already learned a ton from reading their blog posts, open source code, listening to their conference and meetup presentations and trying to reverse engineer their work. So needless to say, I'm really excited to work with and learn even more from them.
In my time at the Wall Street Journal, I've also had the pleasure of working with many talented people. I'm especially grateful for the freedom I had to experiment and try new things.
Still, I'm sure this is the right move for me. It's clear that at the highest levels of the organization the New York Times has made the web a priority in a way that the Journal has not.
I'm extremely grateful to Aron Pilhofer and everyone else at the Times who I've talked to before now for giving me this opportunity. I can't wait to get started. My first day is December 12th.
November 27, 2011
This site is now running on Django with a pretty stripped down amount of code. In many ways, it's going backwards from WordPress, but I see this as a way to keep my Django skills sharp and to try out a few ideas of mine on a site I'm fully invested in.