— jenksw

The New York City subway system has done an awesome job of achieving pure evil as of late. The 7 Train, my main mode of travel, is not running between Manhattan and Queens on any weekend from early January until early April. If I want to get to Manhattan on the weekend, I have to change trains. If I want to get to Brooklyn at any point, I’m better off riding my bike. This past weekend, my wife and I went down to Dumbo to see a play by The Wooster Group and had to take a train to a bus to a ferry. The story goes that the 7 train between Manhattan and Queens uses the oldest underwater tunnel in the system. When they originally made the tunnel they left no room for anything other than the train cars, so when work has to be done, the trains can’t share the space with any work crews.

I set about doing some research. According to the ever entertaining comments on the Sunnyside Post blog, these extended closures of the 7 Train are nothing new and nothing beneficial ever seems to come from them. I’m beginning to see why the new, great, savior of the MTA packed his bags after a handful of months and headed to an entirely different continent (with a gleaming new rail system).

Anyways, using the Googles I decided to track down some numbers on which lines get closed and how often. Do some lines have a history of closing much more frequently than others? Is the 7 train riddled with an unshakeable past that it is doomed to repeat? I have no idea, but I did find some AWESOME maps.

Seriously, I found nothing. I got so taken in by the incredible selection of historic maps and technical drawings that I discarded my fact finding mission and just oohed and awed over the historical maps archived at www.nycsubway.org. I’ll keep looking. I’m sure this site even has the info I’m looking for, but, c’mon, check out this treasure!

These are all details from images found HERE. Enjoy.

Read More

I enjoy collaborative environments. Being pushed and pulled by others in an effort to make something seem harmonious to an outside viewer is always rewarding. Theater is a great environment for this, the more avant the better. I’ve had the pleasure of working on countless plays, doing music, video, or both. I recently found myself working on a production that required the musicians to work from “behind the curtain” in three unique environments, one for each act.

The first act we were watching a closed circuit monitor while being just on the other side of the curtain from all the action. This meant that we were using the audio from the actors’ voices while watching the physical movement through the monitor for cues. This also had the awesome side effect of being able to voyeuristically watch the audience members file in and occupy the twenty seats on either side of the performance area. What started out as a simple, “I bet that seat gets taken first” soon evolved in to a highly competitive betting pool with everyone picking the first seat taken as well as the last seat taken.

I recorded the order in which the seats were taken for the last four shows and used Illustrator to draw up a simple graphic charting the results. You can see some definite patterns evolve as well as a few wild cards. The most difficult part of this process was watching the seats get chosen and having to remain silent because the audience was only a few feet away on the other side of the curtain.

The chart:

Below is the raw data I scribbled down as I saw my hard earned cheddar being won by other, luckier, far luckier, gamblers.

Read More

Nascar might be the most nerd friendly sport out there. (Yes, it’s a sport. If you really want to argue the sportiness of anything, then you aren’t a nerd and shouldn’t be reading this blog…) The entire existence of Nascar is based on the constant input of nerds. For every one driver and one car there is a team of fifty to a hundred nerds grinding away at reams of numbers trying to chisel away sacred seconds from the time it takes to turn a lap.

The end result of each race is also a bountiful supply of numbers, just waiting to be analyzed, crunched, and visualized. Besides the obvious source, www.nascar.com there are plenty of other sites that offer up a ton of data for the picking. One of my favorites is www.jayski.com. The predominantly text based layout keeps things simple and data rich. Each week, after the race, Jayski posts a detailed breakdown of the commercials shown during each race’s broadcast. Turns out, this data is supplied by another site, CawsNJaws.com. I figured this would be a great place to start flexing my data-viz muscles (or lack there of).

I decided to take all of the commercial stats for the 2011 season and look at the time spent watching cars racing and time spent watching products being advertised (not including the paint on the cars). I went through each race’s stats and threw everything together in a Googledocs spreadsheet. Once everything looked adequate I saved it out as a CSV file that I could load in to Processing. Taking in to account all of the arguments made by Tufte in his book I started spitting out charts.

Take 1

This was my first pass. It’s only the first 34 of 36 races but you get the idea. There is a lot of data being displayed here, but there is also a lot of wasted space. The vertical bars going left to right represent each race. The light orange is the race broadcast while the red is the commercials. Near the end of the season, ESPN introduced side by side racing where they would split the screen so you could watch racing at half size while it shared the screen with ad space. The theory here is that racing at half the size is better at no racing at all. The changing in the background gray colors represents the different networks (FOX, TNT, ABC, ESPN). The width of each bar represents the length of each track (.5 mile short-tracks up to the 2.5 mile super-speedways).

What’s wrong with this take? First of all, you could cut the entire bottom half of the chart off and you wouldn’t lose ANY data. It certainly looks interesting, but since the red bars are always centered, their placement on the vertical scale means nothing. Second, while the total time for both racing and advertising is being represented, it is hard to compare the percentage of racing/advertising by looking at this. Some races look to have identical amounts of commercials, but then you see that the orange bars are a lot longer for one over the other, so the ratio is quite different. How to represent both? Beats me. Let’s move along.

Take 2

Above, we have the bottom half removed. Now lets tweak the colors and bring in the label text to see what is going on…

Colors are a bit hard to delineate.

Things are falling in to place. The problem still remains of how to show both total minutes of broadcast as well as a comparable display of ratio of commercials to racing. Looking at the chart, there is still some relatively unused space where the track names are displayed so I’ll drop a vertical rule in there to show that ratio.

And for the last step I’ll insert the names of the networks to show which colors correspond to which networks.

With the last iteration we have a good amount of data being represented without a lot of “chart junk” as Tufte would say. In my opinion, it’s far from ideal. The biggest problem is that it tries to show too much data. Do we really need to see the length of each track? Does that have any bearing on the amount of commercials shown? It doesn’t seem to in this view. Of course, we don’t know that until we show it in a legible form, so I guess creating these charts is always a bit like fishing. You do everything you can to get yourself sorted and then you’re ultimately at the mercy of the data.

The biggest insight I found in creating this is with the second Daytona race. TNT bills this as “Wide Open Coverage” and promises more racing to ads. I vaguely remember them advertising this but never took it too seriously, but looking at that race on this chart and it looks to be true. TNT certainly delivers what they promise. Pretty cool. Somebody give them a trophy, and start spraying them with beer, and let them talk about how they couldn’t do it without the help of all their sponsors…

Read More

Visual Display of Quantitative Information

A few weeks ago I had the pleasure of attending a one day course taught by Edward Tufte. All of the attendees received all four of his books and got to spend the day hearing Tufte talk about this business of displaying information. The books are beautiful to hold, smell, and explore, true nerd coffee table trophies. After spending years seeing these on coworkers’ shelves, I was glad to finally get a set to call my own.

Last week I decided to pick up the first book in the set, The Visual Display of Quantitative Information, and give it a more thorough perusal. I finished the book last night and am already looking forward to starting the next one. It makes sense that a man who is considered one of the top minds in this field, and teaches at Yale, could put a string of words together that others might find interesting, but I was still surprised to find the book so “readable”. I made the mistake of assuming all of the treasure in this book would be in the graphics with some supportive textual side notes scattered throughout. I’m pleased to report that I was very wrong. Maybe this Yale place knows what they are doing when it comes to finding folks to spend their time teaching others.

Instead of giving a blow by blow account of the book I’ll just highlight a few interesting moments in the book and then spend the next few posts developing various charts and graphs that hopefully address these themes.

– Tufte splits up all of the ink in an info graphic as “data-ink” and “non-data-ink”. Increase the former while eliminating the latter. An interesting take on this notion is when Tufte looks at the x and y axis of a chart and simply hacks off the upper and lower bounds of those lines if they are outside the limits of the presented data. Why use ink to display information outside the bounds of the data set? If the x-axis is from 0 – 100 and showing values from 25 – 88 then the x-axis should start at 25 as opposed to 0 and end at 88 as opposed to 100. This seems counter intuitive to me because my brain likes nice and even reference points. This would be like having a football field that wasn’t an even 100 yards. In terms of Tufte’s theory of data-ink it does make sense, so I’m willing to give this a shot and see how it does affect things.

– Alphabetical lists are a wasted opportunity. Tufte’s theory, I’m paraphrasing here, is that an alphabetically ordered list is presenting us an order that we are already aware of and isn’t adding anything to the data. If it is a list of peoples’ names, for example, why not make them in order of age, or height, or weight. Any of these non-alphabetical choices will allow the placement of those names to give us further understanding of the data which was not present in the alphabetical ordering. Again, I find this counter-intuitive to my current methodology. What if I want to look up someone’s name from the list? This leads to another interesting point:

– Be aware of the “viewing architecture” of the graphic. How does the design and placement of data affect how the user’s eyes scan over and absorb the data? In the case of a list being alphabetical, maybe it is more important for the user to see an underlying relationship between the people than the ease of looking for a name. Maybe this is why I found this book so engrossing, because I would read these theories, be persuaded by Tufte’s eloquent argument and then stop and think, “Wait a second, if it’s so obvious and logical why is it the exact opposite of what I normally do?”. This idea of Tufte constantly challenging my preconceptions was more thrilling than upsetting. Instead of saying, “He’s an idiot.” or “I’m an idiot.” I kept saying, “Hmmm, can’t wait to try that out and see what happens.”

The one truth that resounded throughout is that one can always be more critical of how they are presenting data. Coming from an interactive web-based perspective, I’m used to building things in Flash where information can move, be dynamic, and reveal itself based on user interaction. I’m really excited about tackling some of the aforementioned theories in the land of the static. How do you represent change, time, narrative when you have only two dimensions to work with? I don’t think I’ll ever know as much as I think I should know but at least after reading this book I can continue down the path with a robust set of critical tools with which to work.

Thanks for reading.

Read More
Bill and Hillary Clinton

Bill and Hillary Clinton in the NY Times 1990 - 2010

I recently became aware of the NYTimes developer API and decided to use it as an excuse to dig in to Processing. The Times API lets you search through a staggering amount of data and do fun things with that data. Whenever I want to learn a new API I usually poke around the internet for some interesting tutorials. Some folks prefer to go straight in to the documentation but I like the lazier approach of following someone else’s directions and seeing where I end up. Kind of like cooking from a recipe as opposed to just opening the cupboard and winging it. I scored big in this instance and found a great tutorial by Jer Thorpe at blprnt.com.

Jer walks you through all the steps necessary to create your own data visualizations using the NYTimes API. You feed in the search terms and dates and it creates graphs representing the total number of times these terms showed up in NYTimes articles over the preset span of time. The post was written in 2009, when the API was introduced. Since then, I think, they have added a ‘hits per second’ limitation to the API. This causes problems because Jer’s example uses a for-loop to grab all of the data in the setup() function of the application. The only way to get around this is to make the hits to the API over a longer period of time so as not to exceed this ‘hits per second’ limitation. Here is a quick tutorial comparing the two approaches. I’ll approach it from a more general perspective and then offer up my augmented version of Jer’s original files. Enjoy.

Let’s start off with the original for-loop method. In this case we have an array of names and we want to assign a random number to each name and store everything in a HashMap.

We declare our two variables, the HashMap and the array of names:

HashMap results = new HashMap();
String[] words = {"Rusty", "Terry", "Dale", "Ward", "Tony"};

Next we use a for-loop in the setup() function to populate the HashMap with random numbers tied to each name:

void setup(){
  for(int i = 0; i < words.length; i++){
   float randomData = random(50);
   println("random number " + i + ":" + randomData);
   String w = words[i];
   results.put(w, randomData);
  }
  println(results.get("Rusty"));
  println(results.get("Tony"));
}

We don't need to do anything on each frame so we can throw an empty draw() function in there for good measure:

void draw(){

}

Running this we get something along the lines of:

random number 0:16.98148
random number 1:16.044891
random number 2:44.09457
random number 3:6.0902476
random number 4:10.756329
Rusty: 16.98148
Tony: 10.756329

That seems about right. The first name is associated with the first random number and the last name is associated with the last random number. This method is great except for the fact that it doesn't give us control of how fast the application grabs the data and places it in the HashMap. This for loop is going to do this all in one frame (the first frame). What if we wanted to put one item in to the HashMap every second? In order to give us that level of control we need to tweak this general approach and use the draw() function to populate the HashMap at a preset rate, in the fashion of a timer.

In order to move things over to the draw() function we need to create a few more variables to keep track of what's going on. We need a framerate for the loading phase and a framerate for the application once the loading is complete.

int queryRate = 1;
int drawRate = 60;

If queryRate is 1 then the draw function will run once every second, if it is 10 then it will load data in to the HashMap ten times every second.

Next, we need to know whether everything has loaded and what item we are currently loading:

boolean stillLoading = true;
int currentCount = 0;

Now we clean out the setup() function except for the framerate setting:

void setup(){
  frameRate(queryRate);
}

Here comes the bulk of this process. The basic strategy is as follows:
- check to see if we are still loading
- if we are still loading then use the current counter to load that item
- advance the current counter
- if we have more items to load, keep loading
- otherwise, set the loading flag to false

Here is the same strategy using code:

void draw(){
  if(stillLoading == true){
    float randomData = random(50);
    String w = words[currentCount];
    results.put(w, randomData);
	currentCount++;
    if(currentCount >= words.length)
    {
      stillLoading = false;
      frameRate(drawRate);
    }
  }
}

Here is all of the code combined (with some added println methods to let us know what is going on):

HashMap results = new HashMap();
String[] words = {"Rusty", "Terry", "Dale", "Ward", "Tony"};
int queryRate = 1;
int drawRate = 60;
boolean stillLoading = false;
int currentCount = 0;

void setup(){
  frameRate(queryRate);
}

void draw(){
  if(stillLoading == true){
    float randomData = random(50);
    println("LOADING: random number " + currentCount + ":" + randomData);
    String w = words[currentCount];
    results.put(w, randomData);
    //
    currentCount++;
    if(currentCount >= words.length)
    {
      stillLoading = false;
      frameRate(drawRate);
      //
      println("We are done loading.");
      println(words[0] + ": " + results.get(words[0]));
      println(words[4] + ": " + results.get(words[4]));
    }
  } else {
    // whatever needs to happen after everything is loaded
  }
}

Running this we get an output similar to the following:

LOADING: random number 0:30.265686
LOADING: random number 1:17.98813
LOADING: random number 2:45.138313
LOADING: random number 3:34.76939
LOADING: random number 4:34.88063
We are done loading.
Rusty: 30.265686
Tony: 34.88063

When it is running you will notice how all of the "LOADING" lines appear one at a time, depending on your queryRate. In order to use this approach with external data we would just replace the lines where we are generating a random number with lines that load the data from an external source. In the case of the NYTimes DataViz project, we have the added complexity of that data coming in the form of a two dimensional array so there is an added level of counters. Other than that, the strategy is the same. Hopefully this is something you will find useful and applicable. Flash and javascript have built in Timer classes, but in Processing you are left up to your own devices. This stripped down nature of Processing is why it is so fun and approachable so I'm certainly not complaining.

The source for these two examples can be downloaded here:
timerForLoop_sjw.zip

the source for my tweaked NYTimes API project originally created by Jer Thorp:
NYT_GraphMaker_sjw.zip

Jer Thorp's original tutorial can be found here.

Thanks for reading.
-Jenks

Read More