Big Data/Analytics Zone is brought to you in partnership with:

Allen is a published fiction and non-fiction writer working on his second novel. He currently resides in Hanoi, Vietnam where he is traveling around SE Asia. He is an avid reader and lifelong geek interested in fiction, philosophy, and technology. Allen is a DZone Zone Leader and has posted 284 posts at DZone. You can read more from them at their website. View Full User Profile

Data Visualization: A Day in the Life of an NYC Taxi

07.15.2014
| 4924 views |
  • submit to reddit

Talk about data visualization is typically reserved for the big picture. Graphs and charts about coral reefs or population dynamics are fascinating, but the thing about data is that it can make the most mundane things beautiful. Take, for instance, a day in the life of a single NYC Yellow Cab.

Check it out, here.

The data for this visualization comes not from a GPS, but instead from a database of approximately 170 million taxi trips recording the pick-up and drop-off locations of each trip.

The raw data include only start and end locations for each trip. These points were run through Google's Directions API to create the routes shown in this visualization. Of course, these are Google's best choice, not necessarily the one the taxi took.
The visualization shows a green dot while picking up a passenger, yellow dot while en-route, and a red dot while dropping off a passenger. While the driver is searching for his next fare, a large yellow radius is shown, like what your GPS shows you while acquiring satellites.

What this means is that the visualization comes from interpolations of the data rather than, for instance, GPS records. The program knows only when and where the taxi picked up and dropped off passengers, and uses the Google Maps API to guess a route between the two points.

Each leg of the trip fare/nofare/fare, etc should be its own path in the SVG. To do this, pass 10 coordinates to the google directions API (figure out how many total legs you have for the day, then generate queries, each with 10 coordinates. Remember coordinate 10 of the first query will be the same as coordinate 1 of the 2nd query)

From GitHub:

        //loop through each batch of trips, 4 full/empty cycles per call, ending the start of the 5th trip

        for (var i = 0; i < numCalls; i++) {
           
                console.log("I:"+i);
                console.log('Building api Call ' + i + ' for taxi ' + r);

                var waypoints = [];
            //loop through each trip
            for (var j = 0; j < 5; j++) {
                var key = j + (4 * i);
                var d = rawData[r][key];
                //console.log(d);
                console.log("key " + key);
                //console.log(j+(4*i));
                //if there aren't 4 full trips for this call, end early
                //this was not including the last trip, so
                //added a dummy trip (duplicate of last trip)
                //to the source CSV.. TODO: figure this out later.
                //console.log(key);
                if (rawData[r][(key + 1)] == null) {
                    j = 4;
                } else {
                    //each trip except the last should know when the pickup time of the next trip is
                    d.nextpickuptime = rawData[r][key + 1].pickuptime;
                };

                //console.log("D is " + d);
                //build the appropriate part of the api call based on which trip
                switch (j) {
                    case 0:
                    var origin = 'origin=' + d.pickupy + ',' + d.pickupx;
                    var waypoint = [d.dropoffy, d.dropoffx]
                    waypoints.push(waypoint);
                    break;

                    case 4:
                    var destination = '&destination=' + d.pickupy + ',' + d.pickupx;
                    break;

                    default:
                    waypoints.push([d.pickupy, d.pickupx]);
                    waypoints.push([d.dropoffy, d.dropoffx]);
                }

                //console.log(d);

            } // loop through each trip
An obvious flaw with this implementation comes during long breaks between fares. At one point, you can observe a taxi taking a fare from Manhattan to Brooklyn at about 10:00 p.m. and then it appears to take a full 5 hours to drive back to Manhattan to pick up its next fare. In reality, there's no telling what the driver did during those 5 hours--whether he was looping all over the city desperately trying to find his next fare, or just went home to take a nap.

Nonetheless, it's a beautiful visualization and an excellent example of what you get when Big Data gets small.