Using Github for hosting data sets?

aendrew's picture

If you do any viz work with d3.js, you're probably familiar with the work of Mike Bostock, who is the primary developer of d3 and produces stuff for the New York Times.

The realization I had while looking through a bunch of his examples is that he effectively solves the problem faced by most newsrooms in having to deal with an aging and often-incompatible stack: he simply hosts everything at Github, via a gh-pages branch.

This is brilliant. No worrying that your viz will crash with too much traffic! No having to fight with your news organization's server techs to let you run something in PHP! It's a great idea.

Alas, knowledge of Git isn't something that's very common in newsrooms, so I'm considering writing a module that uses Drupal to manage data sets and visualizations on a Github branch.

With this in mind -- can anything think of a good workflow to incorporate this? I'm thinking Drupal produces everything as a really basic HTML page, with datasets and whatnot being stored in Drupal, but finished visualizations pushed to Github.


I don't see the advantage of

sinasalek's picture

I don't see the advantage of using git here, or maybe i didn't understand the problem :)
If the data are going to be flat HTMLs why not use a shared host as storage? that's what's CDN is for. static file can be uploaded via ftp or synced via rsync

or even setup a static web server like nginx or varnish on the same server? they're very very fast when it comes to server static pages. Drupal also has boost module which can cache the entire page, it's also very fast

Drupal also has a revision system , so older version can be preserved as well

@sinasalek -- Those are

aendrew's picture

@sinasalek -- Those are pretty valid points; especially using a shared host, which would make my life a lot easier (Save To FTP solves the problem of creating static pages and moving them to an off-site host quite elegantly).

However, I'm imagining cases with an absurd amount of traffic hits the server at once -- an election map visualization on a major news site during election night, for instance. A shared hosting account would probably fail at that point (Or would it? Every news viz I've published has always been in a cloud of some sort), and convincing a newsroom's server team to let you run a nginx instance using Boost pages would probably be pretty difficult (This is actually a pretty major issue for data journalists -- support for running arbitrary pieces of code on the company servers is, from my observations anyway, pretty low.).

The user might not even have access to a shared hosting account or a server running nginx anyway -- it's possible that they might just use Drupal locally to manage and create their visualizations, with the finished product needing to be hosted somewhere free and scalable. GitHub Pages provides for both criteria.

ændrew rininsland
news, photos, data, code. :: @aendrew

Sandbox project now live!

aendrew's picture

Anyone interested in doing this, I've written a sandbox module that is currently working for pushing generated pages to a central gh-pages branch.

I'm currently applying for full project permission -- please drop me a line on this issue if you're using this module and/or have any thoughts!

ændrew rininsland
news, photos, data, code. :: @aendrew

Data Visualization

Group organizers

Group categories

JS Projects

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: