Proposed "Open Data Warehouse" Features

You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

Posted by brianbehlendorf on March 14, 2011 at 7:57pm
Last updated by pdowney on Wed, 2011-04-13 14:44

What would make for a terrific open data server, along the lines of data.gov or data.worldbank.org? Something that could help encourage more governments, non-profits, companies, and others to publish more data more usefully?

Feel free to add your comments below each point, and based on those comments, improve each point. Try to avoid adding more points to avoid scope creep.

Structured way to store and access raw data sets, downloadable in the same form they were uploaded, and queryable by name, time, or text search
Catalog links to other data hosted elsewhere
Easy to set up (system image, ideally) and zero to running with no prior Drupal install experience.
Feeds/Services: Provides JSON and RDF interfaces to all data, R data sets, and rsync? API key provisioning/mgmt
Apps built can discover new data – when 2011 data is added, apps built to look at 2009 and 2010 will see 2011 without modification
Tools to make it easy for novices to upload the data as e.g. csv or excel
Tools for data admins – librarians – to sort and restructure
Tools for any user to comment, provide scripts for processing, or mark to be notified of updates
Tools for any user to propose fixes to the data, a la github/versioning
Tools for simple visualizations - some sort of plug-in system?
Catalog of equitable fields collected across multiple government systems (FIPS codes, Census tracts, occupation codes) to allow data merging from these various sources easily
Tools to make it easy for novices to export the data into common formats
Include options for searching for data geographically, as well as visualizing data on easily-exportable maps (jpg, png, etc.)

pdowney: I found the points above conflated the acts of uploading data, mapping the data to structure, and managing the catalog. It's not entirely clear if the proposal is for a registry of CSV files, for a repository of datasets, or for a uniform datastore. There's also an implicit role of "librarian" and control over access.. Why can't we all be librarians? Why introduce API keys? Does this cover many of the issues for quality indicators for linked-data datasets? Suggest reformatting the points as follows:

Installation: Software released as Open Source Drupal module, and as a single bundled package to simplify getting started.
Upload: Provide simple process for importing data from CSV, Excel spreadsheets, Google Charts and other formats used by novices. a constraint of a grid, or series of sheets of grids of datapoints will help
Raw Data: Store the uploaded raw data, and enable them to be downloaded and searched in the same format they were uploaded.
Mapping: Ability to identify a dataset column or row as being a type: timestamp, latitude, longitude, FIPS code, Census tract, occupation code, etc.
Resources: Ability to link to an individual data item within a dataset, ideally using a human friendly URI
Representations: Provide access to a dataset in variety of different formats: JSON, RDF, XML, etc.
Revisions: continue to access previous versions, unmodified.
Subscription: Be notified of updates via an RSS feed or email alert.
Access control: API key for datasets which need to be rate limited, or only accessed by registered users with given privileges.
Annotations: Ability to comment upon a dataset, highlight accuracy issues, provide scripts for processing, link to applications, images of visualisations, suggestions for improvements, links to other related datasets elsewhere on The Web.
Patches: Ability for accepting issues, errata and modifications to datasets.
Librarianship: Ability to index, catalog, tag and restructure the presentation of datasets.
Visualisation: Geographical data may be presented on a map, time series on a uniform chart (akin to timetric), other visualisations may be added via extensions and plugins.
Geospatial search: Ability to select portions of a dataset by proximity to a datapoint.
Licensing: Clear terms of licensing and use of the dataset, e.g. datacommons

Proposed "Open Data Warehouse" Features

OpenData Working Group

Group organizers

New groups

Group notifications