Proposed "Open Data Warehouse" Features

Events happening in the community are now at Drupal community events on www.drupal.org.
You are viewing a wiki page. You are welcome to join the group and then edit it. Be bold!

What would make for a terrific open data server, along the lines of data.gov or data.worldbank.org? Something that could help encourage more governments, non-profits, companies, and others to publish more data more usefully?

Feel free to add your comments below each point, and based on those comments, improve each point. Try to avoid adding more points to avoid scope creep.

  1. Structured way to store and access raw data sets, downloadable in the same form they were uploaded, and queryable by name, time, or text search

  2. Catalog links to other data hosted elsewhere

  3. Easy to set up (system image, ideally) and zero to running with no prior Drupal install experience.

  4. Feeds/Services: Provides JSON and RDF interfaces to all data, R data sets, and rsync? API key provisioning/mgmt

  5. Apps built can discover new data – when 2011 data is added, apps built to look at 2009 and 2010 will see 2011 without modification

  6. Tools to make it easy for novices to upload the data as e.g. csv or excel

  7. Tools for data admins – librarians – to sort and restructure

  8. Tools for any user to comment, provide scripts for processing, or mark to be notified of updates

  9. Tools for any user to propose fixes to the data, a la github/versioning

  10. Tools for simple visualizations - some sort of plug-in system?

  11. Catalog of equitable fields collected across multiple government systems (FIPS codes, Census tracts, occupation codes) to allow data merging from these various sources easily

  12. Tools to make it easy for novices to export the data into common formats

  13. Include options for searching for data geographically, as well as visualizing data on easily-exportable maps (jpg, png, etc.)

pdowney: I found the points above conflated the acts of uploading data, mapping the data to structure, and managing the catalog. It's not entirely clear if the proposal is for a registry of CSV files, for a repository of datasets, or for a uniform datastore. There's also an implicit role of "librarian" and control over access.. Why can't we all be librarians? Why introduce API keys? Does this cover many of the issues for quality indicators for linked-data datasets? Suggest reformatting the points as follows:

Installation
Software released as Open Source Drupal module, and as a single bundled package to simplify getting started.
Upload
Provide simple process for importing data from CSV, Excel spreadsheets, Google Charts and other formats used by novices. a constraint of a grid, or series of sheets of grids of datapoints will help
Raw Data
Store the uploaded raw data, and enable them to be downloaded and searched in the same format they were uploaded.
Mapping
Ability to identify a dataset column or row as being a type: timestamp, latitude, longitude, FIPS code, Census tract, occupation code, etc.
Resources
Ability to link to an individual data item within a dataset, ideally using a human friendly URI
Representations
Provide access to a dataset in variety of different formats: JSON, RDF, XML, etc.
Revisions
continue to access previous versions, unmodified.
Subscription
Be notified of updates via an RSS feed or email alert.
Access control
API key for datasets which need to be rate limited, or only accessed by registered users with given privileges.
Annotations
Ability to comment upon a dataset, highlight accuracy issues, provide scripts for processing, link to applications, images of visualisations, suggestions for improvements, links to other related datasets elsewhere on The Web.
Patches
Ability for accepting issues, errata and modifications to datasets.
Librarianship
Ability to index, catalog, tag and restructure the presentation of datasets.
Visualisation
Geographical data may be presented on a map, time series on a uniform chart (akin to timetric), other visualisations may be added via extensions and plugins.
Geospatial search
Ability to select portions of a dataset by proximity to a datapoint.
Licensing
Clear terms of licensing and use of the dataset, e.g. datacommons

OpenData Working Group

Group notifications

This group offers an RSS feed. Or subscribe to these personalized, sitewide feeds: