New feature in paranoia module: sanitizing db tables based on a whitelist approach

greggles's picture

Hi,

I'm excited to announce a new feature in the paranoia module: database sanitizing based on a whitelist.

The main way to sanitize databases now is drush sql-sanitize which works on a blocklist approach. If you add a module or table to your site that stores data in the database that you want to clean up then you must actively add a hook_drush_sql_sync_sanitize that will declare the right command, but it doesn't say if it has cleaned up all the columns in the table.

In contrast, the paranoiasanitize module works by assuming that all tables should be truncated and it only does something else to the table if there is some reason for it to believe that a different query (or none at all) will leave all of the columns sanitized sufficiently.

If you want to try it out, be sure you have the dev version of the paranoia module and then enable the paranoiasanitize module:

drush -y en paranoiasanitize

It works via drush, and the everyday command people will use is:

drush paranoia-sql-sanitize-whitelist -v

Note, just like drush sql-sanitize do not run the command on a live website! It will delete a lot of data!

When you first install the module you should actually not run that command. Instead you need to build the sanitizing-whitelist hooks. There's a command to help you do that :)

drush -v paranoia-build-sanitize-whitelist

I generally pipe this second command to pbcopy so that it's in my paste buffer and then I paste the output into a site-specific module. Be sure to remove the drush command output from the top/bottom as appropriate. You also must rename one of the hooks if you have any tables that are not defined in a hook_schema.

Note that in both cases I use the "-v" option to drush to see some more output.

I'd love any feedback or reviews on the module.

If you're curious about the workflow where you'd use this module, I wrote on my blog about that a while ago: Jenkins + Drush + Dropbox = Easily share sanitized database projects

Comments

That is a nice feature. Just

jribeiro's picture

That is a nice feature.

Just to understand better, if there are no sanitize hooks, for the first instance, the module will sanitize ALL tables? Even node, system, etc?

How to avoid certain tables are not sanitized?

That's right, if there are no

greggles's picture

That's right, if there are no hooks then it will truncate all the tables in the database.

The way to avoid that is to add hook implementations with an empty array in the query element for a table.

I hope the usage will be clear if you run the build command to build your hooks. Try it out. If it's not clear, I will write up some more documentation :)

Also, if you use the "-v" command to Drush it will tell you what it's going to do before doing it and then it asks if you want to proceed, so you can say "no" as you are testing things out to make sure it works properly.

Integrate with drush sql-sanitize?

scor's picture

Was there any conversation with the Drush maintainers to replace sql-sanitize with this approach, as disruptive and destructive as it is? Re the destructive aspect (if you run the command without listing your "good" tables), I suppose you could check if there is no implementation of the hook, and abort at that point. Back to drush, if this approach would not be acceptable for drush sql-sanitize, what if running drush sql-sanitize triggered the logic behind the drush paranoia-sql-sanitize-whitelist command?

I'm very open to this being

greggles's picture

I'm very open to this being more integrated into drush itself, in the issue where I wrote this you can see me saying that :)

Bailing if there are no hook implementations is a bit risky. I'm not sure if that fits. That could happen if a module that contains those definitions is not enabled, for example, and so I think the "right" thing to do in that situation would be to truncate all the tables. I filed an issue about that idea anyway.

This module assumes you are using it in a situation where it's better to be destructive than risk leaking data.

For example, I think this would make sense to run on any *.acquia.com databases before they are migrated to test/dev environments.

My gut reaction is that, due

greg.1.anderson's picture

My gut reaction is that, due to the destructive nature of this command, it's probably better off in contrib than in Drush core. That discussion should happen in the Drush issue queue, if there is desire to move in that direction. As for hooking the sql-sanitize callbacks, while that would certainly work, it might be a bit too shocking for folks who enable paranoia for other reasons, only to discover that their sanitize logic has changed drastically.

In any event, I haven't tried this command yet, but it looks like a good idea for sites that don't have a lot of contrib modules, or where sanitization is so important that it is preferable to break (on the addition of new modules) than risk leaking any data. I imagine that today, though, most places where data security is such a concern probably have policies in place preventing exporting the database at all. Testing in this case is usually done with fixtures (placeholder data). For this reason, I think this is probably too niche for Drush core.

I'm sure this will fill a need for a number of folks, though, and I am glad that it is available in the contrib space.

thanks for the feedback

greggles's picture

Can you clarify what you mean by this:

a good idea for sites that don't have a lot of contrib modules

We have 87 contrib folders, and even more .module files and setting up this module was rather trivial.

On this point:

most places where data security is such a concern probably have policies in place preventing exporting the database at all

I think you would be very very surprised :) And I'm not talking about my current employer when I say that.