Let's implement the sanitization bazooka: Autosanitization

Posted by geek-merlin on September 18, 2012 at 1:09pm

Abstract

Wrong sanitization of user supplied strings, resulting in CSRF security issues, accounts for the vast majority of security announcements. Autosanitization (exactly: proper context stack aware autosanitization) would be the bazooka to end this once and for all. It is in reach und would be a unique feature among notable open source frameworks. The implementation requirements are described. Research is needed as of the D8 core requirements necessary to implement this in contrib land.

Anatomy of a sanitization issue

Consider a node title that is rendered like this:

<h1>Nodetitle</h1>

And now consider a malicious user that inserts malicious data so our nodetitle is rendered:

<h1>Nodetitle<script src="http://evil-blackhat.com/wejustwanttohelp.js"></h1>

Clearly that's not what we want.

The attack pattern

To understand the general pattern of such attacks we need the notion of "browser context" (which is completely unconnected to "request context", which is not our topic here, so i will sometimes simply say "context" and mean "browser context"). Formally corresponds to a parser state (or a set of them) of the browser that interprets out html.
In our example the attacker managed to escape the context "text inside h1" and enter a context that allowed her to inject a malicious script. Sanitization is a text filter that aims to prevent such context escape.

I'll start with a simple sanitization strategy and will step by step develop proper autsanitization.

Approach #1: A single sanitization function.

Why couldn't we just do this:

<?php
$unsafe = mymodule_get_userdata();
// Do some wild processing here.
print sanatize_all($unsafe);
?>

The answer is simple: It can be formally proven that you would either sanitize more or less than needed (i've seen more than one module that over-sanitizes strings so that any "<" is displayed "<" to the user).

So this leads to:

Requirement #1: To avoid undersanitization as well as oversanitization, sanitization has to be aware of the context that the unsafe string is put into.

This implies the necessity of late sanitization.

This brings us to:

Approach #2: Context aware sanitization.

This looks like

<?php
$unsafe = mymodule_get_userdata();
// Do some wild processing here.
print sanatize_all($unsafe, 'our-html-context');
?>

This is pretty much what we have now with check_plain() and check_markup().

One problem is: HTML does know some more contexts than these 2. Which means no correct sanitization. (For an point why over-sanitization also is an issue see below.)

The problem arises in the "do some wild processing" part: To check for proper sanitization a reviewer might have to follow the dataflow path of the userdata all the way through the code. In a modular approach like the drupal framework this makes security an expensive thing. (Honestly: Can you tell at once which render array elements you have to sanitize and which will be sanitized by the render system?)

This leads to:

Requirement #2: For affordable security wrong code must be recognizable locally.

Lets' try this with:

Approach #3: Smart strings.

This looks like:

<?php
$unsafe = new UnsafeWrapper(mymodule_get_userdata());
$safe = new SafeWrapper('Literal string');
// Do some wild processing here.
print $unsafe->render('our-html-context');
print $safe->render('our-html-context');
// Or we might even do
print $template->render_with_variables(array('var1' => $unsafe, 'var2' => $safe);
?>

This is a bit like what the perl language has built in with the "taint" feature: A string is a smart object and knows how to be rendered. (It might carry along some other metadata like type, language, ...)

The cost of this is the wrapper object (that is said to be not more expensive than arrays since PHP5). A bonus (of any late processing) is that any string that for some reason is not rendered does not bother the CPU with its sanitization.

In a template we need to get the context data somehow.

So instead of

<?php
$template = new Template('<h1>{title}</h1>');
?>

we need

<?php
$template = new Template('<h1>{title:this-context}</h1>', 'outer-context');
?>

with some suitable context taxonomy.

Provided some reasonable assumption about parsability (twig will help with this) and well-formed-ness of the template, the context for the template variables can be derived, given the outer template context. But even an explicit handling of template variable context might be worth the effort and surely satisfies locality requirement #2.

Do we have reached wonderland with this approach? Unfortunately not.

Why we need the context stack

There is a class of attacks that uses the way that our html is parsed in the browser. An example of this attack class is this:

<?php
$templatestring = <<<EOT
<div onclick='jQuery(this).replaceWith("{userdata}",this)'>Click me!</div>
EOT;

$userdata = '&quot;+alert(&quot;breakout!&quot;)+&quot;';

print template_render($templatestring, array('!userdata' => sanitize_doublequoted_js_literal($userdata)));
?>

What happens? Attacker can execute arbitrary javascript.

How does it work? The attack uses the fact that our html is preprocessed by the browser, undoing our sanitization and effectively resulting in this:

<?php
<div onclick='jQuery(this).replaceWith(""+alert("breakout!")+"",this)'>Click me!</div>
?>

In ¹ this is described in detail and called "browser traductions".

The conclusion is the straightforward

Requirement #1': To avoid undersanitization as well as oversanitization, sanitization has to be aware of the complete context stack that the unsafe string is put into.

How can this be implemented?

So we reached an idea of sanitization wonderland. To implement this, a templating system needs to
* keep track of the context stack
* provide this information to the render method that does the sanitization thoroughly, mimicing browser traductions.

We already have a system that is capable of doing this: Render arrays

Will it work?

I have a working proof of this in my sandbox sandbox that can be studied.

Does it matter?

As decrribed the described approach can be proven to be the only one that avoids over- as well as under-sanitization. While over-sanitization does not lead to security issues in the first place, it will lead to poor user experience. Which in real world scenarios can lead to security issues: Imagine yourself receiving an email like "Please fix display of 'AT&T' by tomorrow."

And the race will speed up: There are alredy automated approaches² to find the described security holes. They surely are already used by well-funded blackhats and i guess in the next decade they will pour down the market until they reach the scriptkiddie level.

On the other side there is a huge chance in this: Proper context stack aware autosanitization is already used in some proprietary frameworks³⁴, but by now is still not implemented in major open source frameworks¹. So there's the chance for Drupal to be the first player with this USP.

D8?

I don't think anyone sees this landing in D8 core. But it might be worth preparing what is needed in core to make this possible in contrib land. Although i did some talks at DC Munich i can not see clearly the relation of Twig and (a cuccessor of?) render arrays. This is where i hope for the joint knowledge of the community.

EDIT: Added an issue in the twig sandbox.

References

Comments

Thanks for your work on this

Posted by greggles on September 18, 2012 at 3:25pm

Thanks for your work on this issue!

I cross-posted this to the theme development issue to try to get more discussion from themers on the topic.

knaddison blog | Morris Animal Foundation

Thanks!

Posted by effulgentsia on September 18, 2012 at 6:16pm

Thanks for the writeup and your sandbox. I'll take a look at the code when I get a chance. Twig work is happening in http://drupal.org/sandbox/pixelmord/1750250. Please post an issue there linking to this writeup. The main benefit of Twig over PHPTemplate in relation to this is that instead of templates having to write:

print $foo->render('our-html-context');

it could do the same with

{{ foo }}

In other words, the glue code needed to sanitize variables can be handled by the Twig compiler and hidden from the themer.

OK!

Posted by geek-merlin on September 18, 2012 at 6:51pm

OK effulgentsia, i've created an issue in that sandbox linking here.
I'm excited to read your review of the concept.