How to extend filter_xss() to parse style properties safely?

Posted by markus_petrux on October 18, 2008 at 9:00am

Hi all,

I hope this is a good place to talk about this subject.

WYSIWYG editors such as TinyMCE are great, but when it comes to provide a secure method to filter what such a tool allows to users, I have the feeling that Drupal core filters do not help much.

If we use Full HTML, then we can do anything with our great WYSIWYG editor, however this is no no solution if we need to allow WYSIWYG capabilities to users we cannot trust. So, Full HTML filter might only be a valid solution for personal sites, or sites where all contributors can be trusted 100%.

If we use default Filtered HTML input format, then we can whitelist HTML elements, however we cannot whitelist style (or any DOM event handler) because this is hardcoded inside filter_xss(). This is great... or not so great? It's fine to filter out any javascript related attribute. If you need to allow users post javascript, then you can probably give them access to Full HTML since you should trust them 100%. However, when filtering style we are breaking most of the more common features of a WYSIWYG editor, such us setting text color, background color, text size, weight, floating, etc. So, ideally we would need a version of filter_xss that is able to parse individual properties in style attribute.

Alternatives to Drupal core filters exist. We have contrib modules such as htmlpurifier or htmLawed. Great! ...however, IMO, these modules have a few handicaps:

Dependency on third party libraries.
Big chunks of code that may impact performance since this is needed to filter content, which happens constantly, specially on sites where content changes a lot.
Generate valid and well formed HTML, but we already have the HTML Corrector in core since D6, which is tiny and probably enough to cover the issue.

Another problem with above mentioned solutions (core filters and existing contrib modules) is that you have to create (and probably maintain) whitelists of HTML elements in the filter settings itself, but also in the WYSIWYG editor settings, so you have to do the same thing twice, and be sure to keep them in sync. Otherwise, you'll be opening security threats, or get complains from users because the WYSIWYG editos let's them do something that is later filtered.

Ok, so I'm writing a module that implements an input filter that lives in the middle of both worlds. It's a modified version of filter_xss that accepts a whitelist of valid_elements in the same exact format that is used by TinyMCE (syntax is HERE). This whitelist is parsed in internal structure to make things easier at filter process time, but it is aimed to be passed to TinyMCE directly, as entered in the settings panel (the way to do this is something that will depend on Wysiwyg API module).

I have already done the parser of TinyMCE valid_elements, but when the style attribute is whitelisted, then we need to somehow filter individual style properties. Here's where I have some doubts and I would appreciate advice in order to do this as safely as possible, but also using as little resources as possible.

Here's how I'm thinking to do this. Style properties are separated by semicolons, so I could simply explode the string into an array of individual styles properties and parse them separatedly. For each style property, get the property name and check against a hardcoded whitelist. If property is not whilte listed, then filter out just that property and its value. If property is whitelisted, then rather than check for valid property syntax (which is a neverending story and may convert this module into something big as the above mentioned contrib modules), I tought it would be enough to check for bad protocols using filter_xss_bad_protocol(). This method is just simple, and it's the same method applied by filter_xss() in core to check for attribute values. ie. if we are just concerned here about security, it doesn't matter if HTML attribute syntax, or Style property syntax is correct or not, as far as it is secure 100%. Am I right thinking that we could feel safe enough if we just check for filter_xss_bad_protocol() against individual style properties, just like Drupal's Filtered HTML input format does againts individual HTML attributes?

As soon as I get these issues done, I'll create a project page and commit to Drupal's CVS. Here, I would also appreciate advice for the module name. I'm creating a dependency on TinyMCE because this is what I need now, but it might be extended in the future to support methods to enter valid_elements in the same format supported by other WYSIWYG editors such as fckeditor, etc. So, I'm thinking about wysiwyg_filter for the name of this module. Suggestions?

Comments

Sounds good

Posted by wwalc on October 18, 2008 at 10:06am

I understand your problems with filter_xss, it just strips too much than it should (inline styles). I think it would be great if it was possible to use your filter just to strip XSS and leave all styles alone. Option to strip disallowed inline styles could be somehow optional (to improve performance).

I agree with your idea that WYSIWYG module could invoke a hook asking other modules for possible (external) configuration options. As a FCKeditor module maintainer I would be interested in knowing what HTML elements (and optionally what inline styles) are allowed to remove automatically some buttons from the toolbar, same approach could be probably used by other WYSIWYG editors.

Regarding name, perhaps "xss" or "security" could be also included in it's name to better decribe what it is doing, so maybe wysiwyg_xss_filter or wysiwyg_security_filter.

Inline styles are evil

Posted by deviantintegral on October 19, 2008 at 1:47pm

Whatever is eventually decided upon, I think that allowing inline styles should be disabled by default. It makes it much harder for those in charge of the theme / design to ensure a consistent display. I've seen sites where users get "style happy", and it makes the site look incredibly bad. The focus for an editor should be to only allow XHTML elements, where attributes if needed are applied via a filter (such as the table_altrow module I hacked up to do even / odd classes).

Inline styles aren't, some users are...

Posted by markus_petrux on October 20, 2008 at 9:56am

If you enable the H2 tag, or the FONT tag, then users could also do funny things in the site. In fact, many times is as easy as writing a verylongwordmuchlongerthanthis to break many layouts.

Aside, if you enable inline styles, you don't need FONT tag, so you can get valid XHTML. This might be specially important for a magazine, for example, where many people could be writing content and you may not feel safe when you have to enable Full HTML for them, just to be able to publish what the site needs.

It all depends on the use case, I think.

Parsing inline styles will be optional :)

Posted by markus_petrux on October 18, 2008 at 10:28am

This will be triggered only if a) style is enabled in whitelist of HTML attributes, and b) style attribute is present in parsed HTML input.

Thanks for your comments. :) ...I hope you can help me turn this module into something that can also be used for fckeditor when I commit the code. I'm currently dependen on TinyMCE valid_elements syntax. Maybe that could be implemented in a way that parsing this information can be delegated to external hooks or something... well, I believe this concern will make more sense when I publish the code. :)

Another thing I'm coding right now, which extends something I said above is that rather that having a hardcoded list of allowed style properties, I'm adding this as a filter option. For the moment, it will be a group of checkboxes where style properties are organized in somehow logical groups. Something similar to this:

http://htmlhelp.com/reference/css/properties.html
http://www.blooberry.com/indexdot/css/propindex/all.htm

Though, I'm concerned about this, since there maybe style properties that could be used to cause somekind of harm to pages. ie. properties such as position, display:none, etc. That is site admins would have to know what it is safe to enable, so users cannot break site pages. Maybe a simple advice in the filter setting description would sufice?

Regular expressions to validate style properties

Posted by markus_petrux on October 19, 2008 at 2:11pm

Hi all,

I'm attaching a file that contains function wysiwyg_filter_get_style_property_groups(), which is the one I'll be using to build the UI for the filter settings form (several groups of checkboxes) as well as to get the regular expresssions that will be used to validate individual style properties.

Can someone help me looking at the regular expressions?
Is there anything too bad with this method?
Is there any place I can get HTML code to test these regular expressions?

The filter is not fully working yet. But maybe today I can upload the whole thing somewhere (sandbox?) so you can play with it, and then maybe it can be turned into something that can be useful for someone else, packed into a contrib module or whatever.

You know if you download a

Posted by dwees on October 20, 2008 at 12:03am

You know if you download a copy of Firefox, somewhere in their source code they must have this information already worked out. It may be possible to strip it from Firefox and rework it into a form understandable by PHP. This might be easier than working this out by hand...

Dave

Link to complete source below

Posted by markus_petrux on October 20, 2008 at 12:13am

http://groups.drupal.org/node/15979#comment-54547

It Just needs testing, I think. At least what's related to validating style properties. I'm not trying to write a CSS validator, just something that's secure and useful enough. filter_xss() in core does not validate HTML, it just tries to prevent security threats, so here's the same, but allowing users use inline CSS styles securely.

mozilla/layout/style/nsCSSParser.cpp

Posted by markus_petrux on October 20, 2008 at 1:29am

Just checked firefox source code and the CSS parser is thousands of lines of .cpp code.

Also checked other possible sources before, and I was unable to find something based on simple regular expressions to parse CSS style properties. So I tried.

Thoughts: This is a Drupal

Posted by sun on October 18, 2008 at 6:12pm

Thoughts:

This is a Drupal core issue. Wysiwyg API might has to provide this feature built-in (for D7-), but a separate contrib module won't help much.
What's required is structured information, i.e. arrays or objects. A semicolon-delimited list will not work out. We need to gather this information in a format that allows us to feed it into any client-side editor. While TinyMCE is a good candidate for testing new stuff, all code has to work independently from the used editor.
This filter component in Wysiwyg API has to support multiple HTML filter-alike input filters, even concurrently.
Instead of implementing support for particular filters, we should turn it around and let those modules integrate with Wysiwyg API by providing hooks. Wysiwyg API itself has to provide the hook for core's HTML filter.
Extending the HTML filter with attributes is being discussed in Filter options based on CSS attributes, not just HTML tags.
If there is no HTML filter-alike input filter configured for an input format, we can only guess that any markup is allowed. The editor should not strip anything then (besides possible XSS attacks perhaps).
On another thought, we might want to disallow use of client-side editors for the Full HTML input format (i.e. when an input format contains no HTML filter-alike input filter).
Much of this topic goes hand-in-hand with the considerations about Inline API: Instead of allowing users to use dangerous HTML markup in their posts, rich-media contents such as scripts, movies, images, or any other extended markup, can be provided via inline macro tags. This way, users are safe, and admins have full control over what can be done in contents, and how the results will look like. Also, Inline API's integration with Wysiwyg's new plugin API will make those inline contents available to any client-side editor, not just TinyMCE.

Last, but not least, my current fear is that we are focusing on this topic too early. The current plugin API rewrite will lead to major changes in the way Drupal modules can interact with client-side editors, and how client-side editors integrate with input formats.

Daniel F. Kudwien
unleashed mind

Daniel F. Kudwien
netzstrategen

...

Posted by sun on October 18, 2008 at 7:45pm

Please note that http://groups.drupal.org/node/15996 contains some closely related information now.

Daniel F. Kudwien
unleashed mind

Daniel F. Kudwien
netzstrategen

hmm...

Posted by markus_petrux on October 18, 2008 at 9:05pm

This is a Drupal core issue.

Maybe, but what I'm trying to do is for D6, so it has to be approached from a contrib module, and then... time will tell. Well, I need to do this, so I thought I could perhaps share it when it's done.

Last, but not least, my current fear is that we are focusing on this topic too early.

Well, this might be true if you think about something like this for core... or related to Wysiwyg API, which is in the works. But I need to work out something now. If this can help others in some way, then... perfect. :)

This filter component in Wysiwyg API has to support multiple HTML filter-alike input filters, even concurrently.

An easy solution could be if Wysiwyg API allows third parties alter editor settings. If it was a hook, then I could then tell the filter to fill in the valid_elements parameter for TinyMCE. Altering editor settings from external modules could be used for a lot more things, probably.

Extending the HTML filter with attributes is being discussed in Filter options based on CSS attributes, not just HTML tags.

Thanks for that link. I've been away from Drupal for more than a year, and it looks I missed a lot of interesting stuff.

To try to parse style properties securely, I'm trying to build a set regular expressions that will be used to validate property values. This means there will be a regular expression for each style property, and there is where bad stuff will be catched, hopefully.

Another thing that can be done by this filter is the nofollow stuff while parsing HTML input. That means no additional filter pass to rel="nofollow" links based on white/blacklists.

Another thing that can be done by this filter is filtering URLs used for OBJECT SRC, etc., so users would be able to post media from a whitelist of third party sites.

All these things I had to do HERE (Drupal 5 based site). But it was based on hack-ish solutions here and there. This time I need to redo the same job for Drupal 6 based sites, but I'm trying to do it better than last time.

Probably, this discussion will be easier when I have some code to show. I hope to get something soon because this is my primary goal in my daily job.

...

Posted by sun on October 18, 2008 at 10:31pm

...for D6, so it has to be approached from a contrib module, and then... time will tell.
...
...if you think about something like this for core... or related to Wysiwyg API, which is in the works. But I need to work out something now.

Well, the whole Wysiwyg API is overdue. So why not work something out for inclusion in Wysiwyg API?

...if Wysiwyg API allows third parties alter editor settings.

That's almost what I imagine for D7-, however, the bridge between input formats/filters needs to deal with handling and passing structured markup information, NOT editor settings. That's why I proposed the enhancement of hook_filter() for D7 (which could already be implemented in Wysiwyg API and contrib modules for D7-).

...fill in the valid_elements parameter for TinyMCE...
...
Another thing that can be done by this filter is the nofollow stuff while parsing HTML input...
...
Another thing that can be done by this filter is filtering URLs used for OBJECT SRC, etc...

As mentioned before, you are risking scope creep by implementing too many unrelated features and not properly abstracting the required logic due to a editor-specific implementation.

Probably, this discussion will be easier when I have some code to show.

That's for sure ;)

Daniel F. Kudwien
unleashed mind

Daniel F. Kudwien
netzstrategen

Ok, let's try to build from structured markup information

Posted by markus_petrux on October 18, 2008 at 11:45pm

To whitelist style properties is easy. I have already compiled an array of CSS2 attributes organized into groups, something similar to what's in the W3C recommendation. I'm still building the regular expressions that will be used to validate properties. Anyway, the UI here is just several groups of checkboxes from where it is possible to check/enable them or not. No matter where these attributes can be used as far as the regular expressions work and prevent from getting security threats. We're not writing a CSS validator, we just want/need to be safe.

However, how can be easily made an UI from where it is possible to whitelist HTML elements, and attributes for each one of those elements. There may be a lot of possible combinations. In addition to this, one may need to allow elements and/or attributes that do not exist in the standard. That's why I opted for working on a TinyMCE valid_elements parser, which I already have working. How would you do this?

Well, the whole Wysiwyg API is overdue. So why not work something out for inclusion in Wysiwyg API?

Why not? Though, I'm not sure if my approach would really worth, so first let's see what I can come up with around here...

PS: At this moment I'm writing and testing regular expressions for the style properties validator...

Scope creep?

Posted by sun on October 19, 2008 at 12:28am

Regular expressions? Filtering HTML is the job of HTML filter in Drupal core, HTML Purifier, or htmLawed. It's also the job of client-side editors, for example TinyMCE or FCKeditor. Both have the same purpose, but are working in different layers. However, it's not the job of such a bridge.

One possible approach in pseudo-code:

<?php
function wysiwyg_get_valid_markup($format) {
  $items = module_invoke_all('filter', 'allowed_markup', $format);
  // Special handling for core's HTML filter (D7-).
  $allowed_tags = preg_split('/\s+|<|>/', variable_get("allowed_html_$format", '<a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>'), -1, PREG_SPLIT_NO_EMPTY);
  $items += $allowed_tags;
  return $items;
}

/**
 * Example $op allowed_markup implementation.
 */
function htmlawed_filter($op, $format) {
  switch ($op) {
    case 'allowed_markup':
      // Actually not hard-coded, of course.
      return array(
        'a' => array('id', 'class', 'href', 'style', ...),
        'div' => array('id', 'class', ...),
      );
  }
}

/**
 * OFFTOPIC: Example usage of this data.
 */
function wysiwyg_tinymce_settings($editor, $profile, $format) {
  $allowed_markup = wysiwyg_get_valid_markup($format);
  foreach ($allowed_markup as $tag => $attributes) {
    $allowed_markup[$tag] = implode('|', $attributes);
    $allowed_markup[$tag] = $tag . '[' . $allowed_markup[$tag] .']';
  }
  $init['valid_elements'] = implode(',', $allowed_markup);
}
?>

Daniel F. Kudwien
unleashed mind

Daniel F. Kudwien
netzstrategen

There are two different things here

Posted by markus_petrux on October 19, 2008 at 2:29pm

1) A filter that is able to parse allowed style properties safely.
2) Integration of this filter into WYSIWYG.

I'm trying to solve both things in one single step. First thing I need now is solving 1. Then, to achieve 2... that's another story, but it requires 1.

I'm afraid you're talking about 2 here. Well, I've been working on both things at a time because it looked to me good enough being able to parse valid_elements in TinyMCE format. Once I upload the code somewhere, then this discussions will be easier, I guess.

I'm attaching a file here that contains function wysiwyg_filter_parse_valid_elements(), which is what I've coded to parse valid_elements in TinyMCE format. At this moment, the filter settings panel allows to enter HTML elements and attributes in this format, which is stored in preparsed form with this funcion, so it's already "compiled" for the filter process function, and it is also aimed to be passed onto the editor directly.

If I had to code the filter settings form in a more generic way, then I would love to hear ideas in order to build UI that's too much cluttered.

WYSIWYG Filter module in the sandbox

Posted by markus_petrux on October 19, 2008 at 9:34pm

WYSIWYG Filter module in the sandbox: /drupal/contributions/sandbox/markus_petrux/wysiwyg_filter

It mostly works, but it needs a lot of testing and validation, specially the regular expressions used to validate style properties.

Things that might change:

The method to whitelist HTML elements and attributes. Would love to hear ideas here.
Adding filter options to specify:
- which classes can be used if class attribute is enabled (regular expressions?, whitelist?).
- which kind of ids can be used if id attribute is enabled (regular expressions?).
- Whitelist of URLs for Flash, Media, etc. (object, embed)
Integration with Wysiwyg API? How to pass on to the Wysiwyg editor information about HTML elements, attributes, etc. whitelisted in the filter settings?
Anything else?

Screenshot of the WYSIWYG Filter settings panel

Posted by markus_petrux on October 20, 2008 at 9:44am

Here's a screenshot of the WYSIWYG Filter settings panel (attached).

Please, note that using TinyMCE valid_elements syntax it is possible to describe which HTML elements and attrbiutes are allowed, also which values could be used for those attributes. This is something that I don't know how it can be done with an easy to use UI.

In contrast, the section used to whitelist style properties is pretty simple. You just check what you need, so you can only enable properties that can cause no harm, such as color, font-size, text-align... and the parser will make sure that only valid values for those attributes will be rendered.

So... assuming the power that brings TinyMCE valid_elements syntax is not fully/generally needed, AND that the UI to whitelist Style properties whitelist is fine, THEN... is it possible to build an UI to whitelist HTML elements and attributes in a similar fashion? I'm not sure, because there are a lot of HTML elements, and many attributes and possible values for each one of them. So, how it can be done?

If I was to release a project for this WYSIWYG Filter?

Posted by markus_petrux on October 22, 2008 at 8:04am

Hi,

I contacted the Drupal security team for advice from their point of view, if I was to release this as a project, which is something I wish since I believe such a filter would make things easier for WYSIWYG users. The key point is that it will allow whitelisting style properties and the filter will validate them and remove anything that doesn't match a regular expression that exists every style property in CSS2 visual media type specs.

I still need to add a couple of settings, but it is almost done what I was in need to implement, at least for what it is the input filter part of it. A method to integrate with WYSIWYG would remain pending (ie. how to pass on to TinyMCE or whatever else editor the valid_elements parameter).

If someone wishes to take a look, I've been updating my sandbox space in CVS as I go. Link is ~~posted above~~ here.

Before I go ahead and release this... any advice/suggestion?

Thanks

Project page for WYSIWYG Filter contrib module

Posted by markus_petrux on October 24, 2008 at 8:02pm

Hi,

I have been working a little bit more this week on this, optimizing the parser, fixing regular expressions, adding more options, etc. Finally, I've moved the code from my sandbox space to its own project page at d.o.

http://drupal.org/project/wysiwyg_filter

It is almost done. I still wish to hardcode some kind of blacklist for HTML elements, just to make sure no one can use tags such as SCRIPT, STYLE, HEAD, BODY, etc.

For the OBJECT and EMBED tags, I believe the best thing I can do now is to blacklist them as well, so everyone is forced to use additional Drupal filters that better know how to deal with multimedia stuff.

Feedback is much appreciated here, or within the project issues queue itself.

Thanks to all who have spent a little time on this thread. It has helped me figure out how to implement a few details. There may be more things to take into account, though, but time will tell... :)

Cheers

Pushed the WYSIWYG Filter module to RC1 state

Posted by markus_petrux on October 26, 2008 at 8:08pm

I ended up hardcoding a blacklist of HTML elements. Is makes no sense to allow BODY, FORM, etc. in postings. I also blacklisted OBJECT, EMBED, PARAM and APPLET. To use these, better use inline filters, I think.

Screenshot of the WYSIWYG Filter settings panel. Patches and/or ideas to improve the UI to whitelist HTML elements and attributes are welcome. Though, I tried to document the most common rules for the valid_elements option the best I could.

If you are the developer of a WYSIWYG editor, please let me know if there is something this module could do with its own filter options to help enhance the configuration of the WYSIWYG editor itself without additional user intervention.

Cheers

Comments of a partial observer

Posted by ezyang on January 15, 2009 at 6:16am

I fully understand the conflict of interest here (I am the developer for HTML Purifier), but I still would like to strongly against going ahead and implementing something like this on your own. Filtering HTML is already a tricky business and even when your fundamental underpinnings are right, there are still security vulnerabilities. As you add more flexibility and functionality to the filter, your attack surface grows, and the more likely things are going to be insecure.

If you do decide to go ahead, and the Drupal core team backs behind this effort, I would be willing to help security audit the filter. But there are some fundamental difficulties with your implementation that are difficult to fix. I'll outline the argument for well-formedness here:

HTML is a language that occurs in a heterogeneous browser environment. There are many browsers that will consume your HTML, and each of them has their own parsing quirks and mannerisms, as well as custom elements and attributes. The XSS cheatsheet is a great example of different syntaxes, elements and attributes coming together for very specific vulnerabilities that only apply to certain browsers.

There is, however, a domain of HTML strings that are unambiguously interpreted by all browsers (barring rendering differences): standards-compliant HTML. It's a magic wand that makes all syntax-based XSS go away. And if you compile a specific enough whitelist (which was done by HTML Purifier and htmLawed (mostly; they allow conditional comments when they shouldn't)), it makes all XSS go away.

Cheers,
Edward

P.S. I know that the prospect of HTML Purifier in Drupal's core is a snowball's chance in hell. The design and coding philosophies are way too different. But if you all ever change your minds, HTML Purifier is definitely up to the task, even as a plugin. Just make the APIs.

I already created a project for WYSIWYG Filter

Posted by markus_petrux on January 15, 2009 at 7:34am

Here's the project page: http://drupal.org/project/wysiwyg_filter

The main HTML parser is based on core's filter_xss(), but extended to check HTML elements, their properties and style attributes from a series of whitelists. It also implements a blacklist of HTML elements that the user won't be able to whitelist. Also, for certain attributes additional rules are applied as well.

WYSIWYG Filter is not as powerfull and complex as HTML Purifier, but a bit more than core's filter_xss(). I would say it lands in the middle.

It would be really great if you could take a look it. :-D

How to extend filter_xss() to parse style properties safely?

Comments

Sounds good

Inline styles are evil

Inline styles aren't, some users are...

Parsing inline styles will be optional :)

Regular expressions to validate style properties

You know if you download a

Link to complete source below

mozilla/layout/style/nsCSSParser.cpp

Thoughts: This is a Drupal

...

hmm...

...

Ok, let's try to build from structured markup information

Scope creep?

There are two different things here

WYSIWYG Filter module in the sandbox

Screenshot of the WYSIWYG Filter settings panel

If I was to release a project for this WYSIWYG Filter?

Project page for WYSIWYG Filter contrib module

Pushed the WYSIWYG Filter module to RC1 state

Comments of a partial observer

I already created a project for WYSIWYG Filter

Wysiwyg

Group organizers

Group categories

Wiki pages

New groups

Group notifications