Hi all,
I hope this is a good place to talk about this subject.
WYSIWYG editors such as TinyMCE are great, but when it comes to provide a secure method to filter what such a tool allows to users, I have the feeling that Drupal core filters do not help much.
If we use Full HTML, then we can do anything with our great WYSIWYG editor, however this is no no solution if we need to allow WYSIWYG capabilities to users we cannot trust. So, Full HTML filter might only be a valid solution for personal sites, or sites where all contributors can be trusted 100%.
If we use default Filtered HTML input format, then we can whitelist HTML elements, however we cannot whitelist style (or any DOM event handler) because this is hardcoded inside filter_xss(). This is great... or not so great? It's fine to filter out any javascript related attribute. If you need to allow users post javascript, then you can probably give them access to Full HTML since you should trust them 100%. However, when filtering style we are breaking most of the more common features of a WYSIWYG editor, such us setting text color, background color, text size, weight, floating, etc. So, ideally we would need a version of filter_xss that is able to parse individual properties in style attribute.
Alternatives to Drupal core filters exist. We have contrib modules such as htmlpurifier or htmLawed. Great! ...however, IMO, these modules have a few handicaps:
- Dependency on third party libraries.
- Big chunks of code that may impact performance since this is needed to filter content, which happens constantly, specially on sites where content changes a lot.
- Generate valid and well formed HTML, but we already have the HTML Corrector in core since D6, which is tiny and probably enough to cover the issue.
Another problem with above mentioned solutions (core filters and existing contrib modules) is that you have to create (and probably maintain) whitelists of HTML elements in the filter settings itself, but also in the WYSIWYG editor settings, so you have to do the same thing twice, and be sure to keep them in sync. Otherwise, you'll be opening security threats, or get complains from users because the WYSIWYG editos let's them do something that is later filtered.
Ok, so I'm writing a module that implements an input filter that lives in the middle of both worlds. It's a modified version of filter_xss that accepts a whitelist of valid_elements in the same exact format that is used by TinyMCE (syntax is HERE). This whitelist is parsed in internal structure to make things easier at filter process time, but it is aimed to be passed to TinyMCE directly, as entered in the settings panel (the way to do this is something that will depend on Wysiwyg API module).
I have already done the parser of TinyMCE valid_elements, but when the style attribute is whitelisted, then we need to somehow filter individual style properties. Here's where I have some doubts and I would appreciate advice in order to do this as safely as possible, but also using as little resources as possible.
Here's how I'm thinking to do this. Style properties are separated by semicolons, so I could simply explode the string into an array of individual styles properties and parse them separatedly. For each style property, get the property name and check against a hardcoded whitelist. If property is not whilte listed, then filter out just that property and its value. If property is whitelisted, then rather than check for valid property syntax (which is a neverending story and may convert this module into something big as the above mentioned contrib modules), I tought it would be enough to check for bad protocols using filter_xss_bad_protocol(). This method is just simple, and it's the same method applied by filter_xss() in core to check for attribute values. ie. if we are just concerned here about security, it doesn't matter if HTML attribute syntax, or Style property syntax is correct or not, as far as it is secure 100%. Am I right thinking that we could feel safe enough if we just check for filter_xss_bad_protocol() against individual style properties, just like Drupal's Filtered HTML input format does againts individual HTML attributes?
As soon as I get these issues done, I'll create a project page and commit to Drupal's CVS. Here, I would also appreciate advice for the module name. I'm creating a dependency on TinyMCE because this is what I need now, but it might be extended in the future to support methods to enter valid_elements in the same format supported by other WYSIWYG editors such as fckeditor, etc. So, I'm thinking about wysiwyg_filter for the name of this module. Suggestions?

Comments
Sounds good
I understand your problems with filter_xss, it just strips too much than it should (inline styles). I think it would be great if it was possible to use your filter just to strip XSS and leave all styles alone. Option to strip disallowed inline styles could be somehow optional (to improve performance).
I agree with your idea that WYSIWYG module could invoke a hook asking other modules for possible (external) configuration options. As a FCKeditor module maintainer I would be interested in knowing what HTML elements (and optionally what inline styles) are allowed to remove automatically some buttons from the toolbar, same approach could be probably used by other WYSIWYG editors.
Regarding name, perhaps "xss" or "security" could be also included in it's name to better decribe what it is doing, so maybe wysiwyg_xss_filter or wysiwyg_security_filter.
Inline styles are evil
Whatever is eventually decided upon, I think that allowing inline styles should be disabled by default. It makes it much harder for those in charge of the theme / design to ensure a consistent display. I've seen sites where users get "style happy", and it makes the site look incredibly bad. The focus for an editor should be to only allow XHTML elements, where attributes if needed are applied via a filter (such as the table_altrow module I hacked up to do even / odd classes).
Inline styles aren't, some users are...
If you enable the H2 tag, or the FONT tag, then users could also do funny things in the site. In fact, many times is as easy as writing a verylongwordmuchlongerthanthis to break many layouts.
Aside, if you enable inline styles, you don't need FONT tag, so you can get valid XHTML. This might be specially important for a magazine, for example, where many people could be writing content and you may not feel safe when you have to enable Full HTML for them, just to be able to publish what the site needs.
It all depends on the use case, I think.
Parsing inline styles will be optional :)
This will be triggered only if a)
styleis enabled in whitelist of HTML attributes, and b) style attribute is present in parsed HTML input.Thanks for your comments. :) ...I hope you can help me turn this module into something that can also be used for fckeditor when I commit the code. I'm currently dependen on TinyMCE valid_elements syntax. Maybe that could be implemented in a way that parsing this information can be delegated to external hooks or something... well, I believe this concern will make more sense when I publish the code. :)
Another thing I'm coding right now, which extends something I said above is that rather that having a hardcoded list of allowed style properties, I'm adding this as a filter option. For the moment, it will be a group of checkboxes where style properties are organized in somehow logical groups. Something similar to this:
http://htmlhelp.com/reference/css/properties.html
http://www.blooberry.com/indexdot/css/propindex/all.htm
Though, I'm concerned about this, since there maybe style properties that could be used to cause somekind of harm to pages. ie. properties such as position, display:none, etc. That is site admins would have to know what it is safe to enable, so users cannot break site pages. Maybe a simple advice in the filter setting description would sufice?
Regular expressions to validate style properties
Hi all,
I'm attaching a file that contains function
wysiwyg_filter_get_style_property_groups(), which is the one I'll be using to build the UI for the filter settings form (several groups of checkboxes) as well as to get the regular expresssions that will be used to validate individual style properties.The filter is not fully working yet. But maybe today I can upload the whole thing somewhere (sandbox?) so you can play with it, and then maybe it can be turned into something that can be useful for someone else, packed into a contrib module or whatever.
You know if you download a
You know if you download a copy of Firefox, somewhere in their source code they must have this information already worked out. It may be possible to strip it from Firefox and rework it into a form understandable by PHP. This might be easier than working this out by hand...
Dave
Link to complete source below
http://groups.drupal.org/node/15979#comment-54547
It Just needs testing, I think. At least what's related to validating style properties. I'm not trying to write a CSS validator, just something that's secure and useful enough. filter_xss() in core does not validate HTML, it just tries to prevent security threats, so here's the same, but allowing users use inline CSS styles securely.
mozilla/layout/style/nsCSSParser.cpp
Just checked firefox source code and the CSS parser is thousands of lines of .cpp code.
Also checked other possible sources before, and I was unable to find something based on simple regular expressions to parse CSS style properties. So I tried.
Thoughts: This is a Drupal
Thoughts:
Last, but not least, my current fear is that we are focusing on this topic too early. The current plugin API rewrite will lead to major changes in the way Drupal modules can interact with client-side editors, and how client-side editors integrate with input formats.
Daniel F. Kudwien
unleashed mind
Daniel F. Kudwien
netzstrategen
...
Please note that http://groups.drupal.org/node/15996 contains some closely related information now.
Daniel F. Kudwien
unleashed mind
Daniel F. Kudwien
netzstrategen
hmm...
Maybe, but what I'm trying to do is for D6, so it has to be approached from a contrib module, and then... time will tell. Well, I need to do this, so I thought I could perhaps share it when it's done.
Well, this might be true if you think about something like this for core... or related to Wysiwyg API, which is in the works. But I need to work out something now. If this can help others in some way, then... perfect. :)
An easy solution could be if Wysiwyg API allows third parties alter editor settings. If it was a hook, then I could then tell the filter to fill in the valid_elements parameter for TinyMCE. Altering editor settings from external modules could be used for a lot more things, probably.
Thanks for that link. I've been away from Drupal for more than a year, and it looks I missed a lot of interesting stuff.
To try to parse style properties securely, I'm trying to build a set regular expressions that will be used to validate property values. This means there will be a regular expression for each style property, and there is where bad stuff will be catched, hopefully.
Another thing that can be done by this filter is the nofollow stuff while parsing HTML input. That means no additional filter pass to rel="nofollow" links based on white/blacklists.
Another thing that can be done by this filter is filtering URLs used for OBJECT SRC, etc., so users would be able to post media from a whitelist of third party sites.
All these things I had to do HERE (Drupal 5 based site). But it was based on hack-ish solutions here and there. This time I need to redo the same job for Drupal 6 based sites, but I'm trying to do it better than last time.
Probably, this discussion will be easier when I have some code to show. I hope to get something soon because this is my primary goal in my daily job.
...
Well, the whole Wysiwyg API is overdue. So why not work something out for inclusion in Wysiwyg API?
That's almost what I imagine for D7-, however, the bridge between input formats/filters needs to deal with handling and passing structured markup information, NOT editor settings. That's why I proposed the enhancement of hook_filter() for D7 (which could already be implemented in Wysiwyg API and contrib modules for D7-).
As mentioned before, you are risking scope creep by implementing too many unrelated features and not properly abstracting the required logic due to a editor-specific implementation.
That's for sure ;)
Daniel F. Kudwien
unleashed mind
Daniel F. Kudwien
netzstrategen
Ok, let's try to build from structured markup information
To whitelist style properties is easy. I have already compiled an array of CSS2 attributes organized into groups, something similar to what's in the W3C recommendation. I'm still building the regular expressions that will be used to validate properties. Anyway, the UI here is just several groups of checkboxes from where it is possible to check/enable them or not. No matter where these attributes can be used as far as the regular expressions work and prevent from getting security threats. We're not writing a CSS validator, we just want/need to be safe.
However, how can be easily made an UI from where it is possible to whitelist HTML elements, and attributes for each one of those elements. There may be a lot of possible combinations. In addition to this, one may need to allow elements and/or attributes that do not exist in the standard. That's why I opted for working on a TinyMCE valid_elements parser, which I already have working. How would you do this?
Why not? Though, I'm not sure if my approach would really worth, so first let's see what I can come up with around here...
PS: At this moment I'm writing and testing regular expressions for the style properties validator...
Scope creep?
Regular expressions? Filtering HTML is the job of HTML filter in Drupal core, HTML Purifier, or htmLawed. It's also the job of client-side editors, for example TinyMCE or FCKeditor. Both have the same purpose, but are working in different layers. However, it's not the job of such a bridge.
One possible approach in pseudo-code:
<?php
function wysiwyg_get_valid_markup($format) {
$items = module_invoke_all('filter', 'allowed_markup', $format);
// Special handling for core's HTML filter (D7-).
$allowed_tags = preg_split('/\s+|<|>/', variable_get("allowed_html_$format", '<a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>'), -1, PREG_SPLIT_NO_EMPTY);
$items += $allowed_tags;
return $items;
}
/**
* Example $op allowed_markup implementation.
*/
function htmlawed_filter($op, $format) {
switch ($op) {
case 'allowed_markup':
// Actually not hard-coded, of course.
return array(
'a' => array('id', 'class', 'href', 'style', ...),
'div' => array('id', 'class', ...),
);
}
}
/**
* OFFTOPIC: Example usage of this data.
*/
function wysiwyg_tinymce_settings($editor, $profile, $format) {
$allowed_markup = wysiwyg_get_valid_markup($format);
foreach ($allowed_markup as $tag => $attributes) {
$allowed_markup[$tag] = implode('|', $attributes);
$allowed_markup[$tag] = $tag . '[' . $allowed_markup[$tag] .']';
}
$init['valid_elements'] = implode(',', $allowed_markup);
}
?>
Daniel F. Kudwien
unleashed mind
Daniel F. Kudwien
netzstrategen
There are two different things here
1) A filter that is able to parse allowed style properties safely.
2) Integration of this filter into WYSIWYG.
I'm trying to solve both things in one single step. First thing I need now is solving 1. Then, to achieve 2... that's another story, but it requires 1.
I'm afraid you're talking about 2 here. Well, I've been working on both things at a time because it looked to me good enough being able to parse valid_elements in TinyMCE format. Once I upload the code somewhere, then this discussions will be easier, I guess.
I'm attaching a file here that contains function
wysiwyg_filter_parse_valid_elements(), which is what I've coded to parse valid_elements in TinyMCE format. At this moment, the filter settings panel allows to enter HTML elements and attributes in this format, which is stored in preparsed form with this funcion, so it's already "compiled" for the filter process function, and it is also aimed to be passed onto the editor directly.If I had to code the filter settings form in a more generic way, then I would love to hear ideas in order to build UI that's too much cluttered.
WYSIWYG Filter module in the sandbox
WYSIWYG Filter module in the sandbox: /drupal/contributions/sandbox/markus_petrux/wysiwyg_filter
It mostly works, but it needs a lot of testing and validation, specially the regular expressions used to validate style properties.
Things that might change:
Screenshot of the WYSIWYG Filter settings panel
Here's a screenshot of the WYSIWYG Filter settings panel (attached).
Please, note that using TinyMCE valid_elements syntax it is possible to describe which HTML elements and attrbiutes are allowed, also which values could be used for those attributes. This is something that I don't know how it can be done with an easy to use UI.
In contrast, the section used to whitelist style properties is pretty simple. You just check what you need, so you can only enable properties that can cause no harm, such as color, font-size, text-align... and the parser will make sure that only valid values for those attributes will be rendered.
So... assuming the power that brings TinyMCE valid_elements syntax is not fully/generally needed, AND that the UI to whitelist Style properties whitelist is fine, THEN... is it possible to build an UI to whitelist HTML elements and attributes in a similar fashion? I'm not sure, because there are a lot of HTML elements, and many attributes and possible values for each one of them. So, how it can be done?
If I was to release a project for this WYSIWYG Filter?
Hi,
I contacted the Drupal security team for advice from their point of view, if I was to release this as a project, which is something I wish since I believe such a filter would make things easier for WYSIWYG users. The key point is that it will allow whitelisting style properties and the filter will validate them and remove anything that doesn't match a regular expression that exists every style property in CSS2 visual media type specs.
I still need to add a couple of settings, but it is almost done what I was in need to implement, at least for what it is the input filter part of it. A method to integrate with WYSIWYG would remain pending (ie. how to pass on to TinyMCE or whatever else editor the valid_elements parameter).
If someone wishes to take a look, I've been updating my sandbox space in CVS as I go. Link is
posted abovehere.Before I go ahead and release this... any advice/suggestion?
Thanks
Project page for WYSIWYG Filter contrib module
Hi,
I have been working a little bit more this week on this, optimizing the parser, fixing regular expressions, adding more options, etc. Finally, I've moved the code from my sandbox space to its own project page at d.o.
http://drupal.org/project/wysiwyg_filter
It is almost done. I still wish to hardcode some kind of blacklist for HTML elements, just to make sure no one can use tags such as SCRIPT, STYLE, HEAD, BODY, etc.
For the OBJECT and EMBED tags, I believe the best thing I can do now is to blacklist them as well, so everyone is forced to use additional Drupal filters that better know how to deal with multimedia stuff.
Feedback is much appreciated here, or within the project issues queue itself.
Thanks to all who have spent a little time on this thread. It has helped me figure out how to implement a few details. There may be more things to take into account, though, but time will tell... :)
Cheers
Pushed the WYSIWYG Filter module to RC1 state
I ended up hardcoding a blacklist of HTML elements. Is makes no sense to allow BODY, FORM, etc. in postings. I also blacklisted OBJECT, EMBED, PARAM and APPLET. To use these, better use inline filters, I think.
Screenshot of the WYSIWYG Filter settings panel. Patches and/or ideas to improve the UI to whitelist HTML elements and attributes are welcome. Though, I tried to document the most common rules for the valid_elements option the best I could.
If you are the developer of a WYSIWYG editor, please let me know if there is something this module could do with its own filter options to help enhance the configuration of the WYSIWYG editor itself without additional user intervention.
Cheers
Comments of a partial observer
I fully understand the conflict of interest here (I am the developer for HTML Purifier), but I still would like to strongly against going ahead and implementing something like this on your own. Filtering HTML is already a tricky business and even when your fundamental underpinnings are right, there are still security vulnerabilities. As you add more flexibility and functionality to the filter, your attack surface grows, and the more likely things are going to be insecure.
If you do decide to go ahead, and the Drupal core team backs behind this effort, I would be willing to help security audit the filter. But there are some fundamental difficulties with your implementation that are difficult to fix. I'll outline the argument for well-formedness here:
HTML is a language that occurs in a heterogeneous browser environment. There are many browsers that will consume your HTML, and each of them has their own parsing quirks and mannerisms, as well as custom elements and attributes. The XSS cheatsheet is a great example of different syntaxes, elements and attributes coming together for very specific vulnerabilities that only apply to certain browsers.
There is, however, a domain of HTML strings that are unambiguously interpreted by all browsers (barring rendering differences): standards-compliant HTML. It's a magic wand that makes all syntax-based XSS go away. And if you compile a specific enough whitelist (which was done by HTML Purifier and htmLawed (mostly; they allow conditional comments when they shouldn't)), it makes all XSS go away.
Cheers,
Edward
P.S. I know that the prospect of HTML Purifier in Drupal's core is a snowball's chance in hell. The design and coding philosophies are way too different. But if you all ever change your minds, HTML Purifier is definitely up to the task, even as a plugin. Just make the APIs.
I already created a project for WYSIWYG Filter
Here's the project page: http://drupal.org/project/wysiwyg_filter
The main HTML parser is based on core's filter_xss(), but extended to check HTML elements, their properties and style attributes from a series of whitelists. It also implements a blacklist of HTML elements that the user won't be able to whitelist. Also, for certain attributes additional rules are applied as well.
WYSIWYG Filter is not as powerfull and complex as HTML Purifier, but a bit more than core's filter_xss(). I would say it lands in the middle.
It would be really great if you could take a look it. :-D