The eclectic musings of a bitter software engineer.

Sanitize: A whitelist-based Ruby HTML sanitizer

Wednesday December 24, 2008 @ 10:45 PM (PST)

Merry Christmas, Internets! My gift to you this year is Sanitize, a whitelist-based HTML sanitizer written in Ruby. Given a list of acceptable elements and attributes, Sanitize will remove all unacceptable HTML from a string.

Using a simple configuration syntax, you can tell Sanitize to allow certain elements, certain attributes within those elements, and even certain URL protocols within attributes that contain URLs. Any HTML elements or attributes that you don’t explicitly allow will be removed.

Because it’s based on Hpricot, a full-fledged HTML parser, rather than a bunch of fragile regular expressions, Sanitize has no trouble dealing with malformed or maliciously-formed HTML. When in doubt, Sanitize always errs on the side of caution.

Using Sanitize is easy. First, install it:

sudo gem install sanitize

Then call it like so:

require 'rubygems'
require 'sanitize'

html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'

Sanitize.clean(html) # => 'foo'

By default, Sanitize removes all HTML. You can use one of the built-in configs to tell Sanitize to allow certain attributes and elements:

Sanitize.clean(html, Sanitize::Config::RESTRICTED)
# => '<b>foo</b>'

Sanitize.clean(html, Sanitize::Config::BASIC)
# => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'

Sanitize.clean(html, Sanitize::Config::RELAXED)
# => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'

Or, if you’d like more control over what’s allowed, you can provide your own custom configuration:

Sanitize.clean(html, :elements => ['a', 'span'],
    :attributes => {'a' => ['href', 'title'], 'span' => ['class']},
    :protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})

For more details, see the Sanitize Documentation.

Comments

I appreciate you donating this code to the open source community. I have one small issue. The plugin works great except with relative links. I tried adding “/” to the protocols but it does not seem to work. Any advice would be useful….

Gravatar icon
Sunday December 28, 2008 @ 08:49 PM (PST)

Good catch, Johnny. I’ve pushed a change to the git repo that adds support for relative URLs. With this change, you can allow relative URLs by including the special value :relative in a protocol config array, like so:

:protocols => {
  'a' => {'href' => ['http', 'https', :relative]}
}

The Basic and Relaxed configs have also been updated to allow relative URLs.

Gravatar icon
Sunday December 28, 2008 @ 10:35 PM (PST)

This looks very cool. I’ll definitely be using it in my next project. It’s very easy to use. Thanks and have a Happy New Year!

Gravatar icon
Wednesday December 31, 2008 @ 11:46 AM (PST)

Hello and happy new year !
Thanx for this gem.
How can I deal with html entities ? Each one is replaced by “?” character :

>> Sanitize.clean(‘& eacute ;’)
=> “?”

Happy new year

Gravatar icon
Thursday January 01, 2009 @ 02:41 AM (PST)

thanks for the gem, it is awesome.
is there any other way to use this apart from installing it as a gem in the machine, if i want to use it in my rails application?? maybe like a plugin or something???

Gravatar icon
suman
Thursday January 01, 2009 @ 11:12 AM (PST)

This appears to be a bug in Hpricot. I’ve pushed a workaround to the git repo. Thanks for the report!

Gravatar icon
Thursday January 01, 2009 @ 11:47 AM (PST)

I don’t use Rails myself, but you should be able to unpack the Sanitize gem (and its dependencies, Hpricot and HTMLEntities) into Rails’s vendor/gems directory. Here’s a nice howto guide.

Gravatar icon
Thursday January 01, 2009 @ 11:53 AM (PST)

Ryan,
You might want to check out Nokogiri. …Nokogiri is faster, and less buggy than Hpricot… http://github.com/tenderlove/nokogiri/tree/master

(Nokogiri’s #inner_text will strip HTML.)

…might be worth getting your Sanitize lib to work with Nokogiri.

cheers

Gravatar icon
Thursday January 01, 2009 @ 01:21 PM (PST)

Thank you Ryan, it’s working very well.

Gravatar icon
Thursday January 01, 2009 @ 01:36 PM (PST)

I am nothing if not skeptical about sanitization that does not involve a full pass through a real browser instance. The mother of all HTML sanitization hacks was probably Samy’s profile hack on MySpace.

Here is an explaination of what he did. My question is, will Sanitize correctly filter out his hack (and each of the steps we took to get there)? And second to that shouldn’t it be part of the test cases as it is about as pathological as you’ll get?

I’m pretty sure that the answer is going to be “no”. Sanitize is going to only filter out elements and attributes. The problem is that there are many many ways to hide malicious things in the markup that is going to defeat this sanitizer. And to be honest, a bad sanitizer is probably worse than no sanitizer.

The problem is that you need to take into account all the ways tha browsers really mess things up. I notice that you’re testing for entities in place of the colon to hide javascript:, but you have no test for an entity in place of any other character. Given how and when entities are handled, its reasonable for browser to interpret &#106;avascript: like javascript:. In fact, I belive that niave email address obfuscation scripts do this (you know those scripts that try to hide your address from spam bots).

More notably some browsers will actually interpret the string java\nscript: as javascript:; they probably shouldn’t ignore the newline but some do (and given samy’s success, I’ll bet its IE that does).

Before anyone uses this library in production, Ryan really needs to beef up his test cases. I also think you’re going to need add something like JSLint which can be set to use a safe subset of JavaScript. Now JSLint is written in JavaScript so you’ll either need to find a JS engine to work with or you’ll have to rewrite the code in Ruby; that shouldn’t be impossible as Doug Crockford has described the methods he wrote to create JSLint.

I would also like to see a “beat the sanitizer” website setup where people can test the sanitizer against malicious code and send failure reports to Ryan (who would then fix the sanitizer and exapand the test coverage accordingly). Mind you, you’d have to be careful with what you do with known malicious code that bypasses your sanitizer since someone might send you truely malicious code not just something that could be exploited to deliver malicious code.

Gravatar icon
Adam
Thursday January 01, 2009 @ 11:11 PM (PST)

I appreciate your skepticism, Adam, and I welcome test cases that will help me improve Sanitize. However, it’s irresponsible of you to make accusations based purely on speculation. If you think you have a way to break Sanitize, test it. If it works, let me know so I can fix it. Don’t guess, and don’t make accusations based on guesses.

You seem to be under the misconception that Sanitize is intended to make it safe to include CSS and JavaScript in your HTML. It isn’t. Sanitize is intended only to clean HTML. If you tell Sanitize to allow elements (such as <script>) or attributes (such as style) that allow code execution, you’re taking your safety into your own hands and should definitely look into AdSafe, Caja, or other sandboxing tools. By default, Sanitize strips all elements and attributes, and none of the included configurations allows unsafe elements or attributes.

For the record, here’s the result of running a string containing the Samy worm through each of Sanitize’s built-in configurations:

Sanitize.clean(samy)                               # => ""
Sanitize.clean(samy, Sanitize::Config::RESTRICTED) # => ""
Sanitize.clean(samy, Sanitize::Config::BASIC)      # => ""
Sanitize.clean(samy, Sanitize::Config::RELAXED)    # => ""

And here’s the result of each of the javascript: variations you proposed, which aren’t tested for in the Sanitize unit tests because the only character that matters to Sanitize’s protocol-filtering code is the colon. As long as the colon is recognized correctly, the protocol will be sanitized properly:

s = Sanitize.new(Sanitize::Config::RELAXED)

s.clean("<a href=\"&#106;avascript:alert('hi')\">foo</a>") # => "<a>foo</a>"
s.clean("<a href=\"java\nscript:alert('hi')\">foo</a>")    # => "<a>foo</a>"

Remember, Sanitize is based on a whitelist, not a blacklist. You don’t need to tell it what to block, you only need to tell it what to allow. When Sanitize checks for a valid protocol, it doesn’t look for variations of javascript: that need to be filtered out. It looks for a : character and then ensures that anything preceding it is in the protocol whitelist.

Again, skepticism is always healthy, and I thank you for that, but speculation without experimentation is useless and can result in harmful misinformation.

Gravatar icon
Friday January 02, 2009 @ 12:16 AM (PST)

By the way, I couldn’t resist the challenge. You can now test Sanitize to your heart’s content on my very own server at http://sanitize.pieisgood.org. Do let me know if you discover anything saucy.

Gravatar icon
Friday January 02, 2009 @ 01:15 AM (PST)

Very cool, Ryan; library looks good!

Quick question about some new behavior allowed by HTML 5; a minor change was made to allow any element (not just anchors) to include an HREF, such as divs and spans. How will Sanitize handle this?

I tried http://sanitize.pieisgood.org/ but it rejected all divs… didn’t delve much further yet, but may take another look later to see what I come up with.

But the question still remains: is this behavior that you’d like to whitelist by default (since it’s much like a simple link in a different tag)? That is, if you even do whitelist links.

Gravatar icon
Friday January 02, 2009 @ 09:29 AM (PST)

I had written a similar library to this for use scrubbing html emails down, and I ran across some truly weird HTML that I couldn’t get Hpricot to parse.

I ended up switching to Nokogiri, because it tossed anything that didn’t make sense. I ran the sample below through your tester and it came up with all the wacky attributes still intact, as did my own Hpricot parser.

Here’s a sample from one test case:
<table class="zarg" randomstuffhere background-image:url('http://images.webbuyersguide.com/newsletterimages/right_bg.gif') background-repeat:repeat-y>… etc. etc.

Gravatar icon
Matt Wilson
Friday January 02, 2009 @ 11:18 AM (PST)

None of the built-in configs allows href attributes on elements other than <a>, but you can easily tell Sanitize to allow any attribute on any element you want:

html = '<div href="http://foo.com/">Foo</div>'

# Allow divs with href attributes containing relative URLs or HTTP/HTTPS URLs.
config = {
  :elements   => ['div'],
  :attributes => {'div' => ['href']},
  :protocols  => {'div' => {'href' => ['http', 'https', :relative]}}
}

Sanitize.clean(html, config) # => html (unmodified)

Sanitize doesn’t actually understand anything about the semantics of HTML other than what you tell it, so it won’t have any problem dealing with HTML 5.

As for your question about whether Sanitize will whitelist such things by default: nope, Sanitize will never ever whitelist anything by default, but as HTML evolves, the included configs will be updated to take such things into account.

Gravatar icon
Friday January 02, 2009 @ 11:27 AM (PST)

Matt, when I run that example through Sanitize, it doesn’t leave it intact; it (correctly) entifies the markup, rendering it harmless but still displayable. This is one of Sanitize’s safety fallbacks when it encounters something it can’t parse.

Since the example is not even remotely valid HTML, I don’t think it’s fair to expect Hpricot (or any HTML parser) to be able to parse it. However, it is fair to expect that any worthwhile HTML sanitizer will at least sanitize it, which Sanitize does.

If you have any other wacky examples like that, I’d love to see what Sanitize does with them. In this case, though, I think it’s doing the right thing.

Gravatar icon
Friday January 02, 2009 @ 11:34 AM (PST)

This is pretty darn awesome.

Like your site, too. Nice work with the fonts and such things. :)

Gravatar icon
Sebastian (globulus)
Saturday January 03, 2009 @ 12:25 AM (PST)

While nokogiri might in some cases be faster than hpricot, there are also cases where nokogiri is exceptionally difficult to get working because of poor management of dependencies. I am sure that Ryan is aware of both nokogiri and hpricot, and that he’s made educated choices.

A whitelist-parser-based sanitizer is certainly a welcome addition to the toolbox. Thanks Ryan!

Gravatar icon
Saturday January 03, 2009 @ 04:29 AM (PST)

Thank you Ryan for your gift! Now i can parse web content using a few lines of code.
Happy new year!

Gravatar icon
Saturday January 03, 2009 @ 11:01 AM (PST)

This is exactly what I was looking for. I’ll be adding it to my web site soon, to allow visitors to use a small subset of html to markup their entries. Thanks!

Gravatar icon
Sunday January 04, 2009 @ 08:24 AM (PST)

Hey, I’ve found what may be unintended behavior when parsing particularly malformed tags. Not sure if this is best approached via Sanitize or Hpricot, but here’s what I have…

Attempting to clean a malformed href tag NULLs the entire message when using a config that allows the anchor tag and allows protocols. (Basic, Relaxed.)

I’d expect the broken tag to be jettisoned, since without a properly-formed protocol reference, there’s essentially nothing there of interest — I’d just expect the accompanying text to be preserved.

Here’s a short irb transcript…

http://pastebin.com/f7a37ddb7

I’m seeing this with the 1.0.1 gem on Debian Etch, and have also verified that your sanitize.pieisgood.org interface behaves oddly with the above attempt: it produces a “500 Server internal error” message. Actually, this comment interface chokes on it too, necessitating the pastebin link. :)

Sanitize is great for my needs otherwise, by the way. I’ve got about 250k pieces (and growing) of wildly different user-submitted content that need to be sanitized and Sanitize performs exactly as desired on all but about 15 of them. All of those are related either to PEBCAK issues like the above, or to wacky Unicode strings I haven’t quite gotten my head around yet.

Thanks very much for your work!

Gravatar icon
DaemianMack
Sunday January 04, 2009 @ 09:12 AM (PST)

Thanks Daemian. That was indeed a bug in Sanitize. I’ve pushed a fix to the git repo. Please let me know if you discover anything else like that.

Gravatar icon
Sunday January 04, 2009 @ 11:41 AM (PST)

Ah, I didn’t realize that it had escaped the html. In my case, I need a library that strips invalid markup components, not one that escapes the whole tag. I certainly understand the rationale for escaping, but that’s not what I need :).

Nice work all the same!

Gravatar icon
Matt Wilson
Friday January 09, 2009 @ 01:40 PM (PST)

The clean! method should return nil when no changes are required. However:

> Sanitize.clean!("<div id='myid' class='myclass' style='color:red'>hi</div>",
                           :elements=>'div', 
                           :attributes=>{'div'=>'id class style'})

=> "<div class=\"myclass\" id=\"myid\" style=\"color:red\">hi</div>"

No changes are required but nil is not returned, instead the type of quotes and the order of the attributes have been modified.

This could be fixed by changing the comparison at the end of the clean! method from:

return result == html ? nil : html[0, html.length] = result
to
 return result == Hpricot(html).to_s ? nil : html[0, html.length] = result

Gravatar icon
Wednesday January 14, 2009 @ 04:36 PM (PST)

Thanks Daniel. You’re right, the documentation in this case is misleading. It should say that clean! will return nil if no changes were made, not if no changes were necessary. I’ll include a fix in the next release.

Gravatar icon
Wednesday January 14, 2009 @ 05:14 PM (PST)

Hi there!
First of all thx a lot for the gem, it’s been very useful for me!
However, I’ve just tried on the server you set up the following string:

‘’

And it seems that 2 of the logos appear…

Gravatar icon
Friday January 16, 2009 @ 11:50 AM (PST)

I didn’t want to put the images here!
So sorry!

It’s just the same IMG tag 4 times repeated. Somehow 2 of them appears!

Gravatar icon
Friday January 16, 2009 @ 11:50 AM (PST)

Thanks Cristobal, I’ll investigate and get a fix out as soon as possible.

In the future, please report things like this directly to me via email before disclosing them publicly so I have a chance to provide a fix before knowledge of the vulnerability is widespread.

Gravatar icon
Friday January 16, 2009 @ 03:14 PM (PST)

Sanitize 1.0.4 is now available via RubyGems with a fix for this issue.

Gravatar icon
Friday January 16, 2009 @ 03:49 PM (PST)

I have a basic form:

<% form_for … do |f| >
<= Sanitize.clean(f.text_field :title,
Sanitize::Config::RESTRICTED) >
< end %>

This is incorrect, could anyone help me out.

Gravatar icon
Petr
Friday January 23, 2009 @ 02:46 PM (PST)

I notice that the Sanitize::Config::BASIC adds a
rel=“nofollow” to links whereas Sanitize::Config::RELAXED doesn’t.

However looking at the Documentation for the two configs didn’t give me any clue how I could make my own config that would, e.g. be very similar to RELAXED but add the rel=“nofollow” attribute.

tips?

Gravatar icon
Thursday January 29, 2009 @ 01:46 PM (PST)

The :add_attributes config param is what you’re looking for. It’s a Hash of element names, each of which is in turn a Hash of attribute names and values that should be added to all instances of that element.

Here’s the source of Sanitize::Config::BASIC so you can see how it’s done:

class Sanitize
  module Config
    BASIC = {
      :elements => [
        'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
        'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
        'sup', 'u', 'ul'],
 
      :attributes => {
        'a'          => ['href'],
        'blockquote' => ['cite'],
        'q'          => ['cite']
      },
 
      :add_attributes => {
        'a' => {'rel' => 'nofollow'}
      },
 
      :protocols => {
        'a'          => {'href' => ['ftp', 'http', 'https', 'mailto', :relative]},
        'blockquote' => {'cite' => ['http', 'https', :relative]},
        'q'          => {'cite' => ['http', 'https', :relative]}
      }
    }
  end
end

If you’d like a version of the RELAXED config that adds rel="nofollow" to links, this should do the trick:

config = Sanitize::Config::RELAXED.merge({:add_attributes => {'a' => {'rel' => 'nofollow'}}})
Gravatar icon
Thursday January 29, 2009 @ 02:29 PM (PST)

I’ve just noticed that if I view Sanitize’s output in IE6, instances of the &apos; entity aren’t rendered as I’d expected. As it turns out, the &apos; entity is treated differently since it’s part of the XML spec. So the code would seem to be performing as specified, it’s just our (my) expectations that are off. This URL has more…

http://cssvault.com/blog/2007/10/17/internet-explorer-apos-feature/

I was all set to submit a patch for this behavior via github but grepping through sanitize’s source shows only one instance of the &apos; term — in a test — so I’m guessing this behavior might originate in hpricot. Let me know if that’s accurate and I’ll be happy to report this there and see if _why cares about IE6. (I have to, sadly, and once you start seeing ampersand entities instead of single quotes you realize just how outstandingly popular they are.)

Incidentally, looks like the title field and body field of this comment interface interpret “&amp;apos;” differently.

Gravatar icon
Daemian Mack
Thursday February 05, 2009 @ 04:49 AM (PST)

Thanks for reporting this, Daemian. I don’t think Hpricot is at fault here, though. The HTMLEntities gem seems to be the culprit. I’ve been planning to get rid of that dependency anyway by rolling the necessary functionality into Sanitize, so I’ll fix this as part of that change.

As for the comment field behavior, that’s because the title field doesn’t allow HTML (so the string “&apos;” is escaped and displayed literally) whereas the comment body does allow HTML (so the string “&apos;” is not escaped, and is interpreted as an entity by the browser).

Gravatar icon
Thursday February 05, 2009 @ 09:38 AM (PST)

I am having the same problem with the &apos ; in IE 7..

Gravatar icon
Tuesday February 10, 2009 @ 03:29 AM (PST)

The &apos; issue is fixed in the latest development version of Sanitize on GitHub.

Gravatar icon
Tuesday February 10, 2009 @ 09:48 AM (PST)

The latest gem (1.0.5) works with a mongrel development server but fails during a “rake” Test::Unit run. I get an error that it could not find the HTMLEntities gem even though it was installed and unpacked into vendor/gems. To work around this I put the latest source of the gem in vendor/gems (it does not require HTMLEntities) and everything works with Rails.

*The hpricot gem is also in vendor/gems as I “vendor everything”. It should work with your Rails app if you just install them as gems. When 1.0.6 of this gem is released, it should just work out of the box without this workaround.

Gravatar icon
Tuesday February 10, 2009 @ 02:57 PM (PST)

I’ve got some awful-looking HTML I’m parsing and Sanitize is doing a great job for the most part. However, there are some nested tags (<b> <b> Foo! </b> </b>) that it’s not cleaning up. Perhaps that’s outside the scope of an otherwise excellent plugin but it’d be neat if it could fix that as well :)

Gravatar icon
Saturday February 21, 2009 @ 09:50 PM (PST)

Sanitize will strip nested tags if they’re not in the whitelist, but if a tag is whitelisted, Sanitize leaves it alone, even if it’s redundant. You’re probably looking for something more along the lines of Tidy.

Gravatar icon
Saturday February 21, 2009 @ 11:26 PM (PST)

I probably am (as a bit of a newbie to Ruby). Thanks for the heads up! :)

Gravatar icon
Friday February 27, 2009 @ 06:22 AM (PST)

I’m using Sanitize and it works great (thanks!), but when it strips out a script tag, it leaves the contents of the tag in place. While this makes sense for some tags, in this case it can leave a blob of javascript visible to the end user, which is undesirable (I am processing 3rd party HTML and can’t prevent script tags in the body content). Is there a way to have sanitize remove a tag and all of its contents? Thanks!

Gravatar icon
Alx Dark
Wednesday March 04, 2009 @ 01:13 PM (PST)

By default Sanitize tries to preserve (but make safe) any non-tag content, since its primary use case is for sanitizing things like blog comments where removing the contents of a non-whitelisted tag could result in unexpected data loss.

That said, I do plan to add an option to a future version of Sanitize to allow you to specify that you want the contents of non-whitelisted tags removed completely (I’ve already received one or two patches along these lines, I just haven’t been entirely happy with them).

Gravatar icon
Wednesday March 04, 2009 @ 11:14 PM (PST)

I looks like someone asked this question already in the comments. However, I don’t see a solution. I could just be thick. I love the sanitize gem and had no problem using most of it. One issue i’m having is that whenever I have html entities like   it turns it into a question mark. My question is thus how do I allow html entities. thanks ahead of time.

Gravatar icon
Monday March 16, 2009 @ 08:56 PM (PDT)

As of the latest release (1.0.6) Sanitize should pass all well-formed entities through untouched, assuming they’re not used in a malicious context. What version are you using?

Gravatar icon
Monday March 16, 2009 @ 09:13 PM (PDT)

Love the gem, thanks for your work!

I am having an issue with the clean! function. I am writing a validation method that needs to know when the string of html is dirty (requires cleaning), however the clean! function (as pointed out in a previous comment) returns a true value when the string is modified, but not when it needed modification.

I’ve monkey-patched the gem in a rails/config/initializer script and that works for now. Do you plan to change the api in the future or do you plan to leave the current behaviour of clean! ?

Gravatar icon
Trevor Rowe
Monday April 06, 2009 @ 03:03 PM (PDT)

The clean! method has always worked correctly; however, as discussed above, the documentation for the method in the first release of Sanitize was misleading as to the method’s purpose. The documentation in later releases is correct:

clean!(html, config = {})
Performs Sanitize#clean in place, returning html, or nil if no changes were made.

The purpose of clean! is to sanitize the given string in place rather than returning a sanitized copy of the given string. In other words, it’s a destructive verson of clean.

It sounds like what you want is a method that tells you whether the given string needs to be sanitized, but doesn’t actually sanitize it. There currently isn’t a method that does this, but something like the following (which I imagine is similar to what you’ve hacked up) would work:

def is_dirty?(html)
  !Sanitize.clean!(html.dup).nil?
end

I’m curious what your use case for this is, though. If you’re trying to save processing time by not sanitizing already-clean strings, this won’t do the trick, since the only way to determine whether a string is dirty is to actually clean it (or a copy of it, as in the example above).

Gravatar icon
Monday April 06, 2009 @ 07:17 PM (PDT)
def is_dirty?(html)
  !Sanitize.clean!(html.dup).nil?
end

Unfortunately this won’t work. Sanitize.clean! was returning html instead of nil because of 2 different situations:

  1. attributes were getting returned in a different order then they originally appeared
  2. single and double quotes were getting transformed into html entities even though it wasn’t necessary

For issue #1, I experienced img tags with many attributes would sometimes come back with their attributes in a different order than they appeared in the original html. No modificaitons were necessary, the source and resultant html were different. I’m guessing its not limited to img tags, but any tag with multiple attributes.

For issue #2 here is a small example,

  <p>I'm clean!</p>

Gets transformed into:

  <p>I&#39;m clean!</p>

As far as I know the single quote is a perfectly valid character in HTML and doesn’t (normally) need be represented as an html entity.

I guess the next question is WHY do I care if the html experienced minor modifications that don’t affect anything visual. Its a good / valid question. I use it for model attribute validation.

I prefer not to silently modify the data I get from outside sources (I deal with data from the web and from bulk files that are imported on a regular basis). Its important (especially for the bulk files) for me to know when the data I get is invalid so it can be fixed at the source. I need to generate a log of results of what fields had to be skipped and why. Neither of the above two issues would require the data to considered invalid.

I have also been experiencing seemingly random genmentation faults:

/usr/lib64/ruby/gems/1.8/gems/hpricot-0.7/lib/hpricot/parse.rb:33: [BUG] Segmentation fault
ruby 1.8.6 (2008-03-03) [x86_64-linux]

I’m guessing it stems from Hpricot. Thoughts?

Gravatar icon
Trevor Rowe
Friday April 10, 2009 @ 02:55 PM (PDT)

If all you want to do is correct invalid HTML, leaving it untouched if it’s already valid, you want Tidy, not Sanitize. The sole purpose of Sanitize is to remove all but a safe subset of HTML from user-supplied input. If you’re using it for anything else, it’s probably not the best tool for the job. Sanitize doesn’t understand HTML; it understands whitelists, which tell it how to make HTML safe. Tidy, on the other hand, actually understands HTML (but won’t make it safe).

This is why Sanitize converts apostrophes to entities. It’s not always necessary, but it is always safer. Sanitize’s purpose is to sanitize input, which means that safety is its primary concern. The documentation for clean! (see above) says “if no changes were made” and not “if no changes were necessary” for this very reason.

The clean! method only exists in Sanitize because having a destructive alternative for a non-destructive string method is typical in Ruby classes and I thought it likely that people would ask for it if it wasn’t there. I’m starting to think it would be wiser to remove it though, since it seems to cause a significant amount of confusion.

As for the segfaults, Hpricot 0.7 seems to have been a pretty crappy release. I haven’t had a chance to test Sanitize with Hpricot 0.8 yet, but you may have better luck with it. I should have time to get up to speed on the latest Hpricot shenanigans this weekend.

Gravatar icon
Friday April 10, 2009 @ 11:26 PM (PDT)

I believe the confusion with clean! comes from the non-standard behavior. The typical behavior of destructive string methods in Ruby is to always return the modified string (see gsub!, strip!, capitalize!, etc), whereas clean! only returns when something has changed.

The “if no changes were made” phraseology in the api for clean! also implies that the Sanitize gem might have the ability to pass up making changes if they are not necessary, which (as I’ve learned) isn’t always the case. It just compares the before and after results and returns nil if they are the same.

I would suggest leaving clean! but simply return the modified contents always. Had that been the original behavior I wouldn’t probably have gone down the rabbit hole of trying to figure out if it could also be used to validate w/out modifying strings.

Thanks for your awesome work. Even if I can’t use this to validate html prior to saving to the db, its an excellent tool and I plan to continue using it.

Gravatar icon
Trevor Rowe
Monday April 13, 2009 @ 08:27 AM (PDT)
I believe the confusion with clean! comes from the non-standard behavior. The typical behavior of destructive string methods in Ruby is to always return the modified string (see gsub!, strip!, capitalize!, etc), whereas clean! only returns when something has changed.

Actually, the standard behavior of all three of the methods you mentioned, and of most destructive string methods in Ruby, is to return the modified string or nil if the string was not modified, which is exactly what clean! does. Take a look at the API docs: gsub!, strip!, capitalize!.

The documentation for clean! is also patterned after the documentation of standard Ruby string methods, in which the phrase “if no changes were made” is frequently used to describe this behavior.

Your suggestion that clean! always return the modified string is a little puzzling, since this is already what happens. If there is a modified string, clean! will always return it. If the string is not modified, then clean! returns nil. To do anything else would be non-standard.

It sounds like your confusion stemmed more from the fact that you misunderstood the purpose of the library than from the behavior of clean!. I’ll try to make it more explicit in the documentation that Sanitize is not intended to be a validator or a replacement for HTML Tidy.

Gravatar icon
Monday April 13, 2009 @ 10:26 AM (PDT)

Sorry for the double post.

I’m curious how everyone handles things like embed tags. For example, I want to allow some sources of embed tags (i.e. youtube, dailyshow, etc) but not all embed tags. Can sanitize do this or how else does everyone handle this?

Gravatar icon
jay
Wednesday May 13, 2009 @ 04:33 PM (PDT)

Sanitize doesn’t currently provide an option to whitelist specific URLs, but I’ll consider adding this feature.

Gravatar icon
Wednesday May 13, 2009 @ 06:21 PM (PDT)

First, I’d like to thanks for creating sanitize. It justs do the work I need in treating some RSS feed entries.

I tried to use it with Ruby 1.9.1p129 and got problems related to encoding. So I changed the encoding of the string I was sending to ASCII-8BIT using the method “force_encoding” (I had similar problems in Rails).

Then I got another error, this time inside hpricot:

/usr/local/lib/ruby19/gems/1.9.1/gems/hpricot-0.8.1/lib/hpricot/traverse.rb:198:in `block in reparent’: undefined method `parent=’ for "":String (NoMethodError)

So, I think hpricot 0.8.1 is not 100% bug free with 1.9.1

My question is the following: Do you think it’s worth porting Sanitize to use nokogiri instead of hpricot?

It would be nice use nokogiri (that is faster) and ruby 1.9. It’s possible I’ll try this when I’ll need to optimize my app.

Gravatar icon
Hugo
Thursday May 21, 2009 @ 01:28 PM (PDT)

Neat! I have been using htmLawed (tinyurl.com/htmlawed) for my PHP projects, and this tool will seems a good equivalent for my Ruby ones.

Gravatar icon
AS Lenka
Monday June 01, 2009 @ 12:10 AM (PDT)

Has anyone adapted this to Rails’ white-list sanitizer API? Would be handy to just do:

Rails::Initializer.run do |config|
config.action_view.white_list_sanitizer = Sanitizer.new
config.action_view.sanitized_allowed_tags = ‘table’, ‘tr’, ‘td’
config.action_view.sanitized_allowed_attributes = ‘id’, ‘class’, ‘style’
end

If not I will take a stab.

Gravatar icon
Tuesday June 23, 2009 @ 04:57 PM (PDT)
New comment

required, won't be displayed

optional

Don't type anything here unless you're an evil robot:


And especially don't type anything here:

Basic XHTML (including links) is allowed, just don't try anything fishy. Your comment will be auto-formatted unless you use your own <p> tags for formatting. You're also welcome to use Textile.

Copyright © 2002-2009 Ryan Grove. All rights reserved.
Powered by Thoth.