Merry Christmas, Internets! My gift to you this year is Sanitize, a whitelist-based HTML sanitizer written in Ruby. Given a list of acceptable elements and attributes, Sanitize will remove all unacceptable HTML from a string.
Using a simple configuration syntax, you can tell Sanitize to allow certain elements, certain attributes within those elements, and even certain URL protocols within attributes that contain URLs. Any HTML elements or attributes that you don’t explicitly allow will be removed.
Because it’s based on Nokogiri, a full-fledged HTML parser, rather than a bunch of fragile regular expressions, Sanitize has no trouble dealing with malformed or maliciously-formed HTML. When in doubt, Sanitize always errs on the side of caution.
Using Sanitize is easy. First, install it:
gem install sanitize
Then call it like so:
require 'rubygems'
require 'sanitize'
html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
Sanitize.clean(html) # => 'foo'
By default, Sanitize removes all HTML. You can use one of the built-in configs to tell Sanitize to allow certain attributes and elements:
Sanitize.clean(html, Sanitize::Config::RESTRICTED)
# => '<b>foo</b>'
Sanitize.clean(html, Sanitize::Config::BASIC)
# => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'
Sanitize.clean(html, Sanitize::Config::RELAXED)
# => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
Or, if you’d like more control over what’s allowed, you can provide your own custom configuration:
Sanitize.clean(html, :elements => ['a', 'span'],
:attributes => {'a' => ['href', 'title'], 'span' => ['class']},
:protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})
For more details, see the Sanitize Documentation.