Sanitize 1.2.0 released

Version 1.2.0 of Sanitize, my whitelist-based HTML sanitizing library for Ruby, is now available. Consult the HISTORY file for a complete list of changes.

Introducing Transformers

This release adds a major new feature called transformers. Transformers allow you to filter and alter HTML nodes using your own custom logic, on top of (or instead of) Sanitize’s core filter. A transformer is any Ruby object that responds to call() (such as a lambda or proc) and returns either nil or a Hash containing certain optional response values.

To use one or more transformers, pass them to the :transformers config setting:

Sanitize.clean(html, :transformers => [transformer_one, transformer_two])

Input

Each registered transformer’s call() method will be called once for each element node in the HTML, and will receive as an argument an environment Hash that contains Sanitize config information and a reference to a Nokogiri::XML::Node object.

The transformer has full access to the Nokogiri::XML::Node that’s passed into it and to the rest of the document via the node’s document() method. Any changes will be reflected instantly in the document and passed on to subsequently-called transformers and to Sanitize itself. A transformer may even call Sanitize internally to perform custom sanitization if needed.

Transformers have a tremendous amount of power, including the power to completely bypass Sanitize’s built-in filtering.

Output

A transformer may return either nil or a Hash. A return value of nil indicates that the transformer does not wish to act on the current node in any way. A returned Hash may contain instructions that tell Sanitize to whitelist certain attributes or nodes, or to replace the current node with a new node (see the README for specifics).

Example: Transformer to whitelist YouTube video embeds

The following example demonstrates how to create a Sanitize transformer that will safely whitelist valid YouTube video embeds without having to blindly allow other kinds of embedded content, which would be the case if you tried to do this by just whitelisting all <object>, <embed>, and <param> elements:

lambda do |env|
  node      = env[:node]
  node_name = node.name.to_s.downcase
  parent    = node.parent

  # Since the transformer receives the deepest nodes first, we look for a
  # <param> element or an <embed> element whose parent is an <object>.
  return nil unless (node_name == 'param' || node_name == 'embed') &&
      parent.name.to_s.downcase == 'object'

  if node_name == 'param'
    # Quick XPath search to find the <param> node that contains the video URL.
    return nil unless movie_node = parent.search('param[@name="movie"]')[0]
    url = movie_node['value']
  else
    # Since this is an <embed>, the video URL is in the "src" attribute. No
    # extra work needed.
    url = node['src']
  end

  # Verify that the video URL is actually a valid YouTube video URL.
  return nil unless url =~ /^http:\/\/(?:www\.)?youtube\.com\/v\//

  # We're now certain that this is a YouTube embed, but we still need to run
  # it through a special Sanitize step to ensure that no unwanted elements or
  # attributes that don't belong in a YouTube embed can sneak in.
  Sanitize.clean_node!(parent, {
    :elements   => ['embed', 'object', 'param'],
    :attributes => {
      'embed'  => ['allowfullscreen', 'allowscriptaccess', 'height', 'src', 'type', 'width'],
      'object' => ['height', 'width'],
      'param'  => ['name', 'value']
    }
  })

  # Now that we're sure that this is a valid YouTube embed and that there are
  # no unwanted elements or attributes hidden inside it, we can tell Sanitize
  # to whitelist the current node (<param> or <embed>) and its parent
  # (<object>).
  {:whitelist_nodes => [node, parent]}
end

For more details on transformers, consult the README.

Installing

To install or upgrade Sanitize via RubyGems, run:

gem install sanitize