There's more to HTML escaping than &, <, >, and "

A few days ago I tweeted:

If I had a dollar for every HTML escaper that only escapes &, <, >, and ", I'd have $0. Because my account would've been pwned via XSS."

This was exaggeration for effect—there aren’t many cases where a simple XSS injection could actually empty a bank account—but I wanted to make a point.

By some coincidence, I’ve found myself working with various open source projects recently that take a half-assed approach to HTML escaping. It’s something that tends to be implemented as an afterthought, which is unfortunate because it can be critical for the security of users of these projects. I won’t name any names in this post (pull requests are forthcoming), but I will explain some of the common problems I’ve seen, why they’re problems, and what can be done to fix them.

This post is not an introduction to HTML escaping. It assumes that you already know what HTML escaping is and why it’s necessary. This post also is not a comprehensive catalog of XSS vectors; the examples here are illustrative, but they certainly aren’t the only attacks you need to worry about. The intent of this post is to explain some dangers that you may not be aware of, and to encourage you to read more about them and write safer code.

Note that this post only discusses escaping, which is something entirely different (and far less complicated) than sanitizing. HTML sanitization is a topic for another time.

Escaping < and > isn’t enough

The worst HTML escaper I’ve seen in a major open source project only escapes the < and > characters. This may actually be worse than not escaping anything at all, since it gives the illusion of security, but is trivial to defeat.

For example, let’s say I have the following template, and I’m going to replace the placeholder values, indicated in [square brackets], with HTML-escaped user input:

<a href="/user/[username]">[username]</a>

The attacker enters foo" onmouseover="alert(1) as their username. End result, even after escaping:

<a href="/user/foo" onmouseover="alert(1)">foo" onmouseover="alert(1)</a>

Because the " character wasn’t escaped and the attacker’s input was used in an attribute value, the attacker was able to inject arbitrary attributes and therefore JavaScript (which, in a real XSS attack, would probably be something more harmful than an alert).

This is a classic example of making input safer in one context—in this case, as the content of an <a> element—without considering the other contexts in which it’s likely to be used, such as inside an attribute value.

Escaping &, <, >, and " isn’t enough

The characters &, <, >, and " are the ones most commonly targeted by HTML escaper implementations. This seems to be the minimum set of characters that people think need to be escaped. Unfortunately, it’s still not safe if you don’t have complete control over where the escaped values will be used.

Consider the following template, in which the template author has used single-quoted attribute values:

<a href='/user/[username]'>[username]</a>

This is exploitable using the same attack as the previous example, but with single quotes instead of double quotes: foo' onmouseover='alert(1):

<a href='/user/foo' onmouseover='alert(1)'>foo' onmouseover='alert(1)</a>

You may be saying, “But I always use double quotes to quote attribute values!” Are you also the only person who will ever use your HTML escaper? And are you immune to typos?

Escaping &, <, >, ", and ' isn't enough

This is the character set used by PHP’s ubiquitous htmlspecialchars function, and as you may have guessed, it still falls down on attribute values for two reasons.

First, as Hacker News users DanBlake and nbpoole pointed out in a discussion of this blog post, Internet Explorer treats ` as an attribute delimiter. It may be an edge case, but it’s still a potential attack vector, so ` needs to be escaped too.

Second, HTML also allows attribute values to be completely unquoted. Believe it or not, unquoted attribute values are fairly popular (some people are too lazy to quote them, others are performance zealots who can’t bear the thought of wasting those extra bytes).

Unquoted attribute values are one of the single biggest XSS vectors there is. If you don’t quote your attribute values, you’re essentially leaving the door wide open for naughty people to inject naughty things into your HTML. Very few escaper implementations cover all the edge cases necessary to prevent unquoted attribute values from becoming XSS vectors.

Escaping &, <, >, ", ', `, , !, @, $, %, (, ), =, +, {, }, [, and ] is almost enough

All those characters up there (including the space character!) can be used to break out of an unquoted HTML attribute value. If you escape every last one of them, then you’re probably pretty close to being safe. But you’re still not so safe that you can just start throwing around user input willy nilly.

Why? Because this still doesn’t cover some context-specific cases like inserting user input into the body of an inline <script> element or using user input as part of a URL.

Context is key

If you haven’t figured it out already, the primary message I’m trying to convey here is that you must be aware of the context in which you’re working with user input. Some contexts are more susceptible to attack than others, and there’s no single magic escaping bullet that will protect you or your users in all cases.

In other words, you don’t need to escape everything all the time, but you do need to escape everything that’s important in the particular contexts in which you’re displaying user input.

But there’s still one more wrench to throw into the works…

Always specify a charset, or UTF-7 will eat your face

Even if you do everything else right, serving a page that doesn’t explicitly specify a character set can leave Internet Explorer users open to XSS, thanks to the way IE sniffs out the charset when it isn’t specified.

If an attacker is able to get your page to echo back something that looks like UTF-7 encoding early enough in the page, he may be able to trick IE into rendering the page using UTF-7. This could turn the following seemingly harmless input…

+ADw-script+AD4-alert(1)+ADw-/script+AD4-

…into something potentially harmful:

<script>alert(1)</script>

I recommend specifying a UTF-8 charset in both the Content-Type HTTP response header and a <meta> tag, since it’s easy for one or the other to get switched off or omitted inadvertently as a codebase ages (this has happened to me).

Further reading

As I mentioned in the disclaimer at the top of this post, this is not a comprehensive reference of all the things that can go wrong with HTML escaping. It’s not even a guide. It’s more of a tip-of-the-iceberg preview. Please don’t assume that, having read this post, you now know everything there is to know about HTML escaping. I can guarantee that you don’t, because I don’t.

I learned a lot from the following sources, and I highly recommend them if you’re interested in learning more: