A few days ago I tweeted:
If I had a dollar for every HTML escaper that only escapes &, <, >, and ", I'd have $0. Because my account would've been pwned via XSS."
This was exaggeration for effect—there aren’t many cases where a simple XSS injection could actually empty a bank account—but I wanted to make a point.
By some coincidence, I’ve found myself working with various open source projects recently that take a half-assed approach to HTML escaping. It’s something that tends to be implemented as an afterthought, which is unfortunate because it can be critical for the security of users of these projects. I won’t name any names in this post (pull requests are forthcoming), but I will explain some of the common problems I’ve seen, why they’re problems, and what can be done to fix them.
This post is not an introduction to HTML escaping. It assumes that you already know what HTML escaping is and why it’s necessary. This post also is not a comprehensive catalog of XSS vectors; the examples here are illustrative, but they certainly aren’t the only attacks you need to worry about. The intent of this post is to explain some dangers that you may not be aware of, and to encourage you to read more about them and write safer code.
Note that this post only discusses escaping, which is something entirely different (and far less complicated) than sanitizing. HTML sanitization is a topic for another time.
Escaping < and > isn’t enough
The worst HTML escaper I’ve seen in a major open source project only escapes the < and > characters. This may actually be worse than not escaping anything at all, since it gives the illusion of security, but is trivial to defeat.
For example, let’s say I have the following template, and I’m going to replace the placeholder values, indicated in [square brackets], with HTML-escaped user input:
<a href="/user/[username]">[username]</a>
The attacker enters foo" onmouseover="alert(1) as their username. End result, even after escaping:
<a href="/user/foo" onmouseover="alert(1)">foo" onmouseover="alert(1)</a>
Because the " character wasn’t escaped and the attacker’s input was used in an attribute value, the attacker was able to inject arbitrary attributes and therefore JavaScript (which, in a real XSS attack, would probably be something more harmful than an alert).
This is a classic example of making input safer in one context—in this case, as the content of an <a> element—without considering the other contexts in which it’s likely to be used, such as inside an attribute value.
Escaping &, <, >, and " isn’t enough
The characters &, <, >, and " are the ones most commonly targeted by HTML escaper implementations. This seems to be the minimum set of characters that people think need to be escaped. Unfortunately, it’s still not safe if you don’t have complete control over where the escaped values will be used.
Consider the following template, in which the template author has used single-quoted attribute values:
<a href='/user/[username]'>[username]</a>
This is exploitable using the same attack as the previous example, but with single quotes instead of double quotes: foo' onmouseover='alert(1):
<a href='/user/foo' onmouseover='alert(1)'>foo' onmouseover='alert(1)</a>
You may be saying, “But I always use double quotes to quote attribute values!” Are you also the only person who will ever use your HTML escaper? And are you immune to typos?
Escaping &, <, >, ", and ' isn't enough
This is the character set used by PHP’s ubiquitous htmlspecialchars function, and as you may have guessed, it still falls down on attribute values for two reasons.
First, as Hacker News users DanBlake and nbpoole pointed out in a discussion of this blog post, Internet Explorer treats ` as an attribute delimiter. It may be an edge case, but it’s still a potential attack vector, so ` needs to be escaped too.
Second, HTML also allows attribute values to be completely unquoted. Believe it or not, unquoted attribute values are fairly popular (some people are too lazy to quote them, others are performance zealots who can’t bear the thought of wasting those extra bytes).
Unquoted attribute values are one of the single biggest XSS vectors there is. If you don’t quote your attribute values, you’re essentially leaving the door wide open for naughty people to inject naughty things into your HTML. Very few escaper implementations cover all the edge cases necessary to prevent unquoted attribute values from becoming XSS vectors.
Escaping &, <, >, ", ', `, , !, @, $, %, (, ), =, +, {, }, [, and ] is almost enough
All those characters up there (including the space character!) can be used to break out of an unquoted HTML attribute value. If you escape every last one of them, then you’re probably pretty close to being safe. But you’re still not so safe that you can just start throwing around user input willy nilly.
Why? Because this still doesn’t cover some context-specific cases like inserting user input into the body of an inline <script> element or using user input as part of a URL.
Context is key
If you haven’t figured it out already, the primary message I’m trying to convey here is that you must be aware of the context in which you’re working with user input. Some contexts are more susceptible to attack than others, and there’s no single magic escaping bullet that will protect you or your users in all cases.
In other words, you don’t need to escape everything all the time, but you do need to escape everything that’s important in the particular contexts in which you’re displaying user input.
But there’s still one more wrench to throw into the works…
Always specify a charset, or UTF-7 will eat your face
Even if you do everything else right, serving a page that doesn’t explicitly specify a character set can leave Internet Explorer users open to XSS, thanks to the way IE sniffs out the charset when it isn’t specified.
If an attacker is able to get your page to echo back something that looks like UTF-7 encoding early enough in the page, he may be able to trick IE into rendering the page using UTF-7. This could turn the following seemingly harmless input…
+ADw-script+AD4-alert(1)+ADw-/script+AD4-
…into something potentially harmful:
<script>alert(1)</script>
I recommend specifying a UTF-8 charset in both the Content-Type HTTP response header and a <meta> tag, since it’s easy for one or the other to get switched off or omitted inadvertently as a codebase ages (this has happened to me).
Further reading
As I mentioned in the disclaimer at the top of this post, this is not a comprehensive reference of all the things that can go wrong with HTML escaping. It’s not even a guide. It’s more of a tip-of-the-iceberg preview. Please don’t assume that, having read this post, you now know everything there is to know about HTML escaping. I can guarantee that you don’t, because I don’t.
I learned a lot from the following sources, and I highly recommend them if you’re interested in learning more:
Comments
Remember string literals in JS
If you include untrusted strings as string literals in inline JavaScript, be sure to escape the less-than sign as \u003C. Otherwise, an attacker can inject stuff that is sensitive to how the HTML tokenizer tokenizes over the inline script.
The Spanner
Great read, thanks for posting!
I’ve followed Gareth Heyes on his blog The Spanner for years — http://www.thespanner.co.uk/, where he shares some of the findings from his work.
Re: Remember string literals in JS
@Henri: See the OWASP guide’s section on escaping untrusted data for use in JavaScript. You’ll need to do a bit more than just escaping
<, and in some contexts even fully escaped strings can still be unsafe.Quoting attribute values is best practice
Personally I’d just consider quoting attribute values to be best practice.
“Escaping < and > isn’t enough”
I think these HTML escapers are not for attribute values. They are for spewing text into the content of an element.
Re: Quoting attribute values is best practice
@Yuhong: Attribute values should definitely always be quoted. But authors of library- and framework-level tools such as template languages can’t assume that their users will always adhere to best practices.
For that reason, authors of these tools need to be responsible about which characters they escape, and about documenting what they escape so that users aren’t left assuming they’re safe when they actually might not be. Better safe than sorry.
The text has a font issue
Something is wrong with the font on your site. The letter d has some sort of slash attached to it in Firefox 4 and Chrome 11 on Windows 7.
http://i.imgur.com/yaL97.png
Re: The text has a font issue
@Gigi: Weird. Thanks for pointing that out. I’m using Charis SIL via
font-face, and it looks great on my Mac, but now that you mention it I do see the strange “d” in my Windows 7 VM.Don't forget about "a href"
<a href="[untrusted]">is another special context, because you have to make sure they’re not using thejavascript:pseudo protocol (and if you thought it had to start with “javascript:” to trigger JS parsing, you’d be wrong).When escaping the less-than sign, use "\u003C"
@Ryan: you misunderstood what Henri said. Henri is the developer of the HTML5 validator and the author of the HTML5 parsing engine of Firefox. Henri said: When escaping the less-than sign, use “\u003C”. He doesn’t suggest that you should only do that ;)
You forgot one!
You probably want to escape colon (:) too, in that long list of characters.
However, personally I’ve come to like client-side templating engines better. Serve a HTML template, and populate it with JQuery text() or something like Pure. Not as SEO friendly, but I believe you can generally find a balance that works.
Nice article...
… and an eye opener or as for me: a kick in the butt to read up on security in web programming. Thanks.
Strip_tags
Hi, thanks for this article.
What do you think of using a function like strip_tags (PHP) ?
Re: Strip_tags
@Pascal:
strip_tags()is a live grenade that’s almost guaranteed to blow up in your face. There are many, many ways to use it unsafely and very few ways to use it safely, so I recommend avoiding it altogether.Here are just a few of the problems with
strip_tags():<and>characters. Unbalanced brackets can result in>characters being left in the string.",',`, or any of the other characters that are unsafe inside attribute values.strip_tags(), then it will leave those tags and all of their attributes. This virtually guarantees an XSS vulnerability, since attributes can be used to execute JavaScript on almost any element.There are more problems, but hopefully this gives you some idea of why
strip_tags()is best avoided. In short,strip_tags()tries to be an HTML sanitizer, but it doesn’t do many of the very important things that an HTML sanitizer must do to actually sanitize HTML.THis is a huge issue when input moves from one user to storage and then out to other users. Can you name other instances where this can be used. As far as a user hacking his own webpage there is no way around this since grease monkey and other tools alow script injection.