HTML vs. bulk

John R Levine
Sat Jan 4 04:10:11 UTC 2003

> I'm thinking about defining a FUZ3 checksum that would ignore text
> bounded by <html>...</html> and otherwise be similar to the FUZ2
> checksum.

If you're going to do something like that, I'd parse enough of the MIME
headers to pick the plain text out of multipart/alternative and checksum
that.  MIME in general is hugely complex, but picking out one known part
is easy to do in a single simple scan.

If you want to goof around with HTML, take out <!--html comments--> to
catch this months trendy hashbuster that sticks strings related to the
victim's address in comments into the HTML.

