HTML vs. bulk

John R Levine johnl@iecc.com
3 Jan 2003 23:10:11 -0500


> I'm thinking about defining a FUZ3 checksum that would ignore text
> bounded by <html>...</html> and otherwise be similar to the FUZ2
> checksum.

If you're going to do something like that, I'd parse enough of the MIME
headers to pick the plain text out of multipart/alternative and checksum
that.  MIME in general is hugely complex, but picking out one known part
is easy to do in a single simple scan.

If you want to goof around with HTML, take out <!--html comments--> to
catch this months trendy hashbuster that sticks strings related to the
victim's address in comments into the HTML.

Regards,
John Levine, johnl@iecc.com, Primary Perpetrator of "The Internet for Dummies",
Information Superhighwayman wanna-be, http://iecc.com/johnl, Sewer Commissioner
"I dropped the toothpaste", said Tom, crestfallenly.