Search this site


Metadata

Articles

Projects

Presentations

Comment spam that got through

I get emails from this site when someone comments.

This morning, this showed up:

Name: Virtual Pharmacy
Email: [snipped]
URL: [snipped]
Hostname: 114.199.36.72.reverse.layeredtech.com (72.36.199.114)
Entry URL: http://www.semicomplete.com/blog/geekery/barcamp-sanfrancisco-2
Comment location: [snipped]

Everyone repeat, what alcohol should be consumed moderately, but what it means? Why to women
 recommend to drink more moderately than to men? What is the female alcoholism? WBR LeoP
A quick google search for the strange tail token, "WBR LeoP" reveals a clear indication that this is comment spam (as if the content didn't give it away).

The url the spammer used points at pharmacynewsblog.com, which looks like a normal blog.

It's not.

The content is entirely viagra-and-friends related, which is fine. However, examine a simple visible text snippet of the following (this is from the frontpage):

Drug treatment may beat psychotherapy at ...
Google for this phrase and you'll find that it's been plagiarized. But deliciously so:

View source, you'll see:

<p>Drug <b class=ne>joint pain are </b>treatment <BLINK class=ne>of
purchase </BLINK>may <sup class=ne>wellbutrin at </sup>beat <small
class=ne>and paxil vs </small>psychotherapy
The css class 'ne' sets 'display: none' among other properties that make it stay out of the way of the browser.

This is quite clever, and appears automated.

pharmacynewsblog.com seems to be a somewhat autogenerated spam blog that takes news postings about viagara and the like and injects random html into it, with the intention of defeating antispam solutions. Anti-spam engines probably aren't smart enough to know that it should ignore the text pieces that are invisible. Who knows.

But, back to the spam comment. I use javascript to poke parts of the comment form indicating that a javascript-capable browser was used to submit the comment. If javascript is not detected, the comment is denied.

This comment got through, which means that javascript was enabled, which means that it was probably a webbrowser that did it.

Here's the apache log snippet:

72.36.199.114 - - [29/Jan/2007:13:01:17 -0500] "GET /blog/geekery/barcamp-sanfrancisco-2.html HTTP/1.1" 200 15903 "-" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
72.36.199.114 - - [29/Jan/2007:13:01:18 -0500] "GET /style.css HTTP/1.1" 200 2584 "http://www.semicomplete.com/blog/geekery/barcamp-sanfrancisco-2.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
72.36.199.114 - - [29/Jan/2007:13:01:20 -0500] "POST /blog/geekery/barcamp-sanfrancisco-2 HTTP/1.1" 200 16392 "http://www.semicomplete.com/blog/geekery/barcamp-sanfrancisco-2.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
72.36.199.114 - - [29/Jan/2007:13:01:21 -0500] "GET /style.css HTTP/1.1" 200 2584 "http://www.semicomplete.com/blog/geekery/barcamp-sanfrancisco-2" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"
It didn't fetch any images, but it did pull style sheets, which is strange behavior if it's a simple spam bot that doesn't care about how a page looks. It also pulled the blog posting page first, then submitted a comment. Further indication that this bot is either really clever, or a person is behind the wheel.

If you search for the ip, 72.36.199.114, the first hit on google is an automagically updated list of known comment spam hosts.

One anti-spam effort too easily defeated.

I see lots of times where people put their mailing addresses as "foo at bar dot org" in a hopeful effort to keep spammers from scraping your mailing address. Heck, mail archive systems often have (and are deployed with) options to obfuscate email addresses systematically, using the same pattern: foo at bar dot com.

All it does is hurt usability.

Googlng for "* at * dot *" clearly shows lots of matches. It also matches all of the following variants, due to google searches ignoring brackets and such in words:

  • foo at bar dot com
  • foo [at] bar [dot] com
  • foo (at) bar (dot) com
  • ... etc ...
Query, scrape, replace 'at' and 'dot' as desired. I now have 54 million email addresses. What now?

Seems like this effort only serves to have people fool themselves as well as to impede usability. It certainly won't protect you from spam. Why is this method used?

Antispam pyblosxom plugin, followup!

REJECT: Comment attempt by 210.113.83.6 rejected. Reason: Invalid secret token: 'pleaseDontSpam'
REJECT: Comment attempt by 210.120.79.179 rejected. Reason: Invalid secret token: 'pleaseDontSpam'
REJECT: Comment attempt by 200.156.25.4 rejected. Reason: Invalid secret token: 'pleaseDontSpam'
REJECT: Comment attempt by 220.125.164.243 rejected. Reason: Invalid secret token: 'pleaseDontSpam'
REJECT: Comment attempt by 69.57.136.39 rejected. Reason: Invalid secret token: 'pleaseDontSpam'
...
The list goes on. Well over 50 invalid tokens were found. The 'pleaseDontSpam' was the original secret token I used. Just goes to show that, for the moment, most spam bots don't review the page before submitting.

Admittedly, 2 spams got through, I have not investigated why, yet.