[Date Prev][Date Next][Subject Prev][Subject Next][
Date Index][
Subject Index]
Re: Warning: Trolling for Dollars, Scammers hard at work {was: forged Failure Notice}
- Subject: Re: Warning: Trolling for Dollars, Scammers hard at work {was: forged Failure Notice}
- From: Norman Bauman nbauman@xxxxxxxx
- Date: Fri, 18 Jul 2003 12:46:16 -0400
At 06:32 AM 7/18/03 -0400, Robert Holmgren wrote:
>
>May I infer that the Eudora filter has a set of "hard" criteria for
>distinguishing real Email from spam?
The way I set it up it did. I have an email list with about 600
subscribers, and I was always getting bounced mail, so I set up filters
with the most common bounce phrases. (Strangely enough, they are not
standardized among servers.) I still don't understand why it's sending my
XyWrite mail to the bounce mailbox, since I filter XyWrite messages to the
XyWrite mailbox first.
>Bayesian filters are more sophisticated,
>and worth investigating. Bayesian analyzes the Email by picking the 15 or so
>"most interesting/unusual" words in the Email. It then compares this msg
with
>a fairly large, personal corpus of (maybe 200+) Emails that you, the user,
have
>read and declared to be either good or bad, and makes a determination.
Right, one of the developers has a web site, which I read with interest,
and he claimed it worked very well (for him, anyway).
>Thus,
>what is a "bad" message to you might be a "good" msg to a pedophile miscreant
>-- it's completely individualized.
Or in my case, since I write about urology, I don't want to miss a message
from somebody offering me $20,000 to write a report on phosphodiesterase
inhibitors, the most famous of which I will not mention because it will
activate spam filters. Nor will I mention one of the organs that urologists
frequently deal with.
>It takes a few days to build up your
>good/bad corpus, but once you've done so, the Bayesian filter almost never
>makes a mistake. I mean, I really am getting 100-150 spams/day, believe
it or
>not, and the last time my filter made a mistake was, well, sometime last
month.
>The mistake itself, after being manually re-marked as "bad" (or "good", as
the
>case may be) adds to the intelligence of the filter: it won't make the same
>mistake, or one like it, again.
What's the program? I believe that SpamAssassin has a Bayesian filter, in
addition to customizable blacklists and whitelists. My ISP installed
SpamAssassin, but unfortunately they were not competent enough to install
it properly, which would allow me to customize it for myself. It does do a
surprisingly good (but not perfect) job of discriminating spam.
Unfortunately there are workarounds for spammers -- some spam doesn't have
any text in the message, just an html link.
I would have to admire anyone who could write a Bayesian protocol to
effectively elminate spam. I wrote a few stories on text search programs
several years ago, and I interviewed some of the top information scientists
in the country http://www.nasw.org/users/nbauman#txtsrch. They told me
that they could distinguish some concepts which were closely associated
with key words, especially if they were legal terms of art. For example, if
a lawyer had a case of a child injured on monkey bars, it was easy to go to
Lexis and find all the previous cases involving monkey bars. Or you could
easily find a name. But you couldn't easily find concepts.
Some vendors claimed that their programs could figure out meaningful
information about the subject of the message from the frequency, etc., of
key words in the text, even those key words weren't directly related to
concepts. They got a *lot* of money from the CIA to develop programs to
scan international telex traffic, but the programs never worked. The idea
was, somebody was going to send a message, "Boris, could you send me 10
terrorist bombs, thanks, Abdul," and the software would catch it. (One
program actually had a list of keywords like "terrorist".)
Of course, real terrorists used code words and encryptation. There was no
machine-discernable pattern that would give a clue that somebody was a
terrorist or lawbreaker. And when lawyers tested the commercial versions of
these programs in real-world settings, they never worked. There was nothing
that could improve on Lexis-type searches, and even the complex Boolean
Lexis-type searches didn't usually improve on the simple Lexis-type searches.
One of these days I have to look for a magazine that will pay me $1,000 or
$2,000 to spend a week or two writing an update.
Norman
-------------------------------------------------------
Norman Bauman
411 W. 54 St. Apt. 2D
New York, NY 10019
(212) 977-3223
http://www.nasw.org/users/nbauman
Alternate address: nbauman@xxxxxxxx
-------------------------------------------------------