How the Web-Based Email
Spam Filter Works — A Primer
Introduction
The Spam Xploder spam filter, which is an integrated feature of the Web-Based
Email mail client, is a service that screens incoming mail at the server
level. Through the Web-Based Email interface the user can
train
the filter, thus gradually improving the filter's ability to detect incoming
bulk mail.
About Spam Filtering
Spam filtering is the concept of detecting and intercepting unwanted bulk mail
— or "spam" — before it reaches a recipient's mailbox. Generally, spam filters
detect bulk mail through the occurrence of certain phrases and known spammer
IP addresses in incoming mail. However, because distributors of spam are
increasingly innovative in their efforts to circumvent the spam filters that
protect email users' mailboxes, spam is a moving target, and developers of
spam-filtering technology are constantly being challenged in their quest to
keep the bulk-mail onslaught at bay. Thus, in order to effectively shield
email users from spam, a spam filter must be flexible. The Spam Xploder spam
filter, therefore, enables users to personalize the filter by training it to
detect and intercept mail that fit each user's particular preference and
definition of spam.
How the Filter Functions
The server-side Spam Xploder spam filter works in conjunction with the
client-side (end user) Web-Based Email interface. In essence, the end user
utilizes the client-side interface to submit selected email messages for spam
analysis. By analyzing the messages the spam filter compiles information that
enables it to detect and intercept spam. As an increasing number of email
messages are analyzed, the filter becomes increasingly adept at intercepting
electronic mail that this particular user considers spam.
On the server end, the Spam Xploder spam filter works as follows:
- When the mail program (i.e., Web-Based Email) receives an email message,
a connection to the Spam Xploder server is established. The Mail Transfer
Agent accepts these incoming connections and receives the incoming email
message. Messages enter the system and are handed to the spam filter.
- The spam filter first strips out the "From:" and "Reply-to:" addresses
from the message. These addresses are then compared to the user-defined
whitelist of known good senders. If the address is whitelisted, filtering is
complete and the message is delivered to the user's inbox without further
ado.
- If the sender is not whitelisted, the addresses are compared to the
user-defined blacklist. If an address is blacklisted, that message is
treated as spam and delivered to the "Bulk Mail" (or equivalent) folder of
the user's email program. That completes the filtering process for that
particular message.
- If the filter concludes that a message is neither white- nor
blacklisted, the message is subjected to the spam filter's statistical
filtering analysis. The statistical filter uses a rigorous
Bayesian analysis to determine if a message is spam.
- The statistical filter starts the dissection by breaking the message
content into a list of unique tokens A token is a word or any
string of identifiable characters, such as dollar amounts and HTML tags.
Once a complete list of tokens has been generated for a message, the list
will be analyzed.
- The analysis relies on two datasets. The first is the user dataset,
which evolves as a user trains the filter. The user dataset is a
personalized list of tokens compiled from the actual email that a user has
received. The second dataset is the general dataset, which is a list of
tokens intended to represent the average user. Each list entry in the user
and general datasets consists of a token and an indication of the
probability that an email message containing the token is bulk mail.
- The analysis consists of comparing tokens found in the message to the
user and general datasets. The user data is searched first, and the general
data is searched only when a token cannot be found in the user data. If a
token from the message is found in either dataset, the probability score for
that token is noted. The probability score for a token indicates the
probability that a message containing that token is spam. The overall
probability is a statistical score that is calculated over the entire email
message. The result of the calculation is a number between 0 and 99. — The
higher the number, the higher the probability that the message is spam.
- Once a complete list of probabilities has been compiled, the
probabilities are used to calculate the message spam score. This score
reflects the overall probability that the message is spam.
- Messages determined to be spam by the statistical filter are dropped
into the "Bulk Mail" (or equivalent) folder of the user's mailbox. Messages
that are not considered spam are sent to the user's "Inbox." At this point,
message filtering is complete.
Training the spam filter is the process of submitting email messages for spam
analysis, thus gradually increasing the "intelligence" of the spam filter.
That way, as the spam filter, compiles data it will become increasingly adept
at detecting incoming spam.
In training the spam filter, the user can mark a message as either "spam" or
"not spam." The process then proceeds thus:
- Messages selected for training are passed to the server, along with a
"flag" indicating whether the message should be considered spam or "good"
mail. These messages undergo a content analysis similar to the statistical
filter. The message is broken into tokens. Tokens are added to a list and
counted. The list of tokens and counts is then analyzed.
- The analysis consists of comparing tokens found in the message against
the user dataset. Each token is searched for in the user data. If a token
from the message is found in the user dataset, the previous spam and good
mail counts for that token are retrieved. The counts are updated based on
the "spam"/"not spam" flag, and the new spam probability is calculated for
the token.
- If the token is not found, a new record is added to the user dataset for
the token, and the spam probability is calculated.
The spam filter's user data evolves and grows as more messages are
analyzed. More tokens are added, and the probability scores are refined until
the user has a well-defined set of personalized tokens commonly found in
his/her incoming bulk and "good" mail. This adaptive scoring ensures that each
user has a different definition of spam and good mail, thus making it very
difficult to distribute mass mailings that evade the recipients' individually
configured spam filters.
This personalized, adaptive approach guarantees fewer misclassifications of
mail, as each user teaches the system his/her personal definition of what
constitutes spam and good mail.
Bayes, Thomas
(b. 1702, London - d. 1761, Tunbridge Wells, Kent) Nonconformist theologian
mathematician who first used probability inductively and established a
mathematical basis for probability inference (a means of calculating, from the
number of times an event has not occurred, the probability that it will occur
in future trials). He set down his findings on probability in "Essay Towards
Solving a Problem in the Doctrine of Chances" (1763), published posthumously
in the Philosophical Transactions of the Royal Society of London. The only
works he is known to have published in his lifetime are Divine Benevolence, or
an Attempt to Prove That the Principal End of the Divine Providence and
Government is the Happiness of His Creatures (1731) and An Introduction to the
Doctrine of Fluxions, and a Defence of the Mathematicians Against the
Objections of the Author of the Analyst (1736) which countered attacks by
Bishop Berkeley on the logical foundations of Newton's calculus.
Source: Encyclopædia Britannica
Bayesian:
being, relating to, or concerned with a theory (as of decision making or
statistical inference) involving the application of Bayes' theorem and the use
of probabilities based on prior knowledge and accumulated experience <bayesian
probability models>.
Source: Merriam-Webster
Back to Top