There are several methods to filter emails. Out of these, there is one that is based on the principle of predictive behavioral analysis to guess whether an email is junk or not. This is called the Bayesian filter.
What is the Bayesian Filter
Named after Thomas Bayes, who gave the mathematical system that has been used to develop the Bayesian filter, the application uses a statistical theorem that gives you the probability of an event. The analysis evaluates the header and content of the user’s email messages and determines whether the email is a spam message or the equivalent of hard copy bulk mail or a ‘ham’ message.
How Does the Bayesian Filter work?
A Bayesian filter is based on the probability of the appearance of specific words in the header or content of an email. Certain words indicate a higher probability of spam. For instance, take the word ‘Viagra’, which is likely to appear in a spammer’s message. When enough occurrences of Viagra are computed by the filter, Bayesian ‘learns’ to identify this word in the mail content of future messages and marks them as spam. When the filter reaches a certain threshold, about 95 percent, the mail is moved to a junk folder and is automatically deleted. Alternatively, emails can be moved to a quarantine location where the user can access the email and review the software’s decision.
The Bayesian Filter Learns
The filter does not automatically know what content or words can be registered as spam. In fact, the theorem has a learning curve where the user’s behaviour is analyzed to see what messages are generally ignored or flagged as spam (in each individual case) and then the application starts predicting user behaviour.
Breakthrough in Email Protection
The objective of the initial learning period of the Bayesian filter is to reduce incidences of false positives and negatives. On the other hand, other methods often employ a simple scoring type filter. If a message contains specific words and elements, ‘points’ are added to that mail’s score. Once a message exceeds a certain score, it is regarded as spam. While this system may seem easy to understand, it is an incidental method rather than accurate and the results may change when the spammer changes their wording.
When the filtering is used for individual input, the precision of the Bayesian filter increases on a per-use basis. Different users can attract different types of spammers, which usually depends on their online activity. Each time a user confirms a message as spam or ham, the filtering process can become more accurate and present a more refined probability for the next event.
Drawbacks of Bayesian Filter
The downsides of using this filter are incidences of bypasses and poisoning. In cases of targeted spam, spammers may start using words or whole pieces of text that ultimately lower the score or probability. In the long run, these words may get associated with spam, which is called poisoning.
Spammers can also use bypassing tactics like replacing words with text images, deliberately misspelling words (for example, spelling Viagra as VIagra instead), and using homograph letters that are characters from different character sets that look similar (for example, the Omicron from Greek looks almost exactly as “O”, but has a different character encoding).