1 Concentration Based Feature Construction Approach

1.1 Motivation

Feature selection plays a very important part in pattern recognition in particular in spam filtering. Well defined features greatly improve the performance and are not sensitive to the changing of the contents of e-mails and user's interests. For most of proposed anti-spam approaches, words vector is formed through term frequency analysis and served as the feature vector for various classification algorithms. Though several methods such as stop list, words stemming, mutual information, and information gain are used for selection among candidate words and dimension reduction, the dimensionality of final feature vector is still in the order of thousand. Heuristic approaches reduce the dimension to some extent, however the matching process between all the pre-discovered patterns and those patterns appeared in an e-mail to be classified is time consuming, especially when the library of patterns is very large. Meanwhile, mining a proper pattern is very difficult. In natural immune system, the intrusion of pathogens can be simply concluded by the raised concentration of antibodies, which is a desirable property of which we are going to make use in our spam detection system.

1.2 Overview of Our Proposed Approach

Our proposed approach can be mainly divided into three parts. (1) Generate `self' gene library and `non-self' gene library from training samples. (2) Construct the concentration vector of each training sample through above two gene libraries, then these concentration vectors are used as the input of a successive classification algorithm for training. (3) The trained classifier is used to predict the label of testing samples characterized by concentration vectors. Here we focus on the feature construction and do not emphasize the classification algorithms which would be SVM, Artificial Neural Networks, Adaboost, just to name a few.

1.3 Generation of gene libraries

Two gene libraries---`self' gene library and `non-self' gene library are generated from training samples in our proposed approach. The gene fragment in gene library is simply a word. `Self' gene library are composed of words with utmost representative of non-spam e-mails. By contrast, `non-self' gene library contains those words with utmost representative of spam e-mails. Intuitively, a word that appears most in non-spam e-mails while seldom in spam e-mails is a good representative of non-spam e-mails. Consequently, the difference of its frequency in non-spam e-mails minus that in spam e-mails can be used to reflect a word's ``proclivity''. After calculating the difference of each word, words are sorted in terms of their differences. Consider the queue of words is sorted in descendent order, for instance, then two portions of words derived from the front and the rear of the queue with certain proportion can be used to construct `self' gene library and `non-self' gene library, respectively. Preprocessing is used to select candidate words. The features that appear in most of the documents are not relevant to separate these documents because all the classes have instances that contain those features. For simplification, we discard words that appear more than 95% in all messages of the corpus, stop word list is also used to remove those trivial words.

1.4 Construction of Feature

The concentration of an e-mail is defined as the proportion of the number of words of the e-mail which appear in gene library to the number of different words in the e-mail.

2. Local Concentration Based Feature Construction Approach

2.1 Background

BIS is an adaptive distributed system with the capability of discriminating ``self cells'' from ``non-self cells''. It protects our body from attacks of pathogens. Antibodies, produced by lymphocytes to detect pathogens, play core roles in the BIS. On the surfaces of them, there are

specific receptors which can bind corresponding specific pathogens. Thus, antibodies can detect and destroy pathogens by binding them. All the time, antibodies circulate in our body and kill pathogens near them without any central controlling node.

Inspired by the functions and principles of BIS, AIS was proposed in the 1990s as a novel computational intelligence model. In recent years, numerous AIS models have been designed for spam filtering. One main purpose of both BIS and AIS is to discriminate ``self'' from ``non-self''. In the anti-spam field, detectors (antibodies) are designed and created to discriminate spam from legitimate emails, and the ways of creation and manipulation of detectors are quite essential in these AIS models. In this paper, taking some inspirations from BIS and CFC, we propose a LC model and give a way of generating LC features of messages.

2.2 Motivation

Feature extraction approaches play quite important roles and attract many researchers' attention in spam filtering. However, there exist two main problems in most current feature extraction approaches, which are the high dimensionality of feature vectors and lack of consideration on the position related information. To address these two problems, we propose a LC approach, by taking inspirations from BIS, where local concentrations of antibodies determine whether the corresponding pathogens can be culled from the body. Mimicking BIS, we transform messages

into feature vectors of local concentrations. Our LC approach consists of two stages---detector sets (DS) generation and LC features construction.

2.3 Term Selection and DS Generation

The terms, generated by the step of tokenization, are at first selected by utilizing one certain term selection method. In term selection, importance of terms is measured by the criterion defined in the term selection method. Then unimportant (uninformative) terms are removed, and important terms are added to the pre-selected set, which should be initialized as an empty set at the beginning. The purposes of term selection are to reduce the computational complexity of the LC calculation step and reduce possible noises brought by uninformative terms. The noises may occur when the uninformative terms are taken as discriminative features.

2.4 Construction of LC Based Feature Vectors

To construct a LC based feature vector for each message, a sliding window of n-term length is utilized to slide over the message with a step of n-term, which means that there is neither gap nor overlap between any two adjacent windows. At each movement of the window, a spam concentration and a legitimate concentration are calculated according to the terms in the window and the two DS.

3. Uninterrupted approaches for spam detection based on SVM and AIS

3.1 Motivation

We consider HIS and SVM as inspiration in that their certain properties are particularly suitable for developing a spam filter. The similarities between the immune system and a spam detection system are listed as follows. 1) Pattern Recognition: The goal of spam detection is to distinguish spam mails from non-spam mails while HIS differentiates self from potentially dangerous non-self. 2) Dynamical Change: The formats of the e-mail are diverse, and the content of the e-mail and user’s interest are always changing. Tracing these changes is also our goal of spam detection. Similarly, HIS is capable of recognizing a variety of invaders dynamically. 3) Noise Tolerance: A desirable feature in pattern recognition is noise tolerance, while the immune system is capable of recognizing different mutants of pathogens.

For SVM, it is a prominent classifier with solid theoretical foundation of statistical learning and superior generalization performance. Moreover, there are many successful applications based on AIS and SVM techniques. Thus, the proposed algorithms in this paper employ the incremental SVM technique and the concepts of detectors as well as memory cells for uninterrupted spam detection.

3.2 Algorithm Overview

Generally speaking, the incoming e-mails arrive successively in a stream mode. For the processing of this kind of data, an incremental update technique is required for the incorporation of new knowledge from the data stream. Besides, the content of the e-mail and user’s interest constantly drift, thus we have to forget the data points that are no longer in active. The key idea of our algorithm is to employ a sliding window to hold several classifiers simultaneously, meanwhile each classifier in the window updates independently in terms of EM technique. When a new e-mail arrives, it is classified by each classifier independently by use of the classification criterion which will be presented in details in section V-F. The label of the new e-mail is determined by majority voting. For purge of out-of-date knowledge, when a new bath arrives, the “oldest” classifier, e.g., the one at rightmost in the window, is removed from the window while the remaining classifiers slide a position to right. The “youngest” classifier, e.g., the one at leftmost in the window, is newly generated by using the previous batch only. An initialization of the window is needed to start the work. Each email needs to be transformed to a feature vector before classification. The overview of the algorithm is presented in Algorithm 1.

Algorithm 1 Algorithm for Dynamic Spam Detection

Preprocess each message in the Training Set.

Initialize the window, namely generate the .

while Algorithm is running do

if an e-mail is received then

for each classifier in the window do

classify the e-mail independently.

if the e-mail exceeds the margin then

the e-mail is kept.

else

the e-mail is discarded.

end if

if a given number of e-mail exceeding the margin is collected then

update the classifier.

end if

end for

the label of the e-mail is given by classifiers in the window in terms of voting.

if the received e-mail is the last e-mail of current batch then

Slide the window.

Generate the by current batch only.

end if

end while

3.3 Initialization of the Window

At beginning, the window is empty and needs to be initialized. The initialization of the window is to generate the classifier1. For different classification criteria, the construction of the classifier is slightly different. Thus different initialization processes corresponding to specific classification criteria are described in the following paragraphes, respectively. Here, some parts of the corpus are used as training samples to generate the classifier1. For classification criteria of Hamming Distance and Included Angle, the classifier is comprised of Detector Set and Memory Set. At first, each mail is transformed into a binary vector in the way mentioned above. After the pre-processing, these training vectors are used to train the SVM. We use support vectors of the trained SVM as na¨ıve detectors to form the Detector Set. The Memory Set is the set of detectors with better performance, which are called as memory cells. We use Detector Set to classify the training samples according to certain classification criterion, i.e., Hamming Distance or Included Angle. The na¨ıve detectors whose number of correctly classified messages exceeds the presetting threshold nm are promoted as memory cells, meanwhile they are removed from current Detector Set. In addition, each member of Detector Set and Memory Set has a label which is identical with that of the corresponding support vector, and we also assign a lifespan for each memory cell. As there are two sorts of support vectors, that is, support vectors representing none-spams and spams, respectively, the way of the generation of detectors adopted here is an implementation of positive-negative selection algorithm. For classification criterion of SVM, there are no Detector Set and Memory Set. We just need to train the SVM by using the samples only.