Towards designing an email classification system using multi-view based semi-supervised learning

(1)

Towards Designing An Email Classification System Using Multi-View Based

Semi-Supervised Learning

Wenjuan Li∗, Weizhi Meng∗§, Zhiyuan Tan†, Yang Xiang‡

∗_{Department of Computer Science, City University of Hong Kong, Hong Kong SAR} §_{Infocomm Security Department, Institute for Infocom Research, Singapore} †_{School of Computing and Communications, University of Technology Sydney, Australia}

‡_{School of Information Technology, Deakin University, Australia} Email:{wenjuan.li@cityu.edu.hk, yuxin.meng@my.cityu.edu.hk}

Abstract—The goal of email classification is to classify user emails into spam and legitimate ones. Many supervised learning algorithms have been invented in this domain to accomplish the task, and these algorithms require a large number of labeled training data. However, data labeling is a labor intensive task and requires in-depth domain knowledge. Thus, only a very small proportion of the data can be labeled in practice. This bottleneck greatly degrades the effectiveness of supervised email classification systems. In order to address this problem, in this work, we first identify some critical issues regarding supervised machine learning-based email classification. Then we propose an effective classification model based on multi-view disagreement-based semi-supervised learning. The motivation behind the attempt of using multi-view and semi-supervised learning is that multi-view can provide richer information for classification, which is often ignored by literature, and semi-supervised learning supplies with the capability of coping with labeled and unlabeled data. In the evaluation, we demonstrate that the multi-view data can improve the email classification than using a single view data, and that the proposed model working with our algorithm can achieve better performance as compared to the existing similar algorithms.

Keywords-Machine Learning Applications, Email Classifica-tion, Semi-Supervised Learning, Multi-View, Network Security.

I. INTRODUCTION

Email has become an effective and essential communica-tion means for idea and informacommunica-tion exchange with the rapid development of the Internet. However, due to increasing vol-ume of emails, spam or junk emails are being a challenge for both email users and Internet Service Providers (ISPs) [14]. These spam emails can cause lots of security issues if they are not properly detected. For example, spammers often send their messages as HTML mail, which can carry embedded malicious code or be enclosed with attachments that contain macro viruses. Even worse, spam emails may link recipients to some web sites which contain scripts to collect personal information [27]. This makes email classification a hot topic in the network security community. On the whole, the major

Author Note: Weizhi Meng is the corresponding author and is previously known as Yuxin Meng.

intend of email classification is to classify incoming emails and filter out spam emails.

Many supervised machine learning algorithms such as Naive Bayes [18], Decision Tree [26], k-nearest neighbor [9] and Support Vector Machine [1] have been applied to email classification. Although these supervised learning algorithms achieve good results in spam email identification, they still suffer from several issues in practice.

• Requirement of large labeled data.This is a bottleneck for supervised learning-based email classification sys-tems since a big number of labeled data (or instances) are needed during the process of training. In other words, a number of training examples with ground-truth labels should be given in advance. However, in practice, only a very small proportion of the data is labeled while most of the data remains unlabeled.

• Heavy burden of expert labeling.To obtain labeled data to train a supervised learning algorithm, human efforts are usually involved during the labeling. Whilst due to the economic and time costs of expert labeling as well as the large volume of unlabeled data, it is very difficult to obtain enough labeled data in practice, which in turn significantly hinders the development of supervised learning-based email classification systems.

• No response to unseen data.Moreover, with the limited

number of labeled data in the training, it is very hard for supervised learning-based email classification systems to build up accurate classifiers in practice. The main reason is that unseen data is widely encountered as spammers may modify the content of an email to bypass a previously built system. Therefore, traditional supervised email classification cannot detect such mod-ified (unseen) emails without appropriate training. Contributions.In order to address the issues above, in this work, we propose an effective email classification model using both multi-view data and disagreement-based semi-supervised learning. The motivation of using multi-view data is due to the fact that only few works in literature try to explore its effect on real email classification. The objective

(2)

of using disagreement-based semi-supervised learning is to enable the email classification model to learn from both labeled and unlabeled data. Our contributions of the work can be summarized as follows:

• The proposed classification model, based on both multi-view data and semi-supervised learning, can construct two feature sets according to an incoming email, called internal feature set (IFS)and external feature set (EFS). The IFS contains features that are related to email text (or body), while the EFS mainly contains features that are related to routing and forwarding.

• In the current model, we improve and implement a

disagreement-based semi-supervised learning algorithm to automatically leverage unlabeled data during the classification. This algorithm can make a label deci-sion by means of either “Average of Probabilities” or “Majority Voting”. We also compare the performance of these two methods in the evaluation.

• In the evaluation, we perform two experiments to in-vestigate the performance of our proposed classification model using a public dataset and a real (private) dataset respectively. The experimental results, as compared to existing similar algorithms, show the effectiveness of our approach in terms of improving the accuracy of classifying emails.

The remaining parts of this paper are organized as fol-lows. In Section II, we review some related works about the use of machine learning technique in email cation. Section III describes our proposed email classifi-cation model, the construction of multi-view dataset and the disagreement-based semi-supervised learning algorithm. Section IV presents the experimental settings and analyzes experimental results. Finally, we conclude our work with future directions in Section V.

II. RELATEDWORK

Email classification is considered as one of the promising and commonly used methods to identify spam emails. The key point here is to distinguish the spam emails from the legitimate user emails. Many machine learning algorithms have been applied to this research topic such as supervised machine learning algorithms and semi-supervised learning algorithms.

Supervised learning algorithms.: Compared to unsu-pervised learning, suunsu-pervised-based machine learning al-gorithms, such as Naive Bayes, Decision Tree, k-nearest neighbor (KNN), Support Vector Machine (SVM) etc, have been widely studied in literature.

For example, Marsono et al. [18] proposed a hardware architecture for a Naive Bayes classifier in the context of email classification for spam control. They particularly presented a word-serial Naive Bayes classifier architecture that utilizes the Logarithmic Number System (LNS) to reduce the computational complexity and for non-iterative

binary LNS recoding using a look-up table approach. The experiment showed that their approach could handle large number of emails in second. For the decision tree, Meizhen et al. [19] proposed a spam-behavioral recognition model and developed a Fuzzy Decision Tree based spam filter system, which computed Information Gain to analyze and select behavior features of emails. Then, Shi et al. [26] proposed a novel classification method based on decision tree and introduced an ensemble learning to identify spam emails. The evaluation results on a public dataset indicated that the proposed method generally outperformed benchmark techniques such as C4.5, Naive Bayes, SVM and KNN.

For the KNN and SVM, Firte et al. [9] presented an approach for spam detection filters. In particular, they devel-oped an offline application that used the k-Nearest Neighbor (kNN) algorithm and a pre-classified email dataset for the learning process. During the experiments, this system could perform a constant update of the data set and the list of most frequently words that appear in the messages. Drucker et al. [7] studied the use of support vector machines in classifying emails as spam or legitimate by comparing it to other three classification algorithms: Ripper, Rocchio, and boosting decision trees. These four algorithms were tested on two different data sets where SVM could perform best when using binary features. Later, Sculley and Wachman [25] firstly showed that online SVMs indeed gave state-of-the-art classification performance on online spam filtration on large benchmark datasets. They showed that nearly equivalent performance would be achieved by a Relaxed Online SVM (ROSVM) at greatly reduced computational cost. Their re-sults are experimentally verified on email spam, blog spam, and splog detection tasks.

Later, Zhan et al. [37] proposed a stochastic learning method to model abnormal emails using weak estimators in a dynamic environment. A multivariate Bernoulli Naive Bayes (NB) classifier was employed in the training phase. The experimental results demonstrate that the proposed approach is both feasible and effective for the detection of anomalous emails. El-Alfy and Abdel-Aal [8] investigated the application of the GMDH (Group Method of Data Handling) based inductive learning approach in detecting spam messages by automatically identifying content features that effectively distinguish spam from legitimate emails. Compared with other algorithms like neural networks and Naive Bayes, their results show that their approach can provide better spam detection accuracy with false-positive rates as low as 4.3% and yet requires shorter training time. Then, Ouyang et al. [23] conducted a large scale empirical study into the effectiveness of using packet and flow features to detect email spam at an enterprise based on decision tree and Rulefit. Several other related works can be found in [10, 15, 31, 40, 43].

Semi-Supervised learning algorithms.: By noticing the issues regarding the supervised learning, semi-supervised

(3)

Initialization Cloud Provider Feature Extraction Disagreement-based SSL Spam Training Original Labeled Data Training Dataset Classifier 1 Classifier 2 Classifier N … Subset 1 Subset 2 Subset N … Ensemble Unlabeled Data Mail Messages Unlabeled Data Labeled Data

Newly Labeled Data

IFS

EFS Two Feature Datasets Feature Preparation

Decision Classification

Figure 1. The architecture of our proposed email classification model based on multi-view data and disagreement-based semi-supervised learning.

learning has been gradually paid more attention and been developed to leverage unlabeled data in addition to labeled data during the classification.

For instance, Cheng and Li [4] proposed a combined SVM and semi-supervised classifier to label a user’s emails. The SVM is trained with labeled public domain emails and it is used to classify a user’s emails while the semi-supervised classifier uses these emails as the training set and propagates the label information to other unlabeled emails by exploiting the distribution of them in feature space. They indicated that this approach could increase the classification accuracy. Then, they [5] further proposed a semi-supervised classifier ensemble aiming to label a users’ emails and facilitate the tuning process in an efficient way. They also demonstrated similar results that this semi-supervised ensemble could help SVM classify users’ emails with a high accuracy. Gao et al. [11] proposed a semi-supervised approach, called regularized discriminant EM algorithm (RDEM), to detect image spam emails. Compared with fully supervised learning algorithms, they indicated that the cost was too high for fully supervised learning to frequently collect sufficient labeled data for training, and that their approach could leverage small amount of labeled data and large amount of unlabeled data for identifying spams and training a classification model simultaneously. Later, Whissell and Clarke [35] considered a specific semi-supervised spam filtering scenario: that is, when a large amount of training data is available, but only a few true labels can be obtained for that data. They then presented two spam filtering approaches for this scenario, both starting with a clustering of training email. The results showed that their approach could be better than those previously pub-lished state-of-the-art semi-supervised approach on small-sample spam filtration. Several related works about

semi-supervised learning in email classification can be referred to [20, 21, 36, 39], while some surveys regarding the spam filtering can be referred to [3, 30, 33].

Semi-supervised learning has proven its capability of detecting spam emails. In literature, however, we find that very limited works give attention to the use of multi-view data in classifying emails. In this paper, we therefore attempt to develop an effective email classification model combining both multi-view data and semi-supervised learning.

III. OURPROPOSEDEMAILCLASSIFICATIONMODEL In this section, we describe the proposed email classifica-tion model in detail, present the construcclassifica-tion of multi-view data and describe the disagreement-based semi-supervised learning algorithm respectively.

A. Email Classification Model

The main goals of our proposed classification model are twofold: 1) extracting proper features from each email that can be handled during the machine learning-based classi-fication and constructing two attribute sets for the multi-view concept; and 2) using a disagreement-based semi-supervised learning algorithm to label and leverage unla-beled data automatically. The architecture of our proposed email classification model is presented in Fig. 1.

There are three major phases: feature preparation, train-ing and classification. In the phase of feature preparation, the process of initialization is responsible for preprocessing all incoming email messages into our defined common format in order to make incoming emails feasible to be handled by a machine learning classifier (i.e., an email will be represented by a set of features). Then the process of feature extraction collects these common features and converts them into two attribute sets: an internal feature set

(4)

Table I

THE CONSTRUCTION OF TWO FEATURE SETS IN OUR EMAIL CLASSIFICATION MODEL.

Numbers Internal Feature Set (IFS) External Feature Set (EFS)

1 subject length the number of receipts

2 message size the number of replies

3 the number of attachments the level of importance

4 type of attachments the frequency of sending emails

5 size of attachments the frequency of receiving emails

6 the number of words in the subject the name length of senders 7 the number of words in the message

8 the number of embedded images

(IFS) and an external feature set (EFS). The IFS contains attributes that are related to email content (or body) while the EFS consists of the ones that are related to routing and forwarding. The details of constructing multi-view data will be discussed later.

In the training phase, the implemented disagreement-based semi-supervised learning algorithm can establish clas-sifier models by using labeled multi-view instances and automatically label and leverage unlabeled data. Finally, in the phase of classification, the email classification model can make a decision by classifying email messages to either spam or legitimate emails. Note that the unlabeled data (as shown in Fig. 1) will be standardized into the previously suggested common format by passing through the phase of feature preparationin order to facilitate the use of machine learning classifiers.

B. Multi-View Data Construction

Several research studies like [28, 34, 41] in the area of machine learning have shown that multi-view data may improve the performance of a classifier. In addition, Mao et al.[16] applied multi-view to intrusion detection and showed that the multi-view method has a lower error rate than using a single view.

In literature, however, we notice that limited works give attention to the use of multi-view data in the process of email classification. To investigate this issue, one of our major goals in this paper is to explore the effect of multi-view data on email classification. In this section, we introduce our approach of constructing two multi-view feature sets for user emails, namely IFS and EFS. The detailed features within the two sets are presented in Table I.

• IFS.These features are related to email content or body such as subject length, message size, the number of attachments, type of attachments, size of attachments, the number of words in the subject, the number of words in the messageand the number of embedded images.

• EFS. Different from IFS, EFS is relevant to email routing and forwarding such as the number of receipts, the number of replies, the level of importance, the frequency of sending emails, the frequency of receiving emailsand the name length of senders.

Feature selection and capture. Some features like subject length, message size, size of attachments and the number of words in the message have ever been studied in several research works like [14, 17] and in public spam datasets like [6]. These studies have proven the feasibility of using these features to describe an email. Based on the above features, in this work, we propose the above 14 features with two attribute sets to represent an email. This particular data construction makes our work different from the existing work. In real deployment, we identify that the features above can be easily captured and computed by means of current email technique (i.e., route tracking and content recording). Multi-view. In literature, we identify that most works explore the issue of email classification using one attribute data and few works discuss the multi-view method (see Section 2). One of the reasons may be that one view is much straightforward. But motivated by other works like [16], in this paper, we aim to construct a two-view data by using the above proposed features to investigate its effect on classifying emails.

To better describe our task, let A and B denote two views and (<a,b>, c) denote a labeled example, where a ∈ A and b ∈ B are the two portions of the example, and c is the label. Assume that c ∈ {0, 1} where 0 denotes negative classes and 1 denotes positive classes. Further assume that there are two functions fA and fB over A and B, such

that fA(a) = fB(b) = c. This means that each example is

associated with two attributes where each contains enough information for determining the label of the example [41]. Thus, if given k examples, we can have a labeled dataset: (< ak, bk >, ck) (k=1,2,..., ck is known). Let U = (<

ai, bi >, ci) (i = 1, 2..., ci is unknown) denote a large

number of unlabeled data, our task is to train a classifier to classify new examples.

C. Disagreement-based Semi-Supervised Learning

Disagreement-based semi-supervised learning can provide a mechanism to allow classifiers trained on different views to exploit unlabeled data. The learning process can be treated as a type of ensemble learning. In addition, semi-supervised learning can refer to either transductive learning or inductive learning. Transductive learning attempts to infer the correct

(5)

labels for the given unlabeled data whereas the goal of inductive learning is to infer the correct mapping. In practice, a semi-supervised learning algorithm often uses transduction or induction interchangeably.

A key of disagreement-based semi-supervised learning is to generate multiple learners, let them collaborate to exploit unlabeled examples, and maintain a large disagreement be-tween the base learners. Regarding the multi-views, we can generate multiple learners with these multi-views and then use the multiple learners to start disagreement-based semi-supervised learning. The first algorithm of this concept is the co-training algorithm proposed by Blum and Mitchell [2]. They assumed that the data has two sufficient and redundant views (i.e., attribute sets), where each view is sufficient for training a strong learner and the views are conditionally independent to each other given the class label.

To better explain the disagreement-based semi-supervised learning, let L and U denote a labeled dataset and an unlabeled dataset respectively, assuming that L = {(x1, y1), (x2, y2), ..., (xn, yn)} and U = {(< x 0 1, y 0 1 > , c0₁), (< x0₂, y₂0 >, c₂0), ..., (< x0_n, y_n0 >, c0_m)}. By presenting L and U to a learning algorithm in constructing a function X → Y , thus, we can predict the labels of unseen data by using this function (where X and Y presents the input space and output space respectively, xi, x

0

j ∈ X, i =

1, 2, ..., |n|, j = 1, 2, ..., |m|). By considering multi-views, L and U can be represented as L = {(< x1, y1 >, c1), (<

x2, y2 >, c2), ..., (< xn, yn >, cn)} and U = {(< x 0 1, y 0 1 > , c0₁), (< x0₂, y₂0 >, c₂0), ..., (< x0_n, y_n0 >, c0_m)} respectively. As mentioned earlier, it is noticed that several multi-view learning algorithms require independent and redundant views. Unfortunately, such a requirement can hardly be met in most scenarios [42]. In this work, we employ a method of disagreement-based co-training (ensemble) [29] which does not require independent and redundant attributes, but to use multiple base classifiers with different learning algorithms instead of using the same base learner on the different subsamples of original labeled data.

Specifically, each classifier h is first trained on the original labeled data. Ensembles H are then established by means of all classifiers except one (eh) to search for a subset of high confidence unlabeled data. These ensembles estimate the error rate for each classifier from the agreement among the classifiers. Later, a subset of U is selected by eh for h. Data that can improve the error over a pre-defined threshold are added to the labeled training dataset. In this case, each classifier has its own training dataset. Note that data which is labeled for the classifier is not deleted from the unlabeled dataset. The above training process will be repeated until there are no more data can be labeled to improve the performance of any classifier. An outline of this co-training is shown as below:

• Initialization: given L, U, H; • For each iteration i:

Table II

THEOLTVALGORITHM.

Process:

1. LP ← seed, LN← ∅

2. Identify all pairs of correlated projections, obtaining αi, βiand λi.

3. For j = 0, 1, 2, ..., l − 1 do P roject < xi, yi>

into the m pairs of correlated projections. 4. For j = 1, 2, ..., l − 1 do compute ρi 5. P ← argmaxγ+(ρi), N ← argminγ−(ρi) 6. For all < xj, yj> ∈ P do LP ← LP∪ (< xj, yj>, 1) 7. For all < xj, yj> ∈ N do LN← LN∪ (< xj, yj>, 0) 8. L ← LP∪ LN, U ← U − (P ∪ N ) Output: L, U .

– Finding error rate for component classifier based on disagreement among classifiers;

– Assigning labels to unlabeled instances based on agreement among ensembles;

– Sampling high-confidence examples for compo-nent classifier;

– Building component classifier based on newly-labeled and original newly-labeled instances;

– Iteration end.

– Controlling the error rate for each component classifier and update the ensemble.

• Generating final hypothesis.

The specific co-training algorithm can be referred to [29], but differently, we employ OLTV method [41] to generate L and U for the co-training which can help generate a more reliably labeled dataset. The OLTV method assumes that if two sufficient views are conditionally independent given the class label, the most strongly correlated pair of projections should be in accordance with the ground truth. The specific algorithm of OLTV is described in Table II. To label an unlabeled data, we employ two voting approaches: “Average of Probabilities” [29] and “Majority Voting” [13].

For the ‘Average of Probabilities”, suppose Y = (y1, y2, ..., ym) be the class labels and there are totally N

classifiers. Thus, this voting method for prediction of the new example x can be computed as:

arg max(1 N N X i=1 pi(ym|x)) (1)

For the “Majority Voting”, the maximum number of classifiers is considered as a major rule which means that the majority of the classifiers should be agreed to assign a label to the unlabeled data.

IV. EVALUATION

In this section, we evaluate our proposed email classi-fication model using a public dataset and a real (private) dataset respectively. The objective of the first experiment is to investigate the performance of the disagreement-based

(6)

algorithm, while the objective of the second experiment is to study the effect of our proposed multi-view data. Some used measures in the evaluation are described as follows:

• Area under an ROC curve (AUC). This is a popular method used for comparing classifiers and it represents the expected performance as a single scalar in which the larger the AUC, the better the experiment is as predicted the existence of the classification [24].

• False Positive (FP). This measure indicates the possi-bility of identifying a legitimate email as a spam email.

• False Negative (FP).This measure indicates the possi-bility of identifying a spam email as a legitimate email.

• Classification Accuracy. This measure indicates the possibility of correctly identifying both spam and le-gitimate emails.

A. Experiment1

In this experiment, we mainly aim to explore the perfor-mance of the disagreement-based semi-supervised learning algorithm by comparing it to several traditional supervised learning classifiers such as Naive Bayes, IBK1_{, J48 and}

SMO. All the base classifiers are extracted from the WEKA platform [32] with the purpose of avoiding any implemen-tation bias.

Specifically, we employ a publicly available spam email dataset, called SPAM E-mail Dataset [6], which contains 58 attributes and a total of 4601 emails including 1813 spam emails and 2788 legitimate emails. Note that for evaluating the disagreement-based semi-supervised learning algorithm, we divided this dataset into two parts with labeled data and unlabeled data where the unlabeled data consists of 600 instances that are randomly selected from the original dataset. In addition, we use three classifiers: Naive Bayes, IBK and J48 in the disagreement-based SSL and set the value of pre-defined threshold to 0.75 for all classifiers. The disagreement-based semi-supervised learning algorithm uses “Majority Voting” and was run by 60 iterations. The experimental results are described in Table III.

Table III

COMPARISON OF CLASSIFICATION RESULTS INExperiment1.

Algorithm FP FN Classification Accuracy

Naive Bayes 0.169 0.248 0.765

SMO 0.142 0.223 0.783

IBK 0.134 0.215 0.792

J48 0.113 0.187 0.823

Our algorithm 0.092 0.101 0.884

Table III presents that the disagreement-based semi-supervised learning algorithm outperforms other semi-supervised learning algorithm in the aspects of false positives, false negatives and classification accuracy. For example, J48 can

1_{In this experiment, we set k = 3.}

achieve the best classification accuracy of 0.823 among the supervised learning classifiers while the disagreement-based semi-supervised learning can increase the classifica-tion accuracy to 0.884 after running 60 iteraclassifica-tions. The results regarding FN and FP are similar. These results indicate that the semi-supervised learning can overall enhance the clas-sification capability of detecting spam emails by leveraging unlabeled data.

B. Experiment2

As there is no public dataset regarding our proposed multi-view data, in this experiment, we accordingly construct a private dataset based on our defined features (see Table I), aiming to explore the performance of the proposed email classification model. This dataset mainly contains 7133 emails recorded from two recognized institutes and it was di-vided into two parts of labeled dataset and unlabeled dataset by means of a random selection. The unlabeled dataset contains up to 2300 instances selected from the private dataset while the remaining data was manually labeled by the security officers from the institutes. Similarly, we use three classifiers: Naive Bayes, IBK and J48 in the disagreement-based SSL and set the value of pre-defined threshold to 0.75 for all classifiers.

“Average of Probabilities” versus “Majority Voting”.: To explore the performance of using these two voting methods, we compared the results of classification accuracy after 60 and 100 iterations respectively. The results are presented in Table IV. The table shows that these two voting methods can achieve very close classification accuracy, while the “Majority Voting” can obtain a better result regarding our proposed classification model.

Table IV

COMPARISON OF CLASSIFICATION ACCURACY USING“AVERAGE OF

PROBABILITIES”AND“MAJORITYVOTING”.

Voting Methods 60 Iterations 100 Iterations

Average of Probabilities 0.852 0.904

Majority Voting 0.857 0.913

Multi-view versus single-view.: To demonstrate the effect of using the multi-view data on email classification, we further compare our approach with the single view of EM semi-supervised learning [22]. Note that all features, as shown in Table I, will be used to train the EM semi-supervised learning as a single view dataset and that our approach uses the “Majority Voting” during this experiment. The detailed results are presented in Fig. 2.

The figure shows that the classification accuracy of our approach using multi-view data can gradually outperform the method which uses single view data. In addition, it is noticeable that our approach can improve the classification accuracy significantly after a few training iterations. After 60 iterations, it is found that our approach can increase the

(7)

Table V

COMPARISON OF CLASSIFICATION RESULTS INExperiment2.

Algorithm Classification Accuracy AUC

Naive Bayes 0.702 0.761

SMO 0.748 0.779

IBK 0.773 0.796

J48 0.785 0.823

Our algorithm (60 iterations) 0.857 0.913

Standard Co-Training (60 iterations) [2] 0.822 0.897

Co-EM (60 iterations) [22] 0.831 0.902 0 1 0 2 0 3 0 4 0 5 0 6 0 0 . 7 0 0 . 7 2 0 . 7 4 0 . 7 6 0 . 7 8 0 . 8 0 0 . 8 2 0 . 8 4 0 . 8 6 C la s s if ic a ti o n A c c u ra c y I t e r a t i o n O u r A p p r o a c h w i t h M u l t i - V i e w S i n g l e V i e w i n [ 1 4 ]

Figure 2. The comparison results of classification accuracy regarding multi-view and single view.

classification accuracy by nearly 3% as compared to the algorithm of using single view data.

In addition, we describe the results of classification ac-curacy and AUC by comparing our approach with several supervised learning algorithms in Table V. Note that all features will be used to train these supervised learning algorithms as a single view dataset. It is noticeable that our classification model can achieve the best result by com-bining multi-view data and semi-supervised SSL. Moreover, both the classification accuracy and AUC of the supervised learning algorithms are nearly below 0.8, which reflect the difficulty of identifying spam emails using supervised learning algorithms in real settings.

Multi-view algorithm comparison.: To further inves-tigate the performance of our approach, we apply other two popular multi-view disagreement-based semi-supervised learning algorithms such as Standard Co-Training [2] and Co-EM [22] to our dataset. The results are also shown in Table V. It is noticeable that our algorithm can achieve a better classification accuracy and AUC than the other two algorithms. For instance, our algorithm achieve an accuracy of 0.857, which is higher than the other two increased by 0.034 and 0.026 respectively.

Overall, these experimental results indicate that the use of multi-view data can increase the classification accuracy and

AUC than using single view data, and that the construction of multi-view data in this work is promising in real email classification. As compared to the existing multi-view semi-supervised learning algorithms, we find that our proposed email classification model is effective.

V. CONCLUSION ANDFUTUREWORK

Spam emails are a big problem for the Internet user and email classification is regarded as one of the promising methods to address this issue. In literature, many supervised-based email classification systems have been proposed. However, we point out that the traditional supervised-based classification system suffers from several issues in practice such as requirement of large labeled data, heavy burden of expert labelingand no response to unseen data.

In this paper, we propose an effective email classification model combining both multi-view data and disagreement-based semi-supervised learning. For the multi-view data, the proposed model can construct two feature datasets based on an incoming email: an internal feature set (IFS) and an external feature set (EFS). The IFS contains features that are related to email text (or body) while the EFS mainly contains features that are related to routing and forwarding. The objective of using disagreement-based semi-supervised learning is to enable the email classification model to learn from both labeled and unlabeled data. In the evaluation, we conduct two experiments to explore the performance of our proposed email classification model and study the effect of the multi-view data on email classification. The experimental results demonstrate that our proposed email classification model can further improve the accuracy of classifying emails as compared to the use of a single view data and that our algorithm is effective comparing to existing multi-view semi-supervised learning algorithms.

To the best of our knowledge, our work is an early work in discussing the use of multi-view data in email classification. There are many possible topics for our future work, which could include exploring the performance of using other semi-supervised learning algorithm in our email classification model and providing a comparison study. Future work could also include investigating how to systematically construct an appropriate multi-view dataset for user emails and explore whether there is an optimal construction.

(8)

REFERENCES

[1] O. Amayri and N. Bouguila, “A study of spam filtering using support vector machines,” Artificial Intelligence Review 34(1), 73-108, 2010. [2] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” Proceedings of the 11th Annual Conference on Computational Learning Theory, pp. 92-100, 1998.

[3] G. Caruana and M. Li, “A survey of emerging approaches to spam filtering,” ACM Computing Surveys 44(2), pp. 1-27, 2008. [4] V. Cheng and C.-H. Li, “Combining supervised and semi-supervised

classifier for personalized spam filtering,” Proceedings of PAKDD, LNAI 4426, pp. 449-456, 2007.

[5] V. Cheng and C.-H. Li, “Personalized spam filtering with semi-supervised classifier ensemble,” Proceedings of the 2006 Interna-tional Conference on Web Intelligence (WI), pp. 195-201, 2006. [6] SPAM E-mail Dataset (Accessed on 8 September, 2013): http://web.

cs.wpi.edu/∼_{cs4445/b12/Datasets/spambase.arff.}

[7] H. Drucker, D. Wu, and V.N. Vapnik, “Support vector machines for spam categorization,” IEEE Transactions on Neural Networks 10(5), 1048-1054, 1999.

[8] E.M. El-Alfy and R.E. Abdel-Aal, “Using GMDH-based networks for improved spam detection and email feature analysis,” Applied Soft Computing11, pp. 477-488, 2011.

[9] L. Firte, C. Lemnaru, and R. Potolea, “Spam detection filter using KNN algorithm and resampling,” Proceedings of the 6th Inter-national Conference on Intelligent Computer Communication and Processing (ICCP), pp. 27-33, 2010.

[10] D.M. Freeman, “Using Naive Bayes to detect spammy names in social networks,” Proceedings of the ACM Conference on Computer and Communications Security (CCS), pp. 3-12, 2013.

[11] Y. Gao, M. Yang, and A. Choudhary, “Semi supervised image spam hunter: A regularized discriminant EM approach,” Proceedings of ADMA 2009, LNAI 5678, pp. 152-164, 2009.

[12] S. Kiritchenko, S. Matwin, and S. Abu-hakima, “Email classification with temporal features,” Proceedings of the International Intelligent Information Systems (IIS), pp. 523-533, 2004.

[13] J. Kittler, M. Hatef, R.P. Duin, and J. Matas, “On Combining Classifiers,” IEEE Transactions on Pattern Analysis and Machine Intelligence20(3), 226-239, 1998.

[14] R. Islam and Y. Xiang, “Email Classification Using Data Reduction Method,” Proceedings of the 5th International ICST Conference on Communications and Networking in China, pp. 1-5, 2010. [15] C. Lopes, P. Cortez, P. Sousa, M. Rocha, and M. Rio, “Symbiotic

filtering for spam email detection,” Expert Systems with Applications 38, pp. 9365-9372, 2011.

[16] C.-H. Mao, H.-M. Lee, D. Parikh, T. Chen, and S.-Y. Huang, “Semi-supervised co-training and active learning based approach for multi-view intrusion detection,” Proceedings of the 2009 ACM symposium on Applied Computing (SAC), pp. 2042-2048, 2009.

[17] S. Martin, A. Sewani, B. Nelson, K. Chen, and A.D. Joseph, “An-alyzing Behaviorial Features for Email Classification,” Proceedings of the 2005 Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS), pp. 1-8, 2005.

[18] M.N. Marsono, M.W. El-Kharashi, and F. Gebali, “Binary LNS-based naive Bayes hardware classifier for spam control,” Proceedings of IEEE International Symposium on Circuits and Systems, pp. 3674-3677, 2006.

[19] W. Meizhen, L. Zhitang, and Z. Sheng, “A method for spam behavior recognition based on fuzzy decision tree,” Proceedings of the 9th International Conference on Computer and Information Technology (CIT), pp. 236-241, 2009.

[20] Y. Meng, W. Li, and L.F. Kwok. “Enhancing Email Classification Using Data Reduction and Disagreement-based Semi-Supervised Learning,” Proceedings of The 2014 IEEE International Conference on Communications (ICC), 2014.

[21] M. Mojdeh and G.V. Cormack, “Semi-supervised spam filtering using aggressive consistency learning,” Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pp. 751-752, 2010. [22] K. Nigam, A. McCallum, and T. Mitchell, “Semi-supervised Text

Classification Using EM,” In: Chapelle, O., Zien, A., and Scholkopf, B. (eds.) Semi-Supervised Learning. MIT Press: Boston (2005)

[23] T. Ouyang, S. Ray, M. Allman, and M. Rabinovich, “A large-scale empirical analysis of email spam detection through network characteristics in a stand-alone enterprise,” Computer Networks 59, pp. 101-121, 2014.

[24] S. Rosset, “Model selection via the AUC,” Proceedings of the 21th International Conference on Machine Learning (ICML), pp. 89-97, 1989.

[25] D. Sculley and G.M. Wachman, “Relaxed Online SVMs for Spam Filtering,” Proceedings of the 30th annual international ACM SIGIR conference on Research and Development in Information Retrieval (SIGIR), pp. 415-422, 2007.

[26] L. Shi, Q. Wang, X. Ma, M. Weng, and H. Qiao, “Spam email clas-sification using decision tree ensemble,” Journal of Computational Information Systems8(3), pp. 949-956, 2012.

[27] D. Shinder, “E-mail spam: Is it a Security Issue?” (Accessed on 30 September, 2013) http://www.windowsecurity.com/articles-tutorials/ content security/Email Spam.html.

[28] S. Sun, “A survey of multi-view machine learning,” Neural Comput & Applic23, pp. 2031-2038, 2013.

[29] J. Tanha, M. van Someren, and H. Afsarmanesh, “Disagreement-Based Co-training,” Proceedings of the 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 803-810, 2011.

[30] G. Tang, J. Pei, and W.-S. Luk, “Email mining: tasks, common techniques, and tools,” Knowledge and Information Systems, In Press, 2013.

[31] S.K. Trivedi and S. Dey, “Document Effect of feature selection methods on machine learning classifiers for detecting email spams,” Proceedings of the 2013 Research in Adaptive and Convergent Systems (RACS), pp. 35-40, 2013.

[32] The University of Waikato. WEKA-Waikato Environment for Knowl-edge Analysis. http://www.cs.waikato.ac.nz/ml/weka/

[33] D. Wang, D. Irani, and C. Pu, “A study on evolution of email spam over fifteen years,” Proceedings of the 9th IEEE International Conference on Collaborative Computing: Networking, Applications and Worksharing (COLLABORATECOM), pp. 1-10, 2013. [34] W. Wang and Z.-H. Zhou, “On multi-view active learning and the

combination with semi-supervised learning,” Proceedings of the 25th International Conference on Machine Learning (ICML), pp. 1152-1159, 2008.

[35] J.S. Whissell and C.L.A. Clarke, “Clustering for semi-supervised spam filtering,” Proceedings of the 8th Annual Collaboration, Elec-tronic messaging, Anti-Abuse and Spam Conference (CAES), pp. 125-134, 2011.

[36] Y.-S. Wu, S. Bagchi, N. Singh, and R. Wita, “Spam detection in voice-over-IP calls through semi-supervised clustering,” Proceedings of the International Conference on Dependable Systems and Net-works (DSN), pp. 307-316, 2009.

[37] J. Zhan, B.J. Oommen, and J. Crisostomo, “Anomaly Detection in Dynamic Systems Using Weak Estimators,” ACM Transactions on Internet Technology11(1), pp. 1-16, 2011.

[38] M.-L. Zhang and Z.-H. Zhou, “Multi-label learning by instance differentiation,” Proceedings of the 22nd Conference on Artificial Intelligence (AAAI), pp. 669-674, 2007.

[39] W. Zhang, D. Zhu, Y. Zhang, G. Zhou, and B. Xu, “Harmonic functions based semi-supervised learning for web spam detection,” Proceedings of the 2011 ACM Symposium on Applied Computing (SAC), pp. 74-75, 2011.

[40] B. Zhou, Y. Yao, and J. Luo, “Cost-sensitive three-way email spam filtering,” Journal of Intelligent Information Systems 42(1), pp. 19-45, 2014.

[41] Z.-H. Zhou, D.-C. Zhan, Q. Yang, “Semi-supervised learning with very few labeled training examples,” Proceedings of the 22nd Na-tional Conference on Artificial Intelligence (AAAI), pp. 675-680, 2007.

[42] Z.-H. Zhou and M. Li, “Tri-training: Exploiting unlabeled data using three classifiers,” IEEE Trans Knowledge and Data Engineering 17(11), pp. 1529-1541, 2005.

[43] Y. Zhang, S. Wang, P. Phillips, and G. Ji, “Binary PSO with mutation operator for feature selection using decision tree applied to spam detection,” Knowledge-Based Systems 64, pp. 22-31, 2014.