Towards improving e-mail content classification for spam control: architecture, abstraction, and strategies

(1)

Towards Improving E-mail Content Classification

for Spam Control:

Architecture, Abstraction, and Strategies

by

Muhammad Nadzir Marsono

B.Eng. (Computer), Universiti Teknologi Malaysia, 1999 M.Eng.(Electrical), Universiti Teknologi Malaysia, 2001

A Dissertation Submitted in Partial Fullfillment of the Requirements for the Degree of

Doctor of Philosophy

in the Department of Electrical and Computer Engineering

c

Muhammad Nadzir Marsono, 2007 University of Victoria

(2)

ii

Towards Improving E-mail Content Classification

for Spam Control:

Architecture, Abstraction, and Strategies

by

Muhammad Nadzir Marsono

B.Eng. (Computer), Universiti Teknologi Malaysia, 1999 M.Eng.(Electrical), Universiti Teknologi Malaysia, 2001

Supervisory Committee

Dr. Fayez Gebali, Co-Supervisor

(Department of Electrical and Computer Engineering)

Dr. M. Watheq El-Kharashi, Co-Supervisor (Department of Electrical and Computer Engineering)

Dr. Kin Fun Li, Department Member

Dr. Sadik Dost, Outside Member (Department of Mechanical Engineering)

(3)

iii

Supervisory Committee

Dr. Fayez Gebali, Co-Supervisor

Dr. M. Watheq El-Kharashi, Co-Supervisor (Department of Electrical and Computer Engineering)

Dr. Kin Fun Li, Department Member

Dr. Sadik Dost, Outside Member (Department of Mechanical Engineering)

Abstract

This dissertation discusses techniques to improve the effectiveness and the efficiency of spam control. Specifically, layer-3 e-mail content classification is proposed to allow e-mail pre-classification (for fast spam detection at receiving e-mail servers) and to allow distributed processing at network nodes for fast spam detection at spam control points, e.g., at e-mail servers. Fast spam detection allows prioritizing e-mail servicing at receiving e-mail servers to safeguard non-spam e-mail deliveries even under heavy spam traffic. Fast spam detection also allows spam rejection during Simple Mail Transfer Protocol sessions for inbound and outbound spam control. We have four contributions in the dissertation.

In our first contribution, we propose a hardware architecture for na¨ıve Bayes content classification unit for a high-throughput spam detection computation. We

(4)

Abstract iv

use the logarithmic number system to simplify the na¨ıve Bayes computation. To handle the fast but lossy logarithmic number system computation, we analyze the noise model of our hardware architecture. Through noise analysis, synthesis, and verification by numerical simulation, we show that the na¨ıve Bayes classification unit, implemented on FPGA is capable of processing, with very low computation noise, more than one hundred million features per second, an order of magnitude faster than that on a general-purpose processor implementation.

In our second contribution, we propose e-mail content pre-classification at network layer (layer 3) instead of at application layer (layer 7) as currently being practiced to allow e-mail packet pre-classification and distributed processing for effective spam detection beyond server implementations. By performing e-mail content classification at a lower abstraction level, e-mail packets can be pre-processed, without reassembly, at any network node between sender and receiver. We demonstrated that the na¨ıve Bayes e-mail content classification can be adapted for layer-3 processing. We also show that fast e-mail class estimation can be performed at receiving e-mail servers. Through simulation using e-mail data sets, we showed that the layer-3 e-mail content classification is capable of detecting spam with accuracy and false positive values that approximately equal the ones at layer 7.

In our third contribution, we propose a prioritized e-mail servicing scheme using a priority queuing approach to improve spam handling at receiving e-mail servers. In this scheme, priority is given higher to non-spam e-mails than spam. Four servicing strategies for the proposed scheme are studied. We analyzed the performance of this scheme under different e-mail traffic loads and service capacities. We show that the non-spam delay and loss probability can be reduced when the server is under-provisioned.

In our fourth contribution, we propose a spam handling scheme that rejects spam during Simple Mail Transfer Protocol sessions. The proposed spam handling

(5)

Abstract v

scheme allows inbound and outbound spam control. It is capable of reducing servers’ loadings and hence, non-spam queuing delay and loss probability. We analyze the performance of this scheme under different e-mail traffic loads and service capacities. We show that the non-spam delay and loss probability can be reduced when the server is under-provisioned.

In this dissertation, we present four techniques to improve spam control based on e-mail content classification. We envision that our proposed approaches complement rather than replace the current spam control systems. The proposed four approaches are capable to work with existing spam control systems and support proactive spam and other e-mail-based threats such as phishing and e-mail worm controls anywhere across the Internet.

(6)

vi

List of Tables

3.1 The na¨ıve Bayes classifier design settings. . . 39

3.2 The na¨ıve Bayes classifier synthesis results. . . 40

3.3 Worst-case accumulated noise bounds. . . 43

4.1 E-mail size distribution for SpamAssassin data set. . . 64

4.2 E-mail size distribution for TREC2005 data set. . . 64

4.3 Layer-3 classification errors ǫ0 and ǫ1for SpamAssassin and TREC2005 data sets. . . 66

4.4 Layer-3 average false positive, fp, for SpamAssassin and TREC2005 data sets. . . 69

4.5 Layer-3 average false negative, fn, for SpamAssassin and TREC2005 data sets. . . 71

4.6 Layer-3 e-mail classification accuracy evaluated as (1-AU C) for Spa-mAssassin and TREC2005 data sets. . . 73

4.7 Accuracy for e-mail classification using SpamAssassin data set when overlapping bytes are taken into consideration. . . 74

4.8 Accuracy for e-mail classification using TREC2005 data set when overlapping bytes are taken into consideration. . . 75

4.9 fp and fn for SpamAssassin data set when overlapping bytes are taken into consideration. . . 76

4.10 fp and fn for TREC2005 data set when overlapping bytes are taken into consideration. . . 77

5.1 Summary of priority queuing service probabilities for the prioritized e-mail servicing. . . 90

(12)

xii

List of Figures

2.1 An e-mail with e-mail header and message body. . . 8

2.2 An SMTP session during an e-mail transfer. . . 10

2.3 An e-mail viewed at different levels of abstraction. . . 14

2.4 Structure of TCP/IP headers. . . 16

2.5 A typical spam control scheme over the network topology. . . 22

3.1 Typical operational stages of e-mail content classification. . . 28

3.2 A hardware architecture for implementing na¨ıve Bayes classification. . 31

3.3 A sigmoid function to evaluate P (c0|x). . . 34

3.4 Sources of computation noise in the na¨ıve Bayes classification unit datapath. . . 35

3.5 Accumulated noise in the na¨ıve Bayes classification unit datapath. . . 36

3.6 Operational time graph for the proposed na¨ıve Bayes classification unit. 41 4.1 Proactive spam control over the Internet. . . 51

4.2 Examples of ROC curves. . . 62

4.3 Layer-3 hard-threshold false positive, fp, for SpamAssassin and TREC2005 data sets. . . 68

4.4 Layer-3 hard-threshold false negative, fn, for SpamAssassin and TREC2005 data sets. . . 70

4.5 Layer-3 ROC graphs for SpamAssassin and TREC2005 data sets. . . 72

4.6 Layer-3 e-mail abstraction and its possible evasion attacks. . . 74

5.1 The typical single-queue e-mail servicing scheme on a receiving MTA. 81 5.2 Modeling the typical single-queue e-mail servicing scheme using Markov chain M/M/1/B queue analysis. . . 83

(13)

List of Figures xiii

5.3 The proposed prioritized e-mail servicing scheme for inbound e-mails. 85

5.4 Prioritizing e-mail servicing at a receiving MTA. . . 86

5.5 Model of two-queue prioritized e-mail servicing scheme. . . 87

5.6 Effects of spam a priori, ps, on the performance of SS1 . . . 93

5.7 Effects of spam a priori, ps, on the performance of SS2. . . 94

5.10 Effects of queue service probability, e, on the performance of SS1. . . 99

5.11 Effects of queue occupancy thresholds, kv/kw, on the performance of SS2. . . 101

5.12 Effects of false positive, fp, on the performance of SS4. . . 102

5.13 Effects of false negative, fn, on the performance of SS4. . . 104

5.14 Examples of traffic loads when an MTA is over provisioned and under-provisioned. . . 105

5.15 Performance of the prioritized e-mail servicing scheme under a normal traffic load. . . 106

5.16 Performance of the prioritized e-mail servicing scheme under an attack traffic load. . . 108

5.17 Computational costs of the typical single-queue e-mail servicing scheme on a receiving MTA. . . 110

6.1 Proposed spam rejection during SMTP sessions. . . 116

6.2 Single-queue spam rejection during SMTP sessions. . . 119

6.3 Two-queue spam rejection during SMTP sessions. . . 120

6.4 Effects of spam a priori, ps, on the performance of the spam rejection scheme during SMTP sessions. . . 123

(14)

List of Figures xiv

6.5 Effects of zombie-relayed spam probability, pz, on the performance of the spam rejection scheme during SMTP sessions. . . 125 6.6 Retransmission overhead of spam rejection during SMTP sessions. . . 126 6.7 Effects of false positive, fp, on the performance of the spam rejection

scheme during SMTP sessions. . . 127 6.8 Effects of false negative, fn, on the performance of the spam rejection

scheme during SMTP sessions. . . 129 6.9 Performance of the spam rejection scheme during SMTP sessions under

a normal traffic load. . . 130 6.10 Performance of the spam rejection scheme during SMTP sessions under

an attack traffic load. . . 132 7.1 Towards improving spam control over the Internet: Graphical

(15)

xv

List of Abbreviations

AI Artificial Intelligence

ASCII American Standard Code for Information Interchange

ATM Asynchronous Transfer Mode

AUC Area Under the Curve

DDoS Distributed Denial of Service DMP Designated Mail Protocol

DoS Denial of Service

ESP E-mail Service Provider

FDDI Fiber Distributed Data Interface FPGA Field Programmable Gate Array

Gbps Gigabits per second

GPP General Purpose Processor

IM2000 Internet Mail 2000

IMAP Internet Mail Access Protocol

I/O Input-Output

IP Internet Protocol

ISP Internet Service Provider

kB kilo Byte

k-NN k-Nearest Neighbors

LE Logic Element

LNS Logarithmic Number System

LUT Look-Up Table

MIME Multipurpose Internet Mail Extension

MTA Mail Transfer Agent

(16)

List of Abbreviations xvi

MUA Mail User Agent

OSI Open System Interconnect

PDU Protocol Data Unit

PGP Pretty Good Privacy

POP Post Office Protocol

RBL Realtime Blackhole List

RFC Request For Comment

RHS Right Hand Side

ROC Receiver Operating Characteristics

S/MIME Secure MIME

SMTP Simple Mail Transfer Protocol

SPF Sender Policy Framework

SSL Secure Sockets Layer

SVM Support Vector Machine

TEOS Trusted E-mail Open Standard TCP Transport Control Protocol TLS Transport Layer Security

VHDL VHSIC (Very High Speed Integrated Circuit) Hardware Description Language

(17)

xvii

List of Symbols

a E-mail arrival probability

av E-mail arrival probability for queue V aw E-mail arrival probability for queue W c E-mail service probability

c0 Spam e-mail class

c1 Non-spam e-mail class

cv E-mail service probability for queue V cw E-mail service probability for queue W

D Average non-spam queuing delay for an M/M/1/B queue D1 Average non-spam queuing delay for SMTP1 mode D2 Average non-spam queuing delay for SMTP2 mode

Dp Average non-spam queuing delay for prioritized e-mail servicing scheme Dv Average non-spam queuing delay for queue V

Dw Average non-spam queuing delay for queue W e Probability that queue V is chosen for servicing |E| Number of e-mail examples used for learning

fn False negative given θ (hard-threshold false negative) fn Average value of fn (soft-threshold false positive) fp False positive given θ (hard-threshold false positive) fp Average value of fp (soft-threshold false positive) hi Layer-3 packet classification score

kv Occupancy threshold for queue V kw Occupancy threshold for queue W

kmax Maximum e-mail retransmission attempts before an e-mail is accepted L Average non-spam loss probability for an M/M/1/B queue

(18)

List of Symbols xviii

Lp Average non-spam loss probability for prioritized e-mail servicing scheme Lv Average non-spam loss probability for queue V

Lw Average non-spam loss probability for queue W m Number of classification features in x

n Number of packets per e-mail

nl Total number of non-spam e-mails ns Total number of spam e-mails

nl→s Number of non-spam e-mails classified as spam ns→l Average of spam e-mails classified as non-spam P (c0) A priori probability of spam in learning data set P (c0|x) A posteriori probability of spam given x

P (c1) A priori probability of non-spam in learning data set P (c1|x) A posteriori probability of non-spam given x

P (x) Probability of x occurring in learning data set P (x|c0) Likelihood probability of x in c0

P (x|c1) Likelihood probability of x in c1

P3(c0|x) Layer-3 a posteriori probability of spam given x P3(c1|x) Layer-3 a posteriori probability of non-spam given x

pa Probability that an e-mail is accepted during an SMTP session pd Probability that an e-mail is dropped during an SMTP session pe ωe-bit floating-point exponent

pf ωf-bit floating-point significand ps Probability of spam in e-mail traffic

pz Probability of zombie-relayed spam from total spam e-mails Q Average queue occupancy for an M/M/1/B queue

(19)

List of Symbols xix

Q2 Average queue occupancy for SMTP2 mode

s Equilibrium distribution vector of an M/M/1/B queue si Probability that an M/M/1/B queue contains i e-mails T Average queue throughput for an M/M/1/B queue T1 Average queue throughput for SMTP1 model T2 Average queue throughput for SMTP2 model ta Average e-mail inter-arrival time

ˇ

ta Minimum e-mail inter-arrival time tc Average e-mail service time

ˇ

tc Minimum e-mail service time tn True negative (1-fn)

tp True positive (1-fp)

V Primary queue in two-queue priority queuing V Na¨ıve Bayes generative model

W Secondary queue in two-queue priority queuing x Array of classification features for an e-mail xi i-th feature of x

xj Array of |xj| classification features for j-th packet of an e-mail xi,j i-th feature of xj

y Spam to non-spam likelihood logarithmic ratio ˆ

y Upper bound of y

ˇ

y Lower bound of y

y3 Layer-3 spam to non-spam likelihood logarithmic ratio

αv Probability that queue V is near empty (contains kv e-mails or less) αw Probability that queue W is near empty (contains kw e-mails or less) β Cost of typical single-queue e-mail servicing scheme

(20)

List of Symbols xx

βP Cost of prioritized e-mail servicing scheme

βs Cost of spam rejection scheme during SMTP sessions ∆βP Cost overhead of prioritized e-mail servicing scheme

∆βs Cost overhead of spam rejection scheme during SMTP sessions ǫ0 Layer-3 classification error on spam e-mails only

ǫ1 Layer-3 classification error on non-spam e-mails only

ε1 Learning noise

ε2 LG round-off noise

ε3 SIGM quantization noise

ε4 SIGM round-off noise

θ Classification threshold σ1 Total noise at LG’s output σ2 Total noise at SIGM ’s input σ3 Total noise at SIGM ’s output

τ Markov chain time step

φc Computational overhead for classifying an e-mail

φp Computational overhead for pre-classifying an e-mail packet φq Computational overhead for queuing an e-mail

φr Computational overhead for reassembling an e-mail

ω1 LG’s LUT input width

ω2 LG’s LUT output width

ω3 ACC’s width

ω4 SIGM ’s input width

ω5 SIGM ’s output width

ωe Width of pe

(21)

List of Symbols xxi

ω4f Width of fraction part of ω4 ω4i Width of integer part of ω4

(22)

xxii

Acknowledgment

All praise be to Allah the Almighty who has given me knowledge, patience, and perseverance to finish my Ph.D. dissertation.

My deepest thanks to my supervisors, Dr. Fayez Gebali and Dr. M. Watheq El-Kharashi for their invaluable scholarly advice, inspirations, help, and guidance that helped me through my Ph.D. dissertation works. I will always be indebted to them for all that they have done for me during my stay here in Victoria. Thank you very much for being such a fantastic supervisors and good friends over these past 3 years. I would like to acknowledge the advice and support from my supervisory committee members: Dr. Kin Fun Li and Dr. Sadik Dost, as well as the external examiner: Dr. Ehab Al-Shaer from DePaul University, Chicago, USA, for making my dissertation complete and resourceful.

I would like to thank my employer, Universiti Teknologi Malaysia, Department of Civil Services, and Ministry of Higher Education of Malaysia for the study leave and financial support under scholarship No. JPA(L) A-3238549.

I would like to thank my dad Marsono, and my late mom Salamah, for the love, support, and education they provide. I would like to thank my wife Umi Kalsom, my son Adam and my daughter Hawa for their love, continuous support, and for being friends that I cherished.

There are no research work in isolation. I would like to give special thanks to Dr. Sudhakar Ganti, Dr. Newaz Rafiq, and Haytham El-Miligi for their invaluable support and discussions. I would like to mention grateful thanks for my other colleagues: Abdessalam Amer, Mohamed Fayed, Khalid Khayyat, Ahmad Abdullah and many others. Many thanks as well to my office mates: Qing Wu and Diego Sorrentino.

(23)

xxiii

Dedication

To my parents who instill the importance of education above other things in this world.

To my wife for being my lover and my best friend.

(24)

Chapter 1 Introduction

E-mail is one of the most popular and frequently used ways of communication due to its worldwide accessibility, relatively fast message transfer, and low sending cost. The flaws in the e-mail protocols and the increasing amount of electronic business and financial transactions directly contribute to the increase in e-mail-based threats. Spam [1, 2], phishing [3], and e-mail worms [4] are some forms of the e-mail-based threats nowadays. These threats reduce the quality of e-mail networks, increase security threats [5, 6], and contribute to the financial loss to all Internet users [6–9].

1.1 You’ve Got Spam!

Spam is defined as unsolicited e-mails sent in bulk [10]. Spam constitutes more than two-third of the total e-mail traffic [11]. It contributes to huge loss in employees’ productivity and wasted networking (i.e., communication and computation) resources. Spam costs US companies approximately $20 billion per year in lost productivity and between $600 millions and $2 billion on spam control systems [12]. While the sending costs are negligible, the cost of receiving spam traffic is huge to e-mail service providers (ESPs), Internet service providers (ISPs), and corporate network infrastructures. A

(25)

Introduction 2

bulk e-mail sender (or spammer) can send out thousands of spam e-mails a day with a low cost and can gain profit even when a fraction of spam e-mails sent result in sales [1].

Spam continuously evolves to circumvent spam control systems and is becoming much more sophisticated [2, 5]. The association of spammers with hackers and virus writers pose a very real threat to the Internet availability and security [6–8]. While half a decade ago spam was sent by spammers’ own e-mail servers (mail transfer agents or MTAs), approximately half or more of spam is now sent from compromised (zombie) systems distributed over the Internet [6,7,13]. Spammers also extensively abuse ESPs to avoid blocking of their domains and taking advantages of ESP safelisting (that an e-mail comes from a trusted ESP and is therefore non-spam). Illegal spam relaying increases spam distribution bandwidth and at the same time eludes some spam detection systems [8, 14].

1.2 Problem Statement

The spam problem has led to growing demands for effective spam control, which involves spam detection and spam handling schemes, which deals with strategies and schemes to deal with spam. For spam detection, content classification techniques give the best spam detection accuracy [15] compared to other spam detection techniques such as spam fingerprinting [16–18], heuristic [19], and detection based on senders’ histories [20].

This dissertation focuses on spam detection using content classification tech-niques as an approach to improve spam control. We describe below the problems of current spam control based on content classification investigated in this dissertation. First, software-based spam detection on MTAs is not capable of detecting spam at high throughput and will not cope with the increase in future e-mail traffic.

(26)

Introduction 3

MTAs are e-mail servers that run on general-purpose processors (GPPs). GPP-based systems could not scale with increases in link speed [21]. Even for custom GPP-based spam detectors [22, 23], their throughputs are limited to hundreds Mbps. Specialized hardware architectures, with improved processing power, are needed for fast spam detection and to support the network bandwidth growth.

Second, spam detection at the application layer (layer 7)1 _{restricts where,} when, and how fast spam detection can be performed. Detecting spam at layer 7 makes detection at intermediate nodes (between the sending and the receiving MTAs) infeasible due to the need for complex Transport Control Protocol/Internet Protocol (TCP/IP) processing at link speed [25]. TCP, which requires reassembly, byte alignment, and state tracking [26] requires large computation overhead [21]. As a result, spam control is restricted to MTA implementation as an end-to-end spam control mechanism. An improved spam detection approach, at lower e-mail abstraction levels, is needed to lift the end-to-end implementation restriction and to allow fast spam detection, closer to spam sources.

Third, due to the lack of outbound spam control, e-mails are effectively classless upon reception at the receiving MTA. In the current spam control, e-mail classes are unknown until e-mails have been classified and detected for spam. Thus, all incoming e-mails are queued in a common queue and delivered (to recipients) with equal priority. For e-mail traffic that consists mainly of spam, non-spam e-mails are delayed due to the presence of spam in the queue. Furthermore, non-spam e-mails maybe lost during queuing. Spam wastes MTA processing and bandwidth resources. Any attack on the common queue could disrupt server operations. A scheme at receiving MTAs to maintain e-mail delivery is needed to reduce the non-spam delay and loss due to queuing.

1_{Throughout this dissertation, we assume the seven-layer OSI model, instead of five-layer}

(27)

Introduction 4

Fourth, the lack of outbound spam control cannot effectively perform spam containment. The amount of spam relayed by zombie systems distributed over the Internet is between 45% [6] and 60% [27]. Spam containment is able to reduce the spam volume when performed by all ESPs, ISPs, and corporate networks [13, 28]. Outbound spam control is crucial to reduce MTA loadings, to avoid domain blocking (from sending e-mails), and to avoid bad publicity [14]. However, outbound spam control by ESPs, ISPs, and corporate networks is hindered by its costs [13]. A scheme to allow outbound spam control and spam containment is needed to reduce the loadings of MTAs and the spam volume over the Internet.

1.3 Contributions

Fast spam detection allows effective spam handling. To narrow down the scope of this research, we use probabilistic na¨ıve Bayes content classification [15, 19, 29–31] as the spam detection technique. To allow fast e-mail content classification, we propose specialized hardware support for e-mail content classification and processing e-mails at a lower e-mail abstraction. With fast spam detection, we propose two spam handling schemes for effective spam handling. The contributions of this dissertation are given below.

First, we propose a hardware architecture for a na¨ıve Bayes classification unit for e-mail content classification. We use logarithmic number system (LNS) to simplify na¨ıve Bayes computation. We evaluate the noise performance of through numerical simulation to functionally verify the classification unit. We have demonstrated that the binary LNS-based na¨ıve Bayes classification unit is capable of detecting spam with good precision and is an order of magnitude faster than a GPP-based software implementation. This work have been published in brief in [32] and is accepted for publication in full in [33].

(28)

Introduction 5

Second, we propose and analyze e-mail content classification at layer 3. To the best of our knowledge, e-mail classification at layer 3 has not been previously proposed and evaluated using any content-based classification techniques. We have demonstrated that na¨ıve Bayes e-mail content classification technique can be adapted to allow pre-classification and fast e-mail class estimation at layer 3, without the need for reassembly. Through simulation of the SpamAssassin [34] and TREC2005 [35] data sets, we showed that the accuracy and false positive of layer-3 na¨ıve Bayes classification are approximately equal the ones at layer 7 for all evaluation settings. This work has been published in brief in [36] and is submitted for publication in full in [37].

Third, we propose an analytical model for a prioritized e-mail servicing scheme at receiving MTAs using priority queuing. We have evaluated the proposed scheme using a discrete-time Markov chain analysis. Under an assumption that all incoming e-mail packets have been pre-classified at layer 3, we have demonstrated that the prioritized e-mail servicing can reduce the non-spam delay and loss probability at the receiving MTA under high spam traffic. This work has been published in brief in [38] and is accepted for publication in full in [39].

Fourth, we propose and model a spam rejection scheme during Simple Mail Transfer Protocol (SMTP) sessions. We have modeled and evaluated our proposed scheme using discrete-time Markov chain analysis. Under an assumption that all incoming e-mail packets have been pre-classified at layer 3, we have demonstrated that rejecting spam e-mails during SMTP sessions reduces the e-mail queuing delay and the non-spam loss probability due to reduction in the number of e-mails to be processed by receiving MTAs. We showed that illegal spam relaying can be reduced per e-mail server to allow outbound spam control. This work has been accepted for publication in brief in [40] and is submitted for publication in full in [41].

(29)

Introduction 6

1.4 Dissertation Organization

This dissertation is organized as follows.

Chapter 2 reviews e-mail systems, spam detection techniques, and spam handling schemes. We highlight the current spam detection by content classification at layer 7 and how this technique is limited to MTA implementation. We also address the need for pre-acceptance spam handling to mitigate the spam problem and to contain spam e-mails closer to their sources.

Chapter 3 proposes a na¨ıve Bayes classification unit using a binary LNS approach. Through synthesis and error analysis, we show that na¨ıve Bayes classification can be performed at a high throughput and with low computation noise.

Chapter 4 proposes an e-mail content classification at layer 3 using the na¨ıve Bayes technique. Through simulations with TREC2005 [35] and SpamAssassin [34] data sets, we show that layer-3 na¨ıve Bayes classification exhibits false positive and overall accuracy approximately equal to layer-7 content classification over different packet lengths.

Chapter 5 proposes a prioritized e-mail servicing scheme utilizing layer-3 spam detection. Through priority queue modeling, we show that prioritizing e-mail delivery gives better performance compared to the typical single queue scheme.

Chapter 6 proposes a spam rejection scheme during SMTP sessions utilizing layer-3 spam detection. Through modeling using discrete-time Markov chain analysis, we show that the proposed spam rejection scheme gives better performance compared to the typical single queue scheme and the prioritized servicing scheme under heavy spam traffic. We also determine the reduction in e-mail traffic and MTA loadings.

Chapter 7 summarizes this dissertation, state our contributions, and suggest directions for future research.

(30)

7

Chapter 2 E-mail System and Spam Control: Review

and Taxonomy

This chapter reviews necessary background, including e-mail system and spam control. Our discussion on e-mail system includes e-mail format, transfer protocols, and different levels of e-mail abstractions. Our discussion on spam control is divided into two separate topics: spam detection and spam handling.

This chapter is organized as follows. Section 2.1 discusses e-mail systems: its formats, protocols, and flaws. Section 2.2 reviews spam control approaches, their advantages and disadvantages. Section 2.3 summarizes the chapter.

2.1 The E-mail System

This section describes the Internet mail (e-mail) system in terms of its format and its transfer protocols. We focus the discussion on SMTP protocol, which is the workhorse of e-mail systems. The simplicity of the e-mail format and protocols that led to the spam problem are discussed. Then, we view e-mail at different level of abstractions and discuss problems associated with spam control at different levels of abstraction.

(31)

E-mail System and Spam Control: Review and Taxonomy 8

From: Alice <alice@mail.org> To: Bob <bob@mail.net>

Cc: Charles <charles@mail.com> Bcc: Donna <donna@mail.ca>

Date: Wed, 27 Dec 2006 12:00:00 +0100 (CEST) Subject: An E-mail Example

Hello,

This is an e-mail example, complete with a header and a body.

Figure 2.1: An e-mail with e-mail header and message body.

2.1.1 E-mail Format

E-mail is structured according to RFC2822 [42] and RFC1049 [43] into two different fields: e-mail header and e-mail body. Both fields are free text input formed from the author’s input. Figure 2.1 shows a plain-text e-mail example.

The e-mail header as defined in RFC2822 [42] consists of the Date field that specifies the date and the time of the e-mail. The From field specifies the author’s e-mail address. The fields To, carbon copy (Cc) and blind carbon copy (Bcc) specify the recipients of the e-mail. To and Cc fields are shown in the e-mail received by all recipients, but not Bcc. Optionally, fields Message-ID, In-Reply-To, and References are added to the e-mail by the sending MTA.

As specified in RFC2822 [42] and RFC1049 [43], e-mails are written as 7-bit ASCII text. However, ASCII encoding could not properly represent non-English characters or non-textual data. Multipurpose Internet Mail Extension (MIME) encoding is used to represent rich text and binary data as ASCII text. MIME supports different types of objects, including fancy texts and images to be included in e-mail messages. MIME enables non-ASCII messages to be sent over ASCII encoding while

(32)

remaining compatible with RFC2822 standard. 2.1.2 E-mail Protocols

At layer 7, an e-mail is exchanged between a sender and a recipient according to certain e-mail protocols. Simple Mail Transfer protocol (SMTP), Post Office Protocol (POP), and Internet Message Access Protocol (IMAP) are the protocols that are used to send and receive e-mails over the TCP/IP protocol suite. These protocols are discussed below.

Simple Mail Transfer Protocol (SMTP)

SMTP protocol is used to send e-mails from an mail user agent (MUA) to an MTA, or between MTAs. This protocol is defined in RFC821 [44] and later improved in RFC2821 [45] for an end-to-end mail exchange. These RFCs define a client-server protocol that ensures successful end-to-end e-mail delivery. A sending MUA (or MTA) initiates an SMTP session to which the receiving MTA should respond. Then, the e-mail is transferred between sending and receiving MTAs through an SMTP client-server conversations.

Figure 2.2 shows an example of an SMTP session that illustrates SMTP client-server conversations between sending and receiving MTAs. Lines with S are those sent by the sending MTA, whereas lines with R are those sent by the receiving MTA. On successful connection to the receiving MTA (line 02), the sending MTA attempts an SMTP transfer using HELO or EHLO (extended HELO) command. On receiving reply (which the sender does not have to wait), the sending MTA (or mail user agent, MUA) can start e-mail transfer by issuing the author and the recipient envelop addresses (lines 05 and 07). Envelop addresses MAIL FROM and RCPT TO in lines 05 and 07 are used to forward the e-mail to the recipient. Then, the receiving MTA verifies the

(33)

01 S telnet smtp.mail.net 25

Trying <IP address>... Connected to smtp.mail.net (<IP address>). Escape character is ’∧]’.

02 R 220 smtp.example.net ESMTP Sendmail 8.12.11; Wed, 27 Dec 2006 12:00:00 +0100

03 S EHLO smtp.mail.org

04 R 250-smtp.mail.net Hello smtp.mail.org <IP address>, pleased to meet you

05 S MAIL FROM: <alice@mail.org>

06 R 250 2.1.0 <alice@mail.org>... Sender ok 07 S RCPT TO: <bob@mail.net>

08 R 250 2.1.5 <bob@mail.net>... Recipient ok 09 S DATA

10 R 354 Enter mail, end with "." on a line by itself 11 S From: Alice <alice@mail.org>

12 S To: Bob <bob@mail.net>

13 S Cc: Charles <charles@mail.com>

14 S Date: Wed, 27 Dec 2006 12:00:00 +0100 (CEST) 15 S Subject: An E-mail Example

16 S

17 S Hello, 18 S

19 S This is an e-mail example, complete with a header and a body. 20 S .

21 R 250 2.0.0 <queue ID> Message accepted for delivery 22 S QUIT

23 R 221 2.0.0 smtp.mail.net closing connection

Figure 2.2: An SMTP session during an e-mail transfer.

correct e-mail address formats of both sender and recipients. Lines 11 to 19 show e-mail data illustrated in Figure 2.1.

Post Office Protocol (POP)

POP, described in RFC1939 [46], is used exclusively to retrieve e-mails from an MTA to a local MUA. POP uses a client-server model where local MUAs connect to MTAs

(34)

and issue simple text commands. A POP client establishes TCP connections to a POP server. When connected and authenticated, the POP session enters a transaction state. In this state, the client can access the mailbox and e-mails are copied to the local MUA. When the client sends the QUIT command, the session enters the update state and the connection is then closed.

Internet Mail Access Protocol (IMAP)

IMAP, described in RFC2060 [47] is used to remotely access MTAs without necessarily moving e-mails to a local MUA. IMAP clients can specify criteria either to download messages to local MUAs or keeping e-mails on the server but replicating copies on local MUAs. With the IMAP protocol, an e-mail user can use more than one e-mail MUAs to gain access to e-mails stored on different MTAs. IMAP is also equipped with both client and server functionality for sending in addition to retrieving e-mails. This functionality allows e-mail exchange between an MUA and an MTA using IMAP protocol. IMAP also allows clients to have multiple remote mailboxes to retrieve and send e-mails at any time.

2.1.3 Flaws in the SMTP Protocol

The simplicity of e-mail protocols, especially SMTP, is being manipulated since neither message authentication nor sender validation is supported at the protocol level. The e-mail system has many inherent flaws, which are summarized below. Push-delivery

E-mail system using the SMTP protocol utilize a push-delivery approach, where e-mails are stored by the receiving MTAs once deliveries were accepted during SMTP sessions. This puts the cost of storing unread e-mails solely on the recipients. This

(35)

indeed is a major cause for e-mail related problems [48]. This exposes receiving MTAs to flooding by unwanted e-mails. The spam problem cannot be prevented in the push-delivery approach even with the best designed spam filters [49].

In contrast to the push-delivery approach, a pull-delivery Internet Mail 2000 (IM2000) protocol has been proposed [50]. In IM2000, e-mails are kept at the sending MTA until fetched by the recipient. This is to shift the cost of sending e-mails (and related processes to control spam and other e-mail based threats) to sender, not recipient. However, this protocol has not been widely accepted nor adopted as an Internet standard.

Unauthenticated E-mails

SMTP protocol does not provide sender authentication and sender validation mechanisms. As with e-mail body, e-mail header is a free text input [42] and does not need to be validated. From Figure 2.2, the envelop addresses do not have to be the same as From and To in lines 11 and 12. Thus, e-mail headers can easily be forged. This weakness in RFC2822 is being manipulated by spam senders to conceal their identities. Thus, spam could not be identified by only examining the e-mail header. In addition, e-mails are sent unauthenticated. Thus, the content of an e-mail can be read, modified, and deleted in-transit.

Transparent E-mail Transaction

Without encryption at the MUA level (between a sender and a receiver), an e-mail is visible when viewed at all abstraction levels. At the protocol level, data confidentiality is not ensured. The content of e-mail can be directly read by a third-party within or outside the organization. E-mails are sent using ASCII encoding and no encryption options provided at the protocol level. There are attempts to secure

(36)

e-mail contents using standards such as Pretty Good Privacy (PGP) [51] and secure MIME (S/MIME) [52, 53], but none has been widely adopted [54]. Another option is to encrypt transport channel using standards such as Secure Sockets Layer (SSL) [55] or Transport Layer Security (TLS) [56]. On the negative side, spam and malware remain undetectable until received by receiving MTAs.

Unguaranteed Delivery

Contrary to the belief of most e-mail users, e-mail delivery is not guaranteed to arrive on time or to arrive at all. At the protocol level, although SMTP does provide reception acknowledgements using Disposition-Notification-To identifier, it is not compulsory for MUA and MTA to support this functionality. E-mail delivery may fail at any point between a sender and a receiver. E-mails may be lost due to MTA failure, especially when relayed through several MTAs and sender may not be acknowledged of e-mail delivery failures.

2.1.4 E-mail Abstractions

SMTP, POP, and IMAP are layer-7 protocols, where operate on top of layer-4 TCP protocol, which in turn operates on top of layer-3 IP protocol. An e-mail is viewed differently at different abstraction levels. Figure 2.3 shows an e-mail viewed at different layers or simply at different levels of abstraction. At layer 7, an e-mail is viewed as a text message sent by a sender (A) to a recipient (B). At layer 4, the e-mail is viewed as a stream of bytes sent from A to B. At layer 3, the e-mail is viewed as a group of data packets sent from A to B.

(37)

E-mail System and Spam Control: Review and Taxonomy 14 E-mail viewed as message Application Presentation Session Transport Network Data Link Physical From:A To:B Hello.. E-mail viewed as byte stream Application Presentation Session Transport Network Data Link Physical E-mail viewed as data packets

User A

User B

SMTP, POP, IMAP TCP IP

From:A To:B Hello...

A->B From:A To A->B :B Hello.

L

ay

er

7

6

5

4

3

2

1

Figure 2.3: An e-mail viewed at different levels of abstraction when sent from A to B. An e-mail is viewed as a text message at layer 7, as a byte stream at layer 4, and as data packets at layer 3.

Layer-4 E-mail Abstraction

The layer-4 TCP protocol can be considered the most complex protocol [24]. The TCP protocol bridges layer-3 IP protocol and (as in this dissertation) layer-7 SMTP, IMAP, and POP protocols. TCP is a lossless, connection-oriented layer-4 protocol that transports data over the Internet using unreliable, connection-less, and lossy layer-3 IP protocol. Over 85% of the Internet traffic use TCP as the transport protocol [25, 57]. Of this number, e-mail traffic constitutes 8% (with 6% SMTP) of the total TCP traffic [57, 58].

In a TCP session, a TCP connection is established between a sender and a receiver before the layer-4 byte stream is delivered. Once the connection has been established, the layer-4 byte stream is segmented into data packets and sent to the receiver. On the receiving end, once e-mail packets are received, packet integrity is checked by checking the packet checksum. To guarantee byte stream delivery, packet

(38)

deliveries are acknowledged according to some error control mechanism. To form the e-mail byte stream, packets’ payloads are aligned and reassembled in a process called reassembly. Packet reassembly requires packet buffering, state tracking, and byte stream alignment to ensure proper arrival sequence of byte stream and to enable error control.

E-mail packets can be distinguished from other types of data packets based on the protocol field in the IP header and port numbers in the TCP header. Figure 2.4 shows the TCP header with the information needed to reconstruct a byte stream from e-mail packets using Sequence and Acknowledgement numbers. To distinguish e-mail stream from other Internet traffic streams, SMTP, POP, and IMAP use Destination port numbers 25, 220, and 143, respectively, with random source port numbers.

At layer 4, where layer-7 SMTP, POP, and IMAP e-mail abstraction is transported, an e-mail is viewed as a continuous stream of bytes (ASCII characters), transported without errors and received in sequence by receiver. Viewing e-mails at layer 4 is only suitable on MTAs or low-speed networks. It is as complex as viewing layer-7 e-mails, where complete packet reassembly is needed to form a complete e-mail from a collection of e-mail packets [26]. Layer-7 and layer-4 e-mail abstractions are similar and reflect the same information unless e-mails are bigger than the maximum TCP protocol data unit (PDU) size (64kB).

Layer-3 E-mail Abstraction

IP is a best-effort protocol, where packet delivery is not guaranteed in terms of packet arrival order, traversed links, packet delay, or packet loss [24]. As shown in Figure 2.4, IP header includes the routing information from sender to receiver using the source and destination IP addresses [24]. E-mail packets could not be distinguished from other types of data packets based on the protocol field in the IP header. At layer 3,

(39)

E-mail System and Spam Control: Review and Taxonomy 16 31 27 23 21 15 12 0 IP he ade r T C P he ade r P ay loa d

Hdr ln Service type Total length

Identification Flags Fragment offset

TTL Protocol Header checksum

Source IP address Destination IP address Option and padding (optional)

Option and padding (optional)

Source port Destination port

Sequence number Acknowledgement number

Hdr ln Reserved Flags Window

Checksum Urgent pointer

Payload P ac ke t l eng th Version

Figure 2.4: Structure of TCP/IP headers used for host routing and byte stream reconstruction.

layer-4 payloads can only be identified from destination port numbers (TCP header in IP payload). Only partial e-mail sample, depending on packet length, is available in the IP packet payload. A complete e-mail is available only when e-mail packets have been reassembled (at layer 4).

2.2 Spam Control

With the above inherent flaws in e-mail format and protocols, spam has been a major problem to Internet users. Several approaches to detect and deal with spam had been proposed. In this section, we categorize spam control into three major categories: spam detection techniques, spam handling strategies, and other approaches which

(40)

could be not categorized neither under spam detection techniques nor spam handling strategies.

2.2.1 Spam Detection Techniques

The first step for spam control is spam detection. We discuss several spam detection techniques below. More surveys on spam detection techniques can be found in [11, 59–63].

E-mail Content Classification

Supervised-learning content classification techniques learn the distinctions among different e-mail classes. Once trained with examples to form a generative model, a supervised-learning e-mail content classification can recognize the exact or similar patterns observed during learning.

The accuracy of content classification using supervised-learning techniques depends on the quality, quantity, and timeliness of learning examples. E-mail traffic shows temporal spam a priori variations [64]. The use of different learning data sets, each with different a priori, makes comparison of different classifiers difficult. A classifier that works well on certain learning data sets may not perform well on different data sets. There are several data sets, which can be used for evaluating spam content classifiers, such as the SpamAssassin [34], TREC2005 [35], PU [65], and LingSpam [65] data sets.

E-mail content classification techniques, which originated from text classification techniques, dissect e-mails to estimate their classes. E-mail header and body may contain several informative features, which distinguish non-spam from spam e-mails. E-mail features can be extracted from the content of an e-mail, which could be characters [66], fixed-length strings [67], words [29, 30, 68], or phrases [69]. In some

(41)

e-mail content classifiers, not all extracted features are used for classification. Several most informative features are selected according to some selection criteria [70].

The na¨ıve Bayes probabilistic technique [29,30,68,69] is the most used technique for anti-spam e-mail content classification [71]. Several variations of Bayes spam filters have been proposed [72, 73]. Most na¨ıve Bayes classifiers follow the proposal in [30] that does not do preprocessing to both features and generative model. Certain Bayes classifiers take feature sequences into consideration [69], while others use extensive pruning and text processing [68].

Other supervised learning techniques that have been proposed for spam control are the ones based on artificial intelligence (AI) techniques [74,75], using an approach, similar to how human brain learns in detecting patterns. Similar to the probabilistic techniques, the spam detection accuracy of AI techniques is highly dependent on how well they are trained. One of the problems with AI techniques is the implementation of these techniques [76]. Some techniques, such as Support Vector Machine (SVM), require longer learning time [77] although shown to give better accuracy than na¨ıve Bayes techniques [75].

Another spam detection method is the instance-based technique. It is also called the lazy-learning technique since learning is delayed until a new e-mail needs to be classified [78]. An example of this technique is k-NN (nearest neighbors) [79]. Given an e-mail, k-NN retrieves the k most similar instances from predefined learning examples. The similarity or the nearest distance of the neighbors is defined by some distance measures [78]. Since learning is delayed, this technique is not able to detect spam at high-speed when learning data sets are huge. Implementation with a specialized hardware architecture could be limited to the speed of memory access to the learning data set.

Another content classification technique is the rule-based (or heuristic) technique. Rule-based spam detection is a knowledge-based technique that observes the presence

(42)

of certain patterns and meta data within an e-mail. Several patterns that are usually associated with spam are specific words and phrases, uppercase and special character distribution, and malformed headers [80]. A sophisticated rule-based system such as SpamAssassin [19] can be very effective. However, it depends only on a set of rules and is usually augmented with other spam detection techniques.

Spam detection by content classification has already achieved 99.9% spam detection accuracy [31]). The content classification techniques work either as stand-alone classifiers or as parts of a classifier system [81, 82]. The multi-classifier approach generally exhibits higher multi-classifier accuracy at the cost of higher computational complexity. Several frameworks have been developed for integrating multiple classifiers [19, 83] for MTA implementations.

Since content classification works with e-mail content (header and body), this technique can be used for spam detection anywhere across network topology subject to the capability for processing e-mails at the link speed. Current spam detection techniques based on this technique work on layer-7 e-mail, which does not allow detection at network nodes due to TCP/IP processing required to reassemble e-mail packets (to form complete e-mail) at wire-speed. This restricts spam detection (based on content classification technique) to MTA implementations.

IP-based Authentication

The simplicity of SMTP protocol allows anyone to send an e-mail using any MTA. While user authentication is usually used to identify users, many MTAs allow open e-mail relay mostly due to errors during configuration. Open relays are manipulated by spammers to relay spam.

Blacklists (also known as Realtime Blackhole Lists, or RBLs) and whitelists are the lists of refused [84] and permitted addresses and/or domains, respectively.

(43)

Spammers are known to change regularly their Internet domains and MTAs’ IP addresses, thus limiting the effectiveness of blacklists and whitelists. A study on spam traffic shows that spam e-mails are not sent from distinct sources [20,64]. Hence, spam detection based on senders’ histories alone cannot be used effectively to detect spam. In addition, blacklists and whitelists are very rigid where non-spam domains with high outbound spam are blocked.

Cryptographic Authentication

Using cryptographic authentication, each e-mail is augmented with a digital signature of itself and its sender. When an e-mail is received, a receiving MTA queries the sending MTA for a public key of the e-mail to authenticate the e-mail and its sender. Another example is Sender Policy Framework (SPF) [85]. It works by validating the sender by querying the sender MTA and issuing challenges to the sender to validate its identity. SPF was designed to prevent e-mail forgery, but is incapable to prevent spam relaying by zombie systems that use sending MTAs’ SPF settings [86].

Spam Fingerprinting

Since spam is sent in thousands if not millions, spam can be detected by similarity tests. Similarity between two e-mails can be determined by matching their finger-prints, usually performed by hashing. However, cryptographic hashing techniques are not suitable for spam fingerprinting since a small change results in a different fingerprint. Fuzzy spam fingerprinting techniques are used to detect spam such as Nilsimsa [17], Vipul’s Razor [18], and Distributed Checksum Clearinghouse (DCC) [16]. Nilsimsa has been shown to be able to detect similarity among two or more e-mails [87] with a high degree of robustness to message obfuscation.

(44)

Spam fingerprinting has an added advantage of being able to detect spam without looking at the e-mail content. Moreover, it detects and stops spam in ways similar to intrusion detection and malicious software (malware) detection by matching spam fingerprints. However, spam fingerprinting at network nodes still requires matching spam fingerprints at layer 7. Report in [2] suggests that spam e-mails are highly unique. Thus, spam fingerprinting would require large fingerprint look-ups. Hence, implemented beyond MTA can be very complex depending on e-mail traffic volume and timeliness of fingerprints to match.

2.2.2 Spam Handling Strategies

Spam control is not only about spam detection. While more accurate spam detection is crucial, the speed of spam detection is as crucial. The impact of spam can be alleviated using certain spam handling strategies. The possible spam handling strategies are post-acceptance and pre-acceptance strategies, which are described below.

Post-Acceptance Strategies

Figure 2.5 illustrates the currently used spam control paradigm. Spam is not only sent by spammers’ own MTAs, but also through relaying by zombies. In current spam control paradigm, outbound spam control is rarely being implemented [13] in order to reduce spam control cost to organizations. Thus, outbound spam propagates with minimum control over the Internet until processing at receiving MTAs’ ingress points, either during relaying or reception. As a result, spam control is restricted to inbound e-mail traffic [13], leaving spam control the sole responsibility of the receiver.

Due to the lack of outbound spam control, e-mail class information is not available (within an e-mail message) to the receiver. Effectively, no e-mail class information

(45)

zombies

spammer

core router

edge

router

Reactive control

MTA

Figure 2.5: A typical spam control scheme over the network topology. Spam e-mails are not only sent from spammers’ own MTAs, but also from zombie systems through illegal spam relaying. Spam is controlled mostly on inbound e-mail traffic. Circles represent core and edge (shaded) routers, shaded squares represent zombie systems, and white squares represent any system connected to the Internet.

is available to receiving MTAs during e-mail reception (before queuing) or during SMTP sessions to provide differentiation to e-mails transfer. A probable option is to detect spam during e-mail transport (sender’s history and blacklisting) using relay or proxy MTAs [88] to shield the receiving MTAs.

There are no other viable options for receiving or relaying MTAs, except to accept all incoming e-mails for queuing before performing any spam control mechanisms with equal priorities. The delay of e-mail transfer varies depending on the queuing delay. The queuing delay is low when the incoming e-mail volume is low (with respect to service capacity). For e-mail traffic that consists mainly of spam, the non-spam e-mails are delayed due to the presence of spam in the queue. Furthermore, non-spam e-mails maybe lost during queuing due to heavy non-spam traffic. Any attack on

(46)

the queue, such as during mass-mailed worm outbreaks and denial-of-service (DoS) attacks, could slow down and even disable MTA operations.

Pre-acceptance Strategies

Pre-acceptance spam handling strategies are able to contain spam from spreading on both inbound and outbound traffic. Should fast and accurate spam detection can be performed, spam containment is clearly the best strategy to stop spam closer to their sources [13].

In current spam control systems, pre-acceptance spam handling strategies are done as blocking spam from known sources. This includes the use of blacklistings and whitelistings, IP-based authentication, and cryptographic authentication. While IP-based and cryptographic authentication could address spoofed e-mails such as phishing e-mails, they could not prevent spam relaying using a compliant (non-spam corporate MTAs and ESPs) settings. Furthermore, spam could not be accurately predicted by examining senders’ sending histories [20, 64].

Content classification can be used to detect spam anywhere across the Internet. However, it requires accurate spam detection techniques based on e-mail contents, i.e., content classification or spam fingerprinting, which currently performed at layer 7. In controlling spam, some ISPs route e-mail packets to special relay or proxy MTAs, where their contents are inspected for spam. However, MTA implementation such as [88] could not offer fast spam detection. Other than the complexity of spam detection technique used, GPP-based MTA could not perform TCP/IP processing for high-speed links [21].

(47)

2.2.3 Other Spam Control Approaches

Spam can be prevented from occurring at the first place. We discuss spam control through legislations and development of new e-mail protocols that could address the flaws of current e-mail protocols.

Legislations and Regulations

Several legislations have been proposed to regulate e-mail usage and combat spam such as the 2003 USA Can-Spam Act [89]. Similar legislations were also passed in Europe and Australia. As a result, spam traffic originating from the USA decreased from 41.5% in 2004 to 26% of global spam in 2005 [27]. However, spam originating from other countries, for instance, South Korea and China increased from 12% and 9% in 2004 to 20% and 16% in 2005, respectively [27].

The Internet is borderless and connects users around the worlds. Without worldwide acceptance, adoption, and enforcement of these legislations, spammers could relocate to other countries to avoid enforcement. Legislations alone cannot completely eliminate spam [90]. A working combination between legislations and technology is needed to further mitigate the spam problem.

Computational Payment Approach

This approach tries to shift the cost from totally borne by the receiver to the sender. For non-spam mail senders, the cost is negligible due to the small volume of e-mails. This is not the case for spammers who send e-mails in bulk. Techniques such as HashCash [91] and CAMRAM [92] have been proposed using this approach. The payment is a form of computational power to delay sending e-mails in bulk [62].

Another technique is to use a human interaction test, such that e-mails from unknown senders are computationally challenged [14], which are challenging for

(48)

machines but not humans. The computational payment approach has been shown to work for ESPs in dealing with outbound spam [14]. While these techniques work at e-mail service providers (ESPs) such as Yahoo! and Hotmail, such measures may not sufficient to control effectively outbound spam unless supported by all MTAs.

Improved E-mail Protocols

Some researchers argued that more fundamental changes to the e-mail system are needed. Current e-mail protocols give senders all the rights while giving none to receivers [50, 93]. Furthermore, the work horse for the current e-mail system, SMTP [45], is too simplistic and flawed.

Several other protocols such as IM2000 [50], Designated Mail Protocol (DMP) [94], and Trusted E-mail Open Standard (TEOS) [89] have been proposed. IM2000 was designed to force senders to store e-mails until retrieved by recipients, which is the fundamental spam contributing factor. DMP and TEOS provide authentication mechanisms, which the SMTP protocol lacks. Each of these protocols does not address all flaws of SMTP protocol to be widely accepted as a new e-mail standard.

Furthermore, spam is not the only e-mail-based problem. Phishing and e-mail worms are other problem closely and indirectly related to spam [7–9]. A new protocol, if accepted as an Internet standard, must address other e-mail-based problems as well. Another thing to consider is the acceptance of the new protocol. The changes in the protocol may require changes or upgrades to current network infrastructures worldwide. Unless the new and improved protocol can be adapted seamlessly and without the need to change current network infrastructures, new e-mail protocols may not be widely adopted as new e-mail standards.

(49)

2.3 Chapter Summary

This chapter explored fundamentals of e-mail systems and spam control approaches. The flaws in e-mail format and protocols require different approaches to spam control. The complexity of spam detection not only depends on used detection techniques but also on the abstraction level it is performed at. If spam e-mails can be detected at a lower abstraction level, spam can detected earlier and closer to their sources. The sooner spam can be detected, the better spam can be controlled and more effective spam handling can be performed to mitigate the spam problem.

One of major issues that hinders effective spam control is the throughput of spam detection. We have described that content classification, such as probabilistic na¨ıve Bayes content classification gives the most accurate spam detection [15]. In the next chapter, we propose a high-throughput hardware support for na¨ıve Bayes hardware classification. A specialized hardware support for spam detection is needed for high-throughput spam detection for more effective spam control.

(50)

27

Chapter 3 A High Throughput Na¨ıve Bayes Content

Classification Hardware Unit

This chapter presents the first contribution of this dissertation. We propose a hardware architecture for a na¨ıve Bayes classification unit for high throughput spam detection. Other spam detection techniques offer marginal increase in accuracy in addition to the need for higher computation complexity, such as the one based on SVM [75] and prediction by partial matching [95]. In this dissertation, we focus our discussion on the na¨ıve Bayes classification technique, which is used in most spam control systems [71], either as a part of a stand-alone hardware spam detector or as a hardware support for e-mail servers.

Implementing na¨ıve Bayes e-mail content classification requires complex arith-metic operations that introduce computation noise. This noise increases when the number of classification features is large. The choice of fixed or floating point number format strongly impacts the computation noise and system speed. The logarithmic number system (LNS) offers an attractive solution to simplify the na¨ıve Bayes computation [96]. To allow high throughput hardware implementation, we propose a na¨ıve Bayes classification with non-iterative binary LNS using Look-Up

(51)

A High Throughput Na¨ıve Bayes Content Classification Hardware Unit 28 feature extraction e-mail decision features probabilities classification feature selection generative model

Figure 3.1: Typical operational stages of e-mail content classification.

Table (LUT). We analyze the noise model and the total noise bounds to verify the functionality of our architecture. Through synthesis on FPGA, we evaluate the performance of our proposed architecture.

This chapter is organized as follow. Section 3.1 presents na¨ıve Bayes formulation using LNS approach. We present the hardware architecture of a na¨ıve Bayes e-mail classification hardware unit in Section 3.2. Section 3.3 presents the noise model of our classification unit. We discuss the experimental results in Section 3.4. Section 3.5 estimates the throughput of the proposed na¨ıve Bayes classification hardware unit. We summarize this chapter in Section 3.6.

3.1 Na¨ıve Bayes Formulation Using the Binary LNS

Approach

An e-mail can be viewed as a text document that consists of e-mail header and a free text field structured according to RFC2822 [42]. Each field may contain several informative features, which can be used for classification. The class of an e-mail can be estimated according to the na¨ıve Bayes technique.

Towards improving e-mail content classification for spam control: architecture, abstraction, and strategies

Towards Improving E-mail Content Classification

for Spam Control:

Architecture, Abstraction, and Strategies

Muhammad Nadzir Marsono

Doctor of Philosophy

Towards Improving E-mail Content Classification

for Spam Control:

Architecture, Abstraction, and Strategies

Muhammad Nadzir Marsono

Supervisory Committee

Abstract

Table of Contents

List of Tables

List of Figures

List of Abbreviations

List of Symbols

Acknowledgment

Dedication

Chapter 1

Introduction

1.1

You’ve Got Spam!

1.2

Problem Statement

1.3

Contributions

1.4

Dissertation Organization

Chapter 2

E-mail System and Spam Control: Review

and Taxonomy

2.1

The E-mail System

User A

User B

L

ay

er

7

6

5

4

3

2

1

2.2

Spam Control

zombies

spammer

core router

edge

router

Reactive control

MTA

MTA

2.3

Chapter Summary

Chapter 3

A High Throughput Na¨ıve Bayes Content

Classification Hardware Unit

3.1

Na¨ıve Bayes Formulation Using the Binary LNS

Approach