Designing a framework for col lecting, storing and analysing email metadata

(1)

Bachelor Informatica

Designing a framework for

col-lecting, storing and analysing

email metadata

R.A. Valkering

August 19, 2014

Supervisor(s): Emiel Bruijntjes (Copernica B.V.), G.D. van Albada (UvA)

Inf

orma

tica

—

Universiteit

v

an

Ams

terd

am

(2)

(3)

Abstract

This research studies the optimal design for a framework to collect, store and analyse email metadata gathered with marketing software. In order to do so, the goal of the framework is described and requirements are drawn up to which the framework should adhere. The design and components of the framework are then described, each with several approaches for solving challenges described in the design phase. Each approach is evaluated against the requirements and how they would contribute to a maximally efficient framework. The implementation phase describes which approach was chosen for each component. Finally, the results show that the framework in theory does fulfill the requirements and that it is suitable for collecting, storing and analysing email metadata in its environment.

(4)

(5)

Introduction

Since the conception of the internet and the introduction of the digital mailbox, the world has seen a steep increase in the amount of electronic mails sent around the world. Companies and marketers attempt to use emails as a cost-effective way to bring attention to their product. To achieve this, known customers register themselves and their email addresses in a database, which companies and marketers use to periodically address customers using online marketing campaigns.

The cost of creating an email address is virtually none. While this makes it very easy for people to manage multiple email addresses at once or change their primary address, it also causes a relatively high churn rate1. Email addresses expire when not used for a long time, or simply become idle, resulting in emails not being read by the owner. Over time, customer databases will contain an increasing amount of such email addresses.

It is impractical to mass-mail clients using a traditional email client, so companies usually auto-mate this process by using email marketing software. While sending emails is certainly cheaper than sending paper folders, using such software often comes with a monthly cost, which increases with the amount of emails sent monthly[10]. Having an outdated customer database may result in numerous mails being sent to expired email addresses.

Copernica is a company that develops and maintains software to manage customer databases and create marketing campaigns. The software also collects statistics of sent mailings, so that clients are aware of the effectiveness of their campaign. These statistics consist of interactions between the email address and the sent email, also known as metadata. The possible metadata elements are impressions (generated when the email client loads an image from a server), clicks, abuse/spam reports, unsubscriptions and various errors.

The software makes it possible for clients to create a filter using conditional statements. Many clients attempt to reduce the amount of invalid and inactive email addresses by creating filters based on collected metadata. These filters are often not effective or optimal and greatly increase the server load when sending mailings to a filtered customer database.

In order to reduce the amount of emails sent to expired email addresses, one can attempt to determine the validity and activity of an email address using algorithmic analysis on collected metadata. This will allow clients to easily filter invalid and inactive email addresses from their list, while reducing the server load by not using conditional statements.

This thesis will attempt to design and implement a framework that allows for analysis of collected email metadata.

(8)

1.1 Related work

In 2004, Sommerer described a method to automatically update and maintain contact informa-tion for personal address books.[21] This is done by sending contacts a transmission requesting an update and is handled using an automated reply process. The method attempts to detect invalid email addresses by sending a message and checking if the transmission was successful. For company customer databases however, it makes more sense to analyse results of marketing campaigns, as these already interact with all known contacts.

In 2012, another email marketing company introduced a service called SafeToSend, which checks the validity of newly entered email addresses, among many other features [2]. It does so in a similar way, by detecting bounces whenever a mail is sent to the user[17]. The difference between this technique and the framework proposed in this thesis, is that the framework will not limit itself to bounces and errors, but will attempt to gather information from all interactions with client’s email addresses.

1.2 Contribution

Copernica’s email marketing software contains libraries that can send mass emailings to client databases and log the interactions of email addresses with these emailings. While these logged interactions can be used for displaying email statistics, they are not yet used for gathering information about specific email addresses. Previous documented attempts to do so appear to limit the analysis to delivery errors only, rather than all possible interactions with an email address.

The goal of this thesis is to design and implement a framework that may determine the validity and activity of email addresses by analysing interactions with customer email addresses. The thesis limits itself to the framework only, which allows for the possibility of metadata analysis, and does not describe algorithms that may be used to determine the validity and activity of email addresses.

This thesis continues in chapter 2, which discusses the requirements of the framework and eval-uates choices made for the framework design. Chapter 3 describes the implementation of the framework and the various parts of which it consists. Chapter 4 describes the experiments conducted and gives an overview of the results, which are discussed in chapter 5.

(9)

CHAPTER 2

Design considerations

This chapter further specifies the framework introduced in chapter 1 and lists the considerations which will be taken into account to design the framework. The first part of this chapter will give a more in-depth description of the framework which this thesis aims to describe, and will list the requirements of the framework that follow from this description. The second part of the chapter specifies the possible configurations for the framework, along with how they may fulfill the requirements.

2.1 Framework and environment description

The intended framework will collect, store and analyse email metadata, and will store the analysis results in a database. The framework will be used in the Copernica environment, which practices email marketing. The amount of bulk mails sent using this software ranges from five to eight million a day. The company has experienced large growth in the past years and may continue to do so.

The framework should be able to handle collecting, storing and analysing email metadata for eight million emails a day. On average, more than half of these emails result in no interaction1_at

all. If a log is made of every email sent and of every interaction afterwards, this would amount to a maximum of twelve million interactions2_.

Email metadata does not have to be stored and analysed immediately, but can be collected or queued before any action is taken. Loss of data throughout the framework should be avoided if possible, but any lost data has no consequence for the reliability of the analysis, provided the loss is indiscriminate and does not occur to one type of metadata.

The framework must be set up in a way it can use an algorithm to analyse the email metadata and determine whether email addresses are valid and active or not. This thesis will not aim to develop an algorithm that can do so, but merely to supply the framework that can support such algorithms.

2.1.1 Requirements

This section describes mandatory requirements for the implementation of the framework. The requirements describe properties of the framework without which the framework cannot be

de-1_{impression, click, error, spam report or unsubscribe request}

2_{Potentially, emails could result in multiple interactions, such as multiple errors, multiple clicks or a click and} an impression, but this number is too small to have a sufficient impact on this estimate.

(10)

ployed in the Copernica environment. Additionally, the final design and implementation of the framework will be evaluated against the requirements.

2.1.2 Mandatory requirements

The framework and libraries must meet the following requirements:

• The framework must be scalable in both size and performance. Since the amount of data collected may increase continuously, both the storage size and performance (when analysing data) may hit limits in the future. It should be possible to scale up the storage and performance by adding new hardware.

• The framework should be usable from a PHP-based webserver. The current situation is that PHP is used to send mailings and collect metadata. The storage of metadata and the querying of analytical results should be possible via this PHP code. • Developing additional analysis algorithms should be kept simple. One may wish

to analyse the collected data in new or different ways in order to extract new results. The framework may make use of external libraries that enhance the performance of analysis tasks or make sure they work in the given work environment. Developers that create new algorithms should be supported in doing so and should not have to deal with the complexity of the underlying frameworks and libraries.

• Use of the library should not cause a significant increase in runtime. Whenever a process logs an event, such as a mail sent or a click registered, it should not have to wait for the event to be actually logged. Likewise, any process requesting data from the library should not have to endure long waiting times: the library should be able to serve millions of requests per day without drastically increasing processing time.

2.2 Framework setup

Following the description and requirements, this section will highlight the required components and the design choices that can be made. These design choices will be evaluated according to their advantages and disadvantages in perspective of the requirements, so that a choice can be made during the implementation.

2.2.1 Storing metadata

Logs of email interactions must be stored on one or multiple servers, so that they can be used for analysis. It must be possible for processes to concurrently store logs on the server.

In order to ensure the framework can find all logs, all servers containing logs could be part of a filesystem. This ensures that the framework does not have to be notified of a logfile’s location every time a new log is added, since the logs reside on a known server. The filesystem should take care of data redundancy and should scale well.

Processes should be able to write logs to the fileserver. One way to do this is that all processes write directly to the server where the logfiles reside. This however requires the processes to make use of traditional tools that handle concurrency, such as locks and mutexes, so files don’t become corrupt if multiple processes try to append to it at the same time. Another possibility is that each process writes to a unique file. This will spawn many small files, but will rescind the need for locks and mutexes.

Another way to approach this problem is to use a queue, to which the processes can send logs. The queue is then emptied by another process, which in turn opens a single logfile and writes the logs to this file. While this both solves the concurrency problem and allows for large logfiles,

(11)

which are better for mass processing, it does require a queue to be created and maintained. The process consuming logs from the queue should do so faster than other processes fill the queue.

2.2.2 Collecting metadata

Logs of email interactions may be collected by processes running on the webserver. The require-ments state that such processes should have access to functions that log the metadata, but that these functions should not cause a significant increase in runtime. The additional runtime can be decreased by letting the task run in its own thread, so the process can go ahead with what it was originally doing.

2.2.3 Logfile analysis

Periodically, stored logfiles will be analysed by the framework. One requirement specified earlier in this chapter is that the framework should be scalable, and adding extra servers should increase both performance and storage size. Gaining additional performance from extra servers is only possible if analysis is done by using parallel and distributed computing, so that tasks can be split up into smaller tasks and carried out by several cores at once.

This can be achieved by either handling parallelization in the analysis code itself, or by using a framework or external software to split up the task. The first method gives more freedom on behalf of choosing the language in which analysis code is written, but requires much more work and may make it far more complex to write an analysis program, depending on the complexity of the algorithm. The second method uses existing software and frameworks and generally makes writing parallelized code less complex, though one is limited to methods and languages supported by this framework. Since the requirements state that the development of additional analysis tasks should be kept simple, the second method is preferred for this framework, as using an existing framework can reduce the complexity of the code and may allow programmers to create parallelized programs even if they have no experience doing so.

There are several free frameworks available for processing parallelizable problems, such as MapRe-duce and Spark. MapReMapRe-duce is a programming model proposed by Google that suggests tasks should be split in what are called map and reduce tasks, so they can be automatically parallelized[18]. The parallelization is then handled by a framework, so the programmer can concentrate on data analysis tasks without having to deal with parallelization code, as long as they follow the model. A popular open-source implementation of MapReduce is Apache Hadoop. Hadoop uses the Java Virtual Machine to run and is best used for processing batches and large textfiles[22]. While its main programming language is Java, it also contains libraries to support a couple of other programming languages, such as Ruby, Python and C++. It is also possible to run any executable with Hadoop MapReduce by using Hadoop Streaming[7]. Hadoop MapReduce traditionally makes use of the Hadoop Distributed File System (HDFS) for data storage and analysis, though it can be configured to work with other filesystems[5].

An alternative to MapReduce is Apache Spark. Spark is a framework for cluster computing that focuses on machine learning algorithms (iterative algorithms), which it argues MapReduce is less suitable for[23]. It backs up this claim by proving that MapReduce has a considerably larger running time at an increasing number of iterations when using the logistic regression algorithm. Spark natively supports Scala, Python and Java, although it is possible to use executables in any language with a pipe command[15].

A choice between these systems can be made based on the used programming language and the type of tasks this framework aims to handle. Since Apache Spark focuses on machine learning, it should be considered when the proposed tasks are iterative, or may be in the future.

(12)

2.2.4 Result retrieval

Once logfile analysis is complete, the results should be stored in a location that makes requesting it easy. The results can be stored in the database that the webserver uses for other tasks, so they are easily accessible. If necessary, functions should be available to speed up this process or to reduce server load by querying multiple results at once.

2.2.5 Language

The language used for written libraries and frameworks should be carefully considered, as it influences both accessability and performance of the code. Since the requirements state that functions for logging data should be accessible via PHP code in the webserver, this leaves a few options open. One can either invoke PHP functions directly from the PHP webserver, invoke C, C# or C++ functions using PHP extensions, or invoke functions by executing a command with functions like proc open.

The first approach is trivial and requires the functions to be written in PHP and available on the server. Using it requires little extra work and developing it requires no extra libraries, though it limits the code used to PHP. The downside of this is that PHP is an interpreted language, and thus does not benefit from performance enhancement by using native code and from compilation optimalisations.

The second approach is less trivial, but can potentially enhance performance. PHP is an in-terpreted language written in C[8] and uses the Zend Engine to interpret scripts. This engine also allows programmers to export their C code as a PHP extension[4]. This, however, requires knowledge of the Zend Engine, although there are frameworks such as Zephir3 _{and PHP-CPP}4

that can be used to make creating PHP extensions easier.

The third approach, using functions like proc open, allows programmers to create their library in any language that can be run from the command line. This approach benefits from performance enhancements of compiled languages. Functions like proc open are, however, often considered dangerous and have known security leaks. It is unknown if the function has any effect on the processing time of the program.

3_{A domain specific language used to implement PHP extensions in C.} 4_{A recently developed library for exporting C++ code as PHP extensions.}

(13)

CHAPTER 3

Implementation

The previous chapter saw a description of the various choices that could be made when designing the framework. This chapter attempts to describe the implementation of the designed framework. The chapter is split up in five sections. The first section gives an overview of the framework and its components, while the remaining four sections describe the implementation of each component and the approaches chosen for problems introduced in the previous chapter.

3.1 Framework structure

This section will outline the general structure of the framework and the components that make up the framework, as well as a short description of each component. It will also explain which programming language and libraries have been used for the code and libraries that make up the framework.

The implemented framework will consist of the following components:

1. A library that contains functions to log email metadata and request data analysis results. 2. A server or server cluster configured to store data and run parallelized and distributed data

analysis tasks.

3. A logger running on the server or server cluster that stores email metadata in compressed logfiles.

The flow of data is illustrated in Figure 3.1.

An email interaction, such as a mail sent or an impression received, is detected in the PHP webserver. The server calls a function from the email metadata library to log the interaction. The interaction is sent by the library to a running logger instance, which compresses the file and logs it on the server. The logfiles are periodically processed by the data analysis task, which stores the results in a database. The PHP webserver requests a result from the email metadata library, which queries it from the database and returns the result.

3.1.1 Language and libraries

The framework and libraries will mainly make use of the programming language C++. As described in the previous chapter, PHP is an interpreted language originally written in C. Using C++ instead of PHP allows libraries to benefit from better performance by being compiled as native code and from code optimalisations during the compiler phase. This satisfies the requirement that the framework should be high-performance.

(14)

Figure 3.1: Flow of data in the email metadata framework. process on webserver email metadata library message queue logger application filesystem on which logfiles are stored database in which analysis results are stored metadata analysis framework Pass metadata to log Request results from library Publish metadata to queue Message consumed by logger Stores message on filesystem Logfiles analysed by framework Stores results in database Requests results from database Serves results to library Send results

C++ functions can be used by PHP code when they are wrapped in an extension. Since the alternative, the function proc open, which runs programs from the command line, is known to have security issues, using an extension appears to be the safer alternative.

The C++ functions will be wrapped in an extension with the PHP-CPP library. PHP-CPP is chosen over its alternative Zephir for two reasons. The first reason is preference: Zephir is aimed at C and PHP-CPP uses C++ for its extensions. In the Copernica work environment, C++ is preferred, and using it is also required in order to make use of several of Copernica’s internal libraries. Second, PHP-CPP is internally developed by Copernica, which means any PHP-CPP problems encountered during development of the framework can be validated and fixed more quickly.

Algorithms written in C++ and wrapped in PHP-CPP may experience better performance. An example where the bubblesort was implemented in both PHP and C++ showed a speedup of approximately eleven times[9].

3.2 Analysis framework

The central component of the framework is the cluster of servers on which logs are stored and data analysis is performed. The analysis framework consists of a filesystem and a framework suitable for parallelized and distributed processing of data. Additionally, the framework contains a logger application, which helps storing compressed logfiles onto the filesystem.

(15)

3.2.1 Place within the framework

The analysis framework will be installed on the servers that host the filesystem, on which email metadata will be stored.

3.2.2 Implementation

In the design considerations it was argued that the analysis framework should support parallelized and distributed processing of email metadata, as this fulfills the requirement that the framework should be scalable. For this, two major frameworks were considered: Apache Spark and Hadoop MapReduce.

Hadoop MapReduce is favored as C++ was chosen in the previous section, and Hadoop provides means to develop algorithms in C++ while using the MapReduce template. Apache Spark can be configured to make use of an executable by use of its pipe command, but this could mean extra complexity for algorithm developers.

Choosing Hadoop MapReduce potentially makes the framework unsuitable for iterative algo-rithms (or algoalgo-rithms that make use of machine learning). Hadoop MapReduce can, however, still run iterative algorithms. If future tests point out that such algorithms are favorable over other algorithms, and MapReduce does not achieve the desirable results, Apache Spark can be configured to work with the main Hadoop framework as an alternative to MapReduce.

MapReduce will be installed on the same servers that host the filesystem. This allows MapReduce to make use of the principle of data locality, which means it will assign tasks to the servers where the data to be processed is located. This will reduce the extra network communication overhead, which may become a bottleneck in very large clusters[16].

3.3 Servers and filesystem

In order to find a file system suitable for Hadoop MapReduce, there are a couple of choices. Hadoop MapReduce is delivered with a copy of Hadoop Distributed File System (HFDS) and can be easily set up under it. As mentioned before, it is possible to use MapReduce with another filesystem, though that does require some extra configuration in order to make it work.

3.3.1 HDFS

HDFS is a filesystem that is optimized for use with Hadoop and batch processing, a task common for Hadoop. The MapReduce counterpart was designed to work with HFDS and using this may give the best stability. While HDFS seems to work as well as any other filesystem, its configuration appears to have a bottleneck which stops the performance from scaling linearly. HFDS relies on a central namenode, which tells other nodes on the server on which block a file in the system resides. Because this namenode is the only one available and it has to store the location of all files, it is likely to become a bottleneck for a large system.[20]

When processing data in the order of petabytes, the workload of the namenode increases dras-tically and causes other nodes to have to wait longer for a response, thus increasing processing time for a single node. While the overall processing time will still decrease due to more nodes being available to process the workload, the gain per server becomes less with every server added. Though data on mounted servers is replicated two or three times, the central namenode has no backup, making it a possible single point of failure for the whole system. If it goes offline, the whole system will cease to run. This may be a problem for setups that require high availability or minimal maintenance.

(16)

3.3.2 GlusterFS

GlusterFS is an open-source distributed file system developed and maintained by Red Hat En-terprise Linux. GlusterFS is a general-purpose filesystem which focuses on storage and quick access of a large number of files. GlusterFS relies on an elastic hashing algorithm, eliminating the need for a central node that tells clients where files are located. Instead, GlusterFS hashes the filename to derive the file location. This allows near-linear scalability[19].

GlusterFS can be configured to work with Hadoop MapReduce, by using a recently developed plugin that acts as a mediator between the Hadoop Filesystem interface and any other desired filesystem[1]. Using this plugin to use GlusterFS rather than HDFS increases the stability of the system by eliminating the single point of failure.

In the current setup, GlusterFS and MapReduce run on a server cluster dedicated to logfile storage and analysis. It is not required that they run on dedicated servers: they can share their resources with other processes and may run on virtual machines[6].

3.4 Metadata logging application

For the framework to meet its goal and requirements, processes on the webserver must have access to a function that stores logdata on the aforementioned filesystem. In the design considerations, it was acknowledged that these processes may experience concurrency between them, and thus a solution had to be found. Two different solutions were proposed, with and without use of a queue.

Hadoop MapReduce performs best when using large files (64 MB - 128 MB)[22]. This makes using a queue a better option, since the other option may result in a large number of small files instead. Using a queue requires setting up a consumer application that writes the logfiles to the filesystem, a queue configured to accept messages for publishing and consuming, and a way to publish messages to this queue. The latter will be described in section 3.5, while subsection 3.4.1 and 3.4.2 will focus on describing the queue and the consumer.

3.4.1 Place within the framework

The consumer application will run as one or multiple instances on the server on which the logfiles will reside. The queue can be set up on any server accessible via the network.

3.4.2 Implementation

There are multiple ways to implement the queue. The queue can be implemented and utilized using a Database Management System, or a message broker can be used to handle the queue. For the first approach, several DBMS’s are available in the environment, most notably MongoDB and MariaDB. For the second approach, the message broker available is RabbitMQ, a broker that handles communication using AMQP (Advanced Message Queueing Protocol).

Utilizing the first approach requires the use of a database to store the queue. Messages are serialized as rows in tables and notifying the consumer of new messages traditionally uses polling. The consumer applies this by asking the database whether new messages are available or not. These practices make databases less suitable for implementing message queues, as requesting messages from tables requires that a locking mechanism is put in place so that each message is only read once, making the whole process more computationally expensive[3].

Some of these disadvantages have a workaround. For example, MongoDB allows the use of a Publish/subscribe system to replace the polling system, so that it is more suitable for implement-ing message queues[11]. Additionally, some developers have found MongoDB to be more suitable

(17)

for implementing a message queue than some message brokers[13]. However, this requires use of Python for implementation rather than C++.

The second approach requires the use of a message broker to implement a queue. As opposed to traditional databases, RabbitMQ pushes messages to consumers rather than having to use polling. Pushing is considered to be more efficient than polling and removes the need for a locking mechanism. Additionally, RabbitMQ resides in RAM, making its operations faster, and it uses Flow Control to ensure connections don’t publish messages at a higher rate than the broker can handle[12].

Given the advantages over a traditional DBMS, its compatibility with C++ and the fact that RabbitMQ is already configured in the environment, the application will make use of RabbitMQ for its message queue. The application can consume messages from the queue to log them. The application is written as a standalone program that continuously runs in the background. It consumes messages from the the set-up RabbitMQ queue when there are any available. It then writes the logfiles to a directory specified in a configuration file. The metadata is stored in JSON format.

The messages to log come in as uncompressed messages with a likely amount of repetition. In order to reduce the amount of space these logs will take up, messages can be compressed before they are written to file. Choosing a compression algorithm should take in consideration its compatibility with Hadoop MapReduce, along with compression speed and efficiency.

Most of the available compression algorithms are unsuitable for use with Hadoop MapReduce, as once compressed the file cannot be split into multiple blocks, which means one file can at most be processed by one node. One algorithm that solves this problem is Bzip2. While this compression method is slower than most other formats, the advantage is that it divides the smaller internal blocks using a magic number, which is a 48-bit approximation of PI. This makes it possible for Hadoop to determine where blocks begin and end, and makes it possible to divide the compressed logfile across multiple nodes when analysing it.

3.5 Email metadata library

One of the main components of the framework is the email metadata library, a library that contains functions to log email metadata and request analysis results.

3.5.1 Place within the framework

The library will be available as a PHP extension, wrapped in PHP-CPP. The email metadata library will be installed on all servers that handle sending emails, processing responses or deliv-ering content. The responses processed will mainly consist of error reports and abuse reports, while the content deliverer can detect clicks and impressions. Emails sent are also registered so that they may be used to detect no interactions.

While the library does handle logging email metadata, it does not actually write the data to a log itself. This is handled by the consumer application described in the previous section.

3.5.2 Logging metadata

Many instances of the library should be able to communicate with the same logger. In order to handle the concurrency, the library places the metadata in a queue in RabbitMQ, as discussed in the previous section. Collected interactions are published to the queue, after which they will be consumed and logged by the consumer application.

(18)

When an email interaction is detected, a function of the PHP extension is called. This function checks if an instance of the email metadata library already exists and creates one if this is not the case. This is done to prevent having to recreate the entire object every time metadata is registered, thus improving performance of the library and reducing the resources it requires. On initialization, the library spawns a thread which connects to RabbitMQ and declares a channel over which to publish messages. A process that wishes to log collected metadata can then use the library to do so, which spawns a thread. The thread then runs in the background while the main object is kept in the webserver’s memory. To communicate between the thread and the main process, a watcher1_{is added which can be activated to notify the main process that there}

are messages ready to be sent.

To register the interaction, the library stores the data in an object holding the metadata, which is then stored in a queue. The watcher is then activated and the main thread is awakened. The main thread takes messages from the internal queue and publishes them to RabbitMQ, using the AMQP protocol. The messages will then be stored on a queue on the RabbitMQ server, until the consumer application consumes them and stores them in logfiles.

One problem that arises is that threads do not continue once the parent process is terminated, which may happen when the webserver restarts or clears its data. This can result in the thread being terminated before all the messages are sent through AMQP, and thus data is lost. To avoid this, the PHP extension is configured to destruct the class and thread before it terminates. The destructor waits until all messages are sent, then safely closes the channel and connection and finally stops the thread.

(19)

CHAPTER 4

Experiments

This chapter will cover the experiments conducted on the framework as it has been implemented, and will display the results from these experiments. The implemented framework will also be evaluated against the requirements stated in chapter 2.

For most requirements formulated in chapter 2, the measures taken to fulfill them in the mented framework have been discussed in sections of the design considerations and the imple-mentation. Some requirements, however, require extra research to argue that they have been fulfilled. In the following sections, each requirement will be evaluated individually to determine whether they have been fulfilled or not.

4.1 Scalability

Several measures have been taken to ensure scalability of the framework. The framework makes use of Hadoop MapReduce to divide and parallelize its tasks, so that analysis tasks can be split up across nodes. Adding additional servers should theoretically make performance scale linearly. Hadoop MapReduce has been configured to work with GlusterFS instead of Hadoop Distributed File System, in order to make the system more reliable by eliminating the single point of failure, and to reduce the bottleneck that occurs at a large number of machines. GlusterFS allows users to increase storage size by mounting additional devices.

The current environment unfortunately did not allow for a reliable test to check if performance and storage size indeed scale linearly at a large number of machines.

4.2 Accessability from webserver

The requirements state that the code used to log metadata should be accessible from the web-server. This has been guaranteed by wrapping the C++ code in a PHP-extension. This allows the libraries to benefit from the enhanced performance of a compiled library, while keeping it accessible from PHP.

4.3 Complexity of algorithms

The requirement of performance scalability potentially made designing algorithms more complex, as programmers additionally had to deal with ensuring their tasks could be parallelized. The use

(20)

of MapReduce forces programmers to divide their programs into one or multiple map and reduce functions. This essentially limits them in their program design, but reduces the complexity introduced by parallelization and distribution of tasks.

4.4 Performance

Several measures have been taken to theoretically both improve the performance of the framework and libraries and reduce the time processes spend waiting for the function to finish. The use of PHP extensions written in C++ allowed libraries to benefit from the performance boost of native and optimized code. Where possible, calculations have been done in seperate threads, in order to reduce the wait time of processes running on the webserver. Additionally, results are calculated with the Hadoop MapReduce framework, so that processes only have to request the already available results.

4.4.1 Email metadata library

In order to calculate the performance and capacity of the email metadata library and the con-sumer application, a small environment was set up on a Dell Latitude E530 laptop with an Intel Core i7-320QM CPU @ 2.60GHz running Ubuntu 14.04 LTS. The environment ran Apache 2 with PHP 5.5.9 and RabbitMQ 3.2.4, which was used to queue and deliver the log messages. The consumer application was running in the background and the email metadata library was installed as a PHP extension.

The test consisted of publishing 500000 messages with 64 characters of information to Rab-bitMQ1_{. Prior to every publish, a new instance of the Emailmeta Library was created}2_{, which}

was then unset after the message was published. This was done to simulate 500000 different processes, each publishing one message at a time. Every 10000 messages, the computational time was measured. The test was executed three times to reduce outliers. The output of this test can be viewed in Figure 4.1. The RabbitMQ statistics of the first test can be viewed in Figure 4.2.

Figure 4.1: Email metadata library performance test results

1_{At the time of writing, it was not possible to have a representative collection of log messages.}

2_{In reality, the instance of the library is kept in memory by Apache, so that the whole instance does not have} to be recreated every time a message is sent, thus improving the performance.

(21)

Figure 4.2: Graph showing RabbitMQ throughput during test

From the test it can be deducted that time taken to publish 10000 messages averaged 167.844 milliseconds, which comes down to less than 0.02 milliseconds per message. 4.2 shows us that the maximum publishing capacity lies around 25000 messages per second, while the consuming capacity reaches its peak at 10000 messages per second.

It should be noted that, when running the tests through the PHP commandline, attempts to send over 100000 logs would cause the console to experience a slight delay before the script would be terminated. The reason for this is that at this rate, RabbitMQ utilizes Flow Control to temporarily block the channel, as it cannot handle passing messages at such rate. When the script is terminated, PHP first performs garbage collection and only exits when this is finished. Since it cannot close the channel before all messages are published and the block is lifted, the console is forced to wait until the library’s internal queue is empty. A PHP script running on Apache will never experience this delay, as the object is stored in the memory of Apache Module so that the script does not have to deal with garbage collection.

4.4.2 Analysis framework

A small test was designed to determine the capacity of the analysis framework. The analysis framework was deployed on an Ubuntu server with two terrabytes of storage space and four Intel Neon CPU E5-2403 @ 1.80GHz processors. The server was running a 64-bits Ubuntu 12.04.4 LTS. The test consisted of analysing logfiles containing 50 million logs, using a simple MapReduce counter program. The test was repeated three times, each time increasing the total amount of logs analysed by a factor of two, and each test was executed five times. The average execution time of each test can be found in table 4.1.

Table 4.1: Analysis framework performance test results Amount of logs CPU time spent (seconds)

50 · 106 _83.3

100 · 106 173.9 200 · 106 347.6 400 · 106 693.6

(22)

(23)

CHAPTER 5

Conclusions

This thesis covered the design and implementation of a framework for analysing email metadata. In order to do so, the goal of the framework was described along with a set of requirements to fulfill so that the framework could be applied to the working environment of Copernica Marketing Software.

The design and components of the framework were then described, each with several approaches for solving challenges described in the design phase. Each approach was evaluated against the requirements and how they would contribute to a maximally efficient framework. Both the advantages and disadvantages of each approach were stated.

The implementation phase described which approaches were chosen from those introduced in the design phase, along with motivation why these would best fulfill the requirements. The individual components of the framework were then implemented and deployed in a working environment. The designed framework was then evaluated to see whether it fulfilled the requirements. From the results, it can be concluded that the design choices made have contributed to the fulfillment of the requirement. This was done by theoretical analysis and by executing two tests to test the performance of the email metadata library, the consumer application and the analysis framework. The first test consisted of logging five hundred thousand email interactions through the email metadata library. The average time a process spent to log metadata was less than 0.02 mil-liseconds. If that time is multiplied with the estimated amount of email interactions per day in chapter 2, twelve million, then the total time spent waiting by processes would amount to four minutes. This confirms that the library will most likely not be a bottleneck in the calculations done when detecting an email interaction. It can therefore be concluded that the library has fulfilled the requirement of high performance and low waiting time.

The second test consisted of analysing fifty to four hundred million logs using the analysis framework. The results showed a near-linear growth in computation time when increasing the amount of logs. Extrapolating the results predicts a throughput of between 49 and 52 billion logs per day, or approximately 4000 times the estimated amount of daily interactions. It can therefore be concluded that the analysis framework has a sufficient performance for regular analysis of email metadata.

5.1 Discussion

While in theory most requirements have been fulfilled, and some have been proven with additional experiments, it would have been better if the components of the framework were tested using more quantitative tests. Unfortunately this was not always possible, as the equipment to do so

(24)

often was not available. For example, in order to test scalability of the system, it would have been appropriate to compare the results of MapReduce of the current setup to a configuration with additional cores, but the servers available all ran on the same set of cores.

It would have been appropriate to test the full framework in the working environment of Coper-nica Marketing Software. At the time of writing, not all components of the framework were fully integrated with the software. Thus, even though the functioning of individual components could be validated, whether they would fully function in the full framework and have an appropriate capacity could not be tested.

Additionally, it would have been possible to accurately determine the performance of each com-ponent compared to its alternative implementations, if all design options had been implemented. This, however, would have taken a substantial amount of time and would have exceeded the domain of this thesis.

Some of the requirements specified in chapter 2 may have been redundant. The requirement of scalability for the system may be important for a large-scale system, but the current setup has proven to be sufficient the next couple of years, and maybe even more if Copernica decides to periodically purge old logfiles. Then again it may be possible that the framework will be used for more types of data, in which cases the scalability of the system is an useful trait.

5.2 Future work

The focus on this thesis was on describing the framework that makes it possible to analyse metadata received from email addresses. Many improvements can still be made regarding the framework or the data analysis. This section will discuss some proposals for future research or development.

Little research has been done to determine which algorithm is the most suitable for calculating the validity and activity of an email address. It is yet unknown whether calculating that trait is best done with a traditional algorithm or an iterative algorithm. It is up to future researchers to find this out and attempt to design and implement such an algorithm, using the framework described in this thesis.

This framework was built for the purpose of calculating the validity and activity of email ad-dresses, but the collected metadata can be used for calculating other email address traits. One example of this is that the metadata can be used to calculate the preferred user agent of a user. While internet browsers usually display HTML on webpages according to standards, when it comes to emails the layout is dependent on the email client’s internal interpreter[14]. Once an application knows what email client the email is being sent to, it can modify and tweak its layout to optimize it for that client before sending it.

5.3 Recommendations

Following this thesis, there are several things that can be done in order to improve the framework and its reliability. First, not all components have yet been fully integrated into the Copenica software. A working system is required in order to fully test the stability and capability of the framework.

Second, there is currently no algorithm in place to calculate the validity and activity of email addresses. It is recommended that students or researchers with an expertise in mathematics or multivariate algorithms conduct research to devise such an algorithm.

(25)

Bibliography

[1] Apache hadoop enablement on glusterfs? [https://forge.gluster.org/hadoop/pages/ Home. Accessed: 2014-07-30.

[2] Blog: Our new safetosend deliverability solution: A sneak peek to a revo-lution in email. http://www.freshaddress.com/fresh-perspectives-blog/

our-new-safetosend-deliverability-solution-a-sneak-peek-to-a-revolution-in-email/. Accessed: 2014-07-29.

[3] The database as queue anti-pattern. [http://mikehadlow.blogspot.nl/2012/04/ database-as-queue-anti-pattern.html. Accessed: 2014-08-18.

[4] Extension writing part i: Introduction to php and zend. [http://devzone.zend.com/303/ extension-writing-part-i-introduction-to-php-and-zend/. Accessed: 2014-07-30. [5] Filesystem compatibility with apache hadoop. http://wiki.apache.org/hadoop/HCFS.

Accessed: 2014-07-29.

[6] Glusterfs technical faq. [http://gluster.org/community/documentation/index.php/ GlusterFS_Technical_FAQ/. Accessed: 2014-08-19.

[7] Hadoop streaming. http://hadoop.apache.org/docs/r1.2.1/streaming.html. Ac-cessed: 2014-07-29.

[8] History of php. http://php.net/manual/en/history.php.php. Accessed: 2014-07-30. [9] How fast is a c++ extension? [http://www.php-cpp.com/documentation/bubblesort.

Accessed: 2014-07-30.

[10] Pricing overview — email newsletter software. https://www.copernica.com/en/tarieven. Accessed: 2014-05-22.

[11] Pub/sub with mongodb. [http://blog.mongodb.org/post/29495793738/ pub-sub-with-mongodb. Accessed: 2014-08-18.

[12] Rabbitmq - flow control. [http://www.rabbitmq.com/memory.html. Accessed: 2014-08-18. [13] Replacing rabbitmq with mongodb. [https://blog.serverdensity.com/

replacing-rabbitmq-with-mongodb/. Accessed: 2014-08-18.

[14] Some tips for email layout and responsiveness. http://artsy.github.io/blog/2014/03/ 17/some-tips-for-email-layout-and-responsiveness. Accessed: 2014-07-29.

[15] spark.rdd.pipedrdd. http://spark.apache.org/docs/0.6.1/api/core/spark/rdd/ PipedRDD.html. Accessed: 2014-08-09.

[16] What is the hadoop distributed file system (hdfs)? [http://www-01.ibm.com/software/ data/infosphere/hadoop/hdfs/. Accessed: 2014-07-30.

(26)

[17] Austin C Bliss, Robert W Mack, William B Kaplan, and Mark Rosenstein. System for storing and retrieving old and new electronic identifiers, November 25 2003. US Patent 6,654,789.

[18] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.

[19] Benjamin Depardon, Ga¨el Le Mahec, Cyril S´eguin, et al. Analysis of six distributed file systems. 2013.

[20] Konstantin V Shvachko. Hdfs scalability: The limits to growth. login, 35(2):6–16, 2010. [21] Peter Sommerer. Method and system for automatically updating electronic mail address

information within an electronic mail address database, February 17 2004. US Patent 6,694,353.

[22] Tom White. Hadoop: The definitive guide. ” O’Reilly Media, Inc.”, 2012.

[23] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pages 10–10, 2010.

Designing a framework for col lecting, storing and analysing email metadata

Bachelor Informatica