Improving the Precision of the HyperLogLog Algorithm by Introducing a Bias

(1)

Introducing a Bias

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

Thomas Kamps

10758151

M

ASTER

I

NFORMATION

S

TUDIES

Data Science

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

05-07-2019

1

st

Examiner

2

nd

Examiner

Dr Maarten Marx

Dr. Aljar Meesters

UvA, FNWI, IvI

Copernica B.V.

(2)

Improving the Precision of the HyperLogLog Algorithm by

Introducing a Bias

Thomas Kamps

University of Amsterdam Copernica Marketing Software

thomas@kamps.email

ABSTRACT

Determining the cardinality or number of distinct elements of a multiset is a well-known and often encountered problem in com-puter science. Deterministic and exact methods require amounts of memory proportional to the size of the collection, making them unfeasible for large collections of data. Another option is using a probabilistic approach, like the Hyperloglog algorithm. This algo-rithm provides an estimation of the required cardinality. However, due to its probabilistic nature, the algorithm can over- or underesti-mate, although being asymptotically unbiased. This research aims to �nd if this variance can be reduced, at the cost of introducing a bias. Naturally the costs of these modi�cations are taken into account, in terms of lost accuracy and computational expenses. A number of modi�cations introducing a bias are proposed based on an upper bound, the number of items that were in total added to the HyperLogLog instance. This intuitively makes sense, since the number of unique items can never be higher than the total number of items in the collection. Each proposed version has been evaluated with di�erent sets of simulated data and validated with �ve "real-life" data sets. Most methods produced results that are too erratic to be used in practise. Only methods to improve results for data sets with a high relative cardinality proved successful.

KEYWORDS

cardinality estimation, HyperLogLog

1 INTRODUCTION

Information technology is becoming more and more ubiquitous and dominant in our daily lives. Also, an increasing number of people is getting access to the internet, accelerating the growth of our global datasphere even more. Due to the rise of technologies like IoT, cloud and big data we produce more and more data at an, almost alarming, rate. This Global Datasphere is de�ned by [24] as all data that is being captured, created or replicated and is expected to grow from 33 zettabytes in 2018 to 175 zettabytes in 2025. This ever increasing amount of data puts increasing constraints on what can be stored and what can only be analyzed in real-time. Where previously all data would be stored to be analyzed in batches later, there are more and more situations where this approach is simply not feasible. Due to these cases, among other things, there is a rising need for algorithms that can handle analyzing large amounts of streaming data in real time.

One of the problems often encountered in computer science is �nding the number of unique elements in a collection. Persons with database experience will likely know the "COUNT DISTINCT"

operation. This problem knows many use cases, for instance de-termining the number of unique visitors on a website or detecting malicious tra�c in networks[9]. Naively, this is not a hard or com-plicated task. You simply keep record of items you have seen before and count the items that are not in that list, adding them after they have been added to the count. This approach delivers deterministic and exact results at a good time complexity. However the space complexity is O(n)[1], assuming the size of the items (or their rep-resentation) to be a constant. This kind of linear complexity can quickly become an issue, for instance when counting IPv6 addresses each address is 128 bits, leading to a memory usage of almost 15GB when one billion unique addresses need to be counted. Also when counting unique items on many di�erent machines this causes a problem since these lists cannot easily be synchronized and do-ing so will likely cost much network tra�c and opens the system up to the risk of inconsistencies. To combat this problem many approaches to the problem of cardinality estimation have been pro-posed. These methods, instead of providing an exact answer, will provide an estimation of this answer with only a fraction of the space complexity.

In the past decades, many cardinality estimation algorithms with a probabilistic nature were developed[22]. Some of these use sampling strategies to reduce the space complexity, others use a sketch-based approach, scanning all items and reducing those into sketches that need only limited memory. One of those sketch-based algorithms is HyperLogLog[11], which has a very practical imple-mentation (low memory requirements, high accuracy, deterministic and reproducible for a speci�c set of data, easily parallelizable) and is widely used. This algorithm works by using the randomization caused by hashing the elements to approximate the cardinality. The research shows a typical error of 2% while using only 1.5 kilobytes of memory for collections of more than 109_{elements[11]. For many}

purposes this estimation is well within the acceptable boundaries. The estimator of the algorithm is unbiased and naturally su�ers from some variance in its estimations between di�erent sets of data. For some purposes, like trend analysis, this variance might be obstructive. If you want to analyze some trend in a data stream (e.g. unique items per day), the variance might skew your results, espe-cially if your time window is relatively small. In this case you might want to trade o� some accuracy in favour of increasing precision. This notion is supported by the well-known Stein’s phenomenon [25], which states that an unbiased estimator, like used in the Hy-perloglog algorithm, is suboptimal when it comes to precision and might improve upon the introduction of a bias.

One bias that comes to mind is introducing an upper bound to the algorithm. This upper bound can be the total number of added items, since there can never be more unique items than that. This

(3)

upper bound can be implemented in multiple ways and using it should lead to a downward bias.

This paper will answer the following research question: "Can the precision of the HyperLogLog algorithm be improved by introducing a bias and at what cost?"

This research question can be divided into the following sub-questions:

• What are feasible ways to implement such a bias? • What is the bene�t of the biases? (in terms of precision) • What is the cost of the biases? (in terms of lost accuracy) • What is the cost of the biases? (in terms of computational

power and memory)

These questions will be answered throughout this paper. First a selection of background information and related work is presented, to give a proper background into the necessary concepts and de�ne some used terms and techniques. After that the methodology used for this research is introduced, followed by its results. At the end, conclusions are provided, together with some points of discussion and suggestions for further research.

2 BACKGROUND

To have a proper understanding of this paper some background into the used evaluation measures is necessary. The used methods are described in this section and the formulas for their calculation are given. The two principle measures discussed in this paper are precision and accuracy. The de�nition for accuracy adhered to by this paper is as follows[18]:

The closeness of a measured or computed value to its true value.

Whereas accuracy pertains to measurements related to a true value, precision pertains to measurements among each other. The de�nition adhered to by this paper is as follows:[18]:

The closeness of repeated measurements of the same quantity. For a measurement technique that is free of bias, precision implies accuracy.

To quantify accuracy throughout this paper the mean relative error (MRE) is used. When the relative error is close to zero, the measurement is accurate. The formula for its calculation is given below: MRE =1 n n ’ i=0 x_esti x_truei x_truei (1)

To quantify the precision throughout this paper the standard deviation of the relative errors is used. The relative error of a mea-surement is de�ned as follows:

RE =xest xtrue

xtrue (2)

A measure that can be used to give a combined indication of precision and accuracy is the normalized root-mean-square error. This measure is a normalized version of the root-mean-square error (RMSE)[15]. The normalization is performed so that comparisons between results of di�erent data sets can be easily made. The for-mula for NRMSE as used throughout this paper is as follows, with

¯

xtrue used to indicate the mean of all true values which is used for

normalization:

N RMSE = 1_¯ xtrue

sÕn

i=0(xesti xtruei )2

n (3)

3 RELATED WORK

In this section some work related to this research will be discussed. In the �rst section a short overview of cardinality estimation tech-niques is provided. The second section is about the basics, origin and evolution of the HyperLogLog algorithm. In the third section some motivation will be given for the introduced biases.

3.1 Cardinality estimation

The count distinct problem occurs in many use cases. It is used to identify the number of unique packets according to a certain property (i.e. source or destination IP-address) passing through routers [9, 13, 14]. A more classical use case is query optimiza-tion in large scale database systems [12, 19]. Calculating the exact number of unique elements in a multiset requires memory linearly proportional to the amount of unique elements. This is intuitive considering that to get the exact number of unique elements a reg-ister containing a representation of all unique elements needs to be kept. The space complexity of O(n) (assuming the size of item representations to be constant) is in more detail derived in the work of Alon, Matias & Szegedy[1].

Many approaches to cardinality estimation have been proposed over the course of the last decades[22]. These approaches can be roughly divided into two categories: sampling-based methods and sketch-based methods. Methods based on sampling do not use the full set of data, they sample a subset and use this subset to make an estimation about the complete set. The second approach is a sketch-based method that scans the full set of data but only stores sketches containing indicators derived from the data that are relatively small in terms of memory usage. These stored sketches for instance keep track of certain properties of the data. These sketches can then be used by an estimator function to provide an estimate based on the properties stored in the sketches. The �rst category theoretically has an advantage in time complexity, since not all the data needs to be processed, but only a sample. In practise however, it seems that to get an estimation error similar to the sketch-based algorithms, often most of the data needs to be sampled [4, 16]. The advantage in time complexity is then lost while space complexity only increases. An example of a sampling-based method is Adaptive Sampling[10]. The HyperLogLog algorithm that is the subject of this paper belongs to the sketch-based algorithms.

3.2 The architecture and evolution of

HyperLogLog

The basis of the HyperLogLog algorithm is the observation that when a collection of elements is hashed using a hash function re-sulting in random, uniformly and independently distributed bits, the distribution of ones and zeros over these hashes can be used to estimate the number of elements in the collection. If we assume the bits of the hashes are uniformly distributed, the pattern 0k_1...

(4)

appears with probability1₂k+1. This observation was published by Flajolet & Martin in 1985 [12] as basis for their cardinality estima-tion algorithm that became known as the Flajolet-Martin algorithm. This algorithm keeps track of which positions the �rst one-bit was seen at for the collection of elements. Following the statistical pat-tern de�ned above, if the largest position a one-bit was �rst seen at is six, this happens with a probability of1

26=0.015625, or one in

sixty-four times. This leads to an estimation of sixty-four unique items that were added to the algorithm.

The Flajolet-Martin algorithm is highly memory e�cient but its performance in terms of accuracy and precision leave room for im-provement. The precision of the algorithm is relatively low, which can be explained by the fact that only one estimation is made, based on the entire set of data. This principle makes the algorithm highly vulnerable to outliers. To improve upon this, Durand & Flajolet introduced the LogLog algorithm in 2003[7]. This algorithm uses the same statistical principle as the Flajolet-Martin algorithm, but uses the �rst b bits to assign the element to a bin. This value of b is a parameter that can be adjusted when using the algorithm. A higher value of b will result in a lower error (when the cardinality is high enough), but a lower value means less memory is used. Ev-ery one of the 2b_{bins keeps track of the rank of the �rst one-bit}

in the remaining bit sequence. The estimations that can be made from all these bins are at the end combined into a �nal estimator of the number of unique elements. Each bin needs to be at least log log Nmax bits, which leads to 2blog log Nmax bits of memory

usage. Since the amount of bins can be considered a constant, the space complexity of the algorithm is O(log log NMAX). The

esti-mates produced by this algorithm are asymptotically unbiased and have a relative error of 1.30/p2b_{. The following equation is used}

to produce the estimate for the LogLog algorithm, where m is the number of bins (2b_{) and M is the vector of bin values. The constant} mis a correction for the systematic bias that is present, depending

on the number of bins m.

E = mm2m1 ÕM (4)

HyperLogLog[11] was introduced a few years after LogLog and implements some improvements. The most signi�cant improve-ment is to use the harmonic mean instead of arithmetic mean to combine the estimators of the bins. This leads to the following equation for the �nal estimator.

E = mm2(

m

’

j=1

2 M[j]₎ 1 ₍₅₎

The space complexity of HyperLogLog remains unchanged in comparison to LogLog, but the relative error dropped to 1.04/p2b_.

HyperLogLog is in practise usually used with two additional correc-tions, a small and large range correction. The small range correction is necessary for small cardinalities, the authors recommend to ap-ply it when n <= 5

22b. Below this threshold there is simply not

enough data to generate a proper estimate. When the number of unique items becomes very large, hash collisions become more and more likely to occur, depending on the hash function used. To compensate for these, a large range correction can be applied.

HyperLogLog is widely used and further developed upon. A team of engineers at Google improved the algorithm to what is known as HyperLogLog++[17]. In this version the large range correction is no longer necessary due to a larger 64-bit hash function. The memory requirements are reduced by using a sparse representation for the registers that can be transformed to a dense representation when needed. HyperLogLog can also be adjusted to be used in a sliding window[6], although requiring some additional memory[3]. Other adaptations of the HyperLogLog algorithm include using it in combination with a sampling strategy[5], using di�erent hashing techniques[27] or improving the used estimator function[8].

3.3 Introducing a bias to improve precision

The estimator of the HyperLogLog algorithm as described in the previous section is asymptotically unbiased. This means that the bias of the estimator converges to zero when the number of data points goes to in�nity. But the fact that this estimator is unbiased does not directly mean that it is optimal. Stein published a paper in 1956 that revealed a quite groundbreaking concept, nowadays referred to as Stein’s Paradox or Stein’s Phenomenon[25]. In this paper he proves that an estimator that is optimal in a univariate setting is suboptimal in a multivariate setting. This means that your estimation of the average length of Dutch men can improve when you also take into account the average snowfall in Austria. This is paradoxical because introducing a bias can improve your estimate. This observation also shows that an unbiased estimator is subopti-mal when it comes to precision. This theorem therefore provides grounds for this research. Greene (1993) wrote the following:

Focusing on unbiasedness may still preclude a tolera-bly biased estimator with a much smaller variance. Similar research has been done on an algorithm estimating quantiles[21], but then for the bias/instability vs. convergence speed trade-o�. In this research the authors tried to optimize the conver-gence speed of the algorithm by introducing a bias, at the risk of causing instability.

4 METHODOLOGY

This section describes the methodology used to answer the pro-posed research questions. First the simulation set-up as used through-out this research is described. After that the performed simulations and their parameters are de�ned. At the end of this section the eval-uation with "real-life" data that has been performed is described.

4.1 Simulation set-up

All programs created to perform the simulations are written using C++11[26]. The HyperLogLog implementation is made exactly as described in the original paper of Flajolet, Philippe and Fusy[11] but without small and large range corrections, since these ranges will not be simulated. The hash function used is the 32-bit x86 version of MurmurHash3[2], which is a well-known non-cryptographic hash function.

The data generation for the simulations is done using a cyclic pseudo-random number generator that produces a stream of 232

(5)

seeded with a time stamp to ensure di�erent sequences are gen-erated each run. The implemented solution is based on a imple-mentation made by Preshing[23] and tested using TestU01[20]. The unique elements produced by the described function are then ran-domly replicated to produce a sequence of the desired length and are afterwards permutated, following roughly the de�nition of an ideal multiset as described in the original paper of Flajolet, Philippe and Fusy[11].

4.2 Simulations

In this subsection the performed simulations are described. The b-parameter determining how many registers the algorithm keeps is always set to 10, amounting to 1024 registers. This is a commonly used value in practise and not excessively high or low. All versions of the algorithm were simulated with multisets of sizes 104_{, 10}5_,

106_{, 10}7_{. For each size, cardinalities ranging from 10 percent unique}

items to 100 percent unique items were simulated, in steps of 10 percent. Only cardinalities within the "small range correction" of HyperLogLog are not taken into account. Each combination of algorithm version, data size and relative cardinality was simulated 100 times, resulting in a value of 100 for n of the MRE and NRMSE evaluation measures.

Five versions of the algorithm were tested, all implementing a bias using an upper bound. These di�erent versions are compared to the original version of HyperLogLog to investigate the di�erences and similarities in performance. The di�erent implementations of the algorithm are described in the following subsections. 4.2.1 Version one.The �rst way a bias has been introduced is the simplest of them all. A counter tracking the total number of items added to the HyperLogLog instance will be kept. When an estimate is requested, the estimate will be capped at the total amount of seen items. This leads to the following equation:

estimation = min(estimation, totalItemsAdded) (6) This is a bias that will, following logical reasoning, only have an impact when nearly all items are unique. Normally, the algorithm has a chance to over-estimate to a number higher then the total number of added items, which is prevented by this intervention. On average this will most likely lead to a small downward bias, but less error and a better precision. The impact of this intervention on the needed computational resources is low, since only one additional integer needs to be kept in memory.

4.2.2 Version two.The second bias that has been introduced relies just like the �rst one on a counter, keeping track of the total number of items added to the HyperLogLog instance. This version checks at every addition that changes the estimate if the new estimate is larger than the total count of added items. If that is the case, the addition is reverted.

This intervention is expected to cause a small downward bias, having more e�ect for high relative cardinalities. Because this ver-sion implements a correction that can have e�ect at any addition, the expected impact is larger than version one.

The impact on the memory usage is the same as for version one, so there is only the need to store one additional integer. The computational load is somewhat higher, since every addition that

changes the estimate requires a recalculation of that estimate. Some optimization can be applied, but nevertheless this version is compu-tationally more expensive compared to version one and the original algorithm.

4.2.3 Version three.This version relies on the same principle as version two. The di�erence is that for this version a count of addi-tions is kept per bin of the HyperLogLog algorithm. So for every bin it is known how many of the added items belonged in it. The estimate of the number of unique items of a single bin can be esti-mated as the total estimate divided by the number of bins. When an addition is performed that changes the estimate, the estimation of the number of unique items added to this bin as described above is compared to the count of additions for the bin. When the estima-tion of unique items is higher the addiestima-tion is reverted, like in the previous version.

Since this version implements the addition count per bin, the expected e�ect is slightly larger compared to versions one and two. Using the addition count per bin will likely cause more skipped updates and therefore a larger e�ect is expected.

This version has a high impact on the memory usage of the al-gorithm, since it needs to keep two registers per bin, with the addi-tional register having a higher memory requirement. This amounts to a more than doubled memory requirement. Computationally it does not di�er much from the previous version, but that does mean it is more expensive compared to the original version of HyperLogLog.

4.2.4 Version four.For this version, like version three, a count of additions is kept per bin of the HyperLogLog. The di�erence is in the way the estimate for the bin is produced. Whereas for version three an average derived from the total estimation was used, for this version the estimation is provided using the observation from Flajolet & Martin[12] that is the basis for the HyperLogLog algo-rithm. This observation tells us that an estimation for the number of unique items seen is two to the power of the bins value. So the following formula is used for the bins estimation:

estbin=2 albin (7)

Because this method uses only data from the bin and is not af-fected by any averaging its e�ect is likely larger. Therefore this method implements two ways to dampen its e�ect. It only subtracts one from the value of an addition instead of skipping it altogether when the estimation based on the value is larger than the count of additions for the bin. Also, this version has a parameter deter-mining to which share of the bins this method should be applied. This second parameter can be used to "�ne-tune" the e�ect of this method. Values for this parameter that were simulated are: all bins, 1/2, 1/4, 1/8, 1/16, 1/32 and 1/64.

This version is expected to have a large e�ect, therefore the dampening as described above is introduced. Since this version bases the estimate for a bin purely on the value of that bin, it is very sensitive to outliers. Also, the bias correction factor from the estimation function is not taken into account, just like the properties of the harmonic mean that are normally present in the total estimate.

(6)

Just like the previous version this version has a relatively high memory requirement. However, because it does not make use of the total estimation, calculating the estimation for the bin is com-putationally less expensive compared to version two and three. 4.2.5 Version five.This version uses the same estimator for the bin as version four, however it compares it not to a count of additions kept for that speci�c bin but to a count of the total amount of additions divided by the number of bins. This reduces the memory requirements drastically compared to version four. This version is aside from that identical to version four.

This version is expected to perform similar to version four, since the distribution of items over the bins should be uniform. This means that the estimation of the number of additions should be reasonably accurate, especially when there are many additions.

Figure 1: The distribution of occurrences for the e-mail opens of all �ve tested accounts.

4.3 Evaluation with non-simulated data

To test the validity of this research, additional simulations on "real-world" data sets have been performed. All versions as described in the previous subsection were tested. The data sets used for this purpose are registered e-mail opens for several businesses. These businesses send e-mails to their customers and track the moments these e-mails are opened by the recipients to, among others, de-termine the amount of people interested in their mailings. What is however often more interesting than the total amount of opens, is the amount of unique e-mails that were opened. Each e-mail receives a unique identi�er when sent, that is registered when the e-mail is opened. Using the HyperLogLog algorithm it can be esti-mated how many of the e-mails that were sent are opened once or more.

Five accounts of businesses from di�erent branches were selected to participate in this test. All the registered e-mail opens from 2018 have been gathered for these accounts. As can be seen in �gures 1 & 2 the distribution of the e-mail opens is roughly similar between the accounts, having most of the emails only opened a couple of times. All data sets have between 47% and 70% unique items. This intuitively makes sense, since not many people open an email

Figure 2: The cumulative probabilities of occurrences for the e-mail opens of all �ve tested accounts.

hundreds of times. The outliers in these data sets that do so could be caused by tests or bots, or very interested customers. Each year of data is split into months, so there are twelve data sets per account to perform the tests on. Each of the versions of the HyperLogLog algorithm, including the original version, was compared to the true number of unique identi�ers for all data sets.

5 RESULTS

The results of the performed simulations will be given per tested version. Each version will also be compared to the original version of HyperLogLog. For the simulations using data size 104_{and 10}5_a

graph indicating the results is shown, since these are the smallest and largest data sizes that were simulated. The �nal subsection is dedicated to the results of the evaluation performed on real-life data. In appendix A, �gures for the other data sizes can be found. What stands out in many of the plots is that the lines follow a somewhat erratic pattern, also for the original version. It could be argued that this is caused by a relative low amount of simulations, but this also seems to be the case when more simulations are done. This was tested for the original version at data sizes 104_{and 10}5

with 1000 simulations. The MRE is slightly lower in these cases but the somewhat erratic behavior is still there.

5.1 Version one

As can be seen in �gures 3 & 4 this version has an advantage in precision at the right side of the graph, when the cardinality of the multiset is almost equal to the total size, meaning most items are unique. At lower relative cardinalities this version has no e�ect compared to the original HyperLogLog version. When looking at 100% relative cardinality, the standard deviation of the relative error drops from between 0.031 and 0.034 to between 0.016 and 0.020 (depending on the data size). The e�ect on the NRMSE is similar, lowering it from between 0.031 and 0.034 to between 0.020 and 0.023. This improvement comes at the cost of a small, downward bias. Where the MRE for the original version lies between -0.001 and 0.004 for the di�erent data sizes, it is between -0.013 and -0.011 for this version. For the relative cardinalities of 90% and below,

(7)

Figure 3: The MRE, standard deviation of the RE and NRMSE of algorithm versions one, two, three and original version simulations with a data size of 104_.

this version does not have any noticeable e�ects compared to the original version.

5.2 Version two

This version appears to have an e�ect for cardinalities slightly lower than the �rst version, but the di�erence is small, as can be seen in �gures 3 & 4. It however behaves more unpredictable and often has a higher error compared to version one and the original for the smaller data sizes.

5.3 Version three

This version behaves more erratically and the error is often very high, especially for smaller data sizes. This version is likely too sensitive to be used in practise, given the behavior as can be seen in �gures 3 & 4.

5.4 Versions four and �ve

Version four produces similar results compared to version �ve, without di�erences worth mentioning. Considering the fact that version �ve is, when viewed from a computational standpoint, much more e�cient, only the results from version �ve will be covered in

this section. The results for data sizes 10.000 and 10.000.000 using di�erent parameters as described in the methodology can be found in �gures 5 & 6. What becomes clear from these �gures is that for this version there is a trade-o� between accuracy and precision. Less dampened versions show a reduced standard deviation of the relative errors, but at the cost of a large increase in error. The behavior of this version is very similar when the smaller data size simulations are compared to the larger ones. However, the gains in precision are relatively low when compared to their cost in accuracy.

5.5 Evaluation with non-simulated data

Five data sets consisting of non-simulated data were evaluated, for each of which the average results of the �ve versions with the lowest standard deviation of the relative error and the original version can be found in Table 1. The e�ect of the introduced biases seems minimal, for all accounts there is a version performing equally or slightly better than the original HyperLogLog, however the di�erences are minimal. The best performing version is di�erent for most data sets, sometimes performing worse than the original version for some other data sets.

(8)

Figure 5: The MRE, standard deviation of the RE and NRMSE of algorithm version 5 with di�erent parameters and original version simulations with a data size of 104.

6 CONCLUSION AND DISCUSSION

In conclusion: �ve di�erent ways of implementing biases have been tested. Version four and �ve give very similar results, where version �ve has signi�cant advantages when it comes to computational expensiveness. This version has a parameter that can be used to dampen the e�ect of the introduced bias by only introducing it for a select fraction of the bins in the HyperLogLog instance. Using this dampening option is necessary if the loss in accuracy matters. This version presents a clear precision / accuracy trade-o�, but the cost of gained precision is relatively high, considering the additional error that is introduced. This makes it not likely to use this version in practise.

Version three unfortunately produced erratic results, often with a high error, therefore that version is deemed unusable. Versions one and two are however usable, albeit for particular data sets. Both version one and two only require the additional storage of one integer (large enough to count all items added to the HyperLogLog), so the impact on memory e�ciency is minimal. The computational impact is slightly larger for version two since a lot of estimates need to be calculated when adding items to the HyperLogLog instance, this is however something that could be optimized. For version one

Figure 6: The MRE, standard deviation of the RE and NRMSE of algorithm version 5 with di�erent parameters and original version simulations with a data size of 107. it remains unchanged. Both versions only work when a very high fraction of the items that are added are unique. Version two has a slightly broader range in which it is e�ective, but at the cost of a higher error for smaller data sets and more unpredictable behavior. Version one, although the simplest of them all, seems the most usable in practise. It never performs worse than the original version when looking at the total error and for the high relative cardinalities it signi�cantly improves precision and lowers the total error. This makes it a safe bet to use when the expected data might have a high relative cardinality.

Another disadvantage of all implemented versions, except ver-sion one, is that the results they provide can di�er for the same data set, when the order in which the items are added is di�erent. This can be an impractical drawback for some purposes. Another feature that makes HyperLogLog loved, is the fact that it parallelizes easily, since multiple instances can easily be merged. This feature remains intact for all versions implemented in this paper.

Using the upper bound to implement a bias for the HyperLogLog algorithm in order to increase precision only proved useful for high relative cardinalities. Other options that also have e�ect on lower cardinalities showed erratic behavior. Future research might focus

(9)

Table 1: Results of the �ve versions with lowest standard de-viation of the relative error and original version for the non-simulated data sets.

Account 1

Version NRMSE Std. of RE MRE Original 0.0523 0.0346 0.002 Two 0.0516 0.0317 0.0098 One 0.0523 0.0346 0.002 Three 0.0598 0.0348 0.0347 Five (1/64) 0.0548 0.0349 0.0082 Five (1/16) 0.0653 0.0353 0.0266 Account 2

Version NRMSE Std. of RE MRE Original 0.038 0.0362 -0.0119 Five (1/32) 0.0341 0.0349 0.0022 Five (1/64) 0.0356 0.0356 -0.0055 Five (1/16) 0.0376 0.0357 0.0167 Five (1/8) 0.0556 0.0359 0.0459 One 0.038 0.0362 -0.0119 Account 3

Version NRMSE Std. of RE MRE Original 0.0377 0.0364 0.0074 Five (1/32) 0.0425 0.0356 0.0225 Three 0.037 0.0357 0.008 Five (1/16) 0.0521 0.0357 0.0372 One 0.0377 0.0364 0.0074 Two 0.0377 0.0364 0.0074 Account 4

Version NRMSE Std. of RE MRE Original 0.0225 0.024 -0.0025 Three 0.0223 0.024 -0.0014 One 0.0225 0.024 -0.0025 Two 0.0228 0.0244 -0.0022 Five (1/64) 0.0239 0.0246 0.0061 Five (1/32) 0.0277 0.0252 0.0138 Account 5

Version NRMSE Std. of RE MRE Original 0.0411 0.0372 -0.0079 Five (1/64) 0.039 0.0368 0.0004 Five(1/32) 0.0394 0.0371 0.0072 One 0.0411 0.0372 -0.0079 Three 0.0412 0.0375 -0.0074 Two 0.0413 0.0375 -0.0076

on how this erratic behavior can be reduced. One could also inves-tigate the impact of the b parameter on the performance of biased versions of the algorithm, since for this research it has been consid-ered a constant. Another suggestion is research into other methods of introducing a bias, besides using an upper bound provided by the number of additions. Although one should be warned about the

sensitive nature of the algorithm as shown by this research. What should also be taken into account is the e�ect of the chosen hash function, since this will signi�cantly impact the performance of the algorithm.

REFERENCES

[1] Noga Alon, Yossi Matias, and Mario Szegedy. 1999. The space complexity of approximating the frequency moments. Journal of Computer and system sciences 58, 1 (1999), 137–147.

[2] Austin Appleby. 2012. smhasher. https://code.google.com/p/smhasher [3] Yousra Chabchoub and Georges Hébrail. 2010. Sliding hyperloglog: Estimating

cardinality in a data stream over a sliding window. In 2010 IEEE International Conference on Data Mining Workshops. IEEE, 1297–1303.

[4] Moses Charikar, Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 2000. Towards estimation error guarantees for distinct values. In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 268–279.

[5] Reuven Cohen, Liran Katzir, and Aviv Yehezkel. 2017. Cardinality estimation meets good-turing. Big data research 9 (2017), 1–8.

[6] Mayur Datar, Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 2002. Main-taining stream statistics over sliding windows. SIAM journal on computing 31, 6 (2002), 1794–1813.

[7] Marianne Durand and Philippe Flajolet. 2003. Loglog counting of large cardinali-ties. In European Symposium on Algorithms. Springer, 605–617.

[8] Otmar Ertl. 2017. New cardinality estimation algorithms for HyperLogLog sketches. arXiv preprint arXiv:1702.01284 (2017).

[9] Cristian Estan, George Varghese, and Mike Fisk. 2003. Bitmap algorithms for counting active �ows on high speed links. In Proceedings of the 3rd ACM SIG-COMM conference on Internet measurement. ACM, 153–166.

[10] Philippe Flajolet. 1990. On adaptive sampling. Computing 43, 4 (1990), 391–400. [11] Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. 2007.

Hy-perloglog: the analysis of a near-optimal cardinality estimation algorithm. In Discrete Mathematics and Theoretical Computer Science. Discrete Mathematics and Theoretical Computer Science, 137–156.

[12] Philippe Flajolet and G Nigel Martin. 1985. Probabilistic counting algorithms for data base applications. Journal of computer and system sciences 31, 2 (1985), 182–209.

[13] Éric Fusy and Frécéric Giroire. 2007. Estimating the number of active �ows in a data stream over a sliding window. In Proceedings of the Meeting on Analytic Algorithmics and Combinatorics. Society for Industrial and Applied Mathematics, 223–231.

[14] Sumit Ganguly, Minos Garofalakis, Rajeev Rastogi, and Krishan Sabnani. 2007. Streaming algorithms for robust, real-time detection of ddos attacks. In 27th International Conference on Distributed Computing Systems (ICDCS’07). IEEE, 4–4.

[15] William H. Greene. 1993. Econometric analysis. Macmillan.

[16] Peter J Haas and Lynne Stokes. 1998. Estimating the number of classes in a �nite population. J. Amer. Statist. Assoc. 93, 444 (1998), 1475–1487.

[17] Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in prac-tice: algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the 16th International Conference on Extending Database Tech-nology. ACM, 683–692.

[18] JM Kalish, RJ Beamish, EB Brothers, JM Casselman, RICC Francis, Henrik Mosegaard, Jacques Pan�li, ED Prince, RE Thresher, CA Wilson, et al. 1995. Glossary for otolith studies. (1995).

[19] Zoi Kaoudi, Kostis Kyzirakos, and Manolis Koubarakis. 2010. SPARQL query optimization on top of DHTs. In International Semantic Web Conference. Springer, 418–435.

[20] Pierre L’Ecuyer and Richard Simard. 2007. TestU01: A C Library for Empirical Testing of Random Number Generators. ACM Trans. Math. Softw. 33, 4, Article 22 (Aug. 2007), 40 pages. https://doi.org/10.1145/1268776.1268777

[21] Qiang Ma, Shan Muthukrishnan, and Mark Sandler. 2013. Frugal streaming for estimating quantiles. In Space-E�cient Data Structures, Streams, and Algorithms. Springer, 77–96.

[22] Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2008. Why go logarith-mic if we can go linear?: Towards e�ective distinct counting of search tra�c. In Proceedings of the 11th international conference on Extending database technology: Advances in database technology. ACM, 618–629.

[23] Je� Preshing. 2012. How to Generate a Sequence of

Unique Random Integers. https://preshing.com/20121224/

how-to-generate-a-sequence-of-unique-random-integers/

[24] David Reinsel, John Gantz, and John Rydning. 2017. Data age 2025: The evolution of data to life-critical. IDC White Paper (2017), 1–25.

[25] Charles Stein. 1956. Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. Technical Report. STANFORD UNIVERSITY

(10)

STANFORD United States.

[26] Bjarne Stroustrup. 2013. The C++ Programming Language (4th ed.). Addison-Wesley Professional.

[27] Lun Wang, Tong Yang, Hao Wang, Jie Jiang, Zekun Cai, Bin Cui, and Xiaoming Li. 2018. Fine-grained probability counting for cardinality estimation of data streams. World Wide Web (2018), 1–17.

(11)

A SIMULATION RESULTS

Figure 8: The MRE, standard deviation of the RE and NRMSE of algorithm versions one, two, three and original version simulations with a data size of 106.

(12)