Higher-order Fourier analysis and locally decodable codes

(1)

Higher-order Fourier analysis and locally

decodable codes

Mark Bebawy

June 28, 2020

Bachelor thesis Mathematics and Computer Science Supervisors: dr. Jop Bri¨et, dr. Guus Regts

Cyclotomic polynomials Matching vector codes overFn p Hadamard code and Reed-Muller codes Private information retrieval Entropy of dual functions Conj. dual functions & polynomial phases Minimal sample probability in Szemer´edi’s thm. with random differences high low Computer Science Mathematics LDCs Informatics Institute

Korteweg-de Vries Institute for Mathematics Faculty of Sciences

(2)

Abstract

This thesis studies locally decodable codes (LDCs), a special type of error-correcting codes, and their applications to cryptography and to additive combinatorics. We present how LDCs can be translated into private information retrieval schemes and we study the known bounds on the codeword length of an LDC as a function of the message length. Matching vector codes are currently the best-known LDCs and these constructions have been used in a recent paper by Jop Bri¨et and Farrokh Labib to refute a conjecture from additive combinatorics. Roughly, this conjecture says that ‘dual functions’ from Fn

p toC can be approximated well in terms of ‘polynomial phase functions’ in the L∞ -norm. As such, we study an application of computer science to additive combinatorics, whereas typically, applications go in the opposite direction. The conjecture has strong connections to a stochastic refinement of a celebrated theorem of Szemer´edi from additive combinatorics (in a finite field setting). If the conjecture were true, a still open question regarding this refinement would have been settled. However, the conjecture is not true. The conjecture refutation relies on constructions of matching vector codes and on the factorisation of cyclotomic polynomials over finite fields. We give an overview of all components needed in their proof and we present this proof. We also study the limitations of their proof method. Furthermore, we build upon their counterexamples by searching for more general ones and for stronger ones. We created a computer program that searches for counterexamples, and we found these for the first 1000 primes p, except for p ∈ {2, 3, 5}. We prove that these three cases indeed give no counterexamples. However, as the strength of the counterexamples tends to increase as p gets larger, we conjecture that for all primes p ≥ 7 there are counterexamples. In particular, we found that stronger counterexamples emerge from factoring cyclotomic polynomials over fields Fq where a combination holds of q being small and of q having order p−1₂ modulo p. Cover image: schematic overview of the topics discussed in this thesis.

Title: Higher-order Fourier analysis and locally decodable codes Author: Mark Bebawy, markbebawy0@gmail.com, 11667230 Supervisors: dr. Jop Bri¨et, dr. Guus Regts

Second graders: prof. dr. Ronald de Wolf, prof. dr. Rob Stevenson End date: June 28, 2020

Informatics Institute University of Amsterdam

Science Park 904, 1098 XH Amsterdam

http://www.ivi.uva.nl

Korteweg-de Vries Institute for Mathematics University of Amsterdam

Science Park 904, 1098 XH Amsterdam

(3)

1. Introduction

Additive combinatorics has many applications in theoretical computer science [6,20,30], but in this thesis, we study an application of computer science to additive combinatorics. Near the end of 2019, Jop Bri¨et and Farrokh Labib used certain constructions of locally decodable codes (LDCs) to refute a conjecture that is closely related to a stochastic refinement of Szemer´edi’s theorem [7], which is central in additive combinatorics. The conjecture states that for all primes p ‘dual functions’ can be approximated arbitrarily well by the well-understood low-degree ‘polynomial phase functions’ in the L∞-norm. For a vector space Fn

p over the finite fieldFp =Z/pZ, dual functions are functions from Fn

p to the closed complex unit diskD = {z ∈ C | |z| ≤ 1} of the form y 7→ 1 |Fn p| X w∈Fn p f1(w)f2(w + y)f3(w + 2y), f1, f2, f3:Fnp →D. (1.1)

Polynomial phases of degree d are of the form ψ(y) = e2πiP (y)/p_{, where P ∈}_F

p[z1, . . . , zn] is an n-variate polynomial of degree d. We will elaborate on these objects in Chapter2. The refutation of the conjecture also combines various areas of mathematics (higher-order Fourier analysis, convex geometry, and group theory). Since (at the time of writing) this result is so recent, and since higher-order Fourier analysis and additive combinatorics are young areas of mathematics themselves, there exists no literature that provides a concise overview of how all these different components relate to each other.

The main goal of this thesis is to create this overview and to work towards the refu-tation of the conjecture. Furthermore, Bri¨et and Labib found counterexamples for a specific class of prime numbers (primes p for which the order of 2 in (Z/pZ)∗ is strictly less than p − 1). Although this gives counterexamples for infinitely many prime numbers (see Section4.2), it remains an open question whether there are counterexamples for all primes p. Furthermore, their counterexamples do not refute the conjecture if the poly-nomial phases in the conjecture were of one degree higher. This may seem as a fairly weak refutation. For these two reasons, we created a computer program that searches for counterexamples. The goal is to find stronger counterexamples and to analyse whether counterxamples exist for all primes. We searched for the first 1000 primes and we found counterexamples for all of those primes that are strictly larger than 5. We give a proof that for the primes p ∈ {2, 3, 5} this method gives no counterexamples, and we argue that the underlying problems for these small primes do not occur for primes p ≥ 7, which justifies the idea that there might be counterexamples for all primes p ≥ 7. Also, the strength of the counterexamples generally seemed to increase for larger primes. Based on this, we conjecture that for any prime p ≥ 7, there is a counterexample. We also found that stronger counterexamples emerge from factorising p-th cyclotomic polynomials over

(5)

Fq for small primes q. Our results show that counterexamples are particularly strong when q has order p−1₂ inF∗_p.

We also study locally decodable codes in their own respect, as these have many applica-tions in theoretical computer science [35], one of which is ‘private information retrieval’, which is a set of cryptographic schemes meant to safeguard the privacy of database users.

Mathematical context

In 1975, Endre Szemer´edi proved a theorem about the occurrence of arithmetic progres-sions in subsets of the natural numbers1 _{N [}₂₇_{]. A k-term arithmetic progression with}

common difference y is a sequence of the form

x, x + y, x + 2y, . . . , x + (k − 1)y, where x, y ∈_N. (1.2) Szemer´edi proved that for all k ∈N, any ‘large enough’ subset A ⊂ N contains a k-term arithmetic progression. This theorem — now known as Szemer´edi’s theorem — plays a central role in additive combinatorics [28].

The k = 3 case of this theorem was proven by Roth in 1953, using Fourier analysis on finite abelian groups [24]. Gowers generalised Roth’s methods to give a new proof for Szemer´edi’s theorem in 2001 [13]. With this, he introduced a new field of mathematics, called ‘higher-order Fourier analysis’. In linear Fourier analysis, one studies how func-tions from a finite abelian group G to C correlate to the characters of G.2 _{Instead of}

characters, higher-order Fourier analysis considers higher-order analogues, called poly-nomial phase functions. It also became natural to consider Szemer´edi’s theorem in a finite field setting. That is, to consider arithmetic progressions in subsets A ⊂Fn

p. The reason for this, is that the finite field setting makes the proofs much easier, while the methods generally guide proof techniques for the natural numbers setting [14, 31]. In this thesis, we will solely work with finite fields.

An interesting stochastic refinement of Szemer´edi’s theorem will play a central role here. The idea is that there were no restrictions on the common difference y in (1.2). In the stochastic version, we create a random subset D ⊂ Fn

p, where for every y ∈ Fnp we let y ∈ D with a fixed probability ρ, independently of all other elements. The central question then is the following.

What is the smallest ρk∈ (0, 1] such that with ‘high’ probability any ‘large enough’ subset A ⊂Fn

p contains a k-term arithmetic progression, with common difference in the random subset D ⊂Fn

p?

It turns out that upper bounds on ρk follow from understanding the set of dual func-tions mentioned above. For each ‘pattern’ (e.g., 3-term arithmetic progressions), we get a slightly different set of dual functions. Altman gave a lower bound on ρ3 that 1_{Throughout this thesis, the natural numbers}_{N are taken as the set of positive integers {1, 2, 3, . . .}.} 2

The characters of a finite abelian group G are the additive homomorphisms from G to the multiplicative groupC∗=C \ {0}.

(6)

would be optimal if the dual functions in (1.1) could be approximated arbitrarily well by quadratic phase functions in the L∞-norm [1]. It is known that this approximation can be done when replacing the L∞-norm by the L1_{-norm, which motivated the belief} that the approximation would also hold for the L∞_{-norm. Although Bri¨et and Labib} proved that there are patterns for which the conjecture is false, they also showed that arithmetic progressions are not captured by their counterexamples. Hence, it remains unknown whether the conjecture is true for dual functions emerging from arithmetic progressions (e.g., the ones in (1.1)). This should be motivation for further research.

To refute the conjecture for certain patterns, Bri¨et and Labib used specific construc-tions of locally decodable codes, the ‘matching vector codes’. The counterexamples are also based on the factorisation of cyclotomic polynomials over finite fields and on the existence of ‘sparse decoding polynomials’, which we will discuss in Chapters 3and 4.

Computer scientific context

Locally decodable codes (LDCs) are constructions of error-correcting codes, allowing a sender to encode a message in such a way that a receiver can recover an arbitrary symbol of the message with good probability, even when the codeword gets partially corrupted. With an LDC, the decoder can do this by only looking at a small fraction of randomly chosen symbols of the (corrupted) codeword. This gives an efficient way of recovering a small part of the message without decoding the entire codeword, which is particularly useful when only a small part of some data is needed (e.g., when a single page needs to be recovered from an encoded library). The idea is presented schematically in Figure1.1.

k long message N long codeword Adversarial corruptions Decoder queries only r symbols

Figure 1.1: Idea of a locally decodable code: a message of length k is encoded into a codeword of length N , which becomes partially corrupted (red). With high probability, the decoder recovers an arbitrary symbol of the message (green) by reading only r symbols of the corrupted codeword (blue).

It is desirable to construct LDCs with a small codeword length N and with a low query complexity r, which is the number of symbols that need to be read from the corrupted codeword. However, there is a natural trade-off between these parameters, as generally a larger query complexity corresponds to shorter codewords. For a fixed

(7)

query complexity, we study the relation between the codeword length N and the message length k in Section 2.1.

Certain types of matching vector codes (for prime numbers) are the constructions used to refute the dual functions conjecture (Section 2.2, Conjecture 1). Besides their applications to higher-order Fourier analysis, LDCs are also useful in complexity theory and cryptography [35]. We discuss one cryptographic application — private information retrieval — in Subsection2.1.3.

Thesis roadmap

In Chapter2, we start by providing the broad overview of all different components from LDCs and higher-order Fourier analysis. We present the simplest LDC construction (namely the Hadamard code, the more sophisticated Reed-Muller codes are discussed in AppendixA), we give preliminary definitions, and we present the conjecture about the dual functions and its motivation. The rest of the thesis will fill in the missing details. In Chapter3, we present the matching vector codes, after which we discuss the existence of ‘decoding polynomials’ in Chapter 4, which are needed for the conjecture refutation. Finally, in Chapter5we present the refutation of Bri¨et and Labib, where we also present our own contributions to the counterexamples. We discuss the code that we have written for these contributions in Appendix B.

Notation and conventions

In this thesis, we will use the following notation. • We write [N ] for the set {1, 2, . . . , N }.

• For a finite set G and any function f : G →C, we write Ex∈Gf (x) := _|G|1 P_x∈Gf (x). • We write Fq for the finite field of q elements.3 We write F∗q for the multiplicative

groupFq\ {0}. We writeZm for the additive group Z/mZ.

• For an n-dimensional vector w, we denote its entries by w(1), . . . , w(n). • We write D = {z ∈ C | |z| ≤ 1} for the closed complex unit disk.

• When we write GD _{(set of functions from D to G), we interpret an element x ∈ G}D as a |D|-long vector with entries in G, indexed by the elements of D. Thus, for any d ∈ D we write x(d) for the entry of x corresponding to the element d. • We write g(t) = O(f (t)) (respectively g(t) = Ω(f (t))) if there exist C, t0> 0 such

that for all t ≥ t0 it holds that g(t) ≤ Cf (t) (respectively g(t) ≥ Cf (t)). • We write kg(t)_{= k}O(f (t)) _{if g(t) = O(f (t)) (similarly for Ω).}

3_{We assume implicitly that q is a prime power, as the number of elements in any finite field is a prime} power. Also, for each prime power q, there is a finite field with q elements (which is unique up to isomorphism) [19, Theorem 2.5].

(8)

2. Foundations

In this chapter, we lay down the foundations of this thesis by giving a general overview of the different components — mainly locally decodable codes and the dual functions conjecture — without going into all the details. The details will be filled in by the later chapters.

2.1. Locally decodable codes

We start with the analysis of an important concept from theoretical computer science, namely the theory of locally decodable codes (LDCs). Most of this section is based on Yekhanin’s survey about LDCs from 2012 [35].

The main difference between locally decodable codes and classical error-correcting codes is that in the latter the entire codeword is read to recover the original message, whereas with LDCs only a small fraction of the codeword is read to recover a single symbol of the message with high probability.

LDCs are better when both the codeword length and the query complexity (the number of symbols that need to be read in the decoding algorithm) are small. For our purposes, we will consider LDCs where the query complexity is a fixed constant, independent of the message length. Currently, the best constructions of these LDCs have very large codeword lengths. An active research area studies this relation between the codeword length N and the message length k for fixed query complexity over a constant-sized alphabet. Ideally, N — considered as a function of k — is a slow growing function. It was proven in the year 2000 that 1-query LDCs do not exists [17]. For 2-query LDCs over constant-sized alphabets, it has been proven that N must at least be exponential in k [18]. The existence of LDCs with exponential codeword length proves that this relation is indeed optimal. One such code is the basic Hadamard code (discussed in Subsection

2.1.2), which is a special case of the more complicated Reed-Muller codes. Reed-Muller codes are historically relevant, as they were the best-known constructions until the rise of the matching vector codes in 2006 [33]. However, they are not very relevant for higher-order Fourier analysis, which is why we only discuss them in Appendix A. For higher query complexities, the optimal trade-off is a big open question. For 3-query LDCs, the best-known lower bound says that N is quadratic in k [3], whereas the currently best-known 3-query LDCs — the matching vector codes — provide a subexponential codeword length [10]. Researchers are actively trying to close this gap by searching for shorter 3-query LDCs and for better lower bounds.

(9)

2.1.1. General definitions and theorems We introduce some preliminary definitions. Definition 2.1.1. Let x, y ∈ Fk

q. We define the Hamming distance between x and y, denoted ∆(x, y), as the fraction of coordinates where x and y differ. Thus,

∆(x, y) = Pk

i=11x(i)6=y(i)

k .

Definition 2.1.2. For any domain D of a function f , we define a D-evaluation of f as a vector y = (f (x) | x ∈ D) consisting of the values of f evaluated at all elements of D. We will only consider D-evaluations for finite sets D. For a function f : D → R, we write y ∈ RD _{for its D-evaluation, and for any d ∈ D we write y(d) = f (d).}

Next, we define a locally decodable code formally. Definition 2.1.3. We say that a q-ary code C : Fk

q →FNq is (r, δ, ε)-locally decodable if there is a randomised decoding algorithm A that takes a vector inFN

q and an i ∈ [k] as input and outputs an element of_Fq such that the following conditions hold.

a) For all messages x ∈ Fk

q, for all i ∈ [k], and for all z ∈FNq with ∆(C(x), z) ≤ δ, it holds that

PA(A(z, i) = x(i)) ≥ 1 − ε. (2.1)

b) The algorithm A reads at most r coordinates of its input vector z ∈FN q .

See Figure 1.1 again for a visual representation of this definition. Note that C is a deterministic function and that the decoding algorithm A is randomised, because it is not known in advance which symbols of C(x) are corrupted to produce z ∈ FN

q . Furthermore, codeword corruptions are adversarial, which means that the corrupted locations are deterministic. The probability in (2.1) is taken over the randomness of algorithm A and not over the corrupted codeword locations.

For our purposes, we will consider the alphabet size q, the query complexity r, the corrupted fraction δ, and the failure probability ε as fixed constants. We vary the message length k and study its effect on the codeword length N . For the constructions that we discuss in this thesis, we will also see that ε depends linearly on δ, thus the success probability of A decreases linearly as the fraction of corrupted entries of C(x) increases.

For each LDC in this thesis, we follow the same structure when proving that it is (r, δ, ε)-locally decodable. First, we give encoding and decoding algorithms. Then, we show the correctness of the decoding algorithm. We do this by proving that for every x ∈Fk_q and every i ∈ [k] the decoding algorithm applied to C(x) returns x(i). Lastly, we show that when the decoding algorithm receives a corrupted codeword (with Hamming distance at most δ to the codeword), the probability of querying a corrupted entry is at most ε. This implies that the probability that all r queries go to clean locations is

(10)

at least 1 − ε. With the correctness, this proves that the construction is (r, δ, ε)-locally decodable.

For the last step, we typically use an important result from probability theory, namely the union bound (also known as Boole’s inequality).

Lemma 2.1.4 (Union Bound). For any set of events A1, A2, . . . , An, we have P n [ k=1 Ak ! ≤ n X k=1 P(Ak). Proof. Apply the addition rule

P(A ∪ B) = P(A) + P(B) − P(A ∩ B) ≤ P(A) + P(B) inductively.

Furthermore, when studying the relation between the codeword length and message length for the Reed-Muller codes and matching vector codes, we will need upper bounds and lower bounds on the binomial coefficients.

Lemma 2.1.5. For integers1 ≤ k ≤ n, it holds that n k ≥n k k . Proof. We have n k = n! k!(n − k)! = n(n − 1) · · · (n − (k − 1)) k! = n k · n − 1 k − 1· · · n − (k − 1) 1 .

Now, for each of these fractions (for each j ∈ {0, 1, . . . , k − 1}), we have n − j k − j − n k = nk − jk − nk + nj k(k − j) = j(n − k) k(k − j) ≥ 0, because n ≥ k > j. Thus, each fraction is greater than or equal to n

k and hence n k ≥n k k .

Lemma 2.1.6. For integers1 ≤ k ≤ n, it holds that n k <n · e k k .

(11)

Proof. We have n k = n k · n − 1 k − 1· · · n − (k − 1) 1 = n k!· (n − 1) · · · (n − (k − 1)) ≤ nk k!. We know ek₌P∞ j=0 kj j! and thus nk kk · e k ₌ nk kk ∞ X j=0 kj j! > nk kk kk k! ≥ n k .

Finally, we will also use the following result from combinatorics [25, pg. 26].

Lemma 2.1.7. Let n and k be positive integers with n ≥ k. The number of ways to choose an unordered subset of sizek, with repetitions allowed, from a set of size n equals

n + k − 1 k

.

2.1.2. The Hadamard code

We discuss our first LDC, the Hadamard code, which is one of the simplest locally decodable codes. For this 2-query code, the codeword length is exponential in the message length. If the message has length k, then the codeword has length 2k_{. As} stated earlier, this is the shortest possible codeword length for 2-query LDCs over a constant-sized alphabet [18].

Encoding

We consider the classical Hadamard code over a binary alphabet. Define the inner product hx, yi := xT_y _{mod 2 on}_Fk

2.1 The encoding is as follows. Algorithm 1: Hadamard code (encoding)

Input: A message x ∈Fk 2 Output: A codeword C(x) ∈FFk2 2 1 Let C(x) := hx, yi | y ∈Fk₂; 2 return C(x);

Thus, C(x) is a 2k_{-long vector where each entry corresponds to a vector y ∈}_Fk 2 and contains the inner product hx, yi.

1_{Formally speaking, this nondegenerate bilinear form is not an inner product, as hx, xi = 0 does not} necessarily imply x = 0; for example, take x = (0, 1, 1) ∈F32. However, in the context of LDCs (for instance in [4,10]) this is called an inner product and we will follow this convention.

(12)

Decoding

We will prove that the Hadamard code is an LDC by giving a decoder. For i ∈ [k], define the i-th standard basis vector ei= (1{`=i}| ` ∈ [k]) ∈Fk2.

Algorithm 2: Hadamard code (decoding) Input: A vector z ∈FFk2

2 and an integer i ∈ [k] Output: An element r ∈F2

1 Pick an element y ∈Fk₂ uniformly at random; 2 Query s1 := z(y) and s2:= z(y + ei);

3 return s₁+ s₂ ∈F₂;

We can prove the following theorem using this decoder.

Theorem 2.1.8. Fix a message length k. Then, the Hadamard code is (2, δ, 2δ)-locally decodable for all δ.

Proof. Let δ be given and let i ∈ [k]. Let x ∈ Fk

2 and let z ∈ FF k 2

2 with ∆(C(x), z) ≤ δ. First, we show the correctness of the decoder in Algorithm 2. For this, assume that s1 and s2 are not corrupted. Then, s1 = hx, yi and s2 = hx, y + eii. Hence, (by using the linearity of the inner product)

hx, yi + hx, y + eii = hx, yi + hx, yi + hx, eii ?

= hx, eii = x(i).

In step ? we used the fact that this addition is done inF2, thus s + s = 0 for all s ∈F2. Thus, if the two queries go to non-corrupted bits, then the algorithm outputs the correct symbol x(i).

We now show that the probability of querying a corrupted symbol is at most 2δ. Let E1 be the event that s1 is corrupted and let E2 be the event that s2 is corrupted. Each query goes to a uniformly random location of z and thus has probability at most δ of being corrupted. Then, by the union bound (Lemma2.1.4),

P(s1+ s2 = x(i)) ≥P(¬(E1∪ E2)) = 1 −P(E1∪ E2)

2.1.4

≥ 1 −P(E1) −P(E2) ≥ 1 − 2δ.

As the decoder does two queries, we conclude that the Hadamard code is (2, δ, 2δ)-locally decodable.

2.1.3. Private information retrieval and ethics

As an intermezzo, we will discuss an important application of LDCs in cryptography, namely in private information retrieval (PIR). The main goal of this application is to safeguard a users privacy from database owners. For this reason, we will also discuss the ethical implications of PIR. We also show how the Hadamard code can be translated into a PIR scheme.

(13)

In a PIR scheme, we have a database stored on one or more servers and we have a user that wants to retrieve a certain entry from the database. However, the user does not want to reveal to the servers which entry they retrieve. Of course, if the user queries the entire database they have perfect privacy. However, this solution is very inefficient, as the communication overhead would be way too big for large databases. It turns out that this is the only PIR protocol that provides privacy when the database is stored on a single server [8]. If, however, a copy of the database is stored on multiple servers, and if these servers cannot communicate, then we can apply LDCs to construct PIR schemes that are non-trivial and require less communication. That is the topic of this subsection. It is noteworthy to mention that in fact LDCs are more or less equivalent to PIR schemes, in the sense that every LDC can be translated into a PIR scheme and every PIR scheme induces an LDC [17]. However, the details of the latter is outside the scope of this thesis.

We give the definition of a 2-server PIR scheme, which generalises naturally to r-server PIR schemes. The idea is presented schematically in Figure 2.1. In this setting, the database is represented as a q-ary string of length k (i.e., as an element of _Fk

q) and it is stored on two servers S1 and S2.

Definition 2.1.9. A 2-server private information retrieval protocol is a three-tuple of algorithms P = (Q, A, C) executing Protocol 3and satisfying the following conditions.

• Correctness. For any database D ∈_Fk

q and for any i ∈ [k], the user outputs D(i) with probability 1, where the probability is taken over the random strings rand. • Privacy. Assuming the two servers cannot communicate, for j ∈ {1, 2} the

dis-tributions que_j(i, rand) are identical for all i ∈ [k]. Stated differently, each server S₁, S2 individually learns no information about the value of i.

Protocol 3: Private information retrieval protocol with two servers Input: A database D ∈_Fk

q, two servers S1, S2, and an integer i ∈ [k] Output: D(i)

1 The user U obtains a random string rand by performing some randomised procedure;

2 U invokes Q(i, rand) to obtain a tuple of queries (que₁, que₂); 3 U sends que₁ to server S1 and que2 to server S2;

4 Server S₁ returns ans₁:= A(1, D, que₁) to U and S₂ returns ans2:= A(2, D, que2) to U;

5 U reconstructs D(i) by executing the reconstruction algorithm C(ans1, ans2, i, rand);

(14)

Figure 2.1: Schematic representation of a 2-server private information retrieval protocol. A database is stored on both servers. A user sends one query to each server and receives one answer from each server. Based on these answers, the user retrieves an entry from the database, while the servers individually learn no information about the entry that was retrieved. Image obtained from [32].

We will show how to construct a PIR scheme from an LDC. First, we state the explicit translation of the Hadamard code into a 2-server PIR protocol, after which we generalise this idea to arbitrary LDCs. The 2-server PIR protocol emerging from the Hadamard code essentially consists of splitting the Hadamard decoding algorithm (Algorithm 2) into three algorithms Q, A, C. The algorithm Q generates the queries and each query is sent to a different server. Algorithm A returns the corresponding entry of the codeword and algorithm C calculates the sum of the two answers. For C : Fk

2 →FF k 2

2 and for D ∈Fk2, we write C(D)(r) for the entry of C(D) corresponding to the vector r ∈Fk

2. Protocol 4: 2-server private information retrieval protocol, originating from the 2-query Hadamard code

Input: The Hadamard code C : Fk 2 →FF

k 2

2 with corresponding decoding algorithm AC (Algorithm2), a database D ∈Fk2, 2 servers S1, S2 each storing C(D), and an integer i ∈ [k]

Output: D(i)

1 The user U picks rand ∈Fk₂ uniformly at random; 2 U invokes Q(i, rand) to obtain the tuple of queries

(que₁, que₂) := (rand, rand + ei);

3 U sends que₁ to server S1 and que2 to server S2;

4 Each server S_j (j ∈ {1, 2}) sends ans_j = A(j, D, que_j) = C(D)(que_j) back to U; 5 U reconstructs D(i) by executing the reconstruction algorithm

C(ans1, ans2, i, rand) = ans1+ ans2; 6 return ans1+ ans2;

The correctness follows from the fact that the protocol simply performs the Hadamard decoding algorithm (in this case, we assume C(D) is uncorrupted, thus ans1+ ans2 = hD, randi + hD, rand + eii = D(i)) and the privacy follows from the fact that each server individually is queried either rand or rand + ei. Recall that the servers cannot communicate. As rand is chosen uniformly at random from Fk

2, the vector rand + ei individually is a uniform random element of Fk2 as well. Hence, the distributions que₁(i, rand) = rand are uniform for all i ∈ [k] (and thus identical). The same holds for

(15)

que₂(i, rand) = rand+ei. In other words, each server individually learns no information about i while the user retrieves D(i) successfully from the database.

The translation from any r-query LDC C : Fk

q → FNq — with a corresponding ran-domised decoding algorithm AC — to an r-server private information retrieval protocol is very similar. Any database D ∈ Fk

q is a message of length k and thus D can be mapped to a codeword C(D) by the LDC. Each server S1, . . . , Srstores a copy of C(D). The user U wants to retrieve entry D(i) from the database. First, the user obtains a random string rand by performing a part of the randomised decoding algorithm AC.2 The invocation Q(i, rand) produces r numbers que₁, . . . , que_r, indicating which entries of C(D) need to be queried for the next step of AC. The user sends the query quej to server Sj and receives ansj = A(j, D, quej) = C(D)(quej). Finally, the reconstruction algorithm C(ans1, . . . , ansr, i, rand) is the completion of AC and outputs D(i).

Correctness follows directly from the correctness of the decoder of the LDC, because the algorithms Q, A, and C together form the decoding algorithm AC of the LDC. In this setting, we assume all codewords are not corrupted and thus the probability of reconstructing D(i) is 1. In each LDC discussed in this thesis, the entries que₁, . . . , que_r are uniformly random vectors, because of their dependence on the random vector rand. That means that for any j ∈ [r] the distributions que_j(i, rand) are identical for all i ∈ [k] (namely uniform) and thus each server Sj individually gets no information about i.

To measure how ‘good’ a PIR scheme is, we need the notion of communication com-plexity, which is the total number of bits that needs to be communicated between the servers and the user, maximised over all possible databases D ∈ Fk

q, all i ∈ [k], and all random strings rand. This communication complexity is a function of the database length k. The goal when studying PIR schemes is to minimise the communication com-plexity. We discuss the communication complexity of a PIR scheme that is derived from an LDC.

Proposition 2.1.10. Let P be anr-server PIR scheme which is derived from an r-query LDCC : Fk

q →FNq . Then, the communication complexity of P is at most O(r ·log2(N q)). Proof. If we have an r-query LDC C : Fk

q → FNq which translates to an r-server PIR scheme, the communication amounts to sending r values from [N ] (indicating which entries from the codeword in FN

q need to be read) and receiving r elements from Fq. Each element of Fq can be represented with at most dlog2(q)e bits and a value from [N ] can be represented with at most dlog₂(N )e bits. Hence, the total communication complexity is at most r · dlog2(N )e + r · dlog2(q)e = O(r · log2(N q)).

Note that N is a function of k, depending on the underlying LDC, and r and q are considered constants. This means that PIR schemes require less communication when they are derived from LDCs with shorter codeword lengths, independently of the query complexity.

Currently, the best-known PIR scheme is a 2-server scheme — based on the match-ing vector codes which we discuss in Chapter 3 — with communication complexity

2

In all the LDCs discussed in this thesis, rand is a uniformly random vector from some vector space overFq.

(16)

kO( √

log log k/ log k)_{, where k is the length of the database [}₉_{]. This scheme even translates} to a subexponential 2-query LDC over an alphabet of size 2ko(1)

, where k is the message length. As this is not a constant-sized alphabet, this result does not contradict the exponential lower bound in [18].

Ethics

When it comes to client-database relations, the focus has always been to protect the database from malicious users. Security experts ensure that unauthorised users cannot edit databases and that they cannot retrieve entries not meant for their eyes (e.g., the balance on the bank account of another user). However, users are often not protected from databases. The purpose of private information retrieval schemes is to protect the users by increasing their privacy. They allow users to retrieve data from a database without revealing to the database owner what they retrieve. The question is whether it is desirable to give database users this privacy. On the one hand, privacy is consid-ered a human right by the United Nations and many consider privacy as a high moral value. However, there are also some critiques of privacy. We discuss one of the critiques presented in [22].

The critique is called the ‘other values’ critique. This critique says that often privacy is not as important as other values. An important example of such a value is security. It is widely known that in many countries telecommunication providers are obligated to store information on the internet usage of their customers. Government intelligence agencies are then allowed to investigate this data of criminal suspects. Although this would be desirable when it can be used to fight crime, the danger is that these ‘privacy invasions’ are used too often on innocent people or that the authority to invade a citizen’s privacy is handed to the wrong people. If, for example, a large company is compelled to gather user data, which they share with a third-party without the user granting them permission to do so, this would be considered wrong. The trade-off between individual privacy and general security is quite complicated.

If private information retrieval would be used on such a large scale that criminals can abuse it to hide their criminal activities, then it would be arguable that PIR schemes should not be used or maybe not even be studied. The reason for this is that with PIR schemes criminals could freely query databases (e.g., via search engines) without worrying about showing suspicious behaviour. Analysing online behaviour is an impor-tant tool for intelligence agencies to protect the public from criminal threats and the development of good PIR schemes may dilute this tool.

Although this trade-off is important to consider, today PIR schemes are rarely used for practical purposes [16]. Furthermore, before PIR schemes would be used on a large scale, it would first be used in small settings for smaller databases (e.g., for a cloud service of a company with small databases). Before large scale use, it would also be studied thoroughly how to handle criminal activity and how to allow the search for suspicious behaviour to find criminals, as the discussion on the trade-off between privacy and security is very prevalent in this day of age. Hence, (for now) the theoretical study of PIR schemes can continue without too much worry.

(17)

2.2. Higher-order Fourier analysis

In this section, we formulate Szemer´edi’s theorem and its stochastic refinement, we formalise the definition of metric entropy, and we sketch how these components relate to each other. We show how dual functions come into play and why it is desirable to understand these. Then, we conclude by stating the conjecture on the approximations of dual functions in terms of polynomial phase functions. In Chapter5, we show how to refute this conjecture.

2.2.1. Motivation: Szemer´edi’s theorem and random differences

As stated in the introduction, a central theorem in additive combinatorics is Szemer´edi’s theorem, which says that any large enough subset of the natural numbers contains arbi-trary long arithmetic progressions [27]. It is quite usual to study additive combinatorics in a finite field setting instead of working with the natural numbers. This ‘finite field philosophy’ [14,31] says that reformulating problems intoFn

p (for a small prime p) often makes the proofs easier and cleaner, while the techniques in this setting still guide the techniques in the more general setting. Furthermore, the finite field setting is generally more applicable to theoretical computer science.

In Fn

p, an arithmetic progression is defined as follows.

Definition 2.2.1. A k-term arithmetic progression (k-AP) with common difference y is a sequence of the form

x, x + y, x + 2y, . . . , x + (k − 1)y, where x, y ∈Fn_p. (2.2) If y 6= 0, we call this a proper k-AP.

In this setting, Szemer´edi’s theorem is stated as follows [12,29].

Theorem 2.2.2 (Szemer´edi’s theorem in a finite field setting). Let k ∈N, let δ ∈ (0, 1] and let p ≥ k be a prime number. Then, there is a positive integer N = N (k, δ, p) (i.e., N depends on k, δ, and p), such that for every integer n ≥ N , any subset A ⊂Fn

p with |A| ≥ δpn _{contains a proper}_{k-term arithmetic progression.}

Random Differences

A stochastic refinement of Szemer´edi’s theorem concerns itself around restricting the common difference y in (2.2) to be an element of a random subset D ⊂ Fn

p, having a Bernoulli-ρ distribution. First, we define this distribution.

Definition 2.2.3. A random set D ⊂ Fn

p has a Bernoulli-ρ distribution if, for every element a ∈Fn

p, it holds that P(a ∈ D) = ρ, independently of all other elements of Fnp. Note that when D has a Bernoulli-ρ distribution, the expected size of D is ρ · |Fn

p|. In this setting, we want to know under which conditions the following statement — which we will label as ♣ — holds.

(18)

Statement ♣: Szemer´edi with random differences

Let n, k ∈ _{N, let δ, ρ ∈ (0, 1] and let p ≥ k be a prime number. Let D ⊂ F}n p be a Bernoulli-ρ distributed random set. We say that ♣ holds if, with probability at least 1

2, any subset A ⊂ F n

p with |A| ≥ δpn contains a proper k-AP with common difference in D.

The choice of 1

2 for the probability is arbitrary and could be replaced by any ε ∈ (0, 1). We make this explicit choice to avoid having an extra parameter ε, as this is not needed for our purposes.

We know from Theorem2.2.2that for every positive integer k, every prime p ≥ k, and every δ ∈ (0, 1], there is a positive integer N0 = N0(k, δ, p), such that for every integer n ≥ N0, any subset A ⊂ Fnp with |A| ≥ δpn contains a proper k-AP (with common difference in_Fn

p). Thus, for any p, k, δ, and n ≥ N0, we know that ♣ holds for ρ = 1, as in that case D =Fn

p. This motivates the following question.

Question 1. Letp be a prime number. Let k ≤ p be a positive integer and let δ ∈ (0, 1]. For each n ≥ N0(k, δ, p), what is the smallest ρ ∈ (0, 1] for which ♣ holds?

With this question in mind, we give the following definition.

Definition 2.2.4. Fix a prime number p. For each integer k ≤ p and for each δ ∈ (0, 1], let ρk,δ denote the function

ρk,δ: {N0, N0+ 1, N0+ 2, . . .} → (0, 1]

that maps n to the smallest ρ ∈ (0, 1] for which ♣ holds with these n, k, δ, and p. We refer to ρk,δ(n) as the minimal sample probability (for k and δ) in Fnp. Fasci-natingly, understanding the function ρk,δ(n) has applications to our understanding of locally decodable codes. It turns out that there exist k-query LDCs, mapping messages of length Ω(pn_{· ρ}

k,δ(n)) to codewords of length O(pn) [6]. This means that lower bounds on ρk,δ imply the existence of k-query LDCs (for larger lower bounds, better LDCs exist, as longer messages can be mapped to the same codeword length). The best-known lower bounds on the minimal sample probability inFn

p (where p is considered fixed and n grows to infinity) were established in 2019. Altman showed that for 3-APs in_Fn

p (with p odd) we have ρ3,δ(n) = Ω(p−nn2) for all δ > 0 [1]. Then, Bri¨et showed that for k-APs we have ρk,δ(n) = Ω(p−nnk−1) [5]. It is in fact conjectured that these bounds are optimal, in the sense that ρk,δ(n) = O(p−nnk−1) as well. It turns out that this is the case if the ‘dual functions for APs’ have a small ‘metric entropy number’. This statement is tied to the conjecture that dual functions can be approximated in terms of polynomial phases, which we mentioned in the introduction. Should the conjecture be true, this would im-ply that these bounds are optimal. We will see this (without details) in the following subsection. Although the conjecture is not true in general, it can also be proven that the currently known counterexamples do not capture the sets of dual functions for APs. This means that it could still be possible that these specific sets of dual functions can be approximated in terms of polynomial phases to prove that this bound is optimal. This remains an open question and we discuss this further in Chapter 5.

(19)

2.2.2. Entropy numbers, dual functions and polynomial phases

In this subsection, we define the metric entropy number, the sets of dual functions, the polynomial phase functions, and how these concepts tie together. We will formulate a theorem stating that upper bounds on the entropy number of dual functions imply upper bounds on the minimal sample probability ρk,δ, and we formulate the central conjecture, saying that dual functions can be approximated well by polynomial phases (which has been refuted). To define the entropy number, we first need the notions of the Minkowski sum of sets, the convex hull of a set, and the Lp_-norms.

Definition 2.2.5. For two sets A, B ⊆CN_{, the Minkowski sum of A and B is the set} A + B = {a + b | a ∈ A, b ∈ B}.

Definition 2.2.6. Let A ⊆CN _{be a finite set. The convex hull of A is the set} Conv_C(A) = ( X a∈A caa | each ca ∈C, X a∈A |ca| ≤ 1 ) .

We define the Lp_{-norms of functions from a finite abelian group G to the field}_{C and} the `p_{-norms of vectors in}_CN_.

Definition 2.2.7. Let G be a finite abelian group. Let f : G →C be a function and let a ∈CN. For p ∈ [1, ∞), define kf k_Lp_(G) := Ex∈G|f (x)|p 1/p , kf kL∞_(G):= max x∈G|f (x)|, and kak_`p := N X i=1 |a(i)|p !1/p , kak`∞ := max i∈[N ]|a(i)|.

Remark. Note that for f ∈CG_{, we can interpret f as a function and use the L}∞_-norm, but we can also interpret f as the complete G-evaluation of that function (as a vector inC|G|) and use the `∞norm. As kf kL∞_(G)= kf k_`∞, we will use these interchangeably. The entropy number of a (bounded) set A ⊂CN _{for some ε, M > 0 is defined as the} size of a smallest finite set B ⊂ CN_{, for which all vectors in B have `}∞_{-norm at most} M , such that for any a ∈ A there is a b ∈ Conv_C(B) for which ka − bk`∞ ≤ ε. In other words, there exists a c ∈ εDN _{such that a = b + c. Hence, A ⊂ Conv}

C(B) + εDN. This is the definition we will use.

(20)

Definition 2.2.8 (Entropy number). Let A ⊂ _CN _{be a bounded set and let ε, M > 0.} The entropy number of A, denoted N (A, `∞, M, ε), is the size of a smallest finite set B ⊆ MDN _{for which}

A ⊂ Conv_C(B) + ε_DN_. If such a B does not exist, we write N (A, `∞_{, M, ε) = ∞.}

Remark. We will only consider the entropy number for finite sets A and for numbers M with A ⊆ MDN_{. Thus, for our purposes, we will always have N (A, `}∞_{, M, ε) ≤ |A| < ∞.} Lemma 2.2.9. Let A ⊂ _CN _{be a finite set and let} _{I ⊂ [N ] be a set of coordinates.} Consider the projection of A to I, denoted AI = {(a(i))i∈I | a ∈ A}, and let ε, M > 0. Then,

N (A, `∞, M, ε) ≥ N (AI, `∞, M, ε). Proof. If A ⊂ Conv_C(B) + εDn_{, then A}

I⊂ ConvC(BI) + εDI and |B| ≥ |BI|.

The upper bounds on ρk,δ depend on the entropy number of the set of dual functions. Definition 2.2.10. Let G be a finite abelian group. For each ‘pattern’ i ∈ Zd

≥0, we define the corresponding family of dual functions

∆i:=ϕ ∈DG| ϕ(y) =Ew∈Gf1(w + i(1)y) · · · fd(w + i(d)y), f1, . . . , fd: G →D . Thus, each ϕ ∈ ∆i is a function from G to D. Each choice of f1, . . . , fd gives another dual function. The triangle inequality shows that ϕ maps to D (as each fi maps to D). Example 1. Let A ⊂ G. Let i = (0, 1, 2, . . . , d − 1) ∈ _Zd

≥0 and let f1, . . . , fd all be 1A: G → {0, 1} with 1A(x) = 1 if x ∈ A and otherwise 1A(x) = 0. Then, for the corresponding ϕ ∈DG_{, we have}

ϕ(y) =Ew∈G1A(w)1A(w + y) · · ·1A(w + (d − 1)y),

which gives the fraction of d-APs in A with common difference y ∈ G. This set of dual functions appears frequently in additive combinatorics.

We now state how the minimal sample probability ρ3,δ(n) relates to the entropy num-ber of the dual functions for 3-APs (with i = (0, 1, 2)).

Proposition 2.2.11. Let p be an odd prime, let δ ∈ (0, 1] and let n be in the domain of ρ3,δ. Let G =Fnp. Then, there exist ε, C0 > 0 such that

ρ3,δ(n) ≤ p−nC0logp(N (∆(0,1,2), `∞, 1, ε)).

This is proven by Altman in the integer setting [1], but the same proof works in Fn p. We do not give the proof here, but this motivates the study of dual functions. We see indeed that upper bounds on the entropy number of the dual functions ∆(0,1,2) with domain G =Fn

p imply upper bounds on ρ3,δ(n).

The question now is how to bound the entropy number of the dual functions. One suggestion is to approximate the dual functions with low-degree ‘polynomial phase func-tions’, which are well-understood functions, because they are higher-degree analogues of characters.

(21)

Definition 2.2.12. Let p be a prime number and let n ∈_{N. Then, a polynomial phase} function of degree d is a function ψ : Fn

p →D of the form ψ(x) = e2πiP (x)/p, where P ∈Fp[z1, . . . , zn] is a polynomial of total degree d.

Remark. Note that the polynomial phase functions of degree 1 are exactly the characters ofFn

p [28, Section 4.1]. This is where the name ‘higher-order’ Fourier analysis comes from. In linear Fourier analysis, one studies the correlation between functions f : Fn

p →C and characters of Fn

p, while in higher-order Fourier analysis, we study their correlation with polynomial phase functions of higher degree.

In 2019, Shao used higher-order Fourier analysis to prove that dual functions can in fact be approximated by polynomial phases in the following sense (see [7] for the general idea of the proof).

Proposition 2.2.13. Let d, n ∈N. Let p ≥ d + 1 be a prime and let G = Fn

p. Letε > 0 and let i ∈ Zd_≥0 be arbitrary. Then, there is an M = M (d, p, ε) > 0, such that for all ϕ ∈ ∆i

• there exist polynomial phase functions ψ1, . . . ψr:Fnp →D of degree d − 1, • there exist α1, . . . , αr ∈C with Pr_i=1|αi| ≤ M ,

• there exists an error-function τ : _Fn

p →C with kτkL1 ≤ ε, for which ϕ = r X i=1 αiψi+ τ. (2.3)

Thus, for any ε > 0 dual functions can be approximated by ‘convex combinations’ of polynomial phase functions of degree d − 1, such that the difference has L1_{-norm at most} ε. However, by definition of the L1_{-norm, there might be elements a ∈} _Fn

p for which |τ (a)| = |ϕ(a) −Pr

i=1αiψi(a)| > ε, as long as kτ kL1 = p−nP_x∈_Fn

p|τ (x)| ≤ ε. It turns out that the approximation in the L1_{-norm is not strong enough to say something about} the entropy number of the dual functions. Proposition 2.2.13 did, however, inspire the following conjecture, saying that the same approximation can be done in the L∞-norm. Conjecture 1. Proposition 2.2.13 also holds for an error-function τ : Fn

p → C with kτ kL∞ ≤ ε.

This conjecture is the central subject of this thesis. The following proposition shows how this conjecture is related to the entropy number of the dual functions.

Proposition 2.2.14. Suppose Conjecture 1 is true. Let ε > 0. Let d, n, p, G, M be as in Proposition 2.2.13. Let i ∈ Zd_≥0 and let

B = {ψ :Fn_p →C | ψ is a polynomial phase of degree d − 1}. Then,

N (∆i, `∞, M, ε) ≤ |B| = pO(n d−1₎

(22)

Proof. Note that any ψ ∈ B maps a vector x ∈_Fn

p to e2πiP (x)/p (for some polynomial P in Fp[z1, . . . , zn] of total degree d − 1). This means that |B| = p(

n+d−1

d−1 ), because there are n+d−1_d−1 monomials of degree at most d − 1 in n variables3 _(Lemma _2.1.7_{). Now, as}

p and d are considered constants, and as we let n → ∞, it holds that log_p(|B|) =n + d − 1

d − 1

_Lemma_2.1.6

≤ (n + d − 1)d−1/(d − 1)d−1 ≤ (n + d − 1)d−1_{= O(n}d−1_).

Thus, |B| = pO(nd−1)_{. We will prove that the entropy number of ∆}

i is bounded by |B|. Let N = |Fn

p| = pn. All polynomial phase functions map to D. When interpreting the set B as a set of vectors (Fn

p-evaluations of the functions), this means that B ⊂DN. Let ϕ ∈ ∆i be arbitrary. If Conjecture 1 is true, there are ψ1, . . . , ψr ∈ B, there are α1, . . . , αr ∈C with Pri=1|αi| ≤ M , and there is a τ :Fnp →C with kτk`∞ ≤ ε (identifying τ with itsFn

p-evaluation, thus τ ∈ εDN), such that kϕ −

r X

i=1

αiψik`∞ = kτ k_`∞ ≤ ε. (2.4)

We show that ∆i ⊂ ConvC(M B) + εDN. Note that M B ⊂ MDN, because B ⊂ DN. Also, Pr i=1αiψi = Pr i=1αi 1 M · M ψ ∈ ConvC(M B), because Pr i=1| 1 Mαi| ≤ 1 M · M = 1. Hence, Equation (2.4) implies that ϕ ∈ Conv_C(M B) + εDN_{. Therefore,}

∆i⊂ ConvC(M B) + εDN.

This means N (∆i, `∞, M, ε) ≤ |M B| = |B|, by definition of the entropy number. Thus indeed, if Conjecture1were true for the pattern i = (0, 1, 2), Proposition2.2.11

would show that

ρ3,δ(n) ≤ p−nO(n2) = O(p−nn2), and thus that Altman’s lower bound would be optimal.

Although Bri¨et and Labib found infinitely many primes p and patterns i ∈ _Zd ≥0 for which they could derive a contradiction from Proposition 2.2.14 (which means Conjec-ture 1 cannot be true for all patterns i), they also showed that their counterexamples do not capture arithmetic progressions (patterns i = (0, 1, . . . , d − 1)). This means that Altman’s lower bound could still be proven to be optimal by showing that Conjecture1

holds for i = (0, 1, 2). The question of whether Conjecture 1 holds for APs remains unanswered and should be studied further.

3

The number of monomials of degree at most d − 1 in n variables is equal to the number of ways to choose an unordered subset of size d − 1, with repetitions allowed, from the set {1, z1, z2, . . . , zn}.

(23)

3. Matching vector codes

In this chapter, we study the matching vector (MV) codes over the group Zm, based on [10,23] and on [35, Chapter 4]. From a computer scientific perspective, these codes are interesting, as for composite numbers m, these are currently the best-known LDCs in terms of short codeword lengths [10]. Furthermore, when restricting m to prime numbers, these constructions play a central role in refuting Conjecture1, meaning that MV codes also have applications in higher-order Fourier analysis.

The existence of any matching vector code relies strongly on two components: 1. the existence of a ‘matching vector family’ over Zm;

2. the existence of a ‘sparse decoding polynomial’.

Especially the second component will be important for refuting Conjecture1, as the first component is well-understood.

In Section 3.1, we will define these components formally and we give the encoding algorithm of an MV code. In Section 3.2, we will define a decoder and we will prove that MV codes are locally decodable.

In Section 3.3, we restrict m to composite numbers only, as in that case we know a useful sparse decoding polynomial exists (as we shall see). The existence of a matching vector family for composite m is far from trivial and is outside the scope of this thesis. A survey on this topic can be found in [35, Chapter 5].

In Section 3.4, we restrict m to prime numbers only, as for primes the situation is the other way around. We can construct a useful and fairly straightforward matching vector family, but the existence of a sparse decoding polynomial is complicated. We will tackle that problem in Chapter4. The explicit matching vector family will also allow us to give the relation between the codeword length N and the message length k, as this relation for MV codes depends solely on the underlying matching vector family. It turns out that N = qO(k1/t)_{, where q is a prime and t is a positive integer such that m | q}t_{− 1.} Finally, in Subsection 3.4.1 we restrict m to Mersenne primes (primes of the form 2t_{− 1). For Mersenne primes, we do know that a useful sparse decoding polynomial} exists and, as we already had matching vector families for general primes, this means that for any Mersenne prime we have a matching vector code.

3.1. General framework and encoding

We start by defining the two components on which the existence of MV codes relies. To define matching vector families, we need an inner product (similar to the one used for

(24)

the Hadamard code in Subsection 2.1.2). For positive integers m and n, we define the following inner product onZn

m:

hu, vi := uTv mod m. Remark. For a prime p, _Zn

p is a vector space over the finite fieldFp, which we also denote by Fn

p. We will write Znm for the general case (m composite or prime). Next, we define a matching vector family overZm.

Definition 3.1.1. Let m ≥ 2 be a positive integer and let S ⊂_Zm\ {0}. Two families of vectors U = {u1, . . . , uk} ⊂ Znm, V = {v1, . . . , vk} ⊂Znm form an S-matching vector family of size k if

a) for all i ∈ [k] it holds that hui, vii = 0,

b) for all i, j ∈ [k] with i 6= j it holds that hui, vji ∈ S.

Example 2. In the finite field setting, we can construct an easy matching vector family overFn

p. For each subset B ⊂ [n] = {1, . . . , n} with |B| = p − 1, define uB := (1B(i) | i ∈ [n])T, vB:= (1 −1B(i) | i ∈ [n])T ∈Fnp.

Thus, uB is the indicator vector of the set B and vB is the indicator vector of the set B{ _{(the complement of B). This means that for any B}

1, B2 ⊆ [n], each of size p − 1, we have hu_B₁, vB2i = u T B1vB2 = n X i=1 1B1(i)(1 −1B2(i)) = {i ∈ [n] | i ∈ B1, i /∈ B2} = |B₁∩ B{₂|. It then follows immediately that uT

BvB = 0, because B and B{ have empty intersec-tion. For B16= B2, it holds that

uT_B₁vB2 = |B1∩ B {

2| ∈ {1, . . . , p − 1},

because B1 and B{2 have non-empty intersection and B1 contains exactly p − 1 elements (and thus any intersection with B1 contains at most p − 1 elements). Hence,

U = {uB| B ⊆ [n], |B| = p − 1}, V = {vB| B ⊆ [n], |B| = p − 1} form an_F∗

(25)

Framework

We will need some results from group theory [19, Theorem 1.15], which we will state without proof.

Proposition 3.1.2. Let G be a finite cyclic group of order n (i.e., |G| = n). The following properties hold.

a) For every positive divisor k of n, G contains a unique subgroup of order k. b) Every subgroup of G is cyclic.

c) If G = hgi (i.e., g generates G)1_{, then all other generators are the elements} _gr _with gcd(r, n) = 1.

We also need a result from finite field theory [19, Theorem 2.8].

Proposition 3.1.3. For every finite field Fq, the multiplicative group F∗q is cyclic. We will frequently use the following definition.

Definition 3.1.4. Let p and q be distinct primes. We define the multiplicative order of q in F∗_p, denoted ordp(q), as the smallest number t ∈N for which qt≡ 1 mod p.

To define sparse decoding polynomials, we first define the sparsity of a polynomial. Definition 3.1.5. For a univariate polynomial Q(X) =Pr

j=0cjXj, we define the sup-port of Q, denoted supp(Q), as an ascending sequence of powers for which the corre-sponding coefficients are non-zero. Thus,

supp(Q) = (j ∈Z≥0| cj 6= 0).

The sparsity or support size of Q is defined as the length of supp(Q). Informally, we say a polynomial is sparse when its support size is ‘small’.

Example. Let Q(X) = 5 + 2X + 7X4_{. Then, supp(Q) = (0, 1, 4) and Q has sparsity 3.} We can now set up the framework of an MV code. Let m ≥ 2 be a positive integer and let q be a prime number. Choose t such that m | qt_{− 1.}2 _{By Proposition}_3.1.3_,_F∗

qt is cyclic. Since m | qt_{− 1 = |}_F∗

qt|, by Proposition 3.1.2 we know that F∗_qt has a unique subgroup Gm of order m which is cyclic, thus there exists a g ∈ F∗_qt of order m that generates Gm.

A crucial element that is needed for a matching vector code is the existence of a sparse univariate polynomial that satisfies certain conditions.

1

A group G with |G| = n is cyclic if there is a generator g ∈ G. That is, if for all x ∈ G, there is a j ∈Z such that x = gj. In that case, g has order m.

2_{If m is prime, we can set t = ord}

m(q), which means m | qt− 1. For composite m, we can first choose a prime q and a positive integer t arbitrarily (as the order of choice does not matter), after which we pick m to be some product of the prime factors of qt_{− 1.}

(26)

Definition 3.1.6. Let S ⊂ _Zm\ {0}. A univariate polynomial Q ∈ Fq[X] is called an (S, d)-decoding polynomial if Q has support size d and if Q satisfies

Q(1) 6= 0,

Q(gs) = 0 ∀s ∈ S,

where g ∈F∗_qt of order m as stated above. We also call this a d-sparse decoding polynomial (if S is not known), or simply a decoding polynomial.

Finally, we define the following homomorphism which we will use in the encoding algorithm. For u ∈_Zn

m, define the mapping from (Znm, +) to (F∗qt, ·) by

a 7→ghu,ai. (3.1)

This is a homomorphism from the additive group_Zn

m to the multiplicative groupF∗qt, as for any a, b ∈Zn

m we have

ghu,a+bi= ghu,aighu,bi. Encoding

Throughout this chapter, we will assume the following.

Assumption 1. Let m ≥ 2 and t be positive integers, and let q be a prime such that m | qt_{− 1. Let g ∈} _F∗

qt be of order m and let S ⊆ Zm\ {0}, such that we have an S-matching vector family U = {u1, . . . , uk}, V = {v1, . . . , vk} ⊆Znm of size k.

The encoding works as follows.

Algorithm 5: Matching vector code (encoding, m is composite or prime) Input: A message x = (x(1), . . . , x(k)) ∈Fk

qt Output: A codeword C(x) ∈ (Fqt)Z

n m

1 Construct the function Fx:Znm →Fqt: a 7→Pk_i=1x(i)ghui,ai; 2 C(x) := (Fx(t) : t ∈Znm);

3 return C(x);

Thus, the codeword is a complete Zn

m-evaluation of Fx. Note that Fx(a) is a sum of products, all taken in the field Fqt. It might seem unnatural that our message space is Fk

qt, but there are methods to construct LDCs over the binary alphabet F2 from these codes [10,23]. We will not discuss these, as this message space suffices for our purposes.

We see that for a matching vector family of size k in Zn

m, messages of length k are mapped to codewords of length N = mn_{. This means that the relation between N and k} is based on the underlying matching vector family. As the detailed study of constructions of such families is outside the scope of this thesis, we will only prove the relation between N and k for the prime setting (where we do construct an MV family). For composites, the MV family that currently gives the shortest codewords is the Grolmusz family [15], which was used by Efremenko to construct a 3-query LDC [10]. For this LDC, we have N = exp(exp(O(√log k log log k))), which is subexponential in k.

(27)

3.2. Decoding

In this section, we present a decoder for MV codes and we prove local decodability. Assumption 2. Let m ≥ 2 be a positive integer, let q be prime and let t be a positive integer such that m | qt_{− 1. Let g ∈}_F∗

qt be of order m and let S ⊆Zm\ {0}, such that we have an S-matching vector family U = {u1, . . . , uk}, V = {v1, . . . , vk} ⊆Znm of size k. Assume, for ` ∈ [d], that there are c` ∈F∗q and s` ∈Z≥0 such that we have a sparse (S, d)-decoding polynomial3 Q(X) = d X `=1 c`Xs` ∈Fq[X]. The decoder works as follows.

Algorithm 6: Matching vector code (decoding, m is composite or prime) Input: A vector z ∈ (_F_qt)Z

n

m and an integer i ∈ [k] Output: An element of Fqt

1 Pick w ∈Zn_m uniformly at random;

2 Obtain {r_`:= z(w + s_`v_i) | ` ∈ [d]} ⊂F_qt by querying the coordinates of z corresponding to the points w + s`vi;

3 return Pd `=1c`r` · Q(1)ghui,wi−1_;

We see that the decoder depends strongly on the existence of the polynomial Q. Also, the product Q(1)ghui,wi _{is well-defined and invertible, because 0 6= Q(1) ∈} F

q, 0 6= ghui,wi _∈ F∗

qt, and Fqt is a finite field extension of F_q (which means F_qt can be viewed as a vector space over Fq, see [19, Section 1.4]). Further,

Pd

`=1c`r` is a linear combination of vectors in the vector space_F_qt overF_q, so the return value of Algorithm6 is well-defined. To prove that this decoder is correct and that the matching vector code is indeed locally decodable, we will rely on a property of the homomorphisms ghu,ai. Lemma 3.2.1. Assume the statements in Assumption2. Leti, j ∈ [k]. For any w ∈Zn

m, it holds that d X `=1 c`ghuj,w+s`vii = ( Q(1)ghui,wi _{6= 0} _if _{i = j,} 0 if i 6= j. Proof. We have d X `=1 c`ghuj,w+s`vii= ghuj,wi d X `=1 c`gs`huj,vii= ghuj,wiQ ghuj,vii_. _(3.2) Since Q is an (S, d)-decoding polynomial, we know Q(1) 6= 0 and Q(gs_{) = 0 for all s ∈ S.} Also, the inner product huj, vii = 0 when i = j and is in S when i 6= j. The result then follows from (3.2).

3_{We will see that for composites, the coefficients of Q are elements of}_F∗

qt instead ofF

∗

q. The decoder, however, is the same for composites and primes; the only difference is that products are taken inFqt

(28)

Theorem 3.2.2. Assume the statements in Assumption 2. The matching vector code C : Fk

qt → (Fqt)Z n

m _{defined by Algorithms} ₅ _and ₆ _is(d, δ, dδ)-locally decodable for all δ.

Proof. Let δ be fixed. Let x ∈Fk

qt be a message and let z ∈ (Fqt)Z n

mwith ∆(C(x), z) ≤ δ. First, we show that the decoder (Algorithm 6) is correct. To this end, let i ∈ [k], and let w ∈ Zn

m be chosen uniformly at random. We have to show that if {r1, . . . , rd} are obtained by querying uncorrupted locations, then

d X `=1 c`r` ! ·Q(1)ghui,wi −1 = x(i). (3.3)

If all queries go to clean entries, then for ` ∈ [d] we have r` = z(w+s`vi) = Fx(w+s`vi), where Fx(a) =Pk_j=1x(j)ghuj,ai, as seen in Algorithm5. Now, with Lemma3.2.1we see

d X `=1 c`r` = d X `=1 c`Fx(w + s`vi) = d X `=1 c` k X j=1 x(j)ghuj,w+s`vii = k X j=1 x(j) d X `=1 c`ghuj,w+s`vii Lemma 3.2.1 = x(i) · Q(1)ghui,wi_.

As Q(1)ghui,wi _{6= 0, we can multiply both sides by its inverse, which yields (}_3.3_{). Thus,} if {r1, . . . , rd} are all uncorrupted, Algorithm6 outputs the correct symbol x(i).

We now show that the probability that any of the queries goes to a corrupted location is at most 1 − dδ. For each ` ∈ [d], let E` be the event that z(w + s`vi) is corrupted. Since w is chosen uniformly at random, each w+s`viindividually is a uniformly random element of Zn

m. As ∆(C(x), z) ≤ δ, for each ` ∈ [d] we have P(E`) ≤ δ. The event that any of the d queries is corrupted is the union of all events E`. Thus, applying the union bound (Lemma2.1.4), we seePSd

`=1E`

≤Pd

`=1P(E`) = dδ. The event that all queries are uncorrupted is the complement of Sd

`=1E`, hence P(Algorithm6succeeds) ≥ 1 −P d [ `=1 E` ! ≥ 1 − dδ. Thus indeed, matching vector codes are (d, δ, dδ)-locally decodable for all δ.

We see indeed that the query complexity is the sparsity of Q. Thus, MV codes of lower query complexity are obtained from sparser decoding polynomials Q. Furthermore, the messages that can be encoded have length k, where k is the size of the matching vector family. Thus, larger MV families allow the encoding of longer messages.

3.3. Composites

Matching vector codes for composite numbers m constitute the state of the art of locally decodable codes. As stated earlier, when we use the Grolmusz family as our matching vector family (for composite m), we obtain the currently best-known MV code [10].

(29)

In this section, we restrict m to composite numbers and we show that in that case a useful sparse decoding polynomial Q exists (based on [10]). Throughout this section, we assume the following.

Assumption 3. Let m be a composite number, let q be prime and let t be a positive integer such that m | qt_{− 1. Let g ∈}_F∗

qt be of order m and let U = {u1, . . . , uk}, V = {v1, . . . , vk} ⊆Znm be an S-matching vector family of size k.

The coefficients of the decoding polyomial Q will be elements of Fqt instead of F_q. However, the decoder (Algorithm 6) remains the same and still returns an element of Fqt. We will call such a Q ∈F_qt[X] (of sparsity d with Q(1) 6= 0 and Q(gs) = 0 for all s ∈ S) an (S, d)-decoding polynomial for composites.

Lemma 3.3.1. Assume the statements in Assumption 3. Then, there exists an (S, d)-decoding polynomial for composites Q ∈_F_qt[X] with d ≤ |S| + 1 and with Q(1) = 1. Proof. Define h(X) := Q

s∈S(X − gs) ∈ Fqt[X]. By construction, h(gs) = 0 for all s ∈ S. Also, as deg(h) = |S|, h has at most |S| distinct roots in Fqt [19, Theorem 1.66]. This means that the roots of h in _F_qt are precisely the elements gs ∈F_qt, s ∈ S. In particular, 0 /∈ S, thus h(g0_{) = h(1) 6= 0, meaning h(1) is invertible. Now define} Q(X) := h(1)−1h(X). Then, Q(1) = 1 and Q(gs_{) = 0 for all s ∈ S. Furthermore, Q has} sparsity at most |S| + 1, because deg(Q) = |S|.

We argue below why the following corollary is only useful for composite numbers m, despite the fact that it also holds in the prime setting.

Corollary 3.3.2. Assume the statements in Assumption 3 and let Q ∈ Fqt[X] as in Lemma 3.3.1. Then, the corresponding matching vector code C :Fk

qt → (Fqt)Z n

m_{, defined} by Algorithm 5 and 6, is (d, δ, dδ)-locally decodable with d ≤ |S| + 1 for all δ.

Proof. This is an immediate consequence of Lemma 3.3.1and Theorem 3.2.2.

We already argued that it is desirable to work with a large S-matching vector family to encode longer messages. This corollary also shows that a small set S is desirable for a lower query complexity. The reason that Lemma 3.3.1 is useful for composite numbers is because for composites there are large MV families known for which the set S is small. For example, if m is a product of r primes, then for any n (there exists a constant c > 0 such that) the Grolmusz family gives an S-matching vector family inZn

m of size at least exp(c_{(log log n)}(log n)rr−1) (superpolynomial in n), with |S| = 2r− 1 [15]. Thus, if m is a product of two primes, the (S, 4)-decoding polynomial in Lemma 3.3.1 gives us a 4-query LDC that encodes messages of a superpolynomial length (in n). Efremenko found a 3-sparse decoding polynomial through an exhaustive search, which allowed him to construct a 3-query LDC with subexponential codeword length, using the Grolmusz family [10].

For prime numbers, the set S inherits structure from working over a finite field _Fm. When we use this structure, we can create much sparser decoding polynomials (overFq), inducing MV codes with lower query complexity, which is why Lemma3.3.1is less useful in that case. We make this more explicit in the following section.

Higher-order Fourier analysis and locally decodable codes