# DNSCrypt Setup with PDNSD

Here are notes for setting up DNSCrypt on Arch Linux, using pdnsd as a DNS cache, assuming the use of NetworkManager. I needed it one day since the network I was using blocked traffic to external DNS servers (parental controls), and the DNS server provided had an outdated entry for hmmt.co. (My dad then pointed out to me I could have just hard-coded the necessary IP address in /etc/hosts, oops.)

For the whole process, useful commands to test with are:

• nslookup hmmt.co will tell you the IP used and the server from which it came.
• dig http://www.hmmt.co gives much more detailed information to this effect. (From bind-tools.)
• dig @127.0.0.1 http://www.hmmt.co lets you query a specific DNS server (in this case 127.0.0.1).
• drill @127.0.0.1 http://www.hmmt.co behaves similarly.

First, pacman -S pdnsd dnscrypt-proxy (with sudo ostensibly, but I’ll leave that out here and henceforth).

Run systemctl edit dnscrypt-proxy.socket and fill in override.conf with

[Socket]
ListenStream=
ListenDatagram=
ListenStream=127.0.0.1:40
ListenDatagram=127.0.0.1:40


Optionally, one can also specify which server which DNS serve to use with systemctl edit dnscrypt-proxy.service. For example for cs-uswest I write

[Service]
ExecStart=
ExecStart=/usr/bin/dnscrypt-proxy \
-R cs-uswest


The empty ExecStart= is necessary, since otherwise systemctl will complain about multiple ExecStart commands.

This thus configures dnscrypt-proxy to listen on 127.0.0.1, port 40.

Now we configure pdnsd to listen on port 53 (default) for cache, and relay cache misses to dnscrypt-proxy. This is accomplished by using the following for /etc/pdnsd.conf:

global {
perm_cache = 1024;
cache_dir = "/var/cache/pdnsd";
run_as = "pdnsd";
server_ip = 127.0.0.1;
status_ctl = on;
query_method = udp_tcp;
min_ttl = 15m;       # Retain cached entries at least 15 minutes.
max_ttl = 1w;        # One week.
timeout = 10;        # Global timeout option (10 seconds).
neg_domain_pol = on;
udpbufsize = 1024;   # Upper limit on the size of UDP messages.
}

server {
label = "dnscrypt-proxy";
ip = 127.0.0.1;
port = 40;
timeout = 4;
proxy_only = on;
}

source {
owner = localhost;
file = "/etc/hosts";
}


Now it remains to change the DNS server from whatever default is used into 127.0.0.1. For NetworkManager users, it is necessary to edit /etc/NetworkManager/NetworkManager.conf to prevent it from overriding this file:

[main]
...
dns=none


This will cause resolv.conf to be written as an empty file by NetworkManager: in this case, the default 127.0.0.1 is used as the nameserver, which is what we want.

Needless to say, one finishes with

systemctl enable dnscrypt-proxy
systemctl start dnscrypt-proxy
systemctl enable pdnsd
systemctl start pdnsd


# A Sketchy Overview of Green-Tao

These are the notes of my last lecture in the 18.099 discrete analysis seminar. It is a very high-level overview of the Green-Tao theorem. It is a subset of this paper.

## 1. Synopsis

This post as in overview of the proof of:

Theorem 1 (Green-Tao)

The prime numbers contain arbitrarily long arithmetic progressions.

Here, Szemerédi’s theorem isn’t strong enough, because the primes have density approaching zero. Instead, one can instead try to prove the following “relativity” result.

Theorem (Relative Szemerédi)

Let ${S}$ be a sparse “pseudorandom” set of integers. Then subsets of ${A}$ with positive density in ${S}$ have arbitrarily long arithmetic progressions.

In order to do this, we have to accomplish the following.

• Make precise the notion of “pseudorandom”.
• Prove the Relative Szemerédi theorem, and then
• Exhibit a “pseudorandom” set ${S}$ which subsumes the prime numbers.

This post will use the graph-theoretic approach to Szemerédi as in the exposition of David Conlon, Jacob Fox, and Yufei Zhao. In order to motivate the notion of pseudorandom, we return to the graph-theoretic approach of Roth’s theorem, i.e. the case ${k=3}$ of Szemerédi’s theorem.

## 2. Defining the linear forms condition

### 2.1. Review of Roth theorem

Roth’s theorem can be phrased in two ways. The first is the “set-theoretic” formulation:

Theorem 2 (Roth, set version)

If ${A \subseteq \mathbb Z/N}$ is 3-AP-free, then ${|A| = o(N)}$.

The second is a “weighted” version

Theorem 3 (Roth, weighted version)

Fix ${\delta > 0}$. Let ${f : \mathbb Z/N \rightarrow [0,1]}$ with ${\mathbf E f \ge \delta}$. Then

$\displaystyle \Lambda_3(f,f,f) \ge \Omega_\delta(1).$

We sketch the idea of a graph-theoretic proof of the first theorem. We construct a tripartite graph ${G_A}$ on vertices ${X \sqcup Y \sqcup Z}$, where ${X = Y = Z = \mathbb Z/N}$. Then one creates the edges

• ${(x,y)}$ if ${2x+ y \in A}$,
• ${(x,z)}$ if ${x-z \in A}$, and
• ${(y,z)}$ if ${-y-2z \in A}$.

This construction is selected so that arithmetic progressions in ${A}$ correspond to triangles in the graph ${G_A}$. As a result, if ${A}$ has no 3-AP’s (except trivial ones, where ${x+y+z=0}$), the graph ${G_A}$ has exactly one triangle for every edge. Then, we can use the theorem of Ruzsa-Szemerédi, which states that this graph ${G_A}$ has ${o(n^2)}$ edges.

### 2.2. The measure ${\nu}$

Now for the generalized version, we start with the second version of Roth’s theorem. Instead of a set ${S}$, we consider a function

$\displaystyle \nu : \mathbb Z/N \rightarrow \mathbb R_{\ge 0}$

which we call a majorizing measure. Since we are now dealing with ${A}$ of low density, we normalize ${\nu}$ so that

$\displaystyle \mathbf E[\nu] = 1 + o(1).$

Our goal is to now show a result of the form:

Theorem (Relative Roth, informally, weighted version)

If ${0 \le f \le \nu}$, ${\mathbf E f \ge \delta}$, and ${\nu}$ satisfies a “pseudorandom” condition, then ${\Lambda_3(f,f,f) \ge \Omega_{\delta}(1)}$.

The prototypical example of course is that if ${A \subset S \subset \mathbb Z_N}$, then we let ${\nu(x) = \frac{N}{|S|} 1_S(x)}$.

### 2.3. Pseudorandomness for ${k=3}$

So, how can we put the pseudorandom condition? Initially, consider ${G_S}$ the tripartite graph defined earlier, and let ${p = |S| / N}$; since ${S}$ is sparse we expect ${p}$ small. The main idea that turns out to be correct is: The number of embeddings of ${K_{2,2,2}}$ in ${S}$ is “as expected”, namely ${(1+o(1)) p^{12} N^6}$. Here ${K_{2,2,2}}$ is actually the ${2}$-blow-up of a triangle. This condition thus gives us control over the distribution of triangles in the sparse graph ${G_S}$: knowing that we have approximately the correct count for ${K_{2,2,2}}$ is enough to control distribution of triangles.

For technical reasons, in fact we want this to be true not only for ${K_{2,2,2}}$ but all of its subgraphs ${H}$.

Now, let’s move on to the weighted version. Let’s consider a tripartite graph ${G}$, which we can think of as a collection of three functions

\displaystyle \begin{aligned} \mu_{-z} &: X \times Y \rightarrow \mathbb R \\ \mu_{-y} &: X \times Z \rightarrow \mathbb R \\ \mu_{-x} &: Y \times Z \rightarrow \mathbb R. \end{aligned}

We think of ${\mu}$ as normalized so that ${\mathbf E[\mu_{-x}] = \mathbf E[\mu_{-y}] = \mathbf E[\mu_{-z}] = 1}$. Then we can define

Definition 4

A weighted tripartite graph ${\mu = (\mu_{-x}, \mu_{-y}, \mu_{-z})}$ satisfies the ${3}$-linear forms condition if

\displaystyle \begin{aligned} \mathbf E_{x^0,x^1,y^0,y^1,z^0,z^1} &\Big[ \mu_{-x}(y^0,z^0) \mu_{-x}(y^0,z^1) \mu_{-x}(y^1,z^0) \mu_{-x}(y^1,z^1) \\ & \mu_{-y}(x^0,z^0) \mu_{-y}(x^0,z^1) \mu_{-y}(x^1,z^0) \mu_{-y}(x^1,z^1) \\ & \mu_{-z}(x^0,y^0) \mu_{-z}(x^0,y^1) \mu_{-z}(x^1,y^0) \mu_{-z}(x^1,y^1) \Big] \\ &= 1 + o(1) \end{aligned}

and similarly if any of the twelve factors are deleted.

Then the pseudorandomness condition is according to the graph we defined above:

Definition 5

A function ${\nu : \mathbb Z / N \rightarrow \mathbb Z}$ is satisfies the ${3}$-linear forms condition if ${\mathbf E[\nu] = 1 + o(1)}$, and the tripartite graph ${\mu = (\mu_{-x}, \mu_{-y}, \mu_{-z})}$ defined by

\displaystyle \begin{aligned} \mu_{-z} &= \nu(2x+y) \\ \mu_{-y} &= \nu(x-z) \\ \mu_{-x} &= \nu(-y-2z) \end{aligned}

satisfies the ${3}$-linear forms condition.

Finally, the relative version of Roth’s theorem which we seek is:

Theorem 6 (Relative Roth)

Suppose ${\nu : \mathbb Z/N \rightarrow \mathbb R_{\ge 0}}$ satisfies the ${3}$-linear forms condition. Then for any ${f : \mathbb Z/N \rightarrow \mathbb R_{\ge 0}}$ bounded above by ${\nu}$ and satisfying ${\mathbf E[f] \ge \delta > 0}$, we have

$\displaystyle \Lambda_3(f,f,f) \ge \Omega_{\delta}(1).$

### 2.4. Relative Szemerédi

We of course have:

Theorem 7 (Szemerédi)

Suppose ${k \ge 3}$, and ${f : \mathbb Z/n \rightarrow [0,1]}$ with ${\mathbf E[f] \ge \delta}$. Then

$\displaystyle \Lambda_k(f, \dots, f) \ge \Omega_{\delta}(1).$

For ${k > 3}$, rather than considering weighted tripartite graphs, we consider a ${(k-1)}$-uniform ${k}$-partite hypergraph. For example, given ${\nu}$ with ${\mathbf E[\nu] = 1 + o(1)}$ and ${k=4}$, we use the construction

\displaystyle \begin{aligned} \mu_{-z}(w,x,y) &= \nu(3w+2x+y) \\ \mu_{-y}(w,x,z) &= \nu(2w+x-z) \\ \mu_{-x}(w,y,z) &= \nu(w-y-2z) \\ \mu_{-w}(x,y,z) &= \nu(-x-2y-3z). \end{aligned}

Thus 4-AP’s correspond to the simplex ${K_4^{(3)}}$ (i.e. a tetrahedron). We then consider the two-blow-up of the simplex, and require the same uniformity on subgraphs of ${H}$.

Here is the compiled version:

Definition 8

A ${(k-1)}$-uniform ${k}$-partite weighted hypergraph ${\mu = (\mu_{-i})_{i=1}^k}$ satisfies the ${k}$-linear forms condition if

$\displaystyle \mathbf E_{x_1^0, x_1^1, \dots, x_k^0, x_k^1} \left[ \prod_{j=1}^k \prod_{\omega \in \{0,1\}^{[k] \setminus \{j\}}} \mu_{-j}\left( x_1^{\omega_1}, \dots, x_{j-1}^{\omega_{j-1}}, x_{j+1}^{\omega_{j+1}}, \dots, x_k^{\omega_k} \right)^{n_{j,\omega}} \right] = 1 + o(1)$

for all exponents ${n_{j,w} \in \{0,1\}}$.

Definition 9

A function ${\nu : \mathbb Z/N \rightarrow \mathbb R_{\ge 0}}$ satisfies the ${k}$-linear forms condition if ${\mathbf E[\nu] = 1 + o(1)}$, and

$\displaystyle \mathbf E_{x_1^0, x_1^1, \dots, x_k^0, x_k^1} \left[ \prod_{j=1}^k \prod_{\omega \in \{0,1\}^{[k] \setminus \{j\}}} \nu\left( \sum_{i=1}^k (j-i)x_i^{(\omega_i)} \right)^{n_{j,\omega}} \right] = 1 + o(1)$

for all exponents ${n_{j,w} \in \{0,1\}}$. This is just the previous condition with the natural ${\mu}$ induced by ${\nu}$.

The natural generalization of relative Szemerédi is then:

Theorem 10 (Relative Szemerédi)

Suppose ${k \ge 3}$, and ${\nu : \mathbb Z/n \rightarrow \mathbb R_{\ge 0}}$ satisfies the ${k}$-linear forms condition. Let ${f : \mathbb Z/N to \mathbb R_{\ge 0}}$ with ${\mathbf E[f] \ge \delta}$, ${f \le \nu}$. Then

$\displaystyle \Lambda_k(f, \dots, f) \ge \Omega_{\delta}(1).$

## 3. Outline of proof of Relative Szemerédi

The proof of Relative Szeremédi uses two key facts. First, one replaces ${f}$ with a bounded ${\widetilde f}$ which is near it:

Theorem 11 (Dense model)

Let ${\varepsilon > 0}$. There exists ${\varepsilon' > 0}$ such that if:

• ${\nu : \mathbb Z/N \rightarrow \mathbb R_{\ge 0}}$ satisfies ${\left\lVert \nu-1 \right\rVert^{\square}_r \le \varepsilon'}$, and
• ${f : \mathbb Z/N \rightarrow \mathbb R_{\ge 0}}$, ${f \le \nu}$

then there exists a function ${\widetilde f : \mathbb Z/N \rightarrow [0,1]}$ such that ${\left\lVert f - \widetilde f \right\rVert^{\square}_r \le \varepsilon}$.

Here we have a new norm, called the cut norm, defined by

$\displaystyle \left\lVert f \right\rVert^{\square}_r = \sup_{A_i \subseteq (\mathbb Z/N)^{r-1}} \left\lvert \mathbf E_{x_1, \dots, x_r} f(x_1 + \dots + x_r) 1_{A_1}(x_{-1}) \dots 1_{A_r}(x_{-r}) \right\rvert.$

This is actually an extension of the cut norm defined on a ${r}$-uniform ${r}$-partite hypergraph (not ${(r-1)}$-uniform like before!): if ${g : X_1 \times \dots \times X_r \rightarrow \mathbb R}$ is such a graph, we let

$\displaystyle \left\lVert g \right\rVert^{\square}_{r,r} = \sup_{A_i \subseteq X_{-i}} \left\lvert g(x_1, \dots, x_r) 1_{A_1}(x_{-1}) \dots 1_{A_r}(x_{-r}) \right\rvert.$

Taking ${g(x_1, \dots, x_r) = f(x_1 + \dots + x_r)}$, ${X_1 = \dots = X_r = \mathbb Z/N}$ gives the analogy.

For the second theorem, we define the norm

$\displaystyle \left\lVert g \right\rVert^{\square}_{k-1,k} = \max_{i=1,\dots,k} \left( \left\lVert g_{-i} \right\rVert^{\square}_{k-1, k-1} \right).$

Theorem 12 (Relative simplex counting lemma)

Let ${\mu}$, ${g}$, ${\widetilde g}$ be weighted ${(k-1)}$-uniform ${k}$-partite weighted hypergraphs on ${X_1 \cup \dots \cup X_k}$. Assume that ${\mu}$ satisfies the ${k}$-linear forms condition, and ${0 \le g_{-i} \le \mu_{-i}}$ for all ${i}$, ${0 \le \widetilde g \le 1}$. If ${\left\lVert g-\widetilde g \right\rVert^{\square}_{k-1,k} = o(1)}$ then

$\displaystyle \mathbf E_{x_1, \dots, x_k} \left[ g(x_{-1}) \dots g(x_{-k}) - \widetilde g(x_{-1}) \dots \widetilde g(x_{-k}) \right] = o(1).$

One then combines these two results to prove Szemerédi, as follows. Start with ${f}$ and ${\nu}$ in the theorem. The ${k}$-linear forms condition turns out to imply ${\left\lVert \nu-1 \right\rVert^{\square}_{k-1} = o(1)}$. So we can find a nearby ${\widetilde f}$ by the dense model theorem. Then, we induce ${\nu}$, ${g}$, ${\widetilde g}$ from ${\mu}$, ${f}$, ${\widetilde f}$ respectively. The counting lemma then reduce the bounding of ${\Lambda_k(f, \dots, f)}$ to the bounding of ${\Lambda_k(\widetilde f, \dots, \widetilde f)}$, which is ${\Omega_\delta(1)}$ by the usual Szemerédi theorem.

## 4. Arithmetic progressions in primes

We now sketch how to obtain Green-Tao from Relative Szemerédi. As expected, we need to us the von Mangoldt function ${\Lambda}$.

Unfortunately, ${\Lambda}$ is biased (e.g. “all decent primes are odd”). To get around this, we let ${w = w(N)}$ tend to infinity slowly with ${N}$, and define

$\displaystyle W = \prod_{p \le w} p.$

In the ${W}$-trick we consider only primes ${1 \pmod W}$. The modified von Mangoldt function then is defined by

$\displaystyle \widetilde \Lambda(n) = \begin{cases} \frac{\varphi(W)}{W} \log (Wn+1) & Wn+1 \text{ prime} \\ 0 & \text{else}. \end{cases}$

In accordance with Dirichlet, we have ${\sum_{n \le N} \widetilde \Lambda(n) = N + o(N)}$.

So, we need to show now that

Proposition 13

Fix ${k \ge 3}$. We can find ${\delta = \delta(k) > 0}$ such that for ${N \gg 1}$ prime, we can find ${\nu : \mathbb Z/N \rightarrow \mathbb R_{\ge 0}}$ which satisfies the ${k}$-linear forms condition as well as

$\displaystyle \nu(n) \ge \delta \widetilde \Lambda(n)$

for ${N/2 \le n < N}$.

In that case, we can let

$\displaystyle f(n) = \begin{cases} \delta \widetilde\Lambda(n) & N/2 \le n < N \\ 0 & \text{else}. \end{cases}$

Then ${0 \le f \le \nu}$. The presence of ${N/2 \le n < N}$ allows us to avoid “wrap-around issues” that arise from using ${\mathbb Z/N}$ instead of ${\mathbb Z}$. Relative Szemerédi then yields the result.

For completeness, we state the construction. Let ${\chi : \mathbb R \rightarrow [0,1]}$ be supported on ${[-1,1]}$ with ${\chi(0) = 1}$, and define a normalizing constant ${c_\chi = \int_0^\infty \left\lvert \chi'(x) \right\rvert^2 \; dx}$. Inspired by ${\Lambda(n) = \sum_{d \mid n} \mu(d) \log(n/d)}$, we define a truncated ${\Lambda}$ by

$\displaystyle \Lambda_{\chi, R}(n) = \log R \sum_{d \mid n} \mu(d) \chi\left( \frac{\log d}{\log R} \right).$

Let ${k \ge 3}$, ${R = N^{k^{-1} 2^{-k-3}}}$. Now, we define ${\nu}$ by

$\displaystyle \nu(n) = \begin{cases} \dfrac{\varphi(W)}{W} \dfrac{\Lambda_{\chi,R}(Wn+1)^2}{c_\chi \log R} & N/2 \le n < N \\ 0 & \text{else}. \end{cases}$

This turns out to work, provided ${w}$ grows sufficiently slowly in ${N}$.

# Formal vs Functional Series (OR: Generating Function Voodoo Magic)

Epistemic status: highly dubious. I found almost no literature doing anything quite like what follows, which unsettles me because it makes it likely that I’m overcomplicating things significantly.

## 1. Synopsis

Recently I was working on an elegant problem which was the original problem 6 for the 2015 International Math Olympiad, which reads as follows:

Problem

[IMO Shortlist 2015 Problem C6] Let ${S}$ be a nonempty set of positive integers. We say that a positive integer ${n}$ is clean if it has a unique representation as a sum of an odd number of distinct elements from ${S}$. Prove that there exist infinitely many positive integers that are not clean.

Proceeding by contradiction, one can prove (try it!) that in fact all sufficiently large integers have exactly one representation as a sum of an even subset of ${S}$. Then, the problem reduces to the following:

Problem

Show that if ${s_1 < s_2 < \dots}$ is an increasing sequence of positive integers and ${P(x)}$ is a nonzero polynomial then we cannot have

$\displaystyle \prod_{j=1}^\infty (1 - x^{s_j}) = P(x)$

as formal series.

To see this, note that all sufficiently large ${x^N}$ have coefficient ${1 + (-1) = 0}$. Now, the intuitive idea is obvious: the root ${1}$ appears with finite multiplicity in ${P}$ so we can put ${P(x) = (1-x)^k Q(x)}$ where ${Q(1) \neq 0}$, and then we get that ${1-x}$ on the RHS divides ${P}$ too many times, right?

Well, there are some obvious issues with this “proof”: for example, consider the equality

$\displaystyle 1 = (1-x)(1+x)(1+x^2)(1+x^4)(1+x^8) \dots.$

The right-hand side is “divisible” by ${1-x}$, but the left-hand side is not (as a polynomial).

But we still want to use the idea of plugging ${x \rightarrow 1^-}$, so what is the right thing to do? It turns out that this is a complete minefield, and there are a lot of very subtle distinctions that seem to not be explicitly mentioned in many places. I think I have a complete answer now, but it’s long enough to warrant this entire blog post.

Here’s the short version: there’s actually two distinct notions of “generating function”, namely a “formal series” and “functional series”. They use exactly the same notation but are two different types of objects, and this ends up being the source of lots of errors, because “formal series” do not allow substituting ${x}$, while “functional series” do.

Spoiler: we’ll need the asymptotic for the partition function ${p(n)}$.

## 2. Formal Series ${\neq}$ Functional Series

I’m assuming you’ve all heard the definition of ${\sum_k c_kx^k}$. It turns out unfortunately that this isn’t everything: there are actually two types of objects at play here. They are usually called formal power series and power series, but for this post I will use the more descriptive names formal series and functional series. I’ll do everything over ${\mathbb C}$, but one can of course use ${\mathbb R}$ instead.

The formal series is easier to describe:

Definition 1

A formal series ${F}$ is an infinite sequence ${(a_n)_n = (a_0, a_1, a_2, \dots)}$ of complex numbers. We often denote it by ${\sum a_nx^n = a_0 + a_1x + a_2x^2 + \dots}$. The set of formal series is denoted ${\mathbb C[ [x] ]}$.

This is the “algebraic” viewpoint: it’s a sequence of coefficients. Note that there is no worry about convergence issues or “plugging in ${x}$”.

On the other hand, a functional series is more involved, because it has to support substitution of values of ${x}$ and worry about convergence issues. So here are the necessary pieces of data:

Definition 2

A functional series ${G}$ (centered at zero) is a function ${G : U \rightarrow \mathbb C}$, where ${U}$ is an open disk centered at ${0}$ or ${U = \mathbb C}$. We require that there exists an infinite sequence ${(c_0, c_1, c_2, \dots)}$ of complex numbers satisfying

$\displaystyle \forall z \in U: \qquad G(z) = \lim_{N \rightarrow \infty} \left( \sum_{k=0}^N c_k z^k \right).$

(The limit is take in the usual metric of ${\mathbb C}$.) In that case, the ${c_i}$ are unique and called the coefficients of ${G}$.

This is often written as ${G(x) = \sum_n c_n x^n}$, with the open set ${U}$ suppressed.

Remark 3

Some remarks on the definition of functional series:

• This is enough to imply that ${G}$ is holomorphic (and thus analytic) on ${U}$.
• For experts: note that I’m including the domain ${U}$ as part of the data required to specify ${G}$, which makes the presentation cleaner. Most sources do something with “radius of convergence”; I will blissfully ignore this, leaving this data implicitly captured by ${U}$.
• For experts: Perhaps non-standard, ${U \neq \{0\}}$. Otherwise I can’t take derivatives, etc.

Thus formal and functional series, despite having the same notation, have different types: a formal series ${F}$ is a sequence, while a functional series ${G}$ is a function that happens to be expressible as an infinite sum within its domain.

Of course, from every functional series ${G}$ we can extract its coefficients and make them into a formal series ${F}$. So, for lack of better notation:

Definition 4

If ${F = (a_n)_n}$ is a formal series, and ${G : U \rightarrow \mathbb C}$ is a functional series whose coefficients equal ${F}$, then we write ${F \simeq G}$.

## 3. Finite operations

Now that we have formal and functional series, we can define sums. Since these are different types of objects, we will have to run definitions in parallel and then ideally check that they respect ${\simeq}$.

For formal series:

Definition 5

Let ${F_1 = (a_n)_n}$ and ${F_2 = (b_n)_n}$ be formal series. Then we set

\displaystyle \begin{aligned} (a_n)_n \pm (b_n)_n &= (a_n \pm b_n)_n \\ (a_n)_n \cdot (b_n)_n &= \left( \textstyle\sum_{j=0}^n a_jb_{n-j} \right)_n. \end{aligned}

This makes ${\mathbb C[ [x] ]}$ into a ring, with identity ${(0,0,0,\dots)}$ and ${(1,0,0,\dots)}$.

We also define the derivative ${F = (a_n)_n}$ by ${F' = ((n+1)a_{n+1})_n}$.

It’s probably more intuitive to write these definitions as

\displaystyle \begin{aligned} \sum_n a_n x^n \pm \sum_n b_n x^n &= \sum_n (a_n \pm b_n) x^n \\ \left( \sum_n a_n x^n \right) \left( \sum_n b_n x^n \right) &= \sum_n \left( \sum_{j=0}^n a_jb_{n-j} \right) x^n \\ \left( \sum_n a_n x^n \right)' &= \sum_n na_n x^{n-1} \end{aligned}

and in what follows I’ll start to use ${\sum_n a_nx^n}$ more. But officially, all definitions for formal series are in terms of the coefficients alone; these presence of ${x}$ serves as motivation only.

Exercise 6

Show that if ${F = \sum_n a_nx^n}$ is a formal series, then it has a multiplicative inverse if and only if ${a_0 \neq 0}$.

On the other hand, with functional series, the above operations are even simpler:

Definition 7

Let ${G_1 : U \rightarrow \mathbb C}$ and ${G_2 : U \rightarrow \mathbb C}$ be functional series with the same domain ${U}$. Then ${G_1 \pm G_2}$ and ${G_1 \cdot G_2}$ are defined pointwise.

If ${G : U \rightarrow \mathbb C}$ is a functional series (hence holomorphic), then ${G'}$ is defined poinwise.

If ${G}$ is nonvanishing on ${U}$, then ${1/G : U \rightarrow \mathbb C}$ is defined pointwise (and otherwise is not defined).

Now, for these finite operations, everything works as you expect:

Theorem 8 (Compatibility of finite operations)

Suppose ${F}$, ${F_1}$, ${F_2}$ are formal series, and ${G}$, ${G_1}$, ${G_2}$ are functional series ${U \rightarrow \mathbb C}$. Assume ${F \simeq G}$, ${F_1 \simeq G_1}$, ${F_2 \simeq G_2}$.

• ${F_1 \pm F_2 \simeq G_1 \pm G_2}$, ${F_1 \cdot F_2 = G_1 \cdot G_2}$.
• ${F' \simeq G'}$.
• If ${1/G}$ is defined, then ${1/F}$ is defined and ${1/F \simeq 1/G}$.

So far so good: as long as we’re doing finite operations. But once we step beyond that, things begin to go haywire.

## 4. Limits

We need to start considering limits of ${(F_k)_k}$ and ${(G_k)_k}$, since we are trying to make progress towards infinite sums and products. Once we do this, things start to burn.

Definition 9

Let ${F_1 = \sum_n a_n x^n}$ and ${F_2 = \sum_n b_n x^n}$ be formal series, and define the difference by

$\displaystyle d(F_1, F_2) = \begin{cases} 2^{-n} & a_n \neq b_n, \; n \text{ minimal} \\ 0 & F_1 = F_2. \end{cases}$

This function makes ${\mathbb C[[x]]}$ into a metric space, so we can discuss limits in this space. Actually, it is a normed vector space obtained by ${\left\lVert F \right\rVert = d(F,0)}$ above.

Thus, ${\lim_{k \rightarrow \infty} F_k = F}$ if each coefficient of ${x^n}$ eventually stabilizes as ${k \rightarrow \infty}$. For example, as formal series we have that ${(1,-1,0,0,\dots)}$, ${(1,0,-1,0,\dots)}$, ${(1,0,0,-1,\dots)}$ converges to ${1 = (1,0,0,0\dots)}$, which we write as

$\displaystyle \lim_{k \rightarrow \infty} (1 - x^k) = 1 \qquad \text{as formal series}.$

As for functional series, since they are functions on the same open set ${U}$, we can use pointwise convergence or the stronger uniform convergence; we’ll say explicitly which one we’re doing.

Example 10 (Limits don’t work at all)

In what follows, ${F_k \simeq G_k}$ for every ${k}$.

• Here is an example showing that if ${\lim_k F_k = F}$, the functions ${G_k}$ may not converge even pointwise. Indeed, just take ${F_k = 1 - x^k}$ as before, and let ${U = \{ z : |z| < 2 \}}$.
• Here is an example showing that even if ${G_k \rightarrow G}$ uniformly, ${\lim_k F_k}$ may not exist. Take ${G_k = 1 - 1/k}$ as constant functions. Then ${G_k \rightarrow 1}$, but ${\lim_k F_k}$ doesn’t exist because the constant term never stabilizes (in the combinatorial sense).
• The following example from this math.SE answer by Robert Israel shows that it’s possible that ${F = \lim_k F_k}$ exists, and ${G_k \rightarrow G}$ pointwise, and still ${F \not\simeq G}$. Let ${U}$ be the open unit disk, and set

\displaystyle \begin{aligned} A_k &= \{z = r e^{i\theta} \mid 2/k \le r \le 1, \; 0 \le \theta \le 2\pi - 1/k\} \\ B_k &= \left\{ |z| \le 1/k \right\} \end{aligned}

for ${k \ge 1}$. By Runge theorem there’s a polynomial ${p_k(z)}$ such that

$\displaystyle |p_k(z) - 1/z^{k}| < 1/k \text{ on } A_k \qquad \text{and} \qquad |p_k(z)| < 1/k \text{ on }B_k.$

Then

$\displaystyle G_k(z) = z^{k+1} p_k(z)$

is the desired counterexample (with ${F_k}$ being the sequence of coefficients from ${G}$). Indeed by construction ${\lim_k F_k = 0}$, since ${\left\lVert F_k \right\rVert \le 2^{-k}}$ for each ${k}$. Alas, ${|g_k(z) - z| \le 2/k}$ for ${z \in A_k \cup B_k}$, so ${G_k \rightarrow z}$ converges pointwise to the identity function.

To be fair, we do have the following saving grace:

Theorem 11 (Uniform convergence and both limits exist is sufficient)

Suppose that ${G_k \rightarrow G}$ converges uniformly. Then if ${F_k \simeq G_k}$ for every ${k}$, and ${\lim_k F_k = F}$, then ${F \simeq G}$.

Proof: Here is a proof, copied from this math.SE answer by Joey Zhou. WLOG ${G = 0}$, and let ${g_n(z) = \sum{a^{(n)}_kz^k}}$. It suffices to show that ${a_k = 0}$ for all ${k}$. Choose any ${0. By Cauchy’s integral formula, we have

\displaystyle \begin{aligned} \left|a_k - a^{(n)}_k\right| &= \left|\frac{1}{2\pi i} \int\limits_{|z|=r}{\frac{g(z)-g_n(z)}{z^{n+1}}\text{ d}z}\right| \\ & \le\frac{1}{2\pi}(2\pi r)\frac{1}{r^{n+1}}\max\limits_{|z|=r}{|g(z)-g_n(z)|} \xrightarrow{n\rightarrow\infty} 0 \end{aligned}

since ${g_n}$ converges uniformly to ${g}$ on ${U}$. Hence, ${a_k = \lim\limits_{n\rightarrow\infty}{a^{(n)}_k}}$. Since ${a^{(n)}_k = 0}$ for ${n\ge k}$, the result follows. $\Box$

The take-away from this section is that limits are relatively poorly behaved.

## 5. Infinite sums and products

Naturally, infinite sums and products are defined by taking the limit of partial sums and limits. The following example (from math.SE again) shows the nuances of this behavior.

Example 12 (On ${e^{1+x}}$)

The expression

$\displaystyle \sum_{n=0}^\infty \frac{(1+x)^n}{n!} = \lim_{N \rightarrow \infty} \sum_{n=0}^N \frac{(1+x)^n}{n!}$

does not make sense as a formal series: we observe that for every ${N}$ the constant term of the partial sum changes.

But this does converge (uniformly, even) to a functional series on ${U = \mathbb C}$, namely to ${e^{1+x}}$.

Exercise 13

Let ${(F_k)_{k \ge 1}}$ be formal series.

• Show that an infinite sum ${\sum_{k=1}^\infty F_k(x)}$ converges as formal series exactly when ${\lim_k \left\lVert F_k \right\rVert = 0}$.
• Assume for convenience ${F_k(0) = 1}$ for each ${k}$. Show that an infinite product ${\prod_{k=0}^{\infty} (1+F_k)}$ converges as formal series exactly when ${\lim_k \left\lVert F_k-1 \right\rVert = 0}$.

Now the upshot is that one example of a convergent formal sum is the expression ${\lim_{N} \sum_{n=0}^N a_nx^n}$ itself! This means we can use standard “radius of convergence” arguments to transfer a formal series into functional one.

Theorem 14 (Constructing ${G}$ from ${F}$)

Let ${F = \sum a_nx^n}$ be a formal series and let

$\displaystyle r = \frac{1}{\limsup_n \sqrt[n]{|c_n|}}.$

If ${r > 0}$ then there exists a functional series ${G}$ on ${U = \{ |z| < r \}}$ such that ${F \simeq G}$.

Proof: Let ${F_k}$ and ${G_k}$ be the corresponding partial sums of ${c_0x^0}$ to ${c_kx^k}$. Then by Cauchy-Hadamard theorem, we have ${G_k \rightarrow G}$ uniformly on (compact subsets of) ${U}$. Also, ${\lim_k F_k = F}$ by construction. $\Box$

This works less well with products: for example we have

$\displaystyle 1 \equiv (1-x) \prod_{j \ge 0} (1+x^{2^j})$

as formal series, but we can’t “plug in ${x=1}$”, for example,

## 6. Finishing the original problem

We finally return to the original problem: we wish to show that the equality

$\displaystyle P(x) = \prod_{j=1}^\infty (1 - x^{s_j})$

cannot hold as formal series. We know that tacitly, this just means

$\displaystyle \lim_{N \rightarrow \infty} \prod_{j=1}^N\left( 1 - x^{s_j} \right) = P(x)$

as formal series.

Here is a solution obtained only by only considering coefficients, presented by Qiaochu Yuan from this MathOverflow question.

Both sides have constant coefficient ${1}$, so we may invert them; thus it suffices to show we cannot have

$\displaystyle \frac{1}{P(x)} = \frac{1}{\prod_{j=1}^{\infty} (1 - x^{s_j})}$

as formal power series.

The coefficients on the LHS have asymptotic growth a polynomial times an exponential.

On the other hand, the coefficients of the RHS can be shown to have growth both strictly larger than any polynomial (by truncating the product) and strictly smaller than any exponential (by comparing to the growth rate in the case where ${s_j = j}$, which gives the partition function ${p(n)}$ mentioned before). So the two rates of growth can’t match.

# New algebra handouts on my website

For olympiad students: I have now published some new algebra handouts. They are:

• Introduction to Functional Equations, which cover the basic techniques and theory for FE’s typically appearing on olympiads like USA(J)MO.
• Monsters, an advanced handout which covers functional equations that have pathological solutions. It covers in detail the solutions to Cauchy functional equation.
• Summation, which is a compilation of various types of olympiad-style sums like generating functions and multiplicative number theory.

• English, notes on proof-writing that I used at the 2016 MOP (Mathematical Olympiad Summer Program).

You can download all these (and other handouts) from my MIT website. Enjoy!

# Approximating E3-LIN is NP-Hard

This lecture, which I gave for my 18.434 seminar, focuses on the MAX-E3LIN problem. We prove that approximating it is NP-hard by a reduction from LABEL-COVER.

## 1. Introducing MAX-E3LIN

In the MAX-E3LIN problem, our input is a series of linear equations ${\pmod 2}$ in ${n}$ binary variables, each with three terms. Equivalently, one can think of this as ${\pm 1}$ variables and ternary products. The objective is to maximize the fraction of satisfied equations.

Example 1 (Example of MAX-E3LIN instance)

\displaystyle \begin{aligned} x_1 + x_3 + x_4 &\equiv 1 \pmod 2 \\ x_1 + x_2 + x_4 &\equiv 0 \pmod 2 \\ x_1 + x_2 + x_5 &\equiv 1 \pmod 2 \\ x_1 + x_3 + x_5 &\equiv 1 \pmod 2 \end{aligned}

\displaystyle \begin{aligned} x_1 x_3 x_4 &= -1 \\ x_1 x_2 x_4 &= +1 \\ x_1 x_2 x_5 &= -1 \\ x_1 x_3 x_5 &= -1 \end{aligned}

A diligent reader can check that we may obtain ${\frac34}$ but not ${1}$.

Remark 2

We immediately notice that

• If there’s a solution with value ${1}$, we can find it easily with ${\mathbb F_2}$ linear algebra.
• It is always possible to get at least ${\frac{1}{2}}$ by selecting all-zero or all-one.

The theorem we will prove today is that these “obvious” observations are essentially the best ones possible! Our main result is that improving the above constants to 51% and 99%, say, is NP-hard.

Theorem 3 (Hardness of MAX-E3LIN)

The ${\frac{1}{2}+\varepsilon}$ vs. ${1-\delta}$ decision problem for MAX-E3LIN is NP-hard.

This means it is NP-hard to decide whether an MAX-E3LIN instance has value ${\le \frac{1}{2}+\varepsilon}$ or ${\ge 1-\delta}$ (given it is one or the other). A direct corollary of this is approximating MAX-SAT is also NP-hard.

Corollary 4

The ${\frac78+\varepsilon}$ vs. ${1-\delta}$ decision problem for MAX-SAT is NP-hard.

Remark 5

The constant ${\frac78}$ is optimal in light of a random assignment. In fact, one can replace ${1-\delta}$ with ${\delta}$, but we don’t do so here.

Proof: Given an equation ${a+b+c=1}$ in MAX-E3LIN, we consider four formulas ${a \lor \neg b \lor \neg c}$, ${\neg a \lor b \lor \neg c}$, ${a \lor \neg b \lor \neg c}$, ${a \lor b \lor c}$. Either three or four of them are satisfied, with four occurring exactly when ${a+b+c=0}$. One does a similar construction for ${a+b+c=1}$. $\Box$

The hardness of MAX-E3LIN is relevant to the PCP theorem: using MAX-E3LIN gadgets, Ha}stad was able to prove a very strong version of the PCP theorem, in which the verifier merely reads just three bits of a proof!

Let ${\varepsilon, \delta > 0}$. We have

$\displaystyle \mathbf{NP} \subseteq \mathbf{PCP}_{\frac{1}{2}+\varepsilon, 1-\delta}(3, O(\log n)).$

In other words, any ${L \in \mathbf{NP}}$ has a (non-adaptive) verifier with the following properties.

• The verifier uses ${O(\log n)}$ random bits, and queries just three (!) bits.
• The acceptance condition is either ${a+b+c=1}$ or ${a+b+c=0}$.
• If ${x \in L}$, then there is a proof ${\Pi}$ which is accepted with probability at least ${1-\delta}$.
• If ${x \notin L}$, then every proof is accepted with probability at most ${\frac{1}{2} + \varepsilon}$.

## 2. Label Cover

We will prove our main result by reducing from the LABEL-COVER. Recall LABEL-COVER is played as follows: we have a bipartite graph ${G = U \cup V}$, a set of keys ${K}$ for vertices of ${U}$ and a set of labels ${L}$ for ${V}$. For every edge ${e = \{u,v\}}$ there is a function ${\pi_e : L \rightarrow K}$ specifying a key ${k = \pi_e(\ell) \in K}$ for every label ${\ell \in L}$. The goal is to label the graph ${G}$ while maximizing the number of edges ${e}$ with compatible key-label pairs.

Approximating LABEL-COVER is NP-hard:

Theorem 7 (Hardness of LABEL-COVER)

The ${\eta}$ vs. ${1}$ decision problem for LABEL-COVER is NP-hard for every ${\eta > 0}$, given ${|K|}$ and ${|L|}$ are sufficiently large in ${\eta}$.

So for any ${\eta > 0}$, it is NP-hard to decide whether one can satisfy all edges or fewer than ${\eta}$ of them.

## 3. Setup

We are going to make a reduction of the following shape:

In words this means that

• “Completeness”: If the LABEL-COVER instance is completely satisfiable, then we get a solution of value ${\ge 1 - \delta}$ in the resulting MAX-E3LIN.
• “Soundness”: If the LABEL-COVER instance has value ${\le \eta}$, then we get a solution of value ${\le \frac{1}{2} + \varepsilon}$ in the resulting MAX-E3LIN.

Thus given an oracle for MAX-E3LIN decision, we can obtain ${\eta}$ vs. ${1}$ decision for LABEL-COVER, which we know is hard.

The setup for this is quite involved, using a huge number of variables. Just to agree on some conventions:

Definition 8 (“Long Code”)

A ${K}$-indexed binary string ${x = (x_k)_k}$ is a ${\pm 1}$ sequence indexed by ${K}$. We can think of it as an element of ${\{\pm 1\}^K}$. An ${L}$-binary string ${y = (y_\ell)_\ell}$ is defined similarly.

Now we initialize ${|U| \cdot 2^{|K|} + |V| \cdot 2^{|L|}}$ variables:

• At every vertex ${u \in U}$, we will create ${2^{|K|}}$ binary variables, one for every ${K}$-indexed binary string. It is better to collect these variables into a function

$\displaystyle f_u : \{\pm1\}^K \rightarrow \{\pm1\}.$

• Similarly, at every vertex ${v \in V}$, we will create ${2^{|L|}}$ binary variables, one for every ${L}$-indexed binary string, and collect these into a function

$\displaystyle g_v : \{\pm1\}^L \rightarrow \{\pm1\}.$

Picture:

Next we generate the equations. Here’s the motivation: we want to do this in such a way that given a satisfying labelling for LABEL-COVER, nearly all the MAX-E3LIN equations can be satisfied. One idea is as follows: for every edge ${e}$, letting ${\pi = \pi_e}$,

• Take a ${K}$-indexed binary string ${x = (x_k)_k}$ at random. Take an ${L}$-indexed binary string ${y = (y_\ell)_\ell}$ at random.
• Define the ${L}$-indexed binary ${z = (z_\ell)_\ell}$ string by ${z = \left( x_{\pi(\ell)} y_\ell \right)}$.
• Write down the equation ${f_u(x) g_v(y) g_v(z) = +1}$ for the MAX-E3LIN instance.

Thus, assuming we had a valid coloring of the graph, we could let ${f_u}$ and ${g_v}$ be the dictator functions for the colorings. In that case, ${f_u(x) = x_{\pi(\ell)}}$, ${g_v(y) = y_\ell}$, and ${g_v(z) = x_{\pi(\ell)} y_\ell}$, so the product is always ${+1}$.

Unfortunately, this has two fatal flaws:

1. This means a ${1}$ instance of LABEL-COVER gives a ${1}$ instance of MAX-E3LIN, but we need ${1-\delta}$ to have a hope of working.
2. Right now we could also just set all variables to be ${+1}$.

We fix this as follows, by using the following equations.

Definition 8 (Equations of reduction)

For every edge ${e}$, with ${\pi = \pi_e}$, we alter the construction and say

• Let ${x = (x_k)_k}$ be and ${y = (y_\ell)_\ell}$ be random as before.
• Let ${n = (n_\ell)_\ell}$ be a random ${L}$-indexed binary string, drawn from a ${\delta}$-biased distribution (${-1}$ with probability ${\delta}$). And now define ${z = (z_\ell)_\ell}$ by

$\displaystyle z_\ell = x_{\pi(\ell)} y_\ell n_\ell .$

The ${n_\ell}$ represents “noise” bits, which resolve the first problem by corrupting a bit of ${z}$ with probability ${\delta}$.

• Write down one of the following two equations with ${\frac{1}{2}}$ probability each:

\displaystyle \begin{aligned} f_u(x) g_v(y) g_v(z) &= +1 \\ f_u(x) g_v(y) g_v(-z) &= -1. \end{aligned}

This resolves the second issue.

This gives a set of ${O(|E|)}$ equations.

I claim this reduction works. So we need to prove the “completeness” and “soundness” claims above.

## 4. Proof of Completeness

Given a labeling of ${G}$ with value ${1}$, as described we simply let ${f_u}$ and ${g_v}$ be dictator functions corresponding to this valid labelling. Then as we’ve seen, we will pass ${1 - \delta}$ of the equations.

## 5. A Fourier Computation

Before proving soundness, we will first need to explicitly compute the probability an equation above is satisfied. Remember we generated an equation for ${e}$ based on random strings ${x}$, ${y}$, ${\lambda}$.

For ${T \subseteq L}$, we define

$\displaystyle \pi^{\text{odd}}_e(T) = \left\{ k \in K \mid \left\lvert \pi_e^{-1}(k) \cap T \right\rvert \text{ is odd} \right\}.$

Thus ${T}$ maps subsets of ${L}$ to subsets of ${K}$.

Remark 9

Note that ${|\pi^{\text{odd}}(T)| \le |T|}$ and that ${\pi^{\text{odd}}(T) \neq \varnothing}$ if ${|T|}$ is odd.

Lemma 10 (Edge Probability)

The probability that an equation generated for ${e = \{u,v\}}$ is true is

$\displaystyle \frac{1}{2} + \frac{1}{2} \sum_{\substack{T \subseteq L \\ |T| \text{ odd}}} (1-2\delta)^{|T|} \widehat g_v(T)^2 \widehat f_u(\pi^{\text{odd}}_e(T)).$

Proof: Omitted for now\dots $\Box$

## 6. Proof of Soundness

We will go in the reverse direction and show (constructively) that if there is MAX-E3LIN instance has a solution with value ${\ge\frac{1}{2}+2\varepsilon}$, then we can reconstruct a solution to LABEL-COVER with value ${\ge \eta}$. (The use of ${2\varepsilon}$ here will be clear in a moment). This process is called “decoding”.

The idea is as follows: if ${S}$ is a small set such that ${\widehat f_u(S)}$ is large, then we can pick a key from ${S}$ at random for ${f_u}$; compare this with the dictator functions where ${\widehat f_u(S) = 1}$ and ${|S| = 1}$. We want to do something similar with ${T}$.

Here are the concrete details. Let ${\Lambda = \frac{\log(1/\varepsilon)}{2\delta}}$ and ${\eta = \frac{\varepsilon^3}{\Lambda^2}}$ be constants (the actual values arise later).

Definition 11

We say that a nonempty set ${S \subseteq K}$ of keys is heavy for ${u}$ if

$\displaystyle \left\lvert S \right\rvert \le \Lambda \qquad\text{and}\qquad \widehat{f_u}(S) \ge \varepsilon^2.$

Note that there are at most ${\varepsilon^{-2}}$ heavy sets by Parseval.

Definition 12

We say that a nonempty set ${T \subseteq L}$ of labels is ${e}$-excellent for ${v}$ if

$\displaystyle \left\lvert T \right\rvert \le \Lambda \qquad\text{and}\qquad S = \pi^{\text{odd}}_e(T) \text{ is heavy.}$

In particular ${S \neq \varnothing}$ so at least one compatible key-label pair is in ${S \times T}$.

Notice that, unlike the case with ${S}$, the criteria for “good” in ${T}$ actually depends on the edge ${e}$ in question! This makes it easier to keys than to select labels. In order to pick labels, we will have to choose from a ${\widehat g_v^2}$ distribution.

Lemma 13 (At least ${\varepsilon}$ of ${T}$ are excellent)

For any edge ${e = \{u,v\}}$, at least ${\varepsilon}$ of the possible ${T}$ according to the distribution ${\widehat g_v^2}$ are ${e}$-excellent.

Proof: Applying an averaging argument to the inequality

$\displaystyle \sum_{\substack{T \subseteq L \\ |T| \text{ odd}}} (1-2\delta)^{|T|} \widehat g_v(T)^2 \left\lvert \widehat f_u(\pi^{\text{odd}}(T)) \right\rvert \ge 2\varepsilon$

shows there is at least ${\varepsilon}$ chance that ${|T|}$ is odd and satisfies

$\displaystyle (1-2\delta)^{|T|} \left\lvert \widehat f_u(S) \right\rvert \ge \varepsilon$

where ${S = \pi^{\text{odd}}_e(T)}$. In particular, ${(1-2\delta)^{|T|} \ge \varepsilon \iff |T| \le \Lambda}$. Finally by \Cref{rem:po}, we see ${S}$ is heavy. $\Box$

Now, use the following algorithm.

• For every vertex ${u \in U}$, take the union of all heavy sets, say

$\displaystyle \mathcal H = \bigcup_{S \text{ heavy}} S.$

Pick a random key from ${\mathcal H}$. Note that ${|\mathcal H| \le \Lambda\varepsilon^{-2}}$, since there are at most ${\varepsilon^{-2}}$ heavy sets (by Parseval) and each has at most ${\Lambda}$ elements.

• For every vertex ${v \in V}$, select a random set ${T}$ according to the distribution ${\widehat g_v(T)^2}$, and select a random element from ${T}$.

I claim that this works.

Fix an edge ${e}$. There is at least an ${\varepsilon}$ chance that ${T}$ is ${e}$-excellent. If it is, then there is at least one compatible pair in ${\mathcal H \times T}$. Hence we conclude probability of success is at least

$\displaystyle \varepsilon \cdot \frac{1}{\Lambda \varepsilon^{-2}} \cdot \frac{1}{\Lambda} = \frac{\varepsilon^3}{\Lambda^2} = \eta.$

(Addendum: it’s pointed out to me this isn’t quite right; the overall probability of the equation given by an edge ${e}$ is ${\ge \frac{1}{2}+\varepsilon}$, but this doesn’t imply it for every edge. Thus one likely needs to do another averaging argument.)

# Against the “Research vs. Olympiads” Mantra

There’s a Mantra that you often hear in math contest discussions: “math olympiads are very different from math research”. (For known instances, see O’Neil, Tao, and more. More neutral stances: Monks, Xu.)

It’s true. And I wish people would stop saying it.

Every time I’ve heard the Mantra, it set off a little red siren in my head: something felt wrong. And I could never figure out quite why until last July. There was some (silly) forum discussion about how Allen Liu had done extraordinarily on math contests over the past year. Then someone says:

A: Darn, what math problem can he not do?!

B: I’ll go out on a limb and say that the answer to this is “most of the problems worth asking.” We’ll see where this stands in two years, at which point the answer will almost certainly change, but research $\neq$ Olympiads.

Then it hit me.

## Ping-pong vs. Tennis

Let’s try the following thought experiment. Consider a world-class ping-pong player, call her Sarah. She has a fan-base talking about her pr0 ping-pong skills. Then someone comes along as says:

Well, table tennis isn’t the same as tennis.

To which I and everyone else reasonable would say, “uh, so what?”. It’s true, but totally irrelevant; ping-pong and tennis are just not related. Maybe Sarah will be better than average at tennis, but there’s no reason to expect her to be world-class in that too.

And yet we say exactly the same thing for olympiads versus research. Someone wins the IMO, out pops the Mantra. Even if the Mantra is true when taken literally, it’s implicitly sending the message there’s something wrong with being good at contests and not good at research.

So now I ask: just what is wrong with that? To answer this question, I first need to answer: “what is math?”.

There’s been a trick played with this debate, and you can’t see it unless you taboo the word “math”. The word “math” can refer to a bunch of things, like:

• Training for contest problems like USAMO/IMO, or
• Working on open problems and conjectures (“research”).

So here’s the trick. The research community managed to claim the name “math”, leaving only “math contests” for the olympiad community. Now the sentence

“Math contests should be relevant to math”

seems totally innocuous. But taboo the world “math”, and you get

“Olympiads should be relevant to research”

and then you notice something’s wrong. In other words, since “math” is a substring of “math contests”, it suddenly seems like the olympiads are subordinate to research. All because of an accident in naming.

Since when? Everyone agrees that olympiads and research are different things, but it does not then follow that “olympiads are useless”. Even if ping-pong is called “table tennis”, that doesn’t mean the top ping-pong players are somehow inferior to top tennis players. (And the scary thing is that in a world without the name “ping-pong”, I can imagine some people actually thinking so.)

I think for many students, olympiads do a lot of good, independent of any value to future math research. Math olympiads give high school students something interesting to work on, and even the training process for a contest such that the IMO carries valuable life lessons: it teaches you how to work hard even in the face of possible failure, and what it’s like to be competitive at an international level (i.e. what it’s like to become really good at something after years of hard work). The peer group that math contests give is also wonderful, and quite similar to the kind of people you’d meet at a top-tier university (and in some cases, they’re more or less the same people). And the problem solving ability you gain from math contests is indisputably helpful elsewhere in life. Consequently, I’m well on record as saying the biggest benefits of math contests have nothing to do with math.

There are also more mundane (but valid) reasons (they help get students out of the classroom, and other standard blurbs about STEM and so on). And as a matter of taste I also think contest problems are interesting and beautiful in their own right. You could even try to make more direct comparisons (for example, I’d guess the average arXiv paper in algebraic geometry gets less attention than the average IMO geometry problem), but that’s a point for another blog post entirely.

## The Right and Virtuous Path

Which now leads me to what I think is a culture issue.

MOP alumni prior to maybe 2010 or so were classified into two groups. They would either go on to math research, which was somehow seen as the “right and virtuous path“, or they would defect to software/finance/applied math/etc. Somehow there is always this implicit, unspoken message that the smart MOPpers do math research and the dumb MOPpers drop out.

I’ll tell you how I realized why I didn’t like the Mantra: it’s because the only time I hear the Mantra is when someone is belittling olympiad medalists.

The Mantra says that the USA winning the IMO is no big deal. The Mantra says Allen Liu isn’t part of the “smart club” until he succeeds in research too. The Mantra says that the countless time and energy put into running each year’s MOP are a waste of time. The Mantra says that the students who eventually drop out of math research are “not actually good at math” and “just good at taking tests”. The Mantra even tells outsiders that they, too, can be great researchers, because olympiads are useless anyways.

The Mantra is math research’s recruiting slogan.

And I think this is harmful. The purpose of olympiads was never to produce more math researchers. If it’s really the case that olympiads and research are totally different, then we should expect relatively few olympiad students to go into research; yet in practice, a lot of them do. I think one could make a case that a lot of the past olympiad students are going into math research without realizing that they’re getting into something totally unrelated, just because the sign at the door said “math”. One could also make a case that it’s very harmful for those that don’t do research, or try research and then decide they don’t like it: suddenly these students don’t think they’re “good at math” any more, they’re not smart enough be a mathematician, etc.

But we need this kind of problem-solving skill and talent too much for it to all be spent on computing R(6,6). Richard Rusczyk’s take from Math Prize for Girls 2014 is:

When people ask me, am I disappointed when my students don’t go off and be mathematicians, my answer is I’d be very disappointed if they all did. We need people who can think about these complex problems and solve really hard problems they haven’t seen before everywhere. It’s not just in math, it’s not just in the sciences, it’s not just in medicine — I mean, what we’d give to get some of them in Congress!

Academia is a fine career, but there’s tons of other options out there: the research community may denounce those who switch out as failures, but I’m sure society will take them with open arms.

To close, I really like this (sarcastic) comment from Steven Karp (near bottom):

Contest math is inaccessible to over 90% of people as it is, and then we’re supposed to tell those that get it that even that isn’t real math? While we’re at it, let’s tell Vi Hart to stop making videos because they don’t accurately represent math research.

Thanks first of all for the many long and thoughtful comments from everyone (both here, on Facebook, in private, and so on). It’s given me a lot to think about.

Here’s my responses to some of the points that were raised, which is necessarily incomplete because of the volume of discussion.

1. To start off, it was suggested I should explicitly clarify: I do not mean to imply that people who didn’t do well on contests cannot do well in math research. So let me say that now.

2. My favorite comment that I got was that in fact this whole post pattern matches with bravery debates.

On one hand you have lots of olympiad students who actually FEEL BAD about winning medals because they “weren’t doing real math”. But on the other hand there are students whose parents tell them to not pursue math as a major or career because of low contest scores. These students (and their parents) would benefit a lot from the Mantra; so I concede that there are indeed good use cases of the Mantra (such as those that Anonymous Chicken, betaveros describe below) and in particular the Mantra is not intrinsically bad.

Which of these use is the “common use” probably depends on which tribes you are part of (guess which one I see more?). It’s interesting in that in this case, the two sides actually agree on the basic fact (that contests and research are not so correlated).

3. Some people point out that research is a career while contests aren’t. I am not convinced by this; I don’t think “is a career” is a good metric for measuring value to society, and can think of several examples of actual jobs that I think really should not exist (not saying any names). In addition, I think that if the general public understood what mathematicians actually do for a career, they just might be a little less willing to pay us.

I think there’s an interesting discussion about whether contests / research are “valuable” or not, but I don’t think the answer is one-sided; this would warrant a whole different debate (and would derail the entire post if I tried to address it).

4. Some people point out that training for olympiads yields diminishing returns (e.g. learning Muirhead and Schur is probably not useful for anything else). I guess this is true, but isn’t it true of almost anything? Maybe the point is supposed to be “olympiads aren’t everything”, which is agreeable (see below).

5. The other favorite comment I got was from Another Chicken, who points out below that the olympiad tribe itself is elitist: they tend to wall themselves off from outsiders (I certainly do this), and undervalue anything that isn’t hard technical problems.

I concede these are real problems with the olympiad community. Again, this could be a whole different blog post.

But I think this comment missed the point of this post. It is probably fine (albeit patronizing) to encourage olympiad students to expand; but I have a big problem with framing it as “spend time on not-contests because research“. That’s the real issue with the Mantra: it is often used as a recruitment slogan, telling students that research is the next true test after the IMO has been conquered.

Changing the Golden Metric from olympiads to research seems to just make the world more egotistic than it already is.

# Vinogradov’s Three-Prime Theorem (with Sammy Luo and Ryan Alweiss)

This was my final paper for 18.099, seminar in discrete analysis, jointly with Sammy Luo and Ryan Alweiss.

We prove that every sufficiently large odd integer can be written as the sum of three primes, conditioned on a strong form of the prime number theorem.

## 1. Introduction

In this paper, we prove the following result:

Every sufficiently large odd integer ${N}$ is the sum of three prime numbers.

In fact, the following result is also true, called the “weak Goldbach conjecture”.

Theorem 2 (Weak Goldbach conjecture)

Every odd integer ${N \ge 7}$ is the sum of three prime numbers.

The proof of Vinogradov’s theorem becomes significantly simpler if one assumes the generalized Riemann hypothesis; this allows one to use a strong form of the prime number theorem (Theorem 9). This conditional proof was given by Hardy and Littlewood in the 1923’s. In 1997, Deshouillers, Effinger, te Riele and Zinoviev showed that the generalized Riemann hypothesis in fact also implies the weak Goldbach conjecture by improving the bound to ${10^{20}}$ and then exhausting the remaining cases via a computer search.

As for unconditional proofs, Vinogradov was able to eliminate the dependency on the generalized Riemann hypothesis in 1937, which is why the Theorem 1 bears his name. However, Vinogradov’s bound used the ineffective Siegel-Walfisz theorem; his student K. Borozdin showed that ${3^{3^{15}}}$ is large enough. Over the years the bound was improved, until recently in 2013 when Harald Helfgott claimed the first unconditional proof of Theorem 2, see here.

In this exposition we follow Hardy and Littlewood’s approach, i.e. we prove Theorem 1 assuming the generalized Riemann hypothesis, following the exposition of Rhee. An exposition of the unconditional proof by Vinogradov is given by Rouse.

## 2. Synopsis

We are going to prove that

$\displaystyle \sum_{a+b+c = N} \Lambda(a) \Lambda(b) \Lambda(c) \asymp \frac12 N^2 \mathfrak G(N) \ \ \ \ \ (1)$

where

$\displaystyle \mathfrak G(N) \overset{\text{def}}{=} \prod_{p \mid N} \left( 1 - \frac{1}{(p-1)^2} \right) \prod_{p \nmid N} \left( 1 + \frac{1}{(p-1)^3} \right)$

and ${\Lambda}$ is the von Mangoldt function defined as usual. Then so long as ${2 \nmid N}$, the quantity ${\mathfrak G(N)}$ will be bounded away from zero; thus (1) will imply that in fact there are many ways to write ${N}$ as the sum of three distinct prime numbers.

The sum (1) is estimated using Fourier analysis. Let us define the following.

Definition 3

Let ${\mathbb T = \mathbb R/\mathbb Z}$ denote the circle group, and let ${e : \mathbb T \rightarrow \mathbb C}$ be the exponential function ${\theta \mapsto \exp(2\pi i \theta)}$. For ${\alpha\in\mathbb T}$, ${\|\alpha\|}$ denotes the minimal distance from ${\alpha}$ to an integer.

Note that ${|e(\theta)-1|=\Theta(\|\theta\|)}$.

Definition 4

For ${\alpha \in \mathbb T}$ and ${x > 0}$ we define

$\displaystyle S(x, \alpha) = \sum_{n \le x} \Lambda(n) e(n\alpha).$

Then we can rewrite (1) using ${S}$ as a “Fourier coefficient”:

Proof: We have

$\displaystyle S(N,\alpha)^3=\sum_{a,b,c\leq N}\Lambda(a)\Lambda(b)\Lambda(c)e((a+b+c)\alpha),$

so

\displaystyle \begin{aligned} \int_{\alpha \in \mathbb T} S(N, \alpha)^3 e(-N\alpha) \; d\alpha &= \int_{\alpha \in \mathbb T} \sum_{a,b,c\leq N}\Lambda(a)\Lambda(b)\Lambda(c)e((a+b+c)\alpha) e(-N\alpha) \; d\alpha \\ &= \sum_{a,b,c\leq N}\Lambda(a)\Lambda(b)\Lambda(c)\int_{\alpha \in \mathbb T}e((a+b+c-N)\alpha) \; d\alpha \\ &= \sum_{a,b,c\leq N}\Lambda(a)\Lambda(b)\Lambda(c)I(a+b+c=N) \\ &= \sum_{a+b+c=N}\Lambda(a)\Lambda(b)\Lambda(c), \end{aligned}

as claimed. $\Box$

In order to estimate the integral in Proposition 5, we divide ${\mathbb T}$ into the so-called “major” and “minor” arcs. Roughly,

• The “major arcs” are subintervals of ${\mathbb T}$ centered at a rational number with small denominator.
• The “minor arcs” are the remaining intervals.

These will be made more precise later. This general method is called the Hardy-Littlewood circle method, because of the integral over the circle group ${\mathbb T}$.

The rest of the paper is structured as follows. In Section 3, we define the Dirichlet character and other number-theoretic objects, and state some estimates for the partial sums of these objects conditioned on the Riemann hypothesis. These bounds are then used in Section 4 to provide corresponding estimates on ${S(x, \alpha)}$. In Section 5 we then define the major and minor arcs rigorously and use the previous estimates to given an upper bound for the integral over both areas. Finally, we complete the proof in Section 6.

## 3. Prime number theorem type bounds

In this section, we collect the necessary number-theoretic results that we will need. It is in this section only that we will require the generalized Riemann hypothesis.

As a reminder, the notation ${f(x)\ll g(x)}$, where ${f}$ is a complex function and ${g}$ a nonnegative real one, means ${f(x)=O(g(x))}$, a statement about the magnitude of ${f}$. Likewise, ${f(x)=g(x)+O(h(x))}$ simply means that for some ${C}$, ${|f(x)-g(x)|\leq C|h(x)|}$ for all sufficiently large ${x}$.

### 3.1. Dirichlet characters

In what follows, ${q}$ denotes a positive integer.

Definition 6

A Dirichlet character modulo ${q}$ ${\chi}$ is a homomorphism ${\chi : (\mathbb Z/q)^\times \rightarrow \mathbb C^\times}$. It is said to be trivial if ${\chi = 1}$; we denote this character by ${\chi_0}$.

By slight abuse of notation, we will also consider ${\chi}$ as a function ${\mathbb Z \rightarrow \mathbb C^\ast}$ by setting ${\chi(n) = \chi(n \pmod q)}$ for ${\gcd(n,q) = 1}$ and ${\chi(n) = 0}$ for ${\gcd(n,q) > 1}$.

Remark 7

The Dirichlet characters form a multiplicative group of order ${\phi(q)}$ under multiplication, with inverse given by complex conjugation. Note that ${\chi(m)}$ is a primitive ${\phi(q)}$th root of unity for any ${m \in (\mathbb Z/q)^\times}$, thus ${\chi}$ takes values in the unit circle.

Moreover, the Dirichlet characters satisfy an orthogonality relation

Experts may recognize that the Dirichlet characters are just the elements of the Pontryagin dual of ${(\mathbb Z/q)^\times}$. In particular, they satisfy an orthogonality relationship

$\displaystyle \frac{1}{\phi(q)} \sum_{\chi \text{ mod } q} \chi(n) \overline{\chi(a)} = \begin{cases} 1 & n = a \pmod q \\ 0 & \text{otherwise} \end{cases} \ \ \ \ \ (3)$

and thus form an orthonormal basis for functions ${(\mathbb Z/q)^\times \rightarrow \mathbb C}$.

### 3.2. Prime number theorem for arithmetic progressions

Definition 8

The generalized Chebyshev function is defined by

$\displaystyle \psi(x, \chi) = \sum_{n \le x} \Lambda(n) \chi(n).$

The Chebyshev function is studied extensively in analytic number theory, as it is the most convenient way to phrase the major results of analytic number theory. For example, the prime number theorem is equivalent to the assertion that

$\displaystyle \psi(x, \chi_0) = \sum_{n \le x} \Lambda(n) \asymp x$

where ${q = 1}$ (thus ${\chi_0}$ is the constant function ${1}$). Similarly, Dirichlet’s theorem actually asserts that any ${q \ge 1}$,

$\displaystyle \psi(x, \chi) = \begin{cases} x + o_q(x) & \chi = \chi_0 \text{ trivial} \\ o_q(x) & \chi \neq \chi_0 \text{ nontrivial}. \end{cases}$

However, the error term in these estimates is quite poor (more than ${x^{1-\varepsilon}}$ for every ${\varepsilon}$). However, by assuming the Riemann Hypothesis for a certain “${L}$-function” attached to ${\chi}$, we can improve the error terms substantially.

Theorem 9 (Prime number theorem for arithmetic progressions)

Let ${\chi}$ be a Dirichlet character modulo ${q}$, and assume the Riemann hypothesis for the ${L}$-function attached to ${\chi}$.

1. If ${\chi}$ is nontrivial, then

$\displaystyle \psi(x, \chi) \ll \sqrt{x} (\log qx)^2.$

2. If ${\chi = \chi_0}$ is trivial, then

$\displaystyle \psi(x, \chi_0) = x + O\left( \sqrt x (\log x)^2 + \log q \log x \right).$

Theorem 9 is the strong estimate that we will require when putting good estimates on ${S(x, \alpha)}$, and is the only place in which the generalized Riemann Hypothesis is actually required.

### 3.3. Gauss sums

Definition 10

For ${\chi}$ a Dirichlet character modulo ${q}$, the Gauss sum ${\tau(\chi)}$ is defined by

$\displaystyle \tau(\chi)=\sum_{a=0}^{q-1}\chi(a)e(a/q).$

We will need the following fact about Gauss sums.

Lemma 11

Consider Dirichlet characters modulo ${q}$. Then:

1. We have ${\tau(\chi_0) = \mu(q)}$.
2. For any ${\chi}$ modulo ${q}$, ${\left\lvert \tau(\chi) \right\rvert \le \sqrt q}$.

### 3.4. Dirichlet approximation

We finally require Dirichlet approximation theorem in the following form.

Theorem 12 (Dirichlet approximation)

Let ${\alpha \in \mathbb R}$ be arbitrary, and ${M}$ a fixed integer. Then there exists integers ${a}$ and ${q = q(\alpha)}$, with ${1 \le q \le M}$ and ${\gcd(a,q) = 1}$, satisfying

$\displaystyle \left\lvert \alpha - \frac aq \right\rvert \le \frac{1}{qM}.$

## 4. Bounds on ${S(x, \alpha)}$

In this section, we use our number-theoretic results to bound ${S(x,\alpha)}$.

First, we provide a bound for ${S(x,\alpha)}$ if ${\alpha}$ is a rational number with “small” denominator ${q}$.

Lemma 13

Let ${\gcd(a,q) = 1}$. Assuming Theorem 9, we have

$\displaystyle S(x, a/q) = \frac{\mu(q)}{\phi(q)} x + O\left( \sqrt{qx} (\log qx)^2 \right)$

where ${\mu}$ denotes the Möbius function.

Proof: Write the sum as

$\displaystyle S(x, a/q) = \sum_{n \le x} \Lambda(n) e(na/q).$

First we claim that the terms ${\gcd(n,q) > 1}$ (and ${\Lambda(n) \neq 0}$) contribute a negligibly small ${\ll \log q \log x}$. To see this, note that

• The number ${q}$ has ${\ll \log q}$ distinct prime factors, and
• If ${p^e \mid q}$, then ${\Lambda(p) + \dots + \Lambda(p^e) = e\log p = \log(p^e) < \log x}$.

So consider only terms with ${\gcd(n,q) = 1}$. To bound the sum, notice that

\displaystyle \begin{aligned} e(n \cdot a/q) &= \sum_{b \text{ mod } q} e(b/q) \cdot \mathbf 1(b \equiv an) \\ &= \sum_{b \text{ mod } q} e(b/q) \left( \frac{1}{\phi(q)} \sum_{\chi \text{ mod } q} \chi(b) \overline{\chi(an)} \right) \end{aligned}

by the orthogonality relations. Now we swap the order of summation to obtain a Gauss sum:

\displaystyle \begin{aligned} e(n \cdot a/q) &= \frac{1}{\phi(q)} \sum_{\chi \text{ mod } q} \overline{\chi(an)} \left( \sum_{b \text{ mod } q} \chi(b) e(b/q) \right) \\ &= \frac{1}{\phi(q)} \sum_{\chi \text{ mod } q} \overline{\chi(an)} \tau(\chi). \end{aligned}

Thus, we swap the order of summation to obtain that

\displaystyle \begin{aligned} S(x, \alpha) &= \sum_{\substack{n \le x \\ \gcd(n,q) = 1}} \Lambda(n) e(n \cdot a/q) \\ &= \frac{1}{\phi(q)} \sum_{\substack{n \le x \\ \gcd(n,q) = 1}} \sum_{\chi \text{ mod } q} \Lambda(n) \overline{\chi(an)} \tau(\chi) \\ &= \frac{1}{\phi(q)} \sum_{\chi \text{ mod } q} \tau(\chi) \sum_{\substack{n \le x \\ \gcd(n,q) = 1}} \Lambda(n) \overline{\chi(an)} \\ &= \frac{1}{\phi(q)} \sum_{\chi \text{ mod } q} \overline{\chi(a)} \tau(\chi) \sum_{\substack{n \le x \\ \gcd(n,q) = 1}} \Lambda(n)\overline{\chi(n)} \\ &= \frac{1}{\phi(q)} \sum_{\chi \text{ mod } q} \overline{\chi(a)} \tau(\chi) \psi(x, \overline\chi) \\ &= \frac{1}{\phi(q)} \left( \tau(\chi_0) \psi(x, \chi_0) + \sum_{1 \neq \chi \text{ mod } q} \overline{\chi(a)} \tau(\chi) \psi(x, \overline\chi) \right). \end{aligned}

Now applying both parts of Lemma 11 in conjunction with Theorem 9 gives

\displaystyle \begin{aligned} S(x,\alpha) &= \frac{\mu(q)}{\phi(q)} \left( x + O\left( \sqrt x (\log qx)^2 \right) \right) + O\left( \sqrt x (\log x)^2 \right) \\ &= \frac{\mu(q)}{\phi(q)} x + O\left( \sqrt{qx} (\log qx)^2 \right) \end{aligned}

as desired. $\Box$

We then provide a bound when ${\alpha}$ is “close to” such an ${a/q}$.

Lemma 14

Let ${\gcd(a,q) = 1}$ and ${\beta \in \mathbb T}$. Assuming Theorem 9, we have

$\displaystyle S(x, a/q + \beta) = \frac{\mu(q)}{\phi(q)} \left( \sum_{n \le x} e(\beta n) \right) + O\left( (1+\|\beta\|x) \sqrt{qx} (\log qx)^2 \right).$

Proof: For convenience let us assume ${x \in \mathbb Z}$. Let ${\alpha = a/q + \beta}$. Let us denote ${\text{Err}(x, \alpha) = S(x,\alpha) - \frac{\mu(q)}{\phi(q)} x}$, so by Lemma 13 we have ${\text{Err}(x,\alpha) \ll \sqrt{qx}(\log x)^2}$. We have

\displaystyle \begin{aligned} S(x, \alpha) &= \sum_{n \le x} \Lambda(n) e(na/q) e(n\beta) \\ &= \sum_{n \le x} e(n\beta) \left( S(n, a/q) - S(n-1, a/q) \right) \\ &= \sum_{n \le x} e(n\beta) \left( \frac{\mu(q)}{\phi(q)} + \text{Err}(n, \alpha) - \text{Err}(n-1, \alpha) \right) \\ &= \frac{\mu(q)}{\phi(q)} \left( \sum_{n \le x} e(n\beta) \right) + \sum_{1 \le m \le x-1} \left( e( (m+1)\beta) - e( m\beta ) \right) \text{Err}(m, \alpha) \\ &\qquad + e(x\beta) \text{Err}(x, \alpha) - e(0) \text{Err}(0, \alpha) \\ &\le \frac{\mu(q)}{\phi(q)} \left( \sum_{n \le x} e(n\beta) \right) + \left( \sum_{1 \le m \le x-1} \|\beta\| \text{Err}(m, \alpha) \right) + \text{Err}(0, \alpha) + \text{Err}(x, \alpha) \\ &\ll \frac{\mu(q)}{\phi(q)} \left( \sum_{n \le x} e(n\beta) \right) + \left( 1+x\left\| \beta \right\| \right) O\left( \sqrt{qx} (\log qx)^2 \right) \end{aligned}

as desired. $\Box$

Thus if ${\alpha}$ is close to a fraction with small denominator, the value of ${S(x, \alpha)}$ is bounded above. We can now combine this with the Dirichlet approximation theorem to obtain the following general result.

Corollary 15

Suppose ${M = N^{2/3}}$ and suppose ${\left\lvert \alpha - a/q \right\rvert \le \frac{1}{qM}}$ for some ${\gcd(a,q) = 1}$ with ${q \le M}$. Assuming Theorem 9, we have

$\displaystyle S(x, \alpha) \ll \frac{x}{\varphi(q)} + x^{\frac56+\varepsilon}$

for any ${\varepsilon > 0}$.

Proof: Apply Lemma 14 directly. $\Box$

## 5. Estimation of the arcs

We’ll write

$\displaystyle f(\alpha) \overset{\text{def}}{=} S(N,\alpha)=\sum_{n \le N} \Lambda(n)e(n\alpha)$

for brevity in this section.

Recall that we wish to bound the right-hand side of (2) in Proposition 5. We split ${[0,1]}$ into two sets, which we call the “major arcs” and the “minor arcs.” To do so, we use Dirichlet approximation, as hinted at earlier.

In what follows, fix

\displaystyle \begin{aligned} M &= N^{2/3} \\ K &= (\log N)^{10}. \end{aligned}

### 5.1. Setting up the arcs

Definition 16

For ${q \le K}$ and ${\gcd(a,q) = 1}$, ${1 \le a \le q}$, we define

$\displaystyle \mathfrak M(a,q) = \left\{ \alpha \in \mathbb T \mid \left\lvert \alpha - \frac aq \right\rvert \le \frac 1M \right\}.$

These will be the major arcs. The union of all major arcs is denoted by ${\mathfrak M}$. The complement is denoted by ${\mathfrak m}$.

Equivalently, for any ${\alpha}$, consider ${q = q(\alpha) \le M}$ as in Theorem 12. Then ${\alpha \in \mathfrak M}$ if ${q \le K}$ and ${\alpha \in \mathfrak m}$ otherwise.

Proposition 17

${\mathfrak M}$ is composed of finitely many disjoint intervals ${\mathfrak M(a,q)}$ with ${q \le K}$. The complement ${\mathfrak m}$ is nonempty.

Proof: Note that if ${q_1, q_2 \le K}$ and ${a/q_1 \neq b/q_2}$ then ${\left\lvert \frac{a}{q_1} - \frac{b}{q_2} \right\rvert \ge \frac{1}{q_1q_2} \gg \frac{3}{qM}}$. $\Box$

In particular both ${\mathfrak M}$ and ${\mathfrak m}$ are measurable. Thus we may split the integral in (2) over ${\mathfrak M}$ and ${\mathfrak m}$. This integral will have large magnitude on the major arcs, and small magnitude on the minor arcs, so overall the whole interval ${[0,1]}$ it will have large magnitude.

### 5.2. Estimate of the minor arcs

First, we note the well known fact ${\phi(q) \gg q/\log q}$. Note also that if ${q=q(\alpha)}$ as in the last section and ${\alpha}$ is on a minor arc, we have ${q > (\log N)^{10}}$, and thus ${\phi(q) \gg (\log N)^{9}}$.

As such Corollary 3.3 yields that ${f(\alpha) \ll \frac{N}{\phi(q)}+N^{.834} \ll \frac{N}{(\log N)^9}}$.

Now,

\displaystyle \begin{aligned} \left\lvert \int_{\mathfrak m}f(\alpha)^3e(-N\alpha) \; d\alpha \right\rvert &\le \int_{\mathfrak m}\left\lvert f(\alpha)\right\rvert ^3 \; d\alpha \\ &\ll \frac{N}{(\log N)^9} \int_{0}^{1}\left\lvert f(\alpha)\right\rvert ^2 \;d\alpha \\ &=\frac{N}{(\log N)^9}\int_{0}^{1}f(\alpha)f(-\alpha) \; d\alpha \\ &=\frac{N}{(\log N)^9}\sum_{n \le N} \Lambda(n)^2 \\ &\ll \frac{N^2}{(\log N)^8}, \end{aligned}

using the well known bound ${\sum_{n \le N} \Lambda(n)^2 \ll \frac{N}{\log N}}$. This bound of ${\frac{N^2}{(\log N)^8}}$ will be negligible compared to lower bounds for the major arcs in the next section.

### 5.3. Estimate on the major arcs

We show that

$\displaystyle \int_{\mathfrak M}f(\alpha)^3e(-N\alpha) d\alpha \asymp \frac{N^2}{2} \mathfrak G(N).$

By Proposition 17 we can split the integral over each interval and write

$\displaystyle \int_{\mathfrak M} f(\alpha)^3e(-N\alpha) \; d\alpha = \sum_{q \le (\log N)^{10}}\sum_{\substack{1 \le a \le q \\ \gcd(a,q)=1}} \int_{-1/qM}^{1/qM}f(a/q+\beta)^3e(-N(a/q+\beta)) \; d\beta.$

Then we apply Lemma 14, which gives

\displaystyle \begin{aligned} f(a/q+\beta)^3 &= \left(\frac{\mu(q)}{\phi(q)}\sum_{n \le N}e(\beta n) \right)^3 \\ &+\left(\frac{\mu(q)}{\phi(q)}\sum_{n \le N}e(\beta n)\right)^2 O\left((1+\|\beta\|N)\sqrt{qN} \log^2 qN\right) \\ &+\left(\frac{\mu(q)}{\phi(q)}\sum_{n \le N}e(\beta n)\right) O\left((1+\|\beta\|N)\sqrt{qN} \log^2 qN\right)^2 \\ &+O\left((1+\|\beta\|N)\sqrt{qN} \log^2 qN\right)^3. \end{aligned}

Now, we can do casework on the side of ${N^{-.9}}$ that ${\|\beta\|}$ lies on.

• If ${\|\beta\| \gg N^{-.9}}$, we have ${\sum_{n \le N}e(\beta n) \ll \frac{2}{|e(\beta)-1|} \ll \frac{1}{\|\beta\|} \ll N^{.9}}$, and ${(1+\|\beta\|N)\sqrt{qN} \log^2 qN \ll N^{5/6+\varepsilon}}$, because certainly we have ${\|\beta\|<1/M=N^{-2/3}}$.
• If on the other hand ${\|\beta\|\ll N^{-.9}}$, we have ${\sum_{n \le N}e(\beta n) \ll N}$ obviously, and ${O(1+\|\beta\|N)\sqrt{qN} \log^2 qN) \ll N^{3/5+\varepsilon}}$.

As such, we obtain

$\displaystyle f(a/q+\beta)^3 \ll \left( \frac{\mu(q)}{\phi(q)}\sum_{n \le N}e(\beta n) \right)^3 + O\left(N^{79/30+\varepsilon}\right)$

in either case. Thus, we can write

\displaystyle \begin{aligned} &\qquad \int_{\mathfrak M} f(\alpha)^3e(-N\alpha) \; d\alpha \\ &= \sum_{q \le (\log N)^{10}} \sum_{\substack{1 \le a \le q \\ \gcd(a,q)=1}} \int_{-1/qM}^{1/qM} f(a/q+\beta)^3e(-N(a/q+\beta)) \; d\beta \\ &= \sum_{q \le (\log N)^{10}} \sum_{\substack{1 \le a \le q \\ \gcd(a,q)=1}} \int_{-1/qM}^{1/qM}\left[\left(\frac{\mu(q)}{\phi(q)}\sum_{n \le N}e(\beta n)\right)^3 + O\left(N^{79/30+\varepsilon}\right)\right]e(-N(a/q+\beta)) \; d\beta \\ &=\sum_{q \le (\log N)^{10}} \frac{\mu(q)}{\phi(q)^3} S_q \left(\sum_{\substack{1 \le a \le q \\ \gcd(a,q)=1}} e(-N(a/q))\right) \left( \int_{-1/qM}^{1/qM}\left(\sum_{n \le N}e(\beta n)\right)^3e(-N\beta) \; d\beta \right ) \\ &\qquad +O\left(N^{59/30+\varepsilon}\right). \end{aligned}

just using ${M \le N^{2/3}}$. Now, we use

$\displaystyle \sum_{n \le N}e(\beta n) = \frac{1-e(\beta N)}{1-e(\beta)} \ll \frac{1}{\|\beta\|}.$

This enables us to bound the expression

$\displaystyle \int_{1/qM}^{1-1/qM}\left (\sum_{n \le N}e(\beta n)\right) ^ 3 e(-N\beta)d\beta \ll \int_{1/qM}^{1-1/qM}\|\beta\|^{-3} d\beta = 2\int_{1/qM}^{1/2}\beta^{-3} d\beta \ll q^2M^2.$

But the integral over the entire interval is

\displaystyle \begin{aligned} \int_{0}^{1}\left(\sum_{n \le N}e(\beta n) \right)^3 e(-N\beta)d\beta &= \int_{0}^{1} \sum_{a,b,c \le N} e((a+b+c-N)\beta) \\ &\ll \sum_{a,b,c \le N} \mathbf 1(a+b+c=N) \\ &= \binom{N-1}{2}. \end{aligned}

Considering the difference of the two integrals gives

$\displaystyle \int_{-1/qM}^{1/qM}\left(\sum_{n \le N}e(\beta n) \right)^3 e(-N\beta) \; d\beta - \frac{N^2}{2} \ll q^2 M^2 + N \ll (\log N)^c N^{4/3},$

for some absolute constant ${c}$.

For brevity, let

$\displaystyle S_q = \sum_{\substack{1 \le a \le q \\ \gcd(a,q)=1}} e(-N(a/q)).$

Then

\displaystyle \begin{aligned} \int_{\mathfrak M} f(\alpha)^3e(-N\alpha) \; d\alpha &= \sum_{q \le (\log N)^{10}} \frac{\mu(q)}{\phi(q)^3}S_q \left( \int_{-1/qM}^{1/qM}\left(\sum_{n \le N}e(\beta n)\right)^3e(-N\beta) \; d\beta \right ) \\ &\qquad +O\left(N^{59/30+\varepsilon}\right) \\ &= \frac{N^2}{2}\sum_{q \le (\log N)^{10}} \frac{\mu(q)}{\phi(q)^3}S_q + O((\log N)^{10+c} N^{4/3}) + O(N^{59/30+\varepsilon}) \\ &= \frac{N^2}{2}\sum_{q \le (\log N)^{10}} \frac{\mu(q)}{\phi(q)^3} + O(N^{59/30+\varepsilon}). \end{aligned}

.

The inner sum is bounded by ${\phi(q)}$. So,

$\displaystyle \left\lvert \sum_{q>(\log N)^{10}} \frac{\mu(q)}{\phi(q)^3} S_q \right\rvert \le \sum_{q>(\log N)^{10}} \frac{1}{\phi(q)^2},$

which converges since ${\phi(q)^2 \gg q^c}$ for some ${c > 1}$. So

$\displaystyle \int_{\mathfrak M} f(\alpha)^3e(-N\alpha) \; d\alpha = \frac{N^2}{2}\sum_{q = 1}^\infty \frac{\mu(q)}{\phi(q)^3}S_q + O(N^{59/30+\varepsilon}).$

Now, since ${\mu(q)}$, ${\phi(q)}$, and ${\sum_{\substack{1 \le a \le q \\ \gcd(a,q)=1}} e(-N(a/q))}$ are multiplicative functions of ${q}$, and ${\mu(q)=0}$ unless ${q}$ is squarefree,

\displaystyle \begin{aligned} \sum_{q = 1}^\infty \frac{\mu(q)}{\phi(q)^3} S_q &= \prod_p \left(1+\frac{\mu(p)}{\phi(p)^3}S_p \right) \\ &= \prod_p \left(1-\frac{1}{(p-1)^3} \sum_{a=1}^{p-1} e(-N(a/p))\right) \\ &= \prod_p \left(1-\frac{1}{(p-1)^3}\sum_{a=1}^{p-1} (p\cdot \mathbf 1(p|N) - 1)\right) \\ &= \prod_{p|N}\left(1-\frac{1}{(p-1)^2}\right) \prod_{p \nmid N}\left(1+\frac{1}{(p-1)^3}\right) \\ &= \mathfrak G(N). \end{aligned}

So,

$\displaystyle \int_{\mathfrak M} f(\alpha)^3e(-N\alpha) \; d\alpha = \frac{N^2}{2}\mathfrak{G}(N) + O(N^{59/30+\varepsilon}).$

When ${N}$ is odd,

$\displaystyle \mathfrak{G}(N) = \prod_{p|N}\left(1-\frac{1}{(p-1)^2}\right)\prod_{p \nmid N}\left(1+\frac{1}{(p-1)^3}\right)\geq \prod_{m\geq 3}\left(\frac{m-2}{m-1}\frac{m}{m-1}\right)=\frac{1}{2},$

so that we have

$\displaystyle \int_{\mathfrak M} f(\alpha)^3e(-N\alpha) \; d\alpha \asymp \frac{N^2}{2}\mathfrak{G}(N),$

as desired.

## 6. Completing the proof

Because the integral over the minor arc is ${o(N^2)}$, it follows that

$\displaystyle \sum_{a+b+c=N} \Lambda(a)\Lambda(b)\Lambda(c) = \int_{0}^{1} f(\alpha)^3 e(-N\alpha) d \alpha \asymp \frac{N^2}{2}\mathfrak{G}(N) \gg N^2.$

Consider the set ${S_N}$ of integers ${p^k\leq N}$ with ${k>1}$. We must have ${p \le N^{\frac{1}{2}}}$, and for each such ${p}$ there are at most ${O(\log N)}$ possible values of ${k}$. As such, ${|S_N| \ll\pi(N^{1/2}) \log N\ll N^{1/2}}$.

Thus

$\displaystyle \sum_{\substack{a+b+c=N \\ a\in S_N}} \Lambda(a)\Lambda(b)\Lambda(c) \ll (\log N)^3 |S|N \ll\log(N)^3 N^{3/2},$

and similarly for ${b\in S_N}$ and ${c\in S_N}$. Notice that summing over ${a\in S_N}$ is equivalent to summing over composite ${a}$, so

$\displaystyle \sum_{p_1+p_2+p_3=N} \Lambda(p_1)\Lambda(p_2)\Lambda(p_3) =\sum_{a+b+c=N} \Lambda(a)\Lambda(b)\Lambda(c) + O(\log(N)^3 N^{3/2}) \gg N^2,$

where the sum is over primes ${p_i}$. This finishes the proof.

# First drafts of Napkin up!

EDIT: Here’s a July 19 draft that fixes some of the glaring issues that were pointed out.

This morning I finally uploaded the first drafts of my Napkin project, which I’ve been working on since December 2014. See the Napkin tab above for a listing of all drafts.

Napkin is my personal exposition project, which unifies together a lot of my blog posts and even more that I haven’t written on yet into a single coherent narrative. It’s written for students who don’t know much higher math, but are curious and already are comfortable with proofs. It’s especially suited for e.g. students who did contests like USAMO and IMO.

There are still a lot of rough edges in the draft, but I haven’t been able to find much time to work on it this whole calendar year, and so I’ve finally decided the perfect is the enemy of the good and it’s about time I brought this project out of the garage.

I’d much appreciate any comments, corrections, or suggestions, however minor. Please let me know! I do plan to keep updating this draft as I get comments, though I can’t promise that I’ll be very fast in doing so.

I. Basic Algebra and Topology
II. Linear Algebra and Multivariable Calculus
III. Groups, Rings, and More
IV. Complex Analysis
V. Quantum Algorithms
VI. Algebraic Topology I: Homotopy
VII. Category Theory
VIII. Differential Geometry
IX. Algebraic Topology II: Homology
X. Algebraic NT I: Rings of Integers
XI. Algebraic NT II: Galois and Ramification Theory
XII. Representation Theory
XIII. Algebraic Geometry I: Varieties
XIV. Algebraic Geometry II: Schemes
XV. Set Theory I: ZFC, Ordinals, and Cardinals
XVI. Set Theory II: Model Theory and Forcing

(I’ve also posted this on Reddit to try and grab a larger audience. We’ll see how that goes.)

# The Structure Theorem over PID’s

In this post I’ll describe the structure theorem over PID’s which generalizes the following results:

• Finite dimensional vector fields over ${k}$ are all of the form ${k^{\oplus n}}$,
• The classification theorem for finitely generated abelian groups,
• The Frobenius normal form of a matrix,
• The Jordan decomposition of a matrix.

## 1. Some ring theory prerequisites

Prototypical example for this section: ${R = \mathbb Z}$.

Before I can state the main theorem, I need to define a few terms for UFD’s, which behave much like ${\mathbb Z}$: Our intuition from the case ${R = \mathbb Z}$ basically carries over verbatim. We don’t even need to deal with prime ideals and can factor elements instead.

Definition 1

If ${R}$ is a UFD, then ${p \in R}$ is a prime element if ${(p)}$ is a prime ideal and ${p \neq 0}$. For UFD’s this is equivalent to the following property: if ${p = xy}$ then either ${x}$ or ${y}$ is a unit.

So for example in ${\mathbb Z}$ the set of prime elements is ${\{\pm2, \pm3, \pm5, \dots\}}$. Now, since ${R}$ is a UFD, every element ${r}$ factors into a product of prime elements

$\displaystyle r = u p_1^{e_1} p_2^{e_2} \dots p_m^{e_m}$

Definition 2

We say ${r}$ divides ${s}$ if ${s = r'r}$ for some ${r' \in R}$. This is written ${r \mid s}$.

Example 3 (Divisibility in ${\mathbb Z}$)

The number ${0}$ is divisible by every element of ${\mathbb Z}$. All other divisibility as expected.

Ques 4

Show that ${r \mid s}$ if and only if the exponent of each prime in ${r}$ is less than or equal to the corresponding exponent in ${s}$.

Now, the case of interest is the even stronger case when ${R}$ is a PID:

Proposition 5 (PID’s are Noetherian UFD’s)

If ${R}$ is a PID, then it is Noetherian and also a UFD.

Proof: The fact that ${R}$ is Noetherian is obvious. For ${R}$ to be a UFD we essentially repeat the proof for ${\mathbb Z}$, using the fact that ${(a,b)}$ is principal in order to extract ${\gcd(a,b)}$. $\Box$

In this case, we have a Chinese remainder theorem for elements.

Theorem 6 (Chinese remainder theorem for rings)

Let ${m}$ and ${n}$ be relatively prime elements, meaning ${(m) + (n) = (1)}$. Then

$\displaystyle R / (mn) \cong R/m \times R/n.$

Proof: This is the same as the proof of the usual Chinese remainder theorem. First, since ${(m,n)=(1)}$ we have ${am+bn=1}$ for some ${a}$ and ${b}$. Then we have a map

$\displaystyle R/m \times R/n \rightarrow R/(mn) \quad\text{by}\quad (r,s) \mapsto r \cdot bn + s \cdot am.$

One can check that this map is well-defined and an isomorphism of rings. (Diligent readers invited to do so.) $\Box$

Finally, we need to introduce the concept of a Noetherian ${R}$-module.

Definition 7

An ${R}$-module ${M}$ is Noetherian if it satisfies one of the two equivalent conditions:

• Its submodules obey the ascending chain condition: there is no infinite sequence of modules ${M_1 \subsetneq M_2 \subsetneq \dots}$.
• All submodules of ${M}$ (including ${M}$ itself) are finitely generated.

This generalizes the notion of a Noetherian ring: a Noetherian ring ${R}$ is one for which ${R}$ is Noetherian as an ${R}$-module.

Ques 8

Check these two conditions are equivalent. (Copy the proof for rings.)

## 2. The structure theorem

Our structure theorem takes two forms:

Theorem 9 (Structure theorem, invariant form)

Let ${R}$ be a PID and let ${M}$ be any finitely generated ${R}$-module. Then

$\displaystyle M \cong \bigoplus_{i=1}^m R/s_i$

for some ${s_i}$ satisfying ${s_1 \mid s_2 \mid \dots \mid s_m}$.

Corollary 10 (Structure theorem, primary form)

Let ${R}$ be a PID and let ${M}$ be any finitely generated ${R}$-module. Then

$\displaystyle M \cong R^{\oplus r} \oplus R/(q_1) \oplus R/(q_2) \oplus \dots \oplus R/(q_m)$

where ${q_i = p_i^{e_i}}$ for some prime element ${p_i}$ and integer ${e_i \ge 1}$.

Proof: Factor each ${s_i}$ into prime factors (since ${R}$ is a UFD), then use the Chinese remainder theorem. $\Box$

Remark 11

In both theorems the decomposition is unique up to permutations of the summands; good to know, but I won’t prove this.

## 3. Reduction to maps of free ${R}$-modules

The proof of the structure theorem proceeds in two main steps. First, we reduce the problem to a linear algebra problem involving free ${R}$-modules ${R^{\oplus d}}$. Once that’s done, we just have to play with matrices; this is done in the next section.

Suppose ${M}$ is finitely generated by ${d}$ elements. Then there is a surjective map of ${R}$-modules

$\displaystyle R^{\oplus d} \twoheadrightarrow M$

whose image on the basis of ${R^{\oplus d}}$ are the generators of ${M}$. Let ${K}$ denote the kernel.

We claim that ${K}$ is finitely generated as well. To this end we prove that

Lemma 12 (Direct sum of Noetherian modules is Noetherian)

Let ${M}$ and ${N}$ be two Noetherian ${R}$-modules. Then the direct sum ${M \oplus N}$ is also a Noetherian ${R}$-module.

Proof: It suffices to show that if ${L \subseteq M \oplus N}$, then ${L}$ is finitely generated. It’s unfortunately not true that ${L = P \oplus Q}$ (take ${M = N = \mathbb Z}$ ${L = \{(n,n) \mid n \in \mathbb Z\}}$) so we will have to be more careful.

Consider the submodules

\displaystyle \begin{aligned} A &= \left\{ x \in M \mid (x,0) \in L \right\} \subseteq M \\ B &= \left\{ y \in N \mid \exists x \in M : (x,y) \in L \right\} \subseteq N. \end{aligned}

(Note the asymmetry for ${A}$ and ${B}$: the proof doesn’t work otherwise.) Then ${A}$ is finitely generated by ${a_1}$, \dots, ${a_k}$, and ${B}$ is finitely generated by ${b_1}$, \dots, ${b_\ell}$. Let ${x_i = (a_i, 0)}$ and let ${y_i = (\ast, b_i)}$ be elements of ${L}$ (where the ${\ast}$‘s are arbitrary things we don’t care about). Then ${x_i}$ and ${y_i}$ together generate ${L}$. $\Box$

Ques 13

Deduce that for ${R}$ a PID, ${R^{\oplus d}}$ is Noetherian.

Hence ${K \subseteq R^{\oplus d}}$ is finitely generated as claimed. So we can find another surjective map ${R^{\oplus f} \twoheadrightarrow K}$. Consequently, we have a composition

$\displaystyle R^{\oplus f} \twoheadrightarrow K \hookrightarrow R^{\oplus d} \twoheadrightarrow M.$

Observe that ${M}$ is the cokernel of the composition ${T : R^{\oplus f} \rightarrow R^{\oplus d}}$, i.e. we have that

$\displaystyle M \cong R^{\oplus d} / \text{im }(T).$

So it suffices to understand the map ${T}$ well.

## 4. Smith normal form

The idea is now that we have reduced our problem to studying linear maps ${T : R^{\oplus m} \rightarrow R^{\oplus n}}$, which can be thought of as a generic matrix

$\displaystyle T = \begin{pmatrix} a_{11} & \dots & a_{1m} \\ \vdots & \ddots & \vdots \\ a_{n1} & \dots & a_{nm} \end{pmatrix}$

for the standard basis ${e_1}$, \dots, ${e_m}$ of ${R^{\oplus m}}$ and ${f_1}$, \dots, ${f_n}$ of ${N}$.

Of course, as you might expect it ought to be possible to change the given basis of ${T}$ such that ${T}$ has a nicer matrix form. We already saw this in Jordan form, where we had a map ${T : V \rightarrow V}$ and changed the basis so that ${T}$ was “almost diagonal”. This time, we have two sets of bases we can change, so we would hope to get a diagonal basis, or even better.

Before proceeding let’s think about how we might edit the matrix: what operations are permitted? Here are some examples:

• Swapping rows and columns, which just corresponds to re-ordering the basis.
• Adding a multiple of a column to another column. For example, if we add ${3}$ times the first column to the second column, this is equivalent to replacing the basis

$\displaystyle (e_1, e_2, e_3, \dots, e_m) \mapsto (e_1, e_2+3e_1, e_3, \dots, e_m).$

• Adding a multiple of a row to another row. One can see that adding ${3}$ times the first row to the second row is equivalent to replacing the basis

$\displaystyle (f_1, f_2, f_3, \dots, f_n) \mapsto (f_1-3f_2, f_2, f_3, \dots, f_n).$

More generally, If ${A}$ is an invertible ${n \times n}$ matrix we can replace ${T}$ with ${AT}$. This corresponds to replacing

$\displaystyle (f_1, \dots, f_n) \mapsto (A(f_1), \dots, A(f_n))$

(the “invertible” condition just guarantees the latter is a basis). Of course similarly we can replace ${X}$ with ${XB}$ where ${B}$ is an invertible ${m \times m}$ matrix; this corresponds to

$\displaystyle (e_1, \dots, e_m) \mapsto (B^{-1}(e_1), \dots, B^{-1}(e_m))$

Armed with this knowledge, we can now approach the following result.

Theorem 14 (Smith normal form)

Let ${R}$ be a PID. Let ${M = R^{\oplus m}}$ and ${N = R^{\oplus n}}$ be free ${R}$-modules and let ${T : M \rightarrow N}$ be a linear map. Set ${k = \min(m,n)}$.

Then we can select a pair of new bases for ${M}$ and ${N}$ such that ${T}$ has only diagonal entries ${s_1}$, ${s_2}$, \dots, ${s_k}$ and ${s_1 \mid s_2 \mid \dots \mid s_k}$.

So if ${m > n}$, the matrix should take the form

$\displaystyle \begin{pmatrix} s_1 & 0 & 0 & 0 & \dots & 0 \\ 0 & s_2 & 0 & 0 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots & \dots & \vdots \\ 0 & 0 & 0 & s_n & \dots & 0 \end{pmatrix}.$

and similarly when ${m \le n}$.

Ques 15

Show that Smith normal form implies the structure theorem.

Remark 16

Note that this is not a generalization of Jordan form.

• In Jordan form we consider maps ${T : V \rightarrow V}$; note that the source and target space are the same, and we are considering one basis for the space ${V}$.
• In Smith form the maps ${T : M \rightarrow N}$ are between different modules, and we pick two sets of bases (one for ${M}$ and one for ${N}$).

Example 17 (Example of Smith normal form)

To give a flavor of the idea of the proof, let’s work through a concrete example with the following matrix with entries from ${\mathbb Z}$:

$\displaystyle \begin{pmatrix} 18 & 38 & 48 \\ 14 & 30 & 38 \end{pmatrix}.$

The GCD of all the entries is ${2}$, and so motivated by this, we perform the Euclidean algorithm on the left column: subtract the second row from the first row, then three times the first row from the second:

$\displaystyle \begin{pmatrix} 18 & 38 & 48 \\ 14 & 30 & 38 \end{pmatrix} \mapsto \begin{pmatrix} 4 & 8 & 10 \\ 14 & 30 & 38 \end{pmatrix} \mapsto \begin{pmatrix} 4 & 8 & 10 \\ 2 & 6 & 2 \end{pmatrix}.$

Now that the GCD of ${2}$ is present, we move it to the upper-left by switching the two rows, and then kill off all the entries in the same row/column; since ${2}$ was the GCD all along, we isolate ${2}$ completely:

$\displaystyle \begin{pmatrix} 4 & 8 & 10 \\ 2 & 6 & 2 \end{pmatrix} \mapsto \begin{pmatrix} 2 & 6 & 2 \\ 4 & 8 & 10 \end{pmatrix} \mapsto \begin{pmatrix} 2 & 6 & 2 \\ 0 & -4 & 6 \\ \end{pmatrix} \mapsto \begin{pmatrix} 2 & 0 & 0 \\ 0 & -4 & 6 \end{pmatrix}.$

This reduces the problem to a ${1 \times 2}$ matrix. So we just apply the Euclidean algorithm again there:

$\displaystyle \begin{pmatrix} 2 & 0 & 0 \\ 0 & -4 & 6 \end{pmatrix} \mapsto \begin{pmatrix} 2 & 0 & 0 \\ 0 & -4 & 2 \end{pmatrix} \mapsto \begin{pmatrix} 2 & 0 & 0 \\ 0 & 0 & 2 \end{pmatrix} \mapsto \begin{pmatrix} 2 & 0 & 0 \\ 0 & 2 & 0 \end{pmatrix}.$

Now all we have to do is generalize this proof to work with any PID. It’s intuitively clear how to do this: the PID condition more or less lets you perform a Euclidean algorithm.

Proof: Begin with a generic matrix

$\displaystyle T = \begin{pmatrix} a_{11} & \dots & a_{1m} \\ \vdots & \ddots & \vdots \\ a_{n1} & \dots & a_{nm} \end{pmatrix}$

We want to show, by a series of operations (gradually changing the given basis) that we can rearrange the matrix into Smith normal form.

Define ${\gcd(x,y)}$ to be any generator of the principal ideal ${(x,y)}$.

Claim 18 (“Euclidean algorithm”)

If ${a}$ and ${b}$ are entries in the same row or column, we can change bases to replace ${a}$ with ${\gcd(a,b)}$ and ${b}$ with something else.

Proof: We do just the case of columns. By hypothesis, ${\gcd(a,b) = xa+yb}$ for some ${x,y \in R}$. We must have ${(x,y) = (1)}$ now (we’re in a UFD). So there are ${u}$ and ${v}$ such that ${xu + yv = 1}$. Then

$\displaystyle \begin{pmatrix} x & y \\ -v & u \end{pmatrix} \begin{pmatrix} a \\ b \end{pmatrix} = \begin{pmatrix} \gcd(a,b) \\ \text{something} \end{pmatrix}$

and the first matrix is invertible (check this!), as desired. $\Box$
Let ${s_1 = (a_{ij})_{i,j}}$ be the GCD of all entries. Now by repeatedly applying this algorithm, we can cause ${s}$ to appear in the upper left hand corner. Then, we use it to kill off all the entries in the first row and the first column, thus arriving at a matrix

$\displaystyle \begin{pmatrix} s_1 & 0 & 0 & \dots & 0 \\ 0 & a_{22}' & a_{23}' & \dots & a_{2n}' \\ 0 & a_{32}' & a_{33}' & \dots & a_{3n}' \\ \vdots&\vdots&\vdots&\ddots&\vdots \\ 0 & a_{m2}' & a_{m3}' & \dots & a_{mn}' \\ \end{pmatrix}.$

Now we repeat the same procedure with this lower-right ${(m-1) \times (n-1)}$ matrix, and so on. This gives the Smith normal form. $\Box$

With the Smith normal form, we have in the original situation that

$\displaystyle M \cong R^{\oplus d} / \text{im } T$

and applying the theorem to ${T}$ completes the proof of the structure theorem.

## 5. Applications

Now, we can apply our structure theorem! I’ll just sketch proofs of these and let the reader fill in details.

Corollary 19 (Finite-dimensional vector spaces are all isomorphic)

A vector space ${V}$ over a field ${k}$ has a finite spanning set of vectors. Then for some ${n}$, ${V \cong k^{\oplus n}}$.

Proof: In the structure theorem, ${k / (s_i) \in \{0,k\}}$. $\Box$

Corollary 20 (Frobenius normal form)

Let ${T : V \rightarrow V}$ where ${V}$ is a finite-dimensional vector space over an arbitrary field ${k}$ (not necessarily algebraically closed). Then one can write ${T}$ as a block-diagonal matrix whose blocks are all of the form

$\displaystyle \begin{pmatrix} 0 & 0 & 0 & \dots & 0 & \ast \\ 1 & 0 & 0 & \dots & 0 & \ast \\ 0 & 1 & 0 & \dots & 0 & \ast \\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots \\ 0 & 0 & 0 & \dots & 1 & \ast \\ \end{pmatrix}.$

Proof: View ${V}$ as a ${k[x]}$-module with action ${x \cdot v = T(v)}$. By theorem ${V \cong \bigoplus_i k[x] / (s_i)}$ for some polynomials ${s_i}$, where ${s_1 \mid s_2 \mid s_3 \mid \dots}$. Write each block in the form described. $\Box$

Corollary 21 (Jordan normal form)

Let ${T : V \rightarrow V}$ where ${V}$ is a finite-dimensional vector space over an arbitrary field ${k}$ which is algebraically closed. Prove that ${T}$ can be written in Jordan form.

Proof: We now use the structure theorem in its primary form. Since ${k[x]}$ is algebraically closed each ${p_i}$ is a linear factor, so every summand looks like ${k[x] / (x-a)^m}$ for some ${m}$. $\Box$

This is a draft of Chapter 15 of the Napkin.

# Miller-Rabin (for MIT 18.434)

This is a transcript of a talk I gave as part of MIT’s 18.434 class, the “Seminar in Theoretical Computer Science” as part of MIT’s communication requirement. (Insert snarky comment about MIT’s CI-* requirements here.) It probably would have made a nice math circle talk for high schoolers but I felt somewhat awkward having to present it to a bunch of students who were clearly older than me.

## 1. Preliminaries

### 1.1. Modular arithmetic

In middle school you might have encountered questions such as

Exercise 1

What is ${3^{2016} \pmod{10}}$?

You could answer such questions by listing out ${3^n}$ for small ${n}$ and then finding a pattern, in this case of period ${4}$. However, for large moduli this “brute-force” approach can be time-consuming.

Fortunately, it turns out that one can predict the period in advance.

Theorem 2 (Euler’s little theorem)

1. Let ${\gcd(a,n) = 1}$. Then ${a^{\phi(n)} \equiv 1 \pmod n}$.
2. (Fermat) If ${p}$ is a prime, then ${a^p \equiv a \pmod p}$ for every ${a}$.

Proof: Part (a) is a special case of Lagrange’s Theorem: if ${G}$ is a finite group and ${g \in G}$, then ${g^{|G|}}$ is the identity element. Now select ${G = (\mathbb Z/n\mathbb Z)^\times}$. Part (b) is the case ${n=p}$. $\Box$

Thus, in the middle school problem we know in advance that ${3^4 \equiv 1 \pmod{10}}$ because ${\phi(10) = 4}$. This bound is sharp for primes:

Theorem 3 (Primitive roots)

For every ${p}$ prime there’s a ${g \pmod p}$ such that ${g^{p-1} \equiv 1 \pmod p}$ but ${g^{k} \not\equiv 1 \pmod p}$ for any ${k < p-1}$. (Hence ${(\mathbb Z/p\mathbb Z)^\times \cong \mathbb Z/(p-1)}$.)

For a proof, see the last exercise of my orders handout.

We will define the following anyways:

Definition 4

We say an integer ${n}$ (thought of as an exponent) annihilates the prime ${p}$ if

• ${a^n \equiv 1 \pmod p}$ for every ${a \not\equiv 0 \pmod p}$,
• or equivalently, ${p-1 \mid n}$.

Theorem 5 (All/nothing)

Suppose an exponent ${n}$ does not annihilate the prime ${p}$. Then more than ${\frac{1}{2} p}$ of ${x \pmod p}$ satisfy ${x^n \not\equiv 1 \pmod p}$.

Proof: Much stronger result is true: in ${x^n \equiv 1 \pmod p}$ then ${x^{\gcd(n,p-1)} \equiv 1 \pmod p}$. $\Box$

### 1.2. Repeated Exponentiation

Even without the previous facts, one can still do:

Theorem 6 (Repeated exponentation)

Given ${x}$ and ${n}$, one can compute ${x^n \pmod N}$ with ${O(\log n)}$ multiplications mod ${N}$.

The idea is that to compute ${x^{600} \pmod N}$, one just multiplies ${x^{512+64+16+8}}$. All the ${x^{2^k}}$ can be computed in ${k}$ steps, and ${k \le \log_2 n}$.

### 1.3. Chinese remainder theorem

In the middle school problem, we might have noticed that to compute ${3^{2016} \pmod{10}}$, it suffices to compute it modulo ${5}$, because we already know it is odd. More generally, to understand ${\pmod n}$ it suffices to understand ${n}$ modulo each of its prime powers.

The formal statement, which we include for completeness, is:

Theorem 7 (Chinese remainder theorem)

Let ${p_1}$, ${p_2}$, \dots, ${p_m}$ be distinct primes, and ${e_i \ge 1}$ integers. Then there is a ring isomorphism given by the natural projection

$\displaystyle \mathbb Z/n \rightarrow \prod_{i=1}^m \mathbb Z/p_i^{e_i}.$

In particular, a random choice of ${x \pmod n}$ amounts to a random choice of ${x}$ mod each prime power.

For an example, in the following table we see the natural bijection between ${x \pmod{15}}$ and ${(x \pmod 3, x \pmod 5)}$.

$\displaystyle \begin{array}{c|cc} x \pmod{15} & x \pmod{3} & x \pmod{5} \\ \hline 0 & 0 & 0 \\ 1 & 1 & 1 \\ 2 & 2 & 2 \\ 3 & 0 & 3 \\ 4 & 1 & 4 \\ 5 & 2 & 0 \\ 6 & 0 & 1 \\ 7 & 1 & 2 \end{array} \quad \begin{array}{c|cc} x \pmod{15} & x \pmod{3} & x \pmod{5} \\ \hline 8 & 2 & 3 \\ 9 & 0 & 4 \\ 10 & 1 & 0 \\ 11 & 2 & 1 \\ 12 & 0 & 2 \\ 13 & 1 & 3 \\ 14 & 2 & 4 \\ && \end{array}$

## 2. The RSA algorithm

This simple number theory is enough to develop the so-called RSA algorithm. Suppose Alice wants to send Bob a message ${M}$ over an insecure channel. They can do so as follows.

• Bob selects integers ${d}$, ${e}$ and ${N}$ (with ${N}$ huge) such that ${N}$ is a semiprime and

$\displaystyle de \equiv 1 \pmod{\phi(N)}.$

• Bob publishes both the number ${N}$ and ${e}$ (the public key) but keeps the number ${d}$ secret (the private key).
• Alice sends the number ${X = M^e \pmod N}$ across the channel.
• Bob computes

$\displaystyle X^d \equiv M^{de} \equiv M^1 \equiv M \pmod N$

and hence obtains the message ${M}$.

In practice, the ${N}$ in RSA is at least ${2000}$ bits long.

The trick is that an adversary cannot compute ${d}$ from ${e}$ and ${N}$ without knowing the prime factorization of ${N}$. So the security relies heavily on the difficulty of factoring.

Remark 8

It turns out that we basically don’t know how to factor large numbers ${N}$: the best known classical algorithms can factor an ${n}$-bit number in

$\displaystyle O\left( \exp\left( \frac{64}{9} n \log(n)^2 \right)^{1/3} \right)$

time (“general number field sieve”). On the other hand, with a quantum computer one can do this in ${O\left( n^2 \log n \log \log n \right)}$ time.

## 3. Primality Testing

Main question: if we can’t factor a number ${n}$ quickly, can we at least check it’s prime?

In what follows, we assume for simplicity that ${n}$ is squarefree, i.e. ${n = p_1 p_2 \dots p_k}$ for distinct primes ${p_k}$, This doesn’t substantially change anything, but it makes my life much easier.

### 3.1. Co-RP

Here is the goal: we need to show there is a random algorithm ${A}$ which does the following.

• If ${n}$ is composite then
• More than half the time ${A}$ says “definitely composite”.
• Occasionally, ${A}$ says “possibly prime”.
• If ${n}$ is prime, ${A}$ always says “possibly prime”.

If there is a polynomial time algorithm ${A}$ that does this, we say that PRIMES is in Co-RP. Clearly, this is a very good thing to be true!

### 3.2. Fermat

One idea is to try to use the converse of Fermat’s little theorem: given an integer ${n}$, pick a random number ${x \pmod n}$ and see if ${x^{n-1} \equiv 1 \pmod n}$. (We compute using repeated exponentiation.) If not, then we know for sure ${n}$ is not prime, and we call ${x}$ a Fermat witness modulo ${n}$.

How good is this test? For most composite ${n}$, pretty good:

Proposition 9

Let ${n}$ be composite. Assume that there is a prime ${p \mid n}$ such that ${n-1}$ does not annihilate ${p}$. Then over half the numbers mod ${n}$ are Fermat witnesses.

Proof: Apply the Chinese theorem then the “all-or-nothing” theorem. $\Box$
Unfortunately, if ${n}$ doesn’t satisfy the hypothesis, then all the ${\gcd(x,n) = 1}$ satisfy ${x^{n-1} \equiv 1 \pmod n}$!

Are there such ${n}$ which aren’t prime? Such numbers are called Carmichael numbers, but unfortunately they exist, the first one is ${561 = 3 \cdot 11 \cdot 17}$.

Remark 10

For ${X \gg 1}$, there are more than ${X^{1/3}}$ Carmichael numbers at most ${X}$.

Thus these numbers are very rare, but they foil the Fermat test.

Exercise 11

Show that a Carmichael number is not a semiprime.

### 3.3. Rabin-Miller

Fortunately, we can adapt the Fermat test to cover Carmichael numbers too. It comes from the observation that if ${n}$ is prime, then ${a^2 \equiv 1 \pmod n \implies a \equiv \pm 1 \pmod n}$.

So let ${n-1 = 2^s t}$, where ${t}$ is odd. For example, if ${n = 561}$ then ${560 = 2^4 \cdot 35}$. Then we compute ${x^t}$, ${x^{2t}}$, \dots, ${x^{n-1}}$. For example in the case ${n=561}$ and ${x=245}$:

$\displaystyle \begin{array}{c|r|rrr} & \mod 561 & \mod 3 & \mod 11 & \mod 17 \\ \hline x & 245 & -1 & 3 & 7 \\ \hline x^{35} & 122 & -1 & \mathbf 1 & 3 \\ x^{70} & 298 & \mathbf 1 & 1 & 9 \\ x^{140} & 166 & 1 & 1 & -4 \\ x^{280} & 67 & 1 & 1 & -1 \\ x^{560} & 1 & 1 & 1 & \mathbf 1 \end{array}$

And there we have our example! We have ${67^2 \equiv 1 \pmod{561}}$, so ${561}$ isn’t prime.

So the Rabin-Miller test works as follows:

• Given ${n}$, select a random ${x}$ and compute powers of ${x}$ as in the table.
• If ${x^{n-1} \not\equiv 1}$, stop, ${n}$ is composite (Fermat test).
• If ${x^{n-1} \equiv 1}$, see if the entry just before the first ${1}$ is ${-1}$. If it isn’t then we say ${x}$ is a RM-witness and ${n}$ is composite.
• Otherwise, ${n}$ is “possibly prime”.

How likely is probably?

Theorem 12

If ${n}$ is Carmichael, then over half the ${x \pmod n}$ are RM witnesses.

Proof: We sample ${x \pmod n}$ randomly again by looking modulo each prime (Chinese theorem). By the theorem on primitive roots, show that the probability the first ${-1}$ appears in any given row is ${\le \frac{1}{2}}$. This implies the conclusion. $\Box$

Exercise 13

Improve the ${\frac{1}{2}}$ in the problem to ${\frac34}$ by using the fact that Carmichael numbers aren’t semiprime.

### 3.4. AKS

In August 6, 2002, it was in fact shown that PRIMES is in P, using the deterministic AKS algorithm. However, in practice everyone still uses Miller-Rabin since the implied constants for AKS runtime are large.