# Some Thoughts on Olympiad Material Design

(This is a bit of a follow-up to the solution reading post last month. Spoiler warnings: USAMO 2014/6, USAMO 2012/2, TSTST 2016/4, and hints for ELMO 2013/1, IMO 2016/2.)

I want to say a little about the process which I use to design my olympiad handouts and classes these days (and thus by extension the way I personally think about problems). The short summary is that my teaching style is centered around showing connections and recurring themes between problems.

Now let me explain this in more detail.

## 1. Main ideas

Solutions to olympiad problems can look quite different from one another at a surface level, but typically they center around one or two main ideas, as I describe in my post on reading solutions. Because details are easy to work out once you have the main idea, as far as learning is concerned you can more or less throw away the details and pay most of your attention to main ideas.

Thus whenever I solve an olympiad problem, I make a deliberate effort to summarize the solution in a few sentences, such that I basically know how to do it from there. I also make a deliberate effort, whenever I write up a solution in my notes, to structure it so that my future self can see all the key ideas at a glance and thus be able to understand the general path of the solution immediately.

The example I’ve previously mentioned is USAMO 2014/6.

Example 1 (USAMO 2014, Gabriel Dospinescu)

Prove that there is a constant ${c>0}$ with the following property: If ${a, b, n}$ are positive integers such that ${\gcd(a+i, b+j)>1}$ for all ${i, j \in \{0, 1, \dots, n\}}$, then

$\displaystyle \min\{a, b\}> (cn)^n.$

If you look at any complete solution to the problem, you will see a lot of technical estimates involving ${\zeta(2)}$ and the like. But the main idea is very simple: “consider an ${N \times N}$ table of primes and note the small primes cannot adequately cover the board, since ${\sum p^{-2} < \frac{1}{2}}$”. Once you have this main idea the technical estimates are just the grunt work that you force yourself to do if you’re a contestant (and don’t do if you’re retired like me).

Thus the study of olympiad problems is reduced to the study of main ideas behind these problems.

## 2. Taxonomy

So how do we come up with the main ideas? Of course I won’t be able to answer this question completely, because therein lies most of the difficulty of olympiads.

But I do have some progress in this way. It comes down to seeing how main ideas are similar to each other. I spend a lot of time trying to classify the main ideas into categories or themes, based on how similar they feel to one another. If I see one theme pop up over and over, then I can make it into a class.

I think olympiad taxonomy is severely underrated, and generally not done correctly. The status quo is that people do bucket sorts based on the particular technical details which are present in the problem. This is correlated with the main ideas, but the two do not always coincide.

An example where technical sort works okay is Euclidean geometry. Here is a simple example: harmonic bundles in projective geometry. As I explain in my book, there are a few “basic” configurations involved:

• Midpoints and parallel lines
• The Ceva / Menelaus configuration
• Harmonic quadrilateral / symmedian configuration
• Apollonian circle (right angle and bisectors)

(For a reference, see Lemmas 2, 4, 5 and Exercise 0 here.) Thus from experience, any time I see one of these pictures inside the current diagram, I think to myself that “this problem feels projective”; and if there is a way to do so I try to use harmonic bundles on it.

An example where technical sort fails is the “pigeonhole principle”. A typical problem in such a class looks something like USAMO 2012/2.

Example 2 (USAMO 2012, Gregory Galperin)

A circle is divided into congruent arcs by ${432}$ points. The points are colored in four colors such that some ${108}$ points are colored Red, some ${108}$ points are colored Green, some ${108}$ points are colored Blue, and the remaining ${108}$ points are colored Yellow. Prove that one can choose three points of each color in such a way that the four triangles formed by the chosen points of the same color are congruent.

It’s true that the official solution uses the words “pigeonhole principle” but that is not really the heart of the matter; the key idea is that you consider all possible rotations and count the number of incidences. (In any case, such calculations are better done using expected value anyways.)

Now why is taxonomy a good thing for learning and teaching? The reason is that building connections and seeing similarities is most easily done by simultaneously presenting several related problems. I’ve actually mentioned this already in a different blog post, but let me give the demonstration again.

Suppose I wrote down the following:

$\displaystyle \begin{array}{lll} A1 & B11 & C8 \\ A9 & B44 & C27 \\ A49 & B33 & C343 \\ A16 & B99 & C1 \\ A25 & B22 & C125 \end{array}$

You can tell what each of the ${A}$‘s, ${B}$‘s, ${C}$‘s have in common by looking for a few moments. But what happens if I intertwine them?

$\displaystyle \begin{array}{lllll} B11 & C27 & C343 & A1 & A9 \\ C125 & B33 & A49 & B44 & A25 \\ A16 & B99 & B22 & C8 & C1 \end{array}$

This is the same information, but now you have to work much harder to notice the association between the letters and the numbers they’re next to.

This is why, if you are an olympiad student, I strongly encourage you to keep a journal or blog of the problems you’ve done. Solving olympiad problems takes lots of time and so it’s worth it to spend at least a few minutes jotting down the main ideas. And once you have enough of these, you can start to see new connections between problems you haven’t seen before, rather than being confined to thinking about individual problems in isolation. (Additionally, it means you will never have redo problems to which you forgot the solution — learn from my mistake here.)

## 3. Ten buckets of geometry

I want to elaborate more on geometry in general. These days, if I see a solution to a Euclidean geometry problem, then I mentally store the problem and solution into one (or more) buckets. I can even tell you what my buckets are:

1. Direct angle chasing
2. Power of a point / radical axis
3. Homothety, similar triangles, ratios
4. Recognizing some standard configuration (see Yufei for a list)
5. Doing some length calculations
6. Complex numbers
7. Barycentric coordinates
8. Inversion
9. Harmonic bundles or pole/polar and homography
10. Spiral similarity, Miquel points

which my dedicated fans probably recognize as the ten chapters of my textbook. (Problems may also fall in more than one bucket if for example they are difficult and require multiple key ideas, or if there are multiple solutions.)

Now whenever I see a new geometry problem, the diagram will often “feel” similar to problems in a certain bucket. Exactly what I mean by “feel” is hard to formalize — it’s a certain gut feeling that you pick up by doing enough examples. There are some things you can say, such as “problems which feature a central circle and feet of altitudes tend to fall in bucket 6”, or “problems which only involve incidence always fall in bucket 9”. But it seems hard to come up with an exhaustive list of hard rules that will do better than human intuition.

## 4. How do problems feel?

But as I said in my post on reading solutions, there are deeper lessons to teach than just technical details.

For examples of themes on opposite ends of the spectrum, let’s move on to combinatorics. Geometry is quite structured and so the themes in the main ideas tend to translate to specific theorems used in the solution. Combinatorics is much less structured and many of the themes I use in combinatorics cannot really be formalized. (Consequently, since everyone else seems to mostly teach technical themes, several of the combinatorics themes I teach are idiosyncratic, and to my knowledge are not taught by anyone else.)

For example, one of the unusual themes I teach is called Global. It’s about the idea that to solve a problem, you can just kind of “add up everything at once”, for example using linearity of expectation, or by double-counting, or whatever. In particular these kinds of approach ignore the “local” details of the problem. It’s hard to make this precise, so I’ll just give two recent examples.

Example 3 (ELMO 2013, Ray Li)

Let ${a_1,a_2,\dots,a_9}$ be nine real numbers, not necessarily distinct, with average ${m}$. Let ${A}$ denote the number of triples ${1 \le i < j < k \le 9}$ for which ${a_i + a_j + a_k \ge 3m}$. What is the minimum possible value of ${A}$?

Example 4 (IMO 2016)

Find all integers ${n}$ for which each cell of ${n \times n}$ table can be filled with one of the letters ${I}$, ${M}$ and ${O}$ in such a way that:

• In each row and column, one third of the entries are ${I}$, one third are ${M}$ and one third are ${O}$; and
• in any diagonal, if the number of entries on the diagonal is a multiple of three, then one third of the entries are ${I}$, one third are ${M}$ and one third are ${O}$.

If you look at the solutions to these problems, they have the same “feeling” of adding everything up, even though the specific techniques are somewhat different (double-counting for the former, diagonals modulo ${3}$ for the latter). Nonetheless, my experience with problems similar to the former was immensely helpful for the latter, and it’s why I was able to solve the IMO problem.

## 5. Gaps

This perspective also explains why I’m relatively bad at functional equations. There are some things I can say that may be useful (see my handouts), but much of the time these are just technical tricks. (When sorting functional equations in my head, I have a bucket called “standard fare” meaning that you “just do work”; as far I can tell this bucket is pretty useless.) I always feel stupid teaching functional equations, because I never have many good insights to say.

Part of the reason is that functional equations often don’t have a main idea at all. Consequently it’s hard for me to do useful taxonomy on them.

Then sometimes you run into something like the windmill problem, the solution of which is fairly “novel”, not being similar to problems that come up in training. I have yet to figure out a good way to train students to be able to solve windmill-like problems.

## 6. Surprise

I’ll close by mentioning one common way I come up with a theme.

Sometimes I will run across an olympiad problem ${P}$ which I solve quickly, and think should be very easy, and yet once I start grading ${P}$ I find that the scores are much lower than I expected. Since the way I solve problems is by drawing experience from similar previous problems, this must mean that I’ve subconsciously found a general framework to solve problems like ${P}$, which is not obvious to my students yet. So if I can put my finger on what that framework is, then I have something new to say.

The most recent example I can think of when this happened was TSTST 2016/4 which was given last June (and was also a very elegant problem, at least in my opinion).

Example 5 (TSTST 2016, Linus Hamilton)

Let ${n > 1}$ be a positive integers. Prove that we must apply the Euler ${\varphi}$ function at least ${\log_3 n}$ times before reaching ${1}$.

I solved this problem very quickly when we were drafting the TSTST exam, figuring out the solution while walking to dinner. So I was quite surprised when I looked at the scores for the problem and found out that empirically it was not that easy.

After I thought about this, I have a new tentative idea. You see, when doing this problem I really was thinking about “what does this ${\varphi}$ operation do?”. You can think of ${n}$ as an infinite tuple

$\displaystyle \left(\nu_2(n), \nu_3(n), \nu_5(n), \nu_7(n), \dots \right)$

of prime exponents. Then the ${\varphi}$ can be thought of as an operation which takes each nonzero component, decreases it by one, and then adds some particular vector back. For example, if ${\nu_7(n) > 0}$ then ${\nu_7}$ is decreased by one and each of ${\nu_2(n)}$ and ${\nu_3(n)}$ are increased by one. In any case, if you look at this behavior for long enough you will see that the ${\nu_2}$ coordinate is a natural way to “track time” in successive ${\varphi}$ operations; once you figure this out, getting the bound of ${\log_3 n}$ is quite natural. (Details left as exercise to reader.)

Now when I read through the solutions, I found that many of them had not really tried to think of the problem in such a “structured” way, and had tried to directly solve it by for example trying to prove ${\varphi(n) \ge n/3}$ (which is false) or something similar to this. I realized that had the students just ignored the task “prove ${n \le 3^k}$” and spent some time getting a better understanding of the ${\varphi}$ structure, they would have had a much better chance at solving the problem. Why had I known that structural thinking would be helpful? I couldn’t quite explain it, but it had something to do with the fact that the “main object” of the question was “set in stone”; there was no “degrees of freedom” in it, and it was concrete enough that I felt like I could understand it. Once I understood how multiple ${\varphi}$ operations behaved, the bit about ${\log_3 n}$ almost served as an “answer extraction” mechanism.

These thoughts led to the recent development of a class which I named Rigid, which is all about problems where the point is not to immediately try to prove what the question asks for, but to first step back and understand completely how a particular rigid structure (like the ${\varphi}$ in this problem) behaves, and to then solve the problem using this understanding.

# Holomorphic Logarithms and Roots

In this post we’ll make sense of a holomorphic square root and logarithm. Wrote this up because I was surprised how hard it was to find a decent complete explanation.

Let ${f : U \rightarrow \mathbb C}$ be a holomorphic function. A holomorphic ${n}$th root of ${f}$ is a function ${g : U \rightarrow \mathbb C}$ such that ${f(z) = g(z)^n}$ for all ${z \in U}$. A logarithm of ${f}$ is a function ${g : U \rightarrow \mathbb C}$ such that ${f(z) = e^{g(z)}}$ for all ${z \in U}$. The main question we’ll try to figure out is: when do these exist? In particular, what if ${f = \mathrm{id}}$?

## 1. Motivation: Square Root of a Complex Number

To start us off, can we define ${\sqrt z}$ for any complex number ${z}$?

The first obvious problem that comes up is that there for any ${z}$, there are two numbers ${w}$ such that ${w^2 = z}$. How can we pick one to use? For our ordinary square root function, we had a notion of “positive”, and so we simply took the positive root.

Let’s expand on this: given ${ z = r \left( \cos\theta + i \sin\theta \right) }$ (here ${r \ge 0}$) we should take the root to be

$\displaystyle w = \sqrt{r} \left( \cos \alpha + i \sin \alpha \right).$

such that ${2\alpha \equiv \theta \pmod{2\pi}}$; there are two choices for ${\alpha \pmod{2\pi}}$, differing by ${\pi}$.

For complex numbers, we don’t have an obvious way to pick ${\alpha}$. Nonetheless, perhaps we can also get away with an arbitrary distinction: let’s see what happens if we just choose the ${\alpha}$ with ${-\frac{1}{2}\pi < \alpha \le \frac{1}{2}\pi}$.

Pictured below are some points (in red) and their images (in blue) under this “upper-half” square root. The condition on ${\alpha}$ means we are forcing the blue points to lie on the right-half plane.

Here, ${w_i^2 = z_i}$ for each ${i}$, and we are constraining the ${w_i}$ to lie in the right half of the complex plane. We see there is an obvious issue: there is a big discontinuity near the point ${z_5}$ and ${z_7}$! The nearby point ${w_6}$ has been mapped very far away. This discontinuity occurs since the points on the negative real axis are at the “boundary”. For example, given ${-4}$, we send it to ${-2i}$, but we have hit the boundary: in our interval ${-\frac{1}{2}\pi \le \alpha < \frac{1}{2}\pi}$, we are at the very left edge.

The negative real axis that we must not touch is is what we will later call a branch cut, but for now I call it a ray of death. It is a warning to the red points: if you cross this line, you will die! However, if we move the red circle just a little upwards (so that it misses the negative real axis) this issue is avoided entirely, and we get what seems to be a “nice” square root.

In fact, the ray of death is fairly arbitrary: it is the set of “boundary issues” that arose when we picked ${-\frac{1}{2}\pi < \alpha \le \frac{1}{2}\pi}$. Suppose we instead insisted on the interval ${0 \le \alpha < \pi}$; then the ray of death would be the positive real axis instead. The earlier circle we had now works just fine.

What we see is that picking a particular ${\alpha}$-interval leads to a different set of edge cases, and hence a different ray of death. The only thing these rays have in common is their starting point of zero. In other words, given a red circle and a restriction of ${\alpha}$, I can make a nice “square rooted” blue circle as long as the ray of death misses it.

So, what exactly is going on?

## 2. Square Roots of Holomorphic Functions

To get a picture of what’s happening, we would like to consider a more general problem: let ${f: U \rightarrow \mathbb C}$ be holomorphic. Then we want to decide whether there is a ${g : U \rightarrow \mathbb C}$ such that

$\displaystyle f(z) = g(z)^2.$

Our previous discussion when ${f = \mathrm{id}}$ tells us we cannot hope to achieve this for ${U = \mathbb C}$; there is a “half-ray” which causes problems. However, there are certainly functions ${f : \mathbb C \rightarrow \mathbb C}$ such that a ${g}$ exists. As a simplest example, ${f(z) = z^2}$ should definitely have a square root!

Now let’s see if we can fudge together a square root. Earlier, what we did was try to specify a rule to force one of the two choices at each point. This is unnecessarily strict. Perhaps we can do something like the following: start at a point in ${z_0 \in U}$, pick a square root ${w_0}$ of ${f(z_0)}$, and then try to “fudge” from there the square roots of the other points. What do I mean by fudge? Well, suppose ${z_1}$ is a point very close to ${z_0}$, and we want to pick a square root ${w_1}$ of ${f(z_1)}$. While there are two choices, we also would expect ${w_0}$ to be close to ${w_1}$. Unless we are highly unlucky, this should tells us which choice of ${w_1}$ to pick. (Stupid concrete example: if I have taken the square root ${-4.12i}$ of ${-17}$ and then ask you to continue this square root to ${-16}$, which sign should you pick for ${\pm 4i}$?)

There are two possible ways we could get unlucky in the scheme above: first, if ${w_0 = 0}$, then we’re sunk. But even if we avoid that, we have to worry that we are in a situation, where we run around a full loop in the complex plane, and then find that our continuous perturbation has left us in a different place than we started. For concreteness, consider the following situation, again with ${f = \mathrm{id}}$:

We started at the point ${z_0}$, with one of its square roots as ${w_0}$. We then wound a full red circle around the origin, only to find that at the end of it, the blue arc is at a different place where it started!

The interval construction from earlier doesn’t work either: no matter how we pick the interval for ${\alpha}$, any ray of death must hit our red circle. The problem somehow lies with the fact that we have enclosed the very special point ${0}$.

Nevertheless, we know that if we take ${f(z) = z^2}$, then we don’t run into any problems with our “make it up as you go” procedure. So, what exactly is going on?

## 3. Covering Projections

By now, if you have read the part of algebraic topology. this should all seem very strangely familiar. The “fudging” procedure exactly describes the idea of a lifting.

More precisely, recall that there is a covering projection

$\displaystyle (-)^2 : \mathbb C \setminus \{0\} \rightarrow \mathbb C \setminus \{0\}.$

Let ${V = \left\{ z \in U \mid f(z) \neq 0 \right\}}$. For ${z \in U \setminus V}$, we already have the square root ${g(z) = \sqrt{f(z)} = \sqrt 0 = 0}$. So the burden is completing ${g : V \rightarrow \mathbb C}$.

Then essentially, what we are trying to do is construct a lifting ${g}$ for the following diagram: Our map ${p}$ can be described as “winding around twice”. From algebraic topology, we now know that this lifting exists if and only if

$\displaystyle f_\ast(\pi_1(V)) \subseteq p_\ast(\pi_1(E))$

is a subset of the image of ${\pi_1(E)}$ by ${p}$. Since ${B}$ and ${E}$ are both punctured planes, we can identify them with ${S^1}$.

Ques 1

Show that the image under ${p}$ is exactly ${2\mathbb Z}$ once we identify ${\pi_1(B) = \mathbb Z}$.

That means that for any loop ${\gamma}$ in ${V}$, we need ${f \circ \gamma}$ to have an even winding number around ${0 \in B}$. This amounts to

$\displaystyle \frac{1}{2\pi} \oint_\gamma \frac{f'}{f} \; dz \in 2\mathbb Z$

since ${f}$ has no poles.

Replacing ${2}$ with ${n}$ and carrying over the discussion gives the first main result.

Theorem 2 (Existence of Holomorphic ${n}$th Roots)

Let ${f : U \rightarrow \mathbb C}$ be holomorphic. Then ${f}$ has a holomorphic ${n}$th root if and only if

$\displaystyle \frac{1}{2\pi i}\oint_\gamma \frac{f'}{f} \; dz \in n\mathbb Z$

for every contour ${\gamma}$ in ${U}$.

## 4. Complex Logarithms

The multivalued nature of the complex logarithm comes from the fact that

$\displaystyle \exp(z+2\pi i) = \exp(z).$

So if ${e^w = z}$, then any complex number ${w + 2\pi i k}$ is also a solution.

We can handle this in the same way as before: it amounts to a lifting of the following diagram. There is no longer a need to work with a separate ${V}$ since:

Ques 3

Show that if ${f}$ has any zeros then ${g}$ possibly can’t exist.

In fact, the map ${\exp : \mathbb C \rightarrow \mathbb C\setminus\{0\}}$ is a universal cover, since ${\mathbb C}$ is simply connected. Thus, ${p(\pi_1(\mathbb C))}$ is trivial. So in addition to being zero-free, ${f}$ cannot have any winding number around ${0 \in B}$ at all. In other words:

Theorem 4 (Existence of Logarithms)

Let ${f : U \rightarrow \mathbb C}$ be holomorphic. Then ${f}$ has a logarithm if and only if

$\displaystyle \frac{1}{2\pi i}\oint_\gamma \frac{f'}{f} \; dz = 0$

for every contour ${\gamma}$ in ${U}$.

## 5. Some Special Cases

The most common special case is

Corollary 5 (Nonvanishing Functions from Simply Connected Domains)

Let ${f : \Omega \rightarrow \mathbb C}$ be continuous, where ${\Omega}$ is simply connected. If ${f(z) \neq 0}$ for every ${z \in \Omega}$, then ${f}$ has both a logarithm and holomorphic ${n}$th root.

Finally, let’s return to the question of ${f = \mathrm{id}}$ from the very beginning. What’s the best domain ${U}$ such that we can define ${\sqrt{-} : U \rightarrow \mathbb C}$? Clearly ${U = \mathbb C}$ cannot be made to work, but we can do almost as well. For note that the only zero of ${f = \mathrm{id}}$ is at the origin. Thus if we want to make a logarithm exist, all we have to do is make an incision in the complex plane that renders it impossible to make a loop around the origin. The usual choice is to delete negative half of the real axis, our very first ray of death; we call this a branch cut, with branch point at ${0 \in \mathbb C}$ (the point which we cannot circle around). This gives

Theorem 6 (Branch Cut Functions)

There exist holomorphic functions

\displaystyle \begin{aligned} \log &: \mathbb C \setminus (-\infty, 0] \rightarrow \mathbb C \\ \sqrt[n]{-} &: \mathbb C \setminus (-\infty, 0] \rightarrow \mathbb C \end{aligned}

satisfying the obvious properties.

There are many possible choices of such functions (${n}$ choices for the ${n}$th root and infinitely many for ${\log}$); a choice of such a function is called a branch. So this is what is meant by a “branch” of a logarithm.

The principal branch is the “canonical” branch, analogous to the way we arbitrarily pick the positive branch to define ${\sqrt{-} : \mathbb R_{\ge 0} \rightarrow \mathbb R_{\ge 0}}$. For ${\log}$, we take the ${w}$ such that ${e^w = z}$ and the imaginary part of ${w}$ lies in ${(-\pi, \pi]}$ (since we can shift by integer multiples of ${2\pi i}$). Often, authors will write ${\text{Log } z}$ to emphasize this choice.

Example 7

Let ${U}$ be the complex plane minus the real interval ${[0,1]}$. Then the function ${U \rightarrow \mathbb C}$ by ${z \mapsto z(z-1)}$ has a holomorphic square root.

Corollary 8

A holomorphic function ${f : U \rightarrow \mathbb C}$ has a holomorphic ${n}$th root for all ${n \ge 1}$ if and only if it has a holomorphic logarithm.

# Facts about Lie Groups and Algebras

In Spring 2016 I was taking 18.757 Representations of Lie Algebras. Since I knew next to nothing about either Lie groups or algebras, I was forced to quickly learn about their basic facts and properties. These are the notes that I wrote up accordingly. Proofs of most of these facts can be found in standard textbooks, for example Kirillov.

## 1. Lie groups

Let ${K = \mathbb R}$ or ${K = \mathbb C}$, depending on taste.

Definition 1

A Lie group is a group ${G}$ which is also a ${K}$-manifold; the multiplication maps ${G \times G \rightarrow G}$ (by ${(g_1, g_2) \mapsto g_1g_2}$) and the inversion map ${G \rightarrow G}$ (by ${g \mapsto g^{-1}}$) are required to be smooth.

A morphism of Lie groups is a map which is both a map of manifolds and a group homomorphism.

Throughout, we will let ${e \in G}$ denote the identity, or ${e_G}$ if we need further emphasis.

Note that in particular, every group ${G}$ can be made into a Lie group by endowing it with the discrete topology. This is silly, so we usually require only focus on connected groups:

Proposition 2 (Reduction to connected Lie groups)

Let ${G}$ be a Lie group and ${G^0}$ the connected component of ${G}$ which contains ${e}$. Then ${G^0}$ is a normal subgroup, itself a Lie group, and the quotient ${G/G^0}$ has the discrete topology.

In fact, we can also reduce this to the study of simply connected Lie groups as follows.

Proposition 3 (Reduction to simply connected Lie groups)

If ${G}$ is connected, let ${\pi : \widetilde G \rightarrow G}$ be its universal cover. Then ${\widetilde G}$ is a Lie group, ${\pi}$ is a morphism of Lie groups, and ${\ker \pi \cong \pi_1(G)}$.

Here are some examples of Lie groups.

Example 4 (Examples of Lie groups)

• ${\mathbb R}$ under addition is a real one-dimensional Lie group.
• ${\mathbb C}$ under addition is a complex one-dimensional Lie group (and a two-dimensional real Lie group)!
• The unit circle ${S^1 \subseteq \mathbb C}$ is a real Lie group under multiplication.
• ${\text{GL }(n, K) \subset K^{\oplus n^2}}$ is a Lie group of dimension ${n^2}$. This example becomes important for representation theory: a representation of a Lie group ${G}$ is a morphism of Lie groups ${G \rightarrow \text{GL }(n, K)}$.
• ${\text{SL }(n, K) \subset \text{GL }(n, K)}$ is a Lie group of dimension ${n^2-1}$.

As geometric objects, Lie groups ${G}$ enjoy a huge amount of symmetry. For example, any neighborhood ${U}$ of ${e}$ can be “copied over” to any other point ${g \in G}$ by the natural map ${gU}$. There is another theorem worth noting, which is that:

Proposition 5

If ${G}$ is a connected Lie group and ${U}$ is a neighborhood of the identity ${e \in G}$, then ${U}$ generates ${G}$ as a group.

## 2. Haar measure

Recall the following result and its proof from representation theory:

Claim 6

For any finite group ${G}$, ${\mathbb C[G]}$ is semisimple; all finite-dimensional representations decompose into irreducibles.

Proof: Take a representation ${V}$ and equip it with an arbitrary inner form ${\left< -,-\right>_0}$. Then we can average it to obtain a new inner form

$\displaystyle \left< v, w \right> = \frac{1}{|G|} \sum_{g \in G} \left< gv, gw \right>_0.$

which is ${G}$-invariant. Thus given a subrepresentation ${W \subseteq V}$ we can just take its orthogonal complement to decompose ${V}$. $\Box$
We would like to repeat this type of proof with Lie groups. In this case the notion ${\sum_{g \in G}}$ doesn’t make sense, so we want to replace it with an integral ${\int_{g \in G}}$ instead. In order to do this we use the following:

Theorem 7 (Haar measure)

Let ${G}$ be a Lie group. Then there exists a unique Radon measure ${\mu}$ (up to scaling) on ${G}$ which is left-invariant, meaning

$\displaystyle \mu(g \cdot S) = \mu(S)$

for any Borel subset ${S \subseteq G}$ and “translate” ${g \in G}$. This measure is called the (left) Haar measure.

Example 8 (Examples of Haar measures)

• The Haar measure on ${(\mathbb R, +)}$ is the standard Lebesgue measure which assigns ${1}$ to the closed interval ${[0,1]}$. Of course for any ${S}$, ${\mu(a+S) = \mu(S)}$ for ${a \in \mathbb R}$.
• The Haar measure on ${(\mathbb R \setminus \{0\}, \times)}$ is given by

$\displaystyle \mu(S) = \int_S \frac{1}{|t|} \; dt.$

In particular, ${\mu([a,b]) = \log(b/a)}$. One sees the invariance under multiplication of these intervals.

• Let ${G = \text{GL }(n, \mathbb R)}$. Then a Haar measure is given by

$\displaystyle \mu(S) = \int_S |\det(X)|^{-n} \; dX.$

• For the circle group ${S^1}$, consider ${S \subseteq S^1}$. We can define

$\displaystyle \mu(S) = \frac{1}{2\pi} \int_S d\varphi$

across complex arguments ${\varphi}$. The normalization factor of ${2\pi}$ ensures ${\mu(S^1) = 1}$.

Note that we have:

Corollary 9

If the Lie group ${G}$ is compact, there is a unique Haar measure with ${\mu(G) = 1}$.

This follows by just noting that if ${\mu}$ is Radon measure on ${X}$, then ${\mu(X) < \infty}$. This now lets us deduce that

Corollary 10 (Compact Lie groups are semisimple)

${\mathbb C[G]}$ is semisimple for any compact Lie group ${G}$.

Indeed, we can now consider

$\displaystyle \left< v,w\right> = \int_G \left< g \cdot v, g \cdot w\right>_0 \; dg$

as we described at the beginning.

## 3. The tangent space at the identity

In light of the previous comment about neighborhoods of ${e}$ generating ${G}$, we see that to get some information about the entire Lie group it actually suffices to just get “local” information of ${G}$ at the point ${e}$ (this is one formalization of the fact that Lie groups are super symmetric).

To do this one idea is to look at the tangent space. Let ${G}$ be an ${n}$-dimensional Lie group (over ${K}$) and consider ${\mathfrak g = T_eG}$ the tangent space to ${G}$ at the identity ${e \in G}$. Naturally, this is a ${K}$-vector space of dimension ${n}$. We call it the Lie algebra associated to ${G}$.

Example 11 (Lie algebras corresponding to Lie groups)

• ${(\mathbb R, +)}$ has a real Lie algebra isomorphic to ${\mathbb R}$.
• ${(\mathbb C, +)}$ has a complex Lie algebra isomorphic to ${\mathbb C}$.
• The unit circle ${S^1 \subseteq \mathbb C}$ has a real Lie algebra isomorphic to ${\mathbb R}$, which we think of as the “tangent line” at the point ${1 \in S^1}$.

Example 12 (${\mathfrak{gl}(n, K)}$)

Let’s consider ${\text{GL }(n, K) \subset K^{\oplus n^2}}$, an open subset of ${K^{\oplus n^2}}$. Its tangent space should just be an ${n^2}$-dimensional ${K}$-vector space. By identifying the components in the obvious way, we can think of this Lie algebra as just the set of all ${n \times n}$ matrices.

This Lie algebra goes by the notation ${\mathfrak{gl}(n, K)}$.

Example 13 (${\mathfrak{sl}(n, K)}$)

Recall ${\text{SL }(n, K) \subset \text{GL }(n, K)}$ is a Lie group of dimension ${n^2-1}$, hence its Lie algebra should have dimension ${n^2-1}$. To see what it is, let’s look at the special case ${n=2}$ first: then

$\displaystyle \text{SL }(2, K) = \left\{ \begin{pmatrix} a & b \\ c & d \end{pmatrix} \mid ad - bc = 1 \right\}.$

Viewing this as a polynomial surface ${f(a,b,c,d) = ad-bc}$ in ${K^{\oplus 4}}$, we compute

$\displaystyle \nabla f = \left< d, -c, -b, a \right>$

and in particular the tangent space to the identity matrix ${\begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix}}$ is given by the orthogonal complement of the gradient

$\displaystyle \nabla f (1,0,0,1) = \left< 1, 0, 0, 1 \right>.$

Hence the tangent plane can be identified with matrices satisfying ${a+d=0}$. In other words, we see

$\displaystyle \mathfrak{sl}(2, K) = \left\{ T \in \mathfrak{gl}(2, K) \mid \text{Tr } T = 0. \right\}.$

By repeating this example in greater generality, we discover

$\displaystyle \mathfrak{sl}(n, K) = \left\{ T \in \mathfrak{gl}(n, K) \mid \text{Tr } T = 0. \right\}.$

## 4. The exponential map

Right now, ${\mathfrak g}$ is just a vector space. However, by using the group structure we can get a map from ${\mathfrak g}$ back into ${G}$. The trick is “differential equations”:

Proposition 14 (Differential equations for Lie theorists)

Let ${G}$ be a Lie group over ${K}$ and ${\mathfrak g}$ its Lie algebra. Then for every ${x \in \mathfrak g}$ there is a unique homomorphism

$\displaystyle \gamma_x : K \rightarrow G$

which is a morphism of Lie groups, such that

$\displaystyle \gamma_x'(0) = x \in T_eG = \mathfrak g.$

We will write ${\gamma_x(t)}$ to emphasize the argument ${t \in K}$ being thought of as “time”. Thus this proposition should be intuitively clear: the theory of differential equations guarantees that ${\gamma_x}$ is defined and unique in a small neighborhood of ${0 \in K}$. Then, the group structure allows us to extend ${\gamma_x}$ uniquely to the rest of ${K}$, giving a trajectory across all of ${G}$. This is sometimes called a one-parameter subgroup of ${G}$, but we won’t use this terminology anywhere in what follows.

This lets us define:

Definition 15

Retain the setting of the previous proposition. Then the exponential map is defined by

$\displaystyle \exp : \mathfrak g \rightarrow G \qquad\text{by}\qquad x \mapsto \gamma_x(1).$

The exponential map gets its name from the fact that for all the examples I discussed before, it is actually just the map ${e^\bullet}$. Note that below, ${e^T = \sum_{k \ge 0} \frac{T^k}{k!}}$ for a matrix ${T}$; this is called the matrix exponential.

Example 16 (Exponential Maps of Lie algebras)

• If ${G = \mathbb R}$, then ${\mathfrak g = \mathbb R}$ too. We observe ${\gamma_x(t) = e^{tx} \in \mathbb R}$ (where ${t \in \mathbb R}$) is a morphism of Lie groups ${\gamma_x : \mathbb R \rightarrow G}$. Hence

$\displaystyle \exp : \mathbb R \rightarrow \underbrace{\mathbb R}_{=G} \qquad \exp(x) = \gamma_x(1) = e^t \in \mathbb R = G.$

• Ditto for ${\mathbb C}$.
• For ${S^1}$ and ${x \in \mathbb R}$, the map ${\gamma_x : \mathbb R \rightarrow S^1}$ given by ${t \mapsto e^{itx}}$ works. Hence

$\displaystyle \exp : \mathbb R \rightarrow S^1 \qquad \exp(x) = \gamma_x(1) = e^{it} \in S^1.$

• For ${\text{GL }(n, K)}$, the map ${\gamma_X : K \rightarrow \text{GL }(n, K)}$ given by ${t \mapsto e^{tX}}$ works nicely (now ${X}$ is a matrix). (Note that we have to check ${e^{tX}}$ is actually invertible for this map to be well-defined.) Hence the exponential map is given by

$\displaystyle \exp : \mathfrak{gl}(n,K) \rightarrow \text{GL }(n,K) \qquad \exp(X) = \gamma_X(1) = e^X \in \text{GL }(n, K).$

• Similarly,

$\displaystyle \exp : \mathfrak{sl}(n,K) \rightarrow \text{SL }(n,K) \qquad \exp(X) = \gamma_X(1) = e^X \in \text{SL }(n, K).$

Here we had to check that if ${X \in \mathfrak{sl}(n,K)}$, meaning ${\text{Tr } X = 0}$, then ${\det(e^X) = 1}$. This can be seen by writing ${X}$ in an upper triangular basis.

Actually, taking the tangent space at the identity is a functor. Consider a map ${\varphi : G_1 \rightarrow G_2}$ of Lie groups, with lie algebras ${\mathfrak g_1}$ and ${\mathfrak g_2}$. Because ${\varphi}$ is a group homomorphism, ${G_1 \ni e_1 \mapsto e_2 \in G_2}$. Now, by manifold theory we know that maps ${f : M \rightarrow N}$ between manifolds gives a linear map between the corresponding tangent spaces, say ${Tf : T_pM \rightarrow T_{fp}N}$. For us we obtain a linear map

$\displaystyle \varphi_\ast = T \varphi : \mathfrak g_1 \rightarrow \mathfrak g_2.$

In fact, this ${\varphi_\ast}$ fits into a diagram

Here are a few more properties of ${\exp}$:

• ${\exp(0) = e \in G}$, which is immediate by looking at the constant trajectory ${\phi_0(t) \equiv e}$.
• ${\exp'(x) = x \in \mathfrak g}$, i.e. the total derivative ${D\exp : \mathfrak g \rightarrow \mathfrak g}$ is the identity. This is again by construction.
• In particular, by the inverse function theorem this implies that ${\exp}$ is a diffeomorphism in a neighborhood of ${0 \in \mathfrak g}$, onto a neighborhood of ${e \in G}$.
• ${\exp}$ commutes with the commutator. (By the above diagram.)

## 5. The commutator

Right now ${\mathfrak g}$ is still just a vector space, the tangent space. But now that there is map ${\exp : \mathfrak g \rightarrow G}$, we can use it to put a new operation on ${\mathfrak g}$, the so-called commutator.

The idea is follows: we want to “multiply” two elements of ${\mathfrak g}$. But ${\mathfrak g}$ is just a vector space, so we can’t do that. However, ${G}$ itself has a group multiplication, so we should pass to ${G}$ using ${\exp}$, use the multiplication in ${G}$ and then come back.

Here are the details. As we just mentioned, ${\exp}$ is a diffeomorphism near ${e \in G}$. So for ${x}$, ${y}$ close to the origin of ${\mathfrak g}$, we can look at ${\exp(x)}$ and ${\exp(y)}$, which are two elements of ${G}$ close to ${e}$. Multiplying them gives an element still close to ${e}$, so its equal to ${\exp(z)}$ for some unique ${z}$, call it ${\mu(x,y)}$.

One can show in fact that ${\mu}$ can be written as a Taylor series in two variables as

$\displaystyle \mu(x,y) = x + y + \frac{1}{2} [x,y] + \text{third order terms} + \dots$

where ${[x,y]}$ is a skew-symmetric bilinear map, meaning ${[x,y] = -[y,x]}$. It will be more convenient to work with ${[x,y]}$ than ${\mu(x,y)}$ itself, so we give it a name:

Definition 17

This ${[x,y]}$ is called the commutator of ${G}$.

Now we know multiplication in ${G}$ is associative, so this should give us some nontrivial relation on the bracket ${[,]}$. Specifically, since

$\displaystyle \exp(x) \left( \exp(y) \exp(z) \right) = \left( \exp(x) \exp(y) \right) \exp(z).$

we should have that ${\mu(x, \mu(y,z)) = \mu(\mu(x,y), z)}$, and this should tell us something. In fact, the claim is:

Theorem 18

The bracket ${[,]}$ satisfies the Jacobi identity

$\displaystyle [x,[y,z]] + [y,[z,x]] + [z,[x,y]] = 0.$

Proof: Although I won’t prove it, the third-order terms (and all the rest) in our definition of ${[x,y]}$ can be written out explicitly as well: for example, for example, we actually have

$\displaystyle \mu(x,y) = x + y + \frac{1}{2} [x,y] + \frac{1}{12} \left( [x, [x,y]] + [y,[y,x]] \right) + \text{fourth order terms} + \dots.$

The general formula is called the Baker-Campbell-Hausdorff formula.

Then we can force ourselves to expand this using the first three terms of the BCS formula and then equate the degree three terms. The left-hand side expands initially as ${\mu\left( x, y + z + \frac{1}{2} [y,z] + \frac{1}{12} \left( [y,[y,z]] + [z,[z,y] \right) \right)}$, and the next step would be something ugly.

This computation is horrifying and painful, so I’ll pretend I did it and tell you the end result is as claimed. $\Box$
There is a more natural way to see why this identity is the “right one”; see Qiaochu. However, with this proof I want to make the point that this Jacobi identity is not our decision: instead, the Jacobi identity is forced upon us by associativity in ${G}$.

Example 19 (Examples of commutators attached to Lie groups)

• If ${G}$ is an abelian group, we have ${-[y,x] = [x,y]}$ by symmetry and ${[x,y] = [y,x]}$ from ${\mu(x,y) = \mu(y,x)}$. Thus ${[x,y] = 0}$ in ${\mathfrak g}$ for any abelian Lie group ${G}$.
• In particular, the brackets for ${G \in \{\mathbb R, \mathbb C, S^1\}}$ are trivial.
• Let ${G = \text{GL }(n, K)}$. Then one can show that

$\displaystyle [T,S] = TS - ST \qquad \forall S, T \in \mathfrak{gl}(n, K).$

• Ditto for ${\text{SL }(n, K)}$.

In any case, with the Jacobi identity we can define an general Lie algebra as an intrinsic object with a Jacobi-satisfying bracket:

Definition 20

A Lie algebra over ${k}$ is a ${k}$-vector space equipped with a skew-symmetric bilinear bracket ${[,]}$ satisfying the Jacobi identity.

A morphism of Lie algebras and preserves the bracket.

Note that a Lie algebra may even be infinite-dimensional (even though we are assuming ${G}$ is finite-dimensional, so that they will never come up as a tangent space).

Example 21 (Associative algebra ${\rightarrow}$ Lie algebra)

Any associative algebra ${A}$ over ${k}$ can be made into a Lie algebra by taking the same underlying vector space, and using the bracket ${[a,b] = ab - ba}$.

## 6. The fundamental theorems

We finish this list of facts by stating the three “fundamental theorems” of Lie theory. They are based upon the functor

$\displaystyle \mathscr{L} : G \mapsto T_e G$

we have described earlier, which is a functor

• from the category of Lie groups
• into the category of finite-dimensional Lie algebras.

The first theorem requires the following definition:

Definition 22

A Lie subgroup ${H}$ of a Lie group ${G}$ is a subgroup ${H}$ such that the inclusion map ${H \hookrightarrow G}$ is also an injective immersion.

A Lie subalgebra ${\mathfrak h}$ of a Lie algebra ${\mathfrak g}$ is a vector subspace preserved under the bracket (meaning that ${[\mathfrak h, \mathfrak h] \subseteq \mathfrak h]}$).

Theorem 23 (Lie I)

Let ${G}$ be a real or complex Lie group with Lie algebra ${\mathfrak g}$. Then given a Lie subgroup ${H \subseteq G}$, the map

$\displaystyle H \mapsto \mathscr{L}(H) \subseteq \mathfrak g$

is a bijection between Lie subgroups of ${G}$ and Lie subalgebras of ${\mathfrak g}$.

Theorem 24 (The Lie functor is an equivalence of categories)

Restrict ${\mathscr{L}}$ to a functor

• from the category of simply connected Lie groups over ${K}$
• to the category of finite-dimensional Lie algebras over ${K}$.

Then

1. (Lie II) ${\mathscr{L}}$ is fully faithful, and
2. (Lie III) ${\mathscr{L}}$ is essentially surjective on objects.

If we drop the “simply connected” condition, we obtain a functor which is faithful and exact, but not full: non-isomorphic Lie groups can have isomorphic Lie algebras (one example is ${\text{SO }(3)}$ and ${\text{SU }(2)}$).

# Algebraic Topology Functors

This will be old news to anyone who does algebraic topology, but oddly enough I can’t seem to find it all written in one place anywhere, and in particular I can’t find the bit about ${\mathsf{hPairTop}}$ at all.

In algebraic topology you (for example) associate every topological space ${X}$ with a group, like ${\pi_1(X, x_0)}$ or ${H_5(X)}$. All of these operations turn out to be functors. This isn’t surprising, because as far as I’m concerned the definition of a functor is “any time you take one type of object and naturally make another object”.

The surprise is that these objects also respect homotopy in a nice way; proving this is a fair amount of the “setup” work in algebraic topology.

## 1. Homology, ${H_n : \mathsf{hTop} \rightarrow \mathsf{Grp}}$

Note that ${H_5}$ is a functor

$\displaystyle H_5 : \mathsf{Top} \rightarrow \mathsf{Grp}$

i.e. to every space ${X}$ we can associate a group ${H_5(X)}$. (Of course, replace ${5}$ by integer of your choice.) Recall that:

Definition 1

Two maps ${f, g : X \rightarrow Y}$ are homotopy equivalent if there exists a homotopy between them.

Thus for a map we can take its homotopy class ${[f]}$ (the equivalence class under this relationship). This has the nice property that ${[f \circ g] = [f] \circ [g]}$ and so on.

Definition 2

Two spaces ${X}$ and ${Y}$ are homotopic if there exists a pair of maps ${f : X \rightarrow Y}$ and ${g : Y \rightarrow X}$ such that ${[f \circ g] = [\mathrm{id}_X]}$ and ${[g \circ f] = [\mathrm{id}_Y]}$.

In light of this, we can define

Definition 3

The category ${\mathsf{hTop}}$ is defined as follows:

• The objects are topological spaces ${X}$.
• The morphisms ${X \rightarrow Y}$ are homotopy classes of continuous maps ${X \rightarrow Y}$.

Remark 4

Composition is well-defined since ${[f \circ g] = [f] \circ [g]}$. Two spaces are isomorphic in ${\mathsf{hTop}}$ if they are homotopic.

Remark 5

As you might guess this “quotient” construction is called a quotient category.

Then the big result is that:

Theorem 6

The induced map ${f_\sharp = H_n(f)}$ of a map ${f: X \rightarrow Y}$ depends only on the homotopy class of ${f}$. Thus ${H_n}$ is a functor

$\displaystyle H_n : \mathsf{hTop} \rightarrow \mathsf{Grp}.$

The proof of this is geometric, using the so-called prism operators. In any case, as with all functors we deduce

Corollary 7

${H_n(X) \cong H_n(Y)}$ if ${X}$ and ${Y}$ are homotopic.

In particular, the contractible spaces are those spaces ${X}$ which are homotopy equivalent to a point. In which case, ${H_n(X) = 0}$ for all ${n \ge 1}$.

## 2. Relative Homology, ${H_n : \mathsf{hPairTop} \rightarrow \mathsf{Grp}}$

In fact, we also defined homology groups

$\displaystyle H_n(X,A)$

for ${A \subseteq X}$. We will now show this is functorial too.

Definition 8

Let ${\varnothing \neq A \subset X}$ and ${\varnothing \neq B \subset X}$ be subspaces, and consider a map ${f : X \rightarrow Y}$. If ${f(A) \subseteq B}$ we write

$\displaystyle f : (X,A) \rightarrow (Y,B).$

We say ${f}$ is a map of pairs, between the pairs ${(X,A)}$ and ${(Y,B)}$.

Definition 9

We say that ${f,g : (X,A) \rightarrow (Y,B)}$ are pair-homotopic if they are “homotopic through maps of pairs”.

More formally, a pair-homotopy ${f, g : (X,A) \rightarrow (Y,B)}$ is a map ${F : [0,1] \times X \rightarrow Y}$, which we’ll write as ${F_t(X)}$, such that ${F}$ is a homotopy of the maps ${f,g : X \rightarrow Y}$ and each ${F_t}$ is itself a map of pairs.

Thus, we naturally arrive at two categories:

• ${\mathsf{PairTop}}$, the category of pairs of toplogical spaces, and
• ${\mathsf{hPairTop}}$, the same category except with maps only equivalent up to homotopy.

Definition 10

As before, we say pairs ${(X,A)}$ and ${(Y,B)}$ are pair-homotopy equivalent if they are isomorphic in ${\mathsf{hPairTop}}$. An isomorphism of ${\mathsf{hPairTop}}$ is a pair-homotopy equivalence.

Then, the prism operators now let us derive

Theorem 11

We have a functor

$\displaystyle H_n : \mathsf{hPairTop} \rightarrow \mathsf{Grp}.$

The usual corollaries apply.

Now, we want an analog of contractible spaces for our pairs: i.e. pairs of spaces ${(X,A)}$ such that ${H_n(X,A) = 0}$ for ${n \ge 1}$. The correct definition is:

Definition 12

Let ${A \subset X}$. We say that ${A}$ is a deformation retract of ${X}$ if there is a map of pairs ${r : (X, A) \rightarrow (A, A)}$ which is a pair homotopy equivalence.

Example 13 (Examples of Deformation Retracts)

1. If a single point ${p}$ is a deformation retract of a space ${X}$, then ${X}$ is contractible, since the retraction ${r : X \rightarrow \{\ast\}}$ (when viewed as a map ${X \rightarrow X}$) is homotopic to the identity map ${\mathrm{id}_X : X \rightarrow X}$.
2. The punctured disk ${D^2 \setminus \{0\}}$ deformation retracts onto its boundary ${S^1}$.
3. More generally, ${D^{n} \setminus \{0\}}$ deformation retracts onto its boundary ${S^{n-1}}$.
4. Similarly, ${\mathbb R^n \setminus \{0\}}$ deformation retracts onto a sphere ${S^{n-1}}$.

Of course in this situation we have that

$\displaystyle H_n(X,A) \cong H_n(A,A) = 0.$

## 3. Homotopy, ${\pi_1 : \mathsf{hTop}_\ast \rightarrow \mathsf{Grp}}$

As a special case of the above, we define

Definition 14

The category ${\mathsf{Top}_\ast}$ is defined as follows:

• The objects are pairs ${(X, x_0)}$ of spaces ${X}$ with a distinguished basepoint ${x_0}$. We call these pointed spaces.
• The morphisms are maps ${f : (X, x_0) \rightarrow (Y, y_0)}$, meaning ${f}$ is continuous and ${f(x_0) = y_0}$.

Now again we mod out:

Definition 15

Two maps ${f , g : (X, x_0) \rightarrow (Y, y_0)}$ of pointed spaces are homotopic if there is a homotopy between them which also fixes the basepoints. We can then, in the same way as before, define the quotient category ${\mathsf{hTop}_\ast}$.

And lo and behold:

Theorem 16

We have a functor

$\displaystyle \pi_1 : \mathsf{hTop}_\ast \rightarrow \mathsf{Grp}.$

Same corollaries as before.

# A Sketchy Overview of Green-Tao

These are the notes of my last lecture in the 18.099 discrete analysis seminar. It is a very high-level overview of the Green-Tao theorem. It is a subset of this paper.

## 1. Synopsis

This post as in overview of the proof of:

Theorem 1 (Green-Tao)

The prime numbers contain arbitrarily long arithmetic progressions.

Here, Szemerédi’s theorem isn’t strong enough, because the primes have density approaching zero. Instead, one can instead try to prove the following “relativity” result.

Theorem (Relative Szemerédi)

Let ${S}$ be a sparse “pseudorandom” set of integers. Then subsets of ${A}$ with positive density in ${S}$ have arbitrarily long arithmetic progressions.

In order to do this, we have to accomplish the following.

• Make precise the notion of “pseudorandom”.
• Prove the Relative Szemerédi theorem, and then
• Exhibit a “pseudorandom” set ${S}$ which subsumes the prime numbers.

This post will use the graph-theoretic approach to Szemerédi as in the exposition of David Conlon, Jacob Fox, and Yufei Zhao. In order to motivate the notion of pseudorandom, we return to the graph-theoretic approach of Roth’s theorem, i.e. the case ${k=3}$ of Szemerédi’s theorem.

## 2. Defining the linear forms condition

### 2.1. Review of Roth theorem

Roth’s theorem can be phrased in two ways. The first is the “set-theoretic” formulation:

Theorem 2 (Roth, set version)

If ${A \subseteq \mathbb Z/N}$ is 3-AP-free, then ${|A| = o(N)}$.

The second is a “weighted” version

Theorem 3 (Roth, weighted version)

Fix ${\delta > 0}$. Let ${f : \mathbb Z/N \rightarrow [0,1]}$ with ${\mathbf E f \ge \delta}$. Then

$\displaystyle \Lambda_3(f,f,f) \ge \Omega_\delta(1).$

We sketch the idea of a graph-theoretic proof of the first theorem. We construct a tripartite graph ${G_A}$ on vertices ${X \sqcup Y \sqcup Z}$, where ${X = Y = Z = \mathbb Z/N}$. Then one creates the edges

• ${(x,y)}$ if ${2x+ y \in A}$,
• ${(x,z)}$ if ${x-z \in A}$, and
• ${(y,z)}$ if ${-y-2z \in A}$.

This construction is selected so that arithmetic progressions in ${A}$ correspond to triangles in the graph ${G_A}$. As a result, if ${A}$ has no 3-AP’s (except trivial ones, where ${x+y+z=0}$), the graph ${G_A}$ has exactly one triangle for every edge. Then, we can use the theorem of Ruzsa-Szemerédi, which states that this graph ${G_A}$ has ${o(n^2)}$ edges.

### 2.2. The measure ${\nu}$

Now for the generalized version, we start with the second version of Roth’s theorem. Instead of a set ${S}$, we consider a function

$\displaystyle \nu : \mathbb Z/N \rightarrow \mathbb R_{\ge 0}$

which we call a majorizing measure. Since we are now dealing with ${A}$ of low density, we normalize ${\nu}$ so that

$\displaystyle \mathbf E[\nu] = 1 + o(1).$

Our goal is to now show a result of the form:

Theorem (Relative Roth, informally, weighted version)

If ${0 \le f \le \nu}$, ${\mathbf E f \ge \delta}$, and ${\nu}$ satisfies a “pseudorandom” condition, then ${\Lambda_3(f,f,f) \ge \Omega_{\delta}(1)}$.

The prototypical example of course is that if ${A \subset S \subset \mathbb Z_N}$, then we let ${\nu(x) = \frac{N}{|S|} 1_S(x)}$.

### 2.3. Pseudorandomness for ${k=3}$

So, how can we put the pseudorandom condition? Initially, consider ${G_S}$ the tripartite graph defined earlier, and let ${p = |S| / N}$; since ${S}$ is sparse we expect ${p}$ small. The main idea that turns out to be correct is: The number of embeddings of ${K_{2,2,2}}$ in ${S}$ is “as expected”, namely ${(1+o(1)) p^{12} N^6}$. Here ${K_{2,2,2}}$ is actually the ${2}$-blow-up of a triangle. This condition thus gives us control over the distribution of triangles in the sparse graph ${G_S}$: knowing that we have approximately the correct count for ${K_{2,2,2}}$ is enough to control distribution of triangles.

For technical reasons, in fact we want this to be true not only for ${K_{2,2,2}}$ but all of its subgraphs ${H}$.

Now, let’s move on to the weighted version. Let’s consider a tripartite graph ${G}$, which we can think of as a collection of three functions

\displaystyle \begin{aligned} \mu_{-z} &: X \times Y \rightarrow \mathbb R \\ \mu_{-y} &: X \times Z \rightarrow \mathbb R \\ \mu_{-x} &: Y \times Z \rightarrow \mathbb R. \end{aligned}

We think of ${\mu}$ as normalized so that ${\mathbf E[\mu_{-x}] = \mathbf E[\mu_{-y}] = \mathbf E[\mu_{-z}] = 1}$. Then we can define

Definition 4

A weighted tripartite graph ${\mu = (\mu_{-x}, \mu_{-y}, \mu_{-z})}$ satisfies the ${3}$-linear forms condition if

\displaystyle \begin{aligned} \mathbf E_{x^0,x^1,y^0,y^1,z^0,z^1} &\Big[ \mu_{-x}(y^0,z^0) \mu_{-x}(y^0,z^1) \mu_{-x}(y^1,z^0) \mu_{-x}(y^1,z^1) \\ & \mu_{-y}(x^0,z^0) \mu_{-y}(x^0,z^1) \mu_{-y}(x^1,z^0) \mu_{-y}(x^1,z^1) \\ & \mu_{-z}(x^0,y^0) \mu_{-z}(x^0,y^1) \mu_{-z}(x^1,y^0) \mu_{-z}(x^1,y^1) \Big] \\ &= 1 + o(1) \end{aligned}

and similarly if any of the twelve factors are deleted.

Then the pseudorandomness condition is according to the graph we defined above:

Definition 5

A function ${\nu : \mathbb Z / N \rightarrow \mathbb Z}$ is satisfies the ${3}$-linear forms condition if ${\mathbf E[\nu] = 1 + o(1)}$, and the tripartite graph ${\mu = (\mu_{-x}, \mu_{-y}, \mu_{-z})}$ defined by

\displaystyle \begin{aligned} \mu_{-z} &= \nu(2x+y) \\ \mu_{-y} &= \nu(x-z) \\ \mu_{-x} &= \nu(-y-2z) \end{aligned}

satisfies the ${3}$-linear forms condition.

Finally, the relative version of Roth’s theorem which we seek is:

Theorem 6 (Relative Roth)

Suppose ${\nu : \mathbb Z/N \rightarrow \mathbb R_{\ge 0}}$ satisfies the ${3}$-linear forms condition. Then for any ${f : \mathbb Z/N \rightarrow \mathbb R_{\ge 0}}$ bounded above by ${\nu}$ and satisfying ${\mathbf E[f] \ge \delta > 0}$, we have

$\displaystyle \Lambda_3(f,f,f) \ge \Omega_{\delta}(1).$

### 2.4. Relative Szemerédi

We of course have:

Theorem 7 (Szemerédi)

Suppose ${k \ge 3}$, and ${f : \mathbb Z/n \rightarrow [0,1]}$ with ${\mathbf E[f] \ge \delta}$. Then

$\displaystyle \Lambda_k(f, \dots, f) \ge \Omega_{\delta}(1).$

For ${k > 3}$, rather than considering weighted tripartite graphs, we consider a ${(k-1)}$-uniform ${k}$-partite hypergraph. For example, given ${\nu}$ with ${\mathbf E[\nu] = 1 + o(1)}$ and ${k=4}$, we use the construction

\displaystyle \begin{aligned} \mu_{-z}(w,x,y) &= \nu(3w+2x+y) \\ \mu_{-y}(w,x,z) &= \nu(2w+x-z) \\ \mu_{-x}(w,y,z) &= \nu(w-y-2z) \\ \mu_{-w}(x,y,z) &= \nu(-x-2y-3z). \end{aligned}

Thus 4-AP’s correspond to the simplex ${K_4^{(3)}}$ (i.e. a tetrahedron). We then consider the two-blow-up of the simplex, and require the same uniformity on subgraphs of ${H}$.

Here is the compiled version:

Definition 8

A ${(k-1)}$-uniform ${k}$-partite weighted hypergraph ${\mu = (\mu_{-i})_{i=1}^k}$ satisfies the ${k}$-linear forms condition if

$\displaystyle \mathbf E_{x_1^0, x_1^1, \dots, x_k^0, x_k^1} \left[ \prod_{j=1}^k \prod_{\omega \in \{0,1\}^{[k] \setminus \{j\}}} \mu_{-j}\left( x_1^{\omega_1}, \dots, x_{j-1}^{\omega_{j-1}}, x_{j+1}^{\omega_{j+1}}, \dots, x_k^{\omega_k} \right)^{n_{j,\omega}} \right] = 1 + o(1)$

for all exponents ${n_{j,w} \in \{0,1\}}$.

Definition 9

A function ${\nu : \mathbb Z/N \rightarrow \mathbb R_{\ge 0}}$ satisfies the ${k}$-linear forms condition if ${\mathbf E[\nu] = 1 + o(1)}$, and

$\displaystyle \mathbf E_{x_1^0, x_1^1, \dots, x_k^0, x_k^1} \left[ \prod_{j=1}^k \prod_{\omega \in \{0,1\}^{[k] \setminus \{j\}}} \nu\left( \sum_{i=1}^k (j-i)x_i^{(\omega_i)} \right)^{n_{j,\omega}} \right] = 1 + o(1)$

for all exponents ${n_{j,w} \in \{0,1\}}$. This is just the previous condition with the natural ${\mu}$ induced by ${\nu}$.

The natural generalization of relative Szemerédi is then:

Theorem 10 (Relative Szemerédi)

Suppose ${k \ge 3}$, and ${\nu : \mathbb Z/n \rightarrow \mathbb R_{\ge 0}}$ satisfies the ${k}$-linear forms condition. Let ${f : \mathbb Z/N to \mathbb R_{\ge 0}}$ with ${\mathbf E[f] \ge \delta}$, ${f \le \nu}$. Then

$\displaystyle \Lambda_k(f, \dots, f) \ge \Omega_{\delta}(1).$

## 3. Outline of proof of Relative Szemerédi

The proof of Relative Szeremédi uses two key facts. First, one replaces ${f}$ with a bounded ${\widetilde f}$ which is near it:

Theorem 11 (Dense model)

Let ${\varepsilon > 0}$. There exists ${\varepsilon' > 0}$ such that if:

• ${\nu : \mathbb Z/N \rightarrow \mathbb R_{\ge 0}}$ satisfies ${\left\lVert \nu-1 \right\rVert^{\square}_r \le \varepsilon'}$, and
• ${f : \mathbb Z/N \rightarrow \mathbb R_{\ge 0}}$, ${f \le \nu}$

then there exists a function ${\widetilde f : \mathbb Z/N \rightarrow [0,1]}$ such that ${\left\lVert f - \widetilde f \right\rVert^{\square}_r \le \varepsilon}$.

Here we have a new norm, called the cut norm, defined by

$\displaystyle \left\lVert f \right\rVert^{\square}_r = \sup_{A_i \subseteq (\mathbb Z/N)^{r-1}} \left\lvert \mathbf E_{x_1, \dots, x_r} f(x_1 + \dots + x_r) 1_{A_1}(x_{-1}) \dots 1_{A_r}(x_{-r}) \right\rvert.$

This is actually an extension of the cut norm defined on a ${r}$-uniform ${r}$-partite hypergraph (not ${(r-1)}$-uniform like before!): if ${g : X_1 \times \dots \times X_r \rightarrow \mathbb R}$ is such a graph, we let

$\displaystyle \left\lVert g \right\rVert^{\square}_{r,r} = \sup_{A_i \subseteq X_{-i}} \left\lvert g(x_1, \dots, x_r) 1_{A_1}(x_{-1}) \dots 1_{A_r}(x_{-r}) \right\rvert.$

Taking ${g(x_1, \dots, x_r) = f(x_1 + \dots + x_r)}$, ${X_1 = \dots = X_r = \mathbb Z/N}$ gives the analogy.

For the second theorem, we define the norm

$\displaystyle \left\lVert g \right\rVert^{\square}_{k-1,k} = \max_{i=1,\dots,k} \left( \left\lVert g_{-i} \right\rVert^{\square}_{k-1, k-1} \right).$

Theorem 12 (Relative simplex counting lemma)

Let ${\mu}$, ${g}$, ${\widetilde g}$ be weighted ${(k-1)}$-uniform ${k}$-partite weighted hypergraphs on ${X_1 \cup \dots \cup X_k}$. Assume that ${\mu}$ satisfies the ${k}$-linear forms condition, and ${0 \le g_{-i} \le \mu_{-i}}$ for all ${i}$, ${0 \le \widetilde g \le 1}$. If ${\left\lVert g-\widetilde g \right\rVert^{\square}_{k-1,k} = o(1)}$ then

$\displaystyle \mathbf E_{x_1, \dots, x_k} \left[ g(x_{-1}) \dots g(x_{-k}) - \widetilde g(x_{-1}) \dots \widetilde g(x_{-k}) \right] = o(1).$

One then combines these two results to prove Szemerédi, as follows. Start with ${f}$ and ${\nu}$ in the theorem. The ${k}$-linear forms condition turns out to imply ${\left\lVert \nu-1 \right\rVert^{\square}_{k-1} = o(1)}$. So we can find a nearby ${\widetilde f}$ by the dense model theorem. Then, we induce ${\nu}$, ${g}$, ${\widetilde g}$ from ${\mu}$, ${f}$, ${\widetilde f}$ respectively. The counting lemma then reduce the bounding of ${\Lambda_k(f, \dots, f)}$ to the bounding of ${\Lambda_k(\widetilde f, \dots, \widetilde f)}$, which is ${\Omega_\delta(1)}$ by the usual Szemerédi theorem.

## 4. Arithmetic progressions in primes

We now sketch how to obtain Green-Tao from Relative Szemerédi. As expected, we need to us the von Mangoldt function ${\Lambda}$.

Unfortunately, ${\Lambda}$ is biased (e.g. “all decent primes are odd”). To get around this, we let ${w = w(N)}$ tend to infinity slowly with ${N}$, and define

$\displaystyle W = \prod_{p \le w} p.$

In the ${W}$-trick we consider only primes ${1 \pmod W}$. The modified von Mangoldt function then is defined by

$\displaystyle \widetilde \Lambda(n) = \begin{cases} \frac{\varphi(W)}{W} \log (Wn+1) & Wn+1 \text{ prime} \\ 0 & \text{else}. \end{cases}$

In accordance with Dirichlet, we have ${\sum_{n \le N} \widetilde \Lambda(n) = N + o(N)}$.

So, we need to show now that

Proposition 13

Fix ${k \ge 3}$. We can find ${\delta = \delta(k) > 0}$ such that for ${N \gg 1}$ prime, we can find ${\nu : \mathbb Z/N \rightarrow \mathbb R_{\ge 0}}$ which satisfies the ${k}$-linear forms condition as well as

$\displaystyle \nu(n) \ge \delta \widetilde \Lambda(n)$

for ${N/2 \le n < N}$.

In that case, we can let

$\displaystyle f(n) = \begin{cases} \delta \widetilde\Lambda(n) & N/2 \le n < N \\ 0 & \text{else}. \end{cases}$

Then ${0 \le f \le \nu}$. The presence of ${N/2 \le n < N}$ allows us to avoid “wrap-around issues” that arise from using ${\mathbb Z/N}$ instead of ${\mathbb Z}$. Relative Szemerédi then yields the result.

For completeness, we state the construction. Let ${\chi : \mathbb R \rightarrow [0,1]}$ be supported on ${[-1,1]}$ with ${\chi(0) = 1}$, and define a normalizing constant ${c_\chi = \int_0^\infty \left\lvert \chi'(x) \right\rvert^2 \; dx}$. Inspired by ${\Lambda(n) = \sum_{d \mid n} \mu(d) \log(n/d)}$, we define a truncated ${\Lambda}$ by

$\displaystyle \Lambda_{\chi, R}(n) = \log R \sum_{d \mid n} \mu(d) \chi\left( \frac{\log d}{\log R} \right).$

Let ${k \ge 3}$, ${R = N^{k^{-1} 2^{-k-3}}}$. Now, we define ${\nu}$ by

$\displaystyle \nu(n) = \begin{cases} \dfrac{\varphi(W)}{W} \dfrac{\Lambda_{\chi,R}(Wn+1)^2}{c_\chi \log R} & N/2 \le n < N \\ 0 & \text{else}. \end{cases}$

This turns out to work, provided ${w}$ grows sufficiently slowly in ${N}$.

# Approximating E3-LIN is NP-Hard

This lecture, which I gave for my 18.434 seminar, focuses on the MAX-E3LIN problem. We prove that approximating it is NP-hard by a reduction from LABEL-COVER.

## 1. Introducing MAX-E3LIN

In the MAX-E3LIN problem, our input is a series of linear equations ${\pmod 2}$ in ${n}$ binary variables, each with three terms. Equivalently, one can think of this as ${\pm 1}$ variables and ternary products. The objective is to maximize the fraction of satisfied equations.

Example 1 (Example of MAX-E3LIN instance)

\displaystyle \begin{aligned} x_1 + x_3 + x_4 &\equiv 1 \pmod 2 \\ x_1 + x_2 + x_4 &\equiv 0 \pmod 2 \\ x_1 + x_2 + x_5 &\equiv 1 \pmod 2 \\ x_1 + x_3 + x_5 &\equiv 1 \pmod 2 \end{aligned}

\displaystyle \begin{aligned} x_1 x_3 x_4 &= -1 \\ x_1 x_2 x_4 &= +1 \\ x_1 x_2 x_5 &= -1 \\ x_1 x_3 x_5 &= -1 \end{aligned}

A diligent reader can check that we may obtain ${\frac34}$ but not ${1}$.

Remark 2

We immediately notice that

• If there’s a solution with value ${1}$, we can find it easily with ${\mathbb F_2}$ linear algebra.
• It is always possible to get at least ${\frac{1}{2}}$ by selecting all-zero or all-one.

The theorem we will prove today is that these “obvious” observations are essentially the best ones possible! Our main result is that improving the above constants to 51% and 99%, say, is NP-hard.

Theorem 3 (Hardness of MAX-E3LIN)

The ${\frac{1}{2}+\varepsilon}$ vs. ${1-\delta}$ decision problem for MAX-E3LIN is NP-hard.

This means it is NP-hard to decide whether an MAX-E3LIN instance has value ${\le \frac{1}{2}+\varepsilon}$ or ${\ge 1-\delta}$ (given it is one or the other). A direct corollary of this is approximating MAX-SAT is also NP-hard.

Corollary 4

The ${\frac78+\varepsilon}$ vs. ${1-\delta}$ decision problem for MAX-SAT is NP-hard.

Remark 5

The constant ${\frac78}$ is optimal in light of a random assignment. In fact, one can replace ${1-\delta}$ with ${\delta}$, but we don’t do so here.

Proof: Given an equation ${a+b+c=1}$ in MAX-E3LIN, we consider four formulas ${a \lor \neg b \lor \neg c}$, ${\neg a \lor b \lor \neg c}$, ${a \lor \neg b \lor \neg c}$, ${a \lor b \lor c}$. Either three or four of them are satisfied, with four occurring exactly when ${a+b+c=0}$. One does a similar construction for ${a+b+c=1}$. $\Box$

The hardness of MAX-E3LIN is relevant to the PCP theorem: using MAX-E3LIN gadgets, Ha}stad was able to prove a very strong version of the PCP theorem, in which the verifier merely reads just three bits of a proof!

Let ${\varepsilon, \delta > 0}$. We have

$\displaystyle \mathbf{NP} \subseteq \mathbf{PCP}_{\frac{1}{2}+\varepsilon, 1-\delta}(3, O(\log n)).$

In other words, any ${L \in \mathbf{NP}}$ has a (non-adaptive) verifier with the following properties.

• The verifier uses ${O(\log n)}$ random bits, and queries just three (!) bits.
• The acceptance condition is either ${a+b+c=1}$ or ${a+b+c=0}$.
• If ${x \in L}$, then there is a proof ${\Pi}$ which is accepted with probability at least ${1-\delta}$.
• If ${x \notin L}$, then every proof is accepted with probability at most ${\frac{1}{2} + \varepsilon}$.

## 2. Label Cover

We will prove our main result by reducing from the LABEL-COVER. Recall LABEL-COVER is played as follows: we have a bipartite graph ${G = U \cup V}$, a set of keys ${K}$ for vertices of ${U}$ and a set of labels ${L}$ for ${V}$. For every edge ${e = \{u,v\}}$ there is a function ${\pi_e : L \rightarrow K}$ specifying a key ${k = \pi_e(\ell) \in K}$ for every label ${\ell \in L}$. The goal is to label the graph ${G}$ while maximizing the number of edges ${e}$ with compatible key-label pairs.

Approximating LABEL-COVER is NP-hard:

Theorem 7 (Hardness of LABEL-COVER)

The ${\eta}$ vs. ${1}$ decision problem for LABEL-COVER is NP-hard for every ${\eta > 0}$, given ${|K|}$ and ${|L|}$ are sufficiently large in ${\eta}$.

So for any ${\eta > 0}$, it is NP-hard to decide whether one can satisfy all edges or fewer than ${\eta}$ of them.

## 3. Setup

We are going to make a reduction of the following shape:

In words this means that

• “Completeness”: If the LABEL-COVER instance is completely satisfiable, then we get a solution of value ${\ge 1 - \delta}$ in the resulting MAX-E3LIN.
• “Soundness”: If the LABEL-COVER instance has value ${\le \eta}$, then we get a solution of value ${\le \frac{1}{2} + \varepsilon}$ in the resulting MAX-E3LIN.

Thus given an oracle for MAX-E3LIN decision, we can obtain ${\eta}$ vs. ${1}$ decision for LABEL-COVER, which we know is hard.

The setup for this is quite involved, using a huge number of variables. Just to agree on some conventions:

Definition 8 (“Long Code”)

A ${K}$-indexed binary string ${x = (x_k)_k}$ is a ${\pm 1}$ sequence indexed by ${K}$. We can think of it as an element of ${\{\pm 1\}^K}$. An ${L}$-binary string ${y = (y_\ell)_\ell}$ is defined similarly.

Now we initialize ${|U| \cdot 2^{|K|} + |V| \cdot 2^{|L|}}$ variables:

• At every vertex ${u \in U}$, we will create ${2^{|K|}}$ binary variables, one for every ${K}$-indexed binary string. It is better to collect these variables into a function

$\displaystyle f_u : \{\pm1\}^K \rightarrow \{\pm1\}.$

• Similarly, at every vertex ${v \in V}$, we will create ${2^{|L|}}$ binary variables, one for every ${L}$-indexed binary string, and collect these into a function

$\displaystyle g_v : \{\pm1\}^L \rightarrow \{\pm1\}.$

Picture:

Next we generate the equations. Here’s the motivation: we want to do this in such a way that given a satisfying labelling for LABEL-COVER, nearly all the MAX-E3LIN equations can be satisfied. One idea is as follows: for every edge ${e}$, letting ${\pi = \pi_e}$,

• Take a ${K}$-indexed binary string ${x = (x_k)_k}$ at random. Take an ${L}$-indexed binary string ${y = (y_\ell)_\ell}$ at random.
• Define the ${L}$-indexed binary ${z = (z_\ell)_\ell}$ string by ${z = \left( x_{\pi(\ell)} y_\ell \right)}$.
• Write down the equation ${f_u(x) g_v(y) g_v(z) = +1}$ for the MAX-E3LIN instance.

Thus, assuming we had a valid coloring of the graph, we could let ${f_u}$ and ${g_v}$ be the dictator functions for the colorings. In that case, ${f_u(x) = x_{\pi(\ell)}}$, ${g_v(y) = y_\ell}$, and ${g_v(z) = x_{\pi(\ell)} y_\ell}$, so the product is always ${+1}$.

Unfortunately, this has two fatal flaws:

1. This means a ${1}$ instance of LABEL-COVER gives a ${1}$ instance of MAX-E3LIN, but we need ${1-\delta}$ to have a hope of working.
2. Right now we could also just set all variables to be ${+1}$.

We fix this as follows, by using the following equations.

Definition 8 (Equations of reduction)

For every edge ${e}$, with ${\pi = \pi_e}$, we alter the construction and say

• Let ${x = (x_k)_k}$ be and ${y = (y_\ell)_\ell}$ be random as before.
• Let ${n = (n_\ell)_\ell}$ be a random ${L}$-indexed binary string, drawn from a ${\delta}$-biased distribution (${-1}$ with probability ${\delta}$). And now define ${z = (z_\ell)_\ell}$ by

$\displaystyle z_\ell = x_{\pi(\ell)} y_\ell n_\ell .$

The ${n_\ell}$ represents “noise” bits, which resolve the first problem by corrupting a bit of ${z}$ with probability ${\delta}$.

• Write down one of the following two equations with ${\frac{1}{2}}$ probability each:

\displaystyle \begin{aligned} f_u(x) g_v(y) g_v(z) &= +1 \\ f_u(x) g_v(y) g_v(-z) &= -1. \end{aligned}

This resolves the second issue.

This gives a set of ${O(|E|)}$ equations.

I claim this reduction works. So we need to prove the “completeness” and “soundness” claims above.

## 4. Proof of Completeness

Given a labeling of ${G}$ with value ${1}$, as described we simply let ${f_u}$ and ${g_v}$ be dictator functions corresponding to this valid labelling. Then as we’ve seen, we will pass ${1 - \delta}$ of the equations.

## 5. A Fourier Computation

Before proving soundness, we will first need to explicitly compute the probability an equation above is satisfied. Remember we generated an equation for ${e}$ based on random strings ${x}$, ${y}$, ${\lambda}$.

For ${T \subseteq L}$, we define

$\displaystyle \pi^{\text{odd}}_e(T) = \left\{ k \in K \mid \left\lvert \pi_e^{-1}(k) \cap T \right\rvert \text{ is odd} \right\}.$

Thus ${T}$ maps subsets of ${L}$ to subsets of ${K}$.

Remark 9

Note that ${|\pi^{\text{odd}}(T)| \le |T|}$ and that ${\pi^{\text{odd}}(T) \neq \varnothing}$ if ${|T|}$ is odd.

Lemma 10 (Edge Probability)

The probability that an equation generated for ${e = \{u,v\}}$ is true is

$\displaystyle \frac{1}{2} + \frac{1}{2} \sum_{\substack{T \subseteq L \\ |T| \text{ odd}}} (1-2\delta)^{|T|} \widehat g_v(T)^2 \widehat f_u(\pi^{\text{odd}}_e(T)).$

Proof: Omitted for now\dots $\Box$

## 6. Proof of Soundness

We will go in the reverse direction and show (constructively) that if there is MAX-E3LIN instance has a solution with value ${\ge\frac{1}{2}+2\varepsilon}$, then we can reconstruct a solution to LABEL-COVER with value ${\ge \eta}$. (The use of ${2\varepsilon}$ here will be clear in a moment). This process is called “decoding”.

The idea is as follows: if ${S}$ is a small set such that ${\widehat f_u(S)}$ is large, then we can pick a key from ${S}$ at random for ${f_u}$; compare this with the dictator functions where ${\widehat f_u(S) = 1}$ and ${|S| = 1}$. We want to do something similar with ${T}$.

Here are the concrete details. Let ${\Lambda = \frac{\log(1/\varepsilon)}{2\delta}}$ and ${\eta = \frac{\varepsilon^3}{\Lambda^2}}$ be constants (the actual values arise later).

Definition 11

We say that a nonempty set ${S \subseteq K}$ of keys is heavy for ${u}$ if

$\displaystyle \left\lvert S \right\rvert \le \Lambda \qquad\text{and}\qquad \widehat{f_u}(S) \ge \varepsilon^2.$

Note that there are at most ${\varepsilon^{-2}}$ heavy sets by Parseval.

Definition 12

We say that a nonempty set ${T \subseteq L}$ of labels is ${e}$-excellent for ${v}$ if

$\displaystyle \left\lvert T \right\rvert \le \Lambda \qquad\text{and}\qquad S = \pi^{\text{odd}}_e(T) \text{ is heavy.}$

In particular ${S \neq \varnothing}$ so at least one compatible key-label pair is in ${S \times T}$.

Notice that, unlike the case with ${S}$, the criteria for “good” in ${T}$ actually depends on the edge ${e}$ in question! This makes it easier to keys than to select labels. In order to pick labels, we will have to choose from a ${\widehat g_v^2}$ distribution.

Lemma 13 (At least ${\varepsilon}$ of ${T}$ are excellent)

For any edge ${e = \{u,v\}}$, at least ${\varepsilon}$ of the possible ${T}$ according to the distribution ${\widehat g_v^2}$ are ${e}$-excellent.

Proof: Applying an averaging argument to the inequality

$\displaystyle \sum_{\substack{T \subseteq L \\ |T| \text{ odd}}} (1-2\delta)^{|T|} \widehat g_v(T)^2 \left\lvert \widehat f_u(\pi^{\text{odd}}(T)) \right\rvert \ge 2\varepsilon$

shows there is at least ${\varepsilon}$ chance that ${|T|}$ is odd and satisfies

$\displaystyle (1-2\delta)^{|T|} \left\lvert \widehat f_u(S) \right\rvert \ge \varepsilon$

where ${S = \pi^{\text{odd}}_e(T)}$. In particular, ${(1-2\delta)^{|T|} \ge \varepsilon \iff |T| \le \Lambda}$. Finally by \Cref{rem:po}, we see ${S}$ is heavy. $\Box$

Now, use the following algorithm.

• For every vertex ${u \in U}$, take the union of all heavy sets, say

$\displaystyle \mathcal H = \bigcup_{S \text{ heavy}} S.$

Pick a random key from ${\mathcal H}$. Note that ${|\mathcal H| \le \Lambda\varepsilon^{-2}}$, since there are at most ${\varepsilon^{-2}}$ heavy sets (by Parseval) and each has at most ${\Lambda}$ elements.

• For every vertex ${v \in V}$, select a random set ${T}$ according to the distribution ${\widehat g_v(T)^2}$, and select a random element from ${T}$.

I claim that this works.

Fix an edge ${e}$. There is at least an ${\varepsilon}$ chance that ${T}$ is ${e}$-excellent. If it is, then there is at least one compatible pair in ${\mathcal H \times T}$. Hence we conclude probability of success is at least

$\displaystyle \varepsilon \cdot \frac{1}{\Lambda \varepsilon^{-2}} \cdot \frac{1}{\Lambda} = \frac{\varepsilon^3}{\Lambda^2} = \eta.$

(Addendum: it’s pointed out to me this isn’t quite right; the overall probability of the equation given by an edge ${e}$ is ${\ge \frac{1}{2}+\varepsilon}$, but this doesn’t imply it for every edge. Thus one likely needs to do another averaging argument.)

# First drafts of Napkin up!

EDIT: Here’s a July 19 draft that fixes some of the glaring issues that were pointed out.

This morning I finally uploaded the first drafts of my Napkin project, which I’ve been working on since December 2014. See the Napkin tab above for a listing of all drafts.

Napkin is my personal exposition project, which unifies together a lot of my blog posts and even more that I haven’t written on yet into a single coherent narrative. It’s written for students who don’t know much higher math, but are curious and already are comfortable with proofs. It’s especially suited for e.g. students who did contests like USAMO and IMO.

There are still a lot of rough edges in the draft, but I haven’t been able to find much time to work on it this whole calendar year, and so I’ve finally decided the perfect is the enemy of the good and it’s about time I brought this project out of the garage.

I’d much appreciate any comments, corrections, or suggestions, however minor. Please let me know! I do plan to keep updating this draft as I get comments, though I can’t promise that I’ll be very fast in doing so.

I. Basic Algebra and Topology
II. Linear Algebra and Multivariable Calculus
III. Groups, Rings, and More
IV. Complex Analysis
V. Quantum Algorithms
VI. Algebraic Topology I: Homotopy
VII. Category Theory
VIII. Differential Geometry
IX. Algebraic Topology II: Homology
X. Algebraic NT I: Rings of Integers
XI. Algebraic NT II: Galois and Ramification Theory
XII. Representation Theory
XIII. Algebraic Geometry I: Varieties
XIV. Algebraic Geometry II: Schemes
XV. Set Theory I: ZFC, Ordinals, and Cardinals
XVI. Set Theory II: Model Theory and Forcing

(I’ve also posted this on Reddit to try and grab a larger audience. We’ll see how that goes.)

# The Structure Theorem over PID’s

In this post I’ll describe the structure theorem over PID’s which generalizes the following results:

• Finite dimensional vector fields over ${k}$ are all of the form ${k^{\oplus n}}$,
• The classification theorem for finitely generated abelian groups,
• The Frobenius normal form of a matrix,
• The Jordan decomposition of a matrix.

## 1. Some ring theory prerequisites

Prototypical example for this section: ${R = \mathbb Z}$.

Before I can state the main theorem, I need to define a few terms for UFD’s, which behave much like ${\mathbb Z}$: Our intuition from the case ${R = \mathbb Z}$ basically carries over verbatim. We don’t even need to deal with prime ideals and can factor elements instead.

Definition 1

If ${R}$ is a UFD, then ${p \in R}$ is a prime element if ${(p)}$ is a prime ideal and ${p \neq 0}$. For UFD’s this is equivalent to the following property: if ${p = xy}$ then either ${x}$ or ${y}$ is a unit.

So for example in ${\mathbb Z}$ the set of prime elements is ${\{\pm2, \pm3, \pm5, \dots\}}$. Now, since ${R}$ is a UFD, every element ${r}$ factors into a product of prime elements

$\displaystyle r = u p_1^{e_1} p_2^{e_2} \dots p_m^{e_m}$

Definition 2

We say ${r}$ divides ${s}$ if ${s = r'r}$ for some ${r' \in R}$. This is written ${r \mid s}$.

Example 3 (Divisibility in ${\mathbb Z}$)

The number ${0}$ is divisible by every element of ${\mathbb Z}$. All other divisibility as expected.

Ques 4

Show that ${r \mid s}$ if and only if the exponent of each prime in ${r}$ is less than or equal to the corresponding exponent in ${s}$.

Now, the case of interest is the even stronger case when ${R}$ is a PID:

Proposition 5 (PID’s are Noetherian UFD’s)

If ${R}$ is a PID, then it is Noetherian and also a UFD.

Proof: The fact that ${R}$ is Noetherian is obvious. For ${R}$ to be a UFD we essentially repeat the proof for ${\mathbb Z}$, using the fact that ${(a,b)}$ is principal in order to extract ${\gcd(a,b)}$. $\Box$

In this case, we have a Chinese remainder theorem for elements.

Theorem 6 (Chinese remainder theorem for rings)

Let ${m}$ and ${n}$ be relatively prime elements, meaning ${(m) + (n) = (1)}$. Then

$\displaystyle R / (mn) \cong R/m \times R/n.$

Proof: This is the same as the proof of the usual Chinese remainder theorem. First, since ${(m,n)=(1)}$ we have ${am+bn=1}$ for some ${a}$ and ${b}$. Then we have a map

$\displaystyle R/m \times R/n \rightarrow R/(mn) \quad\text{by}\quad (r,s) \mapsto r \cdot bn + s \cdot am.$

One can check that this map is well-defined and an isomorphism of rings. (Diligent readers invited to do so.) $\Box$

Finally, we need to introduce the concept of a Noetherian ${R}$-module.

Definition 7

An ${R}$-module ${M}$ is Noetherian if it satisfies one of the two equivalent conditions:

• Its submodules obey the ascending chain condition: there is no infinite sequence of modules ${M_1 \subsetneq M_2 \subsetneq \dots}$.
• All submodules of ${M}$ (including ${M}$ itself) are finitely generated.

This generalizes the notion of a Noetherian ring: a Noetherian ring ${R}$ is one for which ${R}$ is Noetherian as an ${R}$-module.

Ques 8

Check these two conditions are equivalent. (Copy the proof for rings.)

## 2. The structure theorem

Our structure theorem takes two forms:

Theorem 9 (Structure theorem, invariant form)

Let ${R}$ be a PID and let ${M}$ be any finitely generated ${R}$-module. Then

$\displaystyle M \cong \bigoplus_{i=1}^m R/s_i$

for some ${s_i}$ satisfying ${s_1 \mid s_2 \mid \dots \mid s_m}$.

Corollary 10 (Structure theorem, primary form)

Let ${R}$ be a PID and let ${M}$ be any finitely generated ${R}$-module. Then

$\displaystyle M \cong R^{\oplus r} \oplus R/(q_1) \oplus R/(q_2) \oplus \dots \oplus R/(q_m)$

where ${q_i = p_i^{e_i}}$ for some prime element ${p_i}$ and integer ${e_i \ge 1}$.

Proof: Factor each ${s_i}$ into prime factors (since ${R}$ is a UFD), then use the Chinese remainder theorem. $\Box$

Remark 11

In both theorems the decomposition is unique up to permutations of the summands; good to know, but I won’t prove this.

## 3. Reduction to maps of free ${R}$-modules

The proof of the structure theorem proceeds in two main steps. First, we reduce the problem to a linear algebra problem involving free ${R}$-modules ${R^{\oplus d}}$. Once that’s done, we just have to play with matrices; this is done in the next section.

Suppose ${M}$ is finitely generated by ${d}$ elements. Then there is a surjective map of ${R}$-modules

$\displaystyle R^{\oplus d} \twoheadrightarrow M$

whose image on the basis of ${R^{\oplus d}}$ are the generators of ${M}$. Let ${K}$ denote the kernel.

We claim that ${K}$ is finitely generated as well. To this end we prove that

Lemma 12 (Direct sum of Noetherian modules is Noetherian)

Let ${M}$ and ${N}$ be two Noetherian ${R}$-modules. Then the direct sum ${M \oplus N}$ is also a Noetherian ${R}$-module.

Proof: It suffices to show that if ${L \subseteq M \oplus N}$, then ${L}$ is finitely generated. It’s unfortunately not true that ${L = P \oplus Q}$ (take ${M = N = \mathbb Z}$ ${L = \{(n,n) \mid n \in \mathbb Z\}}$) so we will have to be more careful.

Consider the submodules

\displaystyle \begin{aligned} A &= \left\{ x \in M \mid (x,0) \in L \right\} \subseteq M \\ B &= \left\{ y \in N \mid \exists x \in M : (x,y) \in L \right\} \subseteq N. \end{aligned}

(Note the asymmetry for ${A}$ and ${B}$: the proof doesn’t work otherwise.) Then ${A}$ is finitely generated by ${a_1}$, \dots, ${a_k}$, and ${B}$ is finitely generated by ${b_1}$, \dots, ${b_\ell}$. Let ${x_i = (a_i, 0)}$ and let ${y_i = (\ast, b_i)}$ be elements of ${L}$ (where the ${\ast}$‘s are arbitrary things we don’t care about). Then ${x_i}$ and ${y_i}$ together generate ${L}$. $\Box$

Ques 13

Deduce that for ${R}$ a PID, ${R^{\oplus d}}$ is Noetherian.

Hence ${K \subseteq R^{\oplus d}}$ is finitely generated as claimed. So we can find another surjective map ${R^{\oplus f} \twoheadrightarrow K}$. Consequently, we have a composition

$\displaystyle R^{\oplus f} \twoheadrightarrow K \hookrightarrow R^{\oplus d} \twoheadrightarrow M.$

Observe that ${M}$ is the cokernel of the composition ${T : R^{\oplus f} \rightarrow R^{\oplus d}}$, i.e. we have that

$\displaystyle M \cong R^{\oplus d} / \text{im }(T).$

So it suffices to understand the map ${T}$ well.

## 4. Smith normal form

The idea is now that we have reduced our problem to studying linear maps ${T : R^{\oplus m} \rightarrow R^{\oplus n}}$, which can be thought of as a generic matrix

$\displaystyle T = \begin{pmatrix} a_{11} & \dots & a_{1m} \\ \vdots & \ddots & \vdots \\ a_{n1} & \dots & a_{nm} \end{pmatrix}$

for the standard basis ${e_1}$, \dots, ${e_m}$ of ${R^{\oplus m}}$ and ${f_1}$, \dots, ${f_n}$ of ${N}$.

Of course, as you might expect it ought to be possible to change the given basis of ${T}$ such that ${T}$ has a nicer matrix form. We already saw this in Jordan form, where we had a map ${T : V \rightarrow V}$ and changed the basis so that ${T}$ was “almost diagonal”. This time, we have two sets of bases we can change, so we would hope to get a diagonal basis, or even better.

Before proceeding let’s think about how we might edit the matrix: what operations are permitted? Here are some examples:

• Swapping rows and columns, which just corresponds to re-ordering the basis.
• Adding a multiple of a column to another column. For example, if we add ${3}$ times the first column to the second column, this is equivalent to replacing the basis

$\displaystyle (e_1, e_2, e_3, \dots, e_m) \mapsto (e_1, e_2+3e_1, e_3, \dots, e_m).$

• Adding a multiple of a row to another row. One can see that adding ${3}$ times the first row to the second row is equivalent to replacing the basis

$\displaystyle (f_1, f_2, f_3, \dots, f_n) \mapsto (f_1-3f_2, f_2, f_3, \dots, f_n).$

More generally, If ${A}$ is an invertible ${n \times n}$ matrix we can replace ${T}$ with ${AT}$. This corresponds to replacing

$\displaystyle (f_1, \dots, f_n) \mapsto (A(f_1), \dots, A(f_n))$

(the “invertible” condition just guarantees the latter is a basis). Of course similarly we can replace ${X}$ with ${XB}$ where ${B}$ is an invertible ${m \times m}$ matrix; this corresponds to

$\displaystyle (e_1, \dots, e_m) \mapsto (B^{-1}(e_1), \dots, B^{-1}(e_m))$

Armed with this knowledge, we can now approach the following result.

Theorem 14 (Smith normal form)

Let ${R}$ be a PID. Let ${M = R^{\oplus m}}$ and ${N = R^{\oplus n}}$ be free ${R}$-modules and let ${T : M \rightarrow N}$ be a linear map. Set ${k = \min(m,n)}$.

Then we can select a pair of new bases for ${M}$ and ${N}$ such that ${T}$ has only diagonal entries ${s_1}$, ${s_2}$, \dots, ${s_k}$ and ${s_1 \mid s_2 \mid \dots \mid s_k}$.

So if ${m > n}$, the matrix should take the form

$\displaystyle \begin{pmatrix} s_1 & 0 & 0 & 0 & \dots & 0 \\ 0 & s_2 & 0 & 0 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots & \dots & \vdots \\ 0 & 0 & 0 & s_n & \dots & 0 \end{pmatrix}.$

and similarly when ${m \le n}$.

Ques 15

Show that Smith normal form implies the structure theorem.

Remark 16

Note that this is not a generalization of Jordan form.

• In Jordan form we consider maps ${T : V \rightarrow V}$; note that the source and target space are the same, and we are considering one basis for the space ${V}$.
• In Smith form the maps ${T : M \rightarrow N}$ are between different modules, and we pick two sets of bases (one for ${M}$ and one for ${N}$).

Example 17 (Example of Smith normal form)

To give a flavor of the idea of the proof, let’s work through a concrete example with the following matrix with entries from ${\mathbb Z}$:

$\displaystyle \begin{pmatrix} 18 & 38 & 48 \\ 14 & 30 & 38 \end{pmatrix}.$

The GCD of all the entries is ${2}$, and so motivated by this, we perform the Euclidean algorithm on the left column: subtract the second row from the first row, then three times the first row from the second:

$\displaystyle \begin{pmatrix} 18 & 38 & 48 \\ 14 & 30 & 38 \end{pmatrix} \mapsto \begin{pmatrix} 4 & 8 & 10 \\ 14 & 30 & 38 \end{pmatrix} \mapsto \begin{pmatrix} 4 & 8 & 10 \\ 2 & 6 & 2 \end{pmatrix}.$

Now that the GCD of ${2}$ is present, we move it to the upper-left by switching the two rows, and then kill off all the entries in the same row/column; since ${2}$ was the GCD all along, we isolate ${2}$ completely:

$\displaystyle \begin{pmatrix} 4 & 8 & 10 \\ 2 & 6 & 2 \end{pmatrix} \mapsto \begin{pmatrix} 2 & 6 & 2 \\ 4 & 8 & 10 \end{pmatrix} \mapsto \begin{pmatrix} 2 & 6 & 2 \\ 0 & -4 & 6 \\ \end{pmatrix} \mapsto \begin{pmatrix} 2 & 0 & 0 \\ 0 & -4 & 6 \end{pmatrix}.$

This reduces the problem to a ${1 \times 2}$ matrix. So we just apply the Euclidean algorithm again there:

$\displaystyle \begin{pmatrix} 2 & 0 & 0 \\ 0 & -4 & 6 \end{pmatrix} \mapsto \begin{pmatrix} 2 & 0 & 0 \\ 0 & -4 & 2 \end{pmatrix} \mapsto \begin{pmatrix} 2 & 0 & 0 \\ 0 & 0 & 2 \end{pmatrix} \mapsto \begin{pmatrix} 2 & 0 & 0 \\ 0 & 2 & 0 \end{pmatrix}.$

Now all we have to do is generalize this proof to work with any PID. It’s intuitively clear how to do this: the PID condition more or less lets you perform a Euclidean algorithm.

Proof: Begin with a generic matrix

$\displaystyle T = \begin{pmatrix} a_{11} & \dots & a_{1m} \\ \vdots & \ddots & \vdots \\ a_{n1} & \dots & a_{nm} \end{pmatrix}$

We want to show, by a series of operations (gradually changing the given basis) that we can rearrange the matrix into Smith normal form.

Define ${\gcd(x,y)}$ to be any generator of the principal ideal ${(x,y)}$.

Claim 18 (“Euclidean algorithm”)

If ${a}$ and ${b}$ are entries in the same row or column, we can change bases to replace ${a}$ with ${\gcd(a,b)}$ and ${b}$ with something else.

Proof: We do just the case of columns. By hypothesis, ${\gcd(a,b) = xa+yb}$ for some ${x,y \in R}$. We must have ${(x,y) = (1)}$ now (we’re in a UFD). So there are ${u}$ and ${v}$ such that ${xu + yv = 1}$. Then

$\displaystyle \begin{pmatrix} x & y \\ -v & u \end{pmatrix} \begin{pmatrix} a \\ b \end{pmatrix} = \begin{pmatrix} \gcd(a,b) \\ \text{something} \end{pmatrix}$

and the first matrix is invertible (check this!), as desired. $\Box$
Let ${s_1 = (a_{ij})_{i,j}}$ be the GCD of all entries. Now by repeatedly applying this algorithm, we can cause ${s}$ to appear in the upper left hand corner. Then, we use it to kill off all the entries in the first row and the first column, thus arriving at a matrix

$\displaystyle \begin{pmatrix} s_1 & 0 & 0 & \dots & 0 \\ 0 & a_{22}' & a_{23}' & \dots & a_{2n}' \\ 0 & a_{32}' & a_{33}' & \dots & a_{3n}' \\ \vdots&\vdots&\vdots&\ddots&\vdots \\ 0 & a_{m2}' & a_{m3}' & \dots & a_{mn}' \\ \end{pmatrix}.$

Now we repeat the same procedure with this lower-right ${(m-1) \times (n-1)}$ matrix, and so on. This gives the Smith normal form. $\Box$

With the Smith normal form, we have in the original situation that

$\displaystyle M \cong R^{\oplus d} / \text{im } T$

and applying the theorem to ${T}$ completes the proof of the structure theorem.

## 5. Applications

Now, we can apply our structure theorem! I’ll just sketch proofs of these and let the reader fill in details.

Corollary 19 (Finite-dimensional vector spaces are all isomorphic)

A vector space ${V}$ over a field ${k}$ has a finite spanning set of vectors. Then for some ${n}$, ${V \cong k^{\oplus n}}$.

Proof: In the structure theorem, ${k / (s_i) \in \{0,k\}}$. $\Box$

Corollary 20 (Frobenius normal form)

Let ${T : V \rightarrow V}$ where ${V}$ is a finite-dimensional vector space over an arbitrary field ${k}$ (not necessarily algebraically closed). Then one can write ${T}$ as a block-diagonal matrix whose blocks are all of the form

$\displaystyle \begin{pmatrix} 0 & 0 & 0 & \dots & 0 & \ast \\ 1 & 0 & 0 & \dots & 0 & \ast \\ 0 & 1 & 0 & \dots & 0 & \ast \\ \vdots&\vdots&\vdots&\ddots&\vdots&\vdots \\ 0 & 0 & 0 & \dots & 1 & \ast \\ \end{pmatrix}.$

Proof: View ${V}$ as a ${k[x]}$-module with action ${x \cdot v = T(v)}$. By theorem ${V \cong \bigoplus_i k[x] / (s_i)}$ for some polynomials ${s_i}$, where ${s_1 \mid s_2 \mid s_3 \mid \dots}$. Write each block in the form described. $\Box$

Corollary 21 (Jordan normal form)

Let ${T : V \rightarrow V}$ where ${V}$ is a finite-dimensional vector space over an arbitrary field ${k}$ which is algebraically closed. Prove that ${T}$ can be written in Jordan form.

Proof: We now use the structure theorem in its primary form. Since ${k[x]}$ is algebraically closed each ${p_i}$ is a linear factor, so every summand looks like ${k[x] / (x-a)^m}$ for some ${m}$. $\Box$

This is a draft of Chapter 15 of the Napkin.

# Miller-Rabin (for MIT 18.434)

This is a transcript of a talk I gave as part of MIT’s 18.434 class, the “Seminar in Theoretical Computer Science” as part of MIT’s communication requirement. (Insert snarky comment about MIT’s CI-* requirements here.) It probably would have made a nice math circle talk for high schoolers but I felt somewhat awkward having to present it to a bunch of students who were clearly older than me.

## 1. Preliminaries

### 1.1. Modular arithmetic

In middle school you might have encountered questions such as

Exercise 1

What is ${3^{2016} \pmod{10}}$?

You could answer such questions by listing out ${3^n}$ for small ${n}$ and then finding a pattern, in this case of period ${4}$. However, for large moduli this “brute-force” approach can be time-consuming.

Fortunately, it turns out that one can predict the period in advance.

Theorem 2 (Euler’s little theorem)

1. Let ${\gcd(a,n) = 1}$. Then ${a^{\phi(n)} \equiv 1 \pmod n}$.
2. (Fermat) If ${p}$ is a prime, then ${a^p \equiv a \pmod p}$ for every ${a}$.

Proof: Part (a) is a special case of Lagrange’s Theorem: if ${G}$ is a finite group and ${g \in G}$, then ${g^{|G|}}$ is the identity element. Now select ${G = (\mathbb Z/n\mathbb Z)^\times}$. Part (b) is the case ${n=p}$. $\Box$

Thus, in the middle school problem we know in advance that ${3^4 \equiv 1 \pmod{10}}$ because ${\phi(10) = 4}$. This bound is sharp for primes:

Theorem 3 (Primitive roots)

For every ${p}$ prime there’s a ${g \pmod p}$ such that ${g^{p-1} \equiv 1 \pmod p}$ but ${g^{k} \not\equiv 1 \pmod p}$ for any ${k < p-1}$. (Hence ${(\mathbb Z/p\mathbb Z)^\times \cong \mathbb Z/(p-1)}$.)

For a proof, see the last exercise of my orders handout.

We will define the following anyways:

Definition 4

We say an integer ${n}$ (thought of as an exponent) annihilates the prime ${p}$ if

• ${a^n \equiv 1 \pmod p}$ for every ${a \not\equiv 0 \pmod p}$,
• or equivalently, ${p-1 \mid n}$.

Theorem 5 (All/nothing)

Suppose an exponent ${n}$ does not annihilate the prime ${p}$. Then more than ${\frac{1}{2} p}$ of ${x \pmod p}$ satisfy ${x^n \not\equiv 1 \pmod p}$.

Proof: Much stronger result is true: in ${x^n \equiv 1 \pmod p}$ then ${x^{\gcd(n,p-1)} \equiv 1 \pmod p}$. $\Box$

### 1.2. Repeated Exponentiation

Even without the previous facts, one can still do:

Theorem 6 (Repeated exponentation)

Given ${x}$ and ${n}$, one can compute ${x^n \pmod N}$ with ${O(\log n)}$ multiplications mod ${N}$.

The idea is that to compute ${x^{600} \pmod N}$, one just multiplies ${x^{512+64+16+8}}$. All the ${x^{2^k}}$ can be computed in ${k}$ steps, and ${k \le \log_2 n}$.

### 1.3. Chinese remainder theorem

In the middle school problem, we might have noticed that to compute ${3^{2016} \pmod{10}}$, it suffices to compute it modulo ${5}$, because we already know it is odd. More generally, to understand ${\pmod n}$ it suffices to understand ${n}$ modulo each of its prime powers.

The formal statement, which we include for completeness, is:

Theorem 7 (Chinese remainder theorem)

Let ${p_1}$, ${p_2}$, \dots, ${p_m}$ be distinct primes, and ${e_i \ge 1}$ integers. Then there is a ring isomorphism given by the natural projection

$\displaystyle \mathbb Z/n \rightarrow \prod_{i=1}^m \mathbb Z/p_i^{e_i}.$

In particular, a random choice of ${x \pmod n}$ amounts to a random choice of ${x}$ mod each prime power.

For an example, in the following table we see the natural bijection between ${x \pmod{15}}$ and ${(x \pmod 3, x \pmod 5)}$.

$\displaystyle \begin{array}{c|cc} x \pmod{15} & x \pmod{3} & x \pmod{5} \\ \hline 0 & 0 & 0 \\ 1 & 1 & 1 \\ 2 & 2 & 2 \\ 3 & 0 & 3 \\ 4 & 1 & 4 \\ 5 & 2 & 0 \\ 6 & 0 & 1 \\ 7 & 1 & 2 \end{array} \quad \begin{array}{c|cc} x \pmod{15} & x \pmod{3} & x \pmod{5} \\ \hline 8 & 2 & 3 \\ 9 & 0 & 4 \\ 10 & 1 & 0 \\ 11 & 2 & 1 \\ 12 & 0 & 2 \\ 13 & 1 & 3 \\ 14 & 2 & 4 \\ && \end{array}$

## 2. The RSA algorithm

This simple number theory is enough to develop the so-called RSA algorithm. Suppose Alice wants to send Bob a message ${M}$ over an insecure channel. They can do so as follows.

• Bob selects integers ${d}$, ${e}$ and ${N}$ (with ${N}$ huge) such that ${N}$ is a semiprime and

$\displaystyle de \equiv 1 \pmod{\phi(N)}.$

• Bob publishes both the number ${N}$ and ${e}$ (the public key) but keeps the number ${d}$ secret (the private key).
• Alice sends the number ${X = M^e \pmod N}$ across the channel.
• Bob computes

$\displaystyle X^d \equiv M^{de} \equiv M^1 \equiv M \pmod N$

and hence obtains the message ${M}$.

In practice, the ${N}$ in RSA is at least ${2000}$ bits long.

The trick is that an adversary cannot compute ${d}$ from ${e}$ and ${N}$ without knowing the prime factorization of ${N}$. So the security relies heavily on the difficulty of factoring.

Remark 8

It turns out that we basically don’t know how to factor large numbers ${N}$: the best known classical algorithms can factor an ${n}$-bit number in

$\displaystyle O\left( \exp\left( \frac{64}{9} n \log(n)^2 \right)^{1/3} \right)$

time (“general number field sieve”). On the other hand, with a quantum computer one can do this in ${O\left( n^2 \log n \log \log n \right)}$ time.

## 3. Primality Testing

Main question: if we can’t factor a number ${n}$ quickly, can we at least check it’s prime?

In what follows, we assume for simplicity that ${n}$ is squarefree, i.e. ${n = p_1 p_2 \dots p_k}$ for distinct primes ${p_k}$, This doesn’t substantially change anything, but it makes my life much easier.

### 3.1. Co-RP

Here is the goal: we need to show there is a random algorithm ${A}$ which does the following.

• If ${n}$ is composite then
• More than half the time ${A}$ says “definitely composite”.
• Occasionally, ${A}$ says “possibly prime”.
• If ${n}$ is prime, ${A}$ always says “possibly prime”.

If there is a polynomial time algorithm ${A}$ that does this, we say that PRIMES is in Co-RP. Clearly, this is a very good thing to be true!

### 3.2. Fermat

One idea is to try to use the converse of Fermat’s little theorem: given an integer ${n}$, pick a random number ${x \pmod n}$ and see if ${x^{n-1} \equiv 1 \pmod n}$. (We compute using repeated exponentiation.) If not, then we know for sure ${n}$ is not prime, and we call ${x}$ a Fermat witness modulo ${n}$.

How good is this test? For most composite ${n}$, pretty good:

Proposition 9

Let ${n}$ be composite. Assume that there is a prime ${p \mid n}$ such that ${n-1}$ does not annihilate ${p}$. Then over half the numbers mod ${n}$ are Fermat witnesses.

Proof: Apply the Chinese theorem then the “all-or-nothing” theorem. $\Box$
Unfortunately, if ${n}$ doesn’t satisfy the hypothesis, then all the ${\gcd(x,n) = 1}$ satisfy ${x^{n-1} \equiv 1 \pmod n}$!

Are there such ${n}$ which aren’t prime? Such numbers are called Carmichael numbers, but unfortunately they exist, the first one is ${561 = 3 \cdot 11 \cdot 17}$.

Remark 10

For ${X \gg 1}$, there are more than ${X^{1/3}}$ Carmichael numbers at most ${X}$.

Thus these numbers are very rare, but they foil the Fermat test.

Exercise 11

Show that a Carmichael number is not a semiprime.

### 3.3. Rabin-Miller

Fortunately, we can adapt the Fermat test to cover Carmichael numbers too. It comes from the observation that if ${n}$ is prime, then ${a^2 \equiv 1 \pmod n \implies a \equiv \pm 1 \pmod n}$.

So let ${n-1 = 2^s t}$, where ${t}$ is odd. For example, if ${n = 561}$ then ${560 = 2^4 \cdot 35}$. Then we compute ${x^t}$, ${x^{2t}}$, \dots, ${x^{n-1}}$. For example in the case ${n=561}$ and ${x=245}$:

$\displaystyle \begin{array}{c|r|rrr} & \mod 561 & \mod 3 & \mod 11 & \mod 17 \\ \hline x & 245 & -1 & 3 & 7 \\ \hline x^{35} & 122 & -1 & \mathbf 1 & 3 \\ x^{70} & 298 & \mathbf 1 & 1 & 9 \\ x^{140} & 166 & 1 & 1 & -4 \\ x^{280} & 67 & 1 & 1 & -1 \\ x^{560} & 1 & 1 & 1 & \mathbf 1 \end{array}$

And there we have our example! We have ${67^2 \equiv 1 \pmod{561}}$, so ${561}$ isn’t prime.

So the Rabin-Miller test works as follows:

• Given ${n}$, select a random ${x}$ and compute powers of ${x}$ as in the table.
• If ${x^{n-1} \not\equiv 1}$, stop, ${n}$ is composite (Fermat test).
• If ${x^{n-1} \equiv 1}$, see if the entry just before the first ${1}$ is ${-1}$. If it isn’t then we say ${x}$ is a RM-witness and ${n}$ is composite.
• Otherwise, ${n}$ is “possibly prime”.

How likely is probably?

Theorem 12

If ${n}$ is Carmichael, then over half the ${x \pmod n}$ are RM witnesses.

Proof: We sample ${x \pmod n}$ randomly again by looking modulo each prime (Chinese theorem). By the theorem on primitive roots, show that the probability the first ${-1}$ appears in any given row is ${\le \frac{1}{2}}$. This implies the conclusion. $\Box$

Exercise 13

Improve the ${\frac{1}{2}}$ in the problem to ${\frac34}$ by using the fact that Carmichael numbers aren’t semiprime.

### 3.4. AKS

In August 6, 2002, it was in fact shown that PRIMES is in P, using the deterministic AKS algorithm. However, in practice everyone still uses Miller-Rabin since the implied constants for AKS runtime are large.

# Things Fourier

For some reason several classes at MIT this year involve Fourier analysis. I was always confused about this as a high schooler, because no one ever gave me the “orthonormal basis” explanation, so here goes. As a bonus, I also prove a form of Arrow’s Impossibility Theorem using binary Fourier analysis, and then talk about the fancier generalizations using Pontryagin duality and the Peter-Weyl theorem.

In what follows, we let ${\mathbb T = \mathbb R/\mathbb Z}$ denote the “circle group”, thought of as the additive group of “real numbers modulo ${1}$”. There is a canonical map ${e : \mathbb T \rightarrow \mathbb C}$ sending ${\mathbb T}$ to the complex unit circle, given by ${e(\theta) = \exp(2\pi i \theta)}$.

Disclaimer: I will deliberately be sloppy with convergence issues, in part because I don’t fully understand them myself, and in part because I don’t care.

## 1. Synopsis

Suppose we have a domain ${Z}$ and are interested in functions ${f : Z \rightarrow \mathbb C}$. Naturally, the set of such functions form a complex vector space. We like to equip the set of such functions with an positive definite inner product. The idea of Fourier analysis is to then select an orthonormal basis for this set of functions, say ${(e_\xi)_{\xi}}$, which we call the characters; the indexing ${\xi}$ are called frequencies. In that case, since we have a basis, every function ${f : Z \rightarrow \mathbb C}$ becomes a sum

$\displaystyle f(x) = \sum_{\xi} \widehat f(\xi) e_\xi$

where ${\widehat f(\xi)}$ are complex coefficients of the basis; appropriately we call ${\widehat f}$ the Fourier coefficients. The variable ${x \in Z}$ is referred to as the physical variable. This is generally good because the characters are deliberately chosen to be nice “symmetric” functions, like sine or cosine waves or other periodic functions. Thus ${we}$ decompose an arbitrarily complicated function into a sum on nice ones.

For convenience, we record a few facts about orthonormal bases.

Proposition 1 (Facts about orthonormal bases)

Let ${V}$ be a complex Hilbert space with inner form ${\left< -,-\right>}$ and suppose ${x = \sum_\xi a_\xi e_\xi}$ and ${y = \sum_\xi b_\xi e_\xi}$ where ${e_\xi}$ are an orthonormal basis. Then

\displaystyle \begin{aligned} \left< x,x \right> &= \sum_\xi |a_\xi|^2 \\ a_\xi &= \left< x, e_\xi \right> \\ \left< x,y \right> &= \sum_\xi a_\xi \overline{b_\xi}. \end{aligned}

## 2. Common Examples

### 2.1. Binary Fourier analysis on ${\{\pm1\}^n}$

Let ${Z = \{\pm 1\}^n}$ for some positive integer ${n}$, so we are considering functions ${f(x_1, \dots, x_n)}$ accepting binary values. Then the functions ${Z \rightarrow \mathbb C}$ form a ${2^n}$-dimensional vector space ${\mathbb C^Z}$, and we endow it with the inner form

$\displaystyle \left< f,g \right> = \frac{1}{2^n} \sum_{x \in Z} f(x) \overline{g(x)}.$

In particular,

$\displaystyle \left< f,f \right> = \frac{1}{2^n} \sum_{x \in Z} \left\lvert f(x) \right\rvert^2$

is the average of the squares; this establishes also that ${\left< -,-\right>}$ is positive definite.

In that case, the multilinear polynomials form a basis of ${\mathbb C^Z}$, that is the polynomials

$\displaystyle \chi_S(x_1, \dots, x_n) = \prod_{s \in S} x_s.$

Thus our frequency set is actually the subsets ${S \subseteq \{1, \dots, n\}}$. Thus, we have a decomposition

$\displaystyle f = \sum_{S \subseteq \{1, \dots, n\}} \widehat f(S) \chi_S.$

Example 2 (An example of binary Fourier analysis)

Let ${n = 2}$. Then binary functions ${\{ \pm 1\}^2 \rightarrow \mathbb C}$ have a basis given by the four polynomials

$\displaystyle 1, \quad x_1, \quad x_2, \quad x_1x_2.$

For example, consider the function ${f}$ which is ${1}$ at ${(1,1)}$ and ${0}$ elsewhere. Then we can put

$\displaystyle f(x_1, x_2) = \frac{x_1+1}{2} \cdot \frac{x_2+1}{2} = \frac14 \left( 1 + x_1 + x_2 + x_1x_2 \right).$

So the Fourier coefficients are ${\widehat f(S) = \frac 14}$ for each of the four ${S}$‘s.

This notion is useful in particular for binary functions ${f : \{\pm1\}^n \rightarrow \{\pm1\}}$; for these functions (and products thereof), we always have ${\left< f,f \right> = 1}$.

It is worth noting that the frequency ${\varnothing}$ plays a special role:

Exercise 3

Show that

$\displaystyle \widehat f(\varnothing) = \frac{1}{|Z|} \sum_{x \in Z} f(x).$

### 2.2. Fourier analysis on finite groups ${Z}$

This is the Fourier analysis used in this post and this post. Here, we have a finite abelian group ${Z}$, and consider functions ${Z \rightarrow \mathbb C}$; this is a ${|Z|}$-dimensional vector space. The inner product is the same as before:

$\displaystyle \left< f,g \right> = \frac{1}{|Z|} \sum_{x \in Z} f(x) \overline{g}(x).$

Now here is how we generate the characters. We equip ${Z}$ with a non-degenerate symmetric bilinear form

$\displaystyle Z \times Z \xrightarrow{\cdot} \mathbb T \qquad (\xi, x) \mapsto \xi \cdot x.$

Experts may already recognize this as a choice of isomorphism between ${Z}$ and its Pontryagin dual. This time the characters are given by

$\displaystyle \left( e_\xi \right)_{\xi \in Z} \qquad \text{where} \qquad e_\xi(x) = e(\xi \cdot x).$

In this way, the set of frequencies is also ${Z}$, but the ${\xi \in Z}$ play very different roles from the “physical” ${x \in Z}$. (It is not too hard to check these indeed form an orthonormal basis in the function space ${\mathbb C^{\left\lvert Z \right\rvert}}$, since we assumed that ${\cdot}$ is non-degenerate.)

Example 4 (Cube roots of unity filter)

Suppose ${Z = \mathbb Z/3\mathbb Z}$, with the inner form given by ${\xi \cdot x = (\xi x)/3}$. Let ${\omega = \exp(\frac 23 \pi i)}$ be a primitive cube root of unity. Note that

$\displaystyle e_\xi(x) = \begin{cases} 1 & \xi = 0 \\ \omega^x & \xi = 1 \\ \omega^{2x} & \xi = 2. \end{cases}$

Then given ${f : Z \rightarrow \mathbb C}$ with ${f(0) = a}$, ${f(1) = b}$, ${f(2) = c}$, we obtain

$\displaystyle f(x) = \frac{a+b+c}{3} \cdot 1 + \frac{a + \omega^2 b + \omega c}{3} \cdot \omega^x + \frac{a + \omega b + \omega^2 c}{3} \cdot \omega^{2x}.$

In this way we derive that the transforms are

\displaystyle \begin{aligned} \widehat f(0) &= \frac{a+b+c}{3} \\ \widehat f(1) &= \frac{a+\omega^2 b+ \omega c}{3} \\ \widehat f(2) &= \frac{a+\omega b+\omega^2c}{3}. \end{aligned}

Exercise 5

Show that

$\displaystyle \widehat f(0) = \frac{1}{|Z|} \sum_{x \in Z} f(x).$

Olympiad contestants may recognize the previous example as a “roots of unity filter”, which is exactly the point. For concreteness, suppose one wants to compute

$\displaystyle \binom{1000}{0} + \binom{1000}{3} + \dots + \binom{1000}{999}.$

In that case, we can consider the function

$\displaystyle w : \mathbb Z/3 \rightarrow \mathbb C.$

such that ${w(0) = 1}$ but ${w(1) = w(2) = 0}$. By abuse of notation we will also think of ${w}$ as a function ${w : \mathbb Z \twoheadrightarrow \mathbb Z/3 \rightarrow \mathbb C}$. Then the sum in question is

\displaystyle \begin{aligned} \sum_n \binom{1000}{n} w(n) &= \sum_n \binom{1000}{n} \sum_{k=0,1,2} \widehat w(k) \omega^{kn} \\ &= \sum_{k=0,1,2} \widehat w(k) \sum_n \binom{1000}{n} \omega^{kn} \\ &= \sum_{k=0,1,2} \widehat w(k) (1+\omega^k)^n. \end{aligned}

In our situation, we have ${\widehat w(0) = \widehat w(1) = \widehat w(2) = \frac13}$, and we have evaluated the desired sum. More generally, we can take any periodic weight ${w}$ and use Fourier analysis in order to interchange the order of summation.

Example 6 (Binary Fourier analysis)

Suppose ${Z = \{\pm 1\}^n}$, viewed as an abelian group under pointwise multiplication hence isomorphic to ${(\mathbb Z/2\mathbb Z)^{\oplus n}}$. Assume we pick the dot product defined by

$\displaystyle \xi \cdot x = \frac{1}{2} \sum_i \xi_i x_i$

where ${\xi = (\xi_1, \dots, \xi_n)}$ and ${x = (x_1, \dots, x_n)}$.

We claim this coincides with the first example we gave. Indeed, let ${S \subseteq \{1, \dots, n\}}$ and let ${\xi \in \{\pm1\}^n}$ which is ${-1}$ at positions in ${S}$, and ${+1}$ at positions not in ${S}$. Then the character ${\chi_S}$ form the previous example coincides with the character ${e_\xi}$ in the new notation. In particular, ${\widehat f(S) = \widehat f(\xi)}$.

Thus Fourier analysis on a finite group ${Z}$ subsumes binary Fourier analysis.

### 2.3. Fourier series for functions ${L^2([-\pi, \pi])}$

Now we consider the space ${L^2([-\pi, \pi])}$ of square-integrable functions ${[-\pi, \pi] \rightarrow \mathbb C}$, with inner form

$\displaystyle \left< f,g \right> = \frac{1}{2\pi} \int_{[-\pi, \pi]} f(x) \overline{g(x)}.$

Sadly, this is not a finite-dimensional vector space, but fortunately it is a Hilbert space so we are still fine. In this case, an orthonormal basis must allow infinite linear combinations, as long as the sum of squares is finite.

Now, it turns out in this case that

$\displaystyle (e_n)_{n \in \mathbb Z} \qquad\text{where}\qquad e_n(x) = \exp(inx)$

is an orthonormal basis for ${L^2([-\pi, \pi])}$. Thus this time the frequency set ${\mathbb Z}$ is infinite. So every function ${f \in L^2([-\pi, \pi])}$ decomposes as

$\displaystyle f(x) = \sum_n \widehat f(n) \exp(inx)$

for ${\widehat f(n)}$.

This is a little worse than our finite examples: instead of a finite sum on the right-hand side, we actually have an infinite sum. This is because our set of frequencies is now ${\mathbb Z}$, which isn’t finite. In this case the ${\widehat f}$ need not be finitely supported, but do satisfy ${\sum_n |\widehat f(n)|^2 < \infty}$.

Since the frequency set is indexed by ${\mathbb Z}$, we call this a Fourier series to reflect the fact that the index is ${n \in \mathbb Z}$.

Exercise 7

Show once again

$\displaystyle \widehat f(0) = \frac{1}{2\pi} \int_{[-\pi, \pi]} f(x).$

Often we require that the function ${f}$ satisfies ${f(-\pi) = f(\pi)}$, so that ${f}$ becomes a periodic function, and we can think of it as ${f : \mathbb T \rightarrow \mathbb C}$.

### 2.4. Summary

We summarize our various flavors of Fourier analysis in the following table.

$\displaystyle \begin{array}{llll} \hline \text{Type} & \text{Physical var} & \text{Frequency var} & \text{Basis functions} \\ \hline \textbf{Binary} & \{\pm1\}^n & \text{Subsets } S \subseteq \left\{ 1, \dots, n \right\} & \prod_{s \in S} x_s \\ \textbf{Finite group} & Z & \xi \in Z, \text{ choice of } \cdot, & e(\xi \cdot x) \\ \textbf{Fourier series} & \mathbb T \text{ or } [-\pi, \pi] & n \in \mathbb Z & \exp(inx) \\ \end{array}$

In fact, we will soon see that all these examples are subsumed by Pontryagin duality for compact groups ${G}$.

## 3. Parseval and friends

The notion of an orthonormal basis makes several “big-name” results in Fourier analysis quite lucid. Basically, we can take every result from Proposition~1, translate it into the context of our Fourier analysis, and get a big-name result.

Corollary 8 (Parseval theorem)

Let ${f : Z \rightarrow \mathbb C}$, where ${Z}$ is a finite abelian group. Then

$\displaystyle \sum_\xi |\widehat f(\xi)|^2 = \frac{1}{|Z|} \sum_{x \in Z} |f(x)|^2.$

Similarly, if ${f : [-\pi, \pi] \rightarrow \mathbb C}$ is square-integrable then its Fourier series satisfies

$\displaystyle \sum_n |\widehat f(n)|^2 = \frac{1}{2\pi} \int_{[-\pi, \pi]} |f(x)|^2.$

Proof: Recall that ${\left< f,f\right>}$ is equal to the square sum of the coefficients. $\Box$

Corollary 9 (Formulas for ${\widehat f}$)

Let ${f : Z \rightarrow \mathbb C}$, where ${Z}$ is a finite abelian group. Then

$\displaystyle \widehat f(\xi) = \frac{1}{|Z|} \sum_{x \in Z} f(x) \overline{e_\xi(x)}.$

Similarly, if ${f : [-\pi, \pi] \rightarrow \mathbb C}$ is square-integrable then its Fourier series is given by

$\displaystyle \widehat f(n) = \frac{1}{2\pi} \int_{[-\pi, \pi]} f(x) \exp(-inx).$

Proof: Recall that in an orthonormal basis ${(e_\xi)_\xi}$, the coefficient of ${e_\xi}$ in ${f}$ is ${\left< f, e_\xi\right>}$. $\Box$
Note in particular what happens if we select ${\xi = 0}$ in the above!

Corollary 10 (Plancherel theorem)

Let ${f : Z \rightarrow \mathbb C}$, where ${Z}$ is a finite abelian group. Then

$\displaystyle \left< f,g \right> = \sum_{\xi \in Z} \widehat f(\xi) \overline{\widehat g(\xi)}.$

Similarly, if ${f : [-\pi, \pi] \rightarrow \mathbb C}$ is square-integrable then

$\displaystyle \left< f,g \right> = \sum_n \widehat f(\xi) \overline{\widehat g(\xi)}.$

Proof: Guess! $\Box$

## 4. (Optional) Arrow’s Impossibility Theorem

As an application, we now prove a form of Arrow’s theorem. Consider ${n}$ voters voting among ${3}$ candidates ${A}$, ${B}$, ${C}$. Each voter specifies a tuple ${v_i = (x_i, y_i, z_i) \in \{\pm1\}^3}$ as follows:

• ${x_i = 1}$ if ${A}$ ranks ${A}$ ahead of ${B}$, and ${x_i = -1}$ otherwise.
• ${y_i = 1}$ if ${A}$ ranks ${B}$ ahead of ${C}$, and ${y_i = -1}$ otherwise.
• ${z_i = 1}$ if ${A}$ ranks ${C}$ ahead of ${A}$, and ${z_i = -1}$ otherwise.

Tacitly, we only consider ${3! = 6}$ possibilities for ${v_i}$: we forbid “paradoxical” votes of the form ${x_i = y_i = z_i}$ by assuming that people’s votes are consistent (meaning the preferences are transitive).

Then, we can consider a voting mechanism

\displaystyle \begin{aligned} f : \{\pm1\}^n &\rightarrow \{\pm1\} \\ g : \{\pm1\}^n &\rightarrow \{\pm1\} \\ h : \{\pm1\}^n &\rightarrow \{\pm1\} \end{aligned}

such that ${f(x_\bullet)}$ is the global preference of ${A}$ vs. ${B}$, ${g(y_\bullet)}$ is the global preference of ${B}$ vs. ${C}$, and ${h(z_\bullet)}$ is the global preference of ${C}$ vs. ${A}$. We’d like to avoid situations where the global preference ${(f(x_\bullet), g(y_\bullet), h(z_\bullet))}$ is itself paradoxical.

In fact, we will prove the following theorem:

Theorem 11 (Arrow Impossibility Theorem)

Assume that ${(f,g,h)}$ always avoids paradoxical outcomes, and assume ${\mathbf E f = \mathbf E g = \mathbf E h = 0}$. Then ${(f,g,h)}$ is either a dictatorship or anti-dictatorship: there exists a “dictator” ${k}$ such that

$\displaystyle f(x_\bullet) = \pm x_k, \qquad g(y_\bullet) = \pm y_k, \qquad h(z_\bullet) = \pm z_k$

where all three signs coincide.

The “irrelevance of independent alternatives” reflects that The assumption ${\mathbf E f = \mathbf E g = \mathbf E h = 0}$ provides symmetry (and e.g. excludes the possibility that ${f}$, ${g}$, ${h}$ are constant functions which ignore voter input). Unlike the usual Arrow theorem, we do not assume that ${f(+1, \dots, +1) = +1}$ (hence possibility of anti-dictatorship).

To this end, we actually prove the following result:

Lemma 12

Assume the ${n}$ voters vote independently at random among the ${3! = 6}$ possibilities. The probability of a paradoxical outcome is exactly

$\displaystyle \frac14 + \frac14 \sum_{S \subseteq \{1, \dots, n\}} \left( -\frac13 \right)^{\left\lvert S \right\rvert} \left( \widehat f(S) \widehat g(S) + \widehat g(S) \widehat h(S) + \widehat h(S) \widehat f(S) \right) .$

Proof: Define the Boolean function ${D : \{\pm 1\}^3 \rightarrow \mathbb R}$ by

$\displaystyle D(a,b,c) = ab + bc + ca = \begin{cases} 3 & a,b,c \text{ all equal} \\ -1 & a,b,c \text{ not all equal}. \end{cases}.$

Thus paradoxical outcomes arise when ${D(f(x_\bullet), g(y_\bullet), h(z_\bullet)) = 3}$. Now, we compute that for randomly selected ${x_\bullet}$, ${y_\bullet}$, ${z_\bullet}$ that

\displaystyle \begin{aligned} \mathbf E D(f(x_\bullet), g(y_\bullet), h(z_\bullet)) &= \mathbf E \sum_S \sum_T \left( \widehat f(S) \widehat g(T) + \widehat g(S) \widehat h(T) + \widehat h(S) \widehat f(T) \right) \left( \chi_S(x_\bullet)\chi_T(y_\bullet) \right) \\ &= \sum_S \sum_T \left( \widehat f(S) \widehat g(T) + \widehat g(S) \widehat h(T) + \widehat h(S) \widehat f(T) \right) \mathbf E\left( \chi_S(x_\bullet)\chi_T(y_\bullet) \right). \end{aligned}

Now we observe that:

• If ${S \neq T}$, then ${\mathbf E \chi_S(x_\bullet) \chi_T(y_\bullet) = 0}$, since if say ${s \in S}$, ${s \notin T}$ then ${x_s}$ affects the parity of the product with 50% either way, and is independent of any other variables in the product.
• On the other hand, suppose ${S = T}$. Then

$\displaystyle \chi_S(x_\bullet) \chi_T(y_\bullet) = \prod_{s \in S} x_sy_s.$

Note that ${x_sy_s}$ is equal to ${1}$ with probability ${\frac13}$ and ${-1}$ with probability ${\frac23}$ (since ${(x_s, y_s, z_s)}$ is uniform from ${3!=6}$ choices, which we can enumerate). From this an inductive calculation on ${|S|}$ gives that

$\displaystyle \prod_{s \in S} x_sy_s = \begin{cases} +1 & \text{ with probability } \frac{1}{2}(1+(-1/3)^{|S|}) \\ -1 & \text{ with probability } \frac{1}{2}(1-(-1/3)^{|S|}). \end{cases}$

Thus

$\displaystyle \mathbf E \left( \prod_{s \in S} x_sy_s \right) = \left( -\frac13 \right)^{|S|}.$

Piecing this altogether, we now have that

$\displaystyle \mathbf E D(f(x_\bullet), g(y_\bullet), h(z_\bullet)) = \left( \widehat f(S) \widehat g(T) + \widehat g(S) \widehat h(T) + \widehat h(S) \widehat f(T) \right) \left( -\frac13 \right)^{|S|}.$

Then, we obtain that

\displaystyle \begin{aligned} &\mathbf E \frac14 \left( 1 + D(f(x_\bullet), g(y_\bullet), h(z_\bullet)) \right) \\ =& \frac14 + \frac14\sum_S \left( \widehat f(S) \widehat g(T) + \widehat g(S) \widehat h(T) + \widehat h(S) \widehat f(T) \right) \widehat f(S)^2 \left( -\frac13 \right)^{|S|}. \end{aligned}

Comparing this with the definition of ${D}$ gives the desired result. $\Box$

Now for the proof of the main theorem. We see that

$\displaystyle 1 = \sum_{S \subseteq \{1, \dots, n\}} -\left( -\frac13 \right)^{\left\lvert S \right\rvert} \left( \widehat f(S) \widehat g(S) + \widehat g(S) \widehat h(S) + \widehat h(S) \widehat f(S) \right).$

But now we can just use weak inequalities. We have ${\widehat f(\varnothing) = \mathbf E f = 0}$ and similarly for ${\widehat g}$ and ${\widehat h}$, so we restrict attention to ${|S| \ge 1}$. We then combine the famous inequality ${|ab+bc+ca| \le a^2+b^2+c^2}$ (which is true across all real numbers) to deduce that

\displaystyle \begin{aligned} 1 &= \sum_{S \subseteq \{1, \dots, n\}} -\left( -\frac13 \right)^{\left\lvert S \right\rvert} \left( \widehat f(S) \widehat g(S) + \widehat g(S) \widehat h(S) + \widehat h(S) \widehat f(S) \right) \\ &\le \sum_{S \subseteq \{1, \dots, n\}} \left( \frac13 \right)^{\left\lvert S \right\rvert} \left( \widehat f(S)^2 + \widehat g(S)^2 + \widehat h(S)^2 \right) \\ &\le \sum_{S \subseteq \{1, \dots, n\}} \left( \frac13 \right)^1 \left( \widehat f(S)^2 + \widehat g(S)^2 + \widehat h(S)^2 \right) \\ &= \frac13 (1+1+1) = 1. \end{aligned}

with the last step by Parseval. So all inequalities must be sharp, and in particular ${\widehat f}$, ${\widehat g}$, ${\widehat h}$ are supported on one-element sets, i.e. they are linear in inputs. As ${f}$, ${g}$, ${h}$ are ${\pm 1}$ valued, each ${f}$, ${g}$, ${h}$ is itself either a dictator or anti-dictator function. Since ${(f,g,h)}$ is always consistent, this implies the final result.

## 5. Pontryagin duality

In fact all the examples we have covered can be subsumed as special cases of Pontryagin duality, where we replace the domain with a general group ${G}$. In what follows, we assume ${G}$ is a locally compact abelian (LCA) group, which just means that:

• ${G}$ is a abelian topological group,
• the topology on ${G}$ is Hausdorff, and
• the topology on ${G}$ is locally compact: every point of ${G}$ has a compact neighborhood.

Notice that our previous examples fall into this category:

Example 13 (Examples of locally compact abelian groups)

• Any finite group ${Z}$ with the discrete topology is LCA.
• The circle group ${\mathbb T}$ is LCA and also in fact compact.
• The real numbers ${\mathbb R}$ are an example of an LCA group which is not compact.

### 5.1. The Pontryagin dual

The key definition is:

Definition 14

Let ${G}$ be an LCA group. Then its Pontryagin dual is the abelian group

$\displaystyle \widehat G \overset{\mathrm{def}}{=} \left\{ \text{continuous group homomorphisms } \xi : G \rightarrow \mathbb T \right\}.$

The maps ${\xi}$ are called characters. By equipping it with the compact-open topology, we make ${\widehat G}$ into an LCA group as well.

Example 15 (Examples of Pontryagin duals)

• ${\widehat{\mathbb Z} \cong \mathbb T}$.
• ${\widehat{\mathbb T} \cong \mathbb Z}$. The characters are given by ${\theta \mapsto n\theta}$ for ${n \in \mathbb Z}$.
• ${\widehat{\mathbb R} \cong \mathbb R}$. This is because a nonzero continuous homomorphism ${\mathbb R \rightarrow S^1}$ is determined by the fiber above ${1 \in S^1}$. (Covering projections, anyone?)
• ${\widehat{\mathbb Z/n\mathbb Z} \cong \mathbb Z/n\mathbb Z}$, characters ${\xi}$ being determined by the image ${\xi(1) \in \mathbb T}$.
• ${\widehat{G \times H} \cong \widehat G \times \widehat H}$.
• If ${Z}$ is a finite abelian group, then previous two examples (and structure theorem for abelian groups) imply that ${\widehat{Z} \cong Z}$, though not canonically. You may now recognize that the bilinear form ${\cdot : Z \times Z \rightarrow Z}$ is exactly a choice of isomorphism ${Z \rightarrow \widehat Z}$.
• For any group ${G}$, the dual of ${\widehat G}$ is canonically isomorphic to ${G}$, id est there is a natural isomorphism

$\displaystyle G \cong \widehat{\widehat G} \qquad \text{by} \qquad x \mapsto \left( \xi \mapsto \xi(x) \right).$

This is the Pontryagin duality theorem. (It is an analogy to the isomorphism ${(V^\vee)^\vee \cong V}$ for vector spaces ${V}$.)

### 5.2. The orthonormal basis in the compact case

Now assume ${G}$ is LCA but also compact, and thus has a unique Haar measure ${\mu}$ such that ${\mu(G) = 1}$; this lets us integrate over ${G}$. Let ${L^2(G)}$ be the space of square-integrable functions to ${\mathbb C}$, i.e.

$\displaystyle L^2(G) = \left\{ f : G \rightarrow \mathbb C \quad\text{such that}\quad \int_G |f|^2 \; d\mu < \infty \right\}.$

Thus we can equip it with the inner form

$\displaystyle \left< f,g \right> = \int_G f\overline{g} \; d\mu.$

In that case, we get all the results we wanted before:

Theorem 16 (Characters of ${\widehat G}$ forms an orthonormal basis)

Assume ${G}$ is LCA and compact. Then ${\widehat G}$ is discrete, and the characters

$\displaystyle (e_\xi)_{\xi \in \widehat G} \qquad\text{by}\qquad e_\xi(x) = e(\xi(x)) = \exp(2\pi i \xi(x))$

form an orthonormal basis of ${L^2(G)}$. Thus for each ${f \in L^2(G)}$ we have

$\displaystyle f = \sum_{\xi \in \widehat G} \widehat f(\xi) e_\xi$

where

$\displaystyle \widehat f(\xi) = \left< f, e_\xi \right> = \int_G f(x) \exp(-2\pi i \xi(x)) \; d\mu.$

The sum ${\sum_{\xi \in \widehat G}}$ makes sense since ${\widehat G}$ is discrete. In particular,

• Letting ${G = Z}$ gives “Fourier transform on finite groups”.
• The special case ${G = \mathbb Z/n\mathbb Z}$ has its own Wikipedia page.
• Letting ${G = \mathbb T}$ gives the “Fourier series” earlier.

### 5.3. The Fourier transform of the non-compact case

If ${G}$ is LCA but not compact, then Theorem~16 becomes false. On the other hand, it is still possible to define a transform, but one needs to be a little more careful. The generic example to keep in mind in what follows is ${G = \mathbb R}$.

In what follows, we fix a Haar measure ${\mu}$ for ${G}$. (This ${\mu}$ is no longer unique up to scaling, since ${\mu(G) = \infty}$.)

One considers this time the space ${L^1(G)}$ of absolutely integrable functions. Then one directly defines the Fourier transform of ${f \in L^1(G)}$ to be

$\displaystyle \widehat f(\xi) = \int_G f \overline{e_\xi} \; d\mu$

imitating the previous definitions in the absence of an inner product. This ${\widehat f}$ may not be ${L^1}$, but it is at least bounded. Then we manage to at least salvage:

Theorem 17 (Fourier inversion on ${L^1(G)}$)

Take an LCA group ${G}$ and fix a Haar measure ${\mu}$ on it. One can select a unique dual measure ${\widehat \mu}$ on ${\widehat G}$ such that if ${f \in L^1(G)}$, ${\widehat f \in L^1(\widehat G)}$, the “Fourier inversion formula”

$\displaystyle f(x) = \int_{\widehat G} \widehat f(\xi) e_\xi(x) \; d\widehat\mu.$

holds almost everywhere. It holds everywhere if ${f}$ is continuous.

Notice the extra nuance of having to select measures, because it is no longer the case that ${G}$ has a single distinguished measure.

Despite the fact that the ${e_\xi}$ no longer form an orthonormal basis, the transformed function ${\widehat f : \widehat G \rightarrow \mathbb C}$ is still often useful. In particular, they have special names for a few special ${G}$:

• If ${G = \mathbb R}$, then ${\widehat G = \mathbb R}$, and this construction gives the poorly named “(continuous) Fourier transform”.
• If ${G = \mathbb Z}$, then ${\widehat G = \mathbb T}$, and this construction gives the poorly named “DTFT..

### 5.4. Summary

In summary,

• Given any LCA group ${G}$, we can transform sufficiently nice functions on ${G}$ into functions on ${\widehat G}$.
• If ${G}$ is compact, then we have the nicest situation possible: ${L^2(G)}$ is an inner product space with ${\left< f,g \right> = \int_G f \overline{g} \; d\mu}$, and ${e_\xi}$ form an orthonormal basis across ${\widehat \xi \in \widehat G}$.
• If ${G}$ is not compact, then we no longer get an orthonormal basis or even an inner product space, but it is still possible to define the transform

$\displaystyle \widehat f : \widehat G \rightarrow \mathbb C$

for ${f \in L^1(G)}$. If ${\widehat f}$ is also in ${L^1(G)}$ we still get a “Fourier inversion formula” expressing ${f}$ in terms of ${\widehat f}$.

We summarize our various flavors of Fourier analysis for various ${G}$ in the following. In the first half ${G}$ is compact, in the second half ${G}$ is not.

$\displaystyle \begin{array}{llll} \hline \text{Name} & \text{Domain }G & \text{Dual }\widehat G & \text{Characters} \\ \hline \textbf{Binary Fourier analysis} & \{\pm1\}^n & S \subseteq \left\{ 1, \dots, n \right\} & \prod_{s \in S} x_s \\ \textbf{Fourier transform on finite groups} & Z & \xi \in \widehat Z \cong Z & e( i \xi \cdot x) \\ \textbf{Discrete Fourier transform} & \mathbb Z/n\mathbb Z & \xi \in \mathbb Z/n\mathbb Z & e(\xi x / n) \\ \textbf{Fourier series} & \mathbb T \cong [-\pi, \pi] & n \in \mathbb Z & \exp(inx) \\ \hline \textbf{Continuous Fourier transform} & \mathbb R & \xi \in \mathbb R & e(\xi x) \\ \textbf{Discrete time Fourier transform} & \mathbb Z & \xi \in \mathbb T \cong [-\pi, \pi] & \exp(i \xi n) \\ \end{array}$

You might notice that the various names are awful. This is part of the reason I got confused as a high school student: every type of Fourier series above has its own Wikipedia article. If it were up to me, we would just use the term “${G}$-Fourier transform”, and that would make everyone’s lives a lot easier.

## 6. Peter-Weyl

In fact, if ${G}$ is a Lie group, even if ${G}$ is not abelian we can still give an orthonormal basis of ${L^2(G)}$ (the square-integrable functions on ${G}$). It turns out in this case the characters are attached to complex irreducible representations of ${G}$ (and in what follows all representations are complex).

The result is given by the Peter-Weyl theorem. First, we need the following result:

Lemma 18 (Compact Lie groups have unitary reps)

Any finite-dimensional (complex) representation ${V}$ of a compact Lie group ${G}$ is unitary, meaning it can be equipped with a ${G}$-invariant inner form. Consequently, ${V}$ is completely reducible: it splits into the direct sum of irreducible representations of ${G}$.

Proof: Suppose ${B : V \times V \rightarrow \mathbb C}$ is any inner product. Equip ${G}$ with a right-invariant Haar measure ${dg}$. Then we can equip it with an “averaged” inner form

$\displaystyle \widetilde B(v,w) = \int_G B(gv, gw) \; dg.$

Then ${\widetilde B}$ is the desired ${G}$-invariant inner form. Now, the fact that ${V}$ is completely reducible follows from the fact that given a subrepresentation of ${V}$, its orthogonal complement is also a subrepresentation. $\Box$

The Peter-Weyl theorem then asserts that the finite-dimensional irreducible unitary representations essentially give an orthonormal basis for ${L^2(G)}$, in the following sense. Let ${V = (V, \rho)}$ be such a representation of ${G}$, and fix an orthonormal basis of ${e_1}$, \dots, ${e_d}$ for ${V}$ (where ${d = \dim V}$). The ${(i,j)}$th matrix coefficient for ${V}$ is then given by

$\displaystyle G \xrightarrow{\rho} \mathop{\mathrm{GL}}(V) \xrightarrow{\pi_{ij}} \mathbb C$

where ${\pi_{ij}}$ is the projection onto the ${(i,j)}$th entry of the matrix. We abbreviate ${\pi_{ij} \circ \rho}$ to ${\rho_{ij}}$. Then the theorem is:

Theorem 19 (Peter-Weyl)

Let ${G}$ be a compact Lie group. Let ${\Sigma}$ denote the (pairwise non-isomorphic) irreducible finite-dimensional unitary representations of ${G}$. Then

$\displaystyle \left\{ \sqrt{\dim V} \rho_{ij} \; \Big\vert \; (V, \rho) \in \Sigma, \text{ and } 1 \le i,j \le \dim V \right\}$

is an orthonormal basis of ${L^2(G)}$.

Strictly, I should say ${\Sigma}$ is a set of representatives of the isomorphism classes of irreducible unitary representations, one for each isomorphism class.

In the special case ${G}$ is abelian, all irreducible representations are one-dimensional. A one-dimensional representation of ${G}$ is a map ${G \hookrightarrow \mathop{\mathrm{GL}}(\mathbb C) \cong \mathbb C^\times}$, but the unitary condition implies it is actually a map ${G \hookrightarrow S^1 \cong \mathbb T}$, i.e. it is an element of ${\widehat G}$.