USEMO Problem Development, Behind the Scenes

In this post I’m hoping to say a bit about the process that’s used for the problem selection of the recent USEMO: how one goes from a pool of problem proposals to a six-problem test. (How to write problems is an entirely different story, and deserves its own post.) I choose USEMO for concreteness here, but I imagine a similar procedure could be used for many other contests.

I hope this might be of interest to students preparing for contests to see a bit of the behind-the-scenes, and maybe helpful for other organizers of olympiads.

The overview of the entire timeline is:

    1. Submission period for authors (5-10 weeks)
    2. Creating the packet
    3. Reviewing period where volunteers try out the proposed problems (6-12 weeks)
    4. Editing and deciding on a draft of the test
    5. Test-solving of the draft of the test (3-5 weeks)
    6. Finalizing and wrap-up

Now I’ll talk about these in more detail.

Pinging for problems

The USA has the rare privilege of having an extremely dedicated and enthusiastic base of volunteers, who is going to make the contest happen rain or shine. When I send out an email asking for problem proposals, I never really worry I won’t get enough people. You might have to adjust the recipe below if you have fewer hands on deck.

When you’re deciding who to invite, you have to think about a trade-off between problem security versus openness. The USEMO is not a high-stakes competition, so it accepts problems from basically anyone. On the other hand, if you are setting problems for your country’s IMO team selection test, you probably don’t want to take problems from the general public.

Submission of the problem is pretty straightforward, ask to have problems emailed as TeX, with a full solution. You should also ask for any information you care about to be included: the list of authors of the problem, any other collaborators who have seen or tried the problem, and a very very rough estimate of how hard the difficulty is. (You shouldn’t trust the estimate too much, but I’ll explain in a moment why this is helpful.)

Ideally I try to allocate 5-10 weeks between when I open submissions for problems and when the submission period ends.

It’s at this time you might as well see who is going to be interested in being a test-solver or reviewer as well — more on that later.

Creating the packet

Once the submission period ends, you want to then collate the problems into a packet that you can send to your reviewers. The reviewers will then rate the problems on difficulty and suitability.

A couple of minor tips for setting the packet:

  • I used to try to sort the packet roughly by difficulty, but in recent years I’ve switched to random order and never looked back. It just biases the reviewers too much to have the problem number matter. The data has been a lot better with random order.
  • Usually I’ll label the problems A-01, A-02, …, A-05 (say), C-06, C-07, …., and so on. The leading zero is deliberate: I’ve done so much IMO shortlist that if I see a problem named “C7”, it will automatically feel like it should be a hard problem, so using “C-07” makes this gut reaction go away for me.
  • It’s debatable whether you need to have subject classifications at all, since in some cases a problem might not fit cleanly, or the “true” classification might even give away the problem. I keep it around just because it’s convenient from an administrative standpoint to have vaguely similar problems grouped together and labelled, but I’ll explicitly tell reviewers to not take the classification seriously, rather as a convenience.

More substantial is the choice of which problems to include in the packet if you are so lucky to have a surplus of submissions. The problem is that reviewers only have so much time and energy and won’t give as good feedback if the packet is too long. In my experience, 20 problems is a nice target, 30 problems is strenuous, anything more than that is usually too much. So if you have more than 30 problems, you might need to cut some problems out.

Since this “early cutting” is necessarily pretty random (because you won’t be able to do all the problems yourself singe-handedly), I usually prefer to do it in a slightly more egalitarian way. For example, if one person submits a lot of problems, you might only take a few problems from them, and say “we had a lot of problems, so we took the 3 of yours we liked the most for review”. (That said, you might have a sense that certain problems are really unlikely to be selected, and so you might as well exclude those.)

You usually also want to make sure that you have a spread of difficulties and subjects. Actually, this is even more true if you don’t have a surplus: if it turns out that, say, you have zero easy algebra or geometry problems, that’s likely to cause problems for you later down the line. It might be good to see if you can get one or two of those if you can. This is why the author’s crude estimates of difficulty can still be useful — it’s not supposed to be used for deciding the test, but it can give you “early warnings” that the packet might be lacking in some area.

One other useful thing to do, if you have the time at this point, is to edit the proposals as they come in, before adding them onto the packet. This includes both making copy edits for clarity and formatting, as well as more substantial changes if you can see an alternate version or formulation that you think is likely to be better than the original. The reason you want to do this step now is that you want the reviewer’s eyes on these changes: making edits halfway through the review process or after it can cause confusion and desynchronization in the reviewer data, and increases the chances of errors (because the modifications have been checked fewer times).

The packet review process

Then comes the period where your reviewers will work through the problems and submit their ratings. This is also something where you want to give reviewers a lot of time: 6-8 weeks is a comfortable amount. Between 10 and 25 reviewers is a comfortable number.

Reviewers are asked to submit a difficulty rating and quality rating for each problem. The system I’ve been using, which seems to work pretty well, goes like follows:

  • The five possible difficulty ratings are “Unsuitable”, “Mediocre”, “Acceptable”, “Nice”, “Excellent”. I’ve found that this choice of five words has the connotations needed to get similar distributions of calibrations across the reviewers.
  • For difficulty, I like to provide three checkboxes “IMO1”, “IMO2”, “IMO3”, but also tell the reviewer that they can check two boxes if, say, they think a problem could appear as either IMO1 or IMO2. That means in essence five ratings {1, 1.5, 2, 2.5, 3} are possible.

This is what I converged on as far as scales that are granular enough to get reasonable numerical data without being so granular that they are unintuitive or confusing. (If your scale is too granular, then a person’s ratings might say more about how a person interpreted the scale than the actual content of the problem.) For my group, five buckets seems to be the magic number; your mileage may vary!

More important is to have lots of free text boxes so that reviewers can provide more detailed comments, alternate solutions, and so on. Those are ultimately more valuable than just a bunch of numbers.

Here’s a few more tips:

  • If you are not too concerned about security, it’s also nice to get discussion between reviewers going. It’s more fun for the reviewers and the value of having reviewers talk with each other a bit tends to outweight the risk of bias.
  • It’s usually advised to send only the problem statements first, and then only send the solutions out about halfway through. I’ve found most reviewers (myself included) appreciate the decreased temptation to look at solutions too early on.
  • One thing I often do is to have a point person for each problem, to make sure every problem is carefully attempted This is nice, but not mandatory — the nicest problems tend to get quite a bit of attention anyways.
  • One thing I’ve had success with is adding a question on the review form that asks “what six problems would you choose if you were making the call, and why?” I’ve found I get a lot of useful perspective from hearing what people say about this.

I just use Google Forms to collect all the data. There’s a feature you can enable that requires a sign-in, so that the reviewer’s responses are saved between sessions and loaded automatically (making it possible to submit the form in multiple sittings).

Choosing the draft of the test

Now that you have the feedback, you should pick a draft of the test! This is the most delicate part, and it’s where it is nice to have a co-director or small committee if possible so that you can talk out loud and bounce ideas of each other.

For this stage I like to have a table with the numerical ratings as a summary of what’s available. The way you want to do this is up to you, but some bits from my workflow:

  • My table is color-coded, and it’s sorted in five different ways: by problem number, by quality rating, by difficulty rating, by subject then quality rating, by subject then difficulty rating.
  • For the quality rating, I use the weights -0.75, -0.5, 0, 1, 1.5 for Unsuitable, Mediocre, Acceptable, Nice, Excellent. This fairly contrived set of weights was chosen based on some experience in which I wanted to the average ratings to satisfy a couple properties: I wanted the sign of the rating (- or +) to match my gut feeling, and I wanted to not the rating too sensitive to a few Unsuitable / Excellent ratings (either extreme). This weighting puts a “cliff” between Acceptable and Nice, which empirically seems a good place to make the most differential.
  • I like to include a short “name” in the table to help with remembering which problem numbers are which, e.g. “2017-vtx graph”.

An example of what a table might look like is given in the image below.

Here is a output made with fake data. The Python script used to generate the table is included here, for anyone that wants to use it.

WT_U = -0.75
WT_M = -0.5
WT_A = 0
WT_N = 1
WT_E = 1.5

# ---- Populate with convincing looking random data ----
import random
slugs = {
		"A-01" : r"$\theta \colon \mathbb Z[x] \to \mathbb Z$",
		"A-02" : r"$\sqrt[3]{\frac{a}{b+7}}$",
		"A-03" : r"$a^a b^b c^c$",
		"C-04" : r"$a+2b+\dots+32c$",
		"C-05" : r"$2017$-vtx dinner",
		"G-06" : r"$ST$ orthog",
		"G-07" : r"$PO \perp YZ$",
		"G-08" : r"Area $5/2$",
		"G-09" : r"$XD \cap AM$ on $\Gamma$",
		"G-10" : r"$\angle PQE, \angle PQF = 90^{\circ}$",
		"N-11" : r"$5^n$ has six zeros",
		"N-12" : r"$n^2 \mid b^n+1$",
		"N-13" : r"$fff$ cycles",
		}

qualities = {}
difficulties = {}
random.seed(150)
for k in slugs.keys():
	# just somehow throw stuff at wall to get counts
	a,b,c,d,e,f = [random.randrange(0,3) for _ in range(6)]
	if c >= 1: a = 0
	if a >= 2: d,e = 1,0
	if e == 0: f = 0
	if a == 0 and b == 0: e *= 2
	qualities[k] = [WT_U] * a + [WT_M] * b + [WT_A] * (b+d+e) + [WT_N] * (c+d+e) + [WT_E] * (c+e+f)

random.seed(369)
for k in slugs.keys():
	# just somehow throw stuff at wall to get counts
	a,b,c,d,e = [random.randrange(0,5) for _ in range(5)]
	if e >= 4:
		b = 0
		c //= 2
	elif e >= 3:
		a = 0
		b //= 2
	if a >= 3:
		e = 0
		d //= 3
	elif a >= 2:
		e = 0
		d //= 2
	difficulties[k] = [1] * a + [1.5] * b + [2] * c + [2.5] * d + [3] * e

# ---- End random data population ----

	
import statistics
def avg(S):
	return statistics.mean(S) if len(S) > 0 else None
def median(S):
	return statistics.median(S) if len(S) > 0 else None

# criteria for inclusion on chart
criteria = lambda k: True

def get_color_string(x, scale_min, scale_max, color_min, color_max):
	if x is None:
		return r"\rowcolor{gray}"
	m = (scale_max+scale_min)/2
	a = min(int(100 * 2 * abs(x-m) / (scale_max-scale_min)), 100)
	color = color_min if x < m else color_max
	return r"\rowcolor{%s!%d}" %(color, a) + "\n"

def get_label(key, slugged=False):
	if slugged:
		return r"{\scriptsize \textbf{%s} %s}" %(key, slugs.get(key, ''))
	else:
		return r"{\scriptsize \textbf{%s}}" % key

## Quality rating
def get_quality_row(key, data, slugged = True):
	a = avg(data)
	s = ("$%+4.2f$" % a) if a is not None else "---"
	color_tex = get_color_string(a, WT_U, WT_E, "Salmon", "green")
	row_tex = r"%s & %d & %d & %d & %d & %d & %s \\" \
			% (get_label(key, slugged),
				data.count(WT_U),
				data.count(WT_M),
				data.count(WT_A),
				data.count(WT_N),
				data.count(WT_E),
				s)
	return color_tex + row_tex
def print_quality_table(d, sort_key = None, slugged = True):
	items = sorted(d.items(), key = sort_key)
	print(r"\begin{tabular}{lcccccr}")
	print(r"\toprule Prob & U & M & A & N & E & Avg \\ \midrule")
	for key, data in items:
		print(get_quality_row(key, data, slugged))
	print(r"\bottomrule")
	print(r"\end{tabular}")

## Difficulty rating
def get_difficulty_row(key, data, slugged = False):
	a = avg(data)
	s = ("$%.3f$" % a) if a is not None else "---"
	color_tex = get_color_string(a, 1, 3, "cyan", "orange")
	row_tex = r"%s & %d & %d & %d & %d & %d & %s \\" \
			% (get_label(key, slugged),
				data.count(1),
				data.count(1.5),
				data.count(2),
				data.count(2.5),
				data.count(3),
				s)
	return color_tex + row_tex
def print_difficulty_table(d, sort_key = None, slugged = False):
	items = sorted(d.items(), key = sort_key)
	print(r"\begin{tabular}{l ccccc c}")
	print(r"\toprule Prob & 1 & 1.5 & 2 & 2.5 & 3 & Avg \\ \midrule")
	for key, data in items:
		print(get_difficulty_row(key, data, slugged))
	print(r"\bottomrule")
	print(r"\end{tabular}")

filtered_qualities = {k:v \
		for k,v in qualities.items() if criteria(k)}
filtered_difficulties = {k:v \
		for k,v in difficulties.items() if criteria(k)}

def print_everything(name, fn = None, flip_slug = False):
	if fn is not None:
		sort_key = lambda item: fn(item[0])
	else:
		sort_key = None
	print(r"\section{" + name + "}")
	if flip_slug:
		print_quality_table(filtered_qualities, sort_key, False)
		print_difficulty_table(filtered_difficulties, sort_key, True)
	else:
		print_quality_table(filtered_qualities, sort_key, True)
		print_difficulty_table(filtered_difficulties, sort_key, False)

# Start outputting content
print(r"""\documentclass[11pt]{scrartcl}
\usepackage{booktabs}
\usepackage[sexy]{evan}
\usepackage{tikz}
\usepackage{pgfplots}
\pgfplotsset{compat=1.17}

\begin{document}
\title{Example of ratings table with randomly generated data}
\maketitle

\setlength\tabcolsep{5pt}
""")



print(r"\section{All ratings}")
print_quality_table(qualities)
print_difficulty_table(difficulties)

print("\n" + r"\newpage" + "\n")
print_everything("Beauty contest, by overall popularity",
		lambda p : (-avg(qualities[p]), p), False)
print_everything("Beauty contest, by subject and popularity",
		lambda p : (p[0], -avg(qualities[p]), p), False)
print("\n" + r"\newpage" + "\n")
print_everything("Beauty contest, by overall difficulty",
		lambda p : (-avg(difficulties[p]), p), True)
print_everything("Beauty contest, by subject and difficulty",
		lambda p : (p[0], -avg(difficulties[p]), p), True)

print("\n")
print(r"\section{Scatter plot}")
print(r"\begin{center}")
print(r"\begin{tikzpicture}")
print(r"""\begin{axis}[width=0.9\textwidth, height=22cm, grid=both,
	xlabel={Average difficulty}, ylabel={Average suitability},
	every node near coord/.append style={font=\scriptsize},
	scatter/classes={A={red},C={blue},G={green},N={black}}]""")
print(r"""\addplot [scatter,
	only marks, point meta=explicit symbolic,
	nodes near coords*={\prob},
	visualization depends on={value \thisrow{prob} \as \prob}]""")
print(r"table [meta=subj] {")
print("X\tY\tprob\tsubj")
for p in qualities.keys():
	x = avg(difficulties[p])
	y = avg(qualities[p])
	if x is None or y is None:
		continue
	print("%0.2f\t%0.2f\t%s\t%s" %(x,y,p[2:],p[0]))
print(r"};")
print(r"\end{axis}")
print(r"\end{tikzpicture}")
print(r"\end{center}")

print(r"\end{document}")

Of course, obligatory warning to not overly rely on the numerical ratings and to put heavy weight on the text comments provided. (The numerical ratings will often have a lot of variance, anyways.)

One thing to keep in mind is that when choosing the problems is that there are two most obvious goals are basically orthogonal. One goal is to have the most attractive problems (“art”), but the other is to have an exam which is balanced across difficulty and subject composition (“science”). These two goals will often compete with each other and you’ll have to make judgment calls to prioritize one over the other.

A final piece of advice is to not be too pedantic. For example, I personally dislike the so-called “Geoff rule” that 1/2/4/5 should be distinct subjects: I find that it is often too restrictive in practice. I also support using “fractional distributions” in which say a problem can be 75% number theory and 25% combinatorics (rather than all-or-nothing) when trying to determine how to balance the exam. This leads to better, more nuanced judgments than insisting on four categories.

This is also the time to make any last edits you want to the problems, again both copy edits or more substantial edits. This gives you a penultimate draft of the exam.

Test solving

If you can, a good last quality check to do is to have a round of test-solving from an unbiased group of additional volunteers who haven’t already seen the packet. (For the volunteers, this is a smaller time commitment than reviewing an entire packet so it’s often feasible as an intermediate commitment.) You ask this last round of volunteers to try out the problems under exam-like conditions, although I find it’s not super necessary to do a full 4.5 hours or have full write-ups, if you’ll get more volunteers this way. A nice number of test-solvers is 5-10 people.

Typically this test-solving is most useful as a sanity check (e.g. to make sure the test is not obviously too difficult) and for any last minute shuffling of the problems (which often happens). I don’t advise making drastic changes at this point. It’s good as a way to get feedback on the most tricky decisions, though.

Wrap-up

After any final edits, I recommend sending a copy of the edited problems and solutions to the reviewers and test-solvers. They’re probably interested to know what problems made the cut, and you want to have eyes going through the final paper to check for ambiguities or errors.

I usually take the time to also send out some details of the selection itself: what the ratings for the problems looked like, often a sentence or two for each problem about the overall feedback, and a documentation of my thought process in the draft selection. It’s good to give people feedback on their problems, in my experience the authors usually appreciate it a lot, especially if they decide to re-submit the problem elsewhere.

And that’s the process.

USA Special Team Selection Test Series for IMO 2021

A lot of people have been asking me how team selection is going to work for the USA this year. This information was sent out to the contestants a while ago, but I understand that there’s a lot of people outside of MOP 2020 who are interested in seeing the TST problems :) so this is a quick overview of how things are going down this year.

This year there are six tests leading to the IMO 2021 team:

  • USA TSTST Day 1: November 12, 2020 (3 problems, 4.5 hours)
  • USA TSTST Day 2: December 10, 2020 (3 problems, 4.5 hours)
  • USA TSTST Day 3: January 21, 2021 (3 problems, 4.5 hours)
  • RMM Day 1: February 2021 (3 problems, 4.5 hours)
  • APMO: March 2021 (5 problems, 4 hours)
  • USAMO: April 2021 (2 days, each with 3 problems and 4.5 hours)

Everyone who was at the virtual MOP in June 2020 is invited to all three days of TSTST, and then the top scores get to take the latter three exams as team selection tests for the IMO. Meanwhile, the RMM teams and EGMO teams are based on just the three days of TSTST.

Similar to past years, discussion of TSTST is allowed on noon Eastern time Monday after each day. That means you can look forward to the first set of three new problems coming out on Monday, November 16, and similarly for the other two days of TSTST.

To add to the hype, I’ll be doing a short one-hour-or-less Twitch stream at 8:00pm ET on Tuesday November 17 where I present the solutions to the TSTST problems of day 1.  If there’s demand, I’ll probably run a review session for the other two days of TSTST, as well.

EDIT: Changed stream time to Tuesday so more people have time to try the problems.

On choosing exercises

Finally, if you attempt to read this without working through a significant number of exercises (see §0.0.1), I will come to your house and pummel you with [Gr-EGA] until you beg for mercy. It is important to not just have a vague sense of what is true, but to be able to actually get your hands dirty. As Mark Kisin has said, “You can wave your hands all you want, but it still won’t make you fly.”

— Ravi Vakil, The Rising Sea: Foundations of Algebraic Geometry

When people learn new areas in higher math, they are usually required to do some exercises. I think no one really disputes this: you have to actually do math to make any progress.

However, from the teacher’s side, I want to make the case that there is some art to picking exercises, too. In the process of writing my Napkin as well as taking way too many math classes I began to see some patterns in which exercises or problems I tended to add to the Napkin, or which exercises I found helpful when learning myself. So, I want to explicitly record some of these thoughts here.

1. How not to do it

So in my usual cynicism I’ll start by saying what I think people typically do, and why I don’t think it works well. As far as I can tell, the criteria used in most classes is:

  1. The student is reasonably able to (at least in theory) eventually solve it.
  2. A student with a solid understanding of the material should be able to do it.
  3. (Optional) The result itself is worth knowing.

Both of these criteria are good. My problem is that I don’t think they are sufficient.

To explain why, let me give a concrete example of something that is definitely assigned in many measure theory classes.

Okay example (completion of a measure space). Let {(X, \mathcal A, \mu)} be a measure space. Let {\overline{\mathcal A}} denote all subsets of {X} which are the union of a set in {\mathcal A} and a null set. Show that {\overline{\mathcal A}} is a sigma-algebra there is a unique extension of the measure {\mu} to it.

I can see why it’s tempting to give this as an exercise. It is a very fundamental result that the student should know. The proof is not too difficult, and the student will understand it better if they do it themselves than if they passively read it. And, if a student really understands measures well, they should find the exercise quite straightforward. For this reason I think this is an okay choice.

But I think we can do better.

In many classes I’ve taken, nearly all the exercises looked like this one. I think when you do this, there are a couple blind spots that sometimes get missed:

  • There’s a difference between “things you should be able to do after learning Z well” and “things you should be able to do when first learning Z“. I would argue that the above example is the former category, but not the latter one — if a student is learning about measures for the first time, my first priority would be to make sure they get a good conceptual understanding first, and in particular can understand why the statement should be true. Then we can worry about actually proving it.
  • Assigning an exercise which checks if you understand X is not the same as actually teaching it. Okay exercises can verify if you understand something, great exercises will actively help you understand it.

2. An example that I found enlightening

In contrast, this year I was given an exercise which I thought was so instructive that I’ll post it here. It comes from algebraic geometry.

Exercise: The punctured gyrotop is the open subset {U} of {X = \mathrm{Spec} \mathbb C[x,y,z] / (xy, z)} obtained by deleting the origin {(x,y,z)} from {X}. Compute {\mathcal O_X(U)}.

It was after I did this exercise that I finally felt like I understood why distinguished open sets are so important when defining an affine scheme. For that matter, it finally clicked why sheaves on a base are worth caring about.

I had read lots and lots of words and pushed symbols around all day. I had even proved, on paper already, that {\mathcal O(U \sqcup V) = \mathcal O(U) \times \mathcal O(V)}. But I never really felt it. This exercise changed that for me, because suddenly I had an example in front of me that I could actually see.

3. Some suggested additional criteria

So here are a few suggested guidelines which I think can help pick exercises like that one.

A. They should be as concrete as possible.

This is me yelling at people to use more examples, once again. But I think having students work through examples as an exercise is just as important (if not more) than reading them aloud in lecture.

One other benefit of using concrete examples is that you can avoid the risk of students solving the exercise by “symbol pushing”. I think many of us know the feeling of solving some textbook exercise by just unwinding a definition and doing a manipulation, or black-boxing some theorem and blindly applying it. In this way one ends up with correct but unenlightening proofs. The issue is that nothing written down resonates with System 1, and so the result doesn’t get internalized.

When you give a concrete exercise with a specific group/scheme/whatever, there is much less chance of something like that happening. You almost have to see the example in order to work with it. I really think internalizing theorems and definitions is better done in this concrete way, rather than the more abstract or general manipulations.

B. They should be enjoyable.

Math majors are humans too. If a whole page of exercises looks boring, students are less likely to do them.

This is one place where I think people could really learn from the math contest community. When designing exams like IMO or USAMO, people fight over which problems they think are the prettiest. The nicest and most instructive exam problems are passed down from generation to generation like prized heirlooms. (Conveniently, the problems are even named, e.g. “IMO 2008/3”, which I privately think helps a ton; it gives the problems a name and face. The most enthusiastic students will often be able to recall where a good problem was from if shown the statement again.) Imagine if the average textbook exercises had even a tenth of that enthusiasm put into crafting them.

Incidentally, I think being concrete helps a lot with this. Part of the reason I enjoyed the punctured gyrotop so much was that I could immediately draw a picture of it, and I had a sense that I should be able to compute the answer, even though I wasn’t experienced enough yet to see what it was. So it was as if the exercise was leading me on the whole way.

For an example of how not to do it, here’s what I think my geometry book would look like if done wrong.

C. They should not be too tricky.

People are always dumber than you think when they first learn a subject; things which should be obvious often are not. So difficulty should be used in moderation: if you assign a hard exercise, you should assume by default the student will not solve it, so there better be some reason you’re adding some extra frustration.

I should at this point also mention some advice most people won’t be able to take (because it is so time-consuming): I think it’s valuable to write full solutions for students, especially on difficult problems. When someone is learning something for the first time, that is the most important time for the students to be able to read the full details of solutions, precisely because they are not yet able to do it themselves.

In math contests, the ideal feedback cycle is something like: a student works on a problem P, makes some progress (possibly solving it), then they look at the solution and see what they were missing or where they could have cleaned up their solution or what they could have done differently, et cetera. This lets them update their intuition or toolkit before going on. If you cut out this last step by not providing solutions, you lose the only real chance you had to give feedback to the student.

4. Memorability

I have, on more occasions than I’m willing to admit, run into the following situation. I solve some exercise in a textbook. Sometime later, I am reading about some other result, and I need some intermediate result, which looks like it could be true but I don’t how to prove it immediately. So I look it up, and then find out it was the exercise I did (and then have to re-do the exercise again because I didn’t write up the solution).

I think you can argue that if you don’t even recognize the statement later, you didn’t learn anything from it. So I think the following is a good summarizing test: how likely is the student to actually remember it later?

USEMO sign-ups are open

I’m happy to announce that sign-ups for my new olympiad style contest, the United States Ersatz Math Olympiad (USEMO), are open now! The webpage for the USEMO is https://web.evanchen.cc/usemo.html (where sign-ups are posted).

https://web.evanchen.cc/static/usemo/usemo-logo.png

The US Ersatz Math Olympiad is a proof-based competition open to all US middle and high school students. Like many competitions, its goals are to develop interest and ability in mathematics (rather than measure it). However, it is one of few proof-based contests open to all US middle and high school students. You can see more about the goals of this contest in the mission statement.

The contest will run over Memorial day weekend:

  • Day 1 is Saturday May 23 2020, from 12:30pm ET — 5:00pm ET.
  • Day 2 is Sunday May 24 2020, from 12:30pm ET — 5:00pm ET.

In the future, assuming continued interest, I hope to make the USEMO into an annual tradition run in the fall.

Circular optimization

This post will mostly be focused on construction-type problems in which you’re asked to construct something satisfying property {P}.

Minor spoilers for USAMO 2011/4, IMO 2014/5.

1. What is a leap of faith?

Usually, a good thing to do whenever you can is to make “safe moves” which are implied by the property {P}. Here’s a simple example.

Example 1 (USAMO 2011)

Find an integer {n} such that the remainder when {2^n} is divided by {n} is odd.

It is easy to see, for example, that {n} itself must be odd for this to be true, and so we can make our life easier without incurring any worries by restricting our search to odd {n}. You might therefore call this an “optimization”: a kind of move that makes the problem easier, essentially for free.

But often times such “safe moves” or not enough to solve the problem, and you have to eventually make “leap-of-faith moves”. For example, maybe in the above problem, we might try to focus our attention on numbers {n = pq} for primes {p} and {q}. This does make our life easier, because we’ve zoomed in on a special type of {n} which is easy to compute. But it runs the risk that maybe there is no such example of {n}, or that the smallest one is difficult to find.

2. Circular reasoning can sometimes save the day

However, a strange type of circular reasoning can sometimes happen, in which a move that would otherwise be a leap-of-faith is actually known to be safe because you also know that the problem statement you are trying to prove is true. I can hardly do better than to give the most famous example:

Example 2 (IMO 2014)

For every positive integer {n}, the Bank of Cape Town issues coins of denomination {\frac 1n}. Given a finite collection of such coins (of not necessarily different denominations) with total value at most {99 + \frac12}, prove that it is possible to split this collection into {100} or fewer groups, such that each group has total value at most {1}.

Let’s say in this problem we find ourselves holding two coins of weight {1/6}. Perhaps we wish to put these coins in the same group, so that we have one less decision to make. However, this could rightly be viewed as a “leap-of-faith”, because there’s no logical reason why the task must remain possible after making this first move.

Except there is a non-logical reason: this is the same as trading the two coins of weight {1/6} for a single coin of weight {1/3}. Why is the task still possible? Because the problem says so: the very problem we are trying to solve includes this case, too. If the problem is going to be true, then it had better be true after we make this trade.

Thus by a perverse circular reasoning we can rest assured that our leap-of-faith here will not come back to bite us. (And in fact, this optimization is a major step of the solution.)

3. More examples of circular optimization

Here’s some more examples of problems you can try that I think have a similar idea.

Problem 1

Prove that in any connected graph {G} on {2004} vertices one can delete some edges to obtain a graph (also with {2004} vertices) whose degrees are all odd.

Problem 2 (USA TST 2017)

In a sports league, each team uses a set of at most {t} signature colors. A set {S} of teams is color-identifiable if one can assign each team in {S} one of their signature colors, such that no team in {S} is assigned any signature color of a different team in {S}. For all positive integers {n} and {t}, determine the maximum integer {g(n,t)} such that: In any sports league with exactly {n} distinct colors present over all teams, one can always find a color-identifiable set of size at least {g(n,t)}.

Feel free to post more examples in the comments.

Meritocracy is the worst form of admissions except for all the other ones

I’m now going to say something explicitly that I hinted at in June: I don’t think a student deserves to make MOP more because they had a higher score than another student.

I think it’s easy to get this impression because the selection for MOP is done by score cutoffs. So it sure looks that way.

But I don’t think MOP admissions (or contests in general) are meant to be a form of judgment. My primary agenda is to run a summer program that is good for its participants, and we get funding for N of them. For that, it’s not important which N students make it, as long as they are enthusiastic and adequately prepared. (Admittedly, for a camp like MOP, “adequately prepared” is a tall order). If anything, what I would hope to select for is the people who would get the most out of attending. This is correlated with, but not exactly the same as, score.

Two corollaries:

  • I support the requirement for full attendance at MOP. I know, it sucks for those star students who qualify for two conflicting and then have to choose. You have my apologies (and congratulations). But if you only come for 2 of 3 weeks, you took away a spot from someone who would have attended the whole time.
  • I am grateful to the European Girl’s MO for giving MOP an opportunity to balance the gender ratio somewhat; empirically, it seems to improve the camp atmosphere if the gender ratio is not 79:1.

Anyways, given my mixed feelings on meritocracy, I sometimes wonder whether MOP should do what every other summer camp does and have an application, or even a lottery. I think the answer is no, but I’m not sure. Some reasons I can think of behind using score only:

  1. MOP does have a (secondary) goal of IMO training, and as a result the program is almost insane in difficulty. For this reason you really do need students with significant existing background and ability. I think very few summer camps should explicitly have this level of achievement as a goal, even secondarily. But I think there should be at least one such camp, and it seems to be MOP.
  2. Selection by score is transparent and fair. There is little risk of favoritism, nepotism, etc. This matters a lot to me because, basically no matter how much I try to convince them otherwise, people will take any admissions decision as some sort of judgment, so better make it impersonal. (More cynically, I honestly think if MOP switched to a less transparent admissions process, we would be dealing with lawsuits within 15 years.)
  3. For better or worse, qualifying for MOP ends up being sort of a reward, so I want to set the incentives right and put the goalpost at “do maximally well on USAMO”. I think we design the USAMO well enough that preparation teaches you valuable lessons (math and otherwise). For an example of how not to set the goalpost, take most college admissions processes.

Honestly, the core issue might really be cultural, rather than an admissions problem. I wish there was a way we could do the MOP selection as we do now without also implicitly sending the (unintentional and undesirable) message that we value students based on how highly they scored.

MOHS hardness scale

There’s a new addition to my olympiad problems and solutions archive: I created an index of many past IMO/USAMO/USA TST(ST) problems by what my opinions on their difficulties are. You can grab the direct link to the file below:

https://evanchen.cc/upload/MOHS-hardness.pdf

In short, the scale runs from 0M to 50M in increments of 5M, and every USAMO / IMO problem on my archive now has a rating too.

My hope is that this can be useful in a couple ways. One is that I hope it’s a nice reference for students, so that they can better make choices about what practice problems would be most useful for them to work on. The other is that the hardness scale contains a very long discussion about how I judge the difficulty of problems. While this is my own personal opinion, obviously, I hope it might still be useful for coaches or at least interesting to read about.

As long as I’m here, I should express some concern that it’s possible this document does more harm than good, too. (I held off on posting this for a few months, but eventually decided to at least try it and see for myself, and just learn from it if it turns out to be a mistake.) I think there’s something special about solving your first IMO problem or USAMO problem or whatever and suddenly realizing that these problems are actually doable — I hope it would not be diminished by me rating the problem as 0M. Maybe more information isn’t always a good thing!

Understanding with System 1

Math must be presented for System 1 to absorb and only incidentally for System 2 to verify.

I finally have a sort-of formalizable guideline for teaching and writing math, and what it means to “understand” math. I’ve been unconsciously following this for years and only now managed to write down explicitly what it is that I’ve been doing.

(This post is written from a math-centric perspective, because that’s the domain where my concrete object-level examples from. But I suspect much of it applies to communicating hard ideas in general.)

S1 and S2

The quote above refers to the System 1 and System 2 framework from Thinking, Fast and Slow. Roughly it divides the brain’s thoughts into two categories:

  • S1 is the part of the brain characterized by fast, intuitive, automatic, instinctive, emotional responses, For example, when you read the text “2+2=?”, S1 tells you (without any effort) that this equals 4.
  • S2 is the part of the brain characterized by slow, deliberative, effortful, logical responses; for example, S2 is used to count the number of words in this sentence.

(The link above gives some more examples.)

The premise of this post is that understanding math well is largely about having the concept resonate with your S1, rather than your S2. For example, let’s take groups from abstract algebra. Then I claim that

G = \{ a/b \mid a,b \text{ odd integers} \}

is a group under the usual multiplication. Now, if you have a student who’s learning group theory for the first time, the only way they could see this is a group is to compare it against a list of the group axioms, and have their S2 verify them one by one. But experienced people don’t do this: their S1 automatically tells them that G “feels” like a group (because e.g. it’s closed and doesn’t have division-by-zero issues).

I think this S1-level understanding is what it means to “get it”. Verifying a solution to a hard olympiad problem by having S2 check each individual step is straightforward in principle, albeit time-consuming. The tricky part is to get this solution to resonate with S1. Hence my advice to never read a solution line by line.

Writing for S1

What this means is that if you’re trying to teach someone an idea, then you should be focusing on trying to get their S1 to grasp it, rather than just their S2. For example, in math it’s not enough to just give a sequence of logical steps which implies the result: give it life.

Here are some examples of ways I (try to) do this.

First, giving good concrete examples. S1 reacts well when it “sees” a concrete object like G above, and can see some intuitive properties about it right away. Abstract “symbol-pushing” is usually left to S2 instead.

Similarly, drawing pictures, so your S1 can actually see the object. On one extreme end, you can write something like “a point $S$ lies on the polar of $T$ if and only if $T$ lies on the polar of $S$”, but it’s much better to just have a picture:

You can even do this for things that aren’t really geometrical in nature. For example, my Napkin features the following picture of cardinal collapse when forcing.

Third, write like you talk, and share your feelings. S1 is emotional. S1 wants to know that compactness is a good property for a space to have, or that non-Noetherian rings are way too big and “only weirdos care about non-Noetherian rings” (just kidding!), or that ramified primes are the “finitely many edge cases” and aren’t worth worrying about. These S1 reactions you get are the things you want to pass on. In particular, avoid standard formal college-textbook-bleed-your-eyes-dry-in-boredom style. (To be fair, not all textbooks do this; this is one reason why I like Pugh’s book so much, for example.)

Even the mechanics on the page can be made to accommodate S1 in this way. S1 can’t read a wall of text; S2 has to put in effort to do that. But S1 can pick out section headers, or bolded phrases like this one, and so on and so forth. That’s why in Napkin all the examples are in separate red boxes and all the big theorems are in blue boxes, and important philosophical points are typeset in bold centered green text. This way S1 naturally puts its attention there.

But do not force it

On the flip side, if you’re trying to learn something, there’s a common failure mode where you try to keep forcing S2 to do something unnatural (rather than trying to have S1 figure it out). This is the kind of thing when you don’t understand what the Chinese Remainder Theorem is trying to say, so you try to fix this by repeatedly reading the proof line by line, and still not really understanding what is going on. Usually this ends up in S2 getting tired and not actually reading the proof after the third or fourth iteration.

(For the Chinese remainder theorem the right thing to do is ask yourself why any arithmetic progression with common difference 7 must contain multiples of 3: credits to Dominic Yeo again for that. I’m not actually sure what you’re supposed to do when stuck on math in general. Usually I just ask my friends what is going on, or give up for now and come back later.)

Actually, I really like the advice that SSC mentions: “develop instincts, then use them”.

MOP should do a better job of supporting its students in not-June

Up to now I always felt a little saddened when I see people drop out of the IMO or EGMO team selection. But actually, really I should be asking myself what I (as a coach) could do better to make sure the students know we value their effort, even if they ultimately don’t make the team.

Because we sure do an awful job of being supportive of the students, or, well, really doing anything at all. There’s no practice material, no encouragement, or actually no form of contact whatsoever. Just three unreasonably hard problems each month, followed by a score report about a week later, starting in December and dragging in to April.

One of a teacher’s important jobs is to encourage their students. And even though we get the best students in the USA, probably we shouldn’t skip that step entirely, especially given the level of competition we put the students through.

So, what should we do about it? Suggestions welcome.