Photo by Fleur on

Goodhart’s law, Machine Learning and why school exams don’t work

Gabriel Cruz
6 min readAug 1, 2019

In kindergarden I didn’t have to study at all in order to score well in the tests. In elementary school I wasn’t able to ace tests without studying anymore, and so I started studying more or less regularly. When I got to high school I finally felt like I had to study really hard to keep my grades between 10 and 8 (out of 10).

Do you see the problem here?

The problem is that all my life I’ve been studying to get good grades — and we don’t go to school to get good grades, or at least we shouldn’t.

A couple of weeks ago me and my roommate Bruno stumbled upon the topic of university exams while chatting at home. We agreed: the current (classic) evaluation system doesn’t work, that’s where Machine Learning comes in.

Machine Learning… more or less

The topic of school exams came up when Bruno was telling me about the problems we face in Machine Learning, a subarea of Artificial Intelligence. Mainly, we talked about the control problem.

The control problem

In artificial intelligence (AI) and philosophy, the AI control problem is the issue of how to build a superintelligent agent that will aid its creators, and avoid inadvertently building a superintelligence that will harm its creators.

- Wikipedia

More generally, the control problem appears whenever an AI behaves unexpectedly. The issue I will focus on here is more specific: how machine learning algorithms cheat metrics to obtain higher grades instead of trying to do a better work.

In Machine Learning, when you create an algorithm that learns you also have to specify some kind of metric that evaluates whether the algorithm is learning or not. For example, one metric for an algorithm that predicts daily average temperature could be how far the prediction is from the actual temperature in degrees Celsius.

This last example might seem obvious, however that’s not always the case. If our task was to choose the best path from a place to another (like Wave and Google Maps do), defining the metric becomes much harder. That’s because it’s hard to determine what is ‘better’ when talking about paths (time, gas spent, road quality, amount of tolls in the road, etc.). When we face these types of problems what we usually do is we use a grading system. In our example here, we weight all the variables and grade every chosen path somehow. This becomes our metric.

So, in the end, the algorithm doesn’t care about your trip. It doesn’t realize it’s choosing a path for you to go to work. It doesn’t have a conscience, it doesn’t know what it’s doing. All it does is choose the path which has the higher score according to our metric, because that’s what we told it to do.

The Control Problem appears when machine learning algorithms start ‘cheating’ in order to score higher in the metric system. In our previous route choosing example, suppose we create a button that allows the user to rate the algorithm based on how good he thinks the chosen path was — and that becomes of our new metric. What may happen is that the algorithm might look for ways to block the user from rating its choices instead of actually choosing more efficient paths.

Grading system and Goodhart’s Law

Goodhart’s law illustrated (sketchplanations)

Grades are the metrics schools and universities generally use to evaluate how well a student has learned. The problem with this is that, as with our route choosing algorithm, once grades become the metrics we use to tell whether a someone is a good or a bad student, all the students will do is look for ways to score higher in the metrics (i.e. get better grades) — meaning that they will tend to memorize answers, cheat or do whatever else that works and stop paying attention to what really matters: their ability to consume and use knowledge.

Fixing metrics

How to make it impossible for students to cheat the metrics is a deep discussion in which I’ll not dive in here. However, even if we cannot completely solve the control problem, what we can do is create more accurate and sophisticated ways to evaluate the performance of school students.

First of all we got to ask ourselves: What do the grades represent? What does a 60% test score mean? Does it mean the student completely learned 6 out of 10 topics and knows nothing about the other 4 topics?

Lowering the stakes

If you’ve ever taken any test you know what I mean when I say that you never got to show everything you have to offer in a test.

When we get a low score on a test we may have the illusion that we don’t really know much about the topic in question, even though we in fact knew quite a bit about it. However, when we ace a test we have the opposite illusion: that we learned ‘enough’ and that there’s no need to keep studying.

We shouldn’t rely on a single test to evaluate students. Continuous evaluation through assignments and small tests makes the statistics much more reliable.

Time-dependent statistics

It doesn’t really matter whether we evaluate students continuously or not unless we analyze the data in a way that makes sense. Let’s say a student scored poorly on the first few tests and then began to show great improvement, but in the end the mean of his grades was low. This student showed improvement, but his grades don’t reflect this improvement.

On the other hand, a student that already knows a bit about the subject might score well in the first couple of evaluations and then, because of lack of further study, his/her grades may drop.

These two cases show how means and other time-independent statistics can mislead us to think that one is a poor (or a brilliant) student when they are actually not. Fortunately, however, we can use Time Series techniques that take into consideration how grades change over time.

For example, if we change the tests to contemplate not only the current subject being studied, but the previous one as well, we could analyze if the student has improved in comparison to the previous test. Then, what we can do with this data is to compute the differences between consecutive tests. We can calculate the mean of the consecutive improvements like so:

Where n is the number of tests and Ti is the ith test (if i = 2, T2, it’s the second test).

What the above equation does is to subtract every pair of consecutive tests (i.e. T2 and T1, T3 and T2,…) and make a mean of the obtained values (divide by the number of tests minus 1). What we end up with is a number that says if the student was mostly able to correct his/her issues between the feedback of the first test and the second test.

There are tons of ways to use Time Series to obtain information about your data set (in this case, students’ grades), that depends on what you want to analyze.

Detecting teaching problems

Comparing the performance of the student to that of the class may help — when most of the class scores low on a test it is very likely there was some flaw on the teaching process. Normalization can be used to show how a given student scored compared to the rest of the class.


The compilation of several different types of statistics offers us much more reliable information about the development of the students. Simply using one statistic tells us very little about how a student develops and learns, we need to stop using grades based on arithmetic means and start using our brains.



Gabriel Cruz

Computer Science student at University of São Paulo. OSS/Linux enthusiast, trailing spaces serial killer, casual pentester