## Overfitting

One of the things I learned in math is that a polynomial of degree N can pass through N+1 arbitrary points. A straight line goes through any two points, a parabola goes through any three points, and so forth. The practical upshot of this is that if your equation is complex enough, you can fit it to any data set.

That’s basically what happened to the geocentric model: it started out simple, with planets going around the Earth in circles. Except that some of the planets wobbled a bit. So they added more terms to the equations to account for the wobbles. Then there turned out to be more wobbles on top of the first wobbles, and more terms had to be added to the equations to take those into account, and so on until the theory collapsed under its own weight. There wasn’t any physical mechanism or cause behind the epicycles (as these wobbles were called). They were just mathematical artifacts. And so, one could argue that the theory was simpler when it had fewer epicycles and didn’t explain all of the data, but also was less wrong.

Take another example (adapted from Russell Glasser, who got it from his CS instructor): let’s say you and I order a pizza, and it comes with olives. I hate olives and you love them, so we want to cut it up in such a way that we both get slices of the same size, but your slice has as many of the olives as possible, and mine have as few as possible. (And don’t tell me we could just order a half-olive pizza; I’m using this as another example.)

We could take a photo of the pizza, feed it into an algorithm that’ll find the position of each olive and come up with the best way to slice the pizza fairly, but with a maximum of olives on your slices.

The problem is, this tells us nothing about how to slice the next such pizza that we order. Unless there’s some reason to think that the olives on the next pizza will be laid out in some similar way on the next pizza, we can’t tell the pizza parlor how to slice it up when we place our next order.

In contrast, imagine if we’d looked at the pizza and said, “Hm. Looks like the cook is sloppy, and just tossed a handful of olives on the left side, without bothering to spread them around.” Then we could ask the parlor slice to slice it into wedges, and we have good odds of winding up with three slices with extra olives and three with minimal olives. Or if we’d found that the cook puts the olives in the middle and doesn’t spread them around. Then we could ask the parlor to slice the pizza into a grid; you take the middle pieces, and I’ll take the outside ones.

But our original super-optimal algorithm doesn’t allow us to do that: by trying to perfectly account for every single olive in that one pizza, it doesn’t help us at all in trying to predict the next pizza.

In The Signal and the Noise, Nate Silver calls this overfitting. It’s often tempting to overfit, because then you can say, “See! My theory of Economic Epicycles explains 29 of the last 30 recessions, as well as 85% of the changes in the Dow Jones Industrial Average!” But is this exciting new theory right? That is, does it help us figure out what the future holds; whether we’re looking at a slight economic dip, a recession, or a full-fledged depression?

We’ve probably all heard the one about how the Dow goes up and down along with skirt hems. Or that the performance of the Washington Redskins predicts the outcome of US presidential elections. Of course, there’s no reason to think that fashion designers control Wall Street, or that football players have special insight into politics. More importantly, it goes to show that if you dig long enough, you can find some data set that matches the one you’re looking at. And in this interconnected, online, googlable world, it’s easier than ever to find some data set that matches what you want to see.

These two examples are easy to see through, because there’s obviously no causal relationship between football and politics. But we humans are good at telling convincing stories. What if I told you that pizza sales (with or without olives) can help predict recessions? After all, when people have less spending money, they eat out less, and pizza sales suffer.

I just made this up, both the pizza example and the explanation. So it’s bogus, unless by some million-to-one chance I stumbled on something right. But it’s a lot more plausible than the skirt or football examples, and thus we need to be more careful before believing it.

Update: John Armstrong pointed out that the first paragraph should say “N+1”, not “N”.

Update 2: As if on cue, Wonkette helps demonstrate the problems with trying to explain too much in this post about Glenn Beck somehow managing to tie together John Kerry’s presence or absence on a boat, his wife’s seizure, and Hillary Clinton’s answering or not answering questions about Benghazi. Probably NSFW because hey, Wonkette. But also full of Glenn Beck-ey crazy.

## The Mainstream Media’s Fluff Bias

If you watch TV news, you may have found yourself wondering why they’re going on and on and on about Paris or Britney or Lindsey or whoever, instead of, like, news that you care about.

Russell Glasser, a student at U Texas at Austin, is currently writing a thesis on this subject. His basic approach was to data-mine both Google News and Digg, the idea being that the former gives a representative sample of what the news media are talking about, and the latter gives insight into what people are actually interested in. You can read more about this, as well as a draft of the thesis, here.

The basic idea is fairly simple: let’s say that in a given month, there are 99 news stories about Britney Spears and one about Tiger Woods; but on Digg, a lot of people recommend the one Tiger Woods story and ignore the 99 Britney Spears stories. It seems reasonable to conclude that the news-reading public cares a lot more about Tiger than about Britney.

And, in fact, his conclusion is that there’s a definite bias towards fluff in the media.

The whole thing is also an interesting exercise in teasing information out of noisy data: any given datum could easily be wrong: someone might Digg every story he reads; or a newspaper might run two Paris Hilton stories in one day because the news editor and entertainment editor didn’t realize that the other one was already covering the story. But if you have a lot of such noisy data (and nowadays, thanks to the Spew, we do), then you can still tease information out of it.