There’s been a lot of talk on the statistics/machine learning/computer science blogs this week about an article in Wired called The End of Theory. Basically, everyone thinks the author, one Chris Anderson, has lost his damn mind. The piece argues that the enormous amounts of data available to modern computers, combined with advances in statistical modeling and analysis techniques, will lead to a time when the old scientific method is no longer used. The argument is that we will give up the practice of building and testing hypotheses in favor of querying huge databases for correlations. I’ll use the same passage as Ed Felten to sum up the article:
[...] The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.
Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.
But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the “beautiful story” phase of a discipline starved of data) is that we don’t know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on.
Among the interesting reactions, we’ve got Andrew Gelman, Drew Conway, Fernando Pereira, Cosma Shalizi, and Ed Felten. They have a range of more and less technical reasons for disagreeing, all of which are interesting and seem on-point to me. Dr. Felten’s explanation of his disagreement is the easiest to understand:
To take a simple example, suppose we discover a correlation between eating spinach and having strong muscles. Does this mean that eating spinach will make you stronger? Not necessarily; this will only be true if spinach causes strength. But maybe people in poor health, who tend to have weaker muscles, have an aversion to spinach. Maybe this aversion is a good thing because spinach is actually harmful to people in poor health. If that is true, then telling everybody to eat more spinach would be harmful. Maybe some common syndrome causes both weak muscles and aversion to spinach. In that case, the next step would be to study that syndrome. I could go on, but the point should be clear. Correlations are interesting, but if we want a guide to action — even if all we want to know is what question to ask next — we need models and experimentation. We need the scientific method.
It’s true that correlations are enough if all you want to do is make money selling ads. In that case, as Anderson says, “Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.” But scientists interested in human behavior would see this argument as completely backwards. To a scientist, the behavior is not “the point,” but a place to begin. Science is a process of forming an understanding of the world we live in, and the one thing data mining doesn’t produce is understanding. It may produce actionable predictions, but it won’t explain them to you.
For instance, here’s another claim from the article:
The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms.
It’s great that Craig Venter is able to sequence a bunch of genomes. Everyone agrees this is a cool project. But, more than anything else, it’s a starting point. A bunch of DNA sequences on disk may produce interesting correlations, but they don’t advance our biological understanding of global ecosystems until they’ve been used to build testable hypotheses.
A couple of months ago I was hanging around in the lobby before an invited lecture on machine learning, and wandered into a conversation between the speaker and a couple of the CS faculty here at UCI. Since I’m not entirely comfortable quoting professors from memory months after the fact and without asking permission, I won’t say exactly who, but it was one of the people high up on this list. Anyway, the person in question is an expert in the fields of machine learning and data mining. So I came into the conversation late, and just caught someone repeating a clam from elsewhere that soon machine learning would make the scientific method obsolete; it was a claim very much like Anderson’s. And this professor, whose research involves thinking up clever new ways to mine data, said, “I think that’s exactly the wrong way to think about it.” I don’t remember the rest of the quote verbatim, but the gist was: In a perfect world, machine learning and data mining would become unnecessary, because we would have a sufficiently complete understanding not to have to resort to them. They are effectively stop-gap measures, which we rely on to make predictions (and, in a lot of cases, money) when we’re willing to act without having (or understanding) interpretable reasons. But we shouldn’t look forward to a world when we can stop searching for that understanding.
June 27th, 2008 at 7:57 am
“Correlation does not equal causation.” Repeat as needed. Is he really claiming regular experimentation won’t be feasible in the future? If so, how would he explain the size and expense of the LHC? It’s either an anomaly or the way of the future.
Reminds me of that “Frosty The Snowman” song. One specific line in it should be troubling to any scientifically-minded person: “There must have been some magic in that old silk hat they found for when they placed it on his head he began to dance around.”
“Must have been”? Correlation does not equal causation!
June 27th, 2008 at 2:53 pm
The claim is not that it won’t be feasible, but that it will be unnecessary to apply the traditional scientific method.
June 30th, 2008 at 6:15 pm
[...] are actually part of the scientific method. They are tools to help advance science, to help us form an understanding of our world. They will be used to develop and refine theory. One thing is certain: Anderson has succeeded [...]
July 1st, 2008 at 10:09 am
[...] article on the “end of theory“, and makes a similar argument to mine, only he does it much better. For instance, I didn’t have this excellent one-line demolition [...]