Archive for the ‘Computer Science’ Category

Sean Carroll Piles On Wired, Totally Pwnz
Tuesday, July 1st, 2008

Over at Cosmic Variance, Sean Carroll addresses the Wired article on the “end of theory.” He makes a similar argument to mine, only he does it much better. For instance, I didn’t have this excellent one-line demolition of the whole argument: “Theory is understanding, and understanding our world is what science is all about.”

Highly recommended for examples involving Brahe, Kepler, Newton and the Large Hadron Collider.

Wired Magazine Doesn’t Understand Science
Thursday, June 26th, 2008

There’s been a lot of talk on the statistics/machine learning/computer science blogs this week about an article in Wired called The End of Theory. Basically, everyone thinks the author, one Chris Anderson, has lost his damn mind. The piece argues that the enormous amounts of data available to modern computers, combined with advances in statistical modeling and analysis techniques, will lead to a time when the old scientific method is no longer used. The argument is that we will give up the practice of building and testing hypotheses in favor of querying huge databases for correlations. I’ll use the same passage as Ed Felten to sum up the article:

[...] The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.

Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise.

But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the “beautiful story” phase of a discipline starved of data) is that we don’t know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on.

Among the interesting reactions, we’ve got Andrew Gelman, Drew Conway, Fernando Pereira, Cosma Shalizi, and Ed Felten. They have a range of more and less technical reasons for disagreeing, all of which are interesting and seem on-point to me. Dr. Felten’s explanation of his disagreement is the easiest to understand:

To take a simple example, suppose we discover a correlation between eating spinach and having strong muscles. Does this mean that eating spinach will make you stronger? Not necessarily; this will only be true if spinach causes strength. But maybe people in poor health, who tend to have weaker muscles, have an aversion to spinach. Maybe this aversion is a good thing because spinach is actually harmful to people in poor health. If that is true, then telling everybody to eat more spinach would be harmful. Maybe some common syndrome causes both weak muscles and aversion to spinach. In that case, the next step would be to study that syndrome. I could go on, but the point should be clear. Correlations are interesting, but if we want a guide to action — even if all we want to know is what question to ask next — we need models and experimentation. We need the scientific method.

It’s true that correlations are enough if all you want to do is make money selling ads. In that case, as Anderson says, “Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.” But scientists interested in human behavior would see this argument as completely backwards. To a scientist, the behavior is not “the point,” but a place to begin. Science is a process of forming an understanding of the world we live in, and the one thing data mining doesn’t produce is understanding. It may produce actionable predictions, but it won’t explain them to you.

For instance, here’s another claim from the article:

The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms.

It’s great that Craig Venter is able to sequence a bunch of genomes. Everyone agrees this is a cool project. But, more than anything else, it’s a starting point. A bunch of DNA sequences on disk may produce interesting correlations, but they don’t advance our biological understanding of global ecosystems until they’ve been used to build testable hypotheses.

A couple of months ago I was hanging around in the lobby before an invited lecture on machine learning, and wandered into a conversation between the speaker and a couple of the CS faculty here at UCI. Since I’m not entirely comfortable quoting professors from memory months after the fact and without asking permission, I won’t say exactly who, but it was one of the people high up on this list. Anyway, the person in question is an expert in the fields of machine learning and data mining. So I came into the conversation late, and just caught someone repeating a clam from elsewhere that soon machine learning would make the scientific method obsolete; it was a claim very much like Anderson’s. And this professor, whose research involves thinking up clever new ways to mine data, said, “I think that’s exactly the wrong way to think about it.” I don’t remember the rest of the quote verbatim, but the gist was: In a perfect world, machine learning and data mining would become unnecessary, because we would have a sufficiently complete understanding not to have to resort to them. They are effectively stop-gap measures, which we rely on to make predictions (and, in a lot of cases, money) when we’re willing to act without having (or understanding) interpretable reasons. But we shouldn’t look forward to a world when we can stop searching for that understanding.

Haughty
Sunday, May 18th, 2008

I promise that this won’t devolve into a scarcely-updated Mathematica blog (there’s a post about bourbon coming soon, for one), but I thought this bit from the program’s “reading and writing files” documentation was priceless:

In Mathematica’s standard notebook interface, you are directly giving input and getting output every time you press Shift+Enter. Although much more rarely needed than in more primitive languages, Mathematica also allows you to get input and generate output as side effects in a computation.

This is so amazing that I don’t even have a joke. If I were smart, like Dr. Shalizi, I might have a joke. But I’m not, so I don’t.

(I saw Cosma speak on Friday. He was really good, and actually did get in a sly crack at A New Kind of Science, though I’m pretty sure I was the only one at UCI who got it. He was talking about models of complex systems and he said something like, “It’s not good enough to simply recreate the behavior of some part of a system and say, ‘Aha, since this looks just like that, then the mechanism behind this must be the mechanism behind that.’ That is, unless you’re writing a 1200-page self-published tome.”)

“BOOM! You couldn’t figure it out, and I’m right.”
Wednesday, January 16th, 2008

There is a student in my department who is known for this tendency to derail lectures. He always has a lot of questions, and even more ideas. He has a few pet methods (hidden Markov models and neural networks, in particular), and he always thinks the problem at hand could be solved via their application — in fact, he’d usually like for the class to listen along as he thinks out loud, until he has figured out exactly how this is all going to work.

Take today, for example. The class is split equally among computer scientists and biologists. The professor explained a problem, which every computer scientist immediately realized is a pretty canonical example of the sort of problem to which the “expectation-maximization” technique is applied. Then he asked if anyone had any ideas as to how to solve it.

Most of the computer scientists in the room turned to the nearest biologist and whispered, “Yeah, you use EM,” because we are all assholes who like to show off. But no one wanted to waste everyone’s time trying to explain EM without the use of prepared slides, so we waited quietly for the professor to continue.

Which, of course, is precisely the kind of opening that The Guy With Ideas was looking for. He pounced, and started a two-minute exposition that no one followed, but which ended with, “It’s kind of like you start with a hidden Markov model, and use something sort of like EM.”

While I was trying to figure out some way to get those two minutes back, the professor said, “Why do you need an HMM?”

“Because they’re good at solving lots of problems.”

“But they’re not applicable because [... there's a good reason].”

“Yeah, but I think they are, because [... nonsense, we've already covered the reasons why you don't need that kind of machinery].”

“Anyway, you’re half right. We’re going to use EM, but not hidden Markov models.”

Then the professor began to lay the foundation for explaining EM. It’s a long story, and it will probably take the whole next lecture to complete. Five minutes in, The Guy With Ideas pipes back up.

“Wait, what if instead we set this variable to 60% and that one to 25%, and then we [... this went on for a while, and was impossible to follow]. Basically, it’s kind of like EM.”

He seriously said this like 1) he had just invented the ideas behind EM, and had to give us all an example, and 2) he hadn’t just had the same idea a few minutes ago, and 3) the professor hadn’t said that he was correct, way back then.

The professor, who, three meetings into the course, is clearly struggling to deal with the interruptions, just said, “You’re right. We’re going to use EM.”

The Guy With Ideas apparently couldn’t believe that one of his ideas was correct, because he said, “Wait, I’m right?” Then he turned to the girl beside him, pointed two fingers into her face, and yelled, “BOOM! You couldn’t figure it out, and I’m right.”

Score one for you, Guy With Ideas. You totally showed the girl next to you, and it only ate up a total of about ten minutes of class time. I’d say that’s a victory for the record books.

Criticism Which is Correct, Confusing
Sunday, December 30th, 2007

Keith is getting ready to begin a masters program with an advisor who is interested in data mining, and this got me interested in Padhraic Smyth’s book on the subject. Dr. Smyth is one of my favorite people at UCI, and he knows from statistical machine learning, so I expected the book to be good. Glancing at the reviews, I noticed that they show an interesting structure: there are a lot of 5-star reviews, and a few 1-star reviews, and a smattering in the middle. Clearly it’s a divisive book: some people think it’s a great survey of important techniques, and some people wish it was more oriented toward producing working code for large-scale business systems.

And then there’s “Mustafa,” who just thinks it sucks. And, while I haven’t read the book, I imagine that most of his criticism is valid:

Finally .. I recevie the book .. I read the list of content and I surprised about it .. and now I know why they dont write the contents here to read before bying the book ..
This is a bad statistics book, you can read any thing in it except about Data Mining … No Cluster Analysis .. No Nural Networks .. No Rule induction No Dicecion Trees .. Nothing and nothing and nothing …
And I want to sell this bad book which Name is Data Mining … for the three lier writers.

The omission of “Nural Networks” does seem to be a glaring mistake, but surely most of that important material is made up for by the inclusion of neural networks on page 173. And don’t get me started on the authors’ decision to exclude “Dicecion Trees.” There’s simply no excuse for that.

Who are the “9 of 44 people” who found this review helpful? More importantly, how can one avoid ever having to deal with them professionally?