It seems that I have started blogging about science at the wrong time – there will soon be nothing to talk about. This is according to Wired: The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Roughly speaking, we are not going to need scientific theory anymore, because the data are much better a description without it. (Incidentally, google has all the data.)
Of course, such a revelation has made waves in the press and of course the blogosphere. As a theoretician I feel somewhat defensive about having a job, so I’m going to join the crowds by explaining why this just very very wrong. Its quite simple really. The Scientific method works as follows:
- Observe a phenomenon (i.e. data).
- Form a hypothesis about the cause (i.e. explain the data).
- Think of a new way to test the hypothesis (i.e. get more data).
- Perform the test. If it fails, go to 2. If it succeeds, go to 3.
- Use the hypothesis as a prediction.
The scientist keeps gathering data and the hypothesis gains evidence and becomes accepted over time – unless of course some new piece of evidence is found that cannot support the hypothesis. Hence science is formed as a set of theories that can change with the evidence. The hypothesis can be used to make predictions, and provides an explanation for the phenomenon.
The new concept notes that we have more evidence than we know what to do with right from the beginning. In fact, by statistically describing the data, we can skip straight from 1 to 5: the data is the model. Predictions can be made without ever having to think about what the data means.
This sounds great, until one thinks a little about the nature of prediction. There are two main types of prediction: interpolation and extrapolation. Interpolation means considering what happens in between two regions that we have measurements for, and statistical models are perfect for this. Extrapolation means considering things outside the measured data. Statistical techniques are really bad for this, because two descriptions can be just as good for the data itself, whereas they give wildly different predictions outside of it. The only solution is to know what on earth is going on and to predict based on that. The only way to achieve this is to build a model for its behaviour, based upon repeated testing of what it does.
To give an example: in my own research I consider the behaviour of bacteria in the human gut. You can’t measure it. People eat food, and stuff comes out the other end. That’s all you get. But: you can model the gut by building an experiment with similar properties. And you can model the experiment with maths, and use that to predict what is happening in our gut. You can’t do this any other way because the data don’t tell you anything about what is going on inside the gut itself. You can mathematically PROVE that there is not enough information in a googleplex of measurements on a live person. You need the experiment, and to include the differences between the experiment and the body, you need the model.
Google data may be coming out of our arses, but we’ve no idea where its been in between.