Testing a security product can sometimes be a very hard job. I don't mean internal QA-type testing, where you look for logic or syntax flaws in a system; I am talking about validating that a technology is effective against a difficult-to-enumerate threat. If you are testing a flaw finder, you can create code with a specific number of flaws and then refine the code until all the flaws are detected. Likewise, if you are writing a vulnerability scanner to search for known holes (ala Nessus), you can construct a pool of example systems where the flaw can be found.
There are many situations where a known set of test vectors cannot be created, making the validation of a security technology somewhat hairy. What happens when the product you are testing is designed to catch threats that are rapidly evolving? Building corpora of known threats for testing against a live system is somewhat futile if the time between when the corpora is constructed and used for testing is long enough for the threat to evolve significantly. Either the live system has to be tested against live data, or the system, which most likely has been fetching updates, has to be "rolled back" to the state it was in at the time each element in the test corpus was first detected.
Consider anti-spam systems, for example. Performing an accurate test of our software was one of the difficulties my organization, Cloudmark, has had with potential customers. One thing we stress, over and over, is that the test environment has to be as close to the production environment as possible, especially from a temporal standpoint. Just as users don't expect their mail to be routinely delayed by 6 hours before being delivered, evaluators shouldn't run a 6 hour-old spam corpus through the system to evaluate its efficacy as a filter. As the time between when a message is received at a mail gateway and when it is scanned by the filter increases, the accuracy of the evaluation should approach perfection, thus invalidating the test.
The "accuracy drift" of the anti-spam system over the course of 6 hours would be insignificant if it wasn't for the fact that spam evolves so damned fast. If spam didn't evolve, then Bayesian filters and sender blacklists would have been the end-all be-all solution for the problem. The past year has seen more and more customers realize, sometimes before we even talk to them at length about testing, that a live data stream is essential for evaluating a new anti-spam solution. I suspect this is because they found their previous methodology lacking when it showed that the competitor's product they bought performed more poorly than predicted by the test.
I started out this post by saying I would discuss security products, and so far I have only mentioned anti-spam systems. The issues with testing became apparent in this area first because of the number of eyes watching the performance of anti-spam systems, i.e. every e-mail user on the planet. In a later post, I will discuss why this also matters for both other specific security systems and the security space in general. For now, I am heading off to see a good friend before he leaves on an extended trip to the .eu.