I had begun discussing the topic of testing security products in a previous post, where I began discussing the difficulty seen in evaluating security products. Essentially, the issues revolve around generating test vectors that is representative of the current threat state. If the system is attempting to counter a rapidly evolving security threat, then the time between when the test vector is generated and the time when the test is performed becomes critical for the fidelity of the test. For anti-spam systems, the length of time between when a test vector and when a test is conducted becomes critical in trying to quantify the accuracy of the solution once it is in production; spam evolves so fast that in a matter of minutes test vectors are no longer representative of the current spam state.
What about other filtration methods? In the past, Anti-Virus systems had to contend with several hundred new viruses a year. A set of viruses could easily be created that would be fairly representative of what a typical user would face for many days or weeks, as long as both the rates of emergence and propagation of new viruses was "low enough". This assumption, which no longer holds, worked very well when viruses were created by amateurs without a motive other than fame. Contemporary viruses are not written by kids screwing around, but by individuals attempting to build large networks of compromised home machines with the intention of leasing them out to for profit. This profit motive drives a far higher rate of virus and malware production than previously seen, as exemplified by the volume of Stration/Warezov variants, which have been causing many AV companies fits in their attempts to prevent this program from propagating all over the place. By testing against even a slightly-stale corpus, AV filter designers don't test against new variants, allowing them to claim far higher accuracy numbers than their products actually provide.
What's the big deal if people don't correctly perform testing? Well, engineers typically design and build systems to meet a specification, and they place their system under test to verify that the spec is being met. If their testing methodology is flawed, then their design is flawed. Eventually, these flaws will come to light in the public eye, as consumers start to realize that a product which claims 100% accuracy has been allowing an awfully high number of viruses to get through.
I am by no means the first person to discuss testing of security products. AV accuracy received quite a bit of attention when Consumer Reports attempted to test AV systems by using newly created viruses rather than the standard corpus. While their attempt at devising a new testing methodology was commendable, it is still not representative of how threats appear on the Internet. Using new, non-propagating viruses to test an AV system begs comparisons to the proverbial tree that falls in a forrest that no one is around to hear. Additionally, it isn't the incremental changes in viruses that are difficult to catch, it is the radical evolutions in viruses as well as the time required for the AV vendors to react that we have to be concerned about. These are things that can't be modeled via corpus testing, but via extended testing on live traffic.
We should be asking why people don't test more frequently on live data as opposed to corpus testing. I suspect it is because of two reasons: labor and repeatability. With corpus testing, you hand verify each element in the corpus as either being a virus once, and that cost is amortized over every test you conduct using the corpus. This isn't exactly an option with live testing, as every message that is either blocked or passed by the filter has to be hand-examined. There is also the issue of testing repeatability, where re-verification of previous results becomes difficult as the live feed evolves. Just because something is hard doesn't mean it shouldn't be done, however.
While systems are under live testing, the content they are filtering is being actively mutated to evade the system under test, essentially creating a multi-player noncooperative game with a limited number of participants. I will continue this discussion by examining the ramifications caused by this game in my next post.