Not Tetsuo, but close.
This fictional documentary covers the emergent disease Metalosis Maligna. It is so cyberpunk, it hurts.
This fictional documentary covers the emergent disease Metalosis Maligna. It is so cyberpunk, it hurts.
I, like many other members of the security community, have been thinking about the PatchGuard architecture that will be implemented in Vista for the past few weeks. I resisted blogging (erg) about it because I don't want to sound like a pompous ass, but might as well get my thoughts down on the subject rather than have them rattle around.
PatchGuard is essentially Microsoft's method for handling the volume of malware in the wild. Hooking kernel calls will become far more difficult, device drivers will have to be signed, and software that traditionally requires access to non-userland features, like firewalls and AV tools, will have to go through APIs standardized out of Redmond.
Obviously, this move raised the hackles from the traditional consumer AV organizations. Any technologic edge that one had over the other that involved interfacing with the kernel, and possibly preventing more malicious software, have been eliminated. If one of the third party vendors requires an avenue into the kernel that is not provided, they have to make a formal request to Microsoft for the API feature and wait for a subsequent Service Pack to provide it.
Normalizing access to the kernel is a "good thing" from an architecture standpoint. Microsoft can't hope to manage security threats in Windows unless it reduces the attack surface, or the number of possible entry points that can be used by an attacker. Third party vendors, however, face compression their margins as Microsoft enters the space and technological innovation in this critical area is standardized across the industry.
At face value, this leaves us with the consumer-grade security products industry on the ropes and a vastly more secure operating system, all because of interface standardization. An opposing view comes forth when we consider the issue of "software diversity". This discipline, which I spent a fair bit of time studying, asserts that populations of systems are more secure when they are "different", or do not share common faults. In non-infosec terms, this is equivalent to diversifying a financial portfolio to reduce the risk of loss associated with correlated securities. By standardizing all security software to essentially the same kernel interface, a new common fault, and a new target, is introduced. We won't know until Vista is widely deployed if the drop in diversity incurred in the standardization of security will offset the gains made by the changes made by PatchGuard.
Testing a security product can sometimes be a very hard job. I don't mean internal QA-type testing, where you look for logic or syntax flaws in a system; I am talking about validating that a technology is effective against a difficult-to-enumerate threat. If you are testing a flaw finder, you can create code with a specific number of flaws and then refine the code until all the flaws are detected. Likewise, if you are writing a vulnerability scanner to search for known holes (ala Nessus), you can construct a pool of example systems where the flaw can be found.
There are many situations where a known set of test vectors cannot be created, making the validation of a security technology somewhat hairy. What happens when the product you are testing is designed to catch threats that are rapidly evolving? Building corpora of known threats for testing against a live system is somewhat futile if the time between when the corpora is constructed and used for testing is long enough for the threat to evolve significantly. Either the live system has to be tested against live data, or the system, which most likely has been fetching updates, has to be "rolled back" to the state it was in at the time each element in the test corpus was first detected.
Consider anti-spam systems, for example. Performing an accurate test of our software was one of the difficulties my organization, Cloudmark, has had with potential customers. One thing we stress, over and over, is that the test environment has to be as close to the production environment as possible, especially from a temporal standpoint. Just as users don't expect their mail to be routinely delayed by 6 hours before being delivered, evaluators shouldn't run a 6 hour-old spam corpus through the system to evaluate its efficacy as a filter. As the time between when a message is received at a mail gateway and when it is scanned by the filter increases, the accuracy of the evaluation should approach perfection, thus invalidating the test.
The "accuracy drift" of the anti-spam system over the course of 6 hours would be insignificant if it wasn't for the fact that spam evolves so damned fast. If spam didn't evolve, then Bayesian filters and sender blacklists would have been the end-all be-all solution for the problem. The past year has seen more and more customers realize, sometimes before we even talk to them at length about testing, that a live data stream is essential for evaluating a new anti-spam solution. I suspect this is because they found their previous methodology lacking when it showed that the competitor's product they bought performed more poorly than predicted by the test.
I started out this post by saying I would discuss security products, and so far I have only mentioned anti-spam systems. The issues with testing became apparent in this area first because of the number of eyes watching the performance of anti-spam systems, i.e. every e-mail user on the planet. In a later post, I will discuss why this also matters for both other specific security systems and the security space in general. For now, I am heading off to see a good friend before he leaves on an extended trip to the .eu.
I had begun discussing the topic of testing security products in a previous post, where I began discussing the difficulty seen in evaluating security products. Essentially, the issues revolve around generating test vectors that is representative of the current threat state. If the system is attempting to counter a rapidly evolving security threat, then the time between when the test vector is generated and the time when the test is performed becomes critical for the fidelity of the test. For anti-spam systems, the length of time between when a test vector and when a test is conducted becomes critical in trying to quantify the accuracy of the solution once it is in production; spam evolves so fast that in a matter of minutes test vectors are no longer representative of the current spam state.
What about other filtration methods? In the past, Anti-Virus systems had to contend with several hundred new viruses a year. A set of viruses could easily be created that would be fairly representative of what a typical user would face for many days or weeks, as long as both the rates of emergence and propagation of new viruses was "low enough". This assumption, which no longer holds, worked very well when viruses were created by amateurs without a motive other than fame. Contemporary viruses are not written by kids screwing around, but by individuals attempting to build large networks of compromised home machines with the intention of leasing them out to for profit. This profit motive drives a far higher rate of virus and malware production than previously seen, as exemplified by the volume of Stration/Warezov variants, which have been causing many AV companies fits in their attempts to prevent this program from propagating all over the place. By testing against even a slightly-stale corpus, AV filter designers don't test against new variants, allowing them to claim far higher accuracy numbers than their products actually provide.
What's the big deal if people don't correctly perform testing? Well, engineers typically design and build systems to meet a specification, and they place their system under test to verify that the spec is being met. If their testing methodology is flawed, then their design is flawed. Eventually, these flaws will come to light in the public eye, as consumers start to realize that a product which claims 100% accuracy has been allowing an awfully high number of viruses to get through.
I am by no means the first person to discuss testing of security products. AV accuracy received quite a bit of attention when Consumer Reports attempted to test AV systems by using newly created viruses rather than the standard corpus. While their attempt at devising a new testing methodology was commendable, it is still not representative of how threats appear on the Internet. Using new, non-propagating viruses to test an AV system begs comparisons to the proverbial tree that falls in a forrest that no one is around to hear. Additionally, it isn't the incremental changes in viruses that are difficult to catch, it is the radical evolutions in viruses as well as the time required for the AV vendors to react that we have to be concerned about. These are things that can't be modeled via corpus testing, but via extended testing on live traffic.
We should be asking why people don't test more frequently on live data as opposed to corpus testing. I suspect it is because of two reasons: labor and repeatability. With corpus testing, you hand verify each element in the corpus as either being a virus once, and that cost is amortized over every test you conduct using the corpus. This isn't exactly an option with live testing, as every message that is either blocked or passed by the filter has to be hand-examined. There is also the issue of testing repeatability, where re-verification of previous results becomes difficult as the live feed evolves. Just because something is hard doesn't mean it shouldn't be done, however.
While systems are under live testing, the content they are filtering is being actively mutated to evade the system under test, essentially creating a multi-player noncooperative game with a limited number of participants. I will continue this discussion by examining the ramifications caused by this game in my next post.
I have been commenting on the testing of security software, specifically anti-spam and anti-virus products. The main point I made in both of those posts was that testing has to be on live data feeds, regardless of how difficult the task, because the threats evolve at such a high rate that corpus-based testing quickly becomes stale and does not represent the true state of incoming traffic.
In situations where there are a limited number of security vendors and adversaries, even live testing becomes extremely difficult. Let's consider an extreme case, where there is only one security vendor and multiple adversaries. Every single system is identical, running up to date anti-virus packages. (Yes, I fully realize this is a completely unrealistic example, but bear with me.) From the standpoint of the testing and user community, the accuracy of the system is perfect; no viruses are seen by the system, as they don't even have an opportunity to propagate. At the same time, virus writers realize there is a huge, untapped market of machines just waiting to be compromised if they could only gain a foothold. These guys sit around and hack code until a vulnerability is found in the AV system, and upon finding it, will release a virus that exploits this in the wild.
Before the virus is released, the accuracy of the system is:
After the virus is released, havoc breaks out, aircraft fall out of the sky, and dogs and cats start living together. 5% of all computers worldwide are infected before the vendor releases a patch. If the vendor was able to move faster, the number of compromised systems would have been only 1%, but left to its own devices, the virus would have compromised every system connected to the net. In this situation, the accuracy of the system is:
The third of these three accuracy measures seems the most appropriate, and the most flexible given a variety of network and economic conditions adversary styles. The measure, which is effectively the expectation of exploitation for a given host, is what is used today by anti-spam system evaluators. It is a slightly more sophisticated way of saying "what is the probability that a piece of spam will get through."
From a general security standpoint, however, it covers a difficult and often ignored parameter critical to the accuracy of a security product: response time. If the window of vulnerability between when the virus first appears and when signatures are issued is shrunk, the accuracy expressed by this metric improves. In fact, the Zero-Hour Anti-Virus industry is an emergent cottage industry in the security space. Ferris covered it back in 2004, and I talked about it at Virus Bulletin 2006.
Many of these zero-hour technologies are being used primarily in the message stream, but this probably won't last for long. I suspect the technology popped up here first because of the sheer volume of e-mail based viruses as well as the ease of which service providers, who ultimately end up spending money on these technologies, can quantify its cost. They store all mail then forward it along, unlike web-based trojans which just fly through on port 80, and have an opportunity to actually examine the number of viruses. As industry gains experience with automated means of identifying and distributing fingerprints or signatures for newly identified malware, we will see it spring up in other places as well.