Threat Detection Evolution: Data CollectionBy Mike Rothman
The first post in this series set the stage for the evolution of threat detection. Now that we’ve made the case for why detection must evolve, let’s work through the mechanics of what that actually means. It comes down to two functions: security data collection, and analytics of the collected data. First we’ll go through what data is helpful and where it should come from.
Threat detection requires two main types of security data. The first is internal data, security data collected from your devices and other assets within your control. It’s the stuff the PCI-DSS has been telling you to collect for years. Second is external data, more commonly known as threat intelligence. But here’s the rub: there is no useful intelligence in external threat data without context for how that data relates to your organization. But let’s not put the cart before the horse. We need to understand what security data we have before worrying about external data.
You’ve likely heard a lot about continuous monitoring because it is such a shiny and attractive term to security types. The problem we described in Vulnerability Management Evolution is that ‘continuous’ can have a bunch of different definitions, depending on who you are talking to. We have a rather constructionist view (meaning, look at the dictionary) and figure the term means “without cessation.” But in many cases, monitoring assets continually doesn’t really add much value over regular and reliable daily monitoring.
So we prefer consistent monitoring of internal resources. That may mean truly continuous, for high-profiles asset at great risk of compromise. Or possibly every week for devices/servers that don’t change much and don’t access high-value data. But the key here is to be consistent about when you collect data from resources, and to ensure the data is reliable.
There are many data sources you might collect from for detection, including:
- Logs: The good news is that pretty much all your technology assets generate logs in some way, shape, or form. Whether it’s a security or network device, a server, an endpoint, or even mobile. Odds are you can’t manage to collect data from everything, so you’ll need to choose which devices to monitor, but pretty much all devices generate log data.
- Vulnerability Data: When trying to detect a potential issue, knowing which devices are vulnerable to what can be important for narrowing down your search. If you know a certain attack targets a certain vulnerability, and you only have a handful of devices that haven’t been patched to address the vulnerability, you know where to look.
- Configuration Data: Configuration data yields similar information to vulnerability data for providing context to understand whether a device could be exploited by a specific attack.
- File Integrity: File integrity monitoring provides important information for figuring out which key files have changed. If a system file has been tampered with outside of an authorized change, it may indicate nefarious activity and should be checked out.
- Network Flows: Network flow data can identify patterns of typical (normal) network activity; which enables you to look for patterns which aren’t exactly normal and could represent reconnaissance, lateral movement, or even exfiltration.
Once you decide what data to collect, you have figure out from where and how much. This involves selecting logical collection points and where to aggregate the data. This depends on the architecture of your technology stack. Many organization opt for a centralized aggregation point to facilitate end-to-end analysis, but that is contingent on the size of the organization. Large enterprises may not be able to handle the scale of collecting everything in one place, and should consider some kind of hierarchical collection/aggregation strategy where data is stored and analyzed locally, and then a subset of the data is sent upstream for central analysis.
Finally, we need to mention the role of the cloud in collection and aggregation, because almost everything is being offered either in the cloud or as a Service nowadays. The reality is that cloud-based aggregation and analysis depend on a few things. The first is the amount of data. Moving logs or flow records is not a big deal because they are pretty small and highly compressible. Moving network packets is a much larger endeavor, and hard to shift to a cloud-based service. The other key determinant is data sensitivity – some organizations are not comfortable with their key security data outside their control in someone else’s data center/service. That’s an organizational and cultural issue, but we’ve seen a much greater level of comfort with cloud-based log aggregation over the past year, and expect it to become far more commonplace inside a 2-year planning horizon.
The other key aspect of internal data collection is integration and normalization of the data. Different data sources have different data formats, which creates a need to normalize data to compare datasets. That involves compromise in terms of granularity of common data formats, and can favor an integrated approach where all data sources are already integrated into a common security data store. Then you (as the practitioner) don’t really need to worry about making all those compromises – instead you can bet that your vendor or service provider has already done the work.
Also consider the availability of resources for dealing with these disparate data sources. The key issue, mentioned in the last post, remains the skills shortage; so starting a data aggregation/collection effort that depends on skilled resources to manage normalization and integration of data may not be the best idea. This doesn’t really have much to do with the size of the organization – it’s really about the sophistication of staff – security data integration is an advanced function that can be beyond even large organizations with less mature security efforts.
Ultimately your goal is visibility into your entire technology infrastructure. An end-to-end view of what’s happening in your environment, wherever your data is, gives you a basis for evolving your detection capabilities.
We have published a lot of research on threat intel to date, most recently a series on Applied Threat Intelligence, which summarized the three main use cases we see for external data.
There are plenty of sources of external data nowadays. The main types are:
- Commercial integrated: It seems every security vendor has a research group providing some type of intelligence. This data is usually very tightly integrated into the product or service you buy from the vendor. There may be a separate charge for intelligence, beyond the base cost of the product or service.
- Commercial standalone: Standalone threat intel is an emerging security market. These vendors typically offer an aggregation platform to collect external data and integrate it into your controls and monitoring systems. Some also gather industry-specific data because attacks tend to cluster in specific industries.
- ISAC: Information Sharing and Analysis Centers are industry-specific organizations that aggregate data from an industry and share it among members. The best known ISAC is for the financial industry, although many other industry associations are spinning up their own ISACs as well.
- OSINT: Finally there is open source intel, publicly available sources for things like malware samples and IP reputation, which can be queried and/or have intel integrated directly into user systems.
How does this external data play into an evolved threat detection capability? As mentioned above, external data without context isn’t very helpful. You don’t know which of the alerts or notifications apply to your environment, so you just create a lot of extra work to figure it out. And the idea is not to create more work.
How can you provide that context? Use the external threat data to look for specific instances of an attack. As we described in Threat Intelligence and Security Monitoring, you can use indicators from other attacks to pinpoint that activity in your network, even if you’ve never seen the attack before. Historically you were restricted to only alerting on conditions/correlations you knew about, so this is a big deal.
To use a more tangible example, let refer back to the concept of retrospection. Let’s say you didn’t know about a heretofore unknown attack (like Duqu 2.0), and received a set of indicators from a threat intelligence provider. You could then look for those indicators within your network. Even if you don’t find that specific attack immediately, you could set your monitoring system (typically a SIEM or an IPS) to watch for those indicators. Basically you can jump time, looking for attacks that haven’t happened to you – yet.
Default to the Process
As usual, it all comes back to process. We mapped that out in TI+SM and Threat Intelligence and Incident Response. You need a process to procure, collect, and utilize threat intelligence, from whatever sources it comes.
Then use external data as triggers or catalysts to mine your internal data using advanced analytics, to see if you find those indicators in your network. We’ll dig into that part of the process next.