SOC 2025: Making Sense of Security DataBy Mike Rothman
Intelligence comes from data. And there is no lack of security data, that’s for sure. Everything generates data. Servers, endpoints, networks, applications, databases, SaaS services, clouds, containers, and anything else that does anything in your technology environment. Just as there is no award for finding every vulnerability, there is no award for collecting all the security data. You want to collect the right data to make sure you can detect an attack before it becomes a breach.
As we consider what the SOC will look like in 2025, given the changing attack surface and available skills base, we’ve got to face reality. The sad truth is that TBs of security data sit underutilized in various data stores throughout the enterprise. It’s not because security analysts don’t want to use the data. They don’t have a consistent process to evaluate ingested data and then analyze it constantly. But let’s not get the cart before the proverbial horse. First, let’s figure out what data will drive the SOC of the Future.
Security Data Foundation
The foundational sources of your security data haven’t changed much over the past decade. You start with the data from your security controls because 1) the controls are presumably detecting or blocking attacks, and 2) you still have to substantiate the controls in place for your friendly (or not so friendly) auditors. These sources include logs and alerts from your firewalls, IPSs, web proxies, email gateways, DLP systems, identity stores, etc. You may also collect network traffic, including flows and even packets.
What about endpoint telemetry from your EDR or next-gen EPP product? There is a renewed interest in endpoint data because remote employees don’t always traverse the corporate network, resulting in a blind spot regarding their activity and security posture. On the downside, endpoint data is plentiful and can create issues in scale and cost. The same considerations must be weighed regarding network packets as well.
But let’s table that discussion for a couple of sections since there is more context to discuss before truly determining whether you need to push all of the data into the security data store.
Once you get the obvious stuff in there, you need to go broader and deeper to provide the data required to evolve the SOC with advanced use cases. That means (selectively) pulling in application and database logs. You probably had an unpleasant flashback to when you tried that in the past. Your RDBMS-based SIEM fell over, and it took you three days to generate a report with all that data in there. But hear us out; you don’t need to get all the application logs, just the relevant ones.
Which brings us to the importance of threat models when planning use cases. That’s right, old-school threat models. You figure out what is most likely to be attacked in your environment (think high-value information assets) and then work backward. How would the attacker compromise the data or the device? What data would you need to detect that attack? Do you have that data? If not, how do you get it? Aggregate and then tune. Wash, rinse, repeat for additional use cases.
We know this doesn’t seem like an evolution; it’s the same stuff we’ve been doing for over a decade, right? Not exactly as the analytics you have at your disposal are much improved, which we’ll get into later in the series. Those analytics are constrained by the availability of security data. Yet you can’t capture all the data, so focus on the threat models and use cases that can answer the questions you need to know.
Given the cloudification of seemingly everything, we need to mention two (relatively) new sources of security data, and that’s your IaaS (infrastructure as a service) providers and SaaS applications. Given the sensitivity of the data going into the cloud, over the seemingly dead bodies of the security folks that would never let that happen, you’re going to need some telemetry from these environments to figure out what’s happening, if those environments are at risk, and ultimately to be able to respond to potential issues. Additionally, you want to pay attention to the data moving to/from the cloud, as detecting when an adversary can pivot between your environments is critical.
Is this radically different from the application and database telemetry discussed above? Not so much in content, but absolutely in location. The question then becomes what and how much, if any, of the cloud security data do you centralize?
What About External Data?
Nowadays, you don’t just use your data to find attackers. You use other people’s data, or in other words, threat intelligence, which gives you the ability to look for attacks that you haven’t seen before. Threat intel isn’t new either, and threat intel platforms (TIP) are being subsumed into broader SOC platforms or evolving to focus more on security operations or analysts. There are still many sources of threat intel, some commercial and some open source. The magic is understanding which sources will be useful to you. That involves curation and evaluating the relevance of the third-party data. As we contemplate the security data that will drive the SOC, effectively leveraging threat intel is a cornerstone of the strategy.
Chilling by the (Security Data) Lake
In the early days of SIEM, there wasn’t a choice of where or how you would store your security data. You selected a SIEM, put the data in there, started with the rules and policies provided by the vendor, tuned the rules and added some more, generated the reports from the system, and hopefully found some attacks. As security tooling has evolved, now you’ve got options for how you build your security monitoring environment.
Let’s start with aggregation. Or what’s now called a security data lake. This new terminology indicates that it’s not your grandad’s SIEM. Rather it’s a place to store significantly more telemetry and make better use of it. It turns out this new fangled data lake doesn’t have to be new at all. You still have the option to buy a SIEM, have it ingest and process your security data, and generate alerts. Same as it ever was, but with a shiny new name.
Alternatively, you can use a vendor’s multi-tenant cloud-based aggregation service, collecting your telemetry and doing the analytics within their cloud estate. Like other SaaS services, you get out of the business of operating the infrastructure. You also use the vendor’s analytics and other ancillary services, like SOAR, because it’s a closed environment.
Finally, you can do it yourself (DIY). The DIY option involves using a modern data store (like Mongo or Snowflake) to store your data. You find or build the connectors for ingestion. You use the platform’s analytics capabilities to design detections and generate your reports. It seems like a lot of work, but it’s definitely an option.
The next decision is how many lakes do you need? Traditionally, you’d centralize your data because it made no sense to operate two on-prem SIEMs. But given the plethora of options in terms of on-prem vs. cloud vs. managed services, it’s feasible to have different security monitoring environments covering other areas of your infrastructure. This decision comes down to the operational motions occurring when an alert fires, which we’ll discuss later in the series.
If you decide to centralize the security data, should it be added to your SIEM, or do you migrate to a cloud-based environment? This depends on your future platforming strategy. If the stated direction is to be cloud-first and to migrate existing data and applications to the cloud, then your decision on location is easy – choose the cloud. Then do you leave your existing SIEM in place to handle the on-prem systems? Again, that’ll depend on the operational motions.
And security isn’t the only consideration. It’s out of the scope of this research to consider application observability, but if you have moved to a model where the DevOps team operates the application and takes on some security responsibility, looking at an integrated platform that monitors for both security and performance makes sense.
For modern infrastructures (read cloud and DevOps), we generally recommend a cascading log architecture, where all of the data is aggregated within the application stack. Then security-relevant data is moved to a separate security repository, which only the security team can access. Security analytics is done here, and then alerts (and other relevant context) go to the security operations group, be it an on-prem or cloud-based environment.
This aligns with the move to decentralize technology efforts, but will it meet the security requirements? Is it possible to do XDR (extended detection and response) if you don’t capture and centralize all security data, including network, cloud, and endpoint data? That’s going to be one of the key discussions in the next post, so let’s defer that question for the time being.
That’s the balancing act. You want to leverage data sources of all types to identify complex attacks. But it needs to be done in the most cost-efficient way.
Let’s wrap up with data retention strategies. It seems security data sources never go away. And you’ll have more data tomorrow than you have today. That’s great for a security vendor, who gets to expand the system in perpetuity, but less so for your CFO, who’s trying to figure out how to find eight figures to pay to a SIEM vendor.
As security purists, we want to keep security data as long as possible. That makes sense from the perspective of finding low and slow attacks or determining the proliferation of an attack. However, there are logical constraints on storage availability and system performance (on-prem) or cost (cloud-based).
As stewards of corporate resources, you need to ask the question. Do you need all that data? And more importantly, how do you make a case to keep the data flowing as budgets tighten? You need a consistent approach to evaluate the usefulness of both internal and external security data. We’re going to take a page from sales and marketing to track the impact of these data sources. Highly functioning go-to-market teams can quantify the value of their campaigns, events, and other marketing spend based on what contributes to closed sales. We want to take a similar approach with incidents, figuring out which security data source contributes to impactful detections.
By adding a “data source” attribute to each incident (or even alert), you can get a sense of which security data sources provide value. You could get very granular in terms of identifying where the information source provides value (earlier is better) in the detection process and possibly determine whether it’s a primary or secondary detection source. Since most SOCs don’t know which incidents are triggered from which data sources, starting with anything will help.
Once you have the incidents instrumented, you evaluate which sources are most effective and stack rank them. We recommend you run this exercise periodically (quarterly or so) and scrutinize which sources provide value, giving you the option to double down on the stuff that works and move away from the stuff that doesn’t.
And you can feel good about that decision because it’s based on data. In the next post, we’ll look at the role security analytics plays as the SOC evolves.
It’s not about the data; it’s about the relationships
Both graph2vec and doc2vec came about because of the need for understanding relationships at-scale. Everyone can have HDFS and Parquet. But not everyone can have Knowledge Graphs. We’ll get there, though, but let’s put focus where it need be. What’s a source? What’s a sink? Neither is as important as a source-sink relationship.
By Andre G