SOC 2025: Operationalizing the SOCBy Mike Rothman
So far in this series, we’ve discussed the challenges of security operations, making sense of security data, and refining detection/analytics, which are all critical components of building a modern, scalable SOC. Yet, there is an inconvenient fact that warrants discussion. Unless someone does something with the information, the best data and analytics don’t result in a positive security outcome.
Security success depends on consistent and effective operational motions. Sadly, this remains a commonly overlooked aspect of building the SOC. As we wrap up the series, we’re going to go from alert to action and do it effectively and efficiently, every time (consistently), which we’ll call the 3 E’s. The goal is to automate everything that can be automated, enabling the carbon (you know, humans) to focus on the things that suit them best. Will we get there by 2025? That depends on you, as the technology is available, it’s a matter of whether you use it.
The 3 E’s
First, let’s be clear on the objective of security operations, which is to facilitate positive security outcomes. Ensuring these outcomes is to focus on the 3 E’s.
- Effectiveness: With what’s at stake for security, you need to be right because security is asymmetric. The attackers only need to be right once, and defenders need to defeat them every time. In reality, it’s not that simple, as attackers do need to string together multiple successful attacks to achieve their mission, but that’s beside the point. A SOC that only finds stuff sometimes is not successful. You want to minimize false positives and eliminate false negatives. If an alert fires, it should identify an area of interest with sufficient context to facilitate verification and investigation.
- Efficiency: You also need to do things as quickly as possible, consuming a minimum of resources due to limited available resources and the significant damage (especially against an attack like ransomware) that can happen in minutes. You need tooling that makes the analyst’s job easier, not harder. You also need to facilitate the communication and collaboration between teams to ensure escalation happens cleanly and quickly. Breaking down the barriers between traditional operational silos becomes a critical path to streamlining operations.
- Every Time (Consistency): Finally, you need the operational motions to be designed and executed the same way, every time. But aren’t there many ways to solve a problem? Maybe. But as you scale up your security team, having specific playbooks to address issues makes it easy to onboard new personnel and ensure they achieve the first two goals: Effectiveness and Efficiency. Strive to streamline the operational motions (as associated playbooks) over time, as things change and as you learn what works in your environment.
Do you get to the 3 E’s overnight? Or course not. It takes years and a lot of effort to get there. But we can tell you that you never get there unless you start the journey.
The first step to a highly functioning SOC is being intentional. You want to determine the proper operational motions for categories of attacks before you have to address them. The more granular the playbook, the less variance you’ll get in the response and the more consistent your operations. Building the playbooks iteratively allows you to learn what works and what doesn’t, tuning and refining the playbook every time you use it. These are living documents and should be treated as such.
So how many playbooks should you define? As a matter of practice, the more playbooks, the better; but you can’t boil the ocean, especially as you get started. Begin by enumerating the situations you see most frequently. These typically include phishing, malware attacks/compromised devices, ransomware, DDoS, unauthorized account creation, and network security rule changes. To be clear, pretty much any alert could trigger a playbook, so ultimately you may get to dozens, if not hundreds. But start with maybe the top 5 alerts detected in your environment and start with those.
What goes into a playbook? Let’s look at the components of the playbook:
- Trigger: Start with the trigger, which will be an alert and have some specific contextual information to guide the next steps.
- Enrichment: Based on the type of alert, there will be additional context and information helpful to understanding the situation and streamlining the work of the analyst handling the issue. Maybe it’s DNS reputation on a suspicious IP address or an adversary profile based on the command and control traffic. You want to ensure the analyst has sufficient information to dig into the alert immediately.
- Verification: At this point, a determination needs to be made as to whether the issue warrants further investigation. What’s required to make that call? For a malware attack, maybe it’s checking the email gateway for a phishing email that arrived in the user’s inbox. Or a notification from the egress filter that a device contacted a suspicious IP address. For each trigger, you want to list the facts that will lead you to conclude this is a real issue and assess the severity.
- Action: Upon verification, what actions need to be taken? Should the device be quarantined and a forensic image of the device be captured? Should an escalation of privileges or firewall rule change get rolled back? You’ll want to determine what needs to be done and document that motion in granular detail, so there are no questions about what should be done. You’ll also look for automation opportunities, which we’ll discuss later in the post.
- Confirmation: Was the action step(s) successful? Next, confirm whether the actions dictated in the playbook happened successfully. This may involve scanning the device (or service) to ensure the change was rolled back or making sure the device is not accessible anymore to an attacker.
- Escalation: What’s next? Does it get routed to a 2nd tier for further verification and research? Is it sent directly to an operations team to be fixed if it can’t be automated? Can the issue be closed out because you’ve gotten the confirmation that the issue was handled? Be specific in where the information goes, what format the supporting documentation needs to be delivered, and how you will follow up to ensure the issue has been addressed to completion.
Building playbooks is a skill, which means you’ll be pretty bad when you start. The first playbook will take you a while. The next will go a bit faster, and with each subsequent playbook, you’ll get the hang of it. By the time you’ve built the 10th, you’ll start cranking them out. Also, factor in a feedback loop, ensuring you capture what works and what doesn’t work every time the playbook runs. This practice of constant improvement is critical, given the dynamic nature of technology.
In terms of playbook design, modularity is your friend. There will be commonalities in terms of how you handle parts of the playbooks; for instance, connecting to devices or services can be standardized (via standard APIs), as can remediation actions (block this or roll back that), as well as escalations. If anything needs to be done across multiple playbooks, look to build a common module. This becomes even more important when automating the operational motions, where you can create scripts/code and reuse them across multiple playbooks.
You may find experienced security practitioners pushing back on strict adherence to the playbook approach, and they have a point. Your rock stars don’t need that level of guidance, but it’s not about them. Achieving a consistent response requires very specific actions to be defined and executed consistently, making it more likely your less-experienced staffers will be able to execute the playbook successfully.
Automating (What Makes Sense)
Once the playbooks are stable, the operational motion will be effective (the first E). Next is to improve efficiency, which means figuring out which actions within the playbook can be automated. As mentioned above, taking a modular playbook approach allows you to introduce automation where appropriate without reinventing the wheel for common actions.
How do we start this automation journey? First, you need to orchestrate between the different systems, and then develop and deploy the automation.
- Orchestrate: Start with the playbook and define the devices and systems to be managed. Then you determine how to connect to and manage the devices. Do you need to use an API, build a script, or develop a home-grown integration? An advantage of using a commercial SOAR platform is that pre-built connectors already exist for most, if not all, end devices. These platforms also have a scripting language or visual studio to develop the connectors and automations. And let’s not forget about security, given you are managing these devices. How do you ensure proper authentication and authorization of any commands sent to the devices/services?
- Develop (and Maintain): Once you have connectivity to the device, you need to build, test and deploy the automation. Keep in mind you are in the software business once you start building automations. Maybe you can use a low-code environment, but it’s still code, so you’ve got to decide who will maintain the code. Also, consider who monitors for changes in the end device and updates the automation when new capabilities are available or needed and how you will fix defects, especially if they break the automation.
- Deploy: It’s too bad the acronym is SOAR and not SO+B+D=R because you can’t respond to anything until you deploy the automation. You’ll need to select an execution environment and define the process to test and iterate until the automation is ready for production. That involves formal functionality testing (with actual, documented unit tests) and a burn-in period where the automation runs in monitor or debug mode to ensure it works as intended. Burn-in is critical because nothing will set back an automation program faster than bringing down systems. So the automations need to be READY before being deployed to production.
We don’t bring up all of these sticky issues (like maintaining automation code) to scare you away from embracing automation. More to make the point that security teams need additional skills in the SOC of 2025. It’s not that the days of the console expert have passed, but you’ll also need staffers with specialized coding chops. And as more and more of security becomes code, more and more of the Ops skills you’ll need will skew towards development.
So which categories of automations make sense to start? Job #1 is to build credibility, so start with functions that won’t bring down systems or otherwise cause damage. Think about alert enrichment, quarantining compromised devices, and maybe blocking egress on known-bad IP addresses. As you get some quick wins and build your credibility, you can start looking at more sophisticated operational motions. By developing the automations modularly, you can string them together to implement advanced, multi-step processes.
Although the technology to get to SOC 2025 is here today, most organizations will take the next 2-3 years to culturally accept this new approach. We fully expect most organizations to adopt a more flexible data collection and aggregation approach and introduce more sophisticated analytics in this timeframe. Automation for alert enrichment, policy changes (block known bad IP addresses), and quarantine will become commonplace. We also expect to see some aspects of security automation built directly into application stacks, especially as organizations increasingly move to the cloud and build all code using CI/CD pipelines.
But what happens then? Let’s look beyond the mid-term planning horizon to what’s in store for SOCs.
- Security Data Lakes: Many SOCs send telemetry to different places. The first is the SIEM for short-term correlation, alerting, and reporting. But many SIEMs can only maintain 60-90 days of data without killing performance or breaking the bank. Thus, telemetry is additionally sent to object storage (typically in the cloud) for longer-term storage, forensics, and cheap archival. Why not handle both objectives on one platform? An emerging approach called the security data lake indexes telemetry in object storage quickly and cost-effectively. This separates the analytics plane from the data plane, providing more scale for a lower price. Some recent entrants in the SIEM market use this model within their platforms, offering a buy (versus build) option for this interested in the approach.
- Security Justification: If you stick around long enough, you’ll see the security cost pendulum swing back and forth. First, the budget is there, and then it’s not. You’ll need to justify security expenditures in belt-tightening times, especially for threat intel and specific controls. The best way to justify spending is to substantiate value by instrumenting your SOC processes to attribute alerts and remediated issues to specific intel sources and controls. It’s kind of like how sales teams use their CRM systems to track lead sources to determine the effectiveness of marketing programs and campaigns.
But first things first, there is a lot to do before we get to SOC 2025. Start by making sense of your security data and more effectively analyzing it. Being intentional about your SOC motions and systematically focusing on effectiveness and efficiency will get you to the promised land of consistency. Or executing security operations flawlessly every time.