← Back to blog
Detection & ResponseApr 12, 2026· 11 min

Cloud Incident Response on AWS: Isolation, Snapshots, and Evidence Preservation

A field-tested AWS incident response playbook for working engineers: contain a compromised instance and credentials, preserve EBS and CloudTrail evidence in a forensic account, and investigate without tipping off the attacker.

Cloud Incident Response on AWS: Isolation, Snapshots, and Evidence Preservation

When an EC2 instance starts beaconing or an access key shows up exfiltrating S3 objects from an unfamiliar ASN, the instinct is to terminate everything and move on. In the cloud that instinct destroys exactly the evidence you need and, worse, tells a competent adversary you have spotted them. AWS incident response is a discipline of careful, ordered moves: contain the blast radius, preserve volatile and durable evidence in a place the attacker cannot reach, reconstruct a timeline, then eradicate and recover. This is the playbook we drill with engineers who run production workloads, and every step below maps to real AWS APIs you can wire into a runbook today.

Before the incident: a forensic account and isolation primitives

You cannot improvise an investigation environment mid-incident. Stand up a dedicated forensics account in your AWS Organization, locked down with an SCP so only the IR role can assume into it, and give it an isolated forensic VPC with no internet gateway and no peering to production. This account is where copied snapshots are decrypted, mounted read-only, and analyzed. Keeping forensics out of the compromised account means the attacker's stolen credentials cannot touch your evidence, and it gives legal a clean chain of custody. Pre-stage the boring primitives too: an empty quarantine security group with no inbound and no outbound rules, a customer-managed KMS key shared with the forensic account, and an SNS-driven runbook so responders are not copy-pasting commands under pressure.

Golden rule of cloud IR: copy before you change, and never act in a way that reveals you are watching. Terminating an instance, deleting a key, or rotating a password the attacker is actively using is a tell. Snapshot and isolate quietly first, then decide on eradication once you understand the full footprint.

Contain: isolate the instance without killing it

Containment is about removing the host's ability to do harm while keeping it running so memory and live state survive. The cleanest move is to swap every network interface to the quarantine security group, which severs the instance's connectivity at the VPC layer without a reboot. Do not detach the volume, do not stop the instance unless you have no choice, and tag it immediately so automation and humans both know it is off-limits.

# Replace ALL security groups on the instance with the empty quarantine SG.
# This severs network access instantly without stopping the instance.
INSTANCE_ID=i-0abc123def4567890
QUARANTINE_SG=sg-0quarantine000000

aws ec2 modify-instance-attribute \
  --instance-id "$INSTANCE_ID" \
  --groups "$QUARANTINE_SG"

# Tag the instance as quarantined so guardrail automation leaves it alone.
aws ec2 create-tags \
  --resources "$INSTANCE_ID" \
  --tags Key=ir:status,Value=quarantined \
         Key=ir:case,Value=CASE-2026-0412

# Remove the instance from any Auto Scaling group WITHOUT replacement,
# so it is not terminated and recycled out from under you.
aws autoscaling detach-instances \
  --instance-ids "$INSTANCE_ID" \
  --auto-scaling-group-name web-asg \
  --no-should-decrement-desired-capacity

If the instance has an IAM instance profile, that role's temporary credentials may already be loaded into the attacker's tooling via IMDS. Detaching the profile stops new role calls, but credentials already vended stay valid until they expire, which is why session revocation below matters more than the profile detach alone.

Contain credentials: deactivate keys and revoke live sessions

Long-lived IAM access keys are deactivated, not deleted, so you preserve them as evidence and can correlate the key ID against CloudTrail. The harder problem is temporary STS credentials, which you cannot individually revoke. Instead you attach an inline deny policy to the principal conditioned on aws:TokenIssueTime, which instantly invalidates every session issued before now while letting you mint fresh credentials for legitimate use.

# 1. Deactivate (do not delete) the suspect long-lived access key.
aws iam update-access-key \
  --user-name svc-deploy \
  --access-key-id AKIAEXAMPLE0000000000 \
  --status Inactive

# 2. Revoke ALL active STS sessions for a role by denying any request
#    whose credentials were issued before this moment. Attach as an
#    inline policy on the role (AWS provides the AWSRevokeOlderSessions
#    pattern for exactly this).
NOW=$(date -u +%Y-%m-%dT%H:%M:%SZ)
cat > revoke.json <<JSON
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Deny",
    "Action": "*",
    "Resource": "*",
    "Condition": { "DateLessThan": { "aws:TokenIssueTime": "$NOW" } }
  }]
}
JSON

aws iam put-role-policy \
  --role-name app-server-role \
  --policy-name IR-RevokeOlderSessions \
  --policy-document file://revoke.json

For a federated or human identity, also disable the IAM Identity Center user and kill their active sessions; for the root account, treat any sign of root use as a top-severity event and rotate root credentials and MFA out of band.

Preserve: snapshot disk, capture memory, freeze the logs

Now copy the evidence. Take an EBS snapshot of every attached volume, tag it to the case, then copy it with your forensic KMS key and share it into the forensic account where it is mounted read-only. Memory is more volatile and more valuable: if the instance is still running and you have SSM access, dump RAM with a tool like AVML to an S3 evidence bucket before you ever consider stopping the host, because a stop wipes it forever.

# Snapshot every EBS volume on the instance, tagged to the case.
for VOL in $(aws ec2 describe-volumes \
      --filters Name=attachment.instance-id,Values="$INSTANCE_ID" \
      --query 'Volumes[].VolumeId' --output text); do
  aws ec2 create-snapshot \
    --volume-id "$VOL" \
    --description "IR CASE-2026-0412 forensic copy of $VOL" \
    --tag-specifications \
      'ResourceType=snapshot,Tags=[{Key=ir:case,Value=CASE-2026-0412},{Key=ir:status,Value=evidence}]'
done

# Capture volatile memory via SSM BEFORE any stop, writing to an
# evidence bucket with Object Lock (WORM) enabled.
aws ssm send-command \
  --instance-ids "$INSTANCE_ID" \
  --document-name "AWS-RunShellScript" \
  --parameters 'commands=["avml /tmp/mem.lime","aws s3 cp /tmp/mem.lime s3://ir-evidence-CASE-2026-0412/ --sse aws:kms"]'

CloudTrail is your source of truth for who did what, but the management-event trail only retains 90 days in the console history and an attacker with enough access can stop a trail. Treat a CloudTrail Lake event data store as the durable record: it is immutable for the retention period you set, queryable in SQL, and captures management plus data events across the org. Verify log file integrity validation is on so you can prove logs were not tampered with, and confirm your evidence S3 buckets use Object Lock in compliance mode so even an admin cannot delete them during the investigation.

Investigate: build the timeline from CloudTrail Lake

Reconstruct the attack as an ordered timeline keyed on the compromised principal. Query CloudTrail Lake for every action by the suspect access key or role, then pivot on the source IP and user agent to find lateral movement. The questions you are answering: when did the credential first act anomalously, what did it touch, did it create persistence (new IAM users, access keys, login profiles, Lambda functions, or trust policy edits), and did it try to blind you by disabling GuardDuty or stopping a trail.

-- CloudTrail Lake: every action by the suspect key, newest first.
SELECT eventTime, eventName, sourceIPAddress, userAgent,
       errorCode, requestParameters
FROM   <event_data_store_id>
WHERE  userIdentity.accessKeyId = 'AKIAEXAMPLE0000000000'
AND    eventTime > '2026-04-10 00:00:00'
ORDER  BY eventTime DESC;

-- Hunt for persistence and defense-evasion in the blast window.
SELECT eventTime, eventName, userIdentity.arn, sourceIPAddress
FROM   <event_data_store_id>
WHERE  eventName IN ('CreateAccessKey','CreateUser','CreateLoginProfile',
                     'PutUserPolicy','AttachRolePolicy','UpdateAssumeRolePolicy',
                     'StopLogging','DeleteTrail','DeleteDetector')
AND    eventTime > '2026-04-10 00:00:00'
ORDER  BY eventTime ASC;

Correlate the timeline against GuardDuty findings and VPC Flow Logs for the quarantined ENI to confirm exfiltration destinations and data volumes. The output of this phase is a written narrative with timestamps, the full list of touched resources, and an inventory of attacker-created artifacts. You cannot eradicate what you have not enumerated.

Eradicate and recover

Eradication removes the attacker's footholds in one coordinated move, not piecemeal, so they cannot fall back to a second key while you close the first. Delete every attacker-created IAM principal, access key, and login profile; revert tampered trust and resource policies; remove rogue Lambda functions, scheduled tasks, and any malicious AMIs or snapshots they shared out of the account. Rotate every secret the host could have read, including database credentials, API tokens, and anything in the instance's environment or Secrets Manager, on the assumption it is all burned.

  • Rebuild, do not clean: launch replacement instances from a known-good, patched AMI rather than trying to disinfect the compromised host.
  • Close the entry vector first: an unpatched CVE, an exposed credential in a repo, or an over-permissive role. Recovering onto the same hole invites a repeat within hours.
  • Restore data from a backup taken before the earliest confirmed compromise, not the most recent snapshot, which may already be poisoned.
  • Re-enable Auto Scaling and traffic only after the replacement passes validation and detections are tuned to catch a re-entry attempt.
  • Keep all evidence snapshots, memory dumps, and the CloudTrail Lake store under Object Lock until the case is formally closed.

Finally, write the blameless postmortem while the detail is fresh: the detection gap that let dwell time accumulate, the containment steps that worked, and the guardrails (SCPs, automated quarantine, key-age alerts, mandatory IMDSv2) that would have shrunk the blast radius. The goal of every incident is to make the next one smaller, faster to contain, and easier to prove. That is the muscle we build in the AWS security labs at ShieldSync, where engineers run this entire playbook end to end against a live compromised account.

Learn it by doing

Spin up a real AWS security lab, or explore our training tracks.

24 people viewing now